[PDF] I. Asynchronous Optimization over weakly Coupled Renewal Systems

Abstract

A renewal system divides the slotted timeline into back to back time periods called renewal frames. At the beginning of each frame, it chooses a policy from a set of options for that frame. The policy determines the duration of the frame, the penalty incurred during the frame (such as energy expenditure), and a vector of performance metrics (such as instantaneous number of jobs served). The starting points of this line of research are Chapter 7 of the book [Nee10a], the seminal work [Nee13a], and Chapter 5 of the PhD thesis of Chih-ping Li [Li11]. These works consider stochastic optimization over a single renewal system. By way of contrast, this thesis considers optimization over multiple parallel renewal systems, which is computationally more challenging and yields much more applications. The goal is to minimize the time average overall penalty subject to time average overall constraints on the corresponding performance metrics. The main difficulty, which is not present in earlier works, is that these systems act asynchronously due to the fact that the renewal frames of different renewal systems are not aligned. The goal of the thesis is to resolve this difficulty head-on via a new asynchronous algorithm and a novel supermartingale stopping time analysis which shows our algorithms not only converge to the optimal solution but also enjoy fast convergence rates. Based on this general theory, we further develop novel algorithms for data center server provision problems with performance guarantees as well as new heuristics for the multi-user file downloading problems.

Full PDF

II. ASYNCHRONOUS OPTIMIZATION OVER WEAKLY COUPLED RENEWAL SYSTEMSbyXiaohan WeiPresented to theFACULTY OF THE USC GRADUATE SCHOOLUNIVERSITY OF SOUTHERN CALIFORNIAIn Partial Fulﬁllment of theRequirements for the DegreeDOCTOR OF PHILOSOPHY(ELECTRICAL ENGINEERING)

December 2019

Copyright 2019 Xiaohan Wei a r X i v : . [ m a t h . O C ] J a n pproved byProfessor Michael Neely,Committee Chair,Department of Electrical Engineering, University of Southern California .Professor Stanislav Minsker,Committee Chair,Department of Mathematics,

University of Southern California .Professor Larry Goldstein,Department of Mathematics,

University of Southern California .Professor Mihailo Jovanovic,Department of Electrical Engineering,

University of Southern California .Professor Ashutosh Nayyar,Department of Electrical Engineering,

University of Southern California . ii edication

To my parents and my wife, Yuhong, who supported me both mentally and ﬁnancially overthe years. iii cknowledgements

First, I would like to thank my advisor professor Michael J. Neely for guiding me throughoutthe PhD journey since Summer 2013. He is a man of accuracy and rigorousness, always passionateabout discussing concrete research problems, and willing to roll up the sleeves and grind throughtechnical details with me. His way of treating research topics signiﬁcantly impacts me. Ratherthan blindly following existing works and doing incremental works when trying to get into a newarea, I learned to ask fundamental mathematical questions, making connections to the tools andtheories we already familiar with and be not afraid of getting my hands dirty. His blazing newideas are my morale boost when grasping in the dark.Next, I would like to thank professor Stanislav Minsker, who is the advisor on my high-dimensional statistics research. I got to know him during the Math-547 statistical learningcourse Fall 2015. Though not much senior than me, he is already extremely knowledgable on thestatistical learning area and has been widely recognized for his works on robust high-dimensionalstatistics. He is a quick thinker and can always point out meaningful new directions hiding ratherdeeply which eventually lead to high-quality publications. I would have published no paper onthis area should I never met with him. Along the way, he also teaches me how to sell my worksand helps me practicing my seminar talks, which lead to impressive presentations and Ming-Hsiehscholarships.Also, I would like to thank professor Larry Goldstein, whom I met during a small paper readinggroup Spring 2016. He is an expert on Stein’s method and, as a senior professor, surprisinglyaccessible to PhD students and active on various research areas. Together with Prof. Minsker,we had quite a few fruitful discussions and made some nice progress on robust statistics.I would also like to thank professor Mihailo Jovanovic, Ashutosh Nayyar for discussing re-search problems with me and siting on my qualifying exam committee. I appreciate them fortheir valuable comments and suggestions. ivoreover, I thank my senior lab mates Hao Yu and Sucha Supittayapornpong who werealways accessible to discussing problems with me and came up with new research ideas. Also,Ruda Zhang, Lang Wang, and Jie Ruan studied various math courses and interesting mathproblems with me and helped me clear up the hurdles on diﬀerent stages, for which I reallyappreciate. Special thanks to professor Qing Ling, who was my undergraduate advisor, butcontinuously inﬂuences me on various aspects of my academic career.Last but not least, I would like to take the chance to express my gratitude for folks who madecontribution on various stages of my research. In particular, I thank Zhuoran Yang, for lightingup new areas and expanding my research horizon, Dongsheng Ding, who brings idea from controlperspective and is always passionate to try out research ideas with me, Sheng Chen for sharingwith me his perspective on robust LASSO problems, professor Jason D. Lee for working on thegeometric median problem with me, and Jianshu Chen from Tencent AI who introduced me tothe area of reinforcement learning. v able of Contents

Dedication iiiAcknowledgements ivAbstract ix1 Introduction to Renewal Systems 1

Data Center Server Provision via Theory of Coupled Renewal Systems 54 N queues system . . . . . . . . . . . . . . . . . . . . . 753.5.2 Real data center traﬃc trace and performance evaluation . . . . . . . . . 773.6 Additional lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 N single-buﬀer queues . . . . . . . . . . . . . . . . . . . . 1114.4.2 Optimality of the indexing algorithm . . . . . . . . . . . . . . . . . . . . . 1124.4.3 Preliminaries on stochastic coupling . . . . . . . . . . . . . . . . . . . . . 1134.4.4 Stochastic ordering of buﬀer state process . . . . . . . . . . . . . . . . . . 1144.4.5 Extending to non-work-conserving policies . . . . . . . . . . . . . . . . . . 1204.5 Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204.5.1 DPP ratio indexing with geometric ﬁle length . . . . . . . . . . . . . . . . 1204.5.2 DPP ratio indexing with non-memoryless ﬁle lengths . . . . . . . . . . . . 1244.6 Additional lemmas and proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254.6.1 Comparison of Max- λ and Min- λ . . . . . . . . . . . . . . . . . . . . . . . 1254.6.2 Proof of Lemma 4.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126vii Opportunistic Scheduling over Renewal Systems 128 θ [ n ] and θ [ n ] . . . . . . . . . . . . . . . . . . . . . . . . 1425.5.2 Towards near optimality (I): Truncation . . . . . . . . . . . . . . . . . . . 1425.5.3 Towards near optimality (II): Exponential supermartingale . . . . . . . . 1455.5.4 An asymptotic upper bound on θ [ n ] . . . . . . . . . . . . . . . . . . . . . 1505.5.5 Finishing the proof of near optimality . . . . . . . . . . . . . . . . . . . . 1525.6 Simulation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1545.7 Additional proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595.8 Computation of Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 bstract A renewal system divides the slotted timeline into back to back time periods called “renewalframes”. At the beginning of each frame, it chooses a policy from a set of options for that frame.The policy determines the duration of the frame, the penalty incurred during the frame (suchas energy expenditure), and a vector of performance metrics (such as instantaneous number ofjobs served). The starting points of this line of research are Chapter 7 of the book [Nee10a], theseminal work [Nee13a], and Chapter 5 of the PhD thesis of Chih-ping Li [Li11], who graduatedbefore I came to USC. These works consider stochastic optimization over a single renewal system.By way of contrast, this thesis considers optimization over multiple parallel renewal systems,which is computationally more challenging and yields much more applications. The goal is tominimize the time average overall penalty subject to time average overall constraints on thecorresponding performance metrics. The main diﬃculty, which is not present in earlier works,is that these systems act asynchronously due to the fact that the renewal frames of diﬀerentrenewal systems are not aligned. The goal of the thesis is to resolve this diﬃculty head-onvia a new asynchronous algorithm and a novel supermartingale stopping time analysis whichshows our algorithms not only converge to the optimal solution but also enjoy fast convergencerates. Based on this general theory, we further develop novel algorithms for data center serverprovision problems with performance guarantees as well as new heuristics for the multi-user ﬁledownloading problems.We start by reviewing existing works on the optimization over a single renewal system inChapter 1. Then, in Chapter 2, we propose a new algorithm for the asynchronous renewaloptimization so that each system can make its own decision after observing a global multiplierthat is updated every slot. We show that this algorithm satisﬁes the desired constraints andachieves O ( (cid:15) ) near optimality with O (1 /(cid:15) ) convergence time. Based on the new algorithm, weformulate the data center server provision problem as an asynchronous renewal optimizationin Chapter 3 and develop a corresponding algorithm which exceeds the state-of-the-art. InChapter 4, we look at another application, namely, the multi-user ﬁle downloading, which can beformulated as a constrained multi-armed bandit problem. We show that our proposed algorithmleads to a useful heuristic approximately solving the problem with experimentally near optimalperformance.In Chapter 5, we consider the constrained optimization over a renewal system with observedrandom events at the beginning of each renewal frame. We propose an online algorithm whichdoes not need the knowledge of the distributions of random events. We prove that this proposedalgorithm is feasible and achieves O ( ε ) near optimality by constructing an exponential super-martingale. Simulation experiments demonstrates the near optimal performance of the proposedalgorithm.Finally, in Chapter 6, we consider online learning over weakly coupled Markov decision pro-cesses. We develop a new distributed online algorithm where each MDP makes its own decisioneach slot after observing a multiplier computed from past information. While the scenario issigniﬁcantly more challenging than the classical online learning context, the algorithm is shownto have a tight O ( √ T ) regret and constraint violations simultaneously over a time horizon T .ix hapter 1Introduction to Renewal Systems1.1 Optimization over a single renewal system: A review Figure 1.1: The sample timeline of a renewalsystem.Renewal systems are generalizations of renewal processes studied in probability and randomprocesses courses. Parallel to Markov decision processes versus Markov chains, renewal systemsare controlled renewal processes. Since this is not a widely used term, to set the tone of thethesis, we start with a review of optimization over a single renewal system.Consider a dynamical system operating over a discrete slotted timeline t ∈ { , , , . . . } . Thetimeline is segmented into back-to-back intervals of time slots called renewal frames . The startof each renewal frame for a system is called a renewal time or simply a renewal for that system.The duration of each renewal frame is a random positive integer whose distribution depends ona control action chosen at the start of the frame. We use k = 0 , , , · · · to index the renewals.Let t k be the time slot corresponding to the k -th renewal with the convention that t = 0. Let T k be the set of all slots from t k to t k +1 −

1. See Fig. 1.1 for a graphical illustration.At time t k , the decision maker chooses a possibly random decision α k in a set A . This actiondetermines the distributions of the following random variables:• The duration of the k -th renewal frame T k := t k +1 − t k , which is a positive integer.1 A vector of performance metrics at each slot of that frame z [ t ] := ( z [ t ] , z [ t ] , · · · , z L [ t ]), t ∈ T k , where L is a ﬁxed positive integer.• A penalty incurred at each slot of the frame y [ t ], t ∈ T k .In the special case where T k = 1 , ∀ k , this reduces to the classical slotted stochastic system, whichhas been relatively well-understood. Let F k be the system history up to t k −

1, which includes { y [ t ] } t k − j =0 , { z [ t ] } t k − j =0 and { T j } k − j =0 . The key property we rely on is as follows. Deﬁnition 1.1.1 (Renewal property) . A system is said to satisfy the renewal property if therandom T k , z [ t ] and y [ t ] , t ∈ T k are conditionally independent of the history F k given α k = α ∈ A . The goal is to minimize the time average penalty subject to L time average constraints onthe performance metrics, i.e. we aim to solve the following optimization problem:min lim sup T →∞ T T − X t =0 E ( y [ t ]) (1.1)s.t. lim sup T →∞ T T − X t =0 E ( z l [ t ]) ≤ d l , l ∈ { , , · · · , L } , (1.2)where { d l } Ll =1 are known constants. Let y ( α k ) := X t ∈T k y [ t ] , z l ( α k ) = X t ∈T k z l [ t ] , T ( α k ) = T k be realizations during the k -th frame using an action α k . Under mild technical conditions (e.g.existence of second moments, see Section 2.1 for details), the problem (1.1)-(1.2) can also bewritten as a fractional program form:min lim sup K →∞ E (cid:16)P K − k =0 y ( α k ) (cid:17) E (cid:16)P K − k =0 T ( α k ) (cid:17) (1.3)s.t. lim sup T →∞ E (cid:16)P K − k =0 z l ( α k ) (cid:17) E (cid:16)P K − k =0 T ( α k ) (cid:17) ≤ d l , l ∈ { , , · · · , L } , (1.4) α k ∈ A , ∀ k .1.1 Optimization over i.i.d. actions Suppose the system adopts an i.i.d. sequence of random actions { α ∗ k } ∞ k =0 , where the decision α ∗ k ∈ A made on frame k independent of the past. Then, by the renewal property, it is easy tosee that { y ( α ∗ k ) , z ( α ∗ k ) , T ( α ∗ k ) } are i.i.d. random variables. We havelim sup K →∞ E (cid:16)P K − k =0 y ( α ∗ k ) (cid:17) E (cid:16)P K − k =0 T ( α ∗ k ) (cid:17) = lim K →∞ K E (cid:16)P K − k =0 y ( α ∗ k ) (cid:17) lim K →∞ K E (cid:16)P K − k =0 T ( α ∗ k ) (cid:17) = E ( y ( α ∗ k )) E ( T ( α ∗ k ))lim sup K →∞ E (cid:16)P K − k =0 z l ( α ∗ k ) (cid:17) E (cid:16)P K − k =0 T ( α ∗ k ) (cid:17) = lim K →∞ K E (cid:16)P K − k =0 z l ( α ∗ k ) (cid:17) lim K →∞ K E (cid:16)P K − k =0 T ( α ∗ k ) (cid:17) = E ( z l ( α ∗ k )) E ( T ( α ∗ k ))As a consequence, if we consider solving (1.3)-(1.4) over the set of i.i.d. random actions, then?min E ( y ( α ∗ k )) E ( T ( α ∗ k )) (1.5)s.t. E ( z l ( α ∗ k )) E ( T ( α ∗ k )) ≤ d l , l ∈ { , , · · · , L } , (1.6) Assumption 1.1.1.

The problem (1.5) - (1.6) is feasible, i.e. there exists α ∗ k such that (1.6) aresatisﬁed. Furthermore, we assume the set of all feasible performance vectors ( E ( y ( α ∗ k )) E ( T ( α ∗ k ) ) , E ( z ( α ∗ k )) E ( T ( α ∗ k ) ) ) over all i.i.d. actions α ∗ k is compact. The compactness assumption is adopted so that there exists at least one i.i.d. action whichsolves (1.5)-(1.6). In fact, one can show that under proper technical conditions the minimumachieved by (1.3)-(1.4) is the same as that of (1.5)-(1.6) (see, for example, Lemma 2.3.2 in thenext section).

As one of the main motivations for this line of research, in this section, we show that thewell-known MDP is a special case of the renewal system. Consider a discrete time MDP over aninﬁnite horizon. It consists of a ﬁnite state space S , and an action space U at each state s ∈ S . For each state s ∈ S , we use P u ( s, s ) to denote the transition probability from s ∈ S to s ∈ S To simplify the notation, we assume each state has the same action space A . All our analysis generalizestrivially to states with diﬀerent action spaces. u ∈ U , i.e. P u ( s, s ) = P r ( s [ t + 1] = s | s [ t ] = s, u [ t ] = u ) , where s [ t ] and u [ t ] are state and action at time slot t .At time slot t , after observing the state s [ t ] ∈ S and choosing the action u [ t ] ∈ U , theMDP receives a penalty y ( u [ t ] , s [ t ]) and L types of resource costs z ( u [ t ] , s [ t ]) , · · · , z L ( u [ t ] , s [ t ]),where these functions are all bounded mappings from S × U to R . For simplicity we write y [ t ] = y ( u [ t ] , s [ t ]) and z l [ t ] = z l ( u [ t ] , s [ t ]). The goal is to minimize the time average penalty withconstraints on time average overall costs. This problem can be written in the form (1.1)-(1.2).In order to deﬁne the renewal frame, we need one more assumption on the MDP. We assumethe MDP is ergodic , i.e. there exists a state which is recurrent and the corresponding Markovchain is aperiodic under any randomized stationary policy , with bounded expected recurrencetime. Under this assumption, the renewals for the MDP can be deﬁned as successive revisitationsto the recurrent state, and the action set A in such scenario is deﬁned as the set of all randomizedstationary policies that can be implemented in one renewal frame. Thus, our renewal systemformulation includes ergodic MDPs. We refer to [Alt99a], [Ber01], and [Ros02] for more detailson MDP theory and related topics. We also refer readers to Chapter 5 for more MDP speciﬁcalgorithms and analysis. In this section, we introduce the classical DPP ratio algorithm solving (1.3)-(1.4) ( [Nee10a],[Nee13a]). It is a frame-based algorithm which updates parameters at the beginning of eachframe. We start by deﬁning the “virtual queues” Q l [ k ] for each constraint with Q l [0] = 0 and Q l [ k + 1] = max { Q l [ k ] + z l ( α k ) − d l T ( α k ) , } , A randomized stationary policy π is an algorithm which chooses actions at state s ∈ S according to a ﬁxedconditional distribution π ( u | s ) , u ∈ U and is independent of all other past information, i.e. P r ( u [ t ] |F t ) = π ( u [ t ] | s [ t ]), u [ t ] ∈ U , s [ t ] ∈ S and F t is the past information up to time t . Q [ t ] be the vector of virtual queues. Deﬁne the drift as follows:∆[ k ] := 12 ( k Q [ k + 1] k − k Q [ k ] k ) , Let F k be the system history up to t k −

1, which includes { y ( α j ) } t − j =0 , { z ( α j ) } t − j =0 Then, it iseasy to show that E (∆[ k ] |F k ) ≤ B + L X l =1 Q l [ k ] E ( z l ( α k ) − d l T ( α k ) |F k ) . Assuming that the second moment of z l ( α k ) − d l T ( α k ) exists, then, there exists a constant B such that B ≥ E (cid:0) ( z l ( α k ) − d l T ( α k )) |F k (cid:1) . We deﬁne the DPP expression as ∆[ k ] + V y ( α k ), where V > E (∆[ k ] + V y ( α k ) |F k ) ≤ B + L X l =1 Q l [ k ] E ( z l ( α k ) − d l T ( α k ) |F k ) + V E ( y ( α k ) |F k ) (1.7)= B + E ( T ( α k ) |F k ) V E ( y ( α k ) |F k ) + P Ll =1 Q l [ k ] E ( z l ( α k ) − d l T ( α k ) |F k ) E ( T ( α k ) |F k ) | {z } minimize this . (1.8)Then, the algorithm (Algorithm 1) aims at minimizing the ratio on the right hand side. Note that Algorithm 1.

DPP ratio algorithm: Fix a trade-oﬀ parameter

V > . • At the beginning of each frame, the proposed algorithm takes action α k in order to minimizethe ratio V E ( y ( α k ) |F k ) + P Ll =1 Q l [ k ] E ( z l ( α k ) |F k ) E ( T ( α k ) |F k ) . (1.9)• Update the virtual queue Q [ k ] via Q l [ k + 1] = max { Q l [ k ] + z l ( α k ) − d l T ( α k ) , } . (1.10)due to the renewal property of the system, maximizing the above ratio is the same as maximizing5he ratio: V E ( y ( α k )) + P Ll =1 Q l [ k ] E ( z l ( α k )) E ( T ( α k )) . The performance of this algorithm has been shown in a number of works ([Nee10a, Nee13a]).We reproduce the proof here but from a somewhat diﬀerent perspective compared to previousworks since it is more illustrative for our purpose and serves as the foundation of our new analysislater.The key step, which is repeatedly used throughout the thesis is as follows: Since our proposedalgorithm minimizes (1.9), it must satisfy: V E ( y ( α k ) |F k ) + P Ll =1 Q l [ k ] E ( z l ( α k ) |F k ) E ( T ( α k ) |F k ) ≤ V E ( y ( α ∗ k )) + P Ll =1 Q l [ k ] E ( z l ( α ∗ k )) E ( T ( α ∗ k )) (1.11)for any i.i.d. decisions α ∗ k , where we use the fact that the α ∗ k is independent of history F k andthus the conditioning on the right hand side can be omitted. In particular, we can choose α ∗ k tobe the solution to (1.5)-(1.6) and let [ f ∗ , g ∗ ] = [ E ( y ( α ∗ k )) / E ( T ( α ∗ k )) , E ( z ( α ∗ k )) / E ( T ( α ∗ k ))] be theoptimal performance vector. Rearranging terms in above inequality gives E V ( y ( α k ) − f ∗ T ( α k )) + L X l =1 Q l [ k ]( z l ( α k ) − g ∗ l T ( α k )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F k ! ≤ . This implies that the expression inside the expectation is a supermartingale diﬀerence sequence , afact not necessarily needed here but is the key to our new analysis later. Now, taking expectationfrom both sides and sum up from k = 0 to K − K − X k =0 E V ( y ( α k ) − f ∗ T ( α k )) + L X l =1 Q l [ k ]( z l ( π k ) − g ∗ l T ( π k )) ! ≤ . Substituting g ∗ l ≤ d l gives K − X k =0 E V ( y ( α k ) − f ∗ T ( α k )) + L X l =1 Q l [ k ]( z l ( α k ) − d l T ( α k )) ! ≤ . (1.12)On the other hand, taking expectation from both sides of the inequality (1.7) and sum up from6 = 0 to K − E (cid:0) k Q [ k + 1] k (cid:1) K − X k =0 E ( V y ( α k )) ≤ K − X k =0 E V y ( α k ) + L X l =1 Q l [ k ]( z l ( α k ) − d l T ( α k )) ! + BK.

Sum the above inequality and (1.12) gives E (cid:0) k Q [ k + 1] k (cid:1) K − X k =0 E ( V y ( α k )) ≤ V K − X k =0 f ∗ E ( T ( α k )) + BK. (1.13)This bound “kills two birds in the same cage”, allowing us to get objective bound and constraintviolations at the same time immediately. On one hand, since E (cid:0) k Q [ k + 1] k (cid:1) ≥

0, we have P K − k =0 E ( V y ( α k )) P K − k =0 E ( T ( α k )) ≤ f ∗ + BV , (1.14)On the other hand, Let C be a constant such that C ≥ | E ( y ( π )) | , T max ≥ | E (cid:0) T ( π k ) (cid:1) | ∀ π ∈ Π, E ( k Q [ k + 1] k ) ≤ p BK + 4 V K ( C + T max ) ⇒ P K − k =0 E ( z l ( α k )) P K − k =0 E ( T ( α k )) ≤ r B + 4 V ( C + T max ) K , (1.15)which follows from the virtual queue updating rule (1.10) that E ( k Q [ k + 1] k ) ≥ P K − k =0 E ( z l ( α k ))and P K − k =0 E ( z l ( α k )) ≥ K . Remark 1.1.1.

The bounds (1.14) , (1.15) are not the tightest possible bounds, but (I believe)simple enough to highlight the key steps. So far readers have gain some understanding on the renewal systems we will talk aboutthroughout the thesis. In this section, we introduce our coupled renewal system model. Manyof the notations are the same as those of the last section except we add a superscript n toindex the renewal systems. Consider N renewal systems that operate over a slotted timeline( t ∈ { , , , . . . } ). The timeline for each system n ∈ { , . . . , N } is segmented into back-to-backintervals, which are renewal frames. The duration of each renewal frame is a random positiveinteger with distribution that depends on a control action chosen by the system at the startof the frame. The decision at each renewal frame also determines the penalty and a vector of7erformance metrics during this frame. The systems are coupled by time average constraintsplaced on these metrics over all systems. The goal is to design a decision strategy for eachsystem so that overall time average penalty is minimized subject to time average constraints.We use k = 0 , , , · · · to index the renewals. Let t nk be the time slot corresponding to the k -th renewal of the n -th system with the convention that t n = 0. Let T nk be the set of all slotsfrom t nk to t nk +1 −

1. At time t nk , the n -th system chooses a possibly random decision α nk in a set A n . This action determines the distributions of the following random variables:• The duration of the k -th renewal frame T nk := t nk +1 − t nk , which is a positive integer.• A vector of performance metrics at each slot of that frame z n [ t ] := ( z n [ t ] , z n [ t ] , · · · , z nL [ t ]), t ∈ T nk .• A penalty incurred at each slot of the frame y n [ t ], t ∈ T nk .We assume each system has the renewal property as deﬁned in Deﬁnition 1.1.1 that given α nk = α n ∈ A n , the random variables T nk , z n [ t ] and y n [ t ], t ∈ T nk are independent of theinformation of all systems from the slots before t nk with the following known conditional expecta-tions E ( T nk | α nk = α n ), E (cid:16) P t ∈T nk y n [ t ] (cid:12)(cid:12)(cid:12) α nk = α n (cid:17) and E (cid:16) P t ∈T nk z n [ t ] (cid:12)(cid:12)(cid:12) α nk = α n (cid:17) . Fig. 1.2 plotsa sample timeline of three parallel renewal systems.Figure 1.2: The sample timelines of three asynchronousparallel renewal systems, where the numbers underneaththe ﬁgure index time slots and the numbers inside theblocks index the renewals of each system.To make the framework a little bit more general, we introduce an uncontrollable external i.i.d.random process { d [ t ] } ∞ t =0 ⊆ R L which can be observed during each time slot. Let d l := E ( d l [ t ]).The expectation of d [ t ] often serves as the constraints of corresponding performance metrics. Aswe shall see in the example application on an energy-aware scheduling problem, z n [ t ] and d [ t ]could represent vectors of job services and arrivals for diﬀerence classes, respectively, and the8onstraints are that the time average service is no less than the time average of arrivals for allclasses of jobs. The goal is to minimize the total time average penalty of these N renewal systemssubject to L total time average constraints on the performance metrics related to the externali.i.d. process, i.e. we aim to solve the following optimization problem:min lim sup T →∞ T T − X t =0 N X n =1 E ( y n [ t ]) (1.16)s.t. lim sup T →∞ T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l , l ∈ { , , · · · , L } . (1.17) Consider a slotted time system with L classes of jobs and N servers. Job arrivals are Poissondistributed with rates λ , · · · , λ L , respectively. These jobs are stored in separate queues denotedas Q [ t ] , · · · , Q L [ t ] in a router waiting to be served. Assume the system is empty at time t = 0so that Q l [0] = 0 , ∀ l ∈ { , , · · · , L } . Let λ l [ t ] be the precise number of class l job arrivals at slot t , then, we have E ( λ l [ t ]) = λ l , ∀ l ∈ { , , · · · , L } . Let µ nl [ t ] and e n [ t ] be the number of class l jobs served and the energy consumption for server n at time slot t , respectively. Fig. 1.3 sketchesan example architecture of the system with 3 classes of jobs and 10 servers.Each server makes decisions over renewal frames and the ﬁrst frame starts at time slot t = 0.Successive renewals can happen at diﬀerent slots for diﬀerent servers. For the n -th server, atthe beginning of the k -th frame ( k ∈ N ), it chooses a processing mode m nk within the set of allmodes M n . The processing mode m nk determines distributions on the number of jobs served, theservice time, and the energy expenditure, with conditional expectations:• b T n ( m nk ) := E ( T nk | m nk ). The expected frame size.• b µ nl ( m nk ) = E (cid:16) P t ∈T nk µ nl [ t ] (cid:12)(cid:12)(cid:12) m nk (cid:17) . The expected number of class l jobs served.• b e n ( m nk ) = E (cid:16) P t ∈T nk e n [ t ] (cid:12)(cid:12)(cid:12) m nk (cid:17) . The expected energy consumption.The goal is to minimize the time average energy consumption, subject to the queue stability9onstraints, i.e. min lim sup T →∞ T T − X t =0 N X n =1 E ( e n [ t ]) (1.18) s.t. lim inf T →∞ T T − X t =0 N X n =1 E ( µ nl [ t ]) ≥ λ l , ∀ l ∈ { , , · · · , L } . (1.19)Thus, we have formulated the problem into the form (5.1)-(1.17). Note that the external processin this example is the arrival process of L classes of jobs with potentially unknown arrival rates λ l . Figure 1.3: Illustration of an energy-aware scheduling sys-tem with 3 classes of jobs and 10 parallel servers. Consider N discrete time Markov decision processes (MDPs) over an inﬁnite horizon. EachMDP consists of a ﬁnite state space S n , and an action space U n at each state s ∈ S n . For eachstate s ∈ S , we use P nu ( s, s ) to denote the transition probability from s ∈ S n to s ∈ S n whentaking action u ∈ U n , i.e. P nu ( s, s ) = P r ( s [ t + 1] = s | s [ t ] = s, u [ t ] = u ) , To simplify the notation, we assume each state has the same action space A n . All our analysis generalizestrivially to states with diﬀerent action spaces. s [ t ] and u [ t ] are state and action at time slot t .At time slot t , after observing the state s [ t ] ∈ S n and choosing the action u [ t ] ∈ U n , the n-thMDP receives a penalty y n ( u [ t ] , s [ t ]) and L types of resource costs z n ( u [ t ] , s [ t ]) , · · · , z nL ( u [ t ] , s [ t ]),where these functions are all bounded mappings from S n × U n to R . For simplicity we write y n [ t ] = y n ( u [ t ] , s [ t ]) and z nl [ t ] = z nl ( u [ t ] , s [ t ]). The goal is to minimize the time average overallpenalty with constraints on time average overall costs, where these MDPs are weakly coupledthrough the time average constraints. This problem can be written in the form (5.1)-(1.17).In order to deﬁne the renewal frame, we need one more assumption on the MDPs. We assumeeach of the MDPs is ergodic , i.e. there exists a state which is recurrent and the correspondingMarkov chain is aperiodic under any randomized stationary policy, with bounded expected re-currence time. Under this assumption, the renewals for each MDP can be deﬁned as successiverevisitations to the recurrent state, and the action set A n in such scenario is deﬁned as the setof all randomized stationary policies that can be implemented in one renewal frame. Thus, ourformulation includes coupled ergodic MDPs. We refer to [Alt99a], [Ber01], and [Ros02] for moredetails on MDP theory and related topics.As a side remark, this multi-MDP problem can be viewed as a single MDP on an enlargedstate space. Constrained MDPs are discussed previously in [Alt99a]. One can show that un-der the previous ergodic assumption, the minimum of (5.1)-(1.17) is achieved by a randomizedstationary policy, and furthermore, such a policy can be obtained via solving a linear programreformulated from (5.1)-(1.17) oﬄine. However, formulating such LP requires the knowledge ofall the parameters in the problem, including the statistics of the external process { d [ t ] } ∞ t =0 , andthe resulting LP is often computationally intractable when the number of MDPs is very large. Compared to (1.1)-(1.2), this problem is much more challenging because these N systems areweakly coupled by the time average constraints (1.17), yet each of them operates over its ownrenewal frames. The renewals of diﬀerent systems do not have to be synchronized and they donot have to occur at the same rate (e.g. see Fig. 1.2). Our goal is to develop an algorithm thatdoes not need the knowledge of d l = E ( d l [ t ]) with a provable performance guarantee.Note that due to the asynchronicity, the DPP ratio algorithm (Algorithm 1) does not apply.More speciﬁcally, in order to cope with the time average constraints, Algorithm 1 introduces11irtual queues to penalize constraint violations. These virtual queues are then updated frame-wise and the analysis is also on the per frame scale of this particular system. For parallel renewalsystems, it is however not clear what is a proper scale to update the virtual queues.Naturally, one would think of introducing a virtual queue for each constraint and update thequeue whenever at least one of the systems starts its new renewal frame. However, this means forthose systems who have yet to reach the renewal, we are updating algorithm parameters in themiddle of the renewal of these systems. This creates grave diﬃculties on how to piece togetherthe analysis from each individual systems. On the other hand, since time is slotted, one couldalso think of “giving up” the notion of renewals, synchronizing all systems on the slot scale anddesigning a slot-based algorithm. However, this does not make the problem any simpler since bydoing so, the algorithm can still update algorithm parameters in the middle of renewals.Prior approaches treat this challenge only in special cases. The works [Nee12a] and [Nee12b]consider a special case where all quantities introduced above are deterministic functions of theactions. The work in [Nee11] develops a two stage algorithm for stochastic multi-renewal systems,but the ﬁrst stage must be solved oﬄine.On the other hand, for the special case where the system is a coupled Markov decisionprocesses (MDPs). Classical methods for MDPs, such as dynamic programming and linearprogramming [Ber95][Put14][Ros02], can be used to solve this problem. However, it can beimpractical for two reasons: First, the state space has dimension that depends on the number ofrenewal systems, making solutions diﬃcult when the number of renewal systems is large. Second,some statistics of the system, such as the average d [ t ] process governing the resource constraints,can be unknown. The problem considered in the current paper is a generalization of optimization over a singlerenewal system. It is shown in [Nee13b] that for the single renewal system with ﬁnite action set,the problem can be solved (oﬄine) via a linear fractional program. Methods for solving linearfractional programs can be found in [BV04] and [Sch83]. The drift-plus-penalty ratio approach isalso developed in [Nee10b] and [Nee13a] for the single renewal system.Note that there are also many other algorithms which consider “asynchronous optimization” ina diﬀerent sense compare to ours. More speciﬁcally, the works [BT97][BGPS06][SN11] [PXYY16]12onsider the scenario where the asynchronicity shown in Fig. 1.2 results from uncontrollabledelays due to environmental uncertainties. These delays are of ﬁxed distributions independent ofthe actions or even deterministic. Thus, the delays do not appear in the optimization objectives.On the other hand, our problem is also related to the multi-server scheduling as is shown inone of the example applications. When assuming proper statistics of the arrivals and/or services,energy optimization problems in multi-server systems can also be treated via queueing theory.Speciﬁcally, by assuming both arrivals and services are Poisson distributed, [GDHBSW13] treatsthe multi-server system as an M/M/k/setup queue and explicitly computes several performancemetrics via the renewal reward theorem. By assuming arrivals are Poisson and only one server,[LN14] and [Yao02] treat the system as a multi-class M/G/1 queue and optimize the averageenergy consumption via polymatroid optimization.

The rest of the thesis is organized as follows:•

Chapter 2: (published in [WN18]) We develop a new algorithm for the general asyn-chronous renewal optimization, where each system operates on its own renewal frame.It is fully analyzed with convergence as well as convergence time results. As a ﬁrst techni-cal contribution, we fully characterize the fundamental performance region of the problem(5.1)-(1.17). We then construct a supermartingale along with a stopping-time to “synchro-nize” all systems on a slot basis, by which we could piece together analysis of each individualsystem to prove the convergence of the proposed algorithm. Furthermore, encapsulatingthis new idea into convex analysis tools, we prove the O (1 /ε ) convergence time of the pro-posed algorithm to reach O ( ε ) near optimality under a mild assumption on the existenceof a Lagrange multiplier. Speciﬁcally, we show that for any accuracy (cid:15) > T ≥ /ε , the sequence { y n [ t ] } and { z n [ t ] } produced by our algorithm satisﬁes,1 T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + O ( ε ) , T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l + O ( ε ) , l ∈ { , , · · · , L } , where f ∗ denotes the optimal objective value of (5.1)-(1.17). Simulation experiments on13he aforementioned multi-server energy-aware scheduling problem also demonstrate theeﬀectiveness of the proposed algorithm.• Chapter 3 Data center server provision: (published in [WN17]) We consider a costminimization problem for data centers with N servers and randomly arriving service re-quests. A central router decides which server to use for each new request. We formulatethis problem as an asynchronous renewal optimization, and develop a distributed controlalgorithm so that each server makes its own decisions, the request queues are bounded andthe overall time average cost is near optimal with probability 1. The algorithm does notneed probability information for the arrival rate or job sizes. Next, an improved algorithmthat uses a single queue is developed via a “virtualization” technique which is shown toprovide the same (near optimal) costs. Simulation experiments on a real data center traﬃctrace demonstrate the eﬃciency of our algorithm compared to other existing algorithms.•

Chapter 4 Multi-user ﬁle downloading: (published in [WN15]) We treat power-awarethroughput maximization in a multi-user ﬁle downloading system. Each user can receivea new ﬁle only after its previous ﬁle is ﬁnished. The ﬁle state processes for each useract as coupled Markov chains that form a generalized restless bandit system. First, anoptimal algorithm is derived for the case of one user. The algorithm maximizes throughputsubject to an average power constraint. Next, the one-user algorithm is extended to a lowcomplexity heuristic for the multi-user problem. The heuristic uses a simple online indexpolicy. In a special case with no power-constraint, the multi-user heuristic is shown to bethroughput optimal. Simulations are used to demonstrate eﬀectiveness of the heuristic inthe general case. For simple cases where the optimal solution can be computed oﬄine, theheuristic is shown to be near-optimal for a wide range of parameters.•

Chapter 5 Opportunistic Scheduling over Renewal Systems: (published in [WN19])In this chapter, we consider an opportunistic scheduling problem over a single renewalsystem. Diﬀerent from previous chapters, we consider teh scenario where at the beginningof each renewal frame, the controller observes a random event and then chooses an action inresponse to the event, which aﬀects the duration of the frame, the amount of resources used,and a penalty metric. The goal is to make frame-wise decisions so as to minimize the timeaverage penalty subject to time average resource constraints. This problem has applications14o task processing and communication in data networks, as well as to certain classes ofMarkov decision problems. We formulate the problem as a dynamic fractional programand propose an adaptive algorithm which uses an empirical accumulation as a feedbackparameter. A key feature of the proposed algorithm is that it does not require knowledgeof the random event statistics and potentially allows (uncountably) inﬁnite event sets. Weprove the algorithm satisﬁes all desired constraints and achieves O ( (cid:15) ) near optimality withprobability 1.• Chapter 6 Online Learning in Weakly Coupled Markov Decision Processes: (published in [WYN18]) In this chapter, we consider a special case of the multiple parallelrenewal systems, namely, the parallel Markov decision processes coupled by global con-straints, where the time varying objective and constraint functions can only be observedafter the decision is made. Special attention is given to how well the decision maker canperform in T slots, starting from any state, compared to the best feasible randomized sta-tionary policy in hindsight. We develop a new distributed online algorithm where eachMDP makes its own decision each slot after observing a multiplier computed from pastinformation. While the scenario is signiﬁcantly more challenging than the classical on-line learning context, the algorithm is shown to have a tight O ( √ T ) regret and constraintviolations simultaneously. To obtain such a bound, we combine several new ingredients in-cluding ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysisfor online constrained optimization, a drift analysis for queue processes, and a perturbationanalysis based on Farkas’ Lemma. 15 hapter 2Asynchronous Optimization over Weakly Coupled RenewalSystems In this chapter, we present our asynchronous algorithm along with the new analysis. Alongthe way, we try to provide some intuitions and high level ideas of the analysis.Consider N renewal systems that operate over a slotted timeline ( t ∈ { , , , . . . } ). Thetimeline for each system n ∈ { , . . . , N } is segmented into back-to-back intervals, which arerenewal frames. The duration of each renewal frame is a random positive integer with distributionthat depends on a control action chosen by the system at the start of the frame. The decisionat each renewal frame also determines the penalty and a vector of performance metrics duringthis frame. The systems are coupled by time average constraints placed on these metrics over allsystems. The goal is to design a decision strategy for each system so that overall time averagepenalty is minimized subject to time average constraints.Recall that we use k = 0 , , , · · · to index the renewals. Let t nk be the time slot correspondingto the k -th renewal of the n -th system with the convention that t n = 0. Let T nk be the set of allslots from t nk to t nk +1 −

Assumption 2.1.1.

The problem (5.1) - (1.17) is feasible, i.e. there are action sequences { α nk } ∞ k =0 for all n ∈ { , , · · · , N } so that the corresponding process { z n [ t ] } ∞ t =0 satisﬁes the constraints (1.17) . Following this assumption, we deﬁne f ∗ as the inﬁmum objective value for (5.1)-(1.17) overall decision sequences that satisfy the constraints. Assumption 2.1.2 (Boundedness) . For any k ∈ N and any n ∈ { , , · · · , N } , there existabsolute constants y max , z max and d max such that | y n [ t ] | ≤ y max , | z nl [ t ] | ≤ z max , | d l [ t ] | ≤ d max , ∀ t ∈ T nk , ∀ l ∈ { , , · · · , L } . Furthermore, there exists an absolute constant B ≥ such that for every ﬁxed α n ∈ A n andevery s ∈ N for which P r ( T nk ≥ s | α nk = α n ) > , E (cid:0) ( T nk − s ) (cid:12)(cid:12) α nk = α n , T nk ≥ s (cid:1) ≤ B. (2.1) Remark 2.1.1.

The quantity T nk − s is usually referred to as the residual lifetime. In the specialcase where s = 0 , (2.1) gives the uniform second moment bound of the renewal frames as E (cid:0) ( T nk ) (cid:12)(cid:12) α nk = α n (cid:1) ≤ B. Note that (2.1) is satisﬁed for a large class of problems. In particular, it can be shown to hold inthe following three cases:1. If the inter-renewal T nk is deterministically bounded.2. If the inter-renewal T nk is geometrically distributed. . If each system is a ﬁnite state ergodic MDP with a ﬁnite action set. Deﬁnition 2.1.1.

For any α n ∈ A n , let b y n ( α n ) := E  X t ∈T nk y n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α nk = α n  , b z nl ( α n ) := E  X t ∈T nk z nl [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α nk = α n  , and b T n ( α n ) := E ( T nk | α nk = α n ) . Deﬁne b f n ( α n ) := b y n ( α n ) / b T n ( α n ) , b g nl ( α n ) := b z nl ( α n ) / b T n ( α n ) , ∀ l ∈ { , , · · · , L } , and let (cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) be a performance vector under the action α n . Note that by Assumption 5.2.1, b y n ( α n ) and b z n ( α n ) in Deﬁnition 2.1.1 are both bounded, and T nk ≥ , ∀ k ∈ N , thus, the set n(cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) , α n ∈ A n o is also bounded. The followingmild assumption states that this set is also closed. Assumption 2.1.3.

The set n(cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) , α n ∈ A n o is compact. The motivation of this assumption is to guarantee that there always exists at least one solutionto each subproblem in our algorithm. Finally, we deﬁne the performance region of each individualsystem as follows.

Deﬁnition 2.1.2.

Let S n be the convex hull of n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) : α n ∈ A n o ⊆ R L +2 . Deﬁne P n := { ( y/T, z /T ) : ( y, z , T ) ∈ S n } ⊆ R L +1 as the performance region of system n . In this section, we propose an algorithm where each system can make its own decision afterobserving a global vector of multipliers which is updated using the global information from all18ystems. We start by deﬁning a vector of virtual queues Q [ t ] := ( Q [ t ] , Q [ t ] , · · · , Q L [ t ]),which are 0 at t = 0 and updated as follows, Q l [ t + 1] = max ( Q l [ t ] + N X n =1 z nl [ t ] − d l [ t ] , ) , l ∈ { , , · · · , L } . (2.2)These virtual queues will serve as global multipliers to control the growth of correspondingresource consumptions. Then, the proposed algorithm is presented in Algorithm 2. Algorithm 2.

Fix a trade-oﬀ parameter

V > : • At the beginning of k -th frame of system n , the system observes the vector of virtual queues Q [ t nk ] and makes a decision α nk ∈ A n so as to solve the following subproblem: D nk := min α n ∈A n E (cid:16) P t ∈T nk ( V y n [ t ] + h Q [ t nk ] , z n [ t ] i ) (cid:12)(cid:12)(cid:12) α nk = α n , Q [ t nk ] (cid:17) E ( T nk | α nk = α n , Q [ t nk ]) . (2.3)• Update the virtual queue after each slot: Q l [ t + 1] = max ( Q l [ t ] + N X n =1 z nl [ t ] − d l [ t ] , ) , l ∈ { , , · · · , L } . (2.4)Note that using the notation speciﬁed in Deﬁnition 2.1.1, we can rewrite (2.3) in a moreconcise way as follows: min α n ∈A n n V b f n ( α n ) + h Q [ t nk ] , b g n ( α n ) i o , (2.5)which is a deterministic optimization problem. Then, by the compactness assumption (Assump-tion 2.1.3), there always exists a solution to this subproblem. Remark 2.2.1.

We would like to compare this algorithm to the DPP ratio algorithm (Algorithm1). For each renewal system, both algorithms update the decision variable frame-wise based onthe virtual queue value at the beginning of each frame. The major diﬀerence is that the proposedalgorithm updates virtual queue slot-wise while Algorithm 1 updates virtual queues per frame.Such a seemingly small change, somewhat surprisingly, requires signiﬁcant generalizations of theanalysis on Algorithm 1.

This algorithm requires knowledge of the conditional expectations associated with the per-formance vectors (cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) , α n ∈ A n , but only requires individual systems n to know19heir own (cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) , α n ∈ A n , and therefore decouples these systems. Furthermore,the virtual queue update uses observed d l [ t ] and does not require knowledge of distribution ormean of d l [ t ].In addition, we introduce Q [ t ] as “virtual queues” for the following two reasons: First, it canbe mapped to real queues in applications (such as the server scheduling problem mentioned inSection 1.3.1), where d [ t ] stands for the arrival process and z [ t ] is the service process. Second,stabilizing these virtual queues implies the constraints (1.17) are satisﬁed, as is illustrated in thefollowing lemma. Lemma 2.2.1. If Q l [0] = 0 and lim T →∞ T E ( Q l [ T ]) = 0 , then, lim sup T →∞ T P T − t =0 P Nn =1 E ( z nl [ t ]) ≤ d l .Proof of Lemma 2.2.1. Fix l ∈ { , , · · · , L } . For any ﬁxed T , Q l [ T ] = P T − t =0 ( Q l [ t + 1] − Q l [ t ]).For each summand, by queue updating rule (5.5), Q l [ t + 1] − Q l [ t ] = max ( Q l [ t ] + N X n =1 z nl [ t ] − d l [ t ] , ) − Q l [ t ] ≥ Q l [ t ] + N X n =1 z nl [ t ] − d l [ t ] − Q l [ t ] = N X n =1 z nl [ t ] − d l [ t ] . Thus, by the assumption Q l [0] = 0, Q l [ T ] ≥ T − X t =0 N X n =1 z nl [ t ] − d l [ t ] ! . Taking expectations of both sides with E ( d l [ t ]) = d l , ∀ l , gives E ( Q l [ T ]) ≥ T − X t =0 N X n =1 E ( z nl [ t ]) − d l ! . Dividing both sides by T and passing to the limit giveslim sup T →∞ T T − X t =0 N X n =1 E ( z nl [ t ]) − d l ! ≤ lim T →∞ T E ( Q l [ T ]) = 0 , ﬁnishing the proof. 20 .2.2 Computing subproblems Since a key step in the algorithm is to solve the optimization problem (2.5), we make severalcomments on the computation of the ratio minimization (2.5). In general, one can solve the ratiooptimization problem (2.3) (therefore (2.5)) via a bisection search algorithm. For more details,see section 7 of [Nee10b]. However, more often than not, bisection search is not the most eﬃcientone. We will discuss two special cases arising from applications where we can ﬁnd a simpler wayof solving the subproblem.First of all, when there are only a ﬁnite number of actions in the set A n , one can solve (2.5)simply via enumerating. This is a typical scenario in energy-aware scheduling where a ﬁniteaction set consists of diﬀerent processing modes that can be chosen by servers.Second, when the set n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) : α n ∈ A n o speciﬁed in Deﬁnition 2.1.2is itself a convex hull of a ﬁnite sequence { ( y j , z j , T j ) } mj =1 , then, (2.5) can be rewritten as a simpleenumeration: min i ∈{ , , ··· ,m } (cid:26) V y i T i + (cid:28) Q [ t nk ] , z i T i (cid:29)(cid:27) . To see this, note that by deﬁnition of convex hull, for any α n ∈ A n , (cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) = P mj =1 p j · ( y j , z j , T j ) for some { p j } mj =1 , p j ≥ P mj =1 p j = 1. Thus, V b f n ( α n ) + h Q [ t nk ] , b g n ( α n ) i = V P mj =1 p j y j P mj =1 p j T j + * Q [ t nk ] , P mj =1 p j z j P mj =1 p j T j + = m X i =1 p i T i P mj =1 p j T j (cid:18) V y i T i + (cid:28) Q [ t nk ] , z i T i (cid:29)(cid:19) =: m X i =1 q i (cid:18) V y i T i + (cid:28) Q [ t nk ] , z i T i (cid:29)(cid:19) , where we let q i = p i T i P mj =1 p j T j . Note that q i ≥ P mi =1 q i = 1 because T i ≥

1. Hence, solving(2.5) is equivalent to choosing { q i } mi =1 to minimize the above expression, which boils down tochoosing a single ( y i , z i , T i ) among { ( y j , z j , T j ) } mj =1 which achieves the minimum.Note that such a convex hull case stands out not only because it yields a simple solution, butalso because of the fact that ergodic coupled MDPs discussed in Section 1.3.2 have the region n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) : α n ∈ A n o being the convex hull of a ﬁnite sequence of points { ( y j , z j , T j ) } mj =1 , where each point ( y j , z j , T j ) results from a pure stationary policy ([Alt99a]).21 Thus, solving (2.5) for the ergodic coupled MDPs reduces to choosing a pure policy among aﬁnite number of pure policies.

In this section, we provide the performance analysis of Algorithm 2. Let f ∗ be the optimalobjective value for problem (5.1)-(1.17). The goal is to show the following bound similar to thatof Algorithm 1: 1 T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + CV , E ( k Q [ T ] k ) ≤ C √ V T , for some constant

C, C >

0. Then, by Lemma 2.2.1, one readily obtains the constraint satisfac-tion result.For the rest of the chapter, the underlying probability space is denoted as the tuple (Ω , F , P ).Let F [ t ] be the system history up until time slot t . Formally, {F [ t ] } ∞ t =0 is a ﬁltration with F [0] = {∅ , Ω } and each F [ t ] , t ≥ σ -algebra generated by all random variables from slot0 to t − In this section, we present several lemmas on the fundamental properties of the optimizationproblem (5.1)-(1.17).The following lemma demonstrates the convexity of P n in Deﬁnition 2.1.2. Lemma 2.3.1.

The performance region P n speciﬁed in Deﬁnition 2.1.2 is convex for any n ∈{ , , · · · , N } . Furthermore, it is the convex hull of the set n(cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) : α n ∈ A n o andthus compact, where (cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) is speciﬁed Deﬁnition 2.1.1. A pure stationary policy is an algorithm where the decision to be taken at any time t is a deterministicfunction of the state at time t , and independent of all other past information. P n speciﬁed in Deﬁnition 2.1.2. Lemma 2.3.2.

For each n ∈ { , , · · · , N } , there exists a pair (cid:16) f n ∗ , g n ∗ (cid:17) ∈ P n such that thefollowing hold: N X n =1 f n ∗ = f ∗ N X n =1 g nl, ∗ ≤ d l , l ∈ { , , · · · , L } , where f ∗ is the optimal objective value for problem (5.1) - (1.17) , i.e. the optimality is achievablewithin ⊗ Nn =1 P n , the Cartesian product of P n .Furthermore, for any (cid:16) f n , g n (cid:17) ∈ P n , n ∈ { , , · · · , N } , satisfying P Nn =1 g nl ≤ d l , l ∈{ , , · · · , L } , we have P Nn =1 f n ≥ f ∗ , i.e. one cannot achieve better performance than (5.1) - (1.17) in ⊗ Nn =1 P n . The proof of this Lemma is delayed to Section 2.6. In particular, the proof uses the followinglemma, which also plays an important role in several lemmas later.

Lemma 2.3.3.

Suppose { y n [ t ] } ∞ t =0 , { z n [ t ] } ∞ t =0 and { T nk } ∞ k =0 are processes resulting from anyalgorithm, then, ∀ T ∈ N , T T − X t =0 E ( f n [ t ] − y n [ t ]) ≤ B T , (2.6)1 T T − X t =0 E ( g nl [ t ] − z nl [ t ]) ≤ B T , l ∈ { , , · · · , L } , (2.7) where B = 2 y max √ B , B = 2 z max √ B and f n [ t ] , g n [ t ] are constant over each renewal frame forsystem n deﬁned by f n [ t ] = b f n ( α n ) , if t ∈ T nk , α nk = α n g n [ t ] = b g n ( α n ) , if t ∈ T nk , α nk = α n , and (cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) are deﬁned in Deﬁnition 2.1.1. Note that this algorithm might make decisions using the past information.

Remark 2.3.1.

Note that directly computing f n ∗ and g nl, ∗ indicated by Lemma 2.3.2 would bediﬃcult because of the fractional nature of P n , the coupling between diﬀerent systems throughtime average constraints and the fact that d l = E ( d l [ t ]) might be unknown. However, Lemma2.3.2 can be used to prove important performance theorems regarding our proposed algorithm asis indicated by the following lemma. The following theorem gives the performance bound of our proposed algorithm.

Theorem 2.3.1.

The sequences { y n [ t ] } ∞ t =0 and { z n [ t ] } ∞ t =0 produced by the proposed algorithmsatisfy all the constraints in (1.17) and achieves O (1 /V ) near optimality, i.e. lim sup T →∞ T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + N C + C V , where f ∗ is the optimal objective of (5.1) - (1.17) , C = 6 Lz max ( N z max + d max ) B i and C :=( N z max + d max ) L/ .Proof of Theorem 2.3.1. Deﬁne the drift-plus-penalty (DPP) expression at time slot t as P [ t ] := E N X n =1 V y n [ t ] + 12 (cid:0) k Q [ t + 1] k − k Q [ t ] k (cid:1)! . (2.8)By the queue updating rule (5.5), we have P [ t ] ≤ E  N X n =1 V y n [ t ] + 12 L X l =1 N X n =1 z nl [ t ] − d l [ t ] ! + L X l =1 Q l [ t ] N X n =1 z nl [ t ] − d l [ t ] ! ≤

12 (

N z max + d max ) L + E N X n =1 V y n [ t ] + L X l =1 Q l [ t ] N X n =1 z nl [ t ] − d l [ t ] !! = 12 ( N z max + d max ) L + E N X n =1 V y n [ t ] + L X l =1 Q l [ t ] N X n =1 z nl [ t ] − d l !! where the second inequality follows from the boundedness assumption (Assumption 5.2.1) that P Ll =1 (cid:16)P Nn =1 z nl [ t ] − d l [ t ] (cid:17) ≤ ( N z max + d max ) L , and the equality follows from the fact that d l [ t ]24s i.i.d. and independent of Q l [ t ], thus, E ( Q l [ t ] d l [ t ]) = E ( Q l [ t ] · E ( d l [ t ] | Q l [ t ])) = E ( Q l [ t ] d l ) . For simplicity, deﬁne C = ( N z max + d max ) L . Now, by the achievability of optimality in ⊗ Nn =1 P n (Lemma 2.3.2), we have P Nn =1 g nl, ∗ ≤ d l , thus, substituting this inequality into theabove bound for P [ t ] gives P [ t ] ≤ C + E N X n =1 V y n [ t ] + N X n =1 L X l =1 Q l [ t ] (cid:0) z nl [ t ] − g nl, ∗ (cid:1)! = C + N X n =1 E ( V y n [ t ] + h Q [ t ] , z n [ t ] − g n ∗ i )= C + N X n =1 E ( X n [ t ]) + V N X n =1 f n ∗ = C + N X n =1 E ( X n [ t ]) + V f ∗ , where we use the deﬁnition of X n [ t ] in (2.15) by substituting ( f n , g n ) with ( f n ∗ , g n ∗ ), i.e. X n [ t ] = V ( y n [ t ] − f n ∗ )+ h Q [ t ] , z n [ t ] − g n ∗ i , in the second from last equality and use the optimality condition(Lemma 2.3.2) in the ﬁnal equality. Thus, it follows1 T T − X t =0 P [ t ] ≤ C + V f ∗ + N X n =1 T T − X t =0 E ( X n [ t ]) . By the virtual queue updating rule (2.4) and the trivial bound Q l [ t ] ≤ O ( t ), we readily get T − X t =0 E ( X n [ t ]) = T − X t =0 E V ( y n [ t ] − f n ∗ ) + L X l =1 Q l [ t ]( z nl [ t ] − g n ∗ ) ! ≤ C ( T + V T ) , for some constant C >

0. However, this bound is too weak to allow us proving the convergenceresult. The key to this proof is to improve such a bound so that T − X t =0 E ( X n [ t ]) ≤ C T + C V. where C and C are two constants independent of V or T . This is Lemma 2.3.8. As a consequence25or any T ∈ N , 1 T T − X t =0 P [ t ] ≤ ( N C + C ) + N C VT . (2.9)On the other hand, by the deﬁnition of P [ t ] in (2.8) and then telescoping sums with Q [0] = 0,we have 1 T T − X t =0 P [ t ] = 1 T T − X t =0 E N X n =1 V y n [ t ] + 12 (cid:0) k Q [ t + 1] k − k Q [ t ] k (cid:1)! = 1 T T − X t =0 N X n =1 V E ( y n [ t ]) + 12 T E (cid:0) k Q [ T ] k (cid:1) . Combining this with inequality (2.9) gives1 T T − X t =0 N X n =1 V E ( y n [ t ]) + 12 T E (cid:0) k Q [ T ] k (cid:1) ≤ N C + C + V f ∗ + N C VT . (2.10)Since T E (cid:0) k Q [ T ] k (cid:1) ≥

0, we can throw away the term and the inequality still holds, i.e.1 T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + N C + C V + N C T . (2.11)Taking lim sup T →∞ from both sides gives the near optimality in the theorem.To get the constraint violation bound, we use Assumption 5.2.1 that | y n [ t ] | ≤ y max , then, by(2.10) again, we have 1 T E (cid:0) k Q [ T ] k (cid:1) ≤ N C + C ) + 4 V y max + 2

N C VT .

By Jensen’s inequality E (cid:0) k Q [ T ] k (cid:1) ≥ E ( k Q [ T ] k ) . This implies that E ( k Q [ T ] k ) ≤ p (2( N C + C ) + 4 V y max ) T + 2 N C V , which implies 1 T E ( k Q [ T ] k ) ≤ r N C + C ) + 4 V y max T + 2 N C VT . (2.12)Sending T → ∞ gives lim T →∞ T E ( Q l [ T ]) = 0 , ∀ l ∈ { , , · · · , L } . ε >

0, let V = 1 /ε , then, for all T ≥ /ε , (2.11) implies that1 T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + O ( ε ) . However, (2.12) suggests a larger convergence time is required for constraint satisfaction! For V = 1 /ε , it can be shown that (2.12) implies that1 T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l + O ( ε ) , whenever T ≥ /ε . The next section shows a tighter 1 /ε convergence time with a mild Lagrangemultiplier assumption. The rest of this section is devoted to proving Lemma 2.3.8. In this section and the next section, our goal is to show that the term T − X t =0 E V ( y n [ t ] − f n ∗ ) + L X l =1 Q l [ t ]( z nl [ t ] − g n ∗ ) ! ≤ C ( V + T ) . (2.13)Learning from the single renewal analysis (equation (1.11)), we have the following key-featureinequality connecting our proposed algorithm with the performance vectors inside P n . Lemma 2.3.4.

Consider the stochastic processes { y n [ t ] } ∞ t =0 , { z n [ t ] } ∞ t =0 , and { T nk } ∞ k =0 resultingfrom the proposed algorithm. For any system n , the following holds for any k ∈ N and any ( f n , g n ) ∈ P n , E (cid:16) P t ∈T nk ( V y n [ t ] + h Q [ t nk ] , z n [ t ] i ) (cid:12)(cid:12)(cid:12) Q [ t nk ] (cid:17) E ( T nk | Q [ t nk ]) ≤ V f n + h Q [ t nk ] , g n i , (2.14) Proof of Lemma 5.5.4.

First of all, since the proposed algorithm solves (2.3) over all possibledecisions in A n , it must achieve value less than or equal to that of any action α n ∈ A n at the27ame frame. This gives, D nk ≤ E (cid:16) P t ∈T nk ( V y n [ t ] + h Q [ t nk ] , z n [ t ] i ) (cid:12)(cid:12)(cid:12) Q [ t nk ] , α nk = α n (cid:17) E ( T nk | Q [ t nk ] , α nk = α n ) = V b y n ( α n ) + h Q [ t nk ] , b z n ( α n ) i b T n ( α n ) , where D nk is deﬁned in (2.3) and the equality follows from the renewal property of the systemthat T nk , P t ∈T nk y n [ t ] and P t ∈T nk z n [ t ] are conditionally independent of Q [ t nk ] given α nk = α n .Since T nk ≥

1, this implies b T n ( α n ) · D nk ≤ V b y n ( α n ) + h Q [ t nk ] , b z n ( α n ) i , thus, for any α n ∈ A n , V b y n ( α n ) + h Q [ t nk ] , b z n ( α n ) i − D nk · b T n ( α n ) ≥ . Since S n speciﬁed in Deﬁnition 2.1.2 is the convex hull of n ( b y n ( α n ) , b z n ( α n ) , b T n ( α n )) , α n ∈ A n o ,it follows for any vector ( y, z , T ) ∈ S n , we have V y + h Q [ t nk ] , z i − D nk · T ≥ . Dividing both sides by T and using the deﬁnition of P n in Deﬁnition 2.1.2 give D nk ≤ V f n + h Q [ t nk ] , g n i , ∀ ( f n , g n ) ∈ P n . Finally, since { y n [ t ] } ∞ t =0 , { z n [ t ] } ∞ t =0 , and { T nk } ∞ k =0 result from the proposed algorithm and theaction chosen is determined by Q [ t nk ] as in (2.3), D nk = E (cid:16) P t ∈T nk ( V y n [ t ] + h Q [ t nk ] , z n [ t ] i ) (cid:12)(cid:12)(cid:12) Q [ t nk ] (cid:17) E ( T nk | Q [ t nk ]) . This ﬁnishes the proof.Our next step is to give a frame-based analysis for each system by constructing a supermartin-gale on the per-frame timescale. We start with a deﬁnition of supermartingale:

Deﬁnition 2.3.1 (Supermartingale) . Consider a probability space (Ω , F , P ) and a ﬁltration F i } ∞ i =0 on this space with F = {∅ , Ω } , F i ⊆ F i +1 , ∀ i and F i ⊆ F , ∀ i . Consider a process { X i } ∞ i =0 ⊆ R adapted to this ﬁltration, i.e. X i ∈ F i +1 , ∀ i . Then, we have { X i } ∞ i =0 is a su-permartigale if E ( | X i | ) < ∞ and E ( X i +1 |F i +1 ) ≤ X i . Furthermore, { X i +1 − X i } ∞ i =0 is called asupermartingale diﬀerence sequence. Note that by deﬁnition of supermartigale, we always have E ( X i +1 − X i |F i +1 ) ≤

0. Along theway, we also have a standard deﬁnition of stopping time which will be used later:

Deﬁnition 2.3.2 (Stopping time) . Given a probability space (Ω , F , P ) and a ﬁltration { ∅ , Ω } = F ⊆ F ⊆ F · · · in F . A stopping time τ with respect to the ﬁltration {F i } ∞ i =0 is a randomvariable such that for any i ∈ N , { τ = i } ∈ F i , i.e. the stopping time occurring at time i is contained in the information during slots , , , · · · , i − . Recall that {F [ t ] } ∞ t =0 is a ﬁltration (with F [ t ] representing system history during slots { , · · · , t − } ). Fix a system n and recall that t nk is the time slot where the k -th renewal occurs for system n . We would like to deﬁne a ﬁltration corresponding to the random times t nk . To this end, deﬁnethe collection of sets {F nk } ∞ k =0 such that for each k , F nk := { A ∈ F : A ∩ { t nk ≤ t } ∈ F [ t ] , ∀ t ∈ { , , , · · · }} For example, the following set A is an element of F n : A = { t n = 5 } ∩ { y [0] = y , y [1] = y , y [2] = y , y [3] = y , y [4] = y } where y , · · · , y are speciﬁc values. Then A ∈ F n because for i ∈ { , , , , } we have A ∩ { t n ≤ i } = ∅ ∈ F [ i ], and for i ∈ { , , , · · · } we have A ∩ { t ≤ i } = A ∈ F [ i ]. The following technicallemma is proved in Section 2.6. Lemma 2.3.5.

The sequence {F nk } ∞ k =0 is a valid ﬁltration, i.e. F nk ⊆ F nk +1 , ∀ k ≥ . Further- ore, for any real-valued adapted process { Z n [ t − } ∞ t =1 with respect to {F [ t ] } ∞ t =1 , n G t nk ( Z n [0] , Z n [1] , · · · , Z n [ t nk − o ∞ k =1 is also adapted to {F nk } ∞ k =1 , where for any t ∈ N , G t ( · ) is a ﬁxed real-valued measurable mappings.That is, for any k , it holds that the value of any measurable function of ( Z n [0] , · · · , Z [ t nk − isdetermined by events in F nk . With Lemma 5.5.4 and Lemma 2.3.5, we can construct a supermartingale as follows,

Lemma 2.3.6.

Consider the stochastic processes { y n [ t ] } ∞ t =0 , { z n [ t ] } ∞ t =0 , and { T nk } ∞ k =0 resultingfrom the proposed algorithm. For any ( f n , g n ) ∈ P n , let X n [ t ] := V (cid:16) y n [ t ] − f n (cid:17) + h Q [ t ] , z n [ t ] − g n i , (2.15) then, E  X t ∈T nk X n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  ≤ Lz max ( N z max + d max ) B := C , where B , z max and d max are as deﬁned in Assumption 5.2.1. Furthermore, deﬁne a real-valuedprocess { Y nK } ∞ K =0 on the frame such that Y n = 0 and Y nK = K − X k =0  X t ∈T nk X n [ t ] − C  , K ≥ . Then, { Y nK } ∞ K =0 is a supermartingale adapted to the aforementioned ﬁltration {F nk } ∞ K =0 . Remark 2.3.2.

Note that in the above lemma the quantity X n [ t ] is the term we aim to boundin (2.13) . Having { Y nK } ∞ K =0 being a supermartingale implies E ( Y nK ) ≤ , ∀ K . This implies E  t nK − X τ =0 X n [ τ ]  ≤ C K ≤ C t nK . Thus, this lemma proves (2.13) is true when T is taken to be the end of any renewal frame ofsystem n . Our goal in the next section is to get rid of this restriction and ﬁnish the proof via astopping time argument. Meaning that for each t in { , , , · · · } , the random variable Z n [ t −

1] is determined by events in F [ t ]. roof of Lemma 2.3.6. Consider any t ∈ T nk , then, we can decompose X n [ t ] as follows X n [ t ] = V ( y n [ t ] − f n ) + h Q [ t nk ] , z n [ t ] − g n i + h Q [ t ] − Q [ t nk ] , z n [ t ] − g n i . (2.16)By the queue updating rule (5.5), we have for any l ∈ { , , · · · , L } and any t > t nk , | Q l [ t ] − Q l [ t nk ] | ≤ t − X s = t nk (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X m =1 z ml [ s ] − d l [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ( t − t nk )( N z max + d max ) (2.17)Thus, for the last term in (2.16), by H¨older’s inequality, h Q [ t ] − Q [ t nk ] , z n [ t ] − g n i ≤k Q [ t ] − Q [ t nk ] k · k z n [ t ] − g n k ∞ ≤ t − X s = t nk (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X m =1 z n [ s ] − d [ t ] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) · k z n [ t ] − g n k ∞ ≤ ( t − t nk ) L ( N z max + d max ) · z max , where the second inequality follows from (2.17) and the last inequality follows from the bounded-ness assumption (Assumption 5.2.1) of corresponding quantities. Substituting the above boundinto (2.16) gives a bound on E (cid:16) P t ∈T nk X n [ t ] (cid:12)(cid:12)(cid:12) F nk (cid:17) as E  X t ∈T nk X n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  ≤ E  X t ∈T nk (cid:16) V (cid:16) y n [ t ] − f n (cid:17) + h Q [ t nk ] , z n [ t ] − g n i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  + E  X t ∈T nk ( t − t nk ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  · L ( N z max + d max ) z max ≤ E  X t ∈T nk (cid:16) V (cid:16) y n [ t ] − f n (cid:17) + h Q [ t nk ] , z n [ t ] − g n i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  + E (cid:0) ( T nk ) (cid:12)(cid:12) F nk (cid:1) · L ( N z max + d max ) z max , (2.18)where we use the fact that 0 + 1 + · · · + T nk − T nk − T nk / ≤ ( T nk ) in the last inequality.Next, by the queue updating rule (5.5), Q l [ t nk ] is determined by z nl [0] , · · · , z nl [ t nk −

1] ( n =1 , , · · · , N ) and d l [0] , · · · , d l [ t nk −

1] for any l ∈ { , , · · · , L } . Thus, by Lemma 2.3.5, Q [ t nk ] isdetermined by F nk . For the proposed algorithm, each system makes decisions purely based onthe virtual queue state Q [ t nk ], and by the renewal property of each system, given the decision31t the k -th renewal, the random quantities T nk , z n [ t ] and y n [ t ], t ∈ T nk are independent of theoutcomes from the slots before t nk . This implies the following display, E  X t ∈T nk (cid:16) V (cid:16) y n [ t ] − f n (cid:17) + h Q [ t nk ] , z n [ t ] − g n i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  = E  X t ∈T nk V (cid:16) y n [ t ] − f n (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  + * Q [ t nk ] , E  X t ∈T nk ( z n [ t ] − g n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk + = E  X t ∈T nk V (cid:16) y n [ t ] − f n (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q [ t nk ]  + * Q [ t nk ] , E  X t ∈T nk ( z n [ t ] − g n ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q [ t nk ] + = E  X t ∈T nk (cid:16) V (cid:16) y n [ t ] − f n (cid:17) + h Q [ t nk ] , z n [ t ] − g n i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q [ t nk ]  , (2.19)By Lemma 5.5.4, we have the following: E  X t ∈T nk ( V y n [ t ] + h Q [ t nk ] , z n [ t ] i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q [ t nk ]  ≤ (cid:16) V f n + h Q [ t nk ] , g n i (cid:17) · E ( T nk | Q [ t nk ]) . Thus, rearranging terms in above inequality gives the expectation on the right hand side of(2.19) is no greater than 0 and hence the ﬁrst expectation on the right hand side of (2.18) is alsono greater than 0. For the second expectation in (2.18), using (2.1) in Assumption 5.2.1 gives E (cid:0) ( T nk ) (cid:12)(cid:12) F nk (cid:1) ≤ B and the ﬁrst part of the lemma is proved.For the second part of the lemma, by Lemma 2.3.5 and the deﬁnition of Y nK , the process { Y nK } ∞ K =0 is adapted to {F nk } ∞ K =0 . Moreover, by Assumption 5.2.1, E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X t ∈T nk X n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E  X t ∈T nk | X n [ t ] |  < ∞ , ∀ k. Thus, E ( | Y nK | ) < ∞ , ∀ K ∈ N , i.e. it is absolutely integrable. Furthermore, by the ﬁrst part ofthe lemma, E (cid:0) Y nK +1 | F nk (cid:1) = Y nK + E   X t ∈T nK X n [ t ] − C  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  ≤ Y nK , ﬁnishing the proof. 32 .3.4 Synchronization lemma So far, we have analyzed the processes related to each individual system over its renewalframes. However, due the asynchronous behavior of diﬀerent systems, the supermartingales ofeach system cannot be immediately summed.In order to prove the result (2.13) and get a global performance bound, we have to get ridof any index related to individual renewal frames only. In other words, we need to look at thesystem property at any time slot T as opposed to any renewal t nk .For any ﬁxed slot T >

0, let S n [ T ] be the number of renewals up to (and including) time slot T , with the convention that the ﬁrst renewal occurs at time t = 0, so t n = 0 and S n [0] = 1, i.e. t n = 0. The next lemma shows S n [ T ] is a valid stopping time, whose proof is in the appendix. Lemma 2.3.7.

For each n ∈ { , , · · · , N } , the random variable S n [ T ] is a stopping time withrespect to the ﬁltration {F nk } ∞ k =0 , i.e. { S n [ T ] = k } ∈ F nk , ∀ k ∈ N . The following theorem tells us a stopping-time truncated supermartingale is still a super-martingale.

Theorem 2.3.2 (Theorem 5.2.6 in [Dur13]) . If τ is a stopping time and Z [ i ] is a supermartingalewith respect to {F i } ∞ i =0 , then Z [ i ∧ τ ] is also a supermartingale, where a ∧ b (cid:44) min { a, b } . With this theorem and the above stopping time construction, we have the following lemmawhich ﬁnishes the argument proving (2.13):

Lemma 2.3.8.

For each n ∈ { , , · · · , N } and any ﬁxed T ∈ N , we have T T − X t =0 E ( X n [ t ]) ≤ C + C VT , where X n [ t ] is deﬁned in (2.16) and C := 6 Lz max ( N z max + d max ) B, C := 2 y max √ B. Proof.

First, note that the renewal index k starts from 0. Thus, for any ﬁxed T ∈ N , t nS n [ T ] − ≤ < t nS n [ T ] , and E T − X t =0 X n [ t ] ! = E  t nSn [ T ] − X t =0 X n [ t ] − t nSn [ T ] − X t = T X n [ t ]  = E  t nSn [ T ] − X t =1 X n [ t ]  − E  t nSn [ T ] − X t = T X n [ t ]  = E (cid:16) Y nS n [ T ] (cid:17) + C E ( S n [ T ]) − E  t nSn [ T ] − X t = T X n [ t ]  ≤ E (cid:16) Y nS n [ T ] (cid:17) + C ( T + 1) − E  t nSn [ T ] − X t = T X n [ t ]  , (2.20)where the third equality follows from the deﬁnition of Y nK in Lemma 2.3.6 and the last inequalityfollows from the fact that the number of renewals up to time slot T is no more than the totalnumber of slots, i.e. S n [ T ] ≤ T + 1. For the term E (cid:16) Y nS n [ T ] (cid:17) , we apply Theorem 2.3.2 with τ = S n [ T ] and index K to obtain { Y nK ∧ S n [ T ] } ∞ K =0 is a supermartingale. This implies E (cid:16) Y nK ∧ S n [ T ] (cid:17) ≤ E (cid:16) Y n ∧ S n [ T ] (cid:17) = E ( Y n ) = 0 , ∀ K ∈ N . Since S n [ T ] ≤ T + 1, it follows by substituting K = T + 1, E (cid:16) Y nS n [ T ] (cid:17) = E (cid:16) Y n ( T +1) ∧ S n [ T ] (cid:17) ≤ . For the last term in (2.20), by queue updating rule (5.5), for any l ∈ { , , · · · , L } , | Q l [ t ] | ≤ t − X s =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N X m =1 z ml [ s ] − d l [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ t ( N z max + d max ) ,

34t then follows from H¨older’s inequality again that E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t nSn [ T ] − X t = T X n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t nSn [ T ] − X t = T (cid:16) V ( y n [ t ] − f n ) + h Q [ t ] , z n [ t ] − g n i (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E  t nSn [ T ] − X t = T (cid:16) V (cid:12)(cid:12)(cid:12) y n [ t ] − f n (cid:12)(cid:12)(cid:12) + k Q [ t ] k · k z n [ t ] − g n k ∞ (cid:17) ≤ E  t nSn [ T ] − X t = T (2 V y max + L ( N z max + d max ) t · z max )  =2 V y max · E (cid:16) t nS n [ T ] − T (cid:17) + Lz max ( N z max + d max ) · (cid:18) (2 T − · E (cid:16) t nS n [ T ] − T (cid:17) + E (cid:16) t nS n [ T ] − T (cid:17) (cid:19) ≤ V y max √ B + 2 Lz max ( N z max + d max ) √ BT + Lz max ( N z max + d max ) B ≤ V y max √ B + 2 Lz max ( N z max + d max ) B ( T + 1) , where in the second from last inequality we use (2.1) of Assumption 5.2.1 that the residual life t nS n [ T ] − T satisﬁes E (cid:16) ( t nS n [ T ] − T ) (cid:17) = E (cid:16) E (cid:16) ( t nS n [ T ] − T ) (cid:12)(cid:12)(cid:12) t nS n [ T ] − t nS n [ T ] − ≥ T − t nS n [ T ] − (cid:17)(cid:17) ≤ B and E (cid:16) t nS n [ T ] − T (cid:17) ≤ √ B , and in the last inequality we use the fact that B ≥

1, thus, √ B ≤ B .Substitute the above bound into (2.20) gives E T − X t =0 X n [ t ] ! ≤ C ( T + 1) + 2 V y max B + 2 Lz max ( N z max + d max ) B ( T + 1)=2 V y max √ B + 3 Lz max ( N z max + d max ) B ( T + 1) ≤ V y max √ B + 6 Lz max ( z max + d max ) BT where we use the deﬁnition C = Lz max ( z max + d max ) B from Lemma 2.3.6 in the equality anduse T + 1 ≤ T in the ﬁnal equality. Dividing both sides by T ﬁnishes the proof.35 .4 Convergence Time Analysis Consider the following optimization problem:min N X n =1 f n (2.21) s.t. N X n =1 g nl ≤ d l , ∀ l ∈ { , , · · · , L } , (2.22)( f n , g n ) ∈ P n , ∀ n ∈ { , , · · · , N } . (2.23)Since P n is convex, it follows P n is convex and ⊗ Nn =1 P n is also convex. Thus, (2.21)-(2.23) isa convex program. Furthermore, by Lemma 2.3.2, we have (2.21)-(2.23) is feasible if and onlyif (5.1)-(1.17) is feasible, and when assuming feasibility, they have the same optimality f ∗ as isspeciﬁed in Lemma 2.3.2.Since P n is convex, one can show (see Proposition 5.1.1 of [Ber09a]) that there always existsa sequence ( γ , γ , · · · , γ L ) so that γ i ≥ , i = 0 , , · · · , L and N X n =1 γ f n + L X l =1 γ l N X n =1 g nl ≥ γ f ∗ + L X l =1 γ l d l , ∀ ( f n , g n ) ∈ P n , i.e. there always exists a hyperplane parametrized by ( γ , γ , · · · , γ L ), supported at ( f ∗ , d , · · · , d L )and containing the set n(cid:16)P Nn =1 f n , P Nn =1 g n (cid:17) : ( f n , g n ) ∈ P n , ∀ n ∈ { , , · · · , N } o on oneside. This hyperplane is called “separating hyperplane”. The following assumption stems fromthis property and simply assumes this separating hyperplane to be non-vertical (i.e. γ > Assumption 2.4.1.

There exists non-negative ﬁnite constants γ , γ , · · · , γ L such that thefollowing holds, N X n =1 f n + L X l =1 γ l N X n =1 g nl ≥ f ∗ + L X l =1 γ l d l , ∀ ( f n , g n ) ∈ P n , i.e. there exists a separating hyperplane parametrized by (1 , γ , · · · , γ L ) . Remark 2.4.1.

The parameters γ , · · · , γ L are called Lagrange multipliers and this assumption s equivalent to the existence of Lagrange multipliers for constrained convex program (2.21) - (2.23) . It is known that Lagrange multipliers exist if the Slater’s condition holds ([Ber09a]),which states that there exists a nonempty interior of the feasible region for the convex program.Slater’s condition is very common in convex optimization theory and plays an important rolein convergence rate analysis, such as the analysis of the interior point algorithm ([BV04]). Inthe current context, this condition is satisﬁed, for example, in energy aware server schedulingproblems, if the highest possible sum of service rates from all servers is strictly higher than thearrival rate. Lemma 2.4.1.

Suppose { y n [ t ] } ∞ t =0 , { z n [ t ] } ∞ t =0 and { T nk } ∞ k =0 are processes resulting from theproposed algorithm. Under the Assumption 2.4.1, T T − X t =0 f ∗ − N X n =1 E ( y n [ t ]) ! ≤ T T − X t =0 L X l =1 γ l N X n =1 E ( z nl [ t ]) − d l ! + C T , where C = B N + B N P Ll =1 γ l , and B , B are deﬁned in Lemma 2.3.3.Proof. First of all, from the statement of Lemma 2.3.3, for the proposed algorithm, we can deﬁnethe corresponding processes ( f n [ t ] , g n [ t ]) for all n as f n [ t ] = b f n ( α n ) = b y n ( α n ) / b T n ( α n ) , if t ∈ T nk , α nk = α n g n [ t ] = b g n ( α n ) = b z n ( α n ) / b T n ( α n ) , if t ∈ T nk , α nk = α n , where the last equality follows from the deﬁnition of b f n ( α n ) and b g n ( α n ) in Deﬁnition 2.1.1. Since (cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) ∈ S n , by deﬁnition of P n in Deﬁnition 2.1.2, ( f n [ t ] , g n [ t ]) ∈ P n ⊆P n , ∀ n, ∀ t . Since P n is a convex set by Lemma 2.3.1, it follows( E ( f n [ t ]) , E ( g n [ t ])) ∈ P n , ∀ t, ∀ n. By Assumption 2.4.1, we have N X n =1 E ( f n [ t ]) + L X l =1 γ l N X n =1 E ( g nl [ t ]) ≥ f ∗ + L X l =1 γ l d l , ∀ t. f ∗ − N X n =1 E ( f n [ t ]) ≤ L X l =1 γ l N X n =1 E ( g nl [ t ]) − d l ! , ∀ t. Taking the time average from 0 to T − T T − X t =0 f ∗ − N X n =1 E ( f n [ t ]) ! ≤ T T − X t =0 L X l =1 γ l N X n =1 E ( g nl [ t ]) − d l ! . (2.24)For the left hand side of (2.24), we have l.h.s. = 1 T T − X t =0 f ∗ − N X n =1 E ( y n [ t ]) ! + 1 T T − X t =0 N X n =1 E ( y n [ t ] − f n [ t ]) ≥ T T − X t =0 f ∗ − N X n =1 E ( y n [ t ]) ! − B NT . (2.25)where the inequality follows from (2.6) in Lemma 2.3.3. For the right hand side of (2.24), wehave r.h.s. = 1 T T − X t =0 L X l =1 γ l N X n =1 E ( z nl [ t ]) − d l ! + 1 T T − X t =0 L X l =1 γ l N X n =1 E ( g nl [ t ] − z nl [ t ]) ≤ T T − X t =0 L X l =1 γ l N X n =1 E ( z nl [ t ]) − d l ! + B N P Ll =1 γ l T , (2.26)where the inequality follows from the fact that γ l ≥ , ∀ l and (2.7) in Lemma 2.3.3. Substituting(2.25) and (2.26) into (2.24) ﬁnishes the proof. Theorem 2.4.1.

Fix ε ∈ (0 , and deﬁne V = 1 /ε . If the problem (5.1) - (1.17) is feasible andthe Assumption 2.4.1 holds, then, for all T ≥ /ε , T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + O ( ε ) , (2.27)1 T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l + O ( ε ) , l ∈ { , , · · · , L } . (2.28) Thus, the algorithm provides O ( ε ) approximation with the convergence time O (1 /ε ) . roof. First of all, by queue updating rule (5.5), T − X t =0 N X n =1 E ( z nl [ t ]) − d l ! ≤ E ( Q l [ T ]) . (2.29)By Lemma 2.4.1, we have1 T T − X t =0 f ∗ − N X n =1 E ( y n [ t ]) ! ≤ T T − X t =0 L X l =1 γ l N X n =1 E ( z nl [ t ]) − d l ! + C T , ≤ L X l =1 γ l T E ( Q l [ T ]) + C T . (2.30)Combining this with (2.10) gives12 T E (cid:0) k Q [ T ] k (cid:1) ≤ N C + C + VT T − X t =0 f ∗ − N X n =1 E ( y n [ t ]) ! + N C VT ≤ N C + C + ( N C + C ) VT + V L X l =1 γ l T E ( Q l [ T ]) ≤ N C + C + ( N C + C ) VT + VT k γ k · k E ( Q [ T ]) k , (2.31)where γ := ( γ , · · · , γ L ), the second inequality follows from (2.30) and the ﬁnal inequalityfollows from Cauchy-Schwarz. Then, by Jensen’s inequality, we have k E ( Q [ T ]) k ≤ E (cid:0) k Q [ T ] k (cid:1) . Thus, it follows by (2.31) that k E ( Q [ T ]) k − V k γ k · k E ( Q [ T ]) k − N C + C ) T − N C + C ) V ≤ . The left hand side is a quadratic form on k E ( Q [ T ]) k , and the inequality implies that k E ( Q [ T ]) k is deterministically upper bounded by the largest root of the equation x − bx − c = 0 with39 = 2 V k γ k and c = 2( N C + C ) T + 2( N C + C ) V . Thus, k E ( Q [ T ]) k ≤ b + √ b + 4 c V k γ k + p V k γ k + 2( N C + C ) T + 2( N C + C ) V ≤ V k γ k + p N C + C ) T + p N C + C ) V .

Thus, for any l ∈ { , , · · · , L } ,1 T E ( Q l [ T ]) ≤ V k γ k T + r N C + C ) T + p N C + C ) VT .

By (2.29) again,1 T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l + 2 V k γ k T + r N C + C ) T + p N C + C ) VT .

Substituting V = 1 /ε and T ≥ /ε into the above inequality gives ∀ l ∈ { , , · · · , L } ,1 T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l + (cid:16) k γ k + p N C + C ) (cid:17) ε + p N C + C ε / = d l + O ( ε ) . Finally, substituting V = 1 /ε and T ≥ /ε into (2.11) gives1 T T − X t =0 N X n =1 E ( y n [ t ]) ≤ f ∗ + O ( ε ) , ﬁnishing the proof. Here, we apply the algorithm introduced in Section 2.2 to deal with the energy-aware schedul-ing problem described in Section 1.3. To be speciﬁc, we consider a scenario with 5 homogeneousservers and 3 diﬀerent classes of jobs, i.e. N = 5 and L = 3. We assume that each server canonly choose one class of jobs to serve during each frame. So the mode set M n contains threeactions { , , } and the action i stands for serving the i -th class of jobs and we count the num-40able 2.1: Problem parameters λ i b H n ( i ) b µ n ( i ) b e n ( i ) b I n ( i )Class 1 2 5.5 15 (Uniform [9 , ∩ N ) 16 2.5Class 2 3 4.6 21 (Uniform [15 , ∩ N ) 20 4.3Class 3 4 3.8 17 (Uniform [11 , ∩ N ) 13 3.7ber of serviced jobs at the end of each service duration. The action m nk determines the followingquantities:• The uniformly distributed total number of class l jobs that can be served with expectation E (cid:16) P t ∈T nk µ nl [ t ] (cid:12)(cid:12)(cid:12) m nk (cid:17) := b µ nl ( m nk ).• The geometrically distributed service duration H nk slots with expectation E ( H nk | m nk ) := b H n ( m nk ).• The energy consumption b e n ( m nk ) for serving all these jobs.• The geometrically distributed idle/setup time I nk slots with constant energy consumption p n per slot and zero job service. The expectation E ( I nk | m nk ) := b I n ( m nk ).The idle/setup cost is p n = 3 units per slot and the rest of the parameters are listed in Table 1.Following the algorithm description in Section 2.2, the proposed algorithm has the queueupdating rule Q l [ t + 1] = max ( Q l [ t ] + λ l [ t ] − N X n =1 µ nl [ t ] , ) , and each system minimizes (2.3) each frame, which can be written asmin m nk ∈M n V (cid:16)b e nl ( m nk ) + p n b I n ( m nk ) (cid:17) − h Q [ t nk ] , b µ n ( m nk ) i b H n ( m nk ) + b I n ( m nk ) . Each plot for the proposed algorithm is the result of running 1 million slots and taking the timeaverage as the performance of the proposed algorithm. The benchmark is the optimal stationaryperformance obtained by performing a change of variable and solving a linear program, knowingthe arrival rates (see also [Nee12b] for details).Fig. 5.3 shows as the trade-oﬀ parameter V gets larger, the time average energy consumptionsunder the proposed algorithm approaches the optimal energy consumption. Fig. 5.4 shows as V gets large, the time average number of services also approaches the optimal service rate for each41lass of jobs. In Fig. 5.5, we plot the time average queue backlog for each class of jobs verses V parameter. We see that the queue backlog for the ﬁrst class is always low whereas the rest queuebacklogs scale up linearly with V . This is because the service rate for the ﬁrst class is alwaysstrictly larger than the arrival rate whereas for the rest classes, as V gets larger, the service ratesapproach the arrival rates. This plot, together with Fig. 5.3, also demonstrate that V is indeeda trade-oﬀ parameter which trades queue backlog for near optimality.Figure 2.1: Time average energy consumption verses V parameter over 1 millon slots. Proof.

We ﬁrst prove the convexity of P n . Consider any two points ( f , g ) , ( f , g ) ∈ P n . Weaim to show that for any q ∈ (0 , qf + (1 − q ) f , q g + (1 − q ) g ) ∈ P n . Notice that bydeﬁnition of P n , there exists ( y , z , T ) , ( y , z , T ) ∈ S n such that f = y /T , g = z /T , f = y /T , and g = z /T . Thus, it is enough to show (cid:18) q y T + (1 − q ) y T , q z T + (1 − q ) z T (cid:19) ∈ P n . (2.32)42igure 2.2: Time average services verses V parameter over 1 millon slots.Figure 2.3: Time average queue size verses V parameter over 1 million slots.43o show this, we make a change of variable by letting p = qT (1 − q ) T + qT . It is obvious that p ∈ (0 , q = pT pT +(1 − p ) T and q y T + (1 − q ) y T = py + (1 − p ) y pT + (1 − p ) T ,q z T + (1 − q ) z T = p z + (1 − p ) z pT + (1 − p ) T . Since S n is convex,( py + (1 − p ) y , p z + (1 − p ) z , pT + (1 − p ) T ) ∈ S n . Thus, by deﬁnition of P n again, (2.32) holds and the ﬁrst part of the proof is ﬁnished.To show the second part of the claim, let Q n := n(cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) : α n ∈ A n o = n(cid:16)b y n ( α n ) . b T n ( α n ) , b z n ( α n ) . b T n ( α n ) (cid:17) : α n ∈ A n o and let conv( Q n ) be the convex hull of Q n . First of all, By Deﬁnition 2.1.2, P n = { ( y/T, z /T ) : ( y, z , T ) ∈ S n } ⊆ R L +1 , for S n being the convex hull of n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) : α n ∈ A n o , thus, in view of the def-inition of Q n , we have Q n ⊆ P n . Since both P n and conv( Q n ) are convex, by deﬁnition of convexhull ([Roc15]) that conv( Q n ) is the smallest convex set containing Q n , we have conv( Q n ) ⊆ P n .To show the reverse inclusion P n ⊆ conv( Q n ), note that any point in P n can be written inthe form (cid:0) yT , z T (cid:1) , where ( y, z , T ) ∈ S n . Since S n by deﬁnition is the convex hull of n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) : α n ∈ A n o ⊆ R L +2 , by the deﬁnition of convex hull, ( y, z , T ) can be written as a convex combination of m points in44he above set. Let n(cid:16)b y n ( α ni ) , b z n ( α ni ) , b T n ( α ni ) (cid:17)o mi =1 be these points, so that( y, z , T ) = m X i =1 p i · (cid:16)b y n ( α ni ) , b z n ( α ni ) , b T n ( α ni ) (cid:17) ,p i ≥ , m X i =1 p i = 1 . As a result, we have (cid:16) yT , z T (cid:17) = (cid:18) P mi =1 p i y n ( α ni ) P mi =1 p i T n ( α ni ) , P mi =1 p i z n ( α ni ) P mi =1 p i T n ( α ni ) (cid:19) . We make a change of variable by letting q j = p j T n ( α nj ) P mi =1 p i T n ( α ni ) , ∀ j = 1 , , · · · , m , then, p j = q j T n ( α nj ) · m X i =1 p i T n ( α ni ) , it follows, (cid:16) yT , z T (cid:17) = m X i =1 q i · (cid:18) y n ( α ni ) T n ( α ni ) , z n ( α ni ) T n ( α ni ) (cid:19) = m X i =1 q i · (cid:16) b f n ( α ni ) , b g n ( α ni ) (cid:17) . Since P mi =1 q i = 1 and q i ≥

0, it follows any point in P n can be written as a convex combination ofﬁnite number of points in Q n , which implies P n ⊆ conv( Q n ). Overall, we have P n = conv( Q n ).Finally, by Assumption 2.1.3, we have Q n = n(cid:16) b f n ( α n ) , b g n ( α n ) (cid:17) : α n ∈ A n o is compact.Thus, P n , being a convex hull of a compact set, is also compact. Proof.

We prove bound (2.6) ((2.7) is proved similarly). By deﬁnition of b f n ( α n ) in Deﬁnition2.1.1, we have for any α n ∈ A n , b f n ( α n ) = E (cid:16) P t ∈T nk y n [ t ] (cid:12)(cid:12)(cid:12) α nk = α n (cid:17) E ( T nk | α nk = α n ) , thus, E  X t ∈T nk (cid:16) b f n ( α nk ) − y n [ t ] (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α nk = α n  = 0 .

45y the renewal property of the system, given α nk = α n , T nk and P t ∈T nk y n [ t ] are independent ofthe past information before t nk . Thus, the same equality holds if conditioning also on F nk , i.e. E  X t ∈T nk (cid:16) b f n ( α nk ) − y n [ t ] (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) α nk = α n , F nk  = 0 . Hence, E  X t ∈T nk (cid:16) b f n ( α nk ) − y n [ t ] (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  = 0 . By the deﬁnition of f n [ t ], this further implies that E  X t ∈T nk ( f n [ t ] − y n [ t ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F nk  = 0 . Since | y n [ t ] | ≤ y max and E ( T nk ) ≤ √ B , it follows E (cid:16)(cid:12)(cid:12)(cid:12)P t ∈T nk ( f n [ t ] − y n [ t ]) (cid:12)(cid:12)(cid:12)(cid:17) < ∞ and the process { F nK } ∞ K =0 deﬁned as F nK = K − X k =0 X t ∈T nk ( f n [ t ] − y n [ t ]) , K ≥ ,F n = 0 is a martingale .Consider any ﬁxed T ∈ N and deﬁne S n [ T ] as the number of renewals up to T . Lemma2.3.7 shows S n [ T ] is a valid stopping time with respect to the ﬁltration {F nk } ∞ k =0 . Furthermore, { F nK ∧ S n [ T ] } ∞ K =0 is a supermartingale by Theorem 2.3.2, where a ∧ b := min { a, b } .For this ﬁxed T , we have E T − X t =0 ( f n [ t ] − y n [ t ]) ! = E  t nSn [ T ] − X t =0 ( f n [ t ] − y n [ t ])  − E  t nSn [ T ] − X t = T ( f n [ t ] − y n [ t ])  = E (cid:16) F nS n [ T ] (cid:17) − E  t nSn [ T ] − X t = T ( f n [ t ] − y n [ t ])  . Since the number of renewals is always bounded by the number of slots at any time, i.e. S n [ T ] ≤ T + 1, it follows E (cid:16) F nS n [ T ] (cid:17) = E (cid:16) F n ( T +1) ∧ S n [ T ] (cid:17) ≤ .

46n the other hand, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  t nSn [ T ] − X t = T ( f n [ t ] − y n [ t ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:16) t nS n [ T ] − T (cid:17) · y max ≤ y max √ B. where the last inequality follows from Assumption 5.2.1 for the residual life time. Thus, E T − X t =0 ( f n [ t ] − y n [ t ]) ! ≤ y max √ B. Dividing both sides by T ﬁnishes the proof. Proof.

Recall that t nk is the time slot where the k -th renewal occurs ( k = 0 , , , · · · ), then, itfollows from the deﬁnition of stopping time ([Dur13]) that { t nk } ∞ k =0 is a sequence of stoppingtimes with respect to {F [ t ] } ∞ t =0 satisfying t nk < t nk +1 , ∀ k . Thus, by deﬁnition of F nk , for any set A ∈ F nk , A ∩ { t nk +1 ≤ t } = A ∩ { t nk ≤ t } ∩ { t nk +1 ≤ t } ∈ F [ t ] . Thus, A ∈ F nk +1 , which implies F nk ⊆ F nk +1 , ∀ k , and {F nk } ∞ k =0 is indeed a ﬁltration. This ﬁnishesthe ﬁrst part of the proof.Next, we would like to show that G t nk ( Z n , · · · , Z n [ t nk − F nk , ∀ k ≥

1, i.e. n G t nk ( Z n , · · · , Z n [ t nk − ∈ B o ∈ F nk , for any Borel set B ⊆ R . By deﬁnitionof F nk , this is equivalent to showing { G t nk ( Z n , · · · , Z n [ t nk − ∈ B } ∩ { t nk ≤ s } ∈ F [ s ] for anyslot s ≥

0. For s = 0, this is obvious because { t nk ≤ } = ∅ , ∀ k ≥

1. Consider any s ≥ n G t nk ( Z n , · · · , Z n [ t nk − ∈ B o ∩ { t nk ≤ s } = s [ i =1 (cid:16) { G i ( Z n , · · · , Z n [ i − ∈ B } \ { t nk = i } (cid:17) = s [ i =1 (cid:16)(cid:8) ( Z n , · · · , Z n [ i − ∈ G − i ( B ) (cid:9) \ { t nk = i } (cid:17) ∈ F [ s ] , ∀ k ≥ , where the last step follows from the assumption that the random variable Z n [ t −

1] is measurablewith respect to F [ t ] for any t > t nk is a stopping time with respect to {F [ t ] } ∞ t =0 for all k ≥

1. This gives the second part of the claim.47 .6.4 Proof of Lemma 2.3.7

Proof.

We aim to prove { S n [ T ] = k } ∈ F nk , ∀ k ∈ N . First of all, recall that the index of therenewal starts from k = 0 and t n = 0, thus, for any k ∈ N , { S n [ T ] = k } = { t nk > T }∩{ t nk − ≤ T } ,and any t ∈ N , { S n [ T ] = k } ∩ { t nk ≤ t } = { t nk > T } ∩ { t nk − ≤ T } ∩ { t nk ≤ t } . (2.33)Consider two cases as follows:1. t ≤ T . In this case, the set (2.33) is empty and obviously belongs to F [ t ].2. t > T . In this case, we have { t nk > T } ∩ { t nk ≤ t } = { T < t nk ≤ t } ∈ F [ t ] as well as { t nk − ≤ T } ∈ F [ T ] ⊆ F [ t ]. Thus, the set (2.33) belongs to F [ t ].Overall, we have { S n [ T ] = k } ∩ { t nk ≤ t } ∈ F [ t ] , ∀ t ∈ N . Thus, { S n [ T ] = k } ∈ F nk and S n [ T ] isindeed a valid stopping time with respect to the ﬁltration {F nk } ∞ k =0 . Proof.

To prove the ﬁrst part of the claim, we deﬁne the following notation: N M n =1 P n := ( N X n =1 p n , p n ∈ P n , ∀ n ) is the Minkowski sum of sets P n , n ∈ { , , · · · , N } , and for any sequence { x [ t ] } ∞ t =0 taking valuesin R d , deﬁne lim sup T →∞ x [ T ] := (cid:18) lim sup T →∞ x [ T ] , · · · , lim sup T →∞ x d [ T ] (cid:19) is a vector of lim sups. By deﬁnition, any vector in ⊕ Nn =1 P n can be constructed from ⊗ Nn =1 P n ,thus, it is enough to show that there exists a vector r ∗ ∈ ⊕ Nn =1 P n such that r ∗ = f ∗ and the restof the entries r ∗ l ≤ d l , l = 1 , , · · · , L .By the feasibility assumption for (5.1)-(1.17), we can consider any algorithm that achievesthe optimality of (5.1)-(1.17) and the corresponding process { ( f n [ t ] , g n [ t ]) } ∞ t =0 deﬁned in Lemma2.3.3 for any system n . Notice that ( f n [ t ] , g n [ t ]) ∈ P n , ∀ n, ∀ t . This follows from the deﬁnition48f b f n ( α n ) and b g n ( α n ) in Deﬁnition 2.1.1 that f n [ t ] = b f n ( α n ) = b y n ( α n ) / b T n ( α n ) , if t ∈ T nk , α nk = α n g n [ t ] = b g n ( α n ) = b z n ( α n ) / b T n ( α n ) , if t ∈ T nk , α nk = α n , and (cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) ∈ S n . By deﬁnition of P n in Deﬁnition 2.1.2, ( f n [ t ] , g n [ t ]) ∈P n , ∀ n, ∀ t .Since P n is convex by Lemma 2.3.1, it follows that ( E ( f n [ t ]) , E ( g n [ t ])) ∈ P n , ∀ n, ∀ t . Hence, T T − X t =1 E ( f n [ t ]) , T T − X t =1 E ( g n [ t ]) ! ∈ P n , ∀ T, ∀ n. This further implies that r ( T ) := T T − X t =1 N X n =1 E ( f n [ t ]) , T T − X t =1 N X n =1 E ( g n [ t ]) ! ∈ N M n =1 P n . By Lemma 2.3.1, P n is compact in R L +1 . Thus, ⊕ Nn =1 P n is also compact. This implies thatthe sequence { r ( T ) } ∞ T =1 has at least one limit point, and any such limit point is contained in ⊕ Nn =1 P n .We consider a speciﬁc limit point of { r ( T ) } ∞ T =1 denoted as r ∗ ∈ ⊕ Nn =1 P n , with the ﬁrst entrydenoted as r ∗ satisfying r ∗ = lim sup T →∞ T T − X t =0 N X n =1 E ( f n [ t ]) . Then, we have the rest of the entries of r ∗ must satisfy r ∗ l ≤ lim sup T →∞ T T − X t =0 N X n =1 E ( g n [ t ]) , ∀ l ∈ { , , · · · , L } . Now, by Lemma 2.3.3, we can connect the lim sup with respect to f n [ t ] and g n [ t ] to that of y n [ t ]49nd z n [ t ] as follows:lim sup T →∞ T T − X t =0 N X n =1 E ( y n [ t ])= lim sup T →∞ T T − X t =0 N X n =1 ( E ( y n [ t ] − f n [ t ]) + E ( f n [ t ]))= lim T →∞ T T − X t =0 N X n =1 E ( y n [ t ] − f n [ t ]) + lim sup T →∞ T T − X t =0 N X n =1 E ( f n [ t ])= lim sup T →∞ T T − X t =0 N X n =1 E ( f n [ t ]) . Similarly, we can show thatlim sup T →∞ T T − X t =0 N X n =1 E ( z n [ t ]) = lim sup T →∞ T T − X t =0 N X n =1 E ( g n [ t ]) . Thus, by our preceeding assumption that the algorithm under consideration achieves the opti-mality of (5.1)-(1.17), we have r ∗ = lim sup T →∞ T T − X t =0 N X n =1 E ( y n [ t ]) = f ∗ r ∗ l ≤ lim sup T →∞ T T − X t =0 N X n =1 E ( z nl [ t ]) ≤ d l , ∀ i ∈ { , , · · · , L } . Overall, we have shown that r ∗ ∈ ⊕ Nn =1 P n achieves the optimality of (5.1)-(1.17), and the ﬁrstpart of the lemma is proved.To prove the second part of the lemma, we show that any point in ⊗ Nn =1 P n is achievable by thecorresponding time averages of some algorithm. Speciﬁcally, consider the following class of ran-domized stationary algorithms : For each system n , at the beginning of k -th frame, the controllerindependently chooses an action α nk from the set A n with a ﬁxed probability distribution.Thus, the actions { α nk } ∞ k =0 result from any randomized stationary algorithm is i.i.d.. By therenewal property of each system, we have  X t ∈T nk y n [ t ] , X t ∈T nk z n [ t ] , T nk  ∞ k =0 , is also an i.i.d. process for each system n . 50ext, we would like to show that any point in S n can be achieved by the correspondingexpectations of some randomized stationary algorithm. Recall that S n deﬁned in Deﬁnition2.1.2 is the convex hull of G n := n(cid:16)b y n ( α n ) , b z n ( α n ) , b T n ( α n ) (cid:17) , α n ∈ A n o ⊆ R L +2 , By deﬁnition of convex hull, any point ( y, z , T ) ∈ S n , can be written as a convex combination ofa ﬁnite number of points from the set G n . Let n(cid:16)b y n ( α ni ) , b z n ( α ni ) , b T n ( α ni ) (cid:17)o mi =1 be these points,then, we have there exists a ﬁnite sequence { p i } mi =1 , such that( y, z , T ) = m X i =1 p i · (cid:16)b y n ( α ni ) , b z n ( α ni ) , b T n ( α ni ) (cid:17) ,p i ≥ , m X i =1 p i = 1 . We can then use { p i } mi =1 to construct the following randomized stationary algorithm: At thestart of each frame k , the controller independently chooses action α i ∈ A n with probability p i deﬁned above for i = 1 , , · · · , m . Then, the one-shot expectation of this particular randomizedstationary algorithm on system n satisﬁes  E  X t ∈T nk y n [ t ]  , E  X t ∈T nk z n [ t ]  , E ( T nk )  = m X i =1 p i · (cid:16)b y n ( α ni ) , b z n ( α ni ) , b T n ( α ni ) (cid:17) = ( y, z , T ) , which implies any point in S n can be achieved by the corresponding expectations of a randomizedstationary algorithm.Next, by deﬁnition of P n in Deﬁnition 2.1.2, any ( f n , g n ) ∈ P n can be written as ( f n , g n ) =( y/T, z /T ), where ( y, z , T ) ∈ S n . Thus, it is achievable by the ratio of one-shot expectationsfrom a randomized stationary algorithm, i.e. E (cid:16)P t ∈T nk y n [ t ] (cid:17) E ( T nk ) = yT = f n , E (cid:16)P t ∈T nk z n [ t ] (cid:17) E ( T nk ) = z T = g n . y n [ t ], z n [ t ] and T nk result from the randomized stationary algorithm,lim T →∞ T T − X t =0 E ( y n [ t ]) = E (cid:16)P t ∈T nk y n [ t ] (cid:17) E ( T nk ) , (2.34)lim T →∞ T T − X t =0 E ( z n [ t ]) = E (cid:16)P t ∈T nk z n [ t ] (cid:17) E ( T nk ) . (2.35)We prove (2.34) and (2.35) is shown in a similar way. Consider any ﬁxed T , and let S n [ T ] bethe number of renewals up to (and including) time T . Then, from Lemma 2.3.7 in Section 2.3, S n [ T ] is a valid stopping time with respect to the ﬁltration {F nk } ∞ k =0 . We write1 T T − X t =0 E ( y n [ t ]) = 1 T E  S n [ T ] X k =0 X t ∈T nk y n [ t ]  − T E  t nSn [ T ] − X t = T y n [ t ]  . (2.36)For the ﬁrst part on the right hand side of (2.36), since nP t ∈T nk y n [ t ] o ∞ k =0 is an i.i.d. process,by Wald’s equality (Theorem 4.1.5 of [Dur13]),1 T E  S n [ T ] X k =0 X t ∈T nk y n [ t ]  = E  X t ∈T nk y n [ t ]  · E ( S n [ T ]) T .

By renewal reward theorem (Theorem 4.4.2 of [Dur13]),lim T →∞ E ( S n [ T ]) T = 1 E ( T nk ) . Thus, lim T →∞ T E  S n [ T ] X k =0 X t ∈T nk y n [ t ]  = E (cid:16)P t ∈T nk y n [ t ] (cid:17) E ( T nk ) . For the second part on the right hand side of (2.36), by Assumption 5.2.1, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  t nSn [ T ] − X t = T y n [ t ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ y max · E (cid:16) t nS n [ T ] − T (cid:17) ≤ √ By max , which implies lim T →∞ T E (cid:16)P t nSn [ T ] − t = T y n [ t ] (cid:17) = 0. Overall, we have (2.34) holds.To this point, we have shown that for any ( f n , g n ) ∈ P n , n ∈ { , , · · · , N } , there exists a52andomized stationary algorithm so thatlim T →∞ T T − X t =0 E ( y n [ t ]) = f n , lim T →∞ T T − X t =0 E ( z n [ t ]) = g n , for any n ∈ { , , · · · , N } . Since f ∗ is the optimal solution to (5.1)-(1.17) over all algorithms, itfollows for any ( f n , g n ) ∈ P n , n ∈ { , , · · · , N } satisfying P Nn =1 g nl ≤ d l , ∀ l ∈ { , , · · · , L } , wehave P Nn =1 f n ≥ f ∗ , and the second part of the lemma is proved.53 hapter 3Data Center Server Provision via Theory of Coupled Re-newal Systems The previous chapter introduces a new algorithm and analysis framework for coupled parallelrenewal systems. In this chapter, we show that the previous algorithm can be applied (extended)to solve a data center power minimization problem consisting of a central controller who makesload balancing decisions per slot and parallel servers having multiple states making decisions perrenewal frame. In particular, the analysis in this chapter, which is customized to the data centerapplication, is stronger than that of previous general algorithm in the sense that we obtain aprobability 1 convergence of the algorithm rather than an expected convergence.

Consider a data center that consists of a central controller and N servers that serve randomlyarriving requests. The system operates in slotted time with time slots t ∈ { , , , . . . } . Eachserver n ∈ { , . . . , N } has three basic states:• Active: The server is available to serve requests. Server n incurs a cost of e n ≥ n and on the preceding sleep mode.An active server can choose to transition to the idle state at any time. When it does so, itchooses the speciﬁc sleep mode to use and the amount of time to sleep. For example, deepersleep modes can shut down more electronics and thereby save on per-slot idling costs. However,a deeper sleep incurs a longer setup time when transitioning back to the active state. Each servermakes separate decisions about when to transition and what sleep mode to use. The resultingtransition times for each server are asynchronous. On top of this, a central controller makesslot-wise decisions for routing requests to servers. It can also reject requests (with a certainamount of cost) if it decides they cannot be supported. The goal is to minimize the overall timeaverage cost.This problem is challenging mainly for two reasons: First, since each setup state generatescost but serves no request, it is not clear whether or not transitioning to idle from the active stateindeed saves power. It is also not clear which sleep mode the server should switch to. Second,if one server is currently in a setup state, it cannot make another decision until it reaches theactive state (which typically takes more than one slot), whereas other active servers can makedecisions during this time. Thus, this problem can be viewed as a system with coupled Markovdecision processes (MDPs) making decisions asynchronously. Experimental work on power and delay minimization in data centers is treated in [Gan13],which proposes to turn each server ON and OFF according to the rule of an

M/M/k/setup queue.The work in [UKIN10] applies Lyapunov optimization to optimize power in virtualized datacenters. However, it assumes each server has negligible setup time and that ON/OFF decisionsare made synchronously at each server. The works [YHS + +

12] considers load balancing across geographically distributed datacenters, and [LWAT13] considers provisioning over a ﬁnite time interval and introduces an online3-approximation algorithm. 55rior works [HS08, MGW09, MSB +

11] consider servers with multiple hypothetical sleep stateswith diﬀerent levels of power consumption and setup times. Although empirical evaluations inthese works show signiﬁcant power saving by introducing sleep states, they are restricted to thescenario where the setup time from sleep to active is on the order of milliseconds, which is notrealistic for today’s data center. Realistic sleep states with setup time on the order of secondsare considered in [GHBK12], where eﬀective heuristic algorithms are proposed and evaluated viaextensive testbed simulations. However, little is known about the theoretical performance boundregarding these algorithms.

At each time slot t ∈ { , , , . . . } , λ ( t ) new requests arrive at the system (see Fig. 3.1). Weassume λ ( t ) takes values in a ﬁnite set Λ. Let R n ( t ) , n ∈ N denote the number of requestsrouted into server n at time t . In addition, the system is allowed to reject requests. Let r ( t ) bethe number of requests that are rejected on slot t , and let c ( t ) be the corresponding per-requestcost for such rejection. Assume c ( t ) takes values in a ﬁnite state space C . The R n ( t ) and r ( t )decision variables on slot t must be nonnegative integers that satisfy: N X n =1 R n ( t ) + r ( t ) = λ ( t ) N X n =1 R n ( t ) ≤ R max for a given integer R max >

0. The vector process ( λ ( t ) , c ( t )) takes values in Λ × C and is assumedto be an independent and identically distributed (i.i.d.) vector over slots t ∈ { , , , . . . } withan unknown probability mass function.Each server n maintains a request queue Q n ( t ) that stores the requests that are routed to it.Requests are served in a FIFO manner with queueing dynamics as follows: Q n ( t + 1) = max { Q n ( t ) + R n ( t ) − µ n ( t ) H n ( t ) , } . (3.1)where H n ( t ) is an indicator variable that is 1 if server n is active on slot t , and 0 else, and µ n ( t ) isa random variable that represents the number of requests can be served on slot t . Each queue isinitialized to Q n (0) = 0. Assume that, every slot in which server n is active, µ n ( t ) is independent56igure 3.1: Illustration of a data center structure which contains a front-end load balancer, N application servers with N request queues and a backend database (omitted here for brevity).and identically distributed with a known mean µ n . This randomness can model variation in jobsizes. Assumption 3.1.1.

The process { ( λ ( t ) , c ( t )) } ∞ t =0 is observable, i.e. the router can observe the ( λ ( t ) , c ( t )) realization each time slot t before making decisions. In contrast, the process { µ n ( t ) } ∞ t =0 is not observable, i.e. given that H n ( t ) = 1 , the server n cannot observe the realization of µ n ( t ) until the end of slot t . Moreover, λ ( t ) , c ( t ) and µ n ( t ) are all bounded by λ max , c max and µ max respectively. Each server n ∈ N has three types of states: active, idle, and setup (see Fig. 3.2). Theidle state of each server n is further decomposed into a collection of distinct sleep modes. Eachserver n ∈ N makes decisions over its own renewal frames . Deﬁne the renewal frame for server n as the time period between successive visits to active state (with each renewal period endingin an active state). Let T n [ f ] denote the frame size of the f -th renewal frame for server n , for f ∈ { , , , . . . } . Let t nf denote the start of frame f , so that T n [ f ] = t nf +1 − t nf . Assume that t n = 0 for all n ∈ N , so that time slot 0 is the start of the ﬁrst renewal frame (labeled frame f = 0) for all servers. For simplicity, assume all servers are “active” on slot t = −

1. Thus, theslot just before each renewal frame is an active slot.Fix a server n ∈ N and a frame index f ∈ { , , , . . . } . Time t nf marks the start of renewalframe f . At this time, server n must decide whether to remain active or to go idle. If it remainsactive then the renewal frame lasts for one slot, so that T n [ f ] = 1. If it goes idle, it chooses an57igure 3.2: Illustration of a typical renewal frame construction, where T n [ i ] is the length of frame i and t ( n ) i is the start slot of frame i .idle mode from a ﬁnite set L n , representing the set of idle mode options. Let α n [ f ] representthis initial decision for server n at the start of frame f , so that: α n [ f ] ∈ { active } ∪ L n where α n [ f ] = active means the server chooses to remain active. If the server chooses to go idle,so that α n [ f ] ∈ L n , it then chooses a variable I n [ f ] that represents how much time it remainsidle . The decision variable I n [ f ] is chosen as an integer in the set { , . . . , I max } for some giveninteger I max >

0. The consequences of these decisions are described below.• Case α n [ f ] = active . The frame starts at time t nf and has size T n [ f ] = 1. The activevariable becomes H n ( t nf ) = 1 and an activation cost of e n is incurred on this slot t nf . Arandom service variable µ n ( t nf ) is generated and requests are served according to the queueupdate (3.1). Recall that, under Assumption 3.1.1, the value of µ n ( t ) is not known untilthe end of the slot.• Case α n [ f ] ∈ L n . In this case, the server chooses to go idle and α n [ f ] represents the speciﬁcsleep mode chosen. The idle duration I n [ f ] is also chosen as an integer in the set [1 , I max ].After the idle duration completes, the setup duration starts and has an independent andrandom duration τ n [ f ] = ˆ τ ( α n [ f ]), where ˆ τ ( α n [ f ]) is an integer random variable with aknown mean and variance that depends on the sleep mode α n [ f ]. At the end of the setuptime the system goes active and serves with a random µ n ( t ) as before. The active variableis H n ( t ) = 0 for all slots t in the idle and setup times, and is 1 at the very last slot of theframe. Further: – Idle cost: Every slot t of the idle time of frame f , an idle cost of g n ( t ) = ˆ g n ( α n [ f ])is incurred (so that the idle cost depends on the sleep mode). We have g n ( t ) = 0 if58erver n is not idle on slot t . The idle cost can be zero, but can also be a small butpositive value if some electronics are still running in the sleep mode chosen. – Setup cost: Every slot t of the setup time of frame f , a cost of W n ( t ) = ˆ W n ( α n [ f ]) isincurred. We have W n ( t ) = 0 if server n is not in a setup duration on slot t .Thus, the length of frame f for server n is: T n [ f ] =  , if α n [ f ] = active ; I n [ f ] + τ n [ f ] + 1 , if α n [ f ] ∈ L n . (3.2)In summary, the costs ˆ g n ( α n ), ˆ W n ( α n ) and the setup time ˆ τ n ( α n ) are functions of α n ∈ L n . Wefurther make the following assumption regarding ˆ τ n ( α n ): Assumption 3.1.2.

For any α n ∈ L n , the function ˆ τ n ( α n ) is an integer random variable withknown mean and variance, as well as bounded ﬁrst four moments. Denote E ( τ n ( α n )) = m α n andVar [ τ n ( α n )] = σ α n . Note that this is a very mild assumption in view of the fact that the setup time of a realserver is always bounded. The motivation behind emphasizing the fourth moment here insteadof simply proceeding with boundedness assumption is more of theoretical interest than practicalimportance.Table I summarizes the parameters introduced in this section. The data center architectureis shown is Fig. 3.1. Since diﬀerent servers might make diﬀerent decisions, the renewal framesare not necessarily aligned.

For each n ∈ N , let C , W n , E n , G n be the time average costs resulting from rejection, setup,service and idle, respectively. They are deﬁned as follows: C = lim T →∞ T P T − t =0 E ( r ( t ) c ( t )), W n = lim T →∞ T P T − t =0 E ( W n ( t )), E n = lim T →∞ T P T − t =0 E ( e n H n ( t )), G n = lim T →∞ T P T − t =0 E ( g n ( t )).The goal is to design a joint routing and service policy so that the time average overall costis minimized and all queues are stable, i.e.min C + N X n =1 (cid:0) W n + E n + G n (cid:1) , s.t. Q n ( t ) stable ∀ n. (3.3)59able 3.1: ParametersControl parameters Control objectives R n ( t ) Requests routed to server n at slot tr ( t ) Requests rejected at slot tα n [ f ] The option (active/idle) server n takes in frame fI n [ f ] Number of slots server n stays idle in frame f Other parameters Meaning λ ( t ) Number of arrivals at time tc ( t ) Per request rejection cost at time te n Per slot active service cost for server nT n [ f ] The length of frame f for server nt ( n ) [ f ] Starting slot of frame f for server nτ n [ f ] Setup duration in frame fµ n ( t ) Number of requests served on server n at time tH n ( t ) Server active indicator (equal to 1 if active, 0 if not) g n ( t ) Idle cost of server n at time tW n ( t ) Setup cost of server n at time t Notice that the constraint in (3.3) is not easy to work with. In order to get an optimization prob-lem one can deal with, we further deﬁne the time average request rate, rejection rate, routingrate and service rate as λ , d , R n , and µ n respectively: λ = lim T →∞ T P T − t =0 λ ( t ) = E ( λ ( t )), r =lim T →∞ T P T − t =0 E ( r ( t )), R n = lim T →∞ T P T − t =0 E ( R n ( t )), µ n = lim T →∞ T P T − t =0 E ( µ n ( t ) H n ( t )).Then, rewrite the problem (3.3) as followsmin C + N X n =1 (cid:0) W n + E n + G n (cid:1) (3.4)s.t. R n ≤ µ n , ∀ n ∈ N (3.5) N X n =1 R n ( t ) ≤ R max , N X n =1 R n ( t ) + r ( t ) = λ ( t ) ∀ t (3.6)Constraint (3.5) requires the time average arrival rate to server n to be less than the timeaverage service rate. We aim to develop an algorithm so that each server can make its owndecision (without looking at the workload or service decision of any other server) and prove itsnear optimality. 60 .2 Coupled renewal optimization In this section, we show one can apply the algorithm introduced in the previous section tosolve (3.4)-(3.6). But before jumping into details, we would like to discuss some intuitions behindsolving this problem. As a side remark, this data center work is written and published beforethe general algorithm introduced in the last section, so this intuition is the origin of thesis.

First of all, from the queueing model described in the last section and Fig. 3.1, it is intuitivethat an eﬃcient algorithm would have each server make decisions regarding its own queue state Q n ( t ), whereas the front-end load-balancer make routing and rejection decisions slot-wise basedon the global information ( λ ( t ) , c ( t ) , Q ( t )).Next, to get an idea on what exactly the decision should be, by virtue of Lyapunov optimiza-tion, one would introduce a trade-oﬀ parameter V > Q ( t ) to solve the following slotwise optimization problemmin V c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )) ! (3.7)+ N X n =1 Q n ( t )( R n ( t ) − µ n ( t )) s.t. constraint (3.6) , which is naturally separable regarding the load-balancing decision ( r ( t ), R n ( t )), and the servicedecision ( W n ( t ) , H n ( t ) , g n ( t ) , µ n ( t )). However, because of the existence of a setup state (onwhich no decision could be made), the server does not have an identical decision set every slotand furthermore, the decision set itself depends on previous decisions. This poses a signiﬁcantdiﬃculty analyzing the above optimization (3.7).In order to resolve this diﬃculty, we try to ﬁnd the smallest “identical time unit” for eachindividual server in lieu of slots. This motivates the notion of renewal frame in the previoussection (see Fig. 3.2). Speciﬁcally, from Fig. 3.2 and the related renewal frame construction, atthe starting slot of each renewal, the server faces the identical decision set (remain active or goto idle with certain slots) regardless of previous decisions. Following this idea, we modify (3.7)as follows: 61 For the front-end load balancer, we observe ( λ ( t ) , c ( t ) , Q ( t )) and solve min V c ( t ) r ( t ) + P Nn =1 Q n ( t ) R n ( t ) , s.t. (3.6), which is detailed in Section 3.2.3.• For each server, instead of per slot optimization min V ( W n ( t ) + e n H n ( t ) + g n ( t )) − Q n ( t ) µ n ( t ), we propose to minimize the time average of this quantity per renewal frame T n [ f ]. In order to apply Algorithm 2 to this scenario, we can view the admission control (whichchooses r ( t ) and R n ( t )) as one another system besides N servers. Thus, this problem is equivalentto an asynchronous optimization over N + 1 parallel renewal systems where one of them is justa slotted system. This falls into the form of (5.1)-(1.17) when setting l = N , y n [ t ] = r ( t ) c ( t ) + W n ( t ) + e n H n ( t ) + g n ( t ) ,z l [ t ] = R l ( t ) − µ l ( t ) , l ∈ { , , · · · , N } d l [ t ] =0 , and the control variable r ( t ), R n ( t ) are non-negative, and must satisfy the following instantconstraints: N X n =1 R n ( t ) ≤ R max , N X n =1 R n ( t ) + r ( t ) = λ ( t ) . The only diﬀerence compared to (3.4)-(3.6) is that here the decision variables r ( t ) and R n ( t )must take values from time-varying ranges per slot and they must be chosen after observing therandom variable c ( t ). However, since r ( t ) and R n ( t ) are updated slot-wise, this minor diﬀerenceis easy to handle via our renewal optimization framework and we have the following Algorithm3. 62 lgorithm 3. Fix a trade-oﬀ parameter

V > , and at each time slot t : • The admission controller chooses r ( t ) and R n ( t ) according to min V c ( t ) r ( t ) + N X n =1 Q n ( t ) R n ( t ) s.t. N X n =1 R n ( t ) ≤ R max , N X n =1 R n ( t ) + r ( t ) = λ ( t ) . (3.8)• Each server chooses service options α n [ f ] and I n [ f ] via the following: min E h P t = t nf +1 − t = t nf (cid:16) V W n ( t ) + V e n H n ( t ) + V g n ( t ) − Q n ( t nf ) µ n ( t ) H n ( t ) (cid:17)(cid:12)(cid:12)(cid:12) Q n ( t nf ) i E (cid:16) T n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) (3.9)• Update Q n ( t ) : Q n ( t + 1) = max { Q n ( t ) + R n ( t ) − µ n ( t ) H n ( t ) , } . (3.8) and (3.9) Note ﬁrst that in Algorithm 3, the solution to problem (3.8) admits a simple thresholdingrule (with shortest queue ties broken arbitrarily): r ( t ) =  max { λ ( t ) − R max , } , if ∃ n ∈ N s.t. Q n ( t ) ≤ V c ( t ); λ ( t ) , otherwise. (3.10) R n ( t ) =  min { λ ( t ) , R max } , if Q n ( t ) is the shortestqueue and Q n ( t ) ≤ V c ( t );0 , otherwise. (3.11)Next, for the problem (3.9), recall the deﬁnition of T n [ f ] and α n [ f ] ∈ { active } ∪ L n . Ifthe server chooses to remain active, then the frame length is exactly 1, otherwise, the server isallowed to choose how long it stays in idle with E (cid:16) T n [ f ] | Q ( t nf ) (cid:17) = I n [ f ] + m α n [ f ] + 1, where I n [ f ] ∈ { , · · · , I max } . It can be easily shown that over all randomized decisions between stayingactive and going to diﬀerent idle states, it is optimal to make a pure decision which either staysactive or goes to one of the idle states with probability 1.63ore speciﬁcally, let D n [ f ] = E h P t = t nf +1 − t = t nf (cid:16) V W n ( t ) + V e n H n ( t ) + V g n ( t ) − Q n ( t nf ) µ n ( t ) H n ( t ) (cid:17)(cid:12)(cid:12)(cid:12) Q n ( t nf ) i E (cid:16) T n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) . (3.12)We have when the server n chooses to be active, then D n [ f ] = V e n − Q n ( t nf ) µ n . (3.13)Otherwise, choosing a speciﬁc idle option α n [ f ] ∈ L n gives D n [ f ] = V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + B σ α n [ f ] + V ˆ g ( α n [ f ]) I n [ f ] I n [ f ] + m α n [ f ] + 1 , (3.14)which follows from the fact that if the server goes idle, then, H n ( t ) are all zero during the frameexcept for the last slot. Then, solving (3.9) is equivalent to choosing one option which achievesa smaller value of D n [ f ] between (3.13) and (3.14).A closer look at the optimization problem (3.14) indicates that the best idle period I n [ f ]solving (3.14) is either 1 or I max . This is unfortunately problematic for the application of datacenter since it means the server is either not idle at all or going to idle for a very long time.When the arrival task stream is of high volatility, this could cause signiﬁcant delay. In the nextsection, we will introduce our proposed algorithm for the servers which makes relatively “smooth”decisions. Our main idea pushing the server away from the binary decision is to add a term in the ratio(3.12) which is quadratic on the renewal frame length. Speciﬁcally, for server n , at the beginningof its f -th renewal frame t nf , it observes its current queue state Q ( t nf ) and makes decisions on α n [ f ] ∈ { active } ∪ L n and I n [ f ] so as to solve the minimization of ratio of expectations in (3.15)64s follows: D n [ f ] (cid:44) E h P t = t nf +1 − t = t nf (cid:16) V W n ( t ) + V e n H n ( t ) + V g n ( t ) − Q n ( t nf ) µ n ( t ) H n ( t ) (cid:17) + (cid:16) t − t nf (cid:17) B (cid:12)(cid:12)(cid:12) Q n ( t nf ) i E (cid:16) T n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) . (3.15)where B = ( R max + µ max ) µ max . Compared to the objective (3.12), the quantity D n [ f ] has anextra term P t = t nf +1 − t = t nf (cid:16) t − t nf (cid:17) B = T n [ f ]( T n [ f ] − B on the numerate that is quadratic in T n [ f ].Similar to the last section, we are then able to simplify the problem by computing D n [ f ] foractive and idle options separately.• If the server chooses to go active, i.e. α n [ f ] = active , then, D n [ f ] = V e n − Q n ( t nf ) µ n . (3.16)• If the server chooses to go idle, i.e. α n [ f ] ∈ L n , then, D n [ f ] = V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + E (cid:16) V ˆ g ( α n [ f ]) I n [ f ] + B T n [ f ]( T n [ f ] − (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) E (cid:16) T n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) (3.17)which follows from the fact that if the server goes idle, then, H n ( t ) are all zero during theframe except for the last slot. Now we try to compute the optimal idle option α n [ f ] ∈L n and idle time length I n [ f ] given the server chooses to go idle. The following lemmaillustrates that the decision on I n [ f ] can also be reduced to pure decision. Lemma 3.2.1.

The best decision minimizing (3.17) is a pure decision which takes one α n [ f ] ∈ L n and one integer value I n [ f ] ∈ { , · · · , I max } minimizing the deterministic func- ion: D n [ f ] = V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + B σ α n [ f ] + V ˆ g ( α n [ f ]) I n [ f ] I n [ f ] + m α n [ f ] + 1 + B I n [ f ] + m α n [ f ] + 1) . (3.18)The proof of above lemma is given in appendix A.Then, the server computes the minimum of (3.18), which is nothing but a deterministic opti-mization problem. It goes in the following two steps:1. For each α n ∈ L n , ﬁrst diﬀerentiating (3.18) with respect to I [ f ] to get a real minimizer.Then, choosing I [ f ] as one of the two integer values bracketing the real minimizer whichachieves a smaller value on (3.18).2. Compare (3.18) for diﬀerent α n ∈ L n and choose the one achieving the minimum.Thus, the server compares (3.16) with the minimum of (3.18). If (3.16) is less than the minimumof (3.18), then, the server chooses to go active. Otherwise, the server chooses to go idle and stayidle for I n [ f ] time slots.Overall, our ﬁnal algorithm is summarized in Algorithm 4. Algorithm 4. • At each time slot t , the data center observes λ ( t ) , c ( t ) , and Q ( t ) chooses rejection decision r ( t ) according to (3.10) and chooses routing decision R n ( t ) according to (3.11) . • For each server n ∈ N , at the beginning of its f -th frame t nf , observe its queue state Q n ( t nf ) and compute (3.16) and the minimum of (3.18) . If (3.16) is less than the minimum of (3.18) , then the server still stays active. Otherwise, the server switches to the idle stateminimizing (3.18) and stays idle for I n [ f ] achieving the minimum of (3.18) . • Update Q n ( t ) , ∀ n ∈ N according to Q n ( t + 1) = max { Q n ( t ) + R n ( t ) − µ n ( t ) H n ( t ) , } . In this section, we prove a probability 1 convergence result for the proposed algorithm (Algo-rithm 4). More speciﬁcally, we prove the online algorithm introduced in the last section makes66ll request queues Q n ( t ) bounded (on the order of V ) and achieves the near optimality withsub-optimality gap on the order of 1 /V with probability 1. In this section, we show that the request queues are deterministically bounded due to thespecial thresholding nature of the admission control. Such a result is stronger (yet simpler) thanthe expected virtual queue analysis presented in the last section.

Lemma 3.3.1. If Q n (0) = 0 , ∀ n ∈ N , then, each request queue Q n ( t ) is deterministicallybounded with bound: Q n ( t ) ≤ V c max + R max , ∀ t, ∀ n ∈ N , where c max (cid:44) max c ∈C c .Proof. We use induction to prove the claim. Base case is trivial since Q n (0) = 0 ≤ V c max + R max .Suppose the claim holds at the beginning of t = i for i >

0, so that Q n ( i ) ≤ V c max + R max . Then,1. If Q n ( i ) ≤ V c max , then, it is possible for the queue to increase during slot i . However, theincrease of the queue within one slot is bounded by R max . which implies at the beginningof slot i + 1, Q n ( i + 1) ≤ V c max + R max .

2. If

V c max < Q n ( i ) ≤ V c max + R max , then, according to (3.11), it is impossible to route anyrequest to server n during slot i , and R n ( i ) = 0 which results in Q n ( i + 1) ≤ V c max + R max . Above all, we ﬁnished the proof of lemma.

Lemma 3.3.2.

The proposed algorithm meets the constraint (3.5) with probability 1.Proof.

From the queue update rule (3.1), it follows, Q n ( t + 1) ≥ Q n ( t ) + R n ( t ) − µ n H n ( t ).Taking telescoping sums from 0 to T − Q n ( T ) ≥ Q n (0) + P T − t =0 R n ( t ) − P T − t =0 µ n H n ( t ).Since Q n (0) = 0, dividing both sides by T gives Q n ( T ) T ≥ T P T − t =0 R n ( t ) − T P T − t =0 µ n H n ( t ).Substitute the bound Q n ( T ) ≤ V c max + R max from lemma 3.3.1 into above inequality and takelimit as T → ∞ give the desired result. In this section, we introduce a class of algorithms which are theoretically helpful for doinganalysis, but practically impossible to implement.67ince servers are coupled only through time average constraint (3.5), each server n can beviewed as a separate renewal system, thus, it can be shown that any possible time average servicerate µ n can be achieved through a frame based stationary randomized service decision, meaningthat the decisions are i.i.d. over frames. Furthermore, it can be shown that the optimality of(3.4)-(3.6) can be achieved over the following randomized stationary algorithms: At the beginningof each time slot t , the data center observes the incoming requests λ ( t ) and rejecting cost c ( t ), thenroutes R ∗ n ( t ) incoming requests to server n and rejects d ∗ ( t ) requests, both of which are randomfunctions of ( λ ( t ) , c ( t )). They satisfy the same instantaneous relation as (3.6). Meanwhile, server n chooses a frame based stationary randomized service decision ( α ∗ n [ f ] , I ∗ n [ f ]), so that the optimalservice rate is achieved.If one knows the stationary distribution for ( λ ( t ) , c ( t )), then, this optimal control algorithmcan be computed using dynamic programming or linear programming. Moreover, the optimalsetup cost W ∗ n ( t ), idle cost g ∗ n ( t ), and the active state indicator H ∗ ( t ) can also be deduced.Since the algorithm is stationary, these three cost processes are all ergodic Markov processes.Let T ∗ n [ f ] be the frame length process under this algorithm. Thus, it follows from the re-newal reward theorem that nP t nf +1 − t = t nf W ∗ n ( t ) o + ∞ f =0 , nP t nf +1 − t = t nf g ∗ n ( t ) o + ∞ f =0 , nP t nf +1 − t = t nf e n H ∗ n ( t ) o + ∞ f =0 , nP t nf +1 − t = t nf µ n ( t ) H ∗ n ( t ) o + ∞ f =0 and { T ∗ n [ f ] } + ∞ f =0 are all i.i.d. random variables over frames . Let C ∗ , W ∗ n , G ∗ n and E ∗ n be the optimal time average costs. Let R ∗ n , µ ∗ n and d ∗ be the optimal timeaverage routing rate, service rate and rejection rate respectively. Then, by the strong law oflarge numbers, W ∗ n = E (cid:18)P t ( n ) f + T ∗ n [ f ] − t = t nf W ∗ n ( t ) (cid:19) E ( T ∗ n [ f ]) (3.19) E ∗ n = E (cid:18)P t ( n ) f + T ∗ n [ f ] − t = t nf e n H ∗ n ( t ) (cid:19) E ( T ∗ n [ f ]) (3.20) G ∗ n = E (cid:18)P t ( n ) f + T ∗ n [ f ] − t = t nf g ∗ n ( t ) (cid:19) E ( T ∗ n [ f ]) (3.21) µ ∗ n = E (cid:18)P t ( n ) f + T ∗ n [ f ] − t = t nf µ n ( t ) H ∗ n ( t ) (cid:19) E ( T ∗ n [ f ]) , (3.22)68lso, notice that R ∗ n ( t ) and d ∗ ( t ) depend only on the random variables λ ( t ) and c ( t ), which isi.i.d. over slots. Thus, R ∗ n ( t ) and d ∗ ( t ) are also i.i.d. random variables over slots . By the law oflarge numbers, R ∗ n = E ( R ∗ n ( t )) , (3.23) C ∗ = E ( c ( t ) d ∗ ( t )) . (3.24) Remark 3.3.1.

Since the idle time I ∗ n [ f ] ∈ [1 , I max ] and the ﬁrst two moments of the setup timeare bounded, it follows the ﬁrst two moments of T ∗ n [ f ] are bounded. In this part, we compare the algorithm deduced from the two optimization problems (3.8) and(3.15) to that of the best stationary algorithm in section 3.3.2, illustrating the key features of theproposed online algorithm. Deﬁne F ( t ) as the system history up till slot t , which includes all thedecisions taken and all the random events before slot t . We ﬁrst consider (3.8). For simplicity ofnotations, deﬁne two random processes { X n [ f ] } ∞ f =0 and { Z [ t ] } ∞ t =0 as follows X n [ f ] = t = t nf +1 − X t = t nf (cid:16) V (cid:16) W n ( t ) − W ∗ n (cid:17) + V (cid:16) e n H n ( t ) − E ∗ n (cid:17) + V (cid:16) g n ( t ) − G ∗ n (cid:17) − Q n ( t nf ) ( µ n H n ( t ) − µ ∗ ) + (cid:0) t − t nf (cid:1) B − Ψ n (cid:17) ,Z [ t ] = V (cid:16) c ( t ) r ( t ) − C ∗ (cid:17) + N X n =1 Q n ( t ) (cid:16) R n ( t ) − R ∗ n (cid:17) , where Ψ n = B E ( T ∗ n [ f ]( T ∗ n [ f ] − E ( T ∗ n [ f ]) and B = ( R max + µ max ) µ max .Given the system information F ( t ), the random events c ( t ) and λ ( t ), the solutions (3.10)and (3.11) take rejecting and routing decisions so as to minimize (3.8) over all possible routingand rejecting decisions at time slot t . Thus, the proposed algorithm achieves smaller value on(3.8) compared to that of the best stationary algorithm in section 3.3.2. Formally, this ideacan be stated as the following inequality: E (cid:16) V c ( t ) r ( t ) + P Nn =1 Q n ( t ) R n ( t ) (cid:12)(cid:12)(cid:12) c ( t ) , λ ( t ) , F ( t ) (cid:17) ≤ E (cid:16) V c ( t ) d ∗ ( t ) + P Nn =1 Q n ( t ) R ∗ n ( t ) (cid:12)(cid:12)(cid:12) c ( t ) , λ ( t ) , F ( t ) (cid:17) . Taking expectation regarding c ( t ) and λ ( t ) using the fact that the best stationary algorithm on R ∗ n ( t ) and d ∗ ( t ) are i.i.d. over slots69 h P t = t nf +1 − t = t nf (cid:16) V ( W n ( t ) + e n H n ( t ) + g n ( t )) − Q n ( t nf ) µ n ( t ) H n ( t ) (cid:17) + (cid:16) t − t nf (cid:17) B (cid:12)(cid:12)(cid:12) F ( t nf ) i E (cid:16) T n [ f ] (cid:12)(cid:12)(cid:12) F ( t nf ) (cid:17) ≤ E (cid:18) P t = t ( n ) f + T ∗ n [ f ] − t = t nf (cid:16) V ( W ∗ n ( t ) + e n H ∗ n ( t ) + g ∗ n ( t )) − Q n ( t nf ) µ n H ∗ n ( t ) (cid:17) + B T ∗ n [ f ]( T ∗ n [ f ] − (cid:12)(cid:12)(cid:12)(cid:12) F ( t nf ) (cid:19) E (cid:16) T ∗ n [ f ] (cid:12)(cid:12)(cid:12) F ( t nf ) (cid:17) (3.26)(independent of F ( t )), together with (3.23) and (3.24), we get E ( Z ( t ) | F ( t )) ≤ . (3.25)Similarly, for (3.15), the proposed service decisions within frame f minimize D n [ f ] in (3.15),thus, compared to the best stationary policy, the inequality (3.26) holds. Again, using the factthat the optimal stationary algorithm gives i.i.d. W ∗ n ( t ), g ∗ n ( t ), H ∗ n ( t ) and T ∗ n [ f ] over frames(independent of F ( t nf )), as well as (3.19), (3.20) and (3.22), we get E (cid:0) X n [ f ] (cid:12)(cid:12) F ( t nf ) (cid:1) (cid:14) E (cid:0) T n [ f ] (cid:12)(cid:12) F ( t nf ) (cid:1) ≤ The key feature inequalities (3.25) and (3.27) provide us with bounds on the expectations.The following lemma serves as a stepping stone passing from expectation bounds to probability 1bounds. Recall the basic deﬁnition of supermartingale in Deﬁnition 2.3.1. We have the followingstrong law of large numbers for supermartingale diﬀerence sequences:

Lemma 3.3.3 (Corollary 4.2 of [Nee12c]) . Let { X t } ∞ t =0 be a supermartingale diﬀerence sequence.If ∞ X t =1 E (cid:0) X t (cid:1)(cid:14) t < ∞ , then, lim sup T →∞ T T − X t =0 X t ≤ , with probability 1. With this lemma, we are ready to prove the following result:70 emma 3.3.4.

Under the proposed algorithm, the following hold with probability 1, lim sup F →∞ F F − X f =0 X n [ f ] ≤ , (3.28)lim sup T →∞ T T − X t =0 Z [ t ] ≤ . (3.29) Proof.

The key to the proof is treating these two sequences as supermartingale diﬀerence se-quences and applying law of large numbers for supermartingale diﬀerence sequences (theorem4.1 and corollary 4.2 in (3.28)).We ﬁrst look at the sequence { X n [ f ] } ∞ f =0 . Let Y n [ F ] = P F − f =0 X n [ f ]. We ﬁrst prove that Y n [ F ]is a supermartingale. Notice that Y n [ F ] ∈ F (cid:16) t ( F ) n (cid:17) , i.e. it is measurable given all the informationbefore frame t ( F ) n , and | Y n [ F ] | < ∞ , ∀ F < ∞ . Furthermore, E (cid:16) Y n [ F + 1] − Y n [ F ] (cid:12)(cid:12)(cid:12) F (cid:16) t nf (cid:17) (cid:17) = E (cid:16) X n [ F ] (cid:12)(cid:12)(cid:12) F (cid:16) t nf (cid:17) (cid:17) ≤ · E (cid:16) T n [ F ] (cid:12)(cid:12)(cid:12) F (cid:16) t nf (cid:17) (cid:17) = 0, where the only inequality follows from(3.27). Thus, it follows Y n [ F ] is a supermartingale. Next, we show that the second mo-ment of supermartingale diﬀerences, i.e. E (cid:0) X n [ f ] (cid:1) , is deterministically bounded by a ﬁxedconstant for any f . This part of proof is given in Appendix B. Thus, the following holds: P ∞ f =1 E (cid:0) X n [ f ] (cid:1) (cid:14) f < ∞ . Now, applying Lemma 3.3.3 immediately gives (3.28).Similarly, we can prove (3.29) by proving M [ t ] = P T − t =0 Z [ t ] is a supermartingale withbounded second moment on diﬀerences using (3.23), (3.24) and (3.25). The procedure is al-most the same as above and we omitted the details here for brevity. Corollary 3.3.1.

The following ratio of time averages is upper bounded with probability 1, lim sup F →∞ P F − f =0 X n [ f ] . P F − f =0 T n [ f ] ≤ .Proof. From (3.28), it follows for any (cid:15) >

0, there exists an F ( (cid:15) ) such that F ≥ F ( (cid:15) ) implies P F − f =0 X n [ f ] .P F − f =0 T n [ f ] ≤ (cid:15) . F P F − f =0 T n [ f ] ≤ (cid:15) . Thus, lim sup F →∞ P F − f =0 X n [ f ] .P F − f =0 T n [ f ] ≤ (cid:15) . Since (cid:15) is arbitrary, take (cid:15) → The ratio of time averages in corollary 3.3.1 and the true time average share the same bound,which is proved by the following lemma: 71 emma 3.3.5.

The following time average is bounded with probability 1, lim sup T →∞ T T − X t =0 (cid:0) V ( W n ( t ) + e n H n ( t ) + g n ( t )) − Q n ( t nf )( µ n H n ( t ) − µ ∗ n ) + (cid:0) t − t nf (cid:1) B (cid:1) ≤ V ( W ∗ n + E ∗ n + G ∗ n ) + Ψ n , (3.30) where Ψ n = B E ( T ∗ n [ f ]( T ∗ n [ f ] − E ( T ∗ n [ f ]) and B = ( R max + µ max ) µ max . The idea of the proof is similar to that of basic renewal theory, which derives upper and lowerbounds for each T within any frame F using corollary 3.3.1, thereby showing that as T → ∞ ,the upper and lower bounds meet. See appendix C for details. With the help of this lemma, weare able to prove the following near optimal performance theorem: Theorem 3.3.1. If Q n (0) = 0 , ∀ n ∈ N , then the time average total cost under the algorithm isnear optimal on the order of O (1 /V ) , i.e. with probability 1, lim sup T →∞ T T − X t =0 c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )) ! ≤ C ∗ + N X n =1 (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17)| {z } Optimal cost + P Nn =1 Ψ n + B V , (3.31) where B (cid:44) P Nn =1 ( R max + µ n ) , Ψ n = B E ( T ∗ n [ f ]( T ∗ n [ f ] − E ( T ∗ n [ f ]) and B = ( R max + µ max ) µ max . See appendix D for details of proof.

The algorithm in previous sections optimizes time average cost. However, it can route requeststo idle queues, which increases system delay. This section considers an improvement in thealgorithm that maintains the same average cost guarantees, but reduces delay. This is done by a“virtualization” technique that reduces from N server request queues to only one request queue Q ( t ). Speciﬁcally, the same Algorithm 1 is run, with queue updates (3.1) for each of the N queues Q n ( t ). However, the Q n ( t ) processes are now virtual queues rather than actual queues: Their72alues are only kept in software. Every slot t , the data center observes the incoming requests λ ( t ), rejection cost c ( t ) and virtual queue values, making rejection decision according to (3.10) asbefore. The admitted requests are queued in Q ( t ). Meanwhile, each server n makes active/idledecisions observing its own virtual queue Q n ( t ) same as before. Whenever a server is active, itgrabs the requests from request queue Q ( t ) and serves them. This results in an actual queueupdating for the system: Q ( t + 1) = max ( Q ( t ) + λ ( t ) − r ( t ) − N X n =1 µ n ( t ) H n ( t ) , ) . (3.32)Fig. 5.2 shows this data center architecture.Figure 3.3: Illustration of basic data center architecture. Since this algorithm does not look at the actual queue Q ( t ), it is not clear whether or not theactual request queue would be stabilized under the proposed algorithm. The following lemmaanswers the question. For simplicity, we call the system with N queues, where our algorithmapplies, the virtual system, and call the system with only one queue the actual system. Lemma 3.4.1. If Q (0) = 0 and Q n (0) = 0 , ∀ n ∈ N , then the virtualization technique stabilizesthe queue Q ( t ) with the bound: Q ( t ) ≤ N ( V c max + R max ) .Proof. Notice that this bound is N times the individual queue bound in lemma 3.3.1, we prove thelemma by showing that the sum-up weights P Nn =1 Q n ( t ) in the virtual system always dominatesthe queue length Q ( t ). We prove this by induction. The base case is obvious since Q (0) =73 Nn =1 Q n (0) = 0. Suppose at the beginning of time t , Q ( t ) ≤ P Nn =1 Q n ( t ), then, during time t ,we distinguish between the following two cases:1. Not all active servers in actual system have requests to serve. This case happens if andonly if there are not enough requests in Q ( t ) to be served, i.e. λ ( t ) − r ( t ) + Q ( t ) < P Nn =1 µ n ( t ) H n ( t ). Thus, according to queue updating rule (3.32), at the beginning of timeslot t + 1, there will be no request sitting in the actual queue, i.e. Q ( t + 1) = 0. Hence, itis guaranteed that Q ( t + 1) ≤ P Nn =1 Q n ( t + 1).2. All active servers in actual system have requests to serve. Notice that the virtual systemand the actual system have exactly the same arrivals, rejections and server active/idlestates. Thus, the following holds, Q ( t + 1) = Q ( t ) + λ ( t ) − r ( t ) − P Nn =1 µ n ( t ) H n ( t ) ≤ P Nn =1 Q n ( t )+ P Nn =1 R n ( t ) − P Nn =1 µ n ( t ) H n ( t ) ≤ P Nn =1 max { Q n ( t )+ R n ( t ) − µ n ( t ) H n ( t ) , } = P Nn =1 Q n ( t + 1), where the ﬁrst inequality follows from induction hypothesis as well asthe fact that P Nn =1 R n ( t ) = λ ( t ) − r ( t ).Above all, we proved Q ( t ) ≤ P Nn =1 Q n ( t ) , ∀ t . Since each Q n ( t ) ≤ V c max + R max , ∀ t , the lemmafollows.Since the virtual system and the actual system have exactly the same cost, and it can beshown that the optimal cost in one queue system is lower bounded by the optimal cost in N queue system, thus, the near optimal performance is still guaranteed. In this section, we demonstrate the performance of our proposed algorithm via extensivesimulations. The ﬁrst simulation runs over i.i.d. traﬃc. We show that our algorithm indeedachieves O (1 /V ) near optimality with O ( V ) delay ([ O (1 /V ) , O ( V )] trade-oﬀ), which is predictedby Lemma 3.3.1 and Theorem 3.3.1. We then apply our algorithm to a real data center traﬃctrace with realistic scale, setup time and cost being the power consumption. We compare theperformance of the proposed algorithm with several other heuristic algorithms and show that ouralgorithm indeed delivers lower delay and saves power.74able 3.2: Problem parametersServer µ n e n ˆ W n ( α n ) E (ˆ τ ( α n ))1 { , , , , } { , , } { , , } { , , } { , , } N queues system In the ﬁrst simulation, we consider a relative small scale problem with i.i.d. generated traﬃc.We set the number of servers N = 5. The incoming requests λ ( t ) are integers following a uniformdistribution in [10 , c ( t ) are also integers following a uniformdistribution in [1 , R max = 40 and the maximum idle time I max = 1000. There is only one idle option α n for each server where the idle cost ˆ g ( α n ) = 0. Thesetup time follows a geometric distribution with mean E (ˆ τ ( α n )), setup cost ˆ W n ( α n ) per slot,service cost e n per slot, and the service amount µ n follows a uniform distribution over integers.The values 1 / E (ˆ τ ( α n )) are generated uniform at random within [0 ,

1] and speciﬁed in table II.The algorithm is run for 1 million slots in each trial and each plot takes the average of these1 million slots. We compare our algorithm to the optimal stationary algorithm. The optimalstationary algorithm is computed using linear program [Fox66a] with the full knowledge of thestatistics of requests and rejecting costs.In Fig. 5.3, we show that as our tradeoﬀ parameter V gets larger, the average cost approachesthe optimal value and achieves a near optimal performance. Furthermore, the cost curve dropsrapidly when V is small and becomes relatively ﬂat when V gets large, thereby demonstratingour O (1 /V ) optimality gap in Theorem 3.3.1. Fig. 5.4 plots the average sum-up queue size P n =1 Q n ( t ) and shows as V gets larger, the average sum-up queue size becomes larger. We alsoplot the sum of individual queue bound from Lemma 3.3.1 for comparison. We can see that thereal queue size grows linearly with V (although the constant in Lemma 3.3.1 is not tight due tothe much better delay we obtain here), which demonstrates the O ( V ) delay bound.We then tune the requests λ ( t ) to be uniform in [20 ,

40] and keep other parameters unchanged.In Fig. 5.5, we see that since the request rate gets larger, we need V to be larger in order toobtain the near optimality, but still, the near optimality gap scales roughly O (1 /V ). Fig. 4.675igure 3.4: Time average cost verses V parameter over 1 millon slots.Figure 3.5: Time average sum-up request queue length verses V parameter over 1 millon slots.76ives the sum-up average queue length in this case. The average queue length is larger than thatof Fig. 5.4 with linear growth with respect to V .Figure 3.6: Time average cost verses V parameter over 1 millon slots.Figure 3.7: Time average cost verses V parameter over 1 millon slots. This second considers a simulation on a real data center traﬃc obtained from the open sourcedata sets of the paper [BAM10]. The trace is plotted in Fig. 3.8. We synthesize diﬀerent datachunks from the source so that the trace contains both the steady phase and increasing phase.The total time duration is 2800 seconds with each slot equal to 20ms. The peak traﬃc is 2120requests per 20 ms, and the time average traﬃc over this whole time interval is 654 requests per20 ms.We consider a data center consisting of 1060 homogeneous servers. We assume each serverhas only one sleep state and the service quantity of each server at each slot follows a Zipf’s77aw with parameter K = 10 and p = 1 .

9. This gives the service rate of each server equalto 1 . ≈ . I max = 5000, while no such limit is imposed for any other benchmark algorithms.We ﬁrst run our proposed algorithm over the trace with virtualization (in Section 3.4) fordiﬀerent V values. We set the initial virtual queue backlog Q n (0) = 2000 ∀ n , and keep 20servers always on. Fig. 3.9 and Fig. 3.10 plots the running average power consumption andcorresponding queue length for V = 400 , ,

800 and 1200, respectively. It can be seen that as V gets large, the average power consumption does not improve too much but the queue lengthchanges drastically. This phenomenon results from the [ O (1 /V ) , O ( V )] trade-oﬀ of our proposedalgorithm. In view of this fact, we choose V = 600 which gives a reasonable delay performancein Fig. 3.10.Next, we compare our proposed algorithm with the same initial setup and V = 600 to thefollowing algorithms:• Always-on with N = 327 active servers and the rest servers staying on the sleep mode.Note that 327 servers can support the average traﬃc over the whole interval which is 654requests per 20 ms. The pdf of Zipf’s law with parameter

K, p is deﬁned as: f ( n ; K, p ) = /n p P Ki =1 /i p , n = 1 , , · · · , K . Thus, themean of the distribution is P Ki =1 /i p − P Ki =1 /i p . V value. Figure 3.10: Instantaneous queue length for diﬀerent V value.79 Always-on with full capacity. This corresponds to keeping all 1060 servers on at every slot.• Reactive. This algorithm is developed in [GHBK12] which reacts to the current traﬃc λ ( t )and maintains k react ( t ) = (cid:6) λ ( t ) / (cid:7) servers on. In the simulation, we choose λ ( t ) to be theaverage of the traﬃc from the latest 10 slots. If the current active server k ( t ) > k react ( t ),then, we turn k ( t ) − k react ( t ) servers oﬀ, otherwise, we turn k react ( t ) − k ( t ) servers to thesetup state.• Reactive with extra capacity. This algorithm is similar to Reactive except that we introducea virtual traﬃc ﬂow of p jobs per slot. So during each time slot t , the algorithm maintains k react ( t ) = (cid:6) ( λ ( t ) + p ) / (cid:7) servers on.Fig. 3.11-3.13 plots the average power consumption, queue length and the number of activeservers, respectively. It can be seen that all algorithms perform pretty well during ﬁrst halfof the trace. For the second half of the trace, the traﬃc load is increasing. The Always-onalgorithm with mean capacity does not adapt to the traﬃc so the queue length blows up quickly.Because of the long setup time, the number of active servers in the Reactive algorithm fails tocatch up with the increasing traﬃc so the queue length also blows up. Our proposed algorithmminimizes the power consumption while stabilizing the queues, thereby outperforming both theAlways-on and the Reactive algorithm. Note that the Reactive with extra 200 job capacity isable to achieve a similar delay performance as our proposed algorithm, but with signiﬁcant extrapower consumption.Figure 3.11: Running average power consumption from slot 1 to the current slot for diﬀerentalgorithms. 80igure 3.12: Instantaneous queue length for diﬀerent algorithms.Figure 3.13: Number of active servers over time.81inally, we evaluate the inﬂuence of diﬀerent sleep modes on the performance. We keep all thesetups the same as before and consider the sleep modes with sleep power consumption equal to2 W and 4 W per slot, respectively. Since the Always-on and the Reactive algorithm do not lookat the sleep power consumption, their decisions remain the same as before, thus, we superposethe queue length of our proposed algorithm onto the previous Fig. 3.12 and get the queue lengthcomparison in Fig. 3.14. We see from the plot that increasing the power consumption during thesleep mode only slightly increases the queue length of our proposed algorithm. Fig. 3.16 plots therunning average power consumption under diﬀerent sleep modes. Despite spending more poweron the sleep mode, the proposed algorithm can still save considerable amount of power comparedto other algorithms while keeping the request queue stable. This shows that our algorithm isempirically robust to the change of sleep mode.Figure 3.14: Instantaneous queue length for diﬀerent algorithms. Appendix A— Proof of Lemma 3.2.1

We have (3.33), as shown at the bottom of this page, holds, where the ﬁrst equality followsfrom the deﬁnition T n [ f ] = I n [ f ] + τ n [ f ] + 1 and the second equality follows from iterated82igure 3.16: Running average power consumption for 0 W sleep cost(left), 2 W sleep cost(middle)and 4 W sleep cost(right) D n [ f ] = V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + E (cid:16) B ( I n [ f ] + τ n [ f ] + 1) + V ˆ g ( α n [ f ]) I n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) E (cid:16) I n [ f ] + τ n [ f ] + 1 (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) − B V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + E (cid:16) B ( I n [ f ] + m α n + 1) + B σ α n [ f ] + V ˆ g ( α n [ f ]) I n [ f ] (cid:12)(cid:12)(cid:12) Q n ( t nf ) (cid:17) E (cid:0) I n [ f ] + m α n + 1 (cid:12)(cid:12) Q n ( t nf ) (cid:1) − B I n [ f ] and α n [ f ]. For simplicity of notations, let F ( α n [ f ] , I n [ f ]) = V ˆ W n ( α n [ f ]) m α n [ f ] + V e n − Q n ( t nf ) µ n + B I n [ f ] + m α n [ f ] + 1) + V ˆ g ( α n [ f ]) I n [ f ]+ B σ α n [ f ] G ( α n [ f ] , I n [ f ]) = I n [ f ] + m α n [ f ] + 1 , then D n [ f ] = E (cid:16) F ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:17) E (cid:16) G ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:17) − B . Q n ( t nf ) at frame f , denote the benchmark solution over puredecisions as m (cid:44) min I n [ f ] ∈ N , I n [ f ] ∈ [1 ,I max ] ,α n [ f ] ∈L n F ( α n [ f ] , I n [ f ]) G ( α n [ f ] , I n [ f ]) . (3.34)Then, for any randomized decision on α n [ f ] and I n [ f ], its realization within frame f satisﬁes thefollowing F ( α n [ f ] , I n [ f ]) G ( α n [ f ] , I n [ f ]) ≥ m, which implies F ( α n [ f ] , I n [ f ]) ≥ mG ( α n [ f ] , I n [ f ]) . Taking conditional expectation from both sides gives E (cid:0) F ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:1) ≥ m E (cid:0) G ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:1) ⇒ E (cid:16) F ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:17) E (cid:16) G ( α n [ f ] , I n [ f ]) | Q n ( t nf ) (cid:17) ≥ m. Thus, it is enough to consider pure decisions only, which boils down to computing (3.34). Thisproves the lemma.

Appendix B— Proof of Lemma 3.3.4

This section is dedicated to prove that E (cid:0) X n [ f ] (cid:1) is bounded. First of all, since the idleoption set L n is ﬁnite, denote W max = max α n ∈L n W n ( α n ) g max = max α n ∈L n g n ( α n )84t is obvious that | W n ( t ) − W ∗ n | ≤ W max , | g n ( t ) − G ∗ n | ≤ g max , | e n H n ( t ) − E ∗ n | ≤ e n , and | µ n H n ( t ) − µ ∗ | ≤ µ n . Combining with the boundedness of queues in lemma 3.3.1, it follows | X n [ f ] | ≤ t = t nf +1 − X t = t nf ( V ( W max + e n + g max ) + ( V c max + R max ) µ n + (cid:0) t − t nf (cid:1) B + Ψ n (cid:1) ≤ ( V ( W max + e n + g max ) + ( V c max + R max ) µ n +Ψ n ) T n [ f ] + T n [ f ]( T n [ f ] − B B (cid:44) V ( W max + e n + g max ) + ( V c max + R max ) µ n + Ψ n + B /

2, it follows | X n [ f ] | ≤ B T n [ f ] + B T n [ f ] . Thus, E (cid:0) X n [ f ] (cid:1) ≤ B E (cid:0) T n [ f ] (cid:1) + B B E (cid:0) T n [ f ] (cid:1) + B E (cid:0) T n [ f ] (cid:1) . Notice that T n [ f ] ≤ I n [ f ] + τ n [ f ] + 1 by (3.2), where I n [ f ] is upper bonded by I max and τ n [ f ]has ﬁrst four moments bounded by assumption 3.1.2. Thus, E (cid:0) X n [ f ] (cid:1) is bounded by a ﬁxedconstant. Appendix C— Proof of Lemma 3.3.5

Proof.

Let’s ﬁrst abbreviate the notation by deﬁning Y ( t ) = V ( W n ( t ) + e n H n ( t ) + g n ( t )) − Q n ( t nf )( µ n H n ( t ) − µ ∗ n )+ (cid:0) t − t nf (cid:1) B . For any T ∈ [ t nf , t ( n +1) F ), we can bound the partial sums from above by the following T − X t =0 Y ( t ) ≤ t nf − X t =0 Y ( t ) + B T n [ F ] + B T n [ F ] , B = ( R max + µ max ) µ max is deﬁned in (3.15), and B (cid:44) V W n + V µ n e n + ( V c max + R max ) µ n + B /

2. Thus,1 T T − X t =0 Y ( t ) ≤ T t nf − X t =0 Y ( t ) + 1 T (cid:18) B T n [ F ] + B T n [ F ] (cid:19) ≤ max { a [ F ] , b [ F ] } , where a [ F ] (cid:44) t nf t nf − X t =0 Y ( t ) + 1 t nf (cid:18) B T n [ F ] + B T n [ F ] (cid:19) ,b [ F ] (cid:44) t nf +1 t nf − X t =0 Y ( t ) + 1 t nf +1 (cid:18) B T n [ F ] + B T n [ F ] (cid:19) . Thus, this implies thatlim sup T →∞ T T − X t =0 Y ( t ) ≤ lim sup F →∞ max { a [ F ] , b [ F ] } = max (cid:26) lim sup F →∞ a [ F ] , lim sup F →∞ b [ F ] (cid:27) . We then try to work out an upper bound for lim sup F →∞ a [ F ] and lim sup F →∞ b [ F ] respectively.1. Bound for lim sup F →∞ a [ F ]:lim sup F →∞ a [ F ] ≤ lim sup F →∞ t nf t nf − X t =0 Y ( t ) + lim sup F →∞ t nf (cid:18) B T n [ F ] + B T n [ F ] (cid:19) ≤ V ( W ∗ n + E ∗ n + G ∗ n ) + Ψ n + lim sup F →∞ t nf (cid:18) B T n [ F ] + B T n [ F ] (cid:19) . where the second inequality follows from corollary 3.3.1. It remains to show thatlim sup F →∞ t nf (cid:18) B T n [ F ] + B T n [ F ] (cid:19) ≤ . (3.35)86ince t nf ≥ F , it is enough to show thatlim sup F →∞ T n [ F ] F = 0 , (3.36)lim sup F →∞ T n [ F ] F = 0 . (3.37)We prove (3.37), and (3.36) is similar. Since each T n [ F ] = I n [ F ] + τ n [ F ] + 1, where I n [ F ] ≤ I max and τ n [ F ] has bounded ﬁrst four moments, the ﬁrst four moments of T n [ F ]must also be bounded and there exists a constant C > E (cid:0) T n [ F ] (cid:1) ≤ C. For any (cid:15) >

0, deﬁne a sequence of events A (cid:15)F (cid:44) (cid:8) T n [ F ] > (cid:15)F (cid:9) . According to Markov inequality,

P r [ A (cid:15)F ] ≤ E (cid:0) T n [ F ] (cid:1) (cid:15) F ≤ C(cid:15) F . Thus, ∞ X F =1 P r [ A (cid:15)F ] ≤ C(cid:15) ∞ X F =1 F ≤ C(cid:15) < ∞ . By Borel-Cantelli lemma (lemma 1.6.1 in [Dur13]),

P r [ A (cid:15)F occurs inﬁnitely often] = 0 , which implies P r (cid:20) lim sup F →∞ T n [ F ] F > (cid:15) (cid:21) = 0 . Since (cid:15) is arbitrary, this implies (3.37). Similarly, (3.36) can be proved. Thus, (3.35) holdsand lim sup F →∞ a [ F ] ≤ V ( W ∗ n + E ∗ n + G ∗ n ) + Ψ n .

87. Bound for lim sup F →∞ b [ F ]:lim sup F →∞ b [ F ] ≤ lim sup F →∞ t nf t nf − X t =0 Y ( t ) · t nf t nf +1 + lim sup F →∞ t nf +1 (cid:18) B T n [ F ] + B T n [ F ] (cid:19) . ≤ lim sup F →∞  t nf t nf − X t =0 Y ( t )  · lim sup F →∞ t nf t nf +1 ≤ (cid:16) V (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17) + Ψ n (cid:17) · lim sup F →∞ t nf t nf +1 ≤ V (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17) + Ψ n , where the second inequality follows from (3.35), the third inequality follows from corollary3.3.1 and the last inequality follows from the fact that V (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17) + Ψ n > Appendix D— Proof of Theorem 3.3.1

Proof.

Deﬁne the drift-plus-penalty(DPP) expression P ( t ) as follows P ( t ) = V c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )) ! + 12 N X n =1 (cid:0) Q n ( t + 1) − Q n ( t ) (cid:1) . By simple algebra using the queue updating rule (3.1), we can work out the upper bound for P ( t ) as follows, P ( t ) ≤ N X n =1 ( R n ( t ) + µ n ) + V c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )) ! + N X n =1 Q n ( t )( R n ( t ) − µ n H n ( t )) ≤ B + V c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )) ! + N X n =1 Q n ( t )( R n ( t ) − µ n H n ( t )) ≤ B + V c ( t ) r ( t ) + N X n =1 Q n ( t ) (cid:16) R n ( t ) − R ∗ n (cid:17) + V N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t ))+ N X n =1 Q n ( t ) ( µ ∗ n − µ n H n ( t )) 88here B = P Nn =1 ( R max + µ n ) , the last inequality follows from adding P Nn =1 Q n ( t ) µ ∗ n andsubtracting P Nn =1 Q n ( t ) R ∗ with the fact that the best randomized stationary algorithm shouldalso satisfy the constraint (3.5), i.e. µ ∗ ≥ R ∗ n .Now we take the partial average of P ( t ) from 0 to T − T →∞ ,lim sup T →∞ T T − X n =1 P ( t ) ≤ B + lim sup T →∞ T T − X t =0 V c ( t ) r ( t ) + N X n =1 Q n ( t ) (cid:16) R n ( t ) − R ∗ n (cid:17)! + N X n =1 lim sup T →∞ T T − X t =0 ( V ( W n ( t ) + e n H n ( t ) + g n ( t ))+ Q n ( t ) ( µ ∗ n − µ n H n ( t ))) . (3.38)According to (3.29),lim sup T →∞ T T − X t =0 V c ( t ) r ( t ) + N X n =1 Q n ( t ) (cid:16) R n ( t ) − R ∗ n (cid:17)! ≤ V C ∗ . (3.39)On the other hand,lim sup T →∞ T T − X t =0 ( V ( W n ( t ) + e n H n ( t ) + g n ( t )) + Q n ( t ) ( µ ∗ n − µ n H n ( t ))) ≤ lim sup T →∞ T T − X t =0 (cid:0) V ( W n ( t ) + e n H n ( t ) + g n ( t )) + Q n ( t nf ) ( µ ∗ n − µ n H n ( t )) + ( t − t nf ) B (cid:1) ≤ V (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17) + Ψ n , (3.40)where B = ( R max + µ max ) µ max as deﬁned below (3.15), the ﬁrst inequality follows from thefact that for any t ∈ (cid:16) t nf , t nf +1 (cid:17) , Q n ( t ) ( µ ∗ n − µ n H n ( t )) ≤ Q n ( t nf ) ( µ ∗ n − µ n H n ( t )) + ( Q n ( t ) − Q n ( t nf )) ( µ ∗ n − µ n H n ( t )) ≤ Q n ( t nf ) ( µ ∗ n − µ n H n ( t )) + t nf +1 − X t = t nf ( R n ( t ) − µ n H n ( t )) ( µ ∗ n − µ n H n ( t )) ≤ Q n ( t nf ) ( µ ∗ n − µ n H n ( t )) + ( t − t nf ) B , T →∞ T T − X t =0 P ( t ) ≤ V C ∗ + N X n =1 (cid:16) W ∗ n + E ∗ n + G ∗ n (cid:17)! + B + N X n =1 Ψ n . (3.41)Finally, notice that by telescoping sums,lim sup T →∞ T T − X t =0 P ( t )= lim sup T →∞ VT T − X t =0 ( c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t ))) + 12 N X n =1 Q n ( T ) ! ≥ V · lim sup T →∞ T T − X t =0 ( c ( t ) r ( t ) + N X n =1 ( W n ( t ) + e n H n ( t ) + g n ( t )))Substitute above inequality into (3.41) and divide V from both sides give the desired result.90 hapter 4Power Aware Wireless File Downloading and Restless Ban-dit via Renewal Optimization In this chapter, we look at another application of the renewal optimization, namely, the wire-less ﬁle downloading. We start with a simple single-user ﬁle downloading problem and show thatthis problem can be characterized by a 2 state Markov decision process (MDP) with constraints,for which the drift-plus-penalty (DPP) ratio algorithm (Algorithm 1) applies. We then considera more realistic multi-user ﬁle downloading and show that this problem is a constrained versionof the well-known restless bandit problem, for which we develop a

DPP ratio indexing heuristicbased on the coupled renewal optimization.

Consider a wireless access point, such as a base station or femto node, that delivers ﬁles to N diﬀerent wireless users. The system operates in slotted time with time slots t ∈ { , , , . . . } .Each user can download at most one ﬁle at a time. File sizes are random and complete deliveryof a ﬁle requires a random number of time slots. A new ﬁle request is made by each user at arandom time after it ﬁnishes its previous download. Let F n ( t ) ∈ { , } represent the binary ﬁlestate process for user n ∈ { , . . . , N } . The state F n ( t ) = 1 means that user n is currently activedownloading a ﬁle, while the state F n ( t ) = 0 means that user n is currently idle.Idle times are assumed to be independent and geometrically distributed with parameter λ n for each user n , so that the average idle time is 1 /λ n . Active times depend on the random ﬁle sizeand the transmission decisions that are made. Every slot t , the access point observes which users91re active and decides to serve a subset of at most M users, where M is the maximum numberof simultaneous transmissions allowed in the system ( M < N is assumed throughout). The goalis to maximize a weighted sum of throughput subject to a total average power constraint.The ﬁle state processes F n ( t ) are coupled controlled Markov chains that form a total state( F ( t ) , . . . , F N ( t )) that can be viewed as a restless multi-armed bandit system . Such problemsare complex due to the inherent curse of dimensionality.We ﬁrst compute an online optimal algorithm for 1-user systems, i.e., the case N = 1.This simple case avoids the curse of dimensionality and provides valuable intuition. The optimalpolicy here is computed via the drift-plus-penalty (DPP) ratio algorithm. The resulting algorithmmakes a greedy transmission decision that aﬀects success probability and power usage. Next, thealgorithm is extended as a low complexity online heuristic for the N -user problem, which we callthe “DPP ratio indexing”. The heuristic has the following desirable properties:• Implementation of the N -user heuristic is as simple as comparing indices for N diﬀerent1-user problems.• The N -user heuristic is analytically shown to meet the desired average power constraint.• The N -user heuristic is shown in simulation to perform well over a wide range of parameters.Speciﬁcally, it is very close to optimal for example cases where an oﬄine optimal can becomputed.• The N -user heuristic is shown to be optimal in a special case with no power constraintand with certain additional assumptions. The optimality proof uses a theory of stochasticcoupling for queueing systems [TE93].Prior work on wireless optimization uses Lyapunov functions to maximize throughput in caseswhere the users are assumed to have an inﬁnite amount of data to send [NML08, ES07, GNT + dropping data if the arrival rate vector is outside ofthe capacity region, e.g. [NML08]). These models do not consider the interplay between arrivalsat the transport layer and ﬁle delivery at the network layer. For example, a web user in a coﬀeeshop may want to evaluate the ﬁle she downloaded before initiating another download. Thecurrent work captures this interplay through the binary ﬁle state processes F n ( t ). This creates92 complex problem of coupled Markov chains. This problem is fundamental to ﬁle downloadingsystems. The modeling and analysis of these systems is a signiﬁcant contribution of the currentthesis.To understand this issue, suppose the data arrival rate is ﬁxed and does not adapt to theservice received over the network. If this arrival rate exceeds network capacity by a factor oftwo, then at least half of all data must be dropped. This can result in an unusable data stream,possibly one that contains every odd-numbered packet. A more practical model assumes thatfull ﬁles must be downloaded and that new downloads are only initiated when previous onesare completed. A general model in this direction would allow each user to download up to K ﬁles simultaneously. This thesis considers the case K = 1, so that each user is either activelydownloading a ﬁle, or is idle. The resulting system for N users has a nontrivial Markov structurewith 2 N states.Since the current problem includes both time-average constraints (on average power expen-diture) and instantaneous constraints which restrict the number of users that can be served onone slot, it is more complicated than the weakly coupled systems discussed in previous chapters.More speciﬁcally, The latter service restriction is similar to a traditional restless multi-armedbandit (RMAB) system [Whi88].RMAB problem considers a population of N parallel MDPs that continue evolving whether inoperation or not (although in diﬀerent rules). The goal is to choose the MDPs in operation duringeach time slot so as to maximize the expected reward subject to a constraint on the numberof MDPs in operation. The problem is in general complex (see P-SPACE hardness results in[PT99]). A standard low-complexity heuristic for such problems is the Whittle’s index technique[Whi88]. However, the Whittle’s index framework applies only when there are two options oneach state (active and passive). Further, it does not consider the additional time average costconstraints. The

DPP ratio indexing algorithm developed in the current work can be viewed asan alternative indexing scheme that can always be implemented and that incorporates additionaltime average constraints. It is likely that the techniques of the current work can be extendedto other constrained RMAB problems. Prior work in [TE93] develops a Lyapunov drift method One way to allow a user n to download up to K ﬁles simultaneously is as follows: Deﬁne K virtual users withseparate binary ﬁle state processes. The transition probability from idle to active in each of these virtual users is λ n /K . The conditional rate of total new arrivals for user n (given that m ﬁles are currently in progress) is then λ n (1 − m/K ) for m ∈ { , , . . . , M } . longest connectedqueue algorithm is delay optimal in a multi-dimensional queueing system with special symmetricassumptions . The problem in [TE93] is diﬀerent from that of the current work. However, a similarcoupling approach is used below to show that, for a special case with no power constraint, the DPPratio indexing algorithm is throughput optimal in certain asymmetric cases. As a consequence,the proof shows the policy is also optimal for a diﬀerent setting with M servers, N single-buﬀerqueues, and arbitrary packet arrival rates ( λ , . . . , λ N ). Consider a ﬁle downloading system that consists of only one user that repeatedly downloadsﬁles. Let F ( t ) ∈ { , } be the ﬁle state process of the user. State “1” means there is a ﬁle inthe system that has not completed its download, and “0” means no ﬁle is waiting. The lengthof each ﬁle is independent and is either exponentially distributed or geometrically distributed(described in more detail below). Let B denote the expected ﬁle size in bits. Time is slotted.At each slot in which there is an active ﬁle for downloading, the user makes a service decisionthat aﬀects both the downloading success probability and the power expenditure. After a ﬁle isdownloaded, the system goes idle (state 0) and remains in the idle state for a random amount oftime that is independent and geometrically distributed with parameter λ > t in which F ( t ) = 1. The decision aﬀects thenumber of bits that are sent, the probability these bits are successfully received, and the powerusage. Let α ( t ) denote the decision variable at slot t and let A represent an abstract action set.The set A can represent a collection of modulation and coding options for each transmission.Assume also that A contains an idle action denoted as “0.” The decision α ( t ) determines thefollowing two values:• The probability of successfully downloading a ﬁle φ ( α ( t )), where φ ( · ) ∈ [0 ,

1] with φ (0) = 0.• The power expenditure p ( α ( t )), where p ( · ) is a nonnegative function with p (0) = 0.94he user chooses α ( t ) = 0 whenever F ( t ) = 0. The user chooses α ( t ) ∈ A for each slot t in which F ( t ) = 1, with the goal of maximizing throughput subject to a time average power constraint.The example where the decision set A is ﬁnite can be found in the simulation experiment section.Here is a simple example where the decision can be continuous: Example 1.

Let A be the set of all possible power allocation options, i.e. A := [ p min , p max ] ∪ { } where p min , p max > are constants. Then, α ( t ) ∈ [ p min , p max ] ∪ { } , p ( α ( t )) = α ( t ) and thesuccess probability of downloading a ﬁle can be φ ( α ( t )) = 1 − exp( − α ( t )) . The problem can be described by a two state Markov decision process with binary state F ( t ).Given F ( t ) = 1, a ﬁle is currently in the system. This ﬁle will ﬁnish its download at the end ofthe slot with probability φ ( α ( t )). Hence, the transition probabilities out of state 1 are: P r [ F ( t + 1) = 0 | F ( t ) = 1] = φ ( α ( t )) (4.1) P r [ F ( t + 1) = 1 | F ( t ) = 1] = 1 − φ ( α ( t )) (4.2)Given F ( t ) = 0, the system is idle and will transition to the active state in the next slot withprobability λ , so that: P r [ F ( t + 1) = 1 | F ( t ) = 0] = λ (4.3) P r [ F ( t + 1) = 0 | F ( t ) = 0] = 1 − λ (4.4)Deﬁne the throughput, measured by bits per slot, as:lim inf T →∞ T T − X t =0 Bφ ( α ( t ))The ﬁle downloading problem reduces to the following:Maximize: lim inf T →∞ T T − X t =0 Bφ ( α ( t )) (4.5)Subject to: lim sup T →∞ T T − X t =0 p ( α ( t )) ≤ β (4.6) α ( t ) ∈ A ∀ t ∈ { , , , . . . } such that F ( t ) = 1 (4.7)Transition probabilities satisfy (4.1)-(4.4) (4.8)95here β is a positive constant that determines the desired average power constraint. The above model assumes that ﬁle completion success on slot t depends only on the transmis-sion decision α ( t ), independent of history. This implicitly assumes that ﬁle length distributionshave a memoryless property where the residual ﬁle length is independent of the amount alreadydelivered. Further, it is assumed that if the controller selects a transmission rate that is largerthan the residual bits in the ﬁle, the remaining portion of the transmission is padded with ﬁllbits . This ensures error events provide no information about the residual ﬁle length beyond thealready known 0/1 binary ﬁle state. Of course, error probability might be improved by removingpadded bits. However, this aﬀects only the last transmission of a ﬁle and has negligible impactwhen expected ﬁle size is large in comparison to the amount that can be transmitted in one slot.Note that padding is not needed in the special case when all transmissions send one ﬁxed lengthpacket.The memoryless property holds when each ﬁle i has independent length B i that is exponen-tially distributed with mean length B bits, so that: P r [ B i > x ] = e − x/B for x > transmission rate r ( t ) (in units of bits/slot) and the transmissionsuccess probability q ( t ) are given by general functions of α ( t ): r ( t ) = ˆ r ( α ( t )) q ( t ) = ˆ q ( α ( t ))Then the ﬁle completion probability φ ( α ( t )) is the probability that the residual amount of bits inthe ﬁle is less than or equal to r ( t ), and that the transmission of these residual bits is a success.By the memoryless property of the exponential distribution, the residual ﬁle length is distributed96he same as the original ﬁle length. Thus: φ ( α ( t )) = ˆ q ( α ( t )) P r [ B i ≤ ˆ r ( α ( t ))]= ˆ q ( α ( t )) Z ˆ r ( α ( t ))0 B e − x/B dx (4.9)Alternatively, history independence holds when each ﬁle i consists of a random number Z i of ﬁxed length packets, where Z i is geometrically distributed with mean Z = 1 /µ . Assume eachtransmission sends exactly one packet, but diﬀerent power levels aﬀect the transmission successprobability q ( t ) = ˆ q ( α ( t )). Then: φ ( α ( t )) = µ ˆ q ( α ( t )) (4.10)The memoryless ﬁle length assumption allows the ﬁle state to be modeled by a simple binary-valued process F ( t ) ∈ { , } . However, actual ﬁle sizes may not have an exponential or geometricdistribution. One way to treat general distributions is to approximate the ﬁle sizes as beingmemoryless by using a φ ( α ( t )) function deﬁned by either (4.9) or (4.10), formed by matchingthe average ﬁle size B or average number of packets Z . The decisions α ( t ) are made accordingto the algorithm below, but the actual event outcomes that arise from these decisions are notmemoryless. A simulation comparison of this approximation is provided in Section 4.5, where itis shown to be remarkably accurate (see Fig. 4.7).The algorithm in this section optimizes over the class of all algorithms that do not useresidual ﬁle length information. This maintains low complexity by ensuring a user has a binary-valued Markov state F ( t ) ∈ { , } . While a system controller might know the residual ﬁle length,incorporating this knowledge creates a Markov decision problem with an inﬁnite number of states(one for each possible value of residual length) which signiﬁcantly complicates the scenario. This subsection develops an online algorithm for problem (4.5)-(4.8). This algorithm followsfrom Algorithm 1 in Chapter 1 with some customizations towards this application. First, noticethat ﬁle state “1” is recurrent under any decisions for α ( t ). Denote t k as the k -th time when thesystem returns to state “1.” Deﬁne the renewal frame as the time period between t k and t k +1 .Deﬁne the frame size : T [ k ] = t k +1 − t k T [ k ] = 1 for any frame k in which the ﬁle does not complete its download. If the ﬁleis completed on frame k , then T [ k ] = 1+ G k , where G k is a geometric random variable with mean E ( G k ) = 1 /λ . Each frame k involves only a single decision α ( t k ) that is made at the beginningof the frame. Thus, the total power used over the duration of frame k is: t k +1 − X t = t k p ( α ( t )) = p ( α ( t k )) (4.11)We treat the time average constraint in (4.6) using a virtual queue Q [ k ] that is updated everyframe k by: Q [ k + 1] = max { Q [ k ] + p ( α ( t k )) − βT [ k ] , } (4.12)with initial condition Q [0] = 0. The algorithm is then parameterized by a constant V ≥ k -th renewal frame, the user observesvirtual queue Q [ k ] and chooses α ( t k ) to maximize the following drift-plus-penalty (DPP) ratio:max α ( t k ) ∈A V Bφ ( α ( t k )) − Q [ k ] p ( α ( t k )) E [ T [ k ] | α ( t k )] (4.13)The numerator of the above ratio adds a “queue drift term” − Q [ k ] p ( α ( t k )) to the “currentreward term” V Bφ ( α ( t k )). The intuition is that it is desirable to have a large value of currentreward, but it is also desirable to have a large drift (since this tends to decrease queue size).Creating a weighted sum of these two terms and dividing by the expected frame size gives asimple index. The next subsections show that, for the context of the current work, this indexleads to an algorithm that pushes throughput arbitrarily close to optimal (depending on thechosen V parameter) with a strong sample path guarantee on average power expenditure.The denominator in (4.13) can easily be computed via the transition model (4.1)-(4.4): E [ T [ k ] | α ( t k )] = 1 − φ ( α ( t k )) + φ ( α ( t k )) · (cid:18) λ (cid:19) = 1 + φ ( α ( t k )) λ (4.14)Thus, (4.13) is equivalent to max α ( t k ) ∈A V Bφ ( α ( t k )) − Q [ k ] p ( α ( t k ))1 + φ ( α ( t k )) /λ (4.15)This gives the following Algorithm 5 for the single-user case: The expected performance analysis98 lgorithm 5. • At each time t k , the user observes virtual queue Q [ k ] and chooses α ( t k ) asthe solution to (4.15) (where ties are broken arbitrarily). • The value Q [ k + 1] is computed according to (4.12) at the end of the k -th frame. of this algorithm follows from that of Section 1.1.4 and we omit the details for brevity. In thefollowing, we give a stronger probability 1 performance analysis taking into account the specialproperty of the algorithm in this customized setting. In this section, we show that the proposed algorithm makes the virtual queue deterministicallybounded.

Lemma 4.2.1.

If there is a constant C ≥ such that Q [ k ] ≤ C for all k ∈ { , , , . . . } , then: lim sup T →∞ T T − X t =0 p ( α ( t )) ≤ β Proof.

From (4.12), we know that for each frame k : Q [ k + 1] ≥ Q [ k ] + p ( α ( t k )) − T [ k ] β Rearranging terms and using T [ k ] = t k +1 − t k gives: p ( α ( t k )) ≤ ( t k +1 − t k ) β + Q [ k + 1] − Q [ k ]Fix K >

0. Summing over k ∈ { , , · · · , K − } gives: K − X k =0 p ( α ( t k )) ≤ ( t K − t ) β + Q [ K ] − Q [0] ≤ t K β + C The sum power over the ﬁrst K frames is the same as the sum up to time t K −

1, and so: t K − X t =0 p ( α ( t )) ≤ t K β + C t K gives: 1 t K t K − X t =0 p ( α ( t )) ≤ β + C/t K . Taking K → ∞ , then, lim sup K →∞ t K t K − X t =0 p ( α ( t )) ≤ β (4.16)Now for each positive integer T , let K ( T ) be the integer such that t K ( T ) ≤ T < t K ( T )+1 . Sincepower is only used at the ﬁrst slot of a frame, one has:1 T T − X t =0 p ( α ( t )) ≤ t K ( T ) t K ( T ) − X t =0 p ( α ( t ))Taking a lim sup as T → ∞ and using (4.16) yields the result.In order to show that the queue process under our proposed algorithm is deterministicallybounded, we need the following assumption: Assumption 4.2.1.

The following quantities are ﬁnite and strictly positive: p min = min α ∈A\{ } p ( α ) p max = max α ∈A\{ } p ( α ) . Lemma 4.2.2.

Suppose Assumption 4.2.1 holds. If Q [0] = 0 , then under our algorithm we havefor all k > : Q [ k ] ≤ max (cid:26) V Bp min + p max − β, (cid:27) Proof.

First, consider the case when p max ≤ β . From (4.12) and the fact that T [ k ] ≥ k ,it is clear the queue can never increase, and so Q [ k ] ≤ Q [0] = 0 for all k > p max > β . We prove the assertion by induction on k . The resulttrivially holds for k = 0. Suppose it holds at k = l for l >

0, so that: Q [ l ] ≤ V Bp min + p max − β We are going to prove that the same holds for k = l + 1. There are two cases:100. Q [ l ] ≤ V Bp min . In this case we have by (4.12): Q [ l + 1] ≤ Q [ l ] + p max − β ≤ V Bp min + p max − β V Bp min < Q [ l ] ≤ V Bp min + p max − β . In this case, we use proof by contradiction. If p ( α ( t l )) = 0then the queue cannot increase, so: Q [ l + 1] ≤ Q [ l ] ≤ V Bp min + p max − β On the other hand, if p ( α ( t l )) > p ( α ( t l )) ≥ p min and so the numerator in (4.15)satisﬁes: V Bφ ( α ( t l )) − Q [ l ] p ( α ( t l )) ≤ V B − Q [ l ] p min < cannot be negative because the alternative choice α ( t l ) = 0 increases the ratio to 0. Thiscontradiction implies that we cannot have p ( α ( t l )) > sample path result that only assumes parameters satisfy λ > B >

0, and0 ≤ φ ( · ) ≤

1. Thus, the algorithm meets the average power constraint even if it uses incorrectvalues for these parameters. The next subsection provides a throughput optimality result whenthese parameters match the true system values.

Consider the following class of i.i.d. randomized algorithms : Let θ ( α ) be non-negative num-bers deﬁned for each α ∈ A , and suppose they satisfy P α ∈A θ ( α ) = 1. Let α ∗ ( t ) representa policy that, every slot t for which F ( t ) = 1, chooses α ∗ ( t ) ∈ A by independently selectingstrategy α with probability θ ( α ). Then ( p ( α ∗ ( t k )) , φ ( α ∗ ( t k ))) are independent and identicallydistributed (i.i.d.) over frames k . Under this algorithm, it follows by the law of large numbers101hat the throughput and power expenditure satisfy (with probability 1):lim t →∞ T T − X t =0 Bφ ( α ∗ ( t )) = B E ( φ ( α ∗ ( t k )))1 + E ( φ ( α ∗ ( t k ))) /λ lim t →∞ T T − X t =0 p ( α ∗ ( t )) = E ( p ( α ∗ ( t k )))1 + E ( φ ( α ∗ ( t k ))) /λ It can be shown that optimality of problem (4.5)-(4.8) can be achieved over this class. Thus,there exists an i.i.d. randomized algorithm α ∗ ( t ) that satisﬁes: B E ( φ ( α ∗ ( t k )))1 + E ( φ ( α ∗ ( t k ))) /λ = µ ∗ (4.17) E ( p ( α ∗ ( t k )))1 + E ( φ ( α ∗ ( t k ))) /λ ≤ β (4.18)where µ ∗ is the optimal throughput for the problem (4.5)-(4.8). Deﬁne F ( t k ) as the system history up to frame k , which includes which includes the actionstaken α ( t ) , · · · , α ( t k − ) , frame lengths T [0] , · · · , T [ k − Q [ k ] (since this is determined by the randomevents before frame k ). Consider the algorithm that, on frame k , observes Q [ k ] and chooses α ( t k ) according to (4.15). The following key feature of this algorithm can be shown (see [Nee10b]for related results): E (cid:0) − V Bφ ( α ( t k )) + Q [ k ] p ( α ( t k )) |F ( t k ) (cid:1) E (1 + φ ( α ( t k )) /λ |F ( t k )) ≤ E (cid:0) − V Bφ ( α ∗ ( t k )) + Q [ k ] p ( α ∗ ( t k )) |F ( t k ) (cid:1) E (1 + φ ( α ∗ ( t k )) /λ |F ( t k ))where α ∗ ( t k ) is any (possibly randomized) alternative decision that is based only on F ( t k ). Thisis an intuitive property: By design, the algorithm in (4.15) observes F ( t k ) and then choosesa particular action α ( t k ) to minimize the ratio over all deterministic actions. Thus, as can beshown, it also minimizes the ratio over all potentially randomized actions. Using the (randomized)i.i.d. decision α ∗ ( t k ) from (4.17)-(4.18) in the above and noting that this alternative decision is102ndependent of F ( t k ) gives: E (cid:0) − V Bφ ( α ( t k )) + Q [ k ] p ( α ( t k )) |F ( t k ) (cid:1) E (1 + φ ( α ( t k )) /λ |F ( t k )) ≤ − V µ ∗ + Q [ k ] β (4.19) Theorem 4.2.1.

Suppose Assumption 4.2.1 holds. The proposed algorithm achieves the con-straint lim sup T →∞ T P T − t =0 p ( α ( t )) ≤ β and yields throughput satisfying (with probability 1): lim inf T →∞ T T − X t =0 Bφ ( α ( t )) ≥ µ ∗ − C V (4.20) where C is a constant. Proof.

First, for any ﬁxed V , Lemma 4.2.2 implies that the queue is deterministically bounded.Thus, according to Lemma 4.2.1, the proposed algorithm achieves the constraintlim sup T →∞ T T − X t =0 p ( α ( t )) ≤ β. The rest is devoted to proving the throughput guarantee (4.20).Deﬁne: L ( Q [ k ]) = 12 Q [ k ] . We call this a

Lyapunov function . Deﬁne a frame-based Lyapunov Drift as:∆[ k ] = L ( Q [ k + 1]) − L ( Q [ k ])According to (4.12) we get Q [ k + 1] ≤ ( Q [ k ] + p ( α ( t k )) − T [ k ] β ) . Thus: ∆[ k ] ≤ ( p ( α ( t k )) − T [ k ] β ) Q [ k ]( p ( α ( t k )) − T [ k ] β ) The constant C is independent of V and is given in the proof. F ( t k ) and recalling that F ( t k ) includes theinformation Q [ k ] gives: E (∆[ k ] |F ( t k )) ≤ C + Q [ k ] E ( p ( α ( t k )) − βT [ k ] |F ( t k )) (4.21)where C is a constant that satisﬁes the following for all possible histories F ( t k ): E (cid:18) ( p ( α ( t k )) − T [ k ] β ) (cid:12)(cid:12)(cid:12)(cid:12) F ( t k ) (cid:19) ≤ C Such a constant C exists because the power p ( α ( t k )) is deterministically bounded due to As-sumption 4.2.1, and the frame sizes T [ k ] are bounded in second moment regardless of historyaccording to (4.14).Adding the “penalty” − E (cid:0) V Bφ ( α ( t k )) |F ( t k ) (cid:1) to both sides of (4.21) gives: E (cid:0) ∆[ k ] − V Bφ ( α ( t k )) |F ( t k ) (cid:1) ≤ C + E (cid:0) − V Bφ ( α ( t k )) + Q [ k ]( p ( α ( t k )) − βT [ k ]) |F ( t k ) (cid:1) = C − Q [ k ] β E ( T [ k ] |F ( t k )) + E ( T [ k ] |F ( t k )) E (cid:0) − V Bφ ( α ( t k )) + Q [ k ] p ( α ( t k )) |F ( t k ) (cid:1) E ( T [ k ] |F ( t k ))Expanding T [ k ] in the denominator of the last term gives: E (cid:0) ∆[ k ] − V Bφ ( α ( t k )) |F ( t k ) (cid:1) ≤ C − Q [ k ] β E ( T [ k ] |F ( t k )) + E ( T [ k ] |F ( t k )) E (cid:0) − V Bφ ( α ( t k )) + Q [ k ] p ( α ( t k )) |F ( t k ) (cid:1) E (1 + φ ( α ( t k )) /λ |F ( t k ))Substituting (4.19) into the above expression gives: E (cid:0) ∆[ k ] − V Bφ ( α ( t k )) |F ( t k ) (cid:1) ≤ C − Q [ k ] β E ( T [ k ] |F ( t k )) + E ( T [ k ] |F ( t k ))( − V µ ∗ + βQ [ k ])= C − V µ ∗ E ( T [ k ] |F ( t k )) (4.22)Rearranging gives: E (cid:0) ∆[ k ] + V ( µ ∗ T [ k ] − Bφ ( α ( t k ))) |F ( t k ) (cid:1) ≤ C (4.23)104his implies that ∆[ k ] + V ( µ ∗ T [ k ] − Bφ ( α ( t k ))) − C is a supermartingale diﬀerence sequence.Furthermore, we already know the queue Q [ k ] is deterministically bounded, it follows that: ∞ X k =1 E (cid:0) ∆[ k ] (cid:1) k < ∞ This, together with (4.23), implies by Lemma 3.3.3 that (with probability 1):lim sup K →∞ K K − X k =0 (cid:2) µ ∗ T [ k ] − Bφ ( α ( t k )) (cid:3) ≤ C V Thus, for any (cid:15) > K :1 K K − X k =0 [ µ ∗ T [ k ] − Bφ ( α ( t k ))] ≤ C V + (cid:15) Rearranging implies that for all suﬃciently large K : P K − k =0 Bφ ( α ( t k )) P K − k =0 T [ k ] ≥ µ ∗ − ( C /V + (cid:15) ) K P K − k =0 T [ k ] ≥ µ ∗ − ( C /V + (cid:15) )where the ﬁnal inequality holds because T [ k ] ≥ k . Thus:lim inf K →∞ P K − k =0 Bφ ( α ( t k )) P K − k =0 T [ k ] ≥ µ ∗ − ( C /V + (cid:15) )The above holds for all (cid:15) >

0. Taking a limit as (cid:15) → K →∞ P K − k =0 Bφ ( α ( t k )) P K − k =0 T [ k ] ≥ µ ∗ − C /V. Notice that φ ( α ( t )) only changes at the boundary of each frame and remains 0 within the frame.Thus, we can replace the sum over frames k by a sum over slots t . The desired result follows.The theorem shows that throughput can be pushed within O (1 /V ) of the optimal value µ ∗ ,where V can be chosen as large as desired to ensure throughput is arbitrarily close to optimal.The tradeoﬀ is a queue bound that grows linearly with V according to Lemma 4.2.2, which aﬀectsthe convergence time required for the constraints to be close to the desired time averages (as105escribed in the proof of Lemma 4.2.1). User%1%

0% λ User%2% λ User%Ν% λ Ν% Ν% Figure 4.1: A system with N users. The shaded node for each user n indicates the current ﬁlestate F n ( t ) of that user. There are 2 N diﬀerent state vectors.This section considers a multi-user ﬁle downloading system that consists of N single-usersubsystems. Each subsystem is similar to the single-user system described in the previous section.Speciﬁcally, for the n -th user (where n ∈ { , . . . , N } ):• The ﬁle state process is F n ( t ) ∈ { , } .• The transmission decision is α n ( t ) ∈ A n , where A n is an abstract set of transmissionoptions for user n .• The power expenditure on slot t is p n ( α n ( t )).• The success probability on a slot t for which F n ( t ) = 1 is φ n ( α n ( t )), where φ n ( · ) is thefunction that describes ﬁle completion probability for user n .• The idle period parameter is λ n > B n bits.Assume that the random variables associated with diﬀerent subsystems are mutually independent.The resulting Markov decision problem has 2 N states, as shown in Fig. 4.1. The transition106robabilities for each active user depends on which users are selected for transmission and onthe corresponding transmission modes. This is a restless bandit system because there can also betransitions for non-selected users (speciﬁcally, it is possible to transition from inactive to active).To control the downloading process, there is a central server with only M threads ( M < N ),meaning that at most M jobs can be processed simultaneously . So at each time slot, the serverhas to make decisions selecting at most M out of N users to transmit a portion of their ﬁles.These decisions are further restricted by a global time average power constraint. The goal is tomaximize the aggregate throughput, which is deﬁned aslim inf T →∞ T T − X t =0 N X n =1 c n B n φ ( α n ( t ))where c , c , . . . , c N are a collection of positive weights that can be used to prioritize users. Thus,this multi-user ﬁle downloading problem reduces to the following:Max: lim inf T →∞ T T − X t =0 N X n =1 c n B n φ n ( α n ( t )) (4.24)S.t.: lim sup T →∞ T T − X t =0 N X n =1 p n ( α n ( t )) ≤ β (4.25) N X n =1 I ( α n ( t )) ≤ M ∀ t ∈ { , , , · · · } (4.26) P r [ F n ( t + 1) = 1 | F n ( t ) = 0] = λ n (4.27) P r [ F n ( t + 1) = 0 | F n ( t ) = 1] = φ n ( α n ( t )) (4.28)where the constraints (4.27)-(4.28) hold for all n ∈ { , . . . , N } and t ∈ { , , , . . . } , and where I ( · ) is the indicator function deﬁned as: I ( x ) =  , if x = 0;1 , otherwise. This section develops our indexing algorithm for the multi-user case using the single-user caseas a stepping stone. The major diﬃculty is the instantaneous constraint P Nn =1 I ( α n ( t )) ≤ M .107emporarily neglecting this constraint, we use Lyapunov optimization to deal with the timeaverage power constraint ﬁrst.We introduce a virtual queue Q ( t ), which is again 0 at t = 0. Instead of updating it on aframe basis, the server updates this queue every slot as follows: Q ( t + 1) = max ( Q ( t ) + N X n =1 p n ( α n ( t )) − β, ) . (4.29)Deﬁne N ( t ) as the set of users beginning their renewal frames at time t , so that F n ( t ) = 1 forall such users. In general, N ( t ) is a subset of N = { , , · · · , N } . Deﬁne |N ( t ) | as the number ofusers in the set N ( t ).At each time slot t , the server observes the queue state Q ( t ) and chooses ( α ( t ) , . . . , α N ( t ))in a manner similar to the single-user case. Speciﬁcally, for each user n ∈ N ( t ) deﬁne: g n ( α n ( t )) (cid:44) V c n B n φ n ( α n ( t )) − Q ( t ) p n ( α n ( t ))1 + φ n ( α n ( t )) /λ n (4.30)This is similar to the expression (4.15) used in the single-user optimization. Call g n ( α n ( t )) a reward . Now deﬁne an index for each subsystem n by: γ n ( t ) (cid:44) max α n ( t ) ∈A n g n ( α n ( t )) (4.31)which is the maximum possible reward one can get from the n -th subsystem at time slot t . Thus,it is natural to deﬁne the following myopic algorithm: Find the (at most) M subsystems in N ( t )with the greatest rewards, and serve these with their corresponding optimal α n ( t ) options in A n that maximize g n ( α n ( t )). Algorithm 6. • At each time slot t , the server observes virtual queue state Q ( t ) and computes the indicesusing (4.31) for all n ∈ N ( t ) . • Activate the min[ M, |N ( t ) | ] subsystems with greatest indices, using their corresponding ac-tions α n ( t ) ∈ A n that maximize g n ( α n ( t )) . • Update Q ( t ) according to (4.29) at the end of each slot t . .3.2 Theoretical performance analysis In this subsection, we show that the above algorithm always satisﬁes the desired time averagepower constraint. We adopt the following assumption:

Assumption 4.3.1.

The following quantities are ﬁnite and strictly positive. p minn = min α n ∈A n \{ } p n ( α n ) p min = min n p minn p maxn = max α n ∈A n p n ( α n ) c max = max n c n B max = max n B n Lemma 4.3.1.

Suppose Assumption 4.3.1 holds. Then, the queue { Q ( t ) } ∞ t =0 is deterministicallybounded under Algorithm 6. Speciﬁcally, we have for all t ∈ { , , , . . . } : Q ( t ) ≤ max ( V c max B max p min + N X n =1 p maxn − β, ) Proof.

First, consider the case when P Nn =1 p maxn ≤ β . Since Q (0) = 0, it is clear from theupdating rule (4.29) that Q ( t ) will remain 0 for all t .Next, consider the case when P Nn =1 p maxn > β . We prove the assertion by induction on t . Theresult trivially holds for t = 0. Suppose at t = t , we have: Q ( t ) ≤ V c max B max p min + N X n =1 p maxn − β We are going to prove that the same statement holds for t = t + 1. We further divide it into twocases:1. Q ( t ) ≤ V c max B max p min . In this case, since the queue increases by at most P Nn =1 p maxn − β onone slot, we have: Q ( t + 1) ≤ V c max B max p min + N X n =1 p maxn − β V c max B max p min < Q ( t ) ≤ V c max B max p min + P Nn =1 p maxn − β . In this case, since φ n ( α n ( t )) ≤ V c n B n φ n ( α n ( t )) ≥ Q ( t ) p n ( α n ( t )) unless α n ( t ) = 0. Thus,the DPP ratio indexing algorithm of minimizing (4.30) chooses α n ( t ) = 0 for all n . Thus,all indices are 0. This implies that Q ( t + 1) cannot increase, and we get Q ( t + 1) ≤ V c max B max p min + P Nn =1 p maxn − β . Theorem 4.3.1.

The proposed DPP ratio indexing algorithm achieves the constraint: lim sup T →∞ T T − X t =0 N X n =1 p n ( α n ( t )) ≤ β Proof.

First of all, similar to Lemma 4.2.1, one can show that if Q ( t ) ≤ C for some constant C > t ∈ { , , , · · · } , then, lim sup T →∞ T P T − t =0 P Nn =1 p n ( α n ( t )) ≤ β . Using Lemma4.3.1 we ﬁnish the proof. In general, it is very diﬃcult to prove optimality of the above multi-user algorithm. Thereare mainly two reasons. The ﬁrst reason is that multiple users might renew themselves asyn-chronously, making it diﬃcult to deﬁne a “renewal frame” for the whole system. Thus, the prooftechnique in Theorem 1 is infeasible. The second reason is that, even without the time averageconstraint, the problem degenerates into a standard restless bandit problem where the optimalityof indexing is not guaranteed.This section considers a special case of the multi-user ﬁle downloading problem where theDPP ratio indexing algorithm is provably optimal. The special case has no time average powerconstraint. Further, for each user n ∈ { , . . . , N } :• Each ﬁle consists of a random number of ﬁxed length packets with mean B n = 1 /µ n .• The decision set A n = { , } , where 0 stands for “idle” and 1 stands for “download.” If α n ( t ) = 1, then user n successfully downloads a single packet.• φ n ( α n ( t )) = µ n α n ( t ).• Idle time is geometrically distributed with mean 1 /λ n .110 The special case µ n = 1 − λ n is assumed .The assumption that the ﬁle length and idle time parameters µ n and λ n satisfy µ n = 1 − λ n isrestrictive. However, there exists certain queueing system which admits exactly the same markovdynamics as the system considered here when the assumption holds (described in Section 4.4.1below). More importantly, it allows us to implement the stochastic coupling idea to prove theoptimality.The goal is to maximize the sum throughput (in units of packets/slot), which is deﬁned as:lim inf T →∞ T T − X t =0 N X n =1 B n φ ( α n ( t )) . (4.32)In this special case, the multi-user ﬁle downloading problem reduces to the following:Max: lim inf T →∞ T T − X t =0 N X n =1 α n ( t ) (4.33)S.t.: N X n =1 α n ( t ) ≤ M ∀ t ∈ { , , , · · · } (4.34) α n ( t ) ∈ { , F n ( t ) } (4.35) P r [ F n ( t + 1) = 1 | F n ( t ) = 0] = λ n (4.36) P r [ F n ( t + 1) = 0 | F n ( t ) = 1] = α n ( t )(1 − λ n ) (4.37)where the equality (4.37) uses the fact that µ n = 1 − λ n . A picture that illustrates the Markovstructure of constraints (4.35)-(4.37) is given in Fig. 4.2 N single-buﬀer queues The above model, with the assumption µ n = 1 − λ n , is structurally equivalent to the following:Consider a system of N single-buﬀer queues, M servers, and independent Bernoulli packet arrivalswith rates λ n to each queue n ∈ { , . . . , N } . This considers packet arrivals rather than ﬁlearrivals , so there are no ﬁle length variables and no parameters µ n in this interpretation. Let A ( t ) = ( A ( t ) , . . . , A N ( t )) be the binary-valued vector of packet arrivals on slot t , assumed tobe i.i.d. over slots and independent in each coordinate. Assume all packets have the same sizeand each queue has a single buﬀer that can store just one packet. Let F n ( t ) be 1 if queue n hasa packet at the beginning of slot t , and 0 else. Each server can transmit at most 1 packet per111

10 1 ( 1) n Download   ( 0) n Idle   n  n  n  n  n  n  Figure 4.2: Markovian dynamics of the n -th system.slot. Let α n ( t ) be 1 if queue n is served on slot t , and 0 else. An arrival A n ( t ) occurs at theend of slot t and is accepted only if queue n is empty at the end of the slot (such as when itwas served on that slot). Packets that are not accepted are dropped. The Markov dynamics aredescribed by the same ﬁgure as before, namely, Fig. 4.2. Further, the problem of maximizingthroughput is given by the same equations (4.33)-(4.37). Thus, although the variables of the twoproblems have diﬀerent interpretations, the problems are structurally equivalent. For simplicityof exposition, the remainder of this section uses this single-buﬀer queue interpretation. Since there is no power constraint, for any

V > c n = 1, Q ( t ) ≡ M non-empty queues, serve all of them. Else, serve the M non-empty queues with the largest values of γ n , where: γ n = 11 + (1 − λ n ) /λ n = λ n . Thus, the DPP ratio indexing algorithm in this context reduces to serving the (at most M ) non-empty queues with the largest λ n values each time slot. For the remainder of this section, thisis called the Max- λ policy. The following theorem shows that Max- λ is optimal in this context. Theorem 4.4.1.

The Max- λ policy is optimal for the problem (4.33) - (4.37) . In particular, underthe single-buﬀer queue interpretation, it maximizes throughput over all policies that transmit on ach slot t without knowledge of the arrival vector A ( t ) . For the N single-buﬀer queue interpretation, the total throughput is equal to the raw arrivalrate P Ni =1 λ i minus the packet drop rate. Intuitively, the reason Max- λ is optimal is that itchooses to leave packets in the queues that are least likely to induce packet drops. An examplecomparison of the throughput gap between Max- λ and Min- λ policies is given in Section 4.6.The proof of Theorem 4.4.1 is divided into two parts. The ﬁrst part uses stochastic couplingtechniques to prove that Max- λ dominates all alternative work-conserving policies. A policyis work-conserving if it does not allow any server to be idle when it could be used to serve anon-empty queue. The second part of the proof shows that throughput cannot be increased byconsidering non-work-conserving policies. Consider two discrete time processes X (cid:44) { X ( t ) } ∞ t =0 and Y (cid:44) { Y ( t ) } ∞ t =0 . The notation X = st Y means that X and Y are stochastically equivalent , in that they are described by thesame probability law. Formally, this means that their joint distributions are the same, so for all t ∈ { , , , . . . } and all ( z , . . . , z t ) ∈ R t +1 : P r [ X (0) ≤ z , . . . , X ( t ) ≤ z t ]= P r [ Y (0) ≤ z , . . . , Y ( t ) ≤ z t ]The notation X ≤ st Y means that X is stochastically less than or equal to Y , as deﬁned by thefollowing theorem. Theorem 4.4.2. ([TE93]) The following three statements are equivalent:1.

X ≤ st Y .2. P r [ g ( X (0) , X (1) , · · · , X ( t )) > z ] ≤ P r [ g ( Y (0) , Y (1) , · · · , Y ( t )) > z ] for all t ∈ Z + , all z ,and for all functions g : R n → R that are measurable and nondecreasing in all coordinates.3. There exist two stochastic processes X and Y on a common probability space that satisfy X = st X , Y = st Y , and X ( t ) ≤ Y ( t ) for every t ∈ Z + . The following additional notation is used in the proof of Theorem 4.4.1.113 Arrival vector { A ( t ) } ∞ t =0 , where A ( t ) (cid:44) [ A ( t ) A ( t ) · · · A N ( t )]. Each A n ( t ) is an inde-pendent binary random variable that takes 1 w.p. λ n and 0 w.p. 1 − λ n .• Buﬀer state vector { F ( t ) } ∞ t =0 , where F ( t ) (cid:44) [ F ( t ) F ( t ) · · · F N ( t )]. So F n ( t ) = 1 if queue n has a packet at the beginning of slot t , and F n ( t ) = 0 else.• Total packet process U (cid:44) { U ( t ) } ∞ t =0 , where U ( t ) (cid:44) P Nn =1 F n ( t ) represents the total numberof packets in the system on slot t . Since each queue can hold at most one packet, we have0 ≤ U ( t ) ≤ N for all slots t . The next lemma is the key to proving Theorem 4.4.1. The lemma considers the multi-queue system with a ﬁxed but arbitrary initial buﬀer state F (0). The arrival process A ( t ) isas deﬁned above. Let U Max- λ be the total packet process under the Max- λ policy. Let U π bethe corresponding process starting from the same initial state F (0) and having the same arrivals A ( t ), but with an arbitrary work-conserving policy π . Lemma 4.4.1.

The total packet processes U π and U Max- λ satisfy: U π ≤ st U Max- λ (4.38) Proof.

Without loss of generality, assume the queues are sorted so that λ n ≤ λ n +1 , n =1 , , · · · , N −

1. Deﬁne { F π ( t ) } ∞ t =0 as the buﬀer state vector under policy π . Deﬁne { F Max- λ ( t ) } ∞ t =0 as the corresponding buﬀer states under the Max- λ policy. By assumption the initial states sat-isfy F π (0) = F Max- λ (0). Next, we construct a third process U λ with a modiﬁed arrival vectorprocess { A λ ( t ) } ∞ t =0 and a corresponding buﬀer state vector { F λ ( t ) } ∞ t =0 (with the same initialstate F λ (0) = F π (0)), which satisﬁes:1. U λ is also generated from the Max- λ policy.2. U λ = st U Max- λ . Since the total packet process is completely determined by the initial state,the scheduling policy, and the arrival process, it is enough to construct { A λ ( t ) } ∞ t =0 so thatit is of the same probability law as { A ( t ) } ∞ t =0 .3. U π ( t ) ≤ U λ ( t ) ∀ t ≥

0. 114ince the arrival process A ( t ) is i.i.d. over slots, in order to guarantee 2) and 3), it is suﬃcientto construct A λ ( t ) coupled with A ( t ) for each t so that the following two properties hold for all t ≥ A ( t ) and A λ ( t ) have the same probability law. Speciﬁcally, bothproduce arrivals according to Bernoulli processes that are independent over queues andover time, with P r [ A n ( t ) = 1] = P r [ A λn ( t ) = 1] = λ n for all n ∈ { , . . . , N } .• For all j ∈ { , , · · · , N } , j X n =1 F πn ( t ) ≤ j X n =1 F λn ( t ) , (4.39)The construction is based on an induction.At t = 0 we have F π (0) = F λ (0). Thus, (4.39) naturally holds for t = 0. Now ﬁx τ ≥ t = τ . If τ ≥

1, further assume the arrivals { A λ ( t ) } τ − t =0 have been constructed to have the same probability law as { A ( t ) } τ − τ =0 . Since arrivalson slot τ occur at the end of slot τ , the arrivals A λ ( τ ) must be constructed. We are going toshow there exists an A λ ( τ ) that is coupled with A ( τ ) so that it has the same probability law andit also ensures (4.39) holds for t = τ + 1.Since arrivals occur after the transmitting action, we divide the analysis into two parts. First,we analyze the temporary buﬀer states after the transmitting action but before arrivals occur .Then, we deﬁne arrivals A λ ( τ ) at the end of slot τ to achieve the desired coupling.Deﬁne ˜ F π ( τ ) and ˜ F λ ( τ ) as the temporary buﬀer states right after the transmitting action atslot τ but before arrivals occur under policy π and policy Max- λ , respectively. Thus, for eachqueue n ∈ { , . . . , N } : ˜ F πn ( τ ) = F πn ( τ ) − α πn ( τ ) (4.40)˜ F λn ( τ ) = F λn ( τ ) − α λn ( τ ) (4.41)where α πn ( τ ) and α λn ( τ ) are the slot τ decisions under policy π and Max- λ , respectively. Since(4.39) holds for j = N on slot τ , the total number of packets at the start of slot τ under policy π isless than or equal to that of using Max- λ . Since both policies π and Max- λ are work-conserving,115t is impossible for policy π to transmit more packets than Max- λ during slot τ . This implies: N X n =1 ˜ F πn ( τ ) ≤ N X n =1 ˜ F λn ( τ ) . (4.42)Indeed, if π transmits the same number of packets as Max- λ on slot τ , then (4.42) clearly holds.On the other hand, if π transmits fewer packets than Max- λ , it must transmit fewer than M packets (since M is the number of servers). In this case, the work-conserving nature of π impliesthat all non-empty queues were served, so that ˜ F πn ( τ ) = 0 for all n and (4.42) again holds. Wenow claim the following holds: Lemma 4.4.2. j X n =1 ˜ F πn ( τ ) ≤ j X n =1 ˜ F λn ( τ ) ∀ j ∈ { , , · · · , N } . (4.43) Proof.

See Section 4.6.Now let j π ( l ) and j λ ( l ) be the subscript of l -th empty temporary buﬀer (with order startingfrom the ﬁrst queue) corresponding to ˜ F π ( τ ) and ˜ F λ ( τ ), respectively. It follows from (4.43) thatthe π system on slot τ has at least as many empty temporary buﬀer states as the Max- λ policy,and: j π ( l ) ≤ j λ ( l ) ∀ l ∈ { , , · · · , K ( τ ) } (4.44)where K ( τ ) ≤ N is the the number of empty temporary buﬀer states under Max- λ at time slot τ . Since λ i ≤ λ j if and only if i ≤ j , (4.44) further implies that λ j π ( l ) ≤ λ j λ ( l ) ∀ l ∈ { , , · · · , K ( τ ) } . (4.45)Now construct the arrival vector A λ ( τ ) for the system with the Max- λ policy in the followingway: A j π ( l ) ( τ ) = 1 ⇒ A λj λ ( l ) ( τ ) = 1 w.p. A j π ( l ) ( τ ) = 0 ⇒  A λj λ ( l ) ( τ ) = 0 , w.p. − λ jλ ( l ) − λ jπ ( l ) ; A λj λ ( l ) ( τ ) = 1 , w.p. λ jλ ( l ) − λ jπ ( l ) − λ jπ ( l ) . (4.47)Notice that (4.47) uses valid probability distributions because of (4.45). This establishes the slot116 arrivals for the Max- λ policy for all of its K ( τ ) queues with empty temporary buﬀer states.The slot τ arrivals for its queues with non-empty temporary buﬀers will be dropped and hencedo not aﬀect the queue states on slot τ + 1. Thus, we deﬁne arrivals A λj ( τ ) to be independent ofall other quantities and to be Bernoulli with P r [ A λj ( τ ) = 1] = λ j for all j in the set: j ∈ { , , · · · , N } \ { j λ (1) , · · · , j λ ( K ( τ )) } Now we verify that A ( τ ) and A λ ( τ ) have the same probability law. First condition on knowledgeof K ( τ ) and the particular j π ( l ) and j λ ( l ) values for l ∈ { , . . . , K ( τ ) } . All queues j with non-empty temporary buﬀer states on slot τ under Max- λ were deﬁned to have arrivals A λj ( τ ) as in-dependent Bernoulli variables with P r [ A λj ( τ ) = 1] = λ j . It remains to verify those queues within { j λ (1) , · · · , j λ ( K ( τ )) } . According to (4.47), for any queue j λ ( l ) in set { j λ (1) , · · · , j λ ( K ( τ )) } , itfollows P r h A λj λ ( l ) ( τ ) = 0 i = (1 − λ j π ( l ) ) 1 − λ j λ ( l ) − λ j π ( l ) = 1 − λ j λ ( l ) and so P r [ A λj ( τ ) = 1] = λ j for all j ∈ { j λ ( l ) } K ( τ ) l =1 . Further, mutual independence of { A j π ( l ) ( τ ) } K ( τ ) l =1 implies mutual independence of { A j λ ( l ) ( τ ) } K ( τ ) l =1 . Finally, these quantities are conditionally in-dependent of events before slot τ , given knowledge of K ( τ ) and the particular j π ( l ) and j λ ( l )values for l ∈ { , . . . , K ( τ ) } . Thus, conditioned on this knowledge, A ( τ ) and A λ ( τ ) have thesame probability law. This holds for all possible values of the conditional knowledge K ( τ ) and j π ( l ) and j λ ( l ). It follows that A ( τ ) and A λ ( τ ) have the same (unconditioned) probability law.Finally, we show that the coupling relations (4.46) and (4.47) produce such F λ ( τ +1) satisfying j X n =1 F πn ( τ + 1) ≤ j X n =1 F λn ( τ + 1) , ∀ j ∈ { , , · · · , N } . (4.48)According to (4.46) and (4.47), A j π ( l ) ( τ ) ≤ A λj λ ( l ) ( τ ) , ∀ l ∈ { , · · · , K ( τ ) } , l X i =1 A j π ( i ) ( τ ) ≤ l X i =1 A λj λ ( i ) ( τ ) , ∀ l ∈ { , · · · , K ( τ ) } . (4.49)Pick any j ∈ { , , · · · , N } . Let l π be the number of empty temporary buﬀers within the ﬁrst j queues under policy π , i.e. l π = max j π ( l ) ≤ j l Similarly deﬁne: l λ = max j λ ( l ) ≤ j l. Then, it follows: j X n =1 F πn ( τ + 1) = j X n =1 ˜ F πn ( τ ) + l π X i =1 A j π ( i ) ( τ ) (4.50) j X n =1 F λn ( τ + 1) = j X n =1 ˜ F λn ( τ ) + l λ X i =1 A λj λ ( i ) ( τ ) (4.51)We know that l π ≥ l λ . So there are two cases:• If l π = l λ , then from (4.50): j X n =1 F πn ( τ + 1) = j X n =1 ˜ F πn ( τ ) + l λ X i =1 A j π ( i ) ( τ ) ≤ j X n =1 ˜ F λn ( τ ) + l λ X i =1 A j λ ( i ) ( τ )= j X n =1 F λn ( τ + 1)where the inequality follows from (4.43) and from (4.49) with l = l λ . Thus, (5.18) holds.118 If l π > l λ , then from (4.50): j X n =1 F πn ( τ + 1) = j X n =1 ˜ F πn ( τ ) + l λ X i =1 A j π ( i ) ( τ )+ l π X i = l λ +1 A j π ( i ) ( τ ) ≤ j X n =1 ˜ F λn ( τ ) + l λ X i =1 A j π ( i ) ( τ ) ≤ j X n =1 ˜ F λn ( τ ) + l λ X i =1 A λj λ ( i ) ( τ )= j X n =1 F λn ( τ + 1) . where the ﬁrst inequality follows from the fact that l π X i = l λ +1 A j π ( i ) ( τ ) ≤ l π − l λ = ( j − l λ ) − ( j − l π )= j X n =1 ˜ F λn ( τ ) − j X n =1 ˜ F πn ( τ ) , and the second inequality follows from (4.49).Thus, (4.39) holds for t = τ + 1 and the induction step is done. Corollary 4.4.1.

The Max- λ policy maximizes throughput within the class of work-conservingpolicies.Proof. Let S π ( t ) be the number of packets transmitted under any work-conserving policy π onslot t , and let S Max- λ ( t ) be the corresponding process under policy Max- λ . Lemma 4.4.1 implies U π ( t ) ≤ st U Max- λ . Then: E ( S π ( t )) = E (min[ U π ( t ) , M ]) ≤ E (min[ U Max- λ ( t ) , M ])= E ( S Max- λ ( t ))where the inequality follows from Theorem 4.4.2, with the understanding that g ( U (0) , . . . , U ( t )) (cid:44) U ( t ) , M ] is a function that is nondecreasing in all coordinates. Corollary 4.4.1 establishes optimality of Max- λ over the class of all work-conserving policies.To complete the proof of Theorem 4.4.1, it remains to show that throughput cannot be increasedby allowing for non-work-conserving policies. It suﬃces to show that for any non-work-conservingpolicy, there exists a work-conserving policy that gets the same or better throughput. The proofis straightforward and we give only a proof sketch for brevity. Consider any non-work-conservingpolicy π , and let F πn ( t ) be its buﬀer state process on slot t for each queue n . For the same initialbuﬀer state and arrival process, deﬁne the work-conserving policy π as follows: Every slot t ,policy π initially allocates the M servers to exactly the same queues as policy π . However, ifsome of these queues are empty under policy π , it reallocates those servers to any non-emptyqueues that are not yet allocated servers (in keeping with the work-conserving property). Let F π n ( t ) be the buﬀer state process for queue n under policy π . It is not diﬃcult to show that F πn ( t ) ≥ F π n ( t ) for all queues n and all slots t . Therefore, on every slot t , the amount of blockedarrivals under policy π is always greater than or equal to that under policy π . This implies thethroughput under policy π is less than or equal to that of policy π . In this section, we demonstrate near optimality of the multi-user DPP ratio indexing algo-rithm by extensive simulations. In the ﬁrst part, we simulate the case in which the ﬁle lengthdistribution is geometric, and show that the suboptimality gap is extremely small. In the secondpart, we test the robustness of our algorithm for more general scenarios in which the ﬁle lengthdistribution is not geometric. For simplicity, it is assumed throughout that all transmissionssend a ﬁxed sized packet, all ﬁles are an integer number of these packets, and that decisions α n ( t ) ∈ A n aﬀect the success probability of the transmission as well as the power expenditure. In the ﬁrst simulation we use N = 8, M = 4 with action set A n = { , } ∀ n ; The settings aregenerated randomly and speciﬁed in Table I, and the constraint β = 5.120able 4.1: Problem parametersUser λ n µ n φ n (1) c n p n (1)1 0.0028 0.5380 0.4842 4.7527 3.95042 0.4176 0.5453 0.4908 2.0681 3.73913 0.0888 0.5044 0.4540 2.8656 3.57534 0.3181 0.6103 0.5493 2.4605 2.18285 0.4151 0.9839 0.8855 4.5554 3.19826 0.2546 0.5975 0.5377 3.9647 3.52907 0.1705 0.5517 0.4966 1.5159 2.52268 0.2109 0.7597 0.6837 3.6364 2.5376The algorithm is run for 1 million slots in each trial and each point is the average of 100trials. We compare the performance of our algorithm with the optimal randomized policy. Theoptimal policy is computed by constructing composite states (i.e. if there are three users whereuser 1 is at state 0, user 2 is at state 1 and user 3 is at state 1, we view 011 as a composite state),and then reformulating this MDP into a linear program (see [Fox66a]) with variables and constraints.In Fig. 5.3, we show that as our tradeoﬀ parameter V gets larger, the objective value ap-proaches the optimal value and achieves a near optimal performance. Fig. 5.4 and Fig. 5.5 showthat V also aﬀects the virtual queue size and the constraint gap. As V gets larger, the averagevirtual queue size becomes larger and the gap becomes smaller. We also plot the upper bound ofqueue size we derived from Lemma 4.3.1 in Fig. 5.4, demonstrating that the queue is bounded.In order to show that V is indeed a trade-oﬀ parameter aﬀecting the convergence time, we plottedFig. 4.6. It can be seen from the ﬁgure that as V gets larger, the number of time slots neededfor the running average to roughly converge to the optimal power expenditure becomes larger.In the second simulation, we explore the parameter space and demonstrate that in generalthe suboptimality gap of our algorithm is negligible. First, we deﬁne the relative error as thefollowing: relative error = | OBJ − OP T | OP T (4.52)where

OBJ is the objective value after running 1 million slots of our algorithm and

OP T is theoptimal value. We ﬁrst explore the system parameters by letting λ n ’s and µ n ’s take randomnumbers within 0 and 1, letting c n take random number within 1 and 5, choosing V = 70 andﬁxing the remaining parameters the same as the last experiment. We conduct 1000 Monte-Carloexperiments and calculate the average relative error, which is .121

10 20 30 40 50 60 703.844.24.44.64.85

V value T h r oughpu t Lyapunov IndexingOptimal Throughput

Figure 4.3: Throughput versus tradeoﬀ parameter V

V value A v e r age P o w e r E x pend i t u r e Lyapunov IndexingPower Constraint

Figure 4.4: The time average power consumption versus tradeoﬀ parameter V .122

10 20 30 40 50 60 70050100150

V value A v e r age V i r t ua l Q ueue B a ck l og Average QueuesizeQueuesize bound

Figure 4.5: Average virtual queue backlog versus tradeoﬀ parameter V . No. of Iterations T i m e A v e r age C on s t r a i n t V i o l a t i on V=70V=40V=10V=4V=70V=40 V=4V=10

Figure 4.6: Running average power consumption versus tradeoﬀ parameter V .123able 4.2: Problem parameters under geometric, uniform and poisson distributionUser µ n Unif. Poiss. λ n φ n (1) c n p n (1)interval mean1 1/3 [1,5] 3 0.4955 0.1832 4.3261 2.87632 1/2 [1,3] 2 0.1181 0.4187 1.6827 2.05493 1/2 [1,3] 2 0.1298 0.4491 1.9483 2.14694 1/7 [1,13] 7 0.4660 0.0984 2.7495 3.44725 1/4 [1,7] 4 0.1661 0.1742 1.5535 3.28016 1/3 [1,5] 3 0.2124 0.3101 4.3151 3.56487 1/2 [1,3] 2 0.5295 0.4980 3.6701 2.46808 1/5 [1,9] 5 0.2228 0.1971 4.0185 2.29849 1/4 [1,7] 4 0.0332 0.1986 3.0411 2.5747Next, we explore the control parameters by letting the p n (1) take random number within2 and 4, and letting φ n (1) /µ n values random numbers between 0 and 1, choosing V = 70 andﬁxing the remaining parameters the same as the ﬁrst simulation. The relative error is .Both experiments show that the suboptimality gap is extremely small. In this part, we test the sensitivity of the algorithm to diﬀerent ﬁle length distributions. Inparticular, the uniform distribution and the Poisson distribution are implemented respectively,while our algorithm still treats them as a geometric distribution with same mean. We thencompare their throughputs with the geometric case.We use N = 9, M = 4 with action set A n = { , } ∀ n . The settings are speciﬁed in Table IIwith constraint β = 5. Notice that for geometric and uniform distribution, the ﬁle lengths aretaken to be integer values. The algorithm is run for 1 million slots in each trial and each pointis the average of 100 trials.While the decisions are made using these values, the aﬀect of these decisions incorporates theactual (non-memoryless) ﬁle sizes. Fig. 4.7 shows the throughput-versus- V relation for the twonon-memoryless cases and the memoryless case with matched means. The performance of allthree is similar. This illustrates that the indexing algorithm is robust under diﬀerent ﬁle lengthdistributions. 124

10 20 30 40 50 60 702.533.544.555.566.577.5

V value T h r oughpu t GeometricUniformPoissonGeometricUniformPoisson

Figure 4.7: Throughput versus tradeoﬀ parameter V under diﬀerent ﬁle length distributions. λ and Min- λ This section shows that diﬀerent work conserving policies can give diﬀerent throughput forthe N single-buﬀer queue problem of Section 4.4.1. Suppose we have two single-buﬀer queuesand one server. Let λ , λ be the arrival rates of the i.i.d. Bernoulli arrival processes for queues1 and 2. Assume λ = λ . There are 4 system states: (0 , , (0 , , (1 , , (1 , i, j ) means queue 1 has i packets and queue 2 has j packets. Consider the (work conserving)policy of giving queue 1 strict priority over queue 2. This is equivalent to the Max- λ policy when λ > λ , and is equivalent to the Min- λ policy when λ < λ . Let θ ( λ , λ ) be the steady statethroughput. Then: θ ( λ , λ ) = p , + p , + p , where p i,j is the steady state probability of the resulting discrete time Markov chain. One cansolve the global balance equations to show that θ (1 / , / > θ (1 / , / λ policy has a higher throughput than the Min- λ policy. In particular, it can be shown that:• Max- λ throughput: θ (1 / , /

4) = 0 .

7• Min- λ throughput: θ (1 / , / ≈ . .6.2 Proof of Lemma 4.4.2 This section proves that: j X n =1 ˜ F πn ( τ ) ≤ j X n =1 ˜ F λn ( τ ) ∀ j ∈ { , , · · · , N } . (4.53)The case j = N is already established from (4.42). Fix j ∈ { , , . . . , N − } . Since π cannottransmit more packets than Max- λ during slot τ , inequality (4.53) is proved by considering twocases:1. Policy π transmits less packets than policy Max- λ . Then π transmits less than M packetsduring slot τ . The work-conserving nature of π implies all non-empty queues were served,so ˜ F πn ( τ ) = 0 for all n and (4.53) holds.2. Policy π transmits the same number of packets as policy Max- λ . In this case, consider thetemporary buﬀer states of the last N − j queues under policy Max- λ . If P Nn = j +1 ˜ F λn ( τ ) = 0,then clearly the following holds N X n = j +1 ˜ F πn ( τ ) ≥ N X n = j +1 ˜ F λn ( τ ) . (4.54)Subtracting (4.54) from (4.42) immediately gives (4.53). If P Nn = j +1 ˜ F λn ( τ ) >

0, then all M servers of the Max- λ system were devoted to serving the largest λ n queues. So only packetsin the last N − j queues could be transmitted by Max- λ during the slot τ . In particular, α λn ( τ ) = 0 for all n ∈ { , . . . , j } , and so (by (4.41)): j X n =1 ˜ F λn ( τ ) = j X n =1 F λn ( τ ) (4.55)126hus: j X n =1 ˜ F πn ( τ ) ≤ j X n =1 F πn ( τ ) (4.56) ≤ j X n =1 F λn ( τ ) (4.57)= j X n =1 ˜ F λn ( τ ) , (4.58)where (4.56) holds by (4.40), (4.57) holds because (4.39) is true on slot t = τ , and the lastequality holds by (4.55). This proves (4.53).127 hapter 5Opportunistic Scheduling over Renewal Systems This chapter considers an opportunistic scheduling problem over a single renewal system.Diﬀerent from previous chapters, we consider teh scenario where at the beginning of each renewalframe, the controller observes a random event and then chooses an action in response to theevent, which aﬀects the duration of the frame, the amount of resources used, and a penaltymetric. The goal is to make frame-wise decisions so as to minimize the time average penaltysubject to time average resource constraints. This problem has applications to task processingand communication in data networks, as well as to certain classes of Markov decision problems.We formulate the problem as a dynamic fractional program and propose an adaptive algorithmwhich uses an empirical accumulation as a feedback parameter. A key feature of the proposedalgorithm is that it does not require knowledge of the random event statistics and potentiallyallows (uncountably) inﬁnite event sets. We prove the algorithm satisﬁes all desired constraintsand achieves O ( (cid:15) ) near optimality with probability 1. Consider a system that operates over the timeline of real numbers t ≥

0. The timeline isdivided into back-to-back periods called renewal frames and the start of each frame is called a renewal (see Fig. 5.1). The system state is refreshed at each renewal. At the start of each renewalframe n ∈ { , , , . . . } the controller observes a random event ω [ n ] ∈ Ω and then takes an action α [ n ] from an action set A in response to ω [ n ]. The pair ( ω [ n ] , α [ n ]) aﬀects: (i) the duration ofthat renewal frame; (ii) a vector of resource expenditures for that frame; (iii) a penalty incurredon that frame. The goal is to choose actions over time to minimize time average penalty subject128o time average constraints on the resources without knowing any statistic of ω [ n ]. We call sucha problem opportunistic scheduling over renewal systems .Figure 5.1: An illustration of a sequence of renewal frames. This problem has applications to task processing in computer networks, and certain general-izations of Markov decision problems.• Task processing networks: Consider a device that processes tasks back-to-back. Eachrenewal period corresponds to the time required to complete a single task. The randomevent ω [ n ] observed corresponds to a vector of task parameters, including the type, size,and resource requirements for that particular task. The action set A consists of diﬀerentprocessing mode options, and the speciﬁc action α [ n ] determines the processing time, energyexpenditure, and task quality. In this case, task quality can be deﬁned as a negative penalty,and the goal is to maximize time average quality subject to power constraints and taskcompletion rate constraints. A speciﬁc example of this sort is the following ﬁle downloadingproblem: Consider a wireless device that repeatedly downloads ﬁles. The device has twostates: active (wants to download a ﬁle) and idle (does not want to download a ﬁle).Renewals occur at the start of each new active state. Here, ω [ n ] denotes the observedwireless channel state, which aﬀects the success probability of downloading a ﬁle (andthereby aﬀects the transition probability from active to idle). This example is discussedfurther in the simulation section (Section 5.6).• Hierarchical Markov decision problems: Consider a slotted two-timescale Markov decisionprocesses (MDP) over an inﬁnite horizon and with constraints on average cost per slot. AnMDP is run on the lower level, with a special state that is recurrent under any sequenceof actions. The renewals are deﬁned as revisitation times to that state. On a higher level,129 random event ω is observed upon each revisitation to the renewal state on the lowerlevel. Then, a decision is made on the higher level in response to ω , which in turn aﬀectsthe transition probability and penalty/cost received per slot on the lower level until thenext renewal. Such a problem is a generalization of classical MDP problem (e.g. [Ros02],[Ber01]) and has been considered previously in [Wer13], [CFMS03] with discrete ﬁnite stateand full information on both levels. A heuristic method is also proposed in [Wer13] whensome of the information is unknown. The algorithm of the current chapter does not requireknowledge of the statistics of ω and allows the event set Ω to be potentially (uncountably)inﬁnite. Most works on optimization over renewal systems consider the simpler scenario of knowingthe probability distribution of ω [ n ]. In such a case, one can show via the renewal-reward theorythat the problem can be solved (oﬄine) by ﬁnding the solution to a linear fractional program.This idea has been applied to solve MDPs in the seminal work [Fox66b]. Methods for solvinglinear fractional programs can also be found, for example, in [Sch83, BV04]. However, thepractical limitations of such an oﬄine algorithm are twofold: First, if the event set Ω is large,then, there are too many probabilities P r ( ω [ n ] = ω ) , ω ∈ Ω to estimate and the correspondingoﬄine optimization problem may be diﬃcult to solve even if all probabilities are estimatedaccurately. Second, generic oﬄine optimization solvers may not take advantage of the specialrenewal structure of the system. One notable example is the treatment of power and delayminimization for a multi-class M/G/1 queue in [Yao02, LN14], where the renewal structureallows a well known c - µ rule for delay minimization to be extended to treat both power anddelay constraints.The work in [Nee10b, Nee13b] presents a new drift-plus-penalty (DPP) ratio algorithm solvingrenewal optimizations knowing the distribution of ω [ n ]. The algorithm treats the constraints via virtual queues so that one only requires to minimize an unconstrained ratio during every renewalframe. The algorithm provably meets all constraints and achieves asymptotic near-optimality.The works [WUZ +

15, UWH +

15] show that the edge cloud server migration problem can be for-mulated as a speciﬁc renewal optimization. Using a variant of the DPP ratio algorithm, theyshow that solving a simple stochastic shortest path problem during every renewal frame gives130ear-optimal performance. The work [WN18] solves a more general asynchronous optimizationover parallel renewal systems, though the knowledge of the random event statistics is still re-quired. It is worth noting that the work [Nee13b] also proposes a heuristic algorithm when thedistribution of ω [ n ] is not known. That algorithm is partially analyzed: It is shown that if a cer-tain process converges, then the algorithm converges to a near-optimal point. However, whetheror not such a process converges is unknown. The renewal optimization problem considered in this chapter is a generalization of stochasticoptimization over ﬁxed time slots. Such problems are categorized based on whether or not therandom event is observed before the decision is made. Cases where the random event is observedbefore taking actions are often referred to as opportunistic scheduling problems . Over the pastdecades, many algorithms have been proposed including max-weight ([TE90, TE93]), Lyapunovoptimization ([ES06, ES07, Nee10b, GNT + online learning problems .Various algorithms are developed for unconstrained learning including the weighted majorityalgorithm ([LW94]), multiplicative weighting algorithm ([FS99]), following the perturbed leader([HP05]) and online gradient descent ([Zin03, HK14]). The resource constrained learning problemis studied in [MJY12] and [WSLJ15]. Online learning with an underlying MDP structure isalso treated using modiﬁed multiplicative weighting ([EDKM05]) and improved following theperturbed leader ([YMS09]). In this work, we focus on opportunistic scheduling over renewal systems and propose a newalgorithm that runs online (i.e. takes actions in response to each observed ω [ n ]). Unlike priorworks, the proposed algorithm requires neither the statistics of ω [ n ] nor explicit estimation ofthem, and is fully analyzed with convergence properties that hold with probability 1. Froma technical perspective, we prove near-optimality of the algorithm by showing asymptotic sta-bility of a customized process, relying on a novel construction of exponential supermartingaleswhich could be of independent interest. We complement our theoretical results with simulation131xperiments on a time varying constrained MDP. Consider a system where the time line is divided into back-to-back time periods called frames.At the beginning of frame n ( n ∈ { , , , · · · } ), a controller observes the realization of a randomvariable ω [ n ], which is an i.i.d. copy of a random variable taking values in a compact set Ω ∈ R q with distribution function unknown to the controller. Then, after observing the random event ω [ n ], the controller chooses an action vector α [ n ] ∈ A . Then, the tuple ( ω [ n ] , α [ n ]) induces thefollowing random variables:• The penalty received during frame n : y [ n ].• The length of frame n : T [ n ].• A vector of resource consumptions during frame n : z [ n ] = [ z [ n ] , z [ n ] , · · · , z L [ n ]].We assume that given α [ n ] = α and ω [ n ] = ω at frame n , ( y [ n ] , T [ n ] , z [ n ]) is a random vectorindependent of the outcomes of previous frames , with known expectations. We then denote theseconditional expectations as ˆ y ( ω, α ) = E ( y [ n ] | ω, α ) , ˆ T ( ω, α ) = E ( T [ n ] | ω, α ) , ˆ z ( ω, α ) = E (ˆ z [ n ] | ω, α ) , which are all deterministic functions of ω and α . This notation is useful when we want to highlightthe action α we choose. The analysis assumes a single action in response to the observed ω [ n ]at each frame. Nevertheless, an ergodic MDP can ﬁt into this model by deﬁning the action asa selection of a policy to implement over that frame so that the corresponding ˆ y ( ω, α ), ˆ T ( ω, α )and ˆ z ( ω, α ) are expectations over the frame under the chosen policy.132et y [ N ] = 1 N N − X n =0 y [ n ] ,T [ N ] = 1 N N − X n =0 T [ n ] ,z l [ N ] = 1 N N − X n =0 z l [ n ] l ∈ { , , · · · , L } . The goal is to minimize the time average penalty subject to L constraints on resource consump-tions. Speciﬁcally, we aim to solve the following fractional programming problem:min lim sup N →∞ y [ N ] T [ N ] (5.1)s.t. lim sup N →∞ z l [ N ] T [ N ] ≤ c l , ∀ l ∈ { , , · · · , L } , (5.2) α [ n ] ∈ A , ∀ n ∈ { , , , · · · } , (5.3)where c l , l ∈ { , , · · · , L } are nonnegative constants, and both the minimum and constraint aretaken in an almost sure sense. Finally, we use θ ∗ to denote the minimum that can be achievedby solving above optimization problem. For simplicity of notation, let K [ n ] = vuut L X l =1 ( z l [ n ] − c l T [ n ]) . (5.4) Our main result requires the following assumptions, their importance will become clear as weproceed. We begin with the following boundedness assumption:

Assumption 5.2.1 (Exponential type) . Given ω [ n ] = ω ∈ Ω and α [ n ] = α ∈ A for a ﬁxed n ,it holds that T [ n ] ≥ with probability 1 and y [ n ] , K [ n ] , T [ n ] are of exponential type, i.e. there xists a constant η > s.t. E (cid:0) exp (cid:0) η (cid:12)(cid:12) y [ n ] (cid:12)(cid:12)(cid:1) (cid:12)(cid:12) ω, α (cid:1) ≤ B + 1 , E (cid:0) exp (cid:0) η (cid:12)(cid:12) K [ n ] (cid:12)(cid:12)(cid:1) (cid:12)(cid:12) ω, α (cid:1) ≤ B + 1 , E (cid:0) exp (cid:0) η (cid:12)(cid:12) T [ n ] (cid:12)(cid:12)(cid:1) (cid:12)(cid:12) ω, α (cid:1) ≤ B + 1 , where B is a positive constant. The following proposition is a simple consequence of the above assumption:

Proposition 1.

Suppose Assumption 5.2.1 holds. Let X [ n ] be any of the three random variables y [ n ] , K [ n ] and T [ n ] for a ﬁxed n . Then, given any ω [ n ] = ω ∈ Ω and α [ n ] = α ∈ A , E (cid:0) (cid:12)(cid:12) X [ n ] (cid:12)(cid:12) (cid:12)(cid:12) ω, α (cid:1) ≤ B/η, E (cid:0) X [ n ] (cid:12)(cid:12) ω, α (cid:1) ≤ B/η . The proof follows from the inequality: B + 1 ≥ E (cid:18) e η (cid:12)(cid:12) X [ n ] (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) ω, α (cid:19) ≥ η · E (cid:0) (cid:12)(cid:12) X [ n ] (cid:12)(cid:12) (cid:12)(cid:12) ω, α (cid:1) + η · E (cid:0) X [ n ] (cid:12)(cid:12) ω, α (cid:1) . Assumption 5.2.2.

There exists a positive constant θ max large enough so that the optimalobjective of (5.1) − (5.3) , denoted as θ ∗ , falls into [0 , θ max ) with probability 1. Remark 5.2.1. If θ ∗ < , then, we shall ﬁnd a constant c large enough so that θ ∗ + c ≥ . Then,deﬁne a new penalty y [ n ] = y [ n ]+ cT [ n ] . It is easy to see that minimizing lim sup N →∞ y [ N ] /T [ N ] is equivalent to minimizing lim sup N →∞ y [ N ] /T [ N ] and the optimal objective of the new problemis θ ∗ + c , which is nonnegative. Assumption 5.2.3.

Let (cid:16) ˆ y ( ω, α ) , ˆ T ( ω, α ) , ˆ z ( ω, α ) (cid:17) be the performance vector under a certain ( ω, α ) pair. Then, for any ﬁxed ω ∈ Ω , the set of achievable performance vectors over all α ∈ A is compact. In order to state the next assumption, we need the notion of randomized stationary policy .We start with the deﬁnition:

Deﬁnition 5.2.1 (Randomized stationary policy) . A randomized stationary policy is an algo-rithm that at the beginning of each frame n , after observing the random event ω [ n ] , the controllerchooses α ∗ [ n ] with a conditional probability that is the same for all n . ssumption 5.2.4 (Bounded achievable region) . Let ( y, T , z ) (cid:44) E (cid:16) (ˆ y ( ω [0] , α ∗ [0]) , ˆ T ( ω [0] , α ∗ [0]) , ˆ z ( ω [0] , α ∗ [0])) (cid:17) be the one-shot average of one randomized stationary policy. Let R ⊆ R L +2 be the set of allachievable one-shot averages ( y, T , z ) . Then, R is bounded. Assumption 5.2.5 ( ξ -slackness) . There exists a randomized stationary policy α ( ξ ) [ n ] such thatthe following holds, E (cid:0) ˆ z l (cid:0) ω [ n ] , α ( ξ ) [ n ] (cid:1)(cid:1) E (cid:16) ˆ T ( ω [ n ] , α ( ξ ) [ n ]) (cid:17) = c l − ξ, ∀ l ∈ { , , · · · , L } , where ξ > is a constant. Remark 5.2.2 (Measurability issue) . We implicitly assume the policies for choosing α inreaction to ω result in a measurable α , so that T [ n ] , y [ n ] , z [ n ] are valid random variables andthe expectations in Assumption 5.2.4 and 5.2.5 are well deﬁned. This assumption is mild. Forexample, when the sets Ω and A are ﬁnite, it holds for any randomized stationary policy. Moregenerally, if Ω and A are measurable subsets of some separable metric spaces, this holds wheneverthe conditional probability in Deﬁnition 5.2.1 is “regular” (see [Dur13] for discussions on regularconditional probability), and T [ n ] , y [ n ] , z [ n ] are continuous functions on Ω × A . We deﬁne a vector of virtual queues Q [ n ] = [ Q [ n ] Q [ n ] · · · Q L [ n ]] which are 0 at n = 0and updated as follows: Q l [ n + 1] = max { Q l [ n ] + z l [ n ] − c l T [ n ] , } . (5.5)The intuition behind this virtual queue idea is that if the algorithm can stabilize Q l [ n ], then the“arrival rate” z l [ N ] /T [ N ] is below “service rate” c l and the constraint is satisﬁed. The proposedalgorithm then proceeds as in Algorithm 5.1 via two ﬁxed parameters V > δ >

0, and anadditional process θ [ n ] that is initialized to be θ [0] = 0. For any real number x , the notation135 x ] θ max stands for ceil and ﬂoor function:[ x ] θ max =  θ max , if x ∈ ( θ max , + ∞ ); x, if x ∈ [0 , θ max ];0 , if x ∈ ( −∞ , V (cid:16) ˆ y ( ω [ n ] , α [ n ]) − θ [ n ] ˆ T ( ω [ n ] , α [ n ]) (cid:17) + L X l =1 Q l [ n ] (cid:16) ˆ z l ( ω [ n ] , α [ n ]) − c l ˆ T ( ω [ n ] , α [ n ]) (cid:17) , Thus, Algorithm 5.1 proceeds by observing ω [ n ] on each frame n and then choosing α [ n ] in A to minimize the above deterministic function. We can now see that we only use knowledge ofcurrent realization ω [ n ], not statistics of ω [ n ]. Also, the compactness assumption (Assumption5.2.3) guarantees that the minimum of (5.6) is always achievable. Algorithm 5.1

Online renewal optimization:• At the beginning of each frame n , the controller observes Q l [ n ], θ [ n ], ω [ n ] and choosesaction α [ n ] ∈ A to minimize the following function: E V ( y [ n ] − θ [ n ] T [ n ]) + L X l =1 Q l [ n ]( z l [ n ] − c l T [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q l [ n ] , θ [ n ] , ω [ n ] ! . (5.6)• Update θ [ n ]: θ [ n + 1] = " n + 1) δ n X i =0 y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! θ max . • Update virtual queues Q l [ n ]: Q l [ n + 1] = max { Q l [ n ] + z l [ n ] − c l T [ n ] , } , l = 1 , , · · · , L. In this section, we prove that the proposed algorithm gives a sequence of actions { α [ n ] } ∞ n =0 which satisﬁes all desired constraints with probability 1. Speciﬁcally, we show that all virtualqueues are stable with probability 1, in which we leverage an important lemma from [Haj82] toobtain a exponential bound for the norm of Q [ n ].136 .4.1 The drift-plus-penalty bound The start of our proof uses the drift-plus-penalty methodology. For a general introductionon this topic, see [Nee12c] for more details. We deﬁne the 2-norm function of the virtual queuevector as: k Q [ n ] k = L X l =1 Q l [ n ] . Deﬁne the

Lyapunov drift ∆( Q [ n ]) as∆( Q [ n ]) = 12 (cid:0) k Q [ n + 1] k − k Q [ n ] k (cid:1) . Next, deﬁne the penalty function at frame n as V ( y [ n ] − θ [ n ] T [ n ]), where V > α [ n ] ∈ A to greedily minimize the following drift-plus-penaltyexpression, with the observed Q [ n ], ω [ n ] and θ [ n ]: E ( V ( y [ n ] − θ [ n ] T [ n ]) + ∆( Q [ n ]) | Q l [ n ] , θ [ n ] , ω [ n ]) . The penalty term V ( y [ n ] − θ [ n ] T [ n ]) uses the θ [ n ] variable, which depends on events from allprevious frames. This penalty does not ﬁt the rubric of [Nee12c] and convergence of the algorithmdoes not follow from prior work. A signiﬁcant thrust of the current chapter is convergence analysisunder such a penalty function.In order to obtain an upper bound on ∆( Q [ n ]), we square both sides of (5.5) and use the factthat max { x, } ≤ x , Q l [ n + 1] ≤ Q l [ n ] + ( z l [ n ] − c l T [ n ]) + 2 Q l [ n ]( z l [ n ] − c l T [ n ]) . (5.7)Summing the above over all l ∈ { , . . . , L } and dividing by 2 gives∆( Q [ n ]) ≤ L X l =1 ( z l [ n ] − c l T [ n ]) + L X l =1 Q l [ n ]( z l [ n ] − c l T [ n ])137dding V ( y [ n ] − θ [ n ] T [ n ]) to both sides and taking conditional expectations gives E ( V ( y [ n ] − θ [ n ] T [ n ]) + ∆( Q [ n ]) | Q l [ n ] , θ [ n ] , ω [ n ]) ≤ E V ( y [ n ] − θ [ n ] T [ n ]) + L X l =1 Q l [ n ]( z l [ n ] − c l T [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q l [ n ] , θ [ n ] , ω [ n ] ! + 12 L X l =1 E (cid:0) ( z l [ n ] − c l T [ n ]) (cid:1) ≤ E V ( y [ n ] − θ [ n ] T [ n ]) + L X l =1 Q l [ n ]( z l [ n ] − c l T [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q l [ n ] , θ [ n ] , ω [ n ] ! + B η . (5.8)where the last inequality follows from Proposition 1. Thus, as we have already seen in Algorithm5.1, the proposed algorithm observes the vector Q [ n ], the random event ω [ n ] and θ [ n ] at frame n , and minimizes the right hand side of (5.8). In this section, we show how the bound (5.8) leads to the feasibility of the proposed algorithm.Deﬁne H n as the system history information up until frame n . Formally, {H n } ∞ n =0 is a ﬁltrationwhere each H n is the σ -algebra generated by all the random variables before frame n . Noticethat since Q [ n ] and θ [ n ] depend only on the events before frame n , H n contains both Q [ n ] and θ [ n ]. The following important lemma gives a stability criterion for any given real random processwith certain negative drift property: Lemma 5.4.1 (Theorem 2.3 of [Haj82]) . Let R [ n ] be a real random process over n ∈ { , , , · · · } satisfying the following two conditions for a ﬁxed r > :1. For any n , E (cid:0) e r ( R [ n +1] − R [ n ]) (cid:12)(cid:12) H n (cid:1) ≤ Γ , for some Γ > .2. Given R [ n ] ≥ σ , E (cid:0) e r ( R [ n +1] − R [ n ]) (cid:12)(cid:12) H n (cid:1) ≤ ρ , with some ρ ∈ (0 , .Suppose further that R [0] ∈ R is given and ﬁnite, then, at every n ∈ { , , , · · · } , the followingbound holds: E (cid:16) e rR [ n ] (cid:17) ≤ ρ n e rR [0] + 1 − ρ n − ρ Γ e rσ . Thus, in order to show the stability of the virtual queue process, it is enough to test the abovetwo conditions with R [ n ] = k Q [ n ] k . The following lemma shows that k Q [ n ] k satisﬁes these twoconditions: 138 emma 5.4.2 (Drift condition) . Let R [ n ] = k Q [ n ] k , then, it satisﬁes the two conditions inLemma 5.4.1 with the following constants: Γ =

B,r = min (cid:26) η, ξη B (cid:27) ,σ = C V,ρ = 1 − rξ Bη r < . where C = B V ξη + θ max +1) Bξη − ξ V . The central idea of the proof is to plug the ξ -slackness policy speciﬁed in Assumption 5.2.5into the right hand side of (5.8). A similar idea has been presented in the Lemma 6 of [WYN15]under the bounded increment of the virtual queue process. Here, we generalize the idea to thecase where the increment of the virtual queues contains exponential type random variables z l [ n ]and T [ n ]. Note that the boundedness of θ [ n ] is crucial for the argument to hold, which justiﬁesthe truncation of pseudo average in the algorithm. Lemma 5.4.1 is proved in the Appendix 5.7.Combining the above two lemmas, we immediately have the following corollary: Corollary 5.4.1 (Exponential decay) . Given Q [0] = 0 , the following holds for any n ∈ { , , , · · · } under the proposed algorithm, E (cid:16) e r k Q [ n ] k (cid:17) ≤ D, (5.9) where D = 1 + B − ρ e rC V , and r, ρ, C are as deﬁned in Lemma 5.4.2. Furthermore, we have E ( k Q [ n ] k ) ≤ r log(1 + B − ρ e rC V ) , i.e. the queue size is O ( V ) . The bound on E ( k Q [ n ] k ) follows readily from (5.9) via Jensen’s inequality. With Corollary5.4.1 in hand, we can prove the following theorem: Theorem 5.4.1 (Feasibility) . All constraints in (5.1) - (5.3) are satisﬁed under the proposedalgorithm with probability 1. roof of Theorem 5.4.1. By queue updating rule (5.5), for any n and any l ∈ { , , · · · , L } , onehas Q l [ n + 1] ≥ Q l [ n ] + z l [ n ] − c l T [ n ] . Fix N as a positive integer. Then, summing over all n ∈ { , , , · · · , N − } , Q l [ N ] ≥ Q l [0] + N − X n =0 ( z l [ n ] − c l T [ n ]) . Since Q l [0] = 0 , ∀ l and T [ n ] ≥ , ∀ n , P N − n =0 z l [ n ] P N − n =0 T [ n ] − c l ≤ Q l [ N ] P N − n =0 T [ n ] ≤ Q l [ N ] N . (5.10)Deﬁne the event A ( ε ) N = { Q l [ N ] > εN } . By the Markov inequality and Corollary 5.4.1, for any ε >

0, we have

P r ( Q l [ N ] > εN ) ≤ P r ( r k Q [ N ] k > rεN )= P r (cid:16) e r k Q [ N ] k > e rεN (cid:17) ≤ E (cid:0) e r k Q [ N ] k (cid:1) e rεN ≤ De − rεN , where r is deﬁned in Corollary 5.4.1. Thus, we have ∞ X N =0 P r ( Q l [ N ] > εN ) ≤ D ∞ X N =0 e − rεN < + ∞ . Thus, by the Borel-Cantelli lemma [Dur13],

P r (cid:16) A ( ε ) N occurs inﬁnitely often (cid:17) = 0 . Since ε > ε → P r (cid:18) lim N →∞ Q l [ N ] N = 0 (cid:19) = 1 . N →∞ from both sides of (5.10) and substituting in the above equationgives the claim. In this section, we show that the proposed algorithm achieves time average penalty within O (1 /V ) of the optimal objective θ ∗ . Since the algorithm meets all the constraints, it follows,lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] ≥ θ ∗ , w.p. . Thus, it is enough to prove the following theorem:

Theorem 5.5.1 (Near optimality) . For any δ ∈ (1 / , and V ≥ , the objective value producedby the proposed algorithm is near optimal with lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] ≤ θ ∗ + B η V , w.p. , i.e. the algorithm achieves O (1 /V ) near optimality. Remark 5.5.1.

Combining Theorem 5.5.1 with Corollary 5.4.1, we see that the tuning parameter V plays a trade-oﬀ between the sub-optimality and the virtual queue bound (i.e. the constraintviolation). In particular, our result recovers the classical [ O (1 /V ) , O ( V )] trade-oﬀ in the workof opportunistic scheduling [Nee10b]. In order to prove Theorem 5.5.1, we introduce the following notation:original pseudo average : ˆ θ [ n ] (cid:44) n + 1) δ n X i =0 y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! , tamed pseudo average : θ [ n ] (cid:44) " n + 1) δ n X i =0 y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! θ max . .5.1 Relation between ˆ θ [ n ] and θ [ n ] We start with a preliminary lemma illustrating that the original pseudo average ˆ θ [ n ] behavesalmost the same as the tamed pseudo average θ [ n ]. Note that θ [ n ] can be written as: θ [ n ] = [ˆ θ [ n ]] θ max . Lemma 5.5.1 (Equivalence relation) . For any x ∈ (0 , θ max ) ,1. θ [ n ] ≥ x if and only if ˆ θ [ n ] ≥ x .2. θ [ n ] ≤ x if and only if ˆ θ [ n ] ≤ x .3. lim sup n →∞ θ [ n ] ≤ x if and only if lim sup n →∞ ˆ θ [ n ] ≤ x .4. lim sup n →∞ θ [ n ] ≥ x if and only if lim sup n →∞ ˆ θ [ n ] ≥ x . This lemma is intuitive and the proof is shown in the Appendix 5.7. We will prove results onˆ θ [ n ] which extend naturally to θ [ n ] via Lemma 5.5.1.The key idea of proving Theorem 5.5.1 is to bound the original pseudo average process ˆ θ [ n ]asymptotically from above by θ ∗ , which is Theorem 5.5.2 below. We then prove Theorem 5.5.2through the following three steps:• We construct a truncated version of ˆ θ [ n ], namely ˜ θ [ n ], which has the same limit as ˆ θ [ n ](Lemma 5.5.3 below), so that it is enough to show ˜ θ [ n ] ≤ θ ∗ asymptotically.• For the process ˜ θ [ n ], we bound the moments of the hitting time, namely, the time intervalbetween two consecutive visits to the region { ˜ θ [ n ] ≤ θ ∗ } , by constructing a dominatingexponential supermartingale and bounding its size. (Lemma 5.5.6 and 5.5.7 below).• We show that ˜ θ [ n ] > θ ∗ only ﬁnitely often asymptotically (with probability 1) using thebounded moments of the hitting time. The following lemma states that the optimality of (5.1)-(5.3) is achievable within the closureof the set of all one-shot averages speciﬁed in Assumption 5.2.4:142 emma 5.5.2 (Stationary optimality) . Let θ ∗ be the optimal objective of (5.1) - (5.3) . Then,there exists a tuple ( y ∗ , T ∗ , z ∗ ) ∈ R , the closure of R , such that the following hold: y ∗ /T ∗ = θ ∗ (5.11) z ∗ l /T ∗ ≤ c l , ∀ l ∈ { , , · · · , L } , (5.12) i.e. the optimality is achievable within R . The proof of this lemma is similar to the proof of Theorem 4.5 as well as Lemma 7.1 of[Nee10b]. We omit the details for brevity.We start the truncation by picking up an ε > θ ∗ + ε /V <θ max . We aim to show lim sup n →∞ θ [ n ] ≤ θ ∗ + ε /V . By Lemma 5.5.1, it is enough to showlim sup n →∞ ˆ θ [ n ] ≤ θ ∗ + ε /V . The following lemma tells us it is enough to prove it on a furtherterm-wise truncated version of ˆ θ [ n ]. Lemma 5.5.3 (Truncation lemma) . Consider the following alternative pseudo average { ˜ θ [ n ] } ∞ n =0 obtained by truncating each summand such that ˜ θ [0] = 0 and ˜ θ [ n + 1] = 1( n + 1) δ n X i =0 " y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! ∧ η + 4 √ LηrV ! log ( i + 1) ! , where a ∧ b (cid:44) min { a, b } , η is deﬁned in Assumption 5.2.1 and r is deﬁned in Lemma 5.4.2. Then,we have lim sup n →∞ ˆ θ [ n ] = lim sup n →∞ ˜ θ [ n ] . Proof of Lemma 5.5.3.

Consider any frame i ∈ { , , , . . . } such that there is a discrepancybetween the summand of ˆ θ [ n ] and ˜ θ [ n ], i.e. y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) > η + 4 √ LηrV ! log ( i + 1) , (5.13)By the Cauchy-Schwartz inequality, this implies y [ i ] − θ [ i ] T [ i ] + 1 V vuut L X l =1 Q l [ i ] vuut L X l =1 ( z l [ i ] − c l T [ i ]) > η + 4 √ LηrV ! log ( i + 1) . Thus, at least one of the following three events happened:143. A i (cid:44) n y [ i ] − θ [ i ] T [ i ] > η log ( i + 1) o .2. B i (cid:44) (cid:26)qP Ll =1 Q l [ i ] > √ Lr log( i + 1) (cid:27) .3. E i (cid:44) n K [ i ] > η log( i + 1) o .where K [ i ] is deﬁned in (5.4). Indeed, the occurence of one of the three events is necessary for(5.13) to happen. We then argue that these three events jointly occur only ﬁnitely many times.Thus, as n → ∞ , the discrepancies are negligible.Assume the event A i occurs, then, since y [ i ] − θ [ i ] T [ i ] ≤ y [ i ], it follows y [ i ] > η log ( i + 1).Then, we have P r ( A i ) ≤ P r (cid:18) y [ i ] > η log ( i + 1) (cid:19) = P r (cid:16) e ηy [ i ] > e ( i +1) (cid:17) ≤ E (cid:0) e ηy [ i ] (cid:1) ( i + 1) i +1) ≤ B ( i + 1) i +1) , where the second to last inequality follows from the Markov inequality and the last inequalityfollows from Assumption 5.2.1.Assume the event B i occurs, then, we have k Q [ i ] k = vuut L X l =1 Q l [ i ] > √ Lr log( i + 1) ≥ r log( i + 1) . Thus,

P r ( B i ) ≤ P r (cid:18) k Q [ i ] k > r log( i + 1) (cid:19) = P r (cid:16) e r k Q [ i ] k > e i +1) (cid:17) ≤ E (cid:0) e r k Q [ i ] k (cid:1) ( i + 1) ≤ D ( i + 1) , where the second to last inequality follows from the Markov inequality and the last inequalityfollows from Corollary 5.4.1. 144ssume the event E i occurs. Again, by Assumption 5.2.1 and the Markov inequality, P r ( E i ) = P r (cid:18) K [ i ] > η log( i + 1) (cid:19) = P r (cid:16) e ηK [ i ] > e i +1) (cid:17) ≤ E (cid:0) e ηK [ i ] (cid:1) ( i + 1) ≤ B ( i + 1) , where the last inequality follows from Assumption 5.2.1 again. Now, by a union bound, P r ( A i ∪ B i ∪ E i ) ≤ P r ( A i ) + P r ( B i ) + P r ( E i ) ≤ B ( i + 1) i +1) + B + D ( i + 1) , and thus, ∞ X i =0 P r ( A i ∪ B i ∪ E i ) ≤ ∞ X i =0 (cid:18) B ( i + 1) i +1) + B + D ( i + 1) (cid:19) < ∞ By the Borel-Cantelli lemma, we have the joint event A i ∪ B i ∪ E i occurs only ﬁnitely many timeswith probability 1, and our proof is ﬁnished.Lemma 5.5.3 is crucial for the rest of the proof. Speciﬁcally, it creates an alternative sequence˜ θ [ n ] which has the following two properties:1. We know exactly what the upper bound of each of the summands is, whereas in ˆ θ [ n ],there is no exact bound for the summand due to Q l [ i ] and other exponential type randomvariables.2. For any n ∈ N , we have ˜ θ [ n ] ≤ ˆ θ [ n ]. Thus, if ˜ θ [ n ] ≥ θ ∗ + ε /V for some n , then, ˆ θ [ n ] ≥ θ ∗ + ε /V . The following preliminary lemma demonstrates a negative drift property for each of thesummands in ˜ θ [ n ]. Lemma 5.5.4 (Key feature inequality) . For any ε > , if θ [ i ] ≥ θ ∗ + ε /V , then, we have E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! ∧ η + 4 √ LηrV ! log ( i + 1) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i ! ≤ − ε /V, roof of Lemma 5.5.4. Since the proposed algorithm minimizes (5.6) over all possible decisionsin A , it must achieve value less than or equal to that of any randomized stationary algorithm α ∗ [ i ]. This in turn implies, E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i , ω [ i ] ! ≤ E ˆ y ( ω [ i ] , α ∗ [ i ]) − θ [ i ] ˆ T ( ω [ i ] , α ∗ [ i ]) + 1 V L X l =1 Q l [ i ](ˆ z l ( ω [ i ] , α ∗ [ i ]) − c l ˆ T ( ω [ i ] , α ∗ [ i ])) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i , ω [ i ] ! . Taking expectation from both sides with respect to ω [ i ] and using the fact that randomizedstationary algorithms are i.i.d. over frames and independent of H i , we have E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i , ω [ i ] ! ≤ y − θ [ i ] T + 1 V L X l =1 Q l [ i ]( z l − c l T ) , for any ( y, T , z ) ∈ R . Since ( y ∗ , T ∗ , z ∗ ) speciﬁed in Lemma 5.5.2 is in the closure of R , we canreplace ( y, T , z ) by the tuple ( y ∗ , T ∗ , z ∗ ) and the inequality still holds. This gives E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i , ω [ i ] ! ≤ y ∗ − θ [ i ] T ∗ + 1 V L X l =1 Q l [ i ]( z ∗ l − c l T ∗ ) , = T ∗ y ∗ /T ∗ − θ [ i ] + 1 V L X l =1 Q l [ i ]( z ∗ l /T ∗ − c l ) ! ≤ T ∗ ( θ ∗ − θ [ i ]) ≤ − ε /V, where the second to last inequality follows from (5.11) and (5.12), and the last inequality followsfrom θ [ i ] ≥ θ ∗ + ε /V and T [ i ] ≥

1. Finally, since a ∧ b ≤ a for any real numbers a, b , it follows, E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! ∧ η + 4 √ LηrV ! log ( i + 1) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i ! ≤ E y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H i ! ≤ − ε /V, and the claim follows.Deﬁne n k as the frame where ˜ θ [ n ] visits the set ( −∞ , θ ∗ + ε /V ) for the k -th time with the146ollowing conventions: 1. If ˜ θ [ n ] ∈ ( −∞ , θ ∗ + ε /V ) and ˜ θ [ n + 1] ∈ ( −∞ , θ ∗ + ε /V ), then wecount them as two times. 2. When k = 1, n is equal to 0. Deﬁne the hitting time S n k as S n k = n k +1 − n k . The goal is to obtain a moment bound on this quantity when ˜ θ [ n k + 1] ≥ θ ∗ + ε /V (otherwise,this quantity is 1). In order to do so, we introduce a new process as follows. For any n k , deﬁne F [ n ] (cid:44) n − X i = n k y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! ∧ η + 4 √ LηrV ! log ( i + 1) ! , ∀ n > n k , (5.14)The following lemma shows that indeed this F [ n ] is closely related to ˜ θ [ n ]. It plays animportant role in proving Lemma 5.5.7: Lemma 5.5.5.

For any n > n k , if ˜ θ [ n ] ≥ θ ∗ + ε /V , then, F [ n ] ≥ .Proof of Lemma 5.5.5. Suppose ˜ θ [ n ] ≥ θ ∗ + ε /V , then, the following holds θ ∗ + ε /V ≤ ˜ θ [ n ] = n δk n δ ˜ θ [ n k ] + 1 n δ F [ n ] . Thus, F [ n ] ≥ n δ ( θ ∗ + ε /V ) − n δk ˜ θ [ n k ] . Since at the frame n k , ˜ θ [ n k ] < θ ∗ + ε /V , it follows, F [ n ] ≥ (cid:0) n δ − n δk (cid:1) ( θ ∗ + ε /V ) . Since θ ∗ + ε /V ≥

0, it follows F [ n ] ≥ S n k of the process ˜ θ [ n ] when { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } , with a strictly negative drift property as Lemma 5.5.4. A classical approach analyzingthe hitting time of a stochastic process came from Wald’s construction of martingale for sequentialanalysis (see, for example, [Wal44] for details). Later, [Haj82] extended this idea to analyze thestability of a queueing system with drift condition by a supermartingale construnction. Here, wetake one step further by considering the following supermartingale construction based on F [ n ]:147 emma 5.5.6 (Exponential Supermartingale) . Fix ε > and V ≥ max n ε η − √ Lr , o such that θ ∗ + ε /V < θ max . Deﬁne a new random process G [ n ] starting from n k + 1 with G [ n ] (cid:44) exp ( λ n F [ n ∧ ( n k + S n k )]) Q n ∧ ( n k + S nk ) i = n k +1 ρ i { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } , where for any set A , A is the indicator function which takes value 1 if A is true and 0 otherwise.For any n ≥ n k + 1 , λ n and ρ n are deﬁned as follows: λ n = ε V e (cid:16) η + √ LηrV (cid:17) log ( n + 1) ,ρ n =1 − ε V e (cid:16) η + √ LηrV (cid:17) log ( n + 1) . Then, the process G [ n ] is measurable with respect to H n , ∀ n ≥ n k + 1 , and furthermore, it is asupermartingale with respect to the ﬁltration {H n } n ≥ n k +1 . The proof of Lemma 5.5.6 is shown in Appendix 5.7.

Remark 5.5.2.

If the increments F [ n + 1] − F [ n ] were to be bounded, then, we could adoptthe similar construction as that of [Haj82]. However, in our scenario F [ n + 1] − F [ n ] is of theorder log ( n + 1) , which is increasing and unbounded. Thus, we need decreasing exponents λ n and increasing weights ρ n to account for that. Furthermore, the indicator function indicates thatwe are only interested in the scenario { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } . The following lemma uses the previous result to bound the conditional fourth moment of thehitting time S n k . Lemma 5.5.7.

Given V ≥ max n ε η − √ Lr , o as in Lemma 5.5.6, for any β ∈ (0 , / and any ε > such that θ ∗ + ε /V < θ max , there exists a positive constant C β,V,ε ’ O (cid:0) V β − ε − (cid:1) ,such that E (cid:0) S n k |H n k (cid:1) ≤ C β,V,ε ( n k + 2) β , ∀ k ≥ . Proof of Lemma 5.5.7.

First of all, from Lemma 5.5.6 gives that G [ n ] is a supermartingale start-148ng from n k + 1, thus, we have the following chains of inequalities for any n ≥ n k + 1: G [ n k + 1] = E ( G [ n k + 1] | H n k +1 ) ≥ E ( G [ n ] | H n k +1 )= E e λ n F [ n ∧ ( n k + S nk )] Q ni = n k +1 ρ i { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n k +1 ! ≥ E e λ n F [ n ∧ ( n k + S nk )] Q ni = n k +1 ρ i { S nk ≥ n − n k +1 } { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n k +1 ! ≥ Q ni = n k +1 ρ i P r (cid:2) S n k ≥ n − n k + 1 , ˜ θ [ n k + 1] ≥ θ ∗ + ε /V (cid:12)(cid:12) H n k +1 (cid:3) , where the ﬁrst inequality uses the supermartingale property and the last inequality uses Lemma5.5.5 that on the set { S n k ≥ n − n k + 1 } , n ∧ ( n k + S n k ) = n and F [ n ] ≥

0. By deﬁnition of G [ n k + 1], G [ n k + 1] = e λ nk +1 F [ n k +1] ρ n k +1 ≤ e λ nk +1 (cid:0) η + √ LηrV (cid:1) log ( n k +2) ρ n k +1 ≤ e, where the ﬁrst inequality follows from the deﬁnition of F [ n ], and the second inequality followsfrom the assumption that V ≥ ε η − √ Lr , thus, λ n k +1 ≤ (cid:0) η + √ LηrV (cid:1) log ( n k +2) and ρ n k +1 ≥ − log e > . Thus, it follows, P r (cid:2) S n k ≥ n − n k + 1 , ˜ θ [ n k + 1] ≥ θ ∗ + ε /V (cid:12)(cid:12) H n k +1 (cid:3) ≤ n Y i = n k +1 ρ i ! · e. Now, we bound the fourth moment of hitting time: E (cid:0) S n k (cid:12)(cid:12) H n k +1 (cid:1) = ∞ X m =1 m P r [ S n k = m | H n k +1 ] ≤ ∞ X m =1 (cid:0) ( m + 1) − m (cid:1) P r (cid:2) S n k ≥ m + 1 , ˜ θ [ n k + 1] ≥ θ ∗ + ε /V (cid:12)(cid:12) H n k +1 (cid:3) + 1 ≤ ∞ X m =1 ( m + 1) P r (cid:2) S n k ≥ m + 1 , ˜ θ [ n k + 1] ≥ θ ∗ + ε /V (cid:12)(cid:12) H n k +1 (cid:3) + 1 ≤ e ∞ X m =1 ( m + 1) n k + m Y i = n k +1 ρ i . C on the order O (cid:0) V β − ε − (cid:1) such that ∞ X m =1 ( m + 1) n k + m Y i = n k +1 ρ i ≤ C ( n k + 2) β , which is given is Appendix 5.8. This implies there exists a C β,V,ε so that E (cid:0) S n k (cid:12)(cid:12) H n k +1 (cid:1) ≤ C β,V,ε ( n k + 2) β . Thus, E (cid:0) S n k (cid:12)(cid:12) H n k (cid:1) = E (cid:0) E (cid:0) S n k (cid:12)(cid:12) H n k +1 (cid:1) |H n k (cid:1) ≤ E (cid:0) C β,V,ε ( n k + 2) β |H n k (cid:1) = C β,V,ε ( n k + 2) β , where the last equality follows from the fact that n k ∈ H n k . This ﬁnishes the proof. θ [ n ] So far, we have proved that if we pick any ε > θ ∗ + ε /V < θ max , then, the inter-visiting time has bounded conditional fourth moment. We aim to show that lim sup n →∞ ˆ θ [ n ] ≤ θ ∗ with probability 1. By Lemma 5.5.3, it is enough to show lim sup n →∞ ˜ θ [ n ] ≤ θ ∗ . To do so, weneed the following Second Borel-Cantelli lemma: Lemma 5.5.8 (Theorem 5.3.2. of [Dur13]) . Let F k , k ≥ be a ﬁltration with F = {∅ , Ω } , and A k , k ≥ be a sequence of events with A k ∈ F k +1 , then { A k occurs inﬁnitely often } = ( ∞ X k =1 P r ( A k |F k ) = ∞ ) Theorem 5.5.2 (Asymptotic upper bound) . For any δ ∈ (1 / , and V ≥ , the following hold, lim sup n →∞ ˆ θ [ n ] ≤ θ ∗ , w.p. , and lim sup n →∞ θ [ n ] ≤ θ ∗ , w.p. . Proof of Theorem 5.5.2.

First of all, since the inter-hitting time S n k has ﬁnite fourth moment,each inter-hitting time is ﬁnite with probability 1, and thus the process { ˜ θ [ n ] } ∞ n =0 will visit150 −∞ , θ ∗ + ε /V ) inﬁnitely many times with probability 1. Then, we pick any (cid:15) > A k (cid:44) ( S n k n / k > (cid:15) ) , k = 1 , , · · · . (5.15)For any ﬁxed k , by Conditional Markov inequality, the following holds with probability 1: P r ( A k |H n k ) = P r (cid:16) S n k > (cid:15) n / k (cid:12)(cid:12)(cid:12) H n k (cid:17) ≤ E (cid:0) S n k |H n k (cid:1) (cid:15) n / k ≤ C β,V,ε ( n k + 2) β (cid:15) n / k ≤ C β,V,ε (cid:15) n − / βk + C β,V,ε β (cid:15) n / k ≤ C β,V,ε (cid:15) k − / β + C β,V,ε β (cid:15) k − / , where the second inequality follows from Lemma 5.5.7 with β ∈ (0 , / a + b ) x ≤ a x + b x , ∀ a, b ≥ x ∈ (0 , n k ≥ k .Choose F k = H n k and A k as is deﬁned in (5.15). Then, for any β ∈ (0 , / ∞ X k =1 P r ( A k |H n k ) ≤ ∞ X k =1 (cid:18) C β,V,ε (cid:15) k − / β + C β,V,ε β (cid:15) k − / (cid:19) < ∞ . Now by Lemma 5.5.8,

P r ( A k occurs inﬁnitely often) = 0 . Since the process { ˜ θ [ n ] } ∞ n =0 visits ( −∞ , θ ∗ + ε /V ) inﬁnitely many times with probability 1,lim sup n →∞ S n k n / k = lim sup k →∞ S n k n / k ≤ (cid:15), w.p. , Since (cid:15) > (cid:15) → n →∞ S n k n / k = 0 , w.p. . (5.16)151inally, we show how this convergence result leads to the bound of ˜ θ [ n ]. According to theupdating rule of ˜ θ [ n ], for any frame n such that n k < n ≤ n k +1 ,˜ θ [ n ] =( n k n ) δ ˜ θ [ n k ] + 1 n δ n − X i = n k (cid:18) y [ i ] − θ [ i ] T [ i ] + 1 V Q [ i ]( z [ i ] − cT [ i ]) (cid:19) ∧ η + 4 √ LηrV ! log ( i + 1) ! ≤ ( n k n ) δ (cid:16) θ ∗ + ε V (cid:17) + 1 n δ n − X i = n k η + 4 √ LηrV ! log ( i + 1) ! ≤ ( n k n ) δ (cid:16) θ ∗ + ε V (cid:17) + 1 n δ S n k η + 4 √ LηrV ! log n, where the ﬁrst inequality follows from the fact that ˜ θ [ n k ] < θ ∗ + ε /V . Now, we take thelim sup n →∞ from both sides and analyze each single term on the right hand side:1 ≥ lim sup n →∞ ( n k n ) δ ≥ lim sup k →∞ ( n k n k + S n k ) δ = lim sup k →∞ ( 11 + S nk n k ) δ = 1 , w.p. , lim sup n →∞ S n k n δ η + 4 √ LηrV ! log n ≤ lim sup n →∞ S n k n / k · lim sup n →∞ (cid:16) η + √ LηrV (cid:17) log nn δ − / = 0 , w.p. , where we apply the convergence result (5.16) in the second line. Thus,lim sup n →∞ ˜ θ [ n ] ≤ θ ∗ + ε V , w.p. . By Lemma 5.5.3 we have lim sup n →∞ ˆ θ [ n ] ≤ θ ∗ + ε /V . Finally, by Lemma 5.5.1, and the factthat θ ∗ + ε /V ∈ (0 , θ max ), we have lim sup n →∞ θ [ n ] ≤ θ ∗ + ε /V . Since this holds for any ε > ε → With the help of previous analysis on θ [ n ], we are ready to prove our main theorem, with thefollowing lemma on strong law of large numbers for martingale diﬀerence sequences: Lemma 5.5.9 (Corollary 4.2 of [Nee12c]) . Let {F i } ∞ i =0 be a ﬁltration and let { X ( i ) } ∞ i =0 be areal-valued random process such that X ( i ) ∈ F i +1 , ∀ i . Suppose there is a ﬁnite constant C suchthat E ( X ( i ) |F i ) ≤ C, ∀ i , and ∞ X i =1 E (cid:0) X ( i ) (cid:1) i < ∞ . hen, lim sup n →∞ n n − X i =0 X ( i ) ≤ C, w.p. . Proof of Theorem 5.5.1.

Recall for any n , the empirical accumulation without ceil and ﬂoorfunction is ˆ θ [ n ] = 1 n δ n − X i =0 y [ i ] − θ [ i ] T [ i ] + 1 V L X l =1 Q l [ i ]( z l [ i ] − c l T [ i ]) ! . Dividing both sides by P n − i =0 T [ i ] /n δ yieldsˆ θ [ n ] n δ P n − i =0 T [ i ] = P n − i =0 (cid:16) y [ i ] − θ [ i ] T [ i ] + V P Ll =1 Q l [ i ]( z l [ i ] − c l T [ i ]) (cid:17)P n − i =0 T [ i ]= P n − i =0 (cid:16) y [ i ] + V P Ll =1 Q l [ i ]( z l [ i ] − c l T [ i ]) (cid:17)P n − i =0 T [ i ] − P n − i =0 θ [ i ] T [ i ] P n − i =0 T [ i ] . Moving the last term to the left hand side and taking the lim sup n →∞ from both sides giveslim sup n →∞ ˆ θ [ n ] n δ P n − i =0 T [ i ] + P n − i =0 θ [ i ] T [ i ] P n − i =0 T [ i ] ! ≥ lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] + P n − i =0 1 V P Ll =1 Q l [ i ]( z l [ i ] − c l T [ i ]) P n − i =0 T [ i ] ≥ lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] + 12 k Q [ n ] k − P n − i =0 P Ll =1 ( z l [ i ] − c l T [ i ]) V P n − i =0 T [ i ] ≥ lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] − V lim sup n →∞ n n − X i =0 K [ i ] , where the second inequality follows from inequality (5.7) and telescoping sums, and the lastinequality follows from T [ n ] ≥ k Q [ n ] k ≥ K [ i ] = qP Ll =1 ( z l [ i ] − c l T [ i ]) . Now we useLemma 5.5.9 with X ( i ) = K [ i ] to bound the second term. Since K [ i ] is of exponential typeby Assumption 5.2.1, we know that E (cid:0) K [ i ] |H n (cid:1) ≤ B /η . Furthermore, E (cid:0) K [ i ] (cid:1) ≤ B /η .Thus, ∞ X i =1 E (cid:0) K [ i ] (cid:1) i < ∞ . Thus, all assumptions in Lemma 5.5.9 are satisﬁed and we conclude thatlim sup n →∞ n n − X i =0 K [ i ] ≤ B η , w.p. . n →∞ ˆ θ [ n ] n δ P n − i =0 T [ i ] + P n − i =0 θ [ i ] T [ i ] P n − i =0 T [ i ] ! ≥ lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] − B η V .

By Theorem 5.5.2, ˆ θ [ n ] is asymptotically upper bounded. Since δ < T [ n ] ≥

1, it follows n δ P n − i =0 T [ i ] = O ( n − δ ), which goes to inﬁnity as n → ∞ . Thus,lim sup n →∞ ˆ θ [ n ] n δ P n − i =0 T [ i ] ≤ , and thus, lim sup n →∞ P n − i =0 θ [ i ] T [ i ] P n − i =0 T [ i ] ≥ lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] − B η V .

By Theorem 5.5.2 again, θ [ n ] is asymptotically upper bounded by θ ∗ . Based on this result, it iseasy to show the following lim sup n →∞ P n − i =0 θ [ i ] T [ i ] P n − i =0 T [ i ] ≤ θ ∗ . Thus, we ﬁnally get lim sup n →∞ P n − i =0 y [ i ] P n − i =0 T [ i ] ≤ θ ∗ + B η V , ﬁnishing the proof.

In this section, we demonstrate the performance of our proposed algorithm through an appli-cation scenario on single user ﬁle downloading. We show that this problem can be formulated asa two state constrained online MDP and solved using our proposed algorithm.Consider a slotted time system where t ∈ { , , , · · · } , and one user is repeatedly downloadingﬁles. We use F ( t ) ∈ { , } to denote the system ﬁle state at time slot t . State “1” indicatesthere is an active ﬁle in the system for downloading and state “0” means there is no ﬁle andthe system is idle. Suppose the user can only download 1 ﬁle at each time, and the user cannotobserve the ﬁle length. Each ﬁle contains an integer number of packets which is independent andgeometrically distributed with expected length equal to 1.During each time slot where there is an active ﬁle for downloading (i.e. F ( t ) = 1), the154ser ﬁrst observes the channel state ω ( t ), which is the i.i.d. random variable taking values inΩ = { . , . , . } with equal probabilities, and delay penalty s ( t ), which is also an i.i.d. randomvariable taking values in { , , } with equal probability. Then, the user makes a service action α ( t ) ∈ A = { , . , . , . } . The pair ( ω ( t ) , α ( t )) aﬀects the following quantities:• The success probability of downloading a ﬁle at time t : φ ( α ( t ) , ω ( t )) (cid:44) α ( t ) · ω ( t ).• The resource consumption p ( α ( t )) at time t . We assume p (0) = 0, p (0 .

3) = 1, p (0 .

6) = 2and p (0 .

9) = 4.After a ﬁle is downloaded, the system goes idle (i.e. F ( t ) = 0) and stays there for a randomamount of time that is independent and geometrically distributed with mean equal to 2. Thegoal is to minimize the time average delay penalty subject to a resource constraint that the timeaverage resource consumption cannot exceed 1.In [WN15], a similar optimization problem is considered but without random events ω ( t ) and s ( t ), which can be formulated as a two state constrained MDP. Here, using the same logic, wecan formulate our optimization problem as a two state constrained online MDP. Given F ( t ) = 1,the ﬁle will ﬁnish its download at the end of this time slot with probability φ ( α ( t ) , ω ( t )). Thus,the transition probabilities out of state 1 are: P r [ F ( t + 1) = 0 | F ( t ) = 1] = φ ( α ( t ) , ω ( t )) P r [ F ( t + 1) = 1 | F ( t ) = 1] = 1 − φ ( α ( t ) , ω ( t )) , On the other hand, given F ( t ) = 0, the system is idle and will transition to the active state inthe next slot with probability λ : P r [ F ( t + 1) = 1 | F ( t ) = 0] = λP r [ F ( t + 1) = 0 | F ( t ) = 0] = 1 − λ, Now, we characterize this online MDP through renewal frames and show that it can be solvedusing the proposed algorithm in Section 5.2. First, notice that the state “1” is recurrent underany action α ( t ). We denote t n as the n -th time slot when the system returns to state “1”. Deﬁne155he renewal frame as the time period between t n and t n +1 with frame size T [ n ] = t n +1 − t n . Furthermore, since the system does not have any control options in state “0”, the controllermakes exactly one decision during each frame and this decision is made at the beginning of eachframe. Thus, we can write out the optimization problem as follows:min lim sup N →∞ P N − n =0 α ( t n ) s ( t n ) P N − n =0 T [ n ] s.t. lim sup N →∞ P N − n =0 p ( α ( t n )) P N − n =0 T [ n ] ≤ , α ( t n ) ∈ A . Subsequently, in order to apply our algorithm, we can deﬁne the virtual queue Q [ n ] as Q [0] = 0with updating rule Q [ n + 1] = max { Q [ n ] + p ( α ( t n )) − T [ n ] , } . Notice that for any particular action α ( t n ) ∈ A and random event ω ( t n ) ∈ Ω, we can alwayscompute E ( T [ n ]) as E ( T [ n ]) = 1 − φ ( α ( t n ) , ω ( t n )) + φ ( α ( t n ) , ω ( t n )) (cid:18) λ (cid:19) = 1 + 2 α ( t n ) ω ( t n ) , where the second equality follows by substituting λ = 0 . φ ( α ( t n ) , ω ( t n )) = α ( t n ) ω ( t n ).Thus, for each α ( t n ) ∈ A , the expression (5.6) can be computed.In each of the simulations, each data point is the time average of 2 million slots. We comparethe performance of the proposed algorithm with the optimal randomized policy. The optimalpolicy is computed by formulating the MDP into a linear program with the knowledge of thedistribution on ω ( t ) and s ( t ). See [Fox66b] for details of this linear program formulation.In Fig. 5.2, we plot the performance of our algorithm verses V parameter for diﬀerent δ value. We see from the plots that as V gets larger, the time averages approaches the optimalvalue and achieves a near optimal performance for δ roughly between 0 . δ value is shown in Fig. 5.3, where we ﬁx V = 300 and plotthe performance of the algorithm verses δ value. It is clear from the plots that the algorithm156ails whenever δ is too small ( δ < .

3) or too big ( δ > δ ∈ (1 / , V value. We see from theplots that the algorithm is always feasible for diﬀerent V ’s and δ ’s, which meets the statement ofTheorem 5.4.1. Also, as V gets larger, the constraint gap tends to be smaller. In Fig. 5.5, we plotthe average virtual queue size verses V value. It shows that the average queue size gets largeras V get larger. To see the implications, recall from the proof of Theorem 5.4.1, the inequality(5.10) implies that the virtual queue size Q l [ N ] aﬀects the rate that the algorithm convergesdown to the feasible region. Thus, if the average virtual queue size is large, then, it takes longerfor the algorithm to converge. This demonstrates that V is indeed a trade-oﬀ parameter whichtrades the sub-optimality gap for the convergence rate.157igure 5.3: Time average penalty versus δ parameter with ﬁxed V = 300.Figure 5.4: Time average resource consumption versus tradeoﬀ parameter V .158igure 5.5: Time average virtual queue size versus tradeoﬀ parameter V . Proof of Lemma 5.4.2.

We begin by bounding the diﬀerence |k Q [ n + 1] k − k Q [ n ] k| for any n : (cid:12)(cid:12) k Q [ n + 1] k − k Q [ n ] k (cid:12)(cid:12) ≤k Q [ n + 1] − Q [ n ] k = vuut L X l =1 (cid:0) max { Q l [ n ] + z l [ n ] − c l T [ n ] , } − Q l [ n ] (cid:1) ≤ vuut L X l =1 ( z l [ n ] − c l T [ n ]) = K [ n ] , where the ﬁrst inequality follows from triangle inequality and the last inequality follows from thefact that for any a, b ∈ R , | max { a + b, } − a | ≤ | b | . Thus, it follows, | E ( k Q [ n + 1] k − k Q [ n ] k| H n ) | ≤ E ( K [ n ] | H n ) ≤ Bη , E (cid:16) e r ( k Q [ n +1] k−k Q [ n ] k ) (cid:12)(cid:12)(cid:12) H n (cid:17) ≤ E ( exp ( rK [ n ]) | H n ) ≤ E ( exp ( ηK [ n ]) | H n ) ≤ B (cid:44) Γwhere the second to last inequality follows by substituting the deﬁnition r = min n η, ξη B o ≤ η and the last inequality follows from Assumption 5.2.1.Next, suppose k Q [ n ] k > σ (cid:44) C V . Then, since the proposed algorithm minimizes the termon the right hand side of (5.8) over all possible decisions at frame n , it must achieve smallervalue on that term compared to that of ξ -slackness policy α ( ξ ) [ n ] speciﬁed in Assumption 5.2.5.Formally, this is E L X l =1 Q l [ n ]( z l [ n ] − c l T [ n ]) + V ( y [ n ] − θ [ n ] T [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n , ω [ n ] ! ≤ E L X l =1 Q l [ n ]( z ( ξ ) l [ n ] − c l T ( ξ ) [ n ]) + V ( y ( ξ ) [ n ] − θ [ n ] T ( ξ ) [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n , ω [ n ] ! . where we used the fact that θ [ n ] and Q [ n ] are in H n . Substitute this bound into the right handside of (5.8) and take expectation from both sides regarding ω [ n ] gives E (∆[ n ] + V ( y [ n ] − θ [ n ] T [ n ]) | H n ) ≤ E L X l =1 Q l [ n ]( z ( ξ ) l [ n ] − c l T ( ξ ) [ n ]) + V ( y ( ξ ) [ n ] − θ [ n ] T ( ξ ) [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n ! + B /η . Since ∆[ n ] = ( k Q [ n + 1] k − k Q [ n ] k ), This implies E (cid:0) k Q [ n + 1] k − k Q [ n ] k | H n (cid:1) ≤ B /η + 2 E L X l =1 Q l [ n ]( z ( ξ ) l [ n ] − c l T ( ξ ) [ n ]) + V ( y ( ξ ) [ n ] − θ [ n ] T ( ξ ) [ n ]) − V ( y [ n ] − θ [ n ] T [ n ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n ! ≤ B /η + 2 L X l =1 Q l [ n ] E (cid:16) z ( ξ ) l [ n ] − c l T ( ξ ) [ n ] (cid:12)(cid:12)(cid:12) H n (cid:17) + 2 V B + θ max Bη ≤ B /η + 2 V B + θ max Bη − ξ L X l =1 Q l [ n ] ≤ B /η + 2 V B + θ max Bη − ξ k Q [ n ] k , E ( T [ n ] |H n ) as wellas the fact that 0 < θ [ n ] < θ max , and the third inequality follows from the ξ -slackness propertyas well as the assumption that z ( ξ ) l [ n ] is i.i.d. over slots and hence independent of Q l [ n ]. Thisfurther implies E (cid:0) k Q [ n + 1] k | H n (cid:1) ≤k Q [ n ] k − ξ k Q [ n ] k + 2 B /η + 2 V B + θ max Bη = k Q [ n ] k − ξ k Q [ n ] k + 2 B /η + 2 V B + θ max Bη − ξ ξ k Q [ n ] k − ξ k Q [ n ] k + 2 B /η + 2 V B + θ max Bη − ξ ξ · ξ + ξ k Q [ n ] k − ξ k Q [ n ] k + C V · ξ + ξ ≤k Q [ n ] k − ξ k Q [ n ] k + ξ (cid:18) k Q [ n ] k − ξ (cid:19) , where we use the fact that C = B V ξη + ξ B + θ max Bη − ξ V and also the assumption that k Q [ n ] k ≥ C V . Now take the square root from both sides gives p E ( k Q [ n + 1] k | H n ) ≤ k Q [ n ] k − ξ . By concavity of √ x function, we have E ( k Q [ n + 1] k | H n ) ≤ p E ( k Q [ n + 1] k | H n ), thus, E ( k Q [ n + 1] k | H n ) ≤ k Q [ n ] k − ξ . (5.17)Finally, we claim that this gives that under the condition k Q [ n ] k > σ (cid:44) C V , E (cid:16) e r ( k Q [ n +1] k−k Q [ n ] k ) (cid:12)(cid:12)(cid:12) H n (cid:17) ≤ ρ (cid:44) − rξ Bη r < . (5.18)161o see this, we expand E (cid:0) e r ( k Q [ n +1] k−k Q [ n ] k ) (cid:12)(cid:12) H n (cid:1) using Taylor series as follows: E (cid:16) e r ( k Q [ n +1] k−k Q [ n ] k ) (cid:12)(cid:12)(cid:12) H n (cid:17) =1 + r E ( k Q [ n + 1] k − k Q [ n ] k| H n ) + r ∞ X k =2 r k − E (cid:0) ( k Q [ n + 1] k − k Q [ n ] k ) k (cid:12)(cid:12) H n (cid:1) k ! ≤ − rξ r ∞ X k =2 r k − E (cid:0) ( k Q [ n + 1] k − k Q [ n ] k ) k (cid:12)(cid:12) H n (cid:1) k ! ≤ − rξ r ∞ X k =2 η k − E (cid:0) ( k Q [ n + 1] k − k Q [ n ] k ) k (cid:12)(cid:12) H n (cid:1) k !=1 − rξ r (cid:0) E (cid:0) e η ( k Q [ n +1] k−k Q [ n ] k ) (cid:12)(cid:12) H n (cid:1) − η E ( k Q [ n + 1] k − k Q [ n ] k| H n ) − (cid:1) η ≤ − rξ B + η · Bη η r ≤ − rξ Bη r = ρ, where the ﬁrst inequality follows from (5.17), the second inequality follows from r ≤ η , and thesecond to last inequality follows from Proposition 1.Finally, notice that the above quadratic function on r attains the minimum at the point r = ξη B with value 1 − ξ η B <

1, and this function is strictly decreasing when r ∈ (cid:18) , ξη B (cid:19) . Thus, our choice of r = min (cid:26) η, ξη B (cid:27) ≤ ξη B ensures that ρ is strictly less than 1 and the proof is ﬁnished. Proof of Lemma 5.5.1. If θ [ n ] = y for some y ∈ [0 , θ max ], then, ˆ θ [ n ] falls into one of the followingthree cases:• ˆ θ [ n ] = y .• y = θ max and ˆ θ [ n ] > θ max .• y = 0 and ˆ θ [ n ] < θ [ n ] = y ≥ x for some y , then, the ﬁrst two cases immediately imply ˆ θ [ n ] ≥ x . If y = 0,then, we have x ≤

0, which violates the assumption that x ∈ (0 , θ max ). Thus, the third case isruled out. On the other hand, if ˆ θ [ n ] ≥ x , then, obviously, θ [ n ] ≥ x .2) If θ [ n ] = y ≤ x for some y , then the last two cases immediately imply ˆ θ [ n ] ≤ x . If y = θ max ,then, we have x ≥ y max , which violates the assumption that x ∈ (0 , θ max ). Thus, the ﬁrst case isruled out. On the other hand, if ˆ θ [ n ] ≤ x , then, obviously, θ [ n ] ≤ x .3) If lim sup n →∞ θ [ n ] ≤ x , then, for any (cid:15) > x + (cid:15) < y max , there exists an N large enough so that θ [ n ] ≤ x + (cid:15), ∀ n ≥ N . Then, by property 2), ˆ θ [ n ] ≤ x + (cid:15), ∀ n ≥ N , whichimplies lim sup n →∞ ˆ θ [ n ] ≤ x + (cid:15) . Let (cid:15) → n →∞ ˆ θ [ n ] ≤ x . One the other hand, iflim sup n →∞ ˆ θ [ n ] ≤ x , then, obviously, lim sup n →∞ θ [ n ] ≤ x .4) If lim inf n →∞ θ [ n ] ≥ x , then, for any (cid:15) > x − (cid:15) > N largeenough so that θ [ n ] ≥ x − (cid:15), ∀ n ≥ N . Then, by property 1), ˆ θ [ n ] ≥ x − (cid:15), ∀ n ≥ N , whichimplies lim sup n →∞ ˆ θ [ n ] ≤ x − (cid:15) . Let (cid:15) → n →∞ ˆ θ [ n ] ≥ x . One the other hand, iflim sup n →∞ ˆ θ [ n ] ≤ x , then, obviously, lim sup n →∞ θ [ n ] ≤ x . Proof of Lemma 5.5.6.

The proof is divided into two parts. The ﬁrst part contains some technicalpreliminaries showing G [ n ] is measurable respect to H n , ∀ n ≥ n k +1, and the second part containscomputations to prove the supermartingale claim.• Technical preliminaries:

First of all, for any ﬁxed k , since n k is a random variable on theintegers, we need to justify that {H n } n ≥ n k +1 is indeed a ﬁltration. First, it is obvious that n k a valid stopping time, i.e. { n k ≤ t } ∈ H t , ∀ t ∈ N . Then, any n = n k + s with some constant s ∈ N + is also a valid stopping time because { n ≤ t } = { n k ≤ t − s } ∈ H ( t − s ) ∨ ⊆ H t , ∀ t ∈ N , where a ∨ b (cid:44) max { a, b } . Thus, by deﬁnition of stopping time σ -algebra from [Dur13], weknow that for any n ≥ n k + 1, H n can be written as the collection of all sets A that have A ∩ { n ≤ t } ∈ H t , ∀ t ∈ N . Now, pick 1 ≤ s ≤ s as constants, and if a set A ∈ H n k + s , An intuitive interpretation is that when n ≤ t , the set A is contained in the information known until t . A ∩ { n k + s ≤ t } = A ∩ { n k + s ≤ t − ( s − s ) } ∈ H ( t − ( s − s )) ∨ ⊆ H t . Thus, H n k + s ⊆ H n k + s and {H n } n ≥ n k +1 is indeed a ﬁltration.Since ˜ θ [ n k + 1] is determined by the realization up to frame n k , it follows, for any t ∈ N + , { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } ∩ { n k + 1 ≤ t } = ∪ ts =1 { ˜ θ [ s ] ≥ θ ∗ + ε /V } ∈ H t , which implies that { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } ∈ H n k +1 . Since {H n } n ≥ n k +1 is a ﬁltration, itfollows { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } ∈ H n for any n ≥ n k + 1. By the same methodology, wecan show that { ˜ θ [ n ] < θ ∗ + ε /V } ∈ H n , ∀ n ≥ n k + 1, which in turn implies, { S n k + n k ≤ n } ∈ H n and { S n k ≥ n − n k + 1 } ∈ H n . Overall, the function G [ n ] is measurable respectto H n , ∀ n ≥ n k + 1.• Proof of supermartingale claim:

It is obvious that | G [ n ] | < ∞ , thus, in order to prove G [ n ]is a supermartingale, it is enough to show that E ( G [ n + 1] − G [ n ] | H n ) ≤ , ∀ n ≥ n k + 1 . (5.19)First, on the set { S n k ≤ n − n k } , we have E (cid:16) ( G [ n + 1] − G [ n ]) { S nk + n k ≤ n } (cid:12)(cid:12)(cid:12) H n (cid:17) = E (cid:16) ( G [ n ] − G [ n ]) { S nk + n k ≤ n } (cid:12)(cid:12)(cid:12) H n (cid:17) = 0 . { S n k ≥ n − n k + 1 } . Since E (cid:16) G [ n + 1] { S nk ≥ n − n k +1 } |H n (cid:17) = E  e λ n +1 F [( n +1) ∧ ( n k + S nk )] Q ( n +1) ∧ ( n k + S nk ) i = n k +1 ρ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n  { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } { S nk ≥ n − n k +1 } = E e λ n +1 F [ n +1] Q n +1 i = n k +1 ρ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n ! { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } { S nk ≥ n − n k +1 } = e λ n +1 F [ n ] Q ni = n k +1 ρ i E (cid:18) e λ n +1 ( F [ n +1] − F [ n ]) ρ n +1 (cid:12)(cid:12)(cid:12)(cid:12) H n (cid:19) { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } { S nk ≥ n − n k +1 } ≤ e λ n F [ n ] Q ni = n k +1 ρ i E (cid:18) e λ n +1 ( F [ n +1] − F [ n ]) ρ n +1 (cid:12)(cid:12)(cid:12)(cid:12) H n (cid:19) { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } { S nk ≥ n − n k +1 } = G [ n ] E (cid:18) e λ n +1 ( F [ n +1] − F [ n ]) ρ n +1 (cid:12)(cid:12)(cid:12)(cid:12) H n (cid:19) { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } { S nk ≥ n − n k +1 } , where { ˜ θ [ n k +1] ≥ θ ∗ + ε /V } and { S nk ≥ n − n k +1 } can be moved out of the expectation because { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } ∈ H n and { S n k ≥ n − n k + 1 } ∈ H n , and the only inequality followsfrom the following argument: On the set { S n k ≥ n − n k + 1 } , { ˜ θ [ n ] ≥ θ ∗ + ε /V } , thus, byLemma 5.5.5, F [ n ] ≥ λ n > λ n +1 , we have λ n +1 F [ n ] ≤ λ n F [ n ]. Thus,it is suﬃcient to show that on the set { S n k ≥ n − n k + 1 } ∩ { ˜ θ [ n k + 1] ≥ θ ∗ + ε /V } , wehave E (cid:18) e λ n +1 ( F [ n +1] − F [ n ]) ρ n +1 (cid:12)(cid:12)(cid:12)(cid:12) H n (cid:19) ≤ . By Taylor expansion, we have E (cid:16) e λ n +1 ( F [ n +1] − F [ n ]) (cid:12)(cid:12)(cid:12) H n (cid:17) =1 + λ n +1 E ( F [ n + 1] − F [ n ] | H n ) + ∞ X k =2 λ kn +1 k ! E (cid:0) ( F [ n + 1] − F [ n ]) k | H n (cid:1) =1 + λ n +1 E ( F [ n + 1] − F [ n ] | H n ) + λ n +1 ∞ X k =2 λ k − n +1 k ! E (cid:0) ( F [ n + 1] − F [ n ]) k | H n (cid:1) ≤ − λ n +1 ε V + λ n +1 ∞ X k =2 λ k − n +1 k ! E (cid:0) ( F [ n + 1] − F [ n ]) k | H n (cid:1) , where the last inequality comes from the following argument: On the set { S n k ≥ n − n k +1 } ,˜ θ [ n k + 1] ≥ θ ∗ + ε /V , thus, by the deﬁnition of ˜ θ [ n ], we have ˆ θ [ n ] ≥ ˜ θ [ n ] ≥ θ ∗ + ε /V , and165emma 5.5.1 gives θ [ n ] ≥ θ ∗ + ε /V , then, by Lemma 5.5.4, we have E ( F [ n + 1] − F [ n ] | H n ) ≤ − ε V .

Now, by the assumption that V ≥ ε η − √ Lr , we have λ n +1 ≤ (cid:0) η + √ LηrV (cid:1) log ( n +1) ,which follows from simple algebraic manipulations. Using the fact that | F [ n + 1] − F [ n ] | ≤ (cid:16) η + √ LηrV (cid:17) log ( n + 1), we have E (cid:16) e λ n +1 ( F [ n +1] − F [ n ]) (cid:12)(cid:12)(cid:12) H n (cid:17) ≤ − λ n +1 (cid:15) V + λ n +1 ∞ X k =2 (cid:18) (cid:0) η + √ LηrV (cid:1) log ( n +1) (cid:19) k − k ! E  η + 4 √ LηrV ! log ( n + 1) ! k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) H n  =1 − λ n +1 (cid:15) V + λ n +1 ∞ X k =2 k ! η + 4 √ LηrV ! log ( n + 1) ! ≤ − λ n +1 (cid:15) V + λ n +1 e η + 4 √ LηrV ! log ( n + 1) = ρ n +1 , where the ﬁnal inequality follows by completing the third term back to Taylor series whichis equal to e . Overall, the inequality (5.19) holds and G [ n ] is a supermartingale. In this appendix, we show that there exists a constant C such that ∞ X m =1 ( m + 1) n k + m Y i = n k +1 ρ i ≤ C ( n k + 2) β . ρ i . Let C = V e (cid:0) η + √ LηrV (cid:1) ε β , then, ρ i =1 − ε V e (cid:16) η + √ LηrV (cid:17) log ( i + 1)= 1 − C β log ( i + 1) < − C ( i + 1) β , where we used the fact that β log ( i + 1) < ( i + 1) β , ∀ β > , i ≥

0. Next, to bound Q n k + mi = n k +1 ρ i ,we take the logarithm: log n k + m Y i = n k +1 ρ i ! = n k + m X i = n k +1 log ρ i = n k + m X i = n k +1 log (cid:18) − C ( i + 1) β (cid:19) ≤ − n k + m X i = n k +1 C ( i + 1) β ≤ − C Z n k + m +1 n k +2 x β dx. where the ﬁrst inequality follows from the ﬁrst order Taylor expansion. Since β <

1, we computethe integral, which gives − C Z n k + m +1 n k +2 x β dx = − C (1 − β ) (cid:0) ( n k + m + 1) − β − ( n k + 2) − β (cid:1) . Thus, ∞ X m =1 ( m + 1) n k + m Y i = n k +1 ρ i ≤ ∞ X m =1 ( m + 1) e − C − β ) ( ( n k + m +1) − β − ( n k +2) − β ) ≤ Z ∞ ( x + 2) e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx + (3 C (1 − β )) , where the last inequality follows from the fact that the integrand is monotonically decreasingwhen x > C (1 − β ), thus, the integral dominates the sum on the tail x > C (1 − β ). For the167art where x ≤ C (1 − β ), the maximum of the integrand is bounded by (3 C (1 − β )) . Thus,the total diﬀerence of such approximation is bounded by (3 C (1 − β )) . Then, we try to estimatethe integral. Notice that ddx e − C − β ) ( x + n k +2) − β = − C e − C − β ) ( x + n k +2) − β ( x + n k + 2) − β , we do integration-by-parts, which gives Z ∞ ( x + 2) e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx = Z ∞ ( x + 2) ( x + n k + 2) β ( x + n k + 2) − β e − C − β ) ( x + n k +2) − β dx · e C − β ) ( n k +2) − β =8 C ( n k + 2) β + Z ∞ C (cid:0) x + 2) ( x + n k + 2) β + β ( x + 2) ( x + n k + 2) β − (cid:1) e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx. Since 5 β ≤ n k ≥

1, we have x + n k + 2 ≥ x + 2, which implies ( x + 2) ( x + n k + 2) β − ≤ ( x + 2) ( x + n k + 2) β , thus, Z ∞ ( x + 2) e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx ≤ C ( n k + 2) β + Z ∞ C ( x + 2) ( x + n k + 2) β e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx. Repeat above procedure 3 more times, we have Z ∞ ( x + 2) e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx ≤ C ( n k + 2) β + 16 C ( n k + 2) β + 24 C ( n k + 2) β + 24 C ( n k + 2) β + Z ∞ C ( x + n k + 2) β − e − C − β ) ( ( x + n k +2) − β − ( n k +2) − β ) dx ≤ C ( n k + 2) β + 16 C ( n k + 2) β + 24 C ( n k + 2) β + 24 C ( n k + 2) β + 24 C ≤ C ( n k + 2) β , for some C on the order of C (which is O (cid:0) V β − ε − (cid:1) ), where the second to last inequalityfollows from 4 β − ≤ − β and thus, we replace ( x + n k + 2) β − with ( x + n k + 2) − β and do adirect integration. Overall, we proved the claim.168 hapter 6Online Learning in Weakly Coupled Markov Decision Pro-cesses In this chapter, we consider online learning over weakly coupled Markov decision processes.We develop a new distributed online algorithm where each MDP makes its own decision each slotafter observing a multiplier computed from past information. While the scenario is signiﬁcantlymore challenging than the classical online learning context, the algorithm is shown to have atight O ( √ T ) regret and constraint violations simultaneously over a time horizon T . This chapter considers online constrained Markov decision processes (OCMDP) where boththe objective and constraint functions can vary each time slot after the decision is made. Weassume a slotted time scenario with time slots t ∈ { , , , . . . } . The OCMDP consists of K parallel Markov decision processes with indices k ∈ { , , . . . , K } . The k -th MDP has state space S ( k ) , action space A ( k ) , and transition probability matrix P ( k ) a which depends on the chosenaction a ∈ A ( k ) . Speciﬁcally, P ( k ) a = ( P ( k ) a ( s, s )) where P ( k ) a ( s, s ) = P r (cid:16) s ( k ) t +1 = s (cid:12)(cid:12)(cid:12) s ( k ) t = s, a ( k ) t = a (cid:17) , where s ( k ) t and a ( k ) t are the state and action for system k on slot t . We assume that both thestate space and the action space are ﬁnite for all k ∈ { , , · · · , K } .After each MDP k ∈ { , . . . , K } makes the decision at time t (and assuming the current stateis s ( k ) t = s and the action is a ( k ) t = a ), the following information is revealed:169. The next state s ( k ) t +1 .2. A penalty function f ( k ) t ( s, a ) that depends on the current state s and the current action a .3. A collection of m constraint functions g ( k )1 ,t ( s, a ) , . . . , g ( k ) m,t ( s, a ) that depend on s and a .The functions f ( k ) t and g ( k ) i,t are all bounded mappings from S ( k ) × A ( k ) to R and representdiﬀerent types of costs incurred by system k on slot t (depending on the current state andaction). For example, in a multi-server data center, the diﬀerent systems k ∈ { , . . . , K } canrepresent diﬀerent servers, the cost function for a particular server k might represent energy ormonetary expenditure for that server, and the constraint costs for server k can represent negativerewards such as service rates or qualities. Coupling between the server systems comes from usingall of them to collectively support a common stream of arriving jobs.A key aspect of this general problem is that the functions f ( k ) t and g ( k ) i,t are unknown untilafter the slot t decision is made. Thus, the precise costs incurred by each system are only knownat the end of the slot. For a ﬁxed time horizon of T slots, the overall penalty and constraintaccumulation resulting from a policy P is: F T ( d , P ) := E T X t =1 K X k =1 f ( k ) t (cid:16) a ( k ) t , s ( k ) t (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d , P ! , (6.1)and G i,T ( d , P ) := E T X t =1 K X k =1 g ( k ) i,t (cid:16) a ( k ) t , s ( k ) t (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d , P ! , where d represents a given distribution on the initial joint state vector ( s (1)0 , · · · , s ( K )0 ). Notethat ( a ( k ) t , s ( k ) t ) denotes the state-action pair of the k th MDP, which is a pair of random variablesdetermined by d and P . Deﬁne a constraint set G := { ( P , d ) : G i,T ( d , P ) ≤ , i = 1 , , · · · , m } . (6.2)Deﬁne the regret of a policy P with respect to a particular joint randomized stationary policyΠ along with an arbitrary starting state distribution d as: F T ( d , P ) − F T ( d , Π) , P so that both the regret and constraint violationsgrow sublinearly with respect to T , where regret is measured against all feasible joint randomizedstationary policies Π.Here we give a brief review of the works related to online optimization and online MDPs.• Online convex optimization (OCO) : This concerns multi-round cost minimization witharbitrarily-varying convex loss functions. Speciﬁcally, on each slot t the decision makerchooses decisions x ( t ) within a convex set X (before observing the loss function f t ( x ))in order to minimize the total regret compared to the best ﬁxed decision in hindsight,expressed as: regret( T ) = T X t =1 f t ( x ( t )) − min x ∈X T X t =1 f t ( x ) . See [H +

16] for an introduction to OCO. Zinkevich introduced OCO in [Zin03] and showsthat an online projection gradient descent (OGD) algorithm achieves O ( √ T ) regret. This O ( √ T ) regret is proven to be the best in [HAK07], although improved performance ispossible if all convex loss functions are strongly convex. The OGD decision requires tocompute a projection of a vector onto a set X . For complicated sets X with functionalequality constraints, e.g., X = { x ∈ X : g k ( x ) ≤ , k ∈ { , , . . . , m }} , the projectioncan have high complexity. To circumvent the projection, work in [MJY12, JHA16, YN16,CLG17] proposes alternative algorithms with simpler per-slot complexity and that satisfythe inequality constraints in the long term (rather than on every slot). Recently, newprimal-dual type algorithms with low complexity are proposed in [NY17, YNW17] to solvemore challenging OCO with time-varying functional inequality constraints.• Online Markov decision processes : This extends OCO to allow systems with a morecomplex Markov structure. This is similar to the setup of the current paper of minimizingthe expression (6.1), but does not have the constraint set (6.2). Unlike traditional OCO, thecurrent penalty depends not only on the current action and the current (unknown) penaltyfunction, but on the current system state (which depends on the history of previous actions).Further, the number of policies can grow exponentially with the sizes of the state and actionspaces, so that solutions can be computationally intensive. The work [EDKM09] developsan algorithm in this context with O ( √ T ) regret. Extended algorithms and regularization171ethods are developed in [YMS09][GRW14][DGS14] to reduce complexity and improvedependencies on the number of states and actions. Online MDP under bandit feedback(where the decision maker can only observe the penalty corresponding to the chosen action)is considered in [YMS09][NAGS10].• Constrained MDPs : This aims to solve classical MDP problems with known cost func-tions but subject to additional constraints on the budget or resources. Linear programmingmethods for MDPs are found, for example, in [Alt99b], and algorithms beyond LP are foundin [Nee11] [CDM14]. Formulations closest to our setup appear in recent work on weaklycoupled MDPs in [BL16][WN16] that have known cost and resource functions.•

Reinforcement Learning (RL) : This concerns MDPs with some unknown parameters(such as unknown functions and transition probabilities). Typically, RL makes strongerassumptions than the online setting, such as an environment that is unknown but ﬁxed,whereas the unknown environment in the online context can change over time. Methodsfor RL are developed in [Ber95][SB98][LHS + Throughout this paper, given an MDP with state space S and action space A , a policy P deﬁnes a (possibly probabilistic) method of choosing actions a ∈ A at state s ∈ S based on thepast information. We start with some basic deﬁnitions of important classes of policies: Deﬁnition 6.2.1.

For an MDP, a randomized stationary policy π deﬁnes an algorithmwhich, whenever the system is in state s ∈ S , chooses an action a ∈ A according to a ﬁxedconditional probability function π ( a | s ) , deﬁned for all a ∈ A and s ∈ S . Deﬁnition 6.2.2.

For an MDP, a pure policy π is a randomized stationary policy with allprobabilities equal to either 0 or 1. That is, a pure policy is deﬁned by a deterministic mappingbetween states s ∈ S and actions a ∈ A . Whenever the system is in a state s ∈ S , it alwayschooses a particular action a s ∈ A (with probability 1). Note that if an MDP has a ﬁnite state and action space, the set of all pure policies isalso ﬁnite. Consider the MDP associated with a particular system k ∈ { , . . . , K } . For any172andomized stationary policy π , it holds that P a ∈A ( k ) π ( a | s ) = 1 for all s ∈ S ( k ) . Deﬁne thetransition probability matrix P ( k ) π under policy π to have components as follows: P ( k ) π ( s, s ) = X a ∈A ( k ) π ( a | s ) P ( k ) a ( s, s ) , s, s ∈ S ( k ) . (6.3)It is easy to verify that P ( k ) π is indeed a stochastic matrix , that is, it has rows with nonnegativecomponents that sum to 1. Let d ( k )0 ∈ [0 , |S ( k ) | be an (arbitrary) initial distribution for the k -th MDP. Deﬁne the state distribution at time t under π as d ( k ) π,t . By the Markov property ofthe system, we have d ( k ) π,t = d ( k )0 (cid:16) P ( k ) π (cid:17) t . A transition probability matrix P ( k ) π is ergodic if itgives rise to a Markov chain that is irreducible and aperiodic. Since the state space is ﬁnite, anergodic matrix P ( k ) π has a unique stationary distribution denoted d ( k ) π , so that d ( k ) π is the uniqueprobability vector solving d = d P ( k ) π . Assumption 6.2.1 (Unichain model) . There exists a universal integer b r ≥ such that for anyinteger r ≥ b r and every k ∈ { , . . . , K } , we have the product P ( k ) π P ( k ) π · · · P ( k ) π r is a transitionmatrix with strictly positive entries for any sequence of pure policies π , π , · · · , π r associatedwith the k th MDP. Remark 6.2.1.

Assumption 6.2.1 implies that each MDP k ∈ { , . . . , K } is ergodic under anypure policy. This follows by taking π , π , · · · , π r all the same in Assumption 6.2.1. Since thetransition matrix of any randomized stationary policy can be formed as a convex combinationof those of pure policies, any randomized stationary policy results in an ergodic MDP for whichthere is a unique stationary distribution. Assumption 6.2.1 is easy to check via the followingsimple suﬃcient condition. Proposition 6.2.1.

Assumption 6.2.1 holds if, for every k ∈ { , . . . , K } , there is a ﬁxed ergodicmatrix P ( k ) (i.e., a transition probability matrix that deﬁnes an irreducible and aperiodic Markovchain) such that for any pure policy π on MDP k we have the decomposition P ( k ) π = δ π P ( k ) + (1 − δ π ) Q ( k ) π , where δ π ∈ (0 , depends on the pure policy π and Q ( k ) π is a stochastic matrix depending on π .Proof. Fix k ∈ { , . . . , K } and assume every pure policy on MDP k has the above decomposition.Since there are only ﬁnitely many pure policies, there exists a lower bound δ min > π ≥ δ min for every pure policy π . Since P ( k ) is an ergodic matrix, there exists an integer r ( k ) > P ( k ) ) r has strictly positive components for all r ≥ r ( k ) . Fix r ≥ r ( k ) andlet π , . . . , π r be any sequence of r pure policies on MDP k . Then P ( k ) π · · · P ( k ) π r ≥ δ min (cid:16) P ( k ) (cid:17) r > , where inequality is treated entrywise. The universal integer r can be taken as the maximuminteger r ( k ) over all k ∈ { , . . . , K } . Deﬁnition 6.2.3. A joint randomized stationary policy Π on K parallel MDPs deﬁnes analgorithm which chooses a joint action a := (cid:0) a (1) , a (2) , · · · , a ( K ) (cid:1) ∈ A (1) × A (2) · · · × A ( K ) given the joint state s := (cid:0) s (1) , s (2) , , · · · , s ( K ) (cid:1) ∈ S (1) × S (2) · · · × S ( K ) according to a ﬁxedconditional probability Π ( a | s ) . The following special class of separable policies can be implemented separately over each ofthe K MDPs and plays a role in both algorithm design and performance analysis.

Deﬁnition 6.2.4.

A joint randomized stationary policy π is separable if the conditional prob-abilities π := (cid:0) π (1) , π (2) , · · · , π ( K ) (cid:1) decompose as a product π ( a | s ) = K Y k =1 π ( k ) (cid:16) a ( k ) | s ( k ) (cid:17) for all a ∈ A (1) × · · · × A ( K ) , s ∈ S (1) · · · × S ( K ) . The functions f ( k ) t and g ( k ) i,t are determined by random processes deﬁned over t = 0 , , , · · · .Speciﬁcally, let Ω be a ﬁnite dimensional vector space. Let { ω t } ∞ t =0 and { µ t } ∞ t =0 be two sequencesof random vectors in Ω. Then for all a ∈ A ( k ) , s ∈ S ( k ) , i ∈ { , , · · · , m } we have g ( k ) i,t ( a, s ) = ˆ g ( k ) i ( a, s, ω t ) ,f ( k ) t ( a, s ) = ˆ f ( k ) ( a, s, µ t )where ˆ g ( k ) i and ˆ f ( k ) formally deﬁne the time-varying functions in terms of the random processes ω t and µ t . It is assumed that the processes { ω t } ∞ t =0 and { µ t } ∞ t =0 are generated at the start of174lot 0 (before any control actions are taken), and revealed gradually over time, so that functions g ( k ) i,t and f ( k ) t are only revealed at the end of slot t . Remark 6.2.2.

The functions generated at time 0 in this way are also called oblivious functionsbecause they are not inﬂuenced by control actions. Such an assumption is commonly adopted inprevious unconstrained online MDP works (e.g. [EDKM09], [YMS09] and [DGS14]). Further, itis also shown in [YMS09] that without this assumption, one can choose a sequence of objectivefunctions against the decision maker in a speciﬁcally designed MDP scenario so that one neverachieves the sublinear regret.

The functions are also assumed to be bounded by a universal constant Ψ, so that: | ˆ g ( k ) i ( a, s, ω ) | ≤ Ψ , | ˆ f ( k ) ( a, s, µ ) | ≤ Ψ , ∀ k ∈ { , . . . , K } , ∀ a ∈ A ( k ) , s ∈ S ( k ) , ∀ ω, µ ∈ Ω . (6.4)It is assumed that { ω t } ∞ t =0 is independent, identically distributed (i.i.d.) and independent of { µ t } ∞ t =0 . Hence, the constraint functions can be arbitrarily correlated on the same slot, butappear i.i.d. over diﬀerent slots. On the other hand, no speciﬁc model is imposed on { µ t } ∞ t =0 .Thus, the functions f ( k ) t can be arbitrarily time varying. Let H t be the system information up totime t , then, for any t ∈ { , , , · · · } , H t contains state and action information up to time t , i.e. s , · · · , s t , a , · · · , a t , and { ω t } ∞ t =0 and { µ t } ∞ t =0 . Throughout this paper, we make the followingassumptions. Assumption 6.2.2 (Independent transition) . For each MDP, given the state s ( k ) t ∈ S ( k ) andaction a ( k ) t ∈ A ( k ) , the next state s ( k ) t +1 is independent of all other past information up to time t as well as the state transition s ( j ) t +1 , ∀ j = k , i.e., for all s ∈ S ( k ) it holds that P r (cid:16) s ( k ) t +1 = s |H t , s ( j ) t +1 , ∀ j = k (cid:17) = P r (cid:16) s ( k ) t +1 = s | s ( k ) t , a ( k ) t (cid:17) where H t contains all past information up to time t . Intuitively, this assumption means that all MDPs are running independently in the jointprobability space and thus the only coupling among them comes from the constraints, whichreﬂects the notion of weakly coupled MDPs in our title. Furthermore, by deﬁnition of H t , given s ( k ) t , a ( k ) t , the next transition s ( k ) t +1 is also independent of function paths { ω t } ∞ t =0 and { µ t } ∞ t =0 .The following assumption states the constraint set is strictly feasible.175 ssumption 6.2.3 (Slater’s condition) . There exists a real value η > and a ﬁxed separablerandomized stationary policy e π such that E " K X k =1 g ( k ) i,t (cid:16) a ( k ) t , s ( k ) t (cid:17) (cid:12)(cid:12)(cid:12) d e π , e π ≤ − η, ∀ i ∈ { , , · · · , m } , where the initial state is d e π and is the unique stationary distribution of policy e π , and the expec-tation is taken with respect to the random initial state and the stochastic function g ( k ) i,t ( a, s ) (i.e., ω t ). Slater’s condition is a common assumption in convergence time analysis of constrained convexoptimization (e.g. [NO09], [Ber09b]). Note that this assumption readily implies the constraintset G can be achieved by the above randomized stationary policy. Speciﬁcally, take d ( k )0 = d e π ( k ) and P = e π , then, we have G i,T ( d , ˜ π ) = T − X t =0 E " K X k =1 g ( k ) i,t (cid:16) a ( k ) t , s ( k ) t (cid:17) (cid:12)(cid:12)(cid:12) d e π , e π ≤ − ηT < . In this section, we recall the well-known linear program formulation of an MDP (see, forexample, [Alt99b] and [Fox66a]). Consider an MDP with a state space S and an action space A .Let ∆ ⊆ R |S||A| be a probability simplex, i.e.∆ =  θ ∈ R |S||A| : X ( s,a ) ∈S×A θ ( s, a ) = 1 , θ ( s, a ) ≥  . Given a randomized stationary policy π with stationary state distribution d π , the MDP is aMarkov chain with transition matrix P π given by (6.3). Thus, it must satisfy the followingbalance equation: X s ∈S d π ( s ) P π ( s, s ) = d π ( s ) , ∀ s ∈ S . Deﬁning θ ( a, s ) = π ( a | s ) d π ( s ) and substituting the deﬁnition of transition probability (6.3) intothe above equation gives X s ∈S X a ∈A θ ( s, a ) P a ( s, s ) = X a ∈A θ ( s , a ) , ∀ s ∈ S . θ ( a, s ) is often interpreted as a stationary probability of being at state s ∈ S andtaking action a ∈ A under some randomized stationary policy. The state action polyhedron Θ isthen deﬁned as Θ := ( θ ∈ ∆ : X s ∈S X a ∈A θ ( s, a ) P a ( s, s ) = X a ∈A θ ( s , a ) , ∀ s ∈ S ) . Given any θ ∈ Θ, one can recover a randomized stationary policy π at any state s ∈ S as π ( a | s ) =  θ ( a,s ) P a ∈A θ ( a,s ) , if P a ∈A θ ( a, s ) = 0 , , otherwise . (6.5)Given any ﬁxed penalty function f ( a, s ), the best policy minimizing the penalty (withoutconstraint) is a randomized stationary policy given by the solution to the following linear program(LP): min h f , θ i , s.t. θ ∈ Θ . (6.6)where f := [ f ( a, s )] a ∈A , s ∈S . Note that for any policy π given by the state-action pair θ accordingto (6.5), h f , θ i = E s ∼ d π ,a ∼ π ( ·| s ) [ f ( a, s )] , Thus, h f , θ i is often referred to as the stationary state penalty of policy π .It can also be shown that any state-action pair in the set Θ can be achieved by a convexcombination of state-action vectors of pure policies, and thus all corner points of the polyhedronΘ are from pure policies. As a consequence, the best randomized stationary policy solving (6.6)is always a pure policy. In this section, we give preliminary results regarding the properties of our weakly coupledMDPs under randomized stationary policies. The proofs can be found in Appendix 6.6.1. Westart with a lemma on the uniform mixing of MDPs.177 emma 6.2.1.

Suppose Assumption 6.2.1 and 6.2.2 hold. There exists a positive integer r anda constant τ ≥ such that for any two state distributions d and d , sup π ( k )1 , ··· ,π ( k ) r (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d ( k )1 − d ( k )2 (cid:17) P ( k ) π ( k )1 P ( k ) π ( k )2 · · · P ( k ) π ( k ) r (cid:13)(cid:13)(cid:13)(cid:13) ≤ e − /τ (cid:13)(cid:13)(cid:13) d ( k )1 − d ( k )2 (cid:13)(cid:13)(cid:13) , ∀ k ∈ { , , · · · , K } where the supremum is taken with respect to any sequence of r randomized stationary policies n π ( k )1 , · · · , π ( k ) r o . For the k -th MDP, let Θ ( k ) be its state-action polyhedron according to the deﬁnition inSection 6.2.3. For any joint randomized stationary policy, let θ ( k ) be the marginal state-actionprobability vector on the k -th MDP, i.e. for any joint state-action distribution Φ( a , s ) where a ∈ A (1) × · · · × A ( K ) and s ∈ S (1) × · · · × S ( K ) , we have θ ( k ) ( a ( k ) , s ( k ) ) = P a ( j ) ,s ( j ) , j = k Φ( a , s ).We have the following lemma: Lemma 6.2.2.

Suppose Assumption 6.2.1 and 6.2.2 hold. Consider the product MDP withproduct state space S (1) × · · · × S ( K ) and action space A (1) × · · · × A ( K ) . Then, for any jointrandomized stationary policy, the following hold:1. The product MDP is irreducible and aperiodic.2. The marginal stationary state-action probability vector θ ( k ) ∈ Θ ( k ) , ∀ k ∈ { , , · · · , K } . An immediate conclusion we can draw from this lemma is that given any penalty and con-straint functions f ( k ) and g ( k ) i , k = 1 , , · · · , K , the stationary penalty and constraint value ofany joint randomized stationary policy can be expressed as K X k =1 D f ( k ) , θ ( k ) E , K X k =1 D g ( k ) i , θ ( k ) E , i = 1 , , · · · , m, with θ ( k ) ∈ Θ ( k ) . This in turn implies such stationary state-action probabilities { θ ( k ) } Kk =1 canalso be realized via a separable randomized stationary policy π with π ( k ) ( a | s ) = θ ( k ) ( a, s ) P a ∈A ( k ) θ ( k ) ( a, s ) , a ∈ A ( k ) , s ∈ S ( k ) , (6.7)and the corresponding stationary penalty and constraint value can also be achieved via thispolicy. This fact implies that when considering the stationary state performance only, the class178f separable randomized stationary policies is large enough to cover all possible stationary penaltyand constraint values.In particular, let ˜ π = (cid:0) ˜ π (1) , · · · , ˜ π ( K ) (cid:1) be the separable randomized stationary policy associ-ated with the Slater condition (Assumption 6.2.3). Using the fact that the constraint functions g ( k ) i,t , k = 1 , , · · · , K (i.e. w t ) are i.i.d.and Assumption 6.2.2 on independence of probabilitytransitions, we have the constraint functions g ( k ) i,t and the state-action pairs at any time t aremutuallly independent. Thus, E " K X k =1 g ( k ) i,t (cid:16) a ( k ) t , s ( k ) t (cid:17) (cid:12)(cid:12)(cid:12) d e π , e π = K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , ˜ θ ( k ) E , where ˜ θ ( k ) corresponds to ˜ π according to (6.7).Then, Slater’s condition can be translated to the following: There exists a sequence of state-action probabilities { ˜ θ ( k ) } Kk =1 from a separable randomized stationary policy such that ˜ θ ( k ) ∈ Θ ( k ) , ∀ k , and K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , ˜ θ ( k ) E ≤ − η, i = 1 , , · · · , m, (6.8)The assumption on separability does not lose generality in the sense that if there is no separablerandomized stationary policy that satisﬁes (6.8), then, there is no joint randomized stationarypolicy that satisﬁes (6.8) either. The current state of an MDP depends on previous states and actions. As a consequence,the slot t penalty not only depends on the current penalty function and current action, but alsoon the system history. This complication does not arise in classical online convex optimization([H + t penalty depends only on the slot t penalty function and action.Now imagine a virtual system where, on each slot t , a policy π t is chosen (rather than anaction). Further imagine the MDP immediately reaching its corresponding stationary distribution d π t . Then the states and actions on previous slots do not matter and the slot t performancedepends only on the chosen policy π t and on the current penalty and constraint functions. Thisimaginary system now has a structure similar to classical online convex optimization as in the179inkevich scenario [Zin03].A key feature of online convex optimization algorithms as in [Zin03] is that they updatetheir decision variables slowly. For a ﬁxed time scale T over which O ( √ T ) regret is desired, thedecision variables are typically changed no more than a distance O (1 / √ T ) from one slot to thenext. An important insight in prior (unconstrained) MDP works(e.g. [DGS14], [EDKM09], and[YMS09]) is that such slow updates also guarantee the “approximate” convergence of an MDPto its stationary distribution. As a consequence, one can design the decision policies under theimaginary assumption that the system instantly reaches its stationary distribution, and laterbound the error between the true system and the imaginary system. If the error is on the sameorder as the desired O ( √ T ) regret, then this approach works. This idea serves as a cornerstoneof our algorithm design of the next section, which treats the case of multiple weakly coupledsystems with both objective functions and constraint functions. Our proposed algorithm is distributed in the sense that each time slot, each MDP solvesits own subproblem and the constraint violations are controlled by a simple update of globalmultipliers called “virtual queues” at the end of each slot. Let Θ (1) , Θ (2) , · · · , Θ ( K ) be thestate-action polyhedra of K MDPs, respectively. Let θ ( k ) t ∈ Θ ( k ) be a state-action vector attime slot t . At t = 0, each MDP chooses its initial state-action vector θ ( k )0 resulting from any separable randomized stationary policy π ( k )0 . For example, one could choose a uniform policy π ( k ) ( a | s ) = 1 / (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) , ∀ s ∈ S ( k ) , solve the equation d π ( k )0 = d π ( k )0 P ( k ) π ( k )0 to get a probability vector d π ( k )0 , and obtain θ ( k )0 ( a, s ) = d π ( k )0 ( s ) / (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) . For each constraint i ∈ { , , · · · , m } , let Q i ( t ) bea virtual queue deﬁned over slots t = 0 , , , · · · with the initial condition Q i (0) = Q i (1) = 0, andupdate equation: Q i ( t + 1) = max ( Q i ( t ) + K X k =1 D g ( k ) i,t − , θ t E , ) , ∀ t ∈ { , , , · · · } . (6.9)Our algorithm uses two parameters V > α > t ∈ { , , , · · · } ,• The k -th MDP observes Q i ( t ) , i = 1 , , · · · , m and chooses θ ( k ) t to solve the following180ubproblem: θ ( k ) t = argmin θ ∈ Θ ( k ) * V f ( k ) t − + m X i =1 Q i ( t ) g ( k ) i,t − , θ + + α (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − (cid:13)(cid:13)(cid:13) . (6.10)• Construct the randomized stationary policy π ( k ) t according to (6.5) with θ = θ ( k ) t , andchoose the action a ( k ) t at k -th MDP according to the conditional distribution π ( k ) t (cid:16) ·| s ( k ) t (cid:17) .• Update the virtual queue Q i ( t ) according to (6.9) for all i = 1 , , · · · , m . Remark 6.3.1.

Note that for any slot t ≥ , this algorithm gives a separable randomizedstationary policy, so that each MDP chooses its own policy based on its own function f ( k ) t − , g ( k ) i,t − , i ∈ { , , · · · , m } , and a common multiplier Q ( t ) := ( Q ( t ) , · · · , Q m ( t )) . Furthermore,note that (6.10) is a convex quadratic program (QP). Standard theory of QP (e.g. [YT89]) showsthat the computation complexity solving (6.10) is poly (cid:0)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:1) for each k . Thus, the totalcomputation complexity over all MDPs during each round is poly (cid:0) K (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:1) . Remark 6.3.2.

The quadratic term α (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − (cid:13)(cid:13)(cid:13) in (6.10) penalizes the deviation of θ fromthe previous decision variable θ ( k ) t − . Thus, under proper choice of α , the distance between θ ( k ) t and θ ( k ) t − would be very small, which is the slow update condition we need according to Section6.2.5. The next lemma shows that solving (6.10) is in fact a projection onto the state-action poly-hedron. For any set

X ∈ R n and a vector y ∈ R n , deﬁne the projection operator P X ( y ) as P X ( y ) = arginf x ∈X k x − y k . Lemma 6.3.1.

Fix an α > and t ∈ { , , , · · · } . The θ t that solves (6.10) is θ ( k ) t = P Θ ( k ) θ ( k ) t − − w ( k ) t α ! , where w ( k ) t = V f ( k ) t − + P mi =1 Q i ( t ) g ( k ) i,t − ∈ R |A ( k ) ||S ( k ) | . roof. By deﬁnition, we have θ ( k ) t =argmin θ ∈ Θ ( k ) D w ( k ) t , θ E + α (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − (cid:13)(cid:13)(cid:13) =argmin θ ∈ Θ ( k ) D w ( k ) t , θ − θ ( k ) t − E + α (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − (cid:13)(cid:13)(cid:13) + D w ( k ) t , θ ( k ) t − E =argmin θ ∈ Θ ( k ) α · (cid:18)D w ( k ) t . α, θ − θ ( k ) t − E + (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − (cid:13)(cid:13)(cid:13) (cid:19) + D w ( k ) t , θ ( k ) t − E =argmin θ ∈ Θ ( k ) α · (cid:13)(cid:13)(cid:13) θ − θ ( k ) t − + w ( k ) t . α (cid:13)(cid:13)(cid:13) = P Θ ( k ) (cid:16) θ ( k ) t − − w ( k ) t . α (cid:17) , ﬁnishing the proof. The intuition of this algorithm follows from the discussion in Section 6.2.5. Instead of theMarkovian regret (6.1) and constraint set (6.2), we work on the imaginary system that after thedecision maker chooses any joint policy Π t and the penalty/constraint functions are revealed,the K parallel Markov chains reach stationary state distribution right away, with state-actionprobability vectors n θ ( k ) t o Kk =1 for K parallel MDPs. Thus there is no Markov state in such asystem anymore and the corresponding stationary penalty and constraint function value at time t can be expressed as P Kk =1 D f ( k ) t , θ ( k ) t E and P Kk =1 D g ( k ) i,t , θ ( k ) t E , i = 1 , , · · · , m , respectively. Asa consequence, we are now facing a relatively easier task of minimizing the following regret: T − X t =0 K X k =1 E (cid:16)D f ( k ) t , θ ( k ) t E(cid:17) − T − X t =0 K X k =1 E (cid:16)D f ( k ) t , θ ( k ) ∗ E(cid:17) , (6.11)where n θ ( k ) ∗ o Kk =1 are the state-action probabilities corresponding to the best ﬁxed joint random-ized stationary policy within the following stationary constraint set G := n θ ( k ) ∈ Θ ( k ) , k ∈ { , , · · · , K } : K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E ≤ , i = 1 , , · · · , m ) , (6.12)with the assumption that Slater’s condition (6.8) holds.182o analyze the proposed algorithm, we need to tackle the following two major challenges:• Whether or not the policy decision of the proposed algorithm would yield O ( √ T ) regretand constraint violation on the imaginary system that reaches steady state instantaneouslyon each slot.• Whether the error between the imaginary and true systems can be bounded by O ( √ T ).In the next section, we answer these questions via a multi-stage analysis piecing together theresults of MDPs from Section 6.2.4 with multiple ingredients from convex analysis and stochasticqueue analysis. We ﬁrst show the O ( √ T ) regret and constraint violation in the imaginary onlinelinear program incorporating a new regret analysis procedure with a stochastic drift analysisfor queue processes. Then, we show if the benchmark randomized stationary algorithm alwaysstarts from its stationary state, then, the discrepancy of regrets between the imaginary and truesystems can be controlled via the slow-update property of the proposed algorithm together withthe properties of MDPs developed in Section 6.2.4. Finally, for the problem with arbitrary non-stationary starting state, we reformulate it as a perturbation on the aforementioned stationarystate problem and analyze the perturbation via Farkas’ Lemma. Let Q ( t ) := [ Q ( t ) , Q ( t ) , · · · , Q m ( t )] be the virtual queue vector and L ( t ) = k Q ( t ) k .Deﬁne the drift ∆( t ) := L ( t + 1) − L ( t ). Sample-path analysis

This section develops a couple of bounds given a sequence of penalty functions f ( k )0 , f ( k )1 , · · · , f ( k ) T − and constraint functions g ( k ) i, , g ( k ) i, , · · · , g ( k ) i,T − . The following lemma provides bounds for virtualqueue processes: Lemma 6.4.1.

For any i ∈ { , , · · · , m } at T ∈ { , , · · · } , the following holds under the virtualqueue update (6.9) , T X t =1 K X k =1 D g ( k ) i,t − , θ ( k ) t − E ≤ Q i ( T + 1) − Q i (1) + Ψ T X t =1 K X k =1 q(cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) , here Ψ > is the constant deﬁned in (6.4) .Proof. By the queue updating rule (6.9), for any t ∈ N , Q i ( t + 1)= max ( Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t E , ) ≥ Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t E = Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t − E + K X k =1 D g ( k ) i,t − , θ ( k ) t − θ ( k ) t − E ≥ Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t − E − K X k =1 (cid:13)(cid:13)(cid:13) g ( k ) i,t − (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) , Note that the constraint functions are deterministically bounded, (cid:13)(cid:13)(cid:13) g ( k ) i,t − (cid:13)(cid:13)(cid:13) ≤ (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) Ψ . Substituting this bound into the above queue bound and rearranging the terms ﬁnish the proof.The next lemma provides a bound for the drift ∆( t ). Lemma 6.4.2.

For any slot t ≥ , we have ∆( t ) ≤ mK Ψ + m X i =1 Q i ( t ) K X k =1 D g ( k ) i,t − , θ ( k ) t E . Proof.

By deﬁnition, we have∆( t ) = 12 k Q ( t + 1) k − k Q ( t ) k ≤ m X i =1  Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t E! − Q i ( t )  = m X i =1 Q i ( t ) K X k =1 D g ( k ) i,t − , θ ( k ) t E + 12 m X i =1 K X k =1 D g ( k ) i,t − , θ ( k ) t E! . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 D g ( k ) i,t − , θ ( k ) t E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:13)(cid:13)(cid:13) g ( k ) i,t − (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) θ ( k ) t (cid:13)(cid:13)(cid:13) ≤ K Ψ . Substituting this bound into the drift bound ﬁnishes the proof.Consider a convex set

X ⊆ R n . Recall that for a ﬁxed real number c >

0, a function h : X → R is said to be c -strongly convex , if h ( x ) − c k x k is convex over x ∈ X . It is easy to see that if q : X → R is convex, c > b ∈ R n , the function q ( x ) + c k x − b k is c -strongly convex.Furthermore, if the function h is c -strongly convex that is minimized at a point x min ∈ X , then(see, e.g., Corollary 1 in [YN17]): h ( x min ) ≤ h ( y ) − c k y − x min k , ∀ y ∈ X . (6.13)The following lemma is a direct consequence of the above strongly convex result. It also demon-strates the key property of our minimization subproblem (6.10). Lemma 6.4.3.

The following bound holds for any k ∈ { , , · · · , K } and any ﬁxed θ ( k ) ∗ ∈ Θ ( k ) : V D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + m X i =1 Q i ( t ) D g ( k ) i,t − , θ ( k ) t E + α k θ ( k ) t − θ ( k ) t − k ≤ V D f ( k ) t − , θ ( k ) ∗ − θ ( k ) t − E + m X i =1 Q i ( t ) D g ( k ) i,t − , θ ( k ) ∗ E + α k θ ( k ) ∗ − θ ( k ) t − k − α k θ ( k ) ∗ − θ ( k ) t k . (6.14)This lemma follows easily from the fact that the proposed algorithm (6.10) gives θ ( k ) t ∈ Θ ( k ) minimizing the left hand side, which is a strongly convex function, and then, applying (6.13),with h (cid:16) θ ( k ) ∗ (cid:17) = V D f ( k ) t − , θ ( k ) ∗ − θ ( k ) t − E + m X i =1 Q i ( t ) D g ( k ) i,t − , θ ( k ) ∗ E + α (cid:13)(cid:13)(cid:13) θ ( k ) ∗ − θ ( k ) t − (cid:13)(cid:13)(cid:13) Combining the previous two lemmas gives the following “drift-plus-penalty” bound.

Lemma 6.4.4.

For any ﬁxed { θ ( k ) ∗ } Kk =1 such that θ ( k ) ∗ ∈ Θ ( k ) and t ∈ N , we have the following ound, ∆( t ) + V K X k =1 D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + α K X k =1 k θ ( k ) t − θ ( k ) t − k ≤ mK Ψ + V K X k =1 D f ( k ) t − , θ ( k ) ∗ − θ ( k ) t − E + m X i =1 Q i ( t − · K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E + α K X k =1 k θ ( k ) ∗ − θ ( k ) t − k − α K X k =1 k θ ( k ) ∗ − θ ( k ) t k (6.15) Proof.

Using Lemma 6.4.2 and then Lemma 6.4.3, we obtain∆( t ) + V K X k =1 D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + α K X k =1 k θ ( k ) t − θ ( k ) t − k ≤ mK Ψ + m X i =1 Q i ( t ) K X k =1 D g ( k ) i,t − , θ ( k ) t E + V K X k =1 D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + α K X k =1 k θ ( k ) t − θ ( k ) t − k ≤ mK Ψ + K X k =1 D f ( k ) t − , θ ( k ) ∗ − θ ( k ) t − E + m X i =1 Q i ( t ) K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E + α K X k =1 k θ ( k ) ∗ − θ ( k ) t − k − α K X k =1 k θ ( k ) ∗ − θ ( k ) t k . (6.16)Note that by the queue updating rule (6.9), we have for any t ≥ | Q i ( t ) − Q i ( t − | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 D g ( k ) i,t − , θ ( k ) t − E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:13)(cid:13)(cid:13) g ( k ) i,t − (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) θ ( k ) t − (cid:13)(cid:13)(cid:13) ≤ K Ψ , and for t = 1, Q i ( t ) − Q i ( t −

1) = 0 by the initial condition of the algorithm. Also, we have forany θ ( k ) ∗ ∈ Θ ( k ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K (cid:13)(cid:13)(cid:13) g ( k ) i,t − (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13) θ ( k ) ∗ (cid:13)(cid:13)(cid:13) ≤ K Ψ . Thus, we have m X i =1 Q i ( t ) K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E ≤ m X i =1 Q i ( t − K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E + mK Ψ . Substituting this bound into (6.16) ﬁnishes the proof.186 bjective boundTheorem 6.4.1.

For any { θ ( k ) ∗ } Kk =1 in the constraint set (6.12) and any T ∈ { , , , · · · } , theproposed algorithm has the following stationary state performance bound: T T − X t =0 E K X k =1 D f ( k ) t , θ ( k ) t E! ≤ T T − X t =0 E K X k =1 D f ( k ) t , θ ( k ) ∗ E! + 2 αKT V + mK Ψ T + V Ψ α K X k =1 (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) + 32 mK Ψ V ,

In particular, choosing α = T and V = √ T gives the O ( √ T ) regret T T − X t =0 E K X k =1 D f ( k ) t , θ ( k ) t E! ≤ T T − X t =0 E K X k =1 D f ( k ) t , θ ( k ) ∗ E! + K + Ψ K X k =1 (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) + 52 mK Ψ ! √ T .

Proof.

First of all, note that { g ( k ) i,t − } Kk =1 is i.i.d. and independent of all system history up to t −

1, and thus independent of Q i ( t − , i = 1 , , · · · , m . We have E (cid:16) Q i ( t − D g ( k ) i,t − , θ ( k ) ∗ E(cid:17) = E ( Q i ( t − E K X k =1 D g ( k ) i,t − , θ ( k ) ∗ E! ≤ { θ ( k ) ∗ } Kk =1 is in the constraint set(6.12). Substituting θ ( k ) ∗ into (6.15), taking expectation with respect to both sides and using(6.17) give E (∆( t )) + V E K X k =1 D f ( k ) t − , θ ( k ) t − θ ( k ) t − E! + α E K X k =1 k θ ( k ) t − θ ( k ) t − k ! ≤ mK Ψ + V E K X k =1 D f ( k ) t − , θ ( k ) ∗ − θ ( k ) t − E! + α E K X k =1 k θ ( k ) ∗ − θ ( k ) t − k ! − α E K X k =1 k θ ( k ) ∗ − θ ( k ) t k ! , where the second inequality follows from (6.17). Note that for any k , completing the squares187ives V D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + α k θ ( k ) t − θ ( k ) t − k ≥ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)r α (cid:16) θ ( k ) t − θ ( k ) t − (cid:17) + V p α/ f ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − V Ψ (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) α . Substituting this inequality into the previous bound and rearranging the terms give V E K X k =1 D f ( k ) t − , θ ( k ) t − E! ≤ V E K X k =1 D f ( k ) t − , θ ( k ) ∗ E! − E (∆( t ))+ V P Kk =1 Ψ (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) α + 32 mK Ψ + α E K X k =1 k θ ( k ) ∗ − θ ( k ) t − k ! − α E K X k =1 k θ ( k ) ∗ − θ ( k ) t k ! . Taking telescoping sums from 1 to T and dividing both sides by T V gives,1 T T X t =1 E K X k =1 D f ( k ) t − , θ ( k ) t − E! ≤ E K X k =1 D f ( k ) t − , θ ( k ) ∗ E! + L (0) − L ( T + 1) V T + V P Kk =1 Ψ (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) α + 32 mK Ψ V + α E (cid:16)P Kk =1 k θ ( k ) ∗ − θ ( k ) T − k (cid:17) − α E (cid:16)P Kk =1 k θ ( k ) ∗ − θ ( k ) T k (cid:17) V T ≤ E K X k =1 D f ( k ) t − , θ ( k ) ∗ E! + V P Kk =1 Ψ (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) α + 32 mK Ψ V + 2 αKV T , where we use the fact that L (0) = 0 and k θ ( k ) ∗ − θ ( k ) T − k ≤ k θ ( k ) ∗ − θ ( k ) T − k ≤ A drift lemma and its implications

From Lemma 6.4.1, we know that in order to get the constraint violation bound, we need tolook at the size of the virtual queue Q i ( T + 1) , i = 1 , , · · · , m . The following drift lemma servesas a cornerstone for our goal. Lemma 6.4.5 (Lemma 5 of [YNW17]) . Let { Ω , F , P } be a probability space. Let { Z ( t ) , t ≥ } be a discrete time stochastic process adapted to a ﬁltration {F t − , t ≥ } with Z (1) = 0 and F = {∅ , Ω } . Suppose there exist integer t > , real constants λ ∈ R , δ max > and < ζ ≤ δ max uch that | Z ( t + 1) − Z ( t ) | ≤ δ max , (6.18) E [ Z ( t + t ) − Z ( t ) |F t − ] ≤  t δ max , if Z ( t ) < λ − t ζ, if Z ( t ) ≥ λ . (6.19) hold for all t ∈ { , , . . . } . Then, the following holds: E [ Z ( t )] ≤ λ + t δ max + t δ ζ log (cid:2) δ ζ (cid:3) , ∀ t ∈ { , , . . . } . Note that a special case of above drift lemma for t = 1 dates back to the seminal paper ofHajek ([Haj82]) bounding the size of a random process with strongly negative drift. Since then,its power has been demonstrated in various scenarios ranging from steady state queue bound([ES12]) to feasibility analysis of stochastic optimization ([WN19]). The current generalizationto a multi-step drift is ﬁrst considered in [YNW17].This lemma is useful in the current context due to the following lemma, whose proof can befound in Appendix 6.6.2. Lemma 6.4.6.

Let F t , t ≥ be the system history functions up to time t , including f ( k )0 , · · · , f ( k ) t − , g ( k )0 ,i , · · · , g ( k ) t − ,i , i = 1 , , · · · , m, k = 1 , , · · · , K , and F is a null set. Let t be an arbitrarypositive integer, then, we have (cid:12)(cid:12) k Q ( t + 1) k − k Q ( t ) k (cid:12)(cid:12) ≤√ mK Ψ , E [ k Q ( t + t ) k − k Q ( t ) k (cid:12)(cid:12) F ( t − ≤  t √ mK Ψ , if k Q ( t ) k < λ − t η , if k Q ( t ) k ≥ λ where λ = V K

Ψ+3 mK Ψ +4 Kα + t ( t − m Ψ+2 mK Ψ ηt + η t ηt . Combining the previous two lemmas gives the virtual queue bound as E ( k Q ( t ) k ) ≤ V K

Ψ + 3 mK Ψ + 4 Kα + t ( t − m Ψ + 2 mK Ψ ηt + η t ηt + t √ mK Ψ+ 4 t mK Ψ η log (cid:2) mK Ψ η (cid:3) . t = √ T , V = √ T and α = T , which implies that E ( k Q ( t ) k ) ≤ C ( m, K, Ψ , η ) √ T , (6.20)where C ( m, K, Ψ , η ) = K Ψ η + mK Ψ η + K + m Ψ η + 2 mK Ψ + η + √ mK Ψ + mK Ψ η log (cid:2) mK Ψ η (cid:3) . The slow-update condition and constraint violation

In this section, we prove the slow-update property of the proposed algorithm, which not onlyimplies the the O ( √ T ) constraint violation bound, but also plays a key role in Markov analysis. Lemma 6.4.7.

The sequence of state-action vectors θ ( k ) t , t ∈ { , , · · · , T } satisﬁes E (cid:16) k θ ( k ) t − θ ( k ) t − k (cid:17) ≤ p m |A ( k ) ||S ( k ) | Ψ E ( k Q ( t ) k )2 α + p |A ( k ) ||S ( k ) | Ψ V α . In particular,choosing V = √ T and α = T gives a slow-update condition E (cid:16) k θ ( k ) t − θ ( k ) t − k (cid:17) ≤ p |A ( k ) ||S ( k ) | Ψ + C p m |A ( k ) ||S ( k ) | Ψ2 √ T , (6.21) where C = C ( m, K, Ψ , η ) is deﬁned in (6.20) .Proof of Lemma 6.4.7. First, choosing θ = θ t − in (6.14) gives V D f ( k ) t − , θ ( k ) t − θ ( k ) t − E + m X i =1 Q i ( t ) D g ( k ) i,t − , θ ( k ) t E + α k θ ( k ) t − θ ( k ) t − k ≤ m X i =1 Q i ( t ) h g ( k ) i,t − , θ ( k ) t − i − α k θ ( k ) t − − θ ( k ) t k . Rearranging the terms gives2 α k θ ( k ) t − θ ( k ) t − k ≤ − V h f ( k ) t − , θ ( k ) t − θ ( k ) t − i − m X i =1 Q i ( t ) h g ( k ) i,t − , θ ( k ) t − θ ( k ) t − i≤ V k f ( k ) t − k · k θ ( k ) t − θ ( k ) t − k + m X i =1 Q i ( t ) k g ( k ) i,t − k · k θ ( k ) t − θ ( k ) t − k ≤ V k f t − k · k θ ( k ) t − θ ( k ) t − k + k Q ( t ) k vuut m X i =1 k g ( k ) i,t − k k θ ( k ) t − θ ( k ) t − k , (cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) ≤ V k f ( k ) t − k + k Q ( t ) k · qP mi =1 k g ( k ) i,t − k α . Applying the fact that k f ( k ) t − k ≤ p |A ( k ) ||S ( k ) | Ψ, k g ( k ) i,t − k ≤ p |A ( k ) ||S ( k ) | Ψ and taking expec-tation from both sides give the ﬁrst bound in the lemma. The second bound follows directly fromthe ﬁrst bound by further substituting (6.20).

Theorem 6.4.2.

The proposed algorithm has the following stationary state constraint violationbound: T T − X t =0 E K X k =1 D g ( k ) i,t , θ ( k ) t E! ≤ √ T C + K X k =1 q m |A ( k ) ||S ( k ) | Ψ C + K X k =1 |A ( k ) ||S ( k ) | Ψ ! , where C = C ( m, K, Ψ , η ) is deﬁned in (6.20) .Proof. Taking expectation from both sides of Lemma 6.4.1 gives T X t =1 E K X k =1 D g ( k ) i,t − , θ ( k ) t − E! ≤ E ( Q i ( T + 1)) + Ψ T X t =1 K X k =1 q(cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) E (cid:16)(cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) (cid:17) . Substituting the bounds (6.20) and (6.21) in to the above inequality gives the desired result.

So far, we have shown that our algorithm achieves an O ( √ T ) regret and constraint violationsimultaneously regarding the stationary online linear program (6.11) with constraint set given by(6.12) in the imaginary system. In this section, we show how these stationary state results leadto a tight performance bound on the original true online MDP problem (6.1) and (6.2) comparingto any joint randomized stationary algorithm starting from its stationary state. Approximate mixing of MDPs

Let F t , t ≥ t , including f ( k )0 , · · · , f ( k ) t − , g ( k )0 ,i , · · · , g ( k ) t − ,i , i = 1 , , · · · , m, k = 1 , , · · · , K , and F is a null set. Let d π ( k ) t be the stationarystate distribution at k -th MDP under the randomized stationary policy π ( k ) t in the proposedalgorithm. Let v ( k ) t be the true state distribution at time slot t under the proposed algorithm given191he function path F T and starting state d ( k )0 , i.e. for any s ∈ S ( k ) , v ( k ) t ( s ) := P r (cid:16) s ( k ) t = s |F T (cid:17) and v ( k )0 = d ( k )0 .The following lemma provides a key estimate on the distance between stationary distributionand true distribution at each time slot t . It builds upon the slow-update condition (Lemma 6.4.7)of the proposed algorithm and uniform mixing bound of general MDPs (Lemma 6.2.1). Lemma 6.4.8.

Consider the proposed algorithm with V = √ T and α = T . For any initial statedistribution { d ( k )0 } Kk =1 and any t ∈ { , , , · · · , T − } , we have E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) ≤ τ r (cid:16)(cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) Ψ (cid:17). √ T + 2 e − tτr +1 , where τ and r are mixing parameters deﬁned in Lemma 6.2.1 and C is an absolute constantdeﬁned in (6.20) .Proof of Lemma 6.4.8. By Lemma 6.4.7 we know that for any t ∈ { , , · · · , T } , E (cid:16)(cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) (cid:17) ≤ q(cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C q m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T ,

Thus, E (cid:16)(cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) (cid:17) ≤ (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T ,

Since for any s ∈ S ( k ) , (cid:12)(cid:12) d π ( k ) t ( s ) − d π ( k ) t − ( s ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) P a ∈A ( k ) θ ( k ) t ( a, s ) − θ ( k ) t − ( a, s ) (cid:12)(cid:12)(cid:12) ≤ P a ∈A ( k ) (cid:12)(cid:12)(cid:12) θ ( k ) t ( a, s ) − θ ( k ) t − ( a, s ) (cid:12)(cid:12)(cid:12) , it then follows E (cid:18)(cid:13)(cid:13)(cid:13) d π ( k ) t − d π ( k ) t − (cid:13)(cid:13)(cid:13) (cid:19) ≤ E (cid:16)(cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) (cid:17) ≤ (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T . (6.22)192ow, we use the above relation to bound E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) for any t ≥ r . E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) ≤ E (cid:18)(cid:13)(cid:13)(cid:13) d π ( k ) t − d π ( k ) t − (cid:13)(cid:13)(cid:13) (cid:19) + E (cid:18)(cid:13)(cid:13)(cid:13) d π ( k ) t − − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:19) ≤ (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13) d π ( k ) t − − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:19) = (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − v ( k ) t − (cid:17) P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) , (6.23)where the second inequality follows from the slow-update condition (6.22) and the ﬁnal equalityfollows from the fact that given the function path F T , the following holds d π ( k ) t − − v ( k ) t = (cid:16) d π ( k ) t − − v ( k ) t − (cid:17) P ( k ) π ( k ) t − . (6.24)To see this, note that from the proposed algorithm, the policy π ( k ) t is determined by F T . Thus,by deﬁnition of stationary distribution, given F T , we know that d π ( k ) t − = d π ( k ) t − P ( k ) π ( k ) t − , and it isenough to show that given F T , v ( k ) t = v ( k ) t − P ( k ) π ( k ) t − . First of all, the state distribution v ( k ) t is determined by v ( k ) t − , π ( k ) t − and probability transition from s t − to s t , which are in turn determined by F T . Thus, given F T , for any s ∈ S ( k ) , v ( k ) t ( s ) = X s ∈S ( k ) P r ( s t = s | s t − = s , F T ) v ( k ) t − ( s ) , and P r ( s t = s | s t − = s , F T ) = X a ∈A ( k ) P r ( s t = s | a t = a, s t − = s , F T ) P r ( a t = a | s t − = s , F T )= X a ∈A ( k ) P a ( s , s ) P r ( a t = a | s t − = s , F T )= X a ∈A ( k ) P a ( s , s ) π ( k ) t − ( a | s ) = P π ( k ) t − ( s , s ) , where the second inequality follows from the Assumption 6.2.2, the third equality follows from193he fact that π ( k ) t − is determined by F T , thus, for any t , π ( k ) t ( a (cid:12)(cid:12) s ) = P r ( a t = a | s t − = s , F T ) , ∀ a ∈ A ( k ) , s ∈ S ( k ) , and the last equality follows from the deﬁnition of transition probability (6.3). This gives v ( k ) t ( s ) = X s ∈S ( k ) P π ( k ) t − ( s , s ) v ( k ) t − ( s ) , and thus (6.24) holds.We can iteratively apply the procedure (6.23) r times as follows E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) ≤ (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − d π ( k ) t − (cid:17) P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − v ( k ) t − (cid:17) P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) ≤ · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − v ( k ) t − (cid:17) P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) =2 · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − v ( k ) t − (cid:17) P ( k ) π ( k ) t − P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) ≤ · · · ≤ r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:18)(cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − r − v ( k ) t − r (cid:17) P ( k ) π ( k ) t − r · · · P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) , where the second inequality follows from the nonexpansive property in ‘ norm of the stochasticmatrix P ( k ) π ( k ) t − that (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) d π ( k ) t − − d π ( k ) t − (cid:17) P ( k ) π ( k ) t − (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) d π ( k ) t − − d π ( k ) t − (cid:13)(cid:13)(cid:13) , and then using the slow-update condition (6.22) again. By Lemma 6.2.1, we have E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) ≤ r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + e − /τ E (cid:18)(cid:13)(cid:13)(cid:13) d π ( k ) t − r − v ( k ) t − r (cid:13)(cid:13)(cid:13) (cid:19) . t = 0 gives E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k ) t − v ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) ≤ b t/τ c X j =0 e − j/τ · r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + E (cid:16)(cid:13)(cid:13)(cid:13) d π ( k )0 − v ( k )0 (cid:13)(cid:13)(cid:13) (cid:17) e −b t/r c /τ ≤ b t/τ c X j =0 e − j/τ · r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + 2 e −b t/r c /τ ≤ Z ∞ x =0 e − x/τ dx · r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + 2 e − trτ +1 ≤ τ r · (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ + C √ m (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ2 √ T + 2 e − trτ +1 ﬁnishing the proof. Benchmarking against policies starting from stationary state

Combining the results derived so far, we have the following regret bound regarding anyrandomized stationary policy Π starting from its stationary state distribution d Π such that( d Π , Π) in the constraint set G deﬁned in (6.2). Theorem 6.4.3.

Let P be the sequence of randomized stationary policies resulting from theproposed algorithm with V = √ T and α = T . Let d be the starting state of the proposedalgorithm. For any randomized stationary policy Π starting from its stationary state distribution d Π such that ( d Π , Π) ∈ G , we have F T ( d , P ) − F T ( d Π , Π) ≤ O m / K K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T ! ,G i,T ( d , P ) ≤ O m / K K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T ! , i = 1 , , · · · , m. Proof of Theorem 6.4.3.

First of all, by Lemma 6.2.2, for any randomized stationary policy Π,there exists some stationary state-action probability vectors { θ ( k ) ∗ } Kk =1 such that θ ( k ) ∗ ∈ Θ ( k ) , F T ( d Π , Π) = T − X t =0 K X k =1 D E ( f t ) , θ ( k ) ∗ E , and G i,T ( d Π , Π) = P T − t =0 P Kk =1 D E ( g i,t ) , θ ( k ) ∗ E . As a consequence, ( d Π , Π) ∈ G implies G i,T ( d Π , Π) =195 T − t =0 P Kk =1 D E ( g i,t ) , θ ( k ) ∗ E ≤ , ∀ i ∈ { , , · · · , m } and it follows { θ ( k ) ∗ } Kk =1 is in the imaginaryconstraint set G deﬁned in (6.12). Thus, we are in a good shape applying Theorem 6.4.1 fromimaginary systems.We then split F T ( d , P ) − F T ( d Π , Π) into two terms: F T ( d , P ) − F T ( d , Π) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E T − X t =0 K X k =1 f ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d , P ! − T − X t =0 K X k =1 E (cid:16)D f ( k ) t , θ ( k ) t E(cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)| {z } (I) + T − X t =0 K X k =1 (cid:16) E (cid:16)D f ( k ) t , θ ( k ) t E(cid:17) − D E ( f t ) , θ ( k ) ∗ E(cid:17)| {z } (II) . By Theorem 6.4.1, we get(II) ≤ K + Ψ K X k =1 (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) + 52 mK Ψ ! √ T . (6.25)We then bound (I). Consider each time slot t ∈ { , , · · · , T − } . We have E (cid:16)D f ( k ) t , θ ( k ) t E(cid:17) = X s ∈S ( k ) X a ∈A ( k ) E (cid:16) d π ( k ) t ( s ) π ( k ) t ( a | s ) f ( k ) t ( a, s ) (cid:17) E (cid:16) f ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , P (cid:17) = X s ∈S ( k ) X a ∈A ( k ) E (cid:16) v ( k ) t ( s ) π ( k ) t ( a | s ) f ( k ) t ( a, s ) (cid:17) , where the ﬁrst equality follows from the deﬁnition of θ ( k ) t and the second equality follows fromthe following: Given a speciﬁc function path F T , the policy π ( k ) t and the true state distribution v ( k ) t are ﬁxed. Thus, we have, E (cid:16) f ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , P , F T (cid:17) = X s ∈S ( k ) X a ∈A ( k ) v ( k ) t ( s ) π ( k ) t ( a | s ) f ( k ) t ( a, s ) . (cid:12)(cid:12)(cid:12) E (cid:16) f ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , P (cid:17) − E (cid:16)D f ( k ) t , θ ( k ) t E(cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X s ∈S ( k ) X a ∈A ( k ) E (cid:16)(cid:16) v ( k ) t ( s ) − d π ( k ) t ( s ) (cid:17) π ( k ) t ( a | s ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ψ ≤ E (cid:16)(cid:13)(cid:13)(cid:13) v ( k ) t − d π ( k ) t (cid:13)(cid:13)(cid:13) (cid:17) Ψ ≤ τ r (1 + C √ m ) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ √ T + 2 e − tτr +1 Ψwhere the last inequality follows from Lemma 6.4.8. Thus, it follows,(I) ≤ T − X t =0 K X k =1 τ r (1 + C √ m ) (cid:12)(cid:12) A ( k ) (cid:12)(cid:12) (cid:12)(cid:12) S ( k ) (cid:12)(cid:12) Ψ √ T + 2 e − tτr +1 Ψ ! ≤ K X k =1 (cid:16) τ r (cid:0) C √ m (cid:1) (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) Ψ (cid:17) √ T + 2Ψ K Z T − t =0 e − xτr +1 dx ≤ τ r Ψ (cid:0) C √ m (cid:1) K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T + 2 e Ψ Kτ r. (6.26)Overall, combining (6.25),(6.26) and substituting the constant C = C ( m, K, Ψ , η ) deﬁned in(6.20) gives the objective regret bound.For the constraint violation, we have G i,T ( d , P ) = E T − X t =0 K X k =1 g ( k ) i,t ( a t , s t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d , P ! − T X t =1 K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ t E| {z } (IV) + T X t =1 K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ t E| {z } (V) . The term (V) can be readily bounded using Theorem 6.4.2 as T − X t =0 E K X k =1 D g ( k ) i,t , θ ( k ) t E! ≤ C + K X k =1 q m |A ( k ) ||S ( k ) | Ψ C + K X k =1 |A ( k ) ||S ( k ) | Ψ ! √ T .

For the term (IV), we have E (cid:16)D g ( k ) i,t , θ ( k ) t E(cid:17) = X s ∈S ( k ) X a ∈A ( k ) E (cid:16) d π ( k ) t ( s ) π ( k ) t ( a | s ) g ( k ) i,t ( a, s ) (cid:17) (cid:16) g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , P (cid:17) = X s ∈S ( k ) X a ∈A ( k ) E (cid:16) v ( k ) t ( s ) π ( k ) t ( a | s ) g ( k ) i,t ( a, s ) (cid:17) , where the ﬁrst equality follows from the deﬁnition of θ ( k ) t and the second equality follows fromthe following: Given a speciﬁc function path F T , the policy π ( k ) t and the true state distribution v ( k ) t are ﬁxed. Thus, we have, E (cid:16) g ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , P , F T (cid:17) = X s ∈S ( k ) X a ∈A ( k ) v ( k ) t ( s ) π ( k ) t ( a | s ) g ( k ) t ( a, s ) . Taking the full expectation regarding the function path gives the result. Then, repeat the sameproof as that of (6.26) gives(IV) ≤ τ r Ψ (cid:0) C √ m (cid:1) K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T + 2 e Ψ Kτ r.

This ﬁnishes the proof of constraint violation.

Recall that Theorem 6.4.3 compares the proposed algorithm with any randomized stationarypolicy Π starting from its stationary state distribution d Π , so that ( d Π , Π) ∈ G . In this section,we generalize Theorem 6.4.3 and obtain a bound of the regret against all ( d , Π) ∈ G where d isan arbitrary starting state distribution (not necessarily the stationary state distribution). Themain technical diﬃculty doing such a generalization is as follows: For any randomized stationarypolicy Π such that ( d , Π) ∈ G , let { θ ( k ) ∗ } Kk =1 be the stationary state-action probabilities suchthat θ ( k ) ∗ ∈ Θ ( k ) and G i,T ( d Π , Π) = P T − t =0 P Kk =1 D E ( g i,t ) , θ ( k ) ∗ E . For some ﬁnite horizon T , theremight exist some “low-cost” starting state distribution d such that G i,T ( d , Π) < G i,T ( d Π , Π)for some i ∈ { , , · · · , m } . As a consequence, one coud have G i,T ( d , Π) ≤ , and T − X t =0 K X k =1 D E ( g i,t ) , θ ( k ) ∗ E > . d , Π) is feasible for our true system, its stationary state-action probabil-ities { θ ( k ) ∗ } Kk =1 can be infeasible with respect to the imaginary constraint set (6.12), and all ouranalysis so far fails to cover such randomized stationary policies.To resolve this issue, we have to “enlarge” the imaginary constraint set (6.12) so as to coverall state-action probabilities { θ ( k ) ∗ } Kk =1 arising from any randomized stationary policy Π such that( d , Π) ∈ G . But a perturbation of constraint set would result in a perturbation of objective inthe imaginary system also. Our main goal in this section is to bound such a perturbation andshow that the perturbation bound leads to the ﬁnal O ( √ T ) regret bound. A relaxed constraint set

We begin with a supporting lemma on the uniform mixing time bound over all joint random-ized stationary policies. The proof is given in Appendix 6.6.3.

Lemma 6.5.1.

Consider any randomized stationary policy Π in (6.2) with arbitrary startingstate distribution d ∈ S (1) × · · · × S ( K ) . Let P Π be the corresponding transition matrix on theproduct state space. Then, the following holds (cid:13)(cid:13)(cid:13) ( d − d Π ) ( P Π ) t (cid:13)(cid:13)(cid:13) ≤ e ( r − t ) /r , ∀ t ∈ { , , , · · · } , (6.27) where r is ﬁxed positive constant independent of Π . The following lemma shows a relaxation of O (1 /T ) on the imaginary constraint set (6.12) isenough to cover all the { θ ( k ) ∗ } Kk =1 discussed at the beginning of this section. The proof is givenin Appendix 6.6.3. Lemma 6.5.2.

For any T ∈ { , , · · · } and any randomized stationary policies Π in (6.2) , witharbitrary starting state distribution d ∈ S (1) × · · · × S ( K ) and stationary state-action probability { θ ( k ) ∗ } Kk =1 , T − X t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E K X k =1 f ( k ) t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! − K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) ∗ E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C K Ψ (6.28) T − X t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E K X k =1 g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! − K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C K Ψ (6.29) where C is an absolute constant. In particular, { θ ( k ) ∗ } Kk =1 is contained in the following relaxed onstraint set G + := ( θ ( k ) ∈ Θ ( k ) , k = 1 , , · · · , K : K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E ≤ C K Ψ T , i = 1 , , · · · , m (cid:27) . Best stationary performance over the relaxed constraint set

Recall that the best stationary performance in hindsight over all randomized stationary poli-cies in the constraint set G can be obtained as the minimum achieved by the following linearprogram. min 1 T T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) E (6.30) s.t. K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E ≤ , i = 1 , , · · · , m. (6.31)On the other hand, if we consider all the randomized stationary policies contained in theoriginal constraint set (6.2), then, By Lemma 6.5.2, the relaxed constraint set G contains allsuch policies and the best stationary performance over this relaxed set comes from the minimumachieved by the following perturbed linear program:min 1 T T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) E (6.32) s.t. K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E ≤ C K Ψ T , i = 1 , , · · · , m. (6.33)We aim to show that the minimum achieved by (6.32)-(6.33) is not far away from that of(6.30)-(6.31). In general, such a conclusion is not true due to the unboundedness of Lagrangemultipliers in constrained optimization. However, since Slater’s condition holds in our case, theperturbation can be bounded via the following well-known Farkas’ lemma ([Ber09b]): Lemma 6.5.3 (Farkas’ Lemma) . Consider a convex program with objective f ( x ) and constraint unction g i ( x ) , i = 1 , , · · · , m : min f ( x ) , (6.34) s.t. g i ( x ) ≤ b i , i = 1 , , · · · , m, (6.35) x ∈ X , (6.36) for some convex set X ⊆ R n . Let x ∗ be one of the solutions to the above convex program. Supposethere exists e x ∈ X such that g i ( e x ) < , ∀ i ∈ { , , · · · , m } . Then, there exists a separationhyperplane parametrized by (1 , µ , µ , · · · , µ m ) such that µ i ≥ and f ( x ) + m X i =1 µ i g i ( x ) ≥ f ( x ∗ ) + m X i =1 µ i b i , ∀ x ∈ X . The parameter µ = ( µ , µ , · · · , µ m ) is usually referred to as a Lagrange multiplier. Fromthe geometric perspective, Farkas’ Lemma states that if Slater’s condition holds, then, thereexists a non-vertical separation hyperplane supported at (cid:16) f ( x ∗ ) , b , · · · , b m (cid:17) and contains theset n(cid:16) f ( x ) , g ( x ) , · · · , g m ( x ) (cid:17) , x ∈ X o on one side. Thus, in order to bound the perturbationof objective with respect to the perturbation of constraint level, we need to bound the slope ofthe supporting hyperplane from above, which boils down to controlling the magnitude of theLagrange multiplier. This is summarized in the following lemma: Lemma 6.5.4 (Lemma 1 of [NO09]) . Consider the convex program (6.34) - (6.36) , and deﬁne theLagrange dual function q ( µ ) = inf x ∈X ( f ( x ) + m X i =1 µ i ( g i ( x ) − b i ) ) . Suppose there exists e x ∈ X such that g i ( e x ) − b i ≤ − η, ∀ i ∈ { , , · · · , m } for some positiveconstant η > . Then, the level set V ¯ µ = { µ , µ , · · · , µ m ≥ , q ( µ ) ≥ q (¯ µ ) } is bounded for anynonnegative ¯ µ . Furthermore, we have max µ ∈V ¯ µ k µ k ≤ ≤ i ≤ m {− g i ( e x ) + b i } ( f ( e x ) − q (¯ µ )) . The technical importance of these two lemmas in the current context is contained in thefollowing corollary.

Corollary 6.5.1.

Let n θ ( k ) ∗ o Kk =1 and n θ ( k ) ∗ o Kk =1 be solutions to (6.30) - (6.31) and (6.32) - (6.33) , espectively. Then, the following holds T T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) ∗ E ≥ T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E − C K √ m Ψ ηT where η is the constant deﬁned in Assumption 6.2.3.Proof of Corollary 6.5.1. Take f (cid:16) θ (1) , · · · , θ ( K ) (cid:17) = 1 T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) E ,g i (cid:16) θ (1) , · · · , θ ( K ) (cid:17) = K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E , X = Θ (1) × Θ (2) × · · · × Θ ( K ) , and b i = 0 in Farkas’ Lemma and we have the following display1 T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) E + m X i =1 µ i K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E ≥ T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E , for any (cid:0) θ (1) , · · · , θ ( K ) (cid:1) ∈ X and some µ , µ , · · · , µ m ≥

0. In particular, substituting (cid:16) θ (1) ∗ , · · · , θ ( K ) ∗ (cid:17) into the above display gives1 T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E ≥ T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E − m X i =1 µ i K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E ≥ T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E − C K Ψ T m X i =1 µ i , (6.37)where the ﬁnal inequality follows from the fact that (cid:16) θ (1) ∗ , · · · , θ ( K ) ∗ (cid:17) satisﬁes the relaxed con-straint P Kk =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E ≤ C K Ψ T and µ i ≥ , ∀ i ∈ { , , · · · , m } . Now we need to boundthe magnitude of Lagrange multiplier ( µ , · · · , µ m ). Note that in our scenario, (cid:12)(cid:12)(cid:12) f (cid:16) θ (1) , · · · , θ ( K ) (cid:17) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Ψ K, µ is the solution to the maximization problemmax µ i ≥ ,i ∈{ , , ··· ,m } q ( µ ) , where q ( µ ) is the dual function deﬁned in Lemma 6.5.4. thus, it must be in any super level set V ¯ µ = { µ , µ , · · · , µ m ≥ , q ( µ ) ≥ q (¯ µ ) } . In particular, taking ¯ µ = 0 in Lemma 6.5.4 and usingSlater’s condition (6.8), we have there exists e θ (1) , · · · , e θ ( K ) such that m X i =1 µ i ≤ √ m k µ k ≤ √ mη (cid:16) f (cid:16)e θ (1) , · · · , e θ ( K ) (cid:17) − inf( θ (1) , ··· ,θ ( K ) ) ∈X f (cid:16) θ (1) , · · · , θ ( K ) (cid:17)! ≤ √ m Ψ Kη , where the ﬁnal inequality follows from the deterministic bound of | f ( θ (1) , · · · , θ ( K ) ) | by Ψ K .Substituting this bound into (6.37) gives the desired result.As a simple consequence of the above corollary, we have our ﬁnal bound on the regret andconstraint violation regarding any ( d , Π) ∈ G . Theorem 6.5.1.

Let P be the sequence of randomized stationary policies resulting from theproposed algorithm with V = √ T and α = T . Let d be the starting state of the proposedalgorithm. For any randomized stationary policy Π starting from the state d such that ( d , Π) ∈G , we have F T ( d , P ) − F T ( d , Π) ≤ O m / K K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T ! ,G i,T ( d , P ) ≤ O m / K K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T ! , i = 1 , , · · · , m. Proof.

Let Π ∗ be the randomized stationary policy corresponding to the solution { θ ( k ) ∗ } Kk =1 to (6.30)-(6.31) and let Π be any randomized stationary policy such that ( d , Π) ∈ G . Since G i,T ( d Π ∗ , Π ∗ ) = P T − t =0 P Kk =1 D E ( g i,t ) , θ ( k ) ∗ E ≤

0, it follows ( d Π ∗ , Π ∗ ) ∈ G . By Theorem 6.4.3, weknow that F T ( d , P ) − F T ( d Π ∗ , Π ∗ ) ≤ O m / K K X k =1 (cid:12)(cid:12)(cid:12) A ( k ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) S ( k ) (cid:12)(cid:12)(cid:12) · √ T ! , and G i,T ( d , P ) satisﬁes the bound in the statement. It is then enough to bound F T ( d Π ∗ , Π ∗ ) − T ( d , Π). We split it in to two terms: F T ( d Π ∗ , Π ∗ ) − F T ( d , Π) ≤ F T ( d Π ∗ , Π ∗ ) − F T ( d Π , Π) | {z } (I) + F T ( d Π , Π) − F T ( d , Π) | {z } (II) . By (6.28) in Lemma 6.5.2, the term (II) is bounded by C K Ψ. It remains to bound the ﬁrstterm. Since ( d , Π) ∈ G , by Lemma 6.5.2, the corresponding state-action probabilities { θ ( k ) } Kk =1 of Π satisﬁes P Kk =1 (cid:10) E ( g i,t ) , θ ( k ) (cid:11) ≤ C K Ψ /T and { θ ( k ) } Kk =1 is feasible for (6.32)-(6.33). Since { θ ( k ) ∗ } Kk =1 is the solution to (6.32)-(6.33), we must have F T ( d Π , Π) = T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) E ≥ T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) ∗ E On the other hand, by Corollary 6.5.1, T − X t =0 K X k =1 D E (cid:16) f ( k ) t (cid:17) , θ ( k ) ∗ E ≥ T − X t =0 K X k =1 D E (cid:16) f ( k ) (cid:17) , θ ( k ) ∗ E − C K √ m Ψ η = F T ( d Π ∗ , Π ∗ ) − C K √ m Ψ η . Combining the above two displays gives (I) ≤ C K √ m Ψ η and the proof is ﬁnished. We prove Lemma 6.2.1 and 6.2.2 in this section.

Proof of Lemma 6.2.1.

For simplicity of notations, we drop the dependencies on k through-out this proof. We ﬁrst show that for any r ≥ b r , where b r is speciﬁed in Assumption 6.2.1, P π P π · · · P π r is a strictly positive stochastic matrix.Since the MDP is ﬁnite state with a ﬁnite action set, the set of all pure policies (Deﬁnition6.2.2) is ﬁnite. Let P , P , · · · , P N be probability transition matrices corresponding to thesepure policies. Consider any sequence of randomized stationary policies π , · · · , π r . Then, itfollows their transition matrices can be expressed as convex combinations of pure policies, i.e. P π = N X i =1 α (1) i P i , P π = N X i =1 α (2) i P i , · · · , P π r = N X i =1 α ( r ) i P i , P Ni =1 α ( j ) i = 1 , ∀ j ∈ { , , · · · , r } and α ( j ) i ≥

0. Thus, we have the following display P π P π · · · P π r = N X i =1 α (1) i P i ! N X i =1 α (2) i P i ! · · · N X i =1 α ( r ) i P i ! = X ( i , ··· ,i r ) ∈G r α (1) i · · · α ( r ) i r · P i P i · · · P i r , (6.38)where G r ranges over all N r conﬁgurations.Since (cid:16)P Ni =1 α (1) i (cid:17) · · · (cid:16)P Ni =1 α ( r ) i (cid:17) = 1, it follows (6.38) is a convex combination of all possiblesequences P i P i · · · P i r . By assumption 6.2.1, we have P i P i · · · P i r is strictly positive for any( i , · · · , i r ) ∈ G r , and there exists a universal lower bound δ > P i P i · · · P i r ranging over all conﬁgurations in ( i , · · · , i r ) ∈ G r . This implies P π P π · · · P π r is also strictlypositive with the same lower bound δ > π , · · · , π r .Now, we proceed to prove the mixing bound. Choose r = b r and we can decompose any P π P π · · · P π r as follows: P π · · · P π r = δ Π + (1 − δ ) Q , where Π has each entry equal to 1 / |S| (recall that |S| is the number of states which equals the sizeof the matrix) and Q depends on π , · · · , π r . Then, Q is also a stochastic matrix (nonnegativeand row sum up to 1) because both P π · · · P π r and Π are stochastic matrices. Thus, for anytwo distribution vectors d and d , we have( d − d ) P π · · · P π r = δ ( d − d ) Π + (1 − δ ) ( d − d ) Q = (1 − δ ) ( d − d ) Q , where we use the fact that for distribution vectors( d − d ) Π = 1 |S| − |S| = 0 . Since Q is a stochastic matrix, it is non-expansive on ‘ -norm, namely, for any vector x , k x Q k ≤k x k . To see this, simply compute k x Q k = |S| X j =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) |S| X i =1 x i Q ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ |S| X j =1 |S| X i =1 | x i Q ij | = |S| X j =1 |S| X i =1 | x i | Q ij = |S| X i =1 | x i | = k x k . (6.39)205verall, we obtain, k ( d − d ) P π · · · P π r k = (1 − δ ) k ( d − d ) Q k ≤ (1 − δ ) k d − d k . We can then take τ = − − δ ) to ﬁnish the proof. Proof of Lemma 6.2.2.

Since the probability transition matrix of any randomized stationary pol-icy is a convex combination of those of pure policies, it is enough to show that the product MDPis irreducible and aperiodic under any joint pure policy. For simplicity, let s t = (cid:0) s (1) , · · · , s ( K ) (cid:1) and a t = (cid:0) a (1) , · · · , a ( K ) (cid:1) . Consider any joint pure policy Π which select a ﬁxed joint action a ∈ A (1) × · · · × A ( K ) given a joint state s ∈ S (1) × · · · × S ( K ) , with probability 1. By Assumption6.2.2, we have P r (cid:16) s (1) t +1 , · · · , s ( K ) t +1 (cid:12)(cid:12)(cid:12) s (1) t , · · · , s ( K ) t , a (1) t , · · · , a ( K ) t (cid:17) = P r (cid:16) s (1) t +1 (cid:12)(cid:12)(cid:12) s (1) t , · · · , s ( K ) t , a (1) t , · · · , a ( K ) t , s (2) t +1 , · · · , s ( K ) t +1 (cid:17) · P r (cid:16) s (2) t +1 , · · · , s ( K ) t +1 (cid:12)(cid:12)(cid:12) s (1) t , · · · , s ( K ) t , a (1) t , · · · , a ( K ) t (cid:17) = P r (cid:16) s (1) t +1 (cid:12)(cid:12)(cid:12) s (1) t , a (1) t (cid:17) P r (cid:16) s (2) t +1 , · · · , s ( K ) t +1 (cid:12)(cid:12)(cid:12) s (1) t , · · · , s ( K ) t , a (1) t , · · · , a ( K ) t (cid:17) = · · · = K − Y k =1 P r (cid:16) s ( k ) t +1 (cid:12)(cid:12)(cid:12) s ( k ) t , a ( k ) t (cid:17) · P r (cid:16) s ( K ) t +1 (cid:12)(cid:12)(cid:12) s (1) t , · · · , s ( K ) t , a (1) t , · · · , a ( K ) t (cid:17) = K Y k =1 P r (cid:16) s ( k ) t +1 (cid:12)(cid:12)(cid:12) s ( k ) t , a ( k ) t (cid:17) , (6.40)where the second equality follows from the independence relation in Assumption 6.2.2. Thus, weobtain the equality, P r ( s t +1 = s (cid:12)(cid:12) s t = s , a t = a ) = K Y k =1 P r (cid:16) s ( k ) t +1 = ˜ s ( k ) (cid:12)(cid:12)(cid:12) s ( k ) t = s ( k ) , a ( k ) t = a ( k ) (cid:17) , Then, the one step transition probability between any two states s , ˜ s ∈ S (1) × · · · × S ( K ) can be206omputed as P r ( s t +1 = ˜ s (cid:12)(cid:12) s t = s ) = X a P r ( s t +1 = ˜ s (cid:12)(cid:12) s t = s , a t = a ) · P r ( a t = a (cid:12)(cid:12) s t = s )= X a K Y k =1 P r (cid:16) s ( k ) t +1 = ˜ s ( k ) (cid:12)(cid:12)(cid:12) s ( k ) t = s ( k ) , a ( k ) t = a ( k ) (cid:17) · P r ( a t = a (cid:12)(cid:12) s t = s )= K Y k =1 P a ( k ) ( s ) (cid:16) s ( k ) , ˜ s ( k ) (cid:17) , where we can remove the summation on a due to the fact that a t is a pure policy. The notation a ( k ) ( s ) denotes a ﬁxed mapping from product state space S (1) × · · · × S ( K ) to an individualaction space A ( k ) resulting from the pure policy, and P a ( k ) ( s ) (cid:0) s ( k ) , ˜ s ( k ) (cid:1) is the Markov transitionprobability from state s ( k ) to ˜ s ( k ) under the action a ( k ) ( s ). One can then further compute the r ( r ≥

2) step transition probability from between any two states s , ˜ s ∈ S (1) × · · · × S ( K ) as P r ( s t + r = ˜ s (cid:12)(cid:12) s t = s ) = X s t + r − · · · X s t +1 K Y k =1 P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · K Y k =1 P a ( k ) ( s t +1 ) (cid:16) s ( k ) t +1 , s ( k ) t +2 (cid:17) · · · K Y k =1 P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) = X s t + r − · · · X s t +1 K Y k =1 P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · P a ( k ) ( s t +1 ) (cid:16) s ( k ) t +1 , s ( k ) t +2 (cid:17) · · · P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) . (6.41)For any k ∈ { , , · · · , K } , the term P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · P a ( k ) ( s t +1 ) (cid:16) s ( k ) t +1 , s ( k ) t +2 (cid:17) · · · P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) denotes the probability of moving from s ( k ) to ˜ s ( k ) along a certain path under a certain sequenceof ﬁxed decisions a ( k ) ( s ) , a ( k ) ( s t +1 ), · · · , a ( k ) ( s t + r − ). Let s ( k ) = (cid:16) s ( k ) t +1 , s ( k ) t +2 , · · · , s ( k ) t + r − (cid:17) ∈ S ( k ) × · · · × S ( k ) , k ∈ { , , · · · , K } be the state path of k-th MDP. One can then change the order of summation in (6.41) and sumover state paths of each MDP as follows: 2076.41) = X s ( K ) · · · X s (1) K Y k =1 P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · P a ( k ) ( s t +1 ) (cid:16) s ( k ) t +1 , s ( k ) t +2 (cid:17) · · · P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) We would like to exchange the order of the product and the sums so that we can take thepath sum over each individual MDP respectively. However, the problem is that the transitionprobabilities are coupled through the actions. The idea to proceed is to ﬁrst apply a “hard”decoupling by taking the inﬁmum of transition probabilities of each MDP over all pure policies,and use Assumption 6.2.1, to bound the transition probability from below uniformly. We have(6.41) ≥ inf s (1) X s ( K ) · · · X s (2) K Y k =2 P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · · · P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) · inf s ( j ) , j =1 X s (1) P a (1) ( s ) (cid:16) s (1) , s (1) t +1 (cid:17) · · · P a (1) ( s t + r − ) (cid:16) s (1) t + r − , ˜ s (1) (cid:17) ≥ inf s (1) X s ( K ) · · · X s (2) K Y k =2 P a ( k ) ( s ) (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · · · P a ( k ) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) · inf π (1)1 , ··· ,π (1) r X s (1) P π (1)1 (cid:16) s (1) , s (1) t +1 (cid:17) · · · P π (1) r (cid:16) s (1) t + r − , ˜ s (1) (cid:17) , where π (1)1 , · · · , π (1) r range over all pure policies, and the second inequality follows from the factthat ﬁx any path of other MDPs (i.e. s ( j ) , j = 1), the term X s (1) P a (1) ( s ) (cid:16) s (1) , s (1) t +1 (cid:17) · · · P a (1) ( s t + r − ) (cid:16) s ( k ) t + r − , ˜ s (1) (cid:17) is the probability of reaching ˜ s (1) from s (1) in r steps using a sequence of actions a (1) ( s (1) ) , · · · , a (1) ( s (1) t + r − ),where each action is a deterministic function of the previous state at the 1-st MDP only. Thus, itdominates the inﬁmum over all sequences of pure policies π (1)1 , · · · , π (1) r on this MDP. Similarly,we can decouple the rest of the sums and obtain the follow display:(6.41) ≥ K Y k =1 inf π ( k )1 , ··· ,π ( k ) r X s ( k ) P π ( k )1 (cid:16) s ( k ) , s ( k ) t +1 (cid:17) · · · P π ( k ) r (cid:16) s ( k ) t + r − , ˜ s ( k ) (cid:17) = K Y k =1 inf π ( k )1 , ··· ,π ( k ) r P π ( k )1 , ··· ,π ( k ) r (cid:16) s ( k ) , ˜ s ( k ) (cid:17) , P π ( k )1 , ··· ,π ( k ) r (cid:0) s ( k ) , ˜ s ( k ) (cid:1) denotes the (cid:0) s ( k ) , ˜ s ( k ) (cid:1) -th entry of the product matrix P ( k ) π ( k )1 · · · P ( k ) π ( k ) r .Now, by Assumption 6.2.1, there exists a large enough integer b r such that P ( k ) π ( k )1 · · · P ( k ) π ( k ) r is astrictly positive matrix for any sequence of r ≥ b r randomized stationary policy. As a consequence,the above probability is strictly positive and (6.41) is also strictly positive.This implies, if we choose ˜ s = s , then, starting from any arbitrary product state s ∈ S (1) ×· · ·×S ( K ) , there is a positive probability of returning to this state after r steps for all r ≥ b r , whichgives the aperiodicity. Similarly, there is a positive probability of reaching any other compositestate after r steps for all r ≥ b r , which gives the irreducibility. This implies the product stateMDP is irreducible and aperiodic under any joint pure policy, and thus, any joint randomizedstationary policy.For the second part of the claim, we consider any randomized stationary policy Π and thecorresponding joint transition probability matrix P Π , there exists a stationary state-action prob-ability vector Φ( a , s ) , a ∈ A (1) × · · · × A ( K ) , s ∈ S (1) × · · · × S ( K ) , such that X a Φ( a , ˜ s ) = X s X a Φ( a , s ) P a ( s , ˜ s ) , ∀ ˜ s ∈ S (1) × · · · × S ( K ) . (6.42)Then, the state-action probability of the k-th MDP is θ ( k ) ( a ( k ) , ˜ s ( k ) ) = P ˜ s ( j ) ,a ( j ) , j = k Φ( a , ˜ s ).Thus, X a ( k ) θ ( k ) ( a ( k ) , ˜ s ( k ) ) = X ˜ s ( j ) , j = k X a Φ( a , ˜ s ) = X s X a Φ( a , s ) X ˜ s ( j ) , j = k P a ( s , ˜ s )= X s X a Φ( a , s ) · P r (cid:16) ˜ s ( k ) | a , s (cid:17) = X s X a Φ( a , s ) · P r (cid:16) ˜ s ( k ) | a ( k ) , s ( k ) (cid:17) = X a ( k ) X s ( k ) θ ( k ) ( a ( k ) , ˜ s ( k ) ) · P r (cid:16) ˜ s ( k ) | a ( k ) , s ( k ) (cid:17) = X a ( k ) X s ( k ) θ ( k ) ( a ( k ) , ˜ s ( k ) ) · P a ( k ) (cid:16) s ( k ) , ˜ s ( k ) (cid:17) where the third from the last inequality follows from Assumption 6.2.2. This ﬁnishes the proof. Proof of Lemma 6.4.6.

Consider the state-action probabilities { ˜ θ ( k ) } Kk =1 which achieves the Slater’scondition in (6.8). First of all, note that Q i ( t ) ∈ F t − , ∀ t ≥

1. Then, using the assumption that209 g ( k ) i,t − } Kk =1 is i.i.d. and independent of all system information up to t −

1, we have E Q i ( t − K X k =1 D g ( k ) i,t − , ˜ θ E (cid:12)(cid:12)(cid:12) F t − ! = E K X k =1 D g ( k ) i,t − , ˜ θ E! Q i ( t − ≤ − ηQ i ( t − . (6.43)Now, by the drift-plus-penalty bound (6.15), with θ ( k ) = ˜ θ ( k ) ,∆( t ) ≤ − V K X k =1 D f ( k ) t − , θ ( k ) t − θ ( k ) t − E − α K X k =1 k θ ( k ) t − θ ( k ) t − k + 32 mK Ψ + V K X k =1 D f ( k ) t − , ˜ θ ( k ) − θ ( k ) t − E + m X i =1 Q i ( t − K X k =1 D g ( k ) i,t − , ˜ θ ( k ) E + α K X k =1 k ˜ θ ( k ) − θ ( k ) t − k − α K X k =1 k ˜ θ ( k ) − θ ( k ) t k ≤ V K

Ψ + 32 mK Ψ + m X i =1 Q i ( t − K X k =1 D g ( k ) i,t − , ˜ θ ( k ) E + α K X k =1 k ˜ θ ( k ) − θ ( k ) t − k − α K X k =1 k ˜ θ ( k ) − θ ( k ) t k where the second inequality follows from Holder’s inequality that (cid:12)(cid:12)(cid:12)D f ( k ) t − , θ ( k ) t − θ ( k ) t − E(cid:12)(cid:12)(cid:12) ≤ k f ( k ) t − k ∞ (cid:13)(cid:13)(cid:13) θ ( k ) t − θ ( k ) t − (cid:13)(cid:13)(cid:13) ≤ . Summing up the drift from t to t + t − E ( ·|F t − ) give E (cid:16) k Q ( t + t ) k − k Q ( t ) k (cid:12)(cid:12)(cid:12) F t − (cid:17) ≤ V K

Ψ + 3 mK Ψ + 2 m X i =1 E t + t − X τ = t Q i ( τ − K X k =1 D g ( k ) i,τ − , ˜ θ ( k ) E (cid:12)(cid:12)(cid:12) F t − ! + 2 α E K X k =1 (cid:16) k ˜ θ ( k ) − θ ( k ) t − k − k ˜ θ ( k ) − θ ( k ) t + t k (cid:17) (cid:12)(cid:12)(cid:12) F t − ! ≤ V K

Ψ + 3 mK Ψ + 4 Kα + 2 m X i =1 E t + t − X τ = t Q i ( τ − K X k =1 D g ( k ) i,τ − , ˜ θ ( k ) E (cid:12)(cid:12)(cid:12) F t − ! . E (cid:16) · (cid:12)(cid:12)(cid:12) F t + t − · · · (cid:12)(cid:12)(cid:12) F t (cid:17) inside the conditional expectation) and the bound (6.43), we have E t + t − X τ = t Q i ( τ − K X k =1 D g ( k ) i,τ − , ˜ θ ( k ) E (cid:12)(cid:12)(cid:12) F t − ! ≤ − η E t + t − X τ = t Q i ( τ − (cid:12)(cid:12)(cid:12) F t − ! ≤ − ηt Q i ( t −

1) + t ( t − ≤ − ηt Q i ( t ) + t ( t − ηt K Ψ , where the last inequality follows from the queue updating rule (6.9) that | Q i ( t − − Q i ( t ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K X k =1 D g ( k ) i,t − , θ ( k ) t − E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ K k g ( k ) i,t − k ∞ k θ ( k ) t − k ≤ K Ψ . Thus, we have E (cid:16) k Q ( t + t ) k − k Q ( t ) k (cid:12)(cid:12)(cid:12) F t − (cid:17) ≤ V K

Ψ + 3 mK Ψ + 4 Kα + t ( t − m Ψ + 2 mK Ψ ηt − ηt m X i =1 Q i ( t ) ≤ V K

Ψ + 3 mK Ψ + 4 Kα + t ( t − m Ψ + 2 mK Ψ ηt − ηt k Q i ( t ) k . Suppose k Q i ( t ) k ≥ V K

Ψ+3 mK Ψ +4 Kα + t ( t − m Ψ+2 mK Ψ ηt + η t ηt , then, it follows, E (cid:16) k Q ( t + t ) k − k Q ( t ) k (cid:12)(cid:12)(cid:12) F t − (cid:17) ≤ − ηt k Q i ( t ) k , which implies E (cid:16) k Q ( t + t ) k (cid:12)(cid:12)(cid:12) F t − (cid:17) ≤ (cid:18) k Q i ( t ) k − ηt (cid:19) Since k Q i ( t ) k ≥ ηt , taking square root from both sides using Jensen’ inequality gives E (cid:16) k Q ( t + t ) k (cid:12)(cid:12)(cid:12) F t − (cid:17) ≤ k Q i ( t ) k − ηt . (cid:12)(cid:12)(cid:12) k Q ( t + 1) k − k Q ( t ) k (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)vuut m X i =1 max ( Q i ( t ) + K X k =1 D g ( k ) i,t − , θ ( k ) t E , ) − vuut m X i =1 Q i ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤  m X i =1 K X k =1 D g ( k ) i,t − , θ ( k ) t E!  / ≤ √ mK Ψ . Overall, we ﬁnish the proof.

Proof of Lemma 6.5.1.

Consider any joint randomized stationary policy Π and a starting stateprobability d on the product state space S (1) × S (2) × · · · × S ( K ) . Let P Π be the correspondingtransition matrix on the product state space. Let d t be the state distribution at time t underΠ and d Π be the stationary state distribution. By Lemma 6.2.2, we know that this productstate MDP is irreducible and aperiodic (ergodic) under any randomized stationary policy. Inparticular, it is ergodic under any pure policy. Since there are only ﬁnitely many pure policies,let P Π , · · · , P Π N be probability transition matrices corresponding to these pure policies. ByProposition 1.7 of [LPW06] , for any Π i , i ∈ { , , · · · , N } , there exists integer τ i > P Π i ) t is strictly positive for any t ≥ τ i . Let τ = max i τ i , then, it follows ( P Π i ) τ is strictly positive uniformly for all Π i ’s. Let δ > P Π i ) τ over all Π i ’s. Following from the fact that the probability transition matrix P Π is aconvex combination of those of pure policies, i.e. P Π = P Ni =1 α i P Π i , α i ≥ , P Ni =1 α i = 1, wehave ( P Π ) τ is also strictly positive. To see this, note that( P Π ) τ = N X i =1 α i P Π i ! τ ≥ N X i =1 α τ i ( P Π i ) τ > , where the inequality is taken to be entry-wise. Furthermore, the least entry of ( P Π ) τ is lowerbounded by δ/N τ − uniformly over all joint randomized stationary policies Π, which follows212rom the fact that the least entry of N ( P Π ) τ is bounded as1 N N X i =1 α τ i δ ≥ N N X i =1 α i ! τ δ = δN τ . The rest is a standard bookkeeping argument following from the Markov chain mixing time theory(Theorem 4.9 of [LPW06]). Let D Π be a matrix of the same size as P Π and each row equal tothe stationary distribution d Π . Let ε = δ/N τ − . We claim that for any integer n >

0, and anyΠ, P τ n Π = (1 − (1 − ε ) n ) D Π + (1 − ε ) n Q n , (6.44)for some stochastic matrix Q . We use induction to prove this claim. First of all, for n = 1, fromthe fact that ( P Π ) τ is a positive matrix and the least entry is uniformly lower bounded by ε over all policies Π, we can write ( P Π ) τ as( P Π ) τ = ε D Π + (1 − ε ) Q , for some stochastic matrix Q , where we use the fact that ε ∈ (0 , n = 1 , , · · · , ‘ , we show that it also holds for n = ‘ + 1. Using the fact that D Π P Π = D Π and QD Π = D Π for any stochastic matrix Q , we can write out P τ ( ‘ +1)Π : P τ ( ‘ +1)Π = P τ ‘ Π P τ Π = (cid:0)(cid:0) − (1 − ε ) ‘ (cid:1) D Π + (1 − ε ) ‘ Q ‘ (cid:1) P τ Π = (cid:0) − (1 − ε ) ‘ (cid:1) D Π P τ Π + (1 − ε ) ‘ Q ‘ P τ Π = (cid:0) − (1 − ε ) ‘ (cid:1) D Π + (1 − ε ) ‘ Q ‘ ( ε D Π + (1 − ε ) Q )= (cid:0) − (1 − ε ) ‘ (cid:1) D Π + (1 − ε ) ‘ Q ‘ ((1 − (1 − ε )) D Π + (1 − ε ) Q )=(1 − (1 − ε ) ‘ +1 ) D Π + (1 − ε ) ‘ +1 Q ‘ +1 . Thus, (6.44) holds. For any integer t >

0, we write t = τ n + j for some integer j ∈ [0 , τ ) and n ≥

0. Then, ( P Π ) t − D Π = ( P Π ) t − D Π = (1 − ε ) n (cid:16) Q n P j Π − D Π (cid:17) . P t Π ( i, · ) be the i -th row of P t Π , then, we obtainmax i k P t Π ( i, · ) − d Π k ≤ − ε ) n , where we use the fact that the ‘ -norm of the row diﬀerence is bounded by 2. Finally, for anystarting state distribution d , we have (cid:13)(cid:13) d P t Π − d Π (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X i d ( i ) (cid:0) P t Π ( i, · ) − d Π (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = X i d ( i ) (cid:13)(cid:13) P t Π ( i, · ) − d Π (cid:13)(cid:13) ≤ max i k P t Π ( i, · ) − d Π k ≤ − ε ) n . Take r = log − ε ﬁnishes the proof. Proof of Lemma 6.5.2.

Let v t ∈ S (1) × · · · × S ( K ) be the joint state distribution at time t underpolicy Π. Using the fact that Π is a ﬁxed policy independent of g ( k ) i,t and Assumption 6.2.2that the probability transition is also independent of function path given any state and action,the function g ( k ) i,t and state-action pair ( a ( k ) t , s ( k ) t ) are mutually independent. Thus, for any t ∈ { , , , · · · , T − } E K X k =1 g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! = X s ∈S (1) ×···×S ( K ) X a ∈A (1) ×···×A ( K ) v t ( s )Π( a | s ) K X k =1 E (cid:16) g ( k ) i,t ( a ( k ) , s ( k ) ) (cid:17) , where s = [ s (1) , · · · , s ( K ) ] and a = [ a (1) , · · · , a ( K ) ] and the latter expectation is taken with respectto g ( k ) i,t (i.e. the random variable w t ). On the other hand, by Lemma 6.2.2, we know that forany randomized stationary policy Π, the corresponding stationary state-action probability canbe expressed as { θ ( k ) ∗ } Kk =1 with θ ( k ) ∗ ∈ Θ ( k ) . Thus, K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) E = X s ∈S (1) ×···×S ( K ) X a ∈A (1) ×···×A ( K ) d Π ( s )Π( a | s ) K X k =1 E (cid:16) g ( k ) i,t ( a ( k ) , s ( k ) ) (cid:17) . T − X t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E K X k =1 g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! − K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ T − X t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X s ∈S (1) ×···×S ( K ) X a ∈A (1) ×···×A ( K ) ( v t ( s ) − d Π ( s )) Π( a | s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K Ψ ≤ K Ψ T − X t =0 k v t − d Π k ≤ K Ψ T − X t =0 e ( r − t ) /r ≤ eK Ψ Z T − e − t/r dt = 2 er K Ψ , where the third inequality follows from Lemma 6.5.1. Taking C = 2 er ﬁnishes the proof of(6.29) and (6.28) can be proved in a similar way.In particular, we have for any randomized stationary policy Π that satisﬁes the constraint(6.2), we have T · K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E ≤ T − X t =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E K X k =1 g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! − K X k =1 D E (cid:16) g ( k ) i,t (cid:17) , θ ( k ) ∗ E(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + T − X t =0 E K X k =1 g ( k ) i,t ( a ( k ) t , s ( k ) t ) (cid:12)(cid:12)(cid:12) d , Π ! ≤ er K Ψ + 0 = 2 er K Ψ , ﬁnishing the proof. 215 ibliography [Alt99a] E. Altman. Constrained Markov decision processes . Chapman and Hall/CRCPress, 1999.[Alt99b] E. Altman.

Constrained Markov decision processes , volume 7. CRC Press, 1999.[BAM10] T. Benson, A. Akella, and D. A. Maltz. Network traﬃc characteristics of datacenters in the wild. In

Proceedings of the 10th ACM SIGCOMM conference onInternet measurement , pages 267–280. ACM, 2010.[Ber95] D. P. Bertsekas.

Dynamic programming and optimal control , volume 1. Athenascientiﬁc Belmont, MA, 1995.[Ber01] D. P. Bertsekas.

Dynamic Programming and Optimal Control, 2nd edition, Vol.I . Athena Scientiﬁc, Nashua, NH, 2001.[Ber09a] D. Bertsekas.

Convex Optimization Theory . Athena Scientiﬁc, 2009.[Ber09b] D. P. Bertsekas.

Convex optimization theory . Athena Scientiﬁc Belmont, 2009.[BGPS06] S. Byod, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms.

IEEE/ACM Transactions on Networking , 14,:2508–2530, 2006.[BL16] C. Boutilier and T. Lu. Budget allocation using weakly coupled, constrainedmarkov decision processes. In

UAI , 2016.[BT97] D. P. Bertsekas and J. N. Tsitsiklis.

Parallel and Distributed Computation: Nu-merical Methods . Athena Scientiﬁc, Nashua, NH, 1997.[BV04] S. Boyd and L. Vandenberghe.

Convex Optimization . Cambridge University Press,2004. 216CDM14] C. Caramanis, N. B. Dimitrov, and D. P. Morton. Eﬃcient algorithms for budget-constrained markov decision processes.

IEEE Transactions on Automatic Control ,59(10):2813–2817, 2014.[CFMS03] H. S. Chang, P. J. Fard, S. I. Marcus, and M. Shayman. Multitime scale markovdecision processes.

IEEE Transactions on Automatic Control , 48(6):976–987,2003.[CLG17] T. Chen, Q. Ling, and G. B. Giannakis. An online convex optimization approachto dynamic network resource allocation. arXiv preprint arXiv:1701.03974 , 2017.[CW16] Y. Chen and M. Wang. Stochastic primal-dual methods and sample complexityof reinforcement learning. arXiv preprint arXiv:1612.02516 , 2016.[DGS14] T. Dick, A. Gyorgy, and C. Szepesvari. Online learning in markov decision pro-cesses with changing cost sequences. In

Proceedings of the 31st InternationalConference on Machine Learning (ICML-14) , pages 512–520, 2014.[Dur13] R. Durrett.

Probability: Theory and Examples, 4th edition . Cambridge UniversityPress, 2013.[EDKM05] E. Even-Dar, S. M. Kakade, and Y. Mansour. Experts in a markov decisionprocess. In

Advances in neural information processing systems , pages 401–408,2005.[EDKM09] E. Even-Dar, S. M. Kakade, and Y. Mansour. Online markov decision processes.

Mathematics of Operations Research , 34(3):726–736, 2009.[ES06] A. Eryilmaz and R. Srikant. Joint congestion control, routing, and mac for sta-bility and fairness in wireless networks.

IEEE Journal on Selected Areas in Com-munications , 24(8):1514–1524, 2006.[ES07] A. Eryilmaz and R. Srikant. Fair resource allocation in wireless networks usingqueue-length-based scheduling and congestion control.

IEEE/ACM Transactionson Networking (TON) , 15(6):1333–1344, 2007.[ES12] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue lengthbounds implied by drift conditions.

Queueing Systems , 72(3-4):311–359, 2012.217Fox66a] B. Fox. Markov renewal programming by linear fractional programming.

SIAMJournal on Applied Mathematics , 14(6):1418–1432, 1966.[Fox66b] B. Fox. Markov renewal programming by linear fractional programming.

SIAMJournal on Applied Mathematics , 14,(6):1418–1432, 1966.[FS99] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicativeweights.

Games and Economic Behavior , 29(1-2):79–103, 1999.[Gan13] A. Gandhi.

Dynamic server provisioning for data center power management . PhDthesis, Carnegie Mellon University, 2013.[GDHBSW13] A. Gandhi, S. Doroudi, M. Harchol-Balter, and A. Scheller-Wolf. Exact analysisof the m/m/k/setup class of markov chains via recursive renewal reward.

Proc.ACM Sigmetrics , pages 153–166, 2013.[GHBK12] A. Gandhi, M. Harchol-Balter, and M. A. Kozuch. Are sleep states eﬀective indata centers? In

Green Computing Conference (IGCC), 2012 International , pages1–10. IEEE, 2012.[GNT +

06] L. Georgiadis, M. J. Neely, L. Tassiulas, et al. Resource allocation and cross-layercontrol in wireless networks.

Foundations and Trends® in Networking , 1(1):1–144,2006.[GRW14] P. Guan, M. Raginsky, and R. M. Willett. Online markov decision processeswith kullback–leibler control cost.

IEEE Transactions on Automatic Control ,59(6):1423–1438, 2014.[H +

16] E. Hazan et al. Introduction to online convex optimization.

Foundations andTrends® in Optimization , 2(3-4):157–325, 2016.[Haj82] B. Hajek. Hitting-time and occupation-time bounds implied by drift analysis withapplications.

Advances in Applied probability , 14(3):502–525, 1982.[HAK07] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for onlineconvex optimization.

Machine Learning , 69:169–192, 2007.218HK14] E. Hazan and S. Kale. Beyond the regret minimization barrier: optimal algorithmsfor stochastic strongly-convex optimization.

The Journal of Machine LearningResearch , 15(1):2489–2512, 2014.[HP05] M. Hutter and J. Poland. Adaptive online prediction by following the perturbedleader.

Journal of Machine Learning Research , 6(Apr):639–660, 2005.[HS08] T. Horvath and K. Skadron. Multi-mode energy management for multi-tier serverclusters. In

Proceedings of the 17th international conference on Parallel architec-tures and compilation techniques , pages 270–279. ACM, 2008.[JHA16] R. Jenatton, J. Huang, and C. Archambeau. Adaptive algorithms for onlineconvex optimization with long-term constraints. In

International Conference onMachine Learning , pages 402–411, 2016.[LHS +

13] T. Lattimore, M. Hutter, P. Sunehag, et al. The sample-complexity of generalreinforcement learning. In

Proceedings of the 30th International Conference onMachine Learning . Journal of Machine Learning Research, 2013.[Li11] C.-p. Li.

Stochastic optimization over parallel queues: Channel-blind scheduling,restless bandit, and optimal delay . Citeseer, 2011.[LN14] C. Li and M. J. Neely. Solving convex optimization with side constraints in amulti-class queue by adaptive cµ rule. Queueing System , 77,(3):331–372, 2014.[LPW06] D. A. Levin, Y. Peres, and E. L. Wilmer.

Markov chains and mixing times .American Mathematical Society, 2006.[LS04] X. Lin and N. B. Shroﬀ. Joint rate control and scheduling in multihop wirelessnetworks. In , volume 2, pages 1484–1489. IEEE, 2004.[LW94] N. Littlestone and M. K. Warmuth. The weighted majority algorithm.

Informa-tion and computation , 108(2):212–261, 1994.[LWAT13] M. Lin, A. Wierman, L. L. Andrew, and E. Thereska. Dynamic right-sizingfor power-proportional data centers.

IEEE/ACM Transactions on Networking ,21(5):1378–1391, 2013. 219MGW09] D. Meisner, B. T. Gold, and T. F. Wenisch. Powernap: eliminating server idlepower. In

ACM Sigplan Notices , volume 44, pages 205–216. ACM, 2009.[MJY12] M. Mahdavi, R. Jin, and T. Yang. Trading regret for eﬃciency: online convexoptimization with long term constraints.

Journal of Machine Learning Research ,13(Sep):2503–2528, 2012.[MSB +

11] D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch. Powermanagement of online data-intensive services. In

ACM SIGARCH Computer Ar-chitecture News , volume 39, pages 319–330. ACM, 2011.[NAGS10] G. Neu, A. Antos, A. Gy¨orgy, and C. Szepesv´ari. Online markov decision processesunder bandit feedback. In

Advances in Neural Information Processing Systems ,pages 1804–1812, 2010.[Nee10a] M. J. Neely. Stochastic network optimization with application to communicationand queueing systems.

Synthesis Lectures on Communication Networks , 3(1):1–211, 2010.[Nee10b] M. J. Neely.

Stochastic Network Optimization with Application to Communicationand Queueing Systems . Morgan & Claypool, 2010.[Nee11] M. J. Neely. Online fractional programming for markov decision systems. In

Communication, Control, and Computing (Allerton), 2011 49th Annual AllertonConference on , pages 353–360. IEEE, 2011.[Nee12a] M. J. Neely. Asynchronous control for coupled markov decision systems.

Infor-mation Theory Workshop (ITW) , 2012.[Nee12b] M. J. Neely. Asynchronous scheduling for energy optimality in systems withmultiple servers.

Proceedings of 46th Annual Conference on Information Sciencesand Systems (CISS) , 2012.[Nee12c] M. J. Neely. Stability and probability 1 convergence for queueing networks vialyapunov optimization.

Journal of Applied Mathematics , 2012, 2012.[Nee13a] M. J. Neely. Dynamic optimization and learning for renewal systems.

IEEETransactions on Automatic Control , 58(1):32–46, 2013.220Nee13b] M. J. Neely. Dynamic optimization and learning for renewal systems.

IEEETransactions on Automatic Control , 58,(1):32–46, 2013.[New05] M. E. Newman. Power laws, pareto distributions and zipf’s law.

Contemporaryphysics , 46(5):323–351, 2005.[NML08] M. J. Neely, E. Modiano, and C.-P. Li. Fairness and optimal stochastic controlfor heterogeneous networks.

IEEE/ACM Transactions On Networking , 16(2):396–409, 2008.[NO09] A. Nedi´c and A. Ozdaglar. Approximate primal solutions and rate analysis fordual subgradient methods.

SIAM Journal on Optimization , 19(4):1757–1780,2009.[NY17] M. J. Neely and H. Yu. Online convex optimization with time-varying constraints. arXiv preprint arXiv:1702.04783 , 2017.[PT99] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of optimal queuingnetwork control.

Mathematics of Operations Research , 24(2):293–305, 1999.[Put14] M. L. Puterman.

Markov decision processes: discrete stochastic dynamic pro-gramming . John Wiley & Sons, 2014.[PXYY16] Z. Peng, Y. Xu, M. Yan, and W. Yin. Arock: an algorithmic framework for asyn-chronous parallel coordinate updates.

To appear in SIAM Journal on ScientiﬁcComputing , 2016.[Rib10] A. Ribeiro. Ergodic stochastic optimization algorithms for wireless communica-tion and networking.

IEEE Transactions on Signal Processing , 58(12):6369–6386,2010.[Roc15] R. T. Rockafellar.

Convex analysis . Princeton university press, 2015.[Ros02] S. Ross.

Introduction to Probability Models, 8th edition . Academic Press, 2002.[SB98] R. S. Sutton and A. G. Barto.

Reinforcement learning: An introduction , volume 1.MIT press Cambridge, 1998. 221Sch83] S. Schaible. Fractional programming.

Zeitschrift fur Operations Research ,27,(1):39–54, 1983.[SN11] K. Srivastava and A. Nedic. Distributed asynchronous constrained stochasticoptimization.

IEEE Journal of Selected Topics in Signal Processing , 5,(4):772–790, 2011.[Sto05] A. L. Stolyar. Maximizing queueing network utility subject to stability: Greedyprimal-dual algorithm.

Queueing Systems , 50(4):401–457, 2005.[TE90] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing sys-tems and scheduling policies for maximum throughput in multihop radio networks.In , pages 2130–2132. IEEE, 1990.[TE93] L. Tassiulas and A. Ephremides. Dynamic server allocation to parallel queueswith randomly varying connectivity.

IEEE Transactions on Information Theory ,39(2):466–478, 1993.[UKIN10] R. Urgaonkar, U. C. Kozat, K. Igarashi, and M. J. Neely. Dynamic resource allo-cation and power management in virtualized data centers. In

Network Operationsand Management Symposium (NOMS), 2010 IEEE , pages 479–486. IEEE, 2010.[UWH +

15] R. Urgaonkar, S. Wang, T. He, M. Zafer, K. Chan, and K. K. Leung. Dynamic ser-vice migration and workload scheduling in edge-clouds.

Performance Evaluation ,91:205–228, 2015.[Wal44] A. Wald. On cumulative sums of random variables.

The Annals of MathematicalStatistics , 15(3):283–296, 1944.[Wer13] C. Wernz. Multi-time-scale markov decision processes for organizational decision-making.

EURO Journal on Decision Processes , 1(3-4):299–324, 2013.[Whi88] P. Whittle. Restless bandits: Activity allocation in a changing world.

Journal ofapplied probability , 25(A):287–298, 1988.[WN15] X. Wei and M. J. Neely. Power-aware wireless ﬁle downloading: A lyapunov index-ing approach to a constrained restless bandit problem.

IEEE/ACM Transactionson Networking , 24(4):2264–2277, 2015.222WN16] X. Wei and M. J. Neely. On the theory and application of distributed asyn-chronous optimization over weakly coupled renewal systems. arXiv preprintarXiv:1608.00195 , 2016.[WN17] X. Wei and M. J. Neely. Data center server provision: Distributed asynchronouscontrol for coupled renewal systems.

IEEE/ACM Transactions on Networking(TON) , 25(4):2180–2194, 2017.[WN18] X. Wei and M. J. Neely. Asynchronous optimization over weakly coupled renewalsystems.

Stochastic Systems , 8(3):167–191, 2018.[WN19] X. Wei and M. J. Neely. Opportunistic scheduling over time varying renewalsystems: An empirical method. arXiv preprint arXiv:1606.03463 , 2019.[WSLJ15] H. Wu, R. Srikant, X. Liu, and C. Jiang. Algorithms with logarithmic or sublinearregret for constrained contextual bandits. In

Advances in Neural InformationProcessing Systems , pages 433–441, 2015.[WUZ +

15] S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung. Dynamicservice migration in mobile edge-clouds. In , pages 1–9. IEEE, 2015.[WYN15] X. Wei, H. Yu, and M. J. Neely. A probabilistic sample path convergence timeanalysis of drift-plus-penalty algorithm for stochastic optimization. arXiv preprintarXiv:1510.02973 , 2015.[WYN18] X. Wei, H. Yu, and M. J. Neely. Online learning in weakly coupled markov decisionprocesses: A convergence time study.

Proceedings of the ACM on Measurementand Analysis of Computing Systems , 2(1):12, 2018.[Yao02] D. D. Yao. Dynamic scheduling via polymatroid optimization.

Proceeding Per-formance Evaluation of Complex Systems: Techniques and Tools , pages 89–113,2002.[YHS +

12] Y. Yao, L. Huang, A. Sharma, L. Golubchik, and M. Neely. Data centers powerreduction: A two time scale approach for delay tolerant workloads. In

INFOCOM,2012 Proceedings IEEE , pages 1431–1439. IEEE, 2012.223YMS09] J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitraryreward processes.

Mathematics of Operations Research , 34(3):737–757, 2009.[YN16] H. Yu and M. J. Neely. A low complexity algorithm with o ( √ T ) regret and ﬁniteconstraint violations for online convex optimization with long term constraints. arXiv preprint arXiv:1604.02218 , 2016.[YN17] H. Yu and M. J. Neely. A simple parallel algorithm with an O (1 /t ) convergencerate for general convex programs. SIAM Journal on Optimization , 27(2):759–783,2017.[YNW17] H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic con-straints. arXiv preprint arXiv:1708.03741 , 2017.[YT89] Y. Ye and E. Tse. An extension of karmarkar’s projective algorithm for convexquadratic programming.

Mathematical programming , 44(1):157–179, 1989.[Zin03] M. Zinkevich. Online convex programming and generalized inﬁnitesimal gradientascent. In