[PDF] Low-Power Status Updates via Sleep-Wake Scheduling

Abstract

We consider the problem of optimizing the freshness of status updates that are sent from a large number of low-power sources to a common access point. The source nodes utilize carrier sensing to reduce collisions and adopt an asynchronized sleep-wake scheduling strategy to achieve a target network lifetime (e.g., 10 years). We use age of information (AoI) to measure the freshness of status updates, and design sleep-wake parameters for minimizing the weighted-sum peak AoI of the sources, subject to per-source battery lifetime constraints. When the sensing time (i.e., the time duration of carrier sensing) is zero, this sleep-wake design problem can be solved by resorting to a two-layer nested convex optimization procedure; however, for positive sensing times, the problem is non-convex. We devise a low-complexity solution to solve this problem and prove that, for practical sensing times that are short, the solution is within a small gap from the optimum AoI performance. When the mean transmission time of status-update packets is unknown, we devise a reinforcement learning algorithm that adaptively performs the following two tasks in an ``efficient way'': a) it learns the unknown parameter, b) it also generates efficient controls that make channel access decisions. We analyze its performance by quantifying its ``regret'', i.e., the sub-optimality gap between its average performance and the average performance of a controller that knows the mean transmission time. Our numerical and NS-3 simulation results show that our solution can indeed elongate the batteries lifetime of information sources, while providing a competitive AoI performance.

Full PDF

aa r X i v : . [ c s . I T ] M a r Low-Power Status Updates via Sleep-WakeScheduling

Ahmed M. Bedewy, Yin Sun,

Senior Member, IEEE , Rahul Singh, and Ness B. Shroff,

Fellow, IEEE

Abstract —We consider the problem of optimizing the freshnessof status updates that are sent from a large number of low-power sources to a common access point. The source nodes utilizecarrier sensing to reduce collisions and adopt an asynchronizedsleep-wake scheduling strategy to achieve a target networklifetime (e.g., 10 years). We use age of information (AoI) tomeasure the freshness of status updates, and design sleep-wakeparameters for minimizing the weighted-sum peak AoI of thesources, subject to per-source battery lifetime constraints. Whenthe sensing time (i.e., the time duration of carrier sensing) iszero, this sleep-wake design problem can be solved by resortingto a two-layer nested convex optimization procedure; however,for positive sensing times, the problem is non-convex. We devisea low-complexity solution to solve this problem and prove that,for practical sensing times that are short, the solution is withina small gap from the optimum AoI performance. When themean transmission time of status-update packets is unknown,we devise a reinforcement learning algorithm that adaptivelyperforms the following two tasks in an “efﬁcient way”: a) it learnsthe unknown parameter, b) it also generates efﬁcient controls thatmake channel access decisions. We analyze its performance byquantifying its “regret”, i.e., the sub-optimality gap between itsaverage performance and the average performance of a controllerthat knows the mean transmission time. Our numerical and NS-3 simulation results show that our solution can indeed elongatethe batteries lifetime of information sources, while providing acompetitive AoI performance.

This paper was presented in part at ACM MobiHoc 2020 [1].This work has been supported in part by ONR grants N00014-17-1-2417and N00014-15-1-2166, Army Research Ofﬁce grants W911NF-14-1-0368and MURI W911NF-12-1-0385, National Science Foundation grants CNS-1446582, CNS-1421576, CNS-1518829, and CCF-1813050, and a grant fromthe Defense Thrust Reduction Agency HDTRA1-14-1-0058.A. M. Bedewy is with the Department of ECE, The Ohio State University,Columbus, OH 43210 USA (e-mail: [email protected]).Y. Sun is with the Department of ECE, Auburn University, Auburn, AL36849 USA (e-mail: [email protected]).R. Singh is with the Department of ECE, Indian Institute of Science,Bangalore 560012, India (e-mail: [email protected]).N. B. Shroff is with the Department of ECE and the Department ofCSE, The Ohio State University, Columbus, OH 43210 USA (e-mail:[email protected]).

I. I

NTRODUCTION

In applications such as networked monitoring and controlsystems, wireless sensor networks, autonomous vehicles, it iscrucial for the destination node to receive timely status updatesso that it can make accurate decisions.

Age of information (AoI) has been used to measure the freshness of statusupdates. More speciﬁcally, AoI [2] is the age of the freshestupdate at the destination, i.e., it is the time elapsed since thefreshest received update was generated. It should be noted thatoptimizing traditional network performance metrics, such asthroughput or delay, do not attain the goal of timely updating.For instance, it is well known that AoI could become verylarge when the offered load is high or low [2]. In other words,AoI captures the information lag at the destination, and ishence more apt for achieving the goal of timely updates. Thus,AoI has recently attracted a lot of interests (see [3], [4] andreferences therein).In a variety of information update systems, energy consump-tion is also a critical concern. For example, wireless sensornetworks are used for monitoring crucial natural and human-related activities, e.g. forest ﬁres, earthquakes, tsunamis, etc.Since such applications often require the deployment of sensornodes in remote or hard-to-reach areas, they need to be ableto operate unattended for long durations. Likewise, in medicalsensor networks, battery replacement/recharging involves aseries of medical procedures, leading to disutility to patients.Hence, energy consumption must be constrained in order tosupport a long battery life of 10-15 years [5] . For networksserving such real-time applications, prolonging battery-life iscrucial. Existing works on multi-source networks, e.g., [8]–[11], [11]–[20], focused exclusively on minimizing the AoIand overlooked the need to reduce power consumption. Thismotivates us to derive scheduling algorithms that achieve a The computations performed in [5] are based on the speciﬁcations ofcommercially used devices. For example, the used transceiver is 2.4 GHzchipset from Chipcon, the CC2420 [6], and the used microcontroller isthe Motorola 8-bit microcontroller MC9508RE8 [7]. For more detail aboutthe supply voltage and current consumption, please see the aforementionedreferences. trade-off between the competing tasks of minimizing AoI andreducing the energy consumption in multi-source networks.Additionally, some status-update systems consist of a largenumber (e.g., hundreds of thousands) of densely packed wire-less nodes, which are serviced by a single access point (AP).Examples include massive machine-type communications [21].The dataloads in such “dense networks” [21], [22] are cre-ated by applications such as home security and automation,oilﬁeld and pipeline monitoring, smart agriculture, animaland livestock tracking, etc. This introduces high variabilityin the data packet sizes so that the transmission times of datapackets are random. Thus, scheduling algorithms designed fortime-slotted systems with a ﬁxed transmission duration, arenot applicable to these systems. Besides that, synchronizedscheduler for time-slotted systems are feasible when there arerelatively few sources and each source has sufﬁcient energy.However, if there are a huge number of sources, the signalingoverhead could be quite high. Since, each source may havelimited energy and low trafﬁc rate, the system could be highlyinefﬁcient. This motivates us to design asynchronized mediumaccess protocols that coordinate the transmissions of multipleconﬂicting transmitters connected to a single AP.Towards that end, we consider a wireless network with M sources that contend for channel access and communicatetheir update packets to an AP. Each source is equipped witha battery that may get charged by a renewable source ofenergy, e.g., solar. Moreover, each source employs a sleep-wake scheduling scheme [23] under which the source transmitsa packet if the channel is idle; and sleeps if either: (i) Thechannel is busy, (ii) it has completed a packet transmission.This enables each source to save the precious battery energyby switching off when it is unlikely to gain channel access forpacket transmissions. However, since a source cannot transmitduring the sleep period, this causes the AoI to increase. Wecarefully design the sleep-wake parameters to minimize theweighted-sum peak age of the sources, while ensuring thatthe battery lifetime constraint of each source is satisﬁed. A. Related Works

There have been signiﬁcant recent efforts on analyzingthe AoI performance of popular queueing service disciplines,e.g., the First-Come, First-Served (FCFS) [2] Last-Come,First-Served (LCFS) with and without preemption [24], andqueueing systems with packet management [25]. In [18],[26]–[29], the age-optimality of Last-Generated, First-Served(LGFS)-type policies in multi-server and multi-hop networks was established, where it was shown that these policies canminimize any non-decreasing functional of the age processes.The design of data sampling and transmission in informationupdate systems was investigated in [30], [31], where samplingpolicies were derived to minimize nonlinear age functions insingle source systems. In [31], it was shown that a variety ofinformation freshness metrics can be represented as monotonicfunctions of the age. The studies in [30], [31] were laterextended to a multi-source scenario in [32], [33].Designing scheduling policies for minimizing AoI in multi-source networks has recently received increasing attention,e.g., [8]–[17]. Of particular interest are those pertaining todesigning distributed scheduling policies [8]–[13]. The workin [8] considered a slotted ALOHA-like random access schemein which each node accesses the channel with a certain accessprobability. These probabilities were then optimized in order tominimize the AoI. However, the model of [8] allows multipleinterfering users to gain channel access simultaneously, andhence allows for the collision. The authors in [9] generalizedthe work in [8] to a wireless network in which the interferenceis described by a general interference model. The Round Robinor Maximum Age First policy was shown to be (near) age-optimal for different system models, e.g., in [10]–[13], [18].Carrier sensing distributed medium access mechanisms, e.g.,Carrier Sense Multiple Access (CSMA), have been widelyadopted in many wireless networks; see [34], [35] for arecent survey. There has been an interest in designing CSMA-based scheduling schemes that optimize the AoI [36], [37].In [36], the authors designed an idealized CSMA (similarto that in [38]) to minimize the AoI with an exponentiallydistributed packet transmission times. In [37], the authorsdesigned a slotted Carrier Sense Multiple Access/Collision-Avoidance (CSMA/CA) (similar to that in [39]) to minimizethe broadcast age of information, which is deﬁned, from asender’s perspective, as the age of the freshest successfullybroadcasted packet. Contrary to these works, the sleep-wakescheduling scheme proposed by us emphasizes on reducingthe cumulative energy consumption in multi-source networksin addition to minimizing the total weighted AoI. Moreover,in our study, transmission times are not necessarily randomvariables with some commonly used parametric density [36],or deterministic [37], but can be any generally distributedrandom variables with ﬁnite mean.

B. Key Contributions

Our key contributions are summarized as follows: • In our model, sources utilize an asynchronized sleep-wake scheduling strategy to achieve an extended batterylifetime. We aim at designing the mean sleeping period ofeach source, which controls its channel access probability,in order to minimize the total weighted average peak ageof the sources while simultaneously meeting per-sourcebattery lifetime constraints. Although, the aforementionedoptimization problem is non-convex, we devise a solution.In the regime for which the sensing time is negligiblecompared to the packet transmission time, the proposedsolution is near-optimal (Theorem 1 and Theorem 3). Ournear-optimality results hold for general distributions ofthe packet transmission times. • We propose an algorithm that can be easily implementedin many practical control systems. In particular, oursolution requires the knowledge of only two variables inits implementation. These two variables are functions ofthe network parameters. An implementation procedure tocompute these two variables is provided. • As the ratio between the sensing time and the packettransmission time reduces to zero, we show that the ageperformance of our proposed algorithm is as good asthat of the optimal synchronized scheduler (e.g., for time-slotted systems). • Finally, since our solution is a function of the meantransmission time of data packets, the network operatorneeds to know this quantity in order to implement thealgorithm. The transmission times however depend uponthe environmental conditions, which in turn are hardto predict before the system operation begins. To over-come this challenge, we develop a reinforcement learning(RL) [40]–[42] algorithm that maintains an estimate ofthe (unknown) mean transmission time, and then utilizesthis estimate in order to derive a solution that is “seem-ingly optimal” for the true system. We show that theregret of the proposed RL algorithm scales as ˜ O ( √ H ) , where H is the operating time horizon.II. M ODEL AND F ORMULATION

A. Network Model and Sleep-wake Scheduling

Consider a wireless network composed of M source nodes,each observing a time-varying signal. The sources generateupdate packets of the observed signals and send the packets toan access point (AP) over a shared spectrum band. If multiple ˜ O hides factors that are logarithmic in H . sources transmit packets simultaneously, a packet collisionoccurs and these packet transmissions fail.The sources use a sleep-wake scheduling scheme to accessthe shared spectrum, where each source switches between asleep mode and a transmission mode over time, according thefollowing rules: Upon waking from the sleep mode, a sourceﬁrst performs carrier sensing to check whether the channel isoccupied by another source, as illustrated in Figure 1. Thetime duration of carrier sensing is denoted as t s , which issufﬁciently long to ensure a high sensing accuracy. If thechannel is sensed to be busy, the source enters the sleep modedirectly; otherwise, the source generates an update packet andsends it over the channel. The source hereafter goes back tothe sleep mode.In the above sleep-wake scheduling scheme, if two sourcesstart transmitting within a time duration of t s , then theirsensing periods are overlapping and they may not be ableto detect the transmission of each other. In order to obtain arobust system design, we consider that they cannot detect eachother’s transmission and a collision occurs. Upon completinga packet transmission, sources switch to the reception modeand wait for an acknowledgement (ACK) that indicates theoutcome of their transmissions (successful transmission orcollision). They then go back to the sleep mode.A sleep-wake cycle , or simply a cycle , is deﬁned as the timeperiod between the ends of two successive packet transmissionor collision events. Each cycle consists of an idle periodand a transmission/collision period . As depicted in Figure1, the packet transmissions in Cycles 1-2 are successful, buta collision occurs in Cycle 3 because Sources 1 and 2 wakeup within a short duration t s .We use T j , j ∈ { , , . . . } to represent the time incurredby the j -th packet transmission or collision event, whichincludes transmission/collision time and feedback delays. Forexample, in Figure 1, T is the time duration of the packettransmission event by Source 1, while T is the time durationof the collision event between Source 1 and 2. We assumethat the T j ’s are i.i.d. for all transmission and collision events,with a general distribution. This assumption does not hold To make the sleep-wake scheduling problem solvable analytically, wemake several approximations. For example, in 802.11b frame structure, thereexists a Short Inter-frame Space (SIFS) between the packet transmissionframe and the ACK frame (i.e., the CTS frame). If another source wakesup during the SIFS, then it may not detect the transmission/ACK frames,leading to unexpected collisions. In our analytical model, such collisionevents are omitted. In other words, we suppose that each cycle must startwith an idle period, where all sources are in the sleep mode, followed bya transmission/collision period. NS-3 simulation results will be provided inSection VI-B to show that these approximations have a negligible impact onthe age performance of our solution.

Source 1

Source 2

Source 3 t s Sensing time

Sleep period

Channel is busy,go to sleep mode

Packet transmissionPacket transmission

Cycle 1

Cycle 2

Collision

Feedback S S S − S < t s Cycle 3

Cycle 4 T T t , t ′ , t ′ , t , Figure 1: Illustration of the sleep-wake cycles. In Cycles 1-2, we have successful packet transmissions. Let S and S representthe remaining sleeping times of Sources 1 and 2, respectively, after a successful transmission. Then, a collision occurs in Cycle3 because the difference between wake-up times of Sources 1 and 2 is less than t s , i.e., S − S < t s . As we can observe,each cycle consists of an idle period before a transmission/collision event.in practice. Nonetheless, NS-3 simulation results in SectionVI-B show that this assumption has a negligible impact onthe performance of the proposed algorithm. When there is noconfusion, we omit the subscript j of T j for simplicity, and use T to denote the transmission/collision time, which is assumedto have a ﬁnite mean, i.e., E [ T ] < ∞ . The sleep periodsof source l are exponentially distributed random variableswith mean value E [ T ] /r l and are independent across sourcesand i.i.d. across time. Notice that, the sleep period parameter r l > has been normalized by the mean transmission time E [ T ] . Let r = ( r , . . . , r M ) be the vector comprising of thesesleep period parameters. B. Total Weighted Average Peak Age

Let U l ( t ) represent the generation time of the most recentlydelivered packet from source l by time t . Then, the age ofinformation , or simply the age , of source l is deﬁned as [2] ∆ l ( t ) = t − U l ( t ) , (1)where ∆ l ( t ) is right-continuous. As shown in Figure 2, theage increases linearly with t , but is reset to a smaller valueupon the delivery of a fresher packet. Observe that a small age ∆ l ( t ) indicates that the AP has a fresh status update packetthat was generated at source l recently. Hence, it is desirableto keep ∆ l ( t ) small for all the sources.Let us introduce some notations and deﬁnitions. Let i l be theindex of the i -th delivered packet from source l . We use t l,i and t ′ l,i to denote the generation and delivery times, respectively,of the i -th delivered packet from source l , such that t ′ l,i − t l,i = t ∆ l ( t ) t l, t ′ l, t ′ l, t l, t l, t ′ l, I l, ∆ peak l, ∆ peak l, ∆ peak l, T l Figure 2: The age ∆ l ( t ) of source l . T i l . Let I l,i = t ′ l,i − t ′ l,i − denote the i -th inter-departure timeof source l , which satisﬁes E [ I l,i ] = E [ I l ] for all i . The i -thpeak age of source l , denoted by ∆ peak l,i , is deﬁned as the AoIof source l right before the i -th packet delivery from source l .As shown in Figure 2, i.e., we have ∆ peak l,i = ∆ l ( t ′− l,i ) , (2)where t ′− l,i is the time instant just before the delivery time t ′ l,i .One can observe from Figure 2 that the peak age is [25] ∆ peak l,i = T ( i − l + I l,i . (3)Hence, the average peak age of source l is given by E [∆ peak l ] = E [ T ] + E [ I l ] , (4)where we omit the subscripts i and i l as I l,i ’s and T i l ’sare i.i.d. across time. The average peak age metric providesinformation regarding the worst case age, with the advantage A packet of a particular source is deemed delivered when the sourcereceives the feedback. of having a simpler formulation than the average age metric[25]. Thus, it is suitable for applications that have an upperbound restriction on AoI.We now derive an expression for E [ I l ] . Let α l be theprobability of the event that the source l obtains channel accessand successfully transmits a packet within a sleep-wake cycle.As shown in [23], one can utilize the memoryless property ofexponential distributed sleep periods to get α l = r l e r l ts E [ T ] e P Mi =1 r i ts E [ T ] P Mi =1 r i . (5)To keep the paper self-contained, we provide the derivation of(5) in Appendix A. Let N l denote the total number of sleep-wake cycles between two subsequent successful transmissionsof source l . Because the probability that source l obtainschannel access and transmits successfully in a given cycle is α l , N l is geometrically distributed with mean α l . By this and(5), we get E [ N l ] = e P Mi =1 r i ts E [ T ] P Mi =1 r i r l e r l ts E [ T ] . (6)An inter-departure time duration of source l is composed of N l consecutive sleep-wake cycles. With a slight abuse of notation,let cycle l,k denote the duration of the k -th sleep-wake cycleafter a successful transmission of source l . Hence, E [ I l ] = E " N l X k =1 cycle l,k . (7)Note that cycle l,k ’s are i.i.d. across time. Moreover, since theevent ( N l = n ) depends only on the history, N l is a stoppingtime [43]. Hence, it follows from Wald’s identity [44] that E [ I l ] = E [ N l ] E [ cycle ] , (8)where E [ cycle ] is the mean duration of a sleep-wake cy-cle. Each cycle consists of an idle period and a transmis-sion/collision time, see Figure 1. Using the memoryless prop-erty of exponential distribution, we observe that the idle periodis the minimum of i.i.d. exponential random variables. Thus, itcan be shown that the idle period in each cycle is exponentiallydistributed with mean value equal to E [ T ] / P Mi =1 r i , where E [ T ] /r l is the mean of sleep periods of source l . Hence, wehave E [ cycle ] = E [ T ] P Mi =1 r i + E [ T ] . (9)Substituting the expressions for E [ N l ] and E [ cycle ] from (6) and (9), respectively, into (8), and (4), we obtain E [∆ peak l ] = e − r l ts E [ T ] E [ T ] r l e P Mi =1 r i ts E [ T ] M X i =1 r i ! + E [ T ] . (10)In this paper, we aim to minimize the total weighted averagepeak age, which is given by M X l =1 w l E [∆ peak l ]= M X l =1 w l e − r l ts E [ T ] E [ T ] r l e P Mi =1 r i ts E [ T ] M X i =1 r i ! + M X l =1 w l E [ T ] , (11)where w l > is the weight of source l . These weightsenable us to prioritize the sources according to their relativeimportance [9], [15]. C. Energy Constraint

Each source is equipped with a battery that can possiblybe recharged by a renewable energy source, such as solar.In typical wireless sensor networks, sources have a muchsmaller power consumption in the sleep mode than in thetransmission mode. For example, if the sensor is equippedwith the radio unit TR 1000 from RF Monolithic [45], [46],the power consumption in the sleep mode is 15 µ W whilethe power consumption in the transmission mode is 24.75mW. Motivated by this, we assume that the energy dissipationduring the sleep mode is negligible as compared to the powerconsumption in the transmission mode. Moreover, we assumethat the sensing time duration t s is much shorter than thetransmission time and hence neglect the energy consumedduring channel sensing. In Section VI-B, we show that theseassumptions have a negligible effect on the performance ofthe proposed sleep-wake scheduling algorithm. Under theseassumptions, the amount of energy used by a source is equalto the amount of energy consumed in packet transmissions andfeedback receptions.The energy constraint on source l is described by thefollowing parameters: a) Initial battery level B l , which denotesthe initial amount of energy stored in the battery, b) Targetlifetime D l , which is the minimum time-duration that thesource l should be active before its battery is depleted, c)Average energy replenishment rate R l , which is the rate atwhich the battery of source l receives energy from its energy It is assumed that R l is either known, or it can be estimated accurately. source. If source l does not have access to an energy source,then we have R l = 0 . Deﬁne P max ,l for source l as P max ,l = B l D l + R l , ∀ l, (12)where P max ,l is the maximum allowable power consumptionof source l such that the target lifetime D l is met.For the sleep-wake scheduling mechanism under consider-ation, it has been shown in [23] that the fraction of time inwhich source l is in the transmission mode is given by σ l = [1 − e − r l ts E [ T ] ] P Mi =1 r i + r l e − r l ts E [ T ] P Mi =1 r i + 1 . (13)For the sake of completeness, the derivation of σ l is providedin Appendix B. Let P avg ,l denote the average power consump-tion of source l in the transmission mode. Then the actualpower consumption of source l , denoted by P act ,l , is given by P act ,l = σ l P avg ,l , ∀ l. (14)For source l to achieve its target lifetime D l , we must have P act ,l ≤ P max ,l , ∀ l. (15)Deﬁne b l , P max ,l /P avg ,l as the target power efﬁciency ofsource l . By using (13)-(14), the constraints in (15) can berewritten as σ l = [1 − e − r l ts E [ T ] ] P Mi =1 r i + r l e − r l ts E [ T ] P Mi =1 r i + 1 ≤ b l , ∀ l. (16)Because σ l ≤ , if b l ≥ , then constraint (16) is alwayssatisﬁed. D. Problem Formulation

Our goal is to ﬁnd the optimal sleep-wake parameters r that minimizes the total weighted average peak age in (11),while simultaneously ensuring the energy constraints (16) forall sources. Dividing the objective function (11) by E [ T ] , weobtain the following optimization problem: (Problem ) ¯∆ w-peakopt , min r l > M X l =1 w l e − r l ts E [ T ] r l e P Mi =1 r i ts E [ T ] M X i =1 r i ! + M X l =1 w l s.t. [1 − e − r l ts E [ T ] ] P Mi =1 r i + r l e − r l ts E [ T ] P Mi =1 r i + 1 ≤ b l , ∀ l, (17)where ¯∆ w-peakopt is the optimal objective value of Problem .We will use ¯∆ w-peak ( r ) to denote the objective value for givensleeping period parameters r . One can notice from (17) that the optimal sleeping period parameters depend on the sensingtime t s and the mean transmission time E [ T ] only through theirratio t s / E [ T ] . This insight plays a crucial role in subsequentanalysis of Problem .III. M AIN R ESULTS

When t s = 0 , although Problem is non-convex, it can besolved by deﬁning an auxiliary variable y = P Mi =1 r i + 1 andapplying a nested optimization algorithm: In the inner layer,we optimize r l for a given y . Then, we write the optimizedobjective as a function of y . In the outer layer, we optimize y .It happens that the inner and outer layer optimization problemsare both convex. The details can be found in Section III-C.However, this method does not work for positive sensingtimes t s > and Problem becomes non-convex. Hence, itis challenging to optimize r for positive t s . In this section,we develop a low-complexity closed-form solution which isshown to be near-optimal if the sensing time t s is short ascompared with the mean transmission time E [ T ] . Our solutionis developed by considering the following two regimes sepa-rately: (i) Energy-adequate regime denoted as P Mi =1 b i ≥ ,where the condition P Mi =1 b i ≥ means that the sources havea sufﬁcient amount of total energy to ensure that at leastone source is awake at any time, (ii) Energy-scarce regime represented by P Mi =1 b i < , which indicates that the sourceshave to sleep for some time to meet the sources’ energyconstraints. A. Energy-adequate Regime

In the energy-adequate regime P Mi =1 b i ≥ , our solution r ⋆ := ( r ⋆ , . . . , r ⋆M ) is given as r ⋆l = min { b l , β ⋆ √ w l } x ⋆ , ∀ l, (18)where x ⋆ and β ⋆ are expressed in terms of the parameters { b i , w i } Mi =1 , t s / E [ T ] as follows: x ⋆ = −

12 + s

14 + E [ T ] t s , (19)and β ⋆ is the unique root of M X i =1 min { b i , β ⋆ √ w i } = 1 . (20)The performance of the above solution r ⋆ is manifested in thefollowing theorem: Theorem 1 ( Near-optimality ) . If P Mi =1 b i ≥ , then thesolution r ⋆ (18) - (20) is near-optimal for solving (17) when t s /E [ T ] is sufﬁciently small, in the following sense: (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) ≤ s t s E [ T ] C + o s t s E [ T ] ! , (21) where C = M X i =1 w i min { b i , β ⋆ √ w i } . (22) Proof.

See Section IV-A.From Theorem 1, we can obtain the following corollary:

Corollary 2 ( Asymptotic optimality ) . If P Mi =1 b i ≥ , thenthe solution r ⋆ (18) - (20) is asymptotically optimal forProblem in (17) as t s / E [ T ] → , i.e., lim ts E [ T ] → (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) = 0 . (23) Moreover, the asymptotic optimal objective value of Problem as t s / E [ T ] → is lim ts E [ T ] → ¯∆ w-peakopt = M X i =1 (cid:20) w i min { b i , β ⋆ √ w i } + w i (cid:21) . (24) Proof.

See Section IV-A.

B. Energy-scarce Regime

Now, we present a solution to Problem in the energy-scarce regime P Mi =1 b i < , and show it is near-optimal. Thesolution r ⋆ of the energy-scarce regime is again given by (18),where x ⋆ and β ⋆ are x ⋆ = min l c l − P Mi =1 b i , β ⋆ = M X i =1 √ w i , (25) We use the standard order notation: f ( h ) = O ( g ( h )) means z ≤ lim h → f ( h ) /g ( h ) ≤ z for some constants z > and z > , while f ( h ) = o ( g ( h )) means lim h → f ( h ) /g ( h ) = 0 . Observe that, according to (24), the asymptotic optimal average peakage of source l is (1 / min { b l , β ⋆ √ w l } + 1) which decreases with theweight w l . The weighted average peak age is w l (1 / min { b l , β ⋆ √ w l } + 1) which increases with w l . This phenomenon is reasonable and agrees with ourexpectation. and c l = 2 b l (cid:16) − P Mi =1 b i (cid:17) Q l , (26) Q l = b l − M X i =1 b i ! + vuut b l − M X i =1 b i ! +4 b l − M X i =1 b i ! M X i =1 b i − b l ! t s E [ T ] . (27)The near-optimality of the proposed solution (i.e., r ⋆ ) in theenergy scarce regime is explained in the following theorem: Theorem 3 ( Near-optimality ) . If P Mi =1 b i < , then thesolution r ⋆ (18) and (25) - (27) is near-optimal for solving (17) when t s / E [ T ] is sufﬁciently small, in the following sense: (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) ≤ t s E [ T ] C + o (cid:18) t s E [ T ] (cid:19) , (28) where C = M X l =1 w l b l (1 − P Mi =1 b i ) M X i =1 b i − min j b j ! . (29) Proof.

See Section IV-B.We obtain the following corollary from Theorem 3.

Corollary 4 ( Asymptotic optimality ) . If P Mi =1 b i < , then (23) holds for the solution r ⋆ (18) and (25) - (27) . In otherwords, our proposed solution is asymptotically optimal forProblem in (17) as t s / E [ T ] → . Moreover, the asymptoticoptimal objective value of Problem as t s / E [ T ] → is lim ts E [ T ] → ¯∆ w-peakopt = M X i =1 (cid:20) w i min { b i , β ⋆ √ w i } + w i (cid:21) = M X i =1 (cid:20) w i b i + w i (cid:21) . (30) Proof.

See Section IV-B.Interestingly, the asymptotic optimal objective values ofProblem in both regimes, given by (24) and (30), are of anidentical expression. However, in the energy-scarce regime, wecan observe that β ⋆ , which is deﬁned in (25), always satisﬁes min { b l , β ⋆ √ w l } = b l for all l . Remark 1.

We would like to point out that the condition t s / E [ T ] ≈ is satisﬁed in many practical applications. Forinstance, in a wireless sensor network that is equipped withlow-power UHF transceivers [47], the carrier sensing time is t s = 40 µ s, while the transmission time is around ms.Hence, t s / E [ T ] ≈ . .C. Discussion In this subsection, we present a simple implementation ofour proposed solution, discuss the nested convex optimizationmethod that can be used to solve Problem when t s = 0 ,provide some useful insights about our proposed solution atthe limit point t s / E [ T ] → , and provide a comparison withsynchronized schedulers performance.

1) Implementation of Sleep-wake Scheduling:

We devise asimple algorithm to compute our solution r ⋆ , which is pro-vided in Algorithm 1. Notice that r ⋆ has the same expression(18) in the energy-adequate and energy-scarce regimes. Weexploit this fact to simplify the implementation of sleep-wakescheduling. In particular, the sources report w l and b l to theAP, which computes β ⋆ and x ⋆ , and broadcasts them back tothe sources. After receiving β ⋆ and x ⋆ , source l computes r ⋆l based on (18). In practical wireless sensor networks, e.g., smartcity networks and industrial control sensor networks [48], [49],the sensors report their measurements via an access point (AP).Hence, it is reasonable to employ the AP in implementing thesleep-wake scheduler. Algorithm 1:

Implementation of sleep-wake scheduler. The AP gathers the parameters { ( w i , b i ) Mi =1 , t s / E [ T ] } ; if P Mi =1 b i ≥ then The AP computes x ⋆ , β ⋆ from (19) and (20); else The AP computes x ⋆ , β ⋆ from (25) - (27); end The AP broadcasts x ⋆ , β ⋆ to all the M sources; Upon hearing x ⋆ , β ⋆ , source l compute r ⋆l from (18);In the above implementation procedure, the sources do notneed to know if the overall network is in the energy-adequateor energy-scarce regime; only the AP knows about it. Further,the amount of downlink signaling overhead is small, becauseonly two parameters β ⋆ and x ⋆ are broadcasted to the sources.Moreover, when the node density is high, the scalability of thenetwork is a crucial concern and reporting w l and b l for eachsource is impractical. In this case, the AP can compute β ⋆ and x ⋆ by estimating the distribution of w l and b l , as well as thenumber of source nodes, which reduces the uplink signalingoverhead. Finally, when sources are not in the hearing rangeof each other, hidden/exposed source problems arise. Theseproblems are challenging to solve analytically. However, thiscan be solved by designing practical heuristic solutions based on the theoretical solutions. One design method was given in[23].

2) The Nested Convex Optimization Method for t s = 0 : If t s = 0 , Problem reduces to the following optimizationproblem: ¯∆ w-peakopt , min r l > M X l =1 w l (cid:16) P Mi =1 r i (cid:17) r l + M X l =1 w l s.t. r l ≤ b l ( M X i =1 r i + 1) , ∀ l. (31)Observe that the optimization problem in (31) is non-convex.To bypass this difﬁculty, we use an auxiliary variable y = P Mi =1 r i + 1 . Hence, we obtain the following optimizationproblem for given y : min r i > M X i =1 (cid:20) w i yr i + w i (cid:21) (32) s.t. r l ≤ b l y, ∀ l, (33) M X i =1 r i + 1 = y. (34)The objective function in (32) is a convex function. Moreover,the constraints in (33) and (34) are afﬁne. Hence, Problem (32)is convex. Exploiting (32), we solve (31) by using a two-layernested convex optimization method: In the inner layer, weoptimize r for given y . After solving r , we will optimize y inthe outer layer. This technique is used in the proof of Lemma 8in Appendix D, where the reader can ﬁnd the detailed solution.

3) Asymptotic Behavior of The Optimal Solution:

In theenergy-adequate regime, the sleeping period parameter r ⋆l ofsource l tends to inﬁnity as t s / E [ T ] → , while the ratio r ⋆l /r ⋆i between source l and source i is kept as a constant for all l and i . Hence, the sleeping time of the sources tends to zero.Meanwhile, since t s / E [ T ] → , the sensing time becomesnegligible. The channel access probability of source l in thislimit can be computed as lim ts E [ T ] → σ ⋆l = min { b l , β ⋆ √ w l } . (35)Because of (20), lim t s / E [ T ] → P Mi =1 σ ⋆i = 1 . Hence, thechannel is occupied by the sources at all time, without anytime overhead spent on sensing and sleeping.On the other hand, in the energy-scarce regime, the sleepingperiod parameter r ⋆l of source l converges to a constant valuewhen t s / E [ T ] → , i.e., we have lim ts E [ T ] → r ⋆l = b l − P Mi =1 b i . (36) Since the cumulative energy is scarce, the sources necessarilyneed to stay idle for some time in order to meet their targetlifetime. Hence, sleep periods are imposed for achievingthe optimal trade-off between minimizing AoI and energyconsumption.

4) Comparison with Synchronized Schedulers Performance:

We would like to show that the performance of our proposedalgorithm is asymptotically no worse than any synchronized(e.g., centralized) scheduler. Consider a scheduler in whichthe fraction of time during which source l transmits updatepackets is equal to a l , where we have a = { a l } Ml =1 and P Mi =1 a i ≤ . In this scheduler, only one source is allowed toaccess the channel at a time, i.e., there is no collision (this canbe achieved either by a deterministic scheduler or by assigninga channel access probability a l for each source l after eachpacket transmission) . We can perform an analysis similar tothat of Section II-B, and show that the total weighted averagepeak age of a synchronized scheduler is given by M X i =1 (cid:20) w i E [ T ] a i + w i E [ T ] (cid:21) . (37)Hence, the problem of designing an optimal synchronizedscheduler that minimizes the total weighted average peak ageunder energy constraints can be cast as ¯∆ w-peakopt-s , min a i > M X i =1 (cid:20) w i a i + w i (cid:21) (38) s.t. a l ≤ b l , ∀ l, (39) M X i =1 a i ≤ , (40)where we have divided the objective function by E [ T ] . Next,we show that the performance of our proposed algorithmconverges to that of the optimal synchronized scheduler when t s / E [ T ] → . Corollary 5.

For any ( w i , b i ) Mi =1 , we have lim ts E [ T ] → ¯∆ w-peakopt = ¯∆ w-peakopt-s . (41) Proof.

The proof is provided in Appendix G which is listedat the end before Appendix H as it requires some results fromprecedent appendixes.Synchronized schedulers were recently studied in [15] forthe case without energy constraints, i.e., b l ≥ for all l . Note that if P Mi =1 a i < , then it is possible that the scheduler decidesnot to serve any source after the transmission of some packet. In this case,the scheduler waits for a random time that has the same distribution as thetransmission time T before deciding to serve another source. According to Corollary 5, the channel access probability ofthe synchronized scheduler in [15] is a special case of oursolution (35) where b l ≥ for all l .IV. P ROOFS OF THE M AIN R ESULTS

In this section, we provide the proofs of Theorem 1,Corollary 2, Theorem 3, and Corollary 4.

A. The Proofs of Theorem 1 and Corollary 2

We prove Theorem 1 and Corollary 2 in three steps:

Step 1 : We show that our solution r ⋆ (18) - (20) is feasiblefor Problem . Lemma 6. If P Mi =1 b i ≥ , then the solution r ⋆ (18) - (20) isfeasible for Problem .Proof. See Appendix C.Hence, by substituting this solution r ⋆ into the objectivefunction of Problem in (17), we get an upper bound onthe optimal value ¯∆ w-peakopt , which is expressed in the followinglemma: Lemma 7. If P Mi =1 b i ≥ , then ¯∆ w-peakopt ≤ ¯∆ w-peak ( r ⋆ ) ≤ M X i =1 " w i e x ⋆ ts E [ T ] (cid:0) x ⋆ (cid:1) min { b i , β ⋆ √ w i } + w i , (42) where x ⋆ , β ⋆ are deﬁned in (19) , (20) .Proof. In Lemma 6, we showed that our proposed solution r ⋆ (18) - (20) is feasible for Problem . Hence, we substitute thissolution into Problem to obtain the following upper bound: M X i =1 " w i e x ⋆ ts E [ T ] (cid:0) x ⋆ (cid:1) e − min { b i ,β ⋆ √ w i } x ⋆ ts E [ T ] min { b i , β ⋆ √ w i } + w i . (43)Next, we replace e − min { b i ,β ⋆ √ w i } x ⋆ ( t s / E [ T ]) by 1 to deriveanother upper bound with a simple expression, which is givenby (42). This completes the proof. Step 2 : We now construct a lower bound on the optimalvalue of Problem . Suppose that r = ( r , . . . , r M ) is afeasible solution to Problem , such that r l > and [1 − e − r l ts E [ T ] ] P Mi =1 r i + r l e − r l ts E [ T ] P Mi =1 r i + 1 ≤ b l , ∀ l. (44)Because [1 − e − r l ( t s / E [ T ]) ] P Mi =1 r i + r l e − r l ( t s / E [ T ]) > r l forall l , r satisﬁes r l / ( P Mi =1 r i + 1) ≤ b l . Hence, the following Problem has a larger feasible set than Problem : (Problem ) ¯∆ w-peakopt , , min r l > M X l =1 w l e − r l ts E [ T ] r l e P Mi =1 r i ts E [ T ] M X i =1 r i ! + M X l =1 w l (45) s.t. r l ≤ b l M X i =1 r i + 1 ! , ∀ l, (46)where ¯∆ w-peakopt , is the optimal value of Problem . The optimalobjective value of Problem is a lower bound of that ofProblem . We note that the constraint set correspondingto Problem is convex. Thus, this relaxation converts theconstraint set of Problem to a convex one, and hence enablesus to obtain a lower bound for the optimal value of Problem , which is expressed in the following lemma: Lemma 8. If P Mi =1 b i ≥ , then ¯∆ w-peakopt ≥ ¯∆ w-peakopt , ≥ M X i =1 (cid:20) w i min { b i , β ⋆ √ w i } + w i (cid:21) , (47) where β ⋆ is the root of (20) .Proof. See Appendix D.

Step 3 : After the upper and lower bounds of ¯∆ w-peakopt werederived in Steps 1-2, we are ready to analysis their gap. Bycombining (42) and (47), the sub-optimality gap of the solution r ⋆ (18) - (20) is upper bounded by (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) ≤ M X i =1 w i (cid:16) e x ⋆ ts E [ T ] (1+ x ⋆ ) − (cid:17) min { b i , β ⋆ √ w i } , (48)where x ⋆ , β ⋆ are deﬁned in (19), (20). Next, we characterizethe right-hand-side (RHS) of (48) by Taylor expansion. Forsimplicity, let ǫ = t s E [ T ] . Using the expression for x ⋆ from(19), we have x ⋆ ǫ = − ǫ r ǫ ǫ = ǫ ǫ + q ǫ + ǫ = √ ǫ + o ( √ ǫ ) . (49)Moreover, x ⋆ = −

12 + r

14 + 1 ǫ = ǫ + q + ǫ = 1 √ ǫ + o (cid:18) √ ǫ (cid:19) . (50) Substituting (49) and (50) in (48), we obtain (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) (51) ≤ M X i =1 w i [ e √ ǫ + o ( √ ǫ ) (1 + √ ǫ + o ( √ ǫ )) − { b i , β ⋆ √ w i } = M X i =1 w i [(1+ √ ǫ + o ( √ ǫ ))(1+ √ ǫ + o ( √ ǫ )) − { b i , β ⋆ √ w i } = 2 √ ǫ M X i =1 w i min { b i , β ⋆ √ w i } + o ( √ ǫ ) , (52)where the second inequality involves the use of Taylor expan-sion. This proves Theorem 1.We can observe that the gap (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) inthe energy-adequate regime converges to zero at a speed of O ( √ ǫ ) , as ǫ → . Further, both the upper and lower bounds(42), (47), converge to P Mi =1 [( w i / min { b i , β ⋆ √ w i } ) + w i ] as t s / E [ T ] → . Thus, this value is the asymptotic optimalobjective value of Problem . This proves Corollary 2. B. The Proofs of Theorem 3 and Corollary 4

Similar to Section IV-A, we prove Theorem 3 and Corollary4 also in three steps:

Step 1 : We show that the proposed solution r ⋆ (18) and(25) - (27) is a feasible solution for Problem . Lemma 9. If P Mi =1 b i < , then the solution r ⋆ (18) and (25) - (27) is feasible for Problem .Proof. See Appendix E.Now, we construct an upper bound on the optimal value ofProblem using our proposed solution as follows: Lemma 10. If P Mi =1 b i < , then ¯∆ w-peakopt ≤ ¯∆ w-peak ( r ⋆ ) ≤ M X l =1 w l b l e P Mi =1 b i x ⋆ ts E [ T ] x ⋆ + M X i =1 b i ! + M X l =1 w l , (53) where x ⋆ is deﬁned in (25) .Proof. In Lemma 9, we showed that our proposed solution r ⋆ (18) and (25) - (27) is feasible for Problem . Hence, wesubstitute this solution into Problem to obtain the followingupper bound: M X l =1 w l e − b l x ⋆ ts E [ T ] b l e P Mi =1 b i x ⋆ ts E [ T ] x ⋆ + M X i =1 b i ! + M X l =1 w l . (54) Next, we replace e − b l x ⋆ ts E [ T ] by 1 to derive another upperbound with a simple expression, which is given by (53). Thiscompletes the proof. Step 2 : Similar to the proof in Section IV-A, we use therelaxed problem, Problem , to construct a lower bound asfollows: Lemma 11. If P Mi =1 b i < , then ¯∆ w-peakopt ≥ ¯∆ w-peakopt , ≥ M X l =1 w l b l e − P Mi =1 bi − P Mi =1 bi ts E [ T ] + M X l =1 w l . (55) Proof.

See Appendix F.

Step 3 : We now characterize the sub-optimality gap byanalyzing the upper and lower bounds constructed above. Bycombining (53) and (55), the sub-optimality gap of the solution r ⋆ (18) and (25) - (27) is upper bounded by (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) ≤ M X l =1 w l b l " e P Mi =1 b i x ⋆ ts E [ T ] x ⋆ + M X i =1 b i ! − e − P Mi =1 bi − P Mi =1 bi ts E [ T ] . (56)where x ⋆ is deﬁned in (25). Next, we characterize the RHSof (56) by Taylor expansion. For simplicity, let ǫ = t s / E [ T ] , Z = ( P Mi =1 b i ) / (1 − P Mi =1 b i ) , and k l = ( P Mi =1 b i − b l ) / (1 − P Mi =1 b i ) . Using Taylor expansion, we are able to obtain thefollowing: min l c l = 1 + (cid:18) min l k l (cid:19) ǫ + o ( ǫ ) , (57) l c l = max l c l = 1 + (cid:18) max l k l (cid:19) ǫ + o ( ǫ ) . (58)Using (57), (58), x ⋆ from (25), and Taylor expansion again,we get e P Mi =1 b i x ⋆ ǫ = 1 + Z (cid:18) (cid:18) min l k l (cid:19) ǫ + o ( ǫ ) (cid:19) ǫ + o ( ǫ )= 1 + Zǫ + o ( ǫ ) , (59) x ⋆ + M X i =1 b i = 1 − P Mi =1 b i min l c l + M X i =1 b i = 1 + (cid:18) max l k l (cid:19) − M X i =1 b i ! ǫ + o ( ǫ ) , (60) e − Zǫ = 1 − Zǫ + o ( ǫ ) . (61)Substituting (59) - (61) into (56), we get (28). This provesTheorem (3). Moreover, we observe that the gap (cid:12)(cid:12)(cid:12) ¯∆ w-peak ( r ⋆ ) − ¯∆ w-peakopt (cid:12)(cid:12)(cid:12) in the energy-scarce regime converges to zero at a speed of O ( ǫ ) , as ǫ → . Further, both the upper and lower bounds(53), (55), converge to P Mi =1 [( w i /b i ) + w i ] as t s / E [ T ] → .Thus, this value is the asymptotic optimal objective value ofProblem . This proves Corollary 4.V. L EARNING TO O PTIMIZE A GE Note that the optimal rate r ⋆ in Theorem 1 depends uponthe mean transmission time E [ T ] . Since the transmission timealso depends upon (possibly) time-varying channel conditions,estimating E [ T ] accurately a priori, could be cumbersome.Thus, in this section, we derive learning algorithms thatoptimize the total weighted average peak age of all sourceswhen the mean transmission time E [ T ] is unknown to thescheduler. We begin by reducing our system to an equivalentdiscrete-time Markov chain. Contributions and Challenges : The simplest learning al-gorithm is called the certainty equivalent rule [50]–[53]. Inthis, the scheduler maintains an empirical estimate of E [ T ] ,and utilizes sleep parameters that are optimal when the truevalue of the mean transmission time is equal to this estimate.The regret of a learning algorithm is the sub-optimality in theperformance that results because the algorithm does not knowthe system parameters. What we are able to show is that byusing the CE rule, we are able to get o ( H ) regret, where H isthe time-horizon. This further implies that the long-term time-average performance of our CE algorithm is asymptoticallyoptimal.This result is important since it is well-known by now [54]that in many reinforcement learning problems [40], the CErule fails to be yield long-term time average performance,because it does not yield a correct estimate of the optimalchoices. Thus, more complex learning rules, such as optimismin the face of uncertainty [41], [53] that utilize conﬁdenceballs in addition to the empirical estimates and thus havea signiﬁcantly higher computational complexity, are requiredin order to ensure optimality. Our main contribution is toshow that the vanilla CE rule yields asymptotically the samelong-term time average performance as the scheduler thatknows the system parameters in advance, i.e. (with a highprobability) the “sub-optimality gap” of the CE rule is o ( H ) where H is the operating time horizon. This means thatinstead of using more complex learning algorithm such as theUCRL [41] or RBMLE [53], one could use CE thereby savingprecious computing power and attaining the optimal average performance (asymptotically). We perform a ﬁnite-time per-formance analysis of the CE rule and explicitly quantify itssub-optimality by deriving an upper-bound on its “regret”, i.e.,the gap between its average expected performance, and thatresulting from the application of optimal sleep parameter. Theproblem of designing and analyzing learning algorithms forour setup poses several challenges, primarily because the ageprocess evolves in continuous time on a continuous state-spacethat is not compact. To address this difﬁculty, we show that forthe purpose of optimizing average age, we can equivalentlywork with a discrete-time process. We then utilize severaltechniques from the theory of general state-space Markovchains [55] for analyzing the learning regret. Sampling Continuous Time Process : Consider the multi-source system in which the sleep durations are modulatedaccording to the parameter vector r = ( r , r , . . . , r M ) .Throughout this section, we let n ∈ N be the discrete time ofthe sampled system. We sample the original continuous-timesystem at those time instants when one out of the followingevents occur: • a source l gets channel access and starts transmitting. Wesay that it wakes up, denoted by m l ( n ) = 1 , • a source l completes packet transmission, and hence goesinto sleep mode such that m l ( n ) = 0 .In what follows, we make this assumption. Assumption 1.

The transmission times are bounded, i.e., ≤ T ≤ T max almost surely, where T max > . Moreover, theprobability density function f ( · ) of T satisﬁes lb ≤ f ( n ) ≤ ub, ∀ y ∈ [0 , T max ] , where lb, ub > are upper and lower bounds on the densityfunction. (cid:3) Deﬁne s l ( n ) := (∆ l ( n ) , m l ( n )) , where ∆ l ( n ) is the age,and m l ( n ) ∈ { , } is the mode of user l . Deﬁne, s ( n ) := ( s ( n ) , s ( n ) , . . . , s M ( n )) . (62)As is shown in Lemma 15 (see Appendix H), for the purposeof adaptively choosing sleep parameters, the process s ( n ) serves as a sufﬁcient-statistics [56] for the optimization prob-lem (17). In other words, s ( n ) is the state of a Markov decisionprocess. Hence, we will work exclusively with the discrete-time system obtained by sampling the original continuous-time system. We use S to denote the state space of a singlesource, i.e., we have s l ( n ) ∈ S . Consider the operation over atime horizon of H discrete time-steps, and let K l denote the (random) number of packets delivered to source l until time H . The cumulative cost incurred is given by C ( H ) := M X l =1 K l X i =1 w l ∆ peak l,i , (63)where ∆ peak l,i denotes the i -th peak age of source l . We let r l ( n ) ∈ R + denote the sleep period parameter for source l ,and denote r ( n ) := ( r ( n ) , r ( n ) , . . . , r M ( n )) . As is shownin Lemma 15, the expected value of the cumulative value ofpeak age can be written as follows, E H − X n =1 g ( s ( n )) ! , (64)where the function g is described in Lemma 15. However, inour setup, the controller that chooses r ( n ) does not know thedensity function f of the packet transmission time, and has toadaptively choose the sleeping period paremeter r ( n ) so as tominimize the operating cost (64).Let F ( d ) t denote the sigma-algebra generated by the randomvariables { s ( i ) } ni =1 , { r ( i ) } n − i =1 (the super-script d denotes thefact that we are working with discretized system). A learningpolicy is a collection of maps F ( d ) t r ( n ) , n = 1 , , . . . , H that chooses the sleep period parameter adaptively based onpast operation history of the system. The performance of alearning policy is measured by its regret R ( H ) , which isdeﬁned as follows, R ( H ) := H X n =1 g ( s ( n )) − H ¯∆ w-peak ( r ⋆ ) , (65)where ¯∆ w-peak ( r ⋆ ) is the optimal performance when the truesystem parameter is known and hence the scheduler canimplement the optimal rate vector. Throughout this sectionwe use θ to denote the mean transmission time E [ T ] . Sincethe optimal rate depends upon the probability density function f ( · ) only through its mean E [ T ] , we also denote it by r ⋆θ . Certainty Equivalence Learning Algorithm : We beginwith some notations. Let col ( i ) , i = 1 , , . . . be a randomvariable that is equal to if there is no collision at time i ,and is otherwise. The empirical estimate of θ at time n isdenoted by ˆ θ ( n ) , and given as ˆ θ ( n ) := P ni =1 T ( i ) col ( i ) N ( n ) ∨ , (66)where N ( n ) : = n X i =1 col ( i ) , (67)and T ( i ) ∈ [0 , T max ] is the time taken to deliver packet attime i . The learning rule operates in episodes. We let τ k be the start time of the k -th episode, and let E k := { τ k , τ k + 1 , . . . , τ k +1 − } be the time-slots that comprise the k -th episode, so that the duration of E k is τ k +1 − τ k time-slots.We let τ k = 2 k , use k ( n ) to denote the index of the currentepisode at time n , and θ ( n ) to denote the empirical estimateat the beginning of the current episode, deﬁned by k ( n ) : = max { k : τ k ≤ t } , (68) θ ( n ) : = ˆ θ ( τ k ( n ) ) . (69)Within each single episode the algorithm implements a singlestationary controller that makes decisions only on the basis ofthe state s ( n ) and the estimate θ ( τ k ) obtained at the beginningof the current ongoing episode k ( n ) . It chooses the sleepperiod parameter as r ( n ) = r ⋆θ ( n ) , ∀ n ∈ E k , i.e., it utilizesthe rate vector that is optimal for the system whose meantransmission time is equal to θ ( n ) . Thus, r ( n ) = r ⋆θ ( τ k ) for τ k ≤ n ≤ τ k +1 − .We summarize our learning rule in Algorithm 2. We will Algorithm 2:

Certainty Equivalence Learning for AgeOptimization

Input:

N, γ ≥ Set ˆ θ (1) = . . for n = 1 , , . . . do if n = τ k then Calculate ˆ θ ( n ) as in (66) and set θ ( n ) asin (68)-(69). end if Use r ( n ) = r ⋆θ ( n ) end for analyze its performance under the following assumptions.Throughout, for a vector x , we let k x k denote its Euclideannorm, and k x k denote its -norm. Assumption 2.

With a high probability, say greater than − δ ,where δ > is a small constant, the state value s ( τ k ) atthe beginning of each episode k belongs to a compact set K := (cid:8) x ∈ S M : k x k ≤ K (cid:9) , where S is the state space ofa single source. (cid:3) The above is not a restrictive assumption, since the sched-uler can always ensure that towards the end of each episode,each source receives a sufﬁcient amount of service in orderto ensure this condition. We now make a few assumptionsregarding the set Θ of “allowable parameters”. Assumption 3.

Recall that r ⋆θ is the optimal sleep parameterwhen the mean transmission time is equal to θ . The following two properties hold for the scheduler that uses r ( n ) ≡ r ⋆θ , n ∈ N .(i) The average cost is ﬁnite, i.e. lim sup H →∞ H H X n =1 E r ⋆θ ( g ( s ( n ))) ≤ K < ∞ , (70) (ii) Each user gets channel access with a non-zero probability inf θ ∈ Θ ,l ∈ [ M ] P ( ca l ( n ) = 1 | r ( n ) = r ⋆θ ) > , (71) where ca l ( i ) is a random variable that is if source l getschannel access at time i , while is otherwise. We denote p min := inf θ ∈ Θ ,l ∈ [ M ] P ( ca l ( n ) = 1 | r ( n ) = r ⋆θ ) . (72) (cid:3) It is easily veriﬁed that (70), (71) hold true whenever therate vector r is bounded.The following result quantiﬁes the learning regret of Algo-rithm 2. Theorem 12.

Consider the problem of designing a learningalgorithm that does not know the statistics of the transmissiontime T , and adaptively chooses the sleep period parameters r ( n ) in order to minimize the cumulative peak age of M sources. Let δ ∈ (0 , p min ) be a constant. Then, underAssumptions 1-3, the regret of Algorithm 2 can be boundedas follows, E [ R ( H )] ≤ K max (cid:26) γ log H ( p min − √ δ ) δ , O (cid:18) δ log H (cid:19)(cid:27) + K π L s Hγ (log H ) ( p min − √ δ ) . where H is the operating time horizon, γ ≥ is a constant, K , p min are as in Assumption 3, and the parameters δ, L > are as in Lemma 19.Proof. See Appendix H.VI. N

UMERICAL AND S IMULATION R ESULTS

We use Matlab and NS-3 to evaluate the performance ofour algorithm. We use “age-optimal scheduler” to denote thesleep-wake scheduler with the sleep period paramters r ⋆l ’s asin (18), which was shown to be near-optimal in Theorem1 and Theorem 3. By “throughput-optimal scheduler”, werefer to the sleep-wake algorithm of [23] that is known toachieve the optimal trade-off between the throughput andenergy consumption reduction. Moreover, we use “ﬁxed sleep-rate scheduler” to denote the sleep-wake scheduler in which Figure 3: Total weighted average peak age ¯∆ w-peakun ( r ) in (11)versus the ratio t s E [ T ] for M = 10 sources.the sleep period parameters r l ’s are equal for all the sources,i.e., r l = k for all l , where the parameter k has been chosenso as to satisfy the energy constraints of Problem . We alsolet ¯∆ w-peakun ( r ) denote the unnormalized total weighted averagepeak age in (11). Finally, we would like to mention that we donot compare the performance of our proposed algorithm withthe CSMA algorithms of [36], [37] where the goal was solelyto minimize the age. Since they do not incorporate energyconstraints, it is not fair to compare the performance of ouralgorithm with them.Unless stated otherwise, our set up is as follows: Theaverage transmission time is E [ T ] = 5 ms. The weights w l ’sattached to different sources are generated by sampling froma uniform distribution in the interval [0 , . The target powerefﬁciencies b l ’s are randomly generated according to a uniformdistribution in the range [0 , . A. Numerical Evaluations

Figure 3 plots the total weighted average peak age ¯∆ w-peakun ( r ) in (11) as a function of the ratio t s E [ T ] , wherethe number of sources is M = 10 . The age-optimal sched-uler is seen to outperform the throughput-optimal and Fixedsleep-rate schedulers. This implies that what minimizes thethroughput does not necessarily minimize AoI and vice versa.Moreover, we observe that the total weighted average peakage of all schedulers increases as the sensing time increases.This is expected since an increase in the sensing time leadsto an increase in the probability of packet collisions, which inturn deteriorates the age performance of these schedulers.We then scale the number of sources M , and plot ¯∆ w-peakun ( r ) in (11) as a function of M in Figure 4. While plotting, we

20 40 60 80 1000.10.20.30.40.50.60.7 Fixed sleep-rate schedulerThroughput-optimal schedulerAge-optimal scheduler

Figure 4: Total weighted average peak age ¯∆ w-peakun ( r ) in (11)versus the number of sources M , where ¯∆ w-peakun ( r ) has beennormalized by M while plotting. -2 -1 Figure 5: Total weighted average peak age ¯∆ w-peakun ( r ) in (11)versus the target power efﬁciency b for M = 100 sources,where ¯∆ w-peakun ( r ) has been normalized by M while plotting.normalize the performance by the number of sources M . Thesensing time t s is ﬁxed at t s = 40 µ s. The weights w l ’scorresponding to different sources are randomly generateduniformly within the range [0 , . The age-optimal scheduler isshown to outperform other schedulers uniformly for all valuesof M . Moreover, as we can observe, the average peak ageof the sources under age-optimal scheduler increases up toaround 0.55 seconds only, while the number of sources risesfrom 1 to 100. This indicates the robustness of our algorithmto changes in the number of sources in a network.In Figure 5, we ﬁx the value of M as and the targetpower efﬁciencies at the same value for all the sources, i.e., b l = b for all l . We then vary the parameter b and plotthe resulting performance. While plotting, we normalize theperformance by the number of sources M . We exclude the Figure 6: Total weighted average peak age ¯∆ w-peakun ( r ) in (11)versus the target lifetime D for a dense network with M = 10 sources, where ¯∆ w-peakun ( r ) has been normalized by M whileplotting. Since the throughput–optimal scheduler is infeasiblefor values of D greater than years, we do not plot itsperformance for these values.simulation of the throughput-optimal scheduler for b < . since the sleeping period parameters that are proposed in [23]are not feasible for Problem in the energy-scarce regime, i.e.,when P Mi =1 b i < . The age-optimal scheduler outperformsthe other schedulers. Moreover, its performance is a decreasingfunction of b , and then settles at a constant value. This occursbecause our proposed solution in (18) is a function solely ofthe weights w l ’s and β ⋆ when b exceeds some value. Thus,the performance of the proposed scheduler saturates after thisvalue of b .We now show the effectiveness of the proposed sched-uler when deployed in “dense networks” [21], [22]. Densenetworks are characterized by a large number of sourcesconnected to a single AP. We ﬁx M at sources, and takethe target lifetimes of the sources to be equal, i.e., D l = D forall l . The weights w l ’s corresponding to different sources aregenerated randomly by sampling from the uniform distributionin the range [0 , . We let the initial battery level B l = 8 mAh for all l and the output voltage is 5 Volt. We alsolet the energy consumption in a transmission mode to be24.75 mW for all sources. We vary the parameter D andplot the resulting performance in Figure 6. While plotting,we normalize the performance by the number of sources M .We exclude simulations for the throughput-optimal schedulerfor values of D for which the scheduler is infeasible, i.e., itscumulative energy consumption exceeds the total allowableenergy consumption. The age-optimal scheduler is seen tooutperform the others. As observed in Figure 6, under the Figure 7: The average actual lifetime versus the target lifetime D .age-optimal scheduler, sources can be active for up to 25years, while simultaneously achieving a decent average peakage of around .2 hour, i.e., 12 minutes. This makes it suitablefor dense networks, where it is crucial that the sources arenecessarily active for many years. B. NS-3 Simulation

We use NS-3 [57] to investigate the effect of our modelassumptions on the performance of age-optimal schedulerin a more practical situation. We simulate the age-optimalscheduler by using IEEE 802.11b while disabling the RTS-CTS and modifying the back-off times to be exponentiallydistributed in the MAC layer. Our simulation results areaveraged over 5 system realizations. The UDP saturationconditions are satisﬁed such that the source nodes always havepackets to send.Our simulation consists of a WiFi network with 1 AP and 3associated source nodes in a ﬁeld of size 50m × µ W. Moreover, all weights are set tounity, i.e., w l = 1 for all l .Figure 7 plots the average actual lifetime of the sourcesversus the target lifetime, where we take the target lifetimesof all sources to be equal, i.e., D l = D for all l . As we can ob-serve, the actual lifetime of the age-optimal scheduler alwaysachieves the target lifetime. This suggests that our assumptions(i.e., (i) omitting the power dissipation in the sleep modeand in the sensing times, (ii) the average transmission times Figure 8: Total weighted average peak age ¯∆ w-peakun ( r ) versusthe target lifetime D .and collision times are equal to each other) do not affect theperformance of the algorithm which reaches its target lifetime.Figure 8 plots the total weighted average peak age versusthe target lifetime, where again we take the target lifetimesof all sources to be equal, i.e., D l = D for all l . The age-optimal scheduler (theoretical) curve is obtained using (11),while the age-optimal scheduler (from NS-3) curve is obtainedusing the NS-3 simulator. As we can observe, the differencebetween the plotted curves does not exceed 2% of the age-optimal scheduler (theoretical) performance. This emphasizesthe negligible impact of our assumptions on the performanceof our proposed algorithm.VII. C ONCLUSIONS

We designed an efﬁcient sleep-wake scheduling algorithmfor wireless networks that attains the optimal trade-off be-tween minimizing the AoI and energy consumption. Sincethe associated optimization problem is non-convex, in generalwe could not hope to solve it for all values of the systemparameters. However, in the regime when the carrier sensingtime t s is negligible as compared to the average transmissiontime E [ T ] , we were able to provide a near-optimal solution.Moreover, the proposed solution is in a simple form thatallowed us to design an easy-to-implement algorithm to obtainthe solution. Furthermore, we showed that the performance ofour proposed algorithm is asymptotically no worse than that ofthe optimal synchronized scheduler, as t s / E [ T ] → . Finally,when the mean transmission time is unknown, we devise areinforcement learning algorithm that adaptively learns theunknown parameter. VIII. A CKNOWLEDGEMENTS

The authors appreciate Jiayu Pan and Shaoyi Li for theirgreat efforts in obtaining the ns-3 simulation results.R

EFERENCES[1] A. M. Bedewy, Y. Sun, R. Singh, and N. B. Shroff, “Optimizinginformation freshness using low-power status updates via sleep-wakescheduling,” in

Proc. MobiHoc , 2020, pp. 51–60.[2] S. Kaul, R. D. Yates, and M. Gruteser, “Real-time status: How oftenshould one update?,” in

Proc. IEEE INFOCOM , 2012, pp. 2731–2735.[3] Y. Sun, I. Kadota, R. Talak, and E. Modiano, “Age of information: A newmetric for information freshness,”

Synthesis Lectures on CommunicationNetworks , vol. 12, no. 2, pp. 1–224, 2019.[4] R. D. Yates, Y. Sun, D. R. Brown III, S. K. Kaul, E. Modiano, andS. Ulukus, “Age of information: An introduction and survey,” arXivpreprint arXiv:2007.08564 , 2020.[5] N. F. Timmons and W. G. Scanlon, “Analysis of the performance ofieee 802.15. 4 for medical sensor body area networking,” in

FirstAnnual IEEE Communications Society Conference on Sensor and AdHoc Communications and Networks. IEEE SECON 2004. , 2004, pp.16–24.[6] Chipcon AS SmartRF CC3420 Preliminary Datasheet, rev 1.0, 17November 2003.[7] Datasheet for MC9S08RE8 motorola microcontroller.[8] R. D. Yates and S. K. Kaul, “Status updates over unreliable multiaccesschannels,” in

Proc. IEEE ISIT , 2017, pp. 331–335.[9] R. Talak, S. Karaman, and E. Modiano, “Distributed schedulingalgorithms for optimizing information freshness in wireless networks,”in

Proc. IEEE SPAWC , 2018, pp. 1–5.[10] R. Li, A. Eryilmaz, and B. Li, “Throughput-optimal wireless schedulingwith regulated inter-service times,” in

Proc. IEEE INFOCOM , 2013, pp.2616–2624.[11] I. Kadota, A. Sinha, E. Uysal-Biyikoglu, R. Singh, and E. Modiano,“Scheduling policies for minimizing age of information in broadcastwireless networks,”

IEEE/ACM Trans. Netw. , vol. 26, no. 6, pp. 2637–2650, 2018.[12] Y. Hsu, E. Modiano, and L. Duan, “Scheduling algorithms for mini-mizing age of information in wireless broadcast networks with randomarrivals,”

IEEE Transactions on Mobile Computing , 2019.[13] Z. Jiang, B. Krishnamachari, X. Zheng, S. Zhou, and Z. Niu, “Timelystatus update in massive iot systems: Decentralized scheduling forwireless uplinks,” arXiv preprint arXiv:1801.03975 , 2018.[14] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of informationin wireless networks with throughput constraints,” in

Proc. INFOCOM ,2018, pp. 1844–1852.[15] R. Talak, S. Karaman, and E. Modiano, “Optimizing informationfreshness in wireless networks under general interference constraints,”in

Proc. MobiHoc , 2018, pp. 61–70.[16] Q. He, D. Yuan, and A. Ephremides, “Optimal link scheduling for ageminimization in wireless systems,”

IEEE Trans. Inf. Theory , vol. 64,no. 7, pp. 5381–5394, 2017.[17] X. Guo, R. Singh, P. R. Kumar, and Z. Niu, “A risk-sensitive approachfor packet inter-delivery time optimization in networked cyber-physicalsystems,”

IEEE/ACM Trans. Netw. , vol. 26, no. 4, pp. 1976–1989, 2018.[18] Y. Sun, E. Uysal-Biyikoglu, and S. Kompella, “Age-optimal updates ofmultiple information ﬂows,” in

IEEE INFOCOM - the 1st Workshop onthe Age of Information (AoI Workshop) , 2018, pp. 136–141.[19] I. Kadota, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, “Minimizingthe age of information in broadcast wireless networks,” in

Proc. Allerton ,2016, pp. 844–851. [20] R. Singh, X. Guo, and P. R. Kumar, “Index policies for optimal mean-variance trade-off of inter-delivery times in real-time sensor networks,”in Proc. IEEE INFOCOM . IEEE, 2015, pp. 505–512.[21] S. S. Kowshik, K. Andreev, A. Frolov, and Y. Polyanskiy, “Energyefﬁcient coded random access for the wireless uplink,” arXiv preprintarXiv:1907.09448 , 2019.[22] S. S. Kowshik and Y. Polyanskiy, “Fundamental limits of many-usermac with ﬁnite payloads and fading,” arXiv preprint arXiv:1901.06732 ,2019.[23] S. Chen, T. Bansal, Y. Sun, P. Sinha, and N. B. Shroff, “Life-add:Lifetime adjustable design for wiﬁ networks with heterogeneous energysupplies,” in

Proc. WiOpt , 2013, pp. 508–515.[24] R. D. Yates and S. K. Kaul, “The age of information: Real-time statusupdating by multiple sources,”

IEEE Trans. Inf. Theory , vol. 65, no. 3,pp. 1807–1827, 2018.[25] M. Costa, M. Codreanu, and A. Ephremides, “On the age of informationin status update systems with packet management,”

IEEE Trans. Inf.Theory , vol. 62, no. 4, pp. 1897–1910, 2016.[26] A. M. Bedewy, Y. Sun, and N. B. Shroff, “Optimizing data freshness,throughput, and delay in multi-server information-update systems,” in

Proc. IEEE ISIT , 2016, pp. 2569–2573.[27] A. M. Bedewy, Y. Sun, and N. B. Shroff, “Minimizing the age ofinformation through queues,”

IEEE Trans. Inf. Theory , vol. 65, no. 8,pp. 5215–5232, 2019.[28] A. M. Bedewy, Y. Sun, and N. B. Shroff, “Age-optimal informationupdates in multihop networks,” in

Proc. IEEE ISIT , 2017, pp. 576–580.[29] A. M. Bedewy, Y. Sun, and N. B. Shroff, “The age of informationin multihop networks,”

IEEE/ACM Trans. Netw. , vol. 27, no. 3, pp.1248–1257, 2019.[30] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff,“Update or wait: How to keep your data fresh,”

IEEE Trans. Inf. Theory ,vol. 63, no. 11, pp. 7492–7508, 2017.[31] Y. Sun and B. Cyr, “Sampling for data freshness optimization: Non-linear age functions,”

Journal of Communications and Networks , vol.21, no. 3, pp. 204–219, 2019.[32] A. M. Bedewy, Y. Sun, S. Kompella, and N. B. Shroff, “Age-optimalsampling and transmission scheduling in multi-source systems,” in

Proc.MobiHoc , 2019, pp. 121–130.[33] A. M. Bedewy, Y. Sun, S. Kompella, and N. B. Shroff, “Optimalsampling and scheduling for timely status updates in multi-sourcenetworks,”

IEEE Trans. Inf. Theory , pp. 1–1, 2021.[34] S. Yun, Y. Yi, J. Shin, et al., “Optimal CSMA: a survey,” in

Proc. ICCS ,2012, pp. 199–204.[35] R. Singh and P. R. Kumar, “Adaptive CSMA for decentralizedscheduling of multi-hop networks with end-to-end deadline constraints,”accepted by

IEEE/ACM Trans. Netw. , 2021.[36] A. Maatouk, M. Assaad, and A. Ephremides, “Minimizing the age ofinformation in a CSMA environment,” arXiv preprint arXiv:1901.00481 ,2019.[37] M. Wang and Y. Dong, “Broadcast age of information in CSMA/CAbased wireless networks,” arXiv preprint arXiv:1904.03477 , 2019.[38] L. Jiang and J. Walrand, “A distributed CSMA algorithm for throughputand utility maximization in wireless networks,”

IEEE/ACM Trans. Netw. ,vol. 18, no. 3, pp. 960–972, 2010.[39] G. Bianchi, “Performance analysis of the IEEE 802.11 distributedcoordination function,”

IEEE J. Sel. Areas Commun. , vol. 18, no. 3,pp. 535–547, 2000.[40] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction ,MIT press, 1998.[41] T. Jaksch, R. Ortner, and P. Auer, “Near-optimal regret bounds for reinforcement learning.,”

Journal of Machine Learning Research , vol.11, no. 4, 2010.[42] R. Singh, A. Gupta, and N. B. Shroff, “Learning in Markov decisionprocesses under constraints,” arXiv preprint arXiv:2002.12435 , 2020.[43] A. N. Shiryaev,

Optimal stopping rules , New York: Springer-Verlag,1978.[44] A. Wald,

Sequential analysis , New York: Courier Corporation, 1973.[45] ASH transceiver TR1000 data sheet, RF Monolithic Inc.[46] K. F. Ramadan, M. I. Dessouky, M. Abd-Elnaby, and F. E. A. El-Samie,“Energy-efﬁcient dual-layer MAC protocol with adaptive layer durationfor wsns,” in , 2016, pp. 47–52.[47] A. El-Hoiydi, “Spatial TDMA and CSMA with preamble sampling forlow power ad hoc wireless sensor networks,” in

Proc. IEEE Int. Symp.Comput. Commun. (ISCC) , 2002, pp. 685–692.[48] C. Lu, A. Saifullah, B. Li, M. Sha, H. Gonzalez, D. Gunatilaka, C. Wu,L. Nie, and Y. Chen, “Real-time wireless sensor-actuator networks forindustrial cyber-physical systems,”

Proceedings of the IEEE , vol. 104,no. 5, pp. 1013–1024, 2016.[49] P. Hsieh and I. Hou, “A decentralized medium access protocol forreal-time wireless ad hoc networks with unreliable transmissions,” in

IEEE 38th International Conference on Distributed Computing Systems(ICDCS) , 2018, pp. 972–982.[50] P. Mandl, “Estimation and control in markov chains,”

Advances inApplied Probability , pp. 40–60, 1974.[51] H. Van de Water and J. Willems, “The certainty equivalence propertyin stochastic control theory,”

IEEE Transactions on Automatic Control ,vol. 26, no. 5, pp. 1080–1087, 1981.[52] H. Mania, S. Tu, and B. Recht, “Certainty equivalence is efﬁcient forlinear quadratic control,” in

Advances in Neural Information ProcessingSystems , 2019, pp. 10154–10164.[53] A. Mete, R. Singh, and P.R. Kumar, “Reward biased maximumlikelihood estimation for reinforcement learning,” arXiv preprintarXiv:2011.07738 , 2020.[54] V. Borkar and P. Varaiya, “Adaptive control of Markov chains, i: Finiteparameter set,”

IEEE Transactions on Automatic Control , vol. 24, no.6, pp. 953–957, 1979.[55] E. Nummelin,

General irreducible Markov chains and non-negativeoperators , vol. 83, Cambridge University Press, 2004.[56] C. Striebel, “Sufﬁcient statistics in the optimum control of stochasticsystems,”

Journal of Mathematical Analysis and Applications

Discrete stochastic processes , Boston: Kluwer AcademicPublishers, 1996.[59] S. Boyd and L. Vandenberghe,

Convex optimization , New York, NY,USA: Cambridge University Press, 2004.[60] S. P. Meyn and R. L. Tweedie,

Markov chains and stochastic stability ,Springer Science & Business Media, 2012.[61] M. L. Puterman,

Markov Decision Processes: Discrete StochasticDynamic Programming (Wiley Series in Probability and Statistics) ,Wiley-Interscience, 2005.[62] P. Billingsley,

Probability and measure , John Wiley & Sons, 2008.[63] O. Hern´andez-Lerma and J. B. Lasserre,

Further topics on discrete-timeMarkov control processes , vol. 42, Springer Science & Business Media,2012.[64] J. K. Hunter, “Introduction to applied mathematics,” lecture notes ofMath 207A, University of California Davis, CA, Fall Quarter, 2011.[65] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,” in

The Collected Works of Wassily Hoeffding , pp. 409–426.Springer, 1994. [66] R. Ayoub, “Euler and the zeta function,” The American MathematicalMonthly , vol. 81, no. 10, pp. 1067–1086, 1974.

IX. A

PPENDIX A PPENDIX AD ERIVATION OF (5)Deﬁne S l as the residual sleeping period of source l after asleep-wake cycle is over. Due to the memoryless property ofexponential distribution, since the sleeping period of source l isexponentially distributed with mean value E [ T ] /r l , S l is alsoexponentially distributed with mean value E [ T ] /r l . Accordingto the proposed sleep-wake scheduler, source l gains accessto the channel and transmits successfully in a given cycle if S i ≥ S l + t s for all i = l . Hence, we have α l = P ( S i ≥ S l + t s , ∀ i = l ) (73) ( a ) = E [ P ( S i ≥ S l + t s , ∀ i = l | S l )] (74) ( b ) = E Y i = l P ( S i ≥ S l + t s | S l )  (75) = Z ∞ Y i = l e − r i sl + ts E [ T ]  r l E [ T ] e − r l sl E [ T ] ds l (76) = r l e r l ts E [ T ] e P Mi =1 r i ts E [ T ] P Mi =1 r i , (77)where (a) is due to P [ A ] = E [ P ( A | B )] , and (b) is due to thefact that S l is independent for different sources.A PPENDIX BD ERIVATION OF (13)Recall the deﬁnition of S l at the beginning of Appendix A.Moreover, deﬁne P l as the probability that source l transmitsa packet in a given cycle, regardless whether packet collisionoccurs or not. For the sleep-wake scheduling mechanism thewe are utilizing here, source l transmits in a given cycle as longas no other source wakes up before S l − t s , i.e., S i ≥ S l − t s for all i = l . Hence, we have P l = P ( S i ≥ S l − t s , ∀ i = l ) (78) = P ( S i ≥ S l − t s , ∀ i = l, S l ≥ t s ) + P ( S l < t s ) , (79) where the ﬁrst term in the RHS is given by P ( S i ≥ S l − t s ≥ , ∀ i = l ) (80) = E [ P ( S i ≥ S l − t s ≥ , ∀ i = l | S l )] (81) = E Y i = l P ( S i ≥ S l − t s ≥ | S l )]  (82) = Z ∞ t s Y i = l e − r i sl − ts E [ T ]  r l E [ T ] e − r l sl E [ T ] ds l (83) = e − r l ts E [ T ] r l P Mi =1 r i . (84)Since S l is exponentially distributed with mean value E [ T ] /r l ,we can determine the second term in the RHS of (79) asfollows: P ( S l < t s ) = 1 − e − r l ts E [ T ] . (85)Substituting (84) and (85) back into (79), we get P l = 1 − e − r l ts E [ T ] + e − r l ts E [ T ] r l P Mi =1 r i . (86)Let α col denote the collision probability in a given cycle. Wehave α col = 1 − P Mi =1 α i , because each cycle includes eithera successful transmission or a collision. Moreover, let E [ Idle ] denote the mean of the idle duration in a cycle. By the renewaltheory in stochastic processes [58], σ l is given by σ l = P l E [ T ]( P Mi =1 α i + α col ) E [ T ] + E [ Idle ] (87) = P l E [ T ] E [ T ] + E [ T ] P Mi =1 r i (88) = [1 − e − r l ts E [ T ] ] P Mi =1 r i + r l e − r l ts E [ T ] P Mi =1 r i + 1 . (89)A PPENDIX CP ROOF OF L EMMA β ⋆ . Lemma 13.

Suppose that w l > , and b l > for all l . If P Mi =1 b i ≥ , then (20) has a unique solution on [0 , max l ( b l / √ w l )] ; otherwise, (20) has no solution.Proof. It is clear that if P Mi =1 b i = 1 , then β ⋆ satisﬁes (20)if and only if β ⋆ ≥ max l ( b l / √ w l ) . Hence, (20) has a uniquesolution on [0 , max l ( b l / √ w l )] in this case. We now focus onthe case of P Mi =1 b i > . In this case, we have the following: • If β ⋆ = 0 , then P Mi =1 min { b i , β ⋆ √ w i } = 0 . • If β ⋆ = max l ( b l / √ w l ) , then P Mi =1 min { b i , β ⋆ √ w i } > . • The left hand side (LHS) of (20) is strictly increasing andcontinuous in β ⋆ on [0 , max l ( b l / √ w l )] .As a result, (20) has a unique solution on [0 , max l ( b l / √ w l )] in this case as well. Finally, if P Mi =1 b i < , then P Mi =1 min { b i , β ⋆ √ w i } ≤ P Mi =1 b i < . Hence, (20) has nosolution if P Mi =1 b i < . This completes the proof.Since we have P Mi =1 b i ≥ , Lemma 13 implies that (20)has a solution for β ⋆ . Now, we are ready to prove Lemma 6.Consider the following constraints: r l t s E [ T ] P Mi =1 r i + r l P Mi =1 r i + 1 ≤ b l , ∀ l. (90)Since we have − e − r l ts E [ T ] ≤ r l t s E [ T ] , (91) e − r l ts E [ T ] ≤ , (92)then, [1 − e − r l ts E [ T ] ] M X i =1 r i + r l e − r l ts E [ T ] ≤ r l t s E [ T ] M X i =1 r i + r l . (93)Thus, if the constraints in (90) are satisﬁed for a given solution r , then the constraints of Problem are satisﬁed as well. Wecan observe that the constraints in (90) are equivalent to thefollowing set of constraints: r l ≤ b l x + 11 + t s E [ T ] x , ∀ l M X i =1 r i = x. (94)Now, it is easy to show that if x ≤ p E [ T ] /t s , then x ≤ ( x + 1) / [1 + ( t s / E [ T ]) x ] . Meanwhile, our proposed solution r ⋆ (18) - (20) satisﬁes P Mi =1 r ⋆i = x ⋆ . Thus, if we can showthat x ⋆ ≤ p E [ T ] /t s , then r ⋆l = min { b l , β ⋆ √ w l } x ⋆ ≤ b l x ⋆ ≤ b l x ⋆ + 11 + t s E [ T ] x ⋆ , (95)and the constraints in (94) hold for our proposed solution r ⋆ . What remains is to prove that x ⋆ ≤ p E [ T ] /t s . We have x ⋆ = −

12 + s

14 + E [ T ] t s (96) = E [ T ] t s + q + E [ T ] t s (97) ≤ E [ T ] t s q E [ T ] t s = s E [ T ] t s . (98)Hence, our proposed solution r ⋆ (18) - (20) satisﬁes (94),which implies (90). This completes the proof.A PPENDIX DP ROOF OF L EMMA e − r l ( t s / E [ T ]) e P Mi =1 r i ( t s / E [ T ]) in (45) of Prob-lem by 1, we obtain the following optimization problem: min r l > M X l =1 w l r l M X i =1 r i ! + M X l =1 w l (99) s.t. r l ≤ b l M X i =1 r i + 1 ! , ∀ l. (100)Since e − r l ( t s / E [ T ]) e P Mi =1 r i ( t s / E [ T ]) ≥ , Problem (99) servesas a lower bound of Problem , and hence a lower bound ofProblem as well. Deﬁne an auxiliary variable y = P Mi =1 r i +1 . By this, we solve a two-layer nested optimization problem.In the inner layer, we optimize r for a given y . After solving r , we will optimize y in the outer layer. Now, ﬁx the valueof y , we obtain the following optimization problem (the innerlayer): min r i > M X i =1 (cid:20) w i yr i + w i (cid:21) (101) s.t. r l ≤ b l y, ∀ l, (102) M X i =1 r i + 1 = y. (103)The objective function in (101) is a convex function. More-over, the constraints in (102) and (103) are afﬁne. Hence,Problem (101) is convex. We use the Lagrangian dualityapproach to solve Problem (101). Problem (101) satisﬁesSlater’s conditions. Thus, the Karush-Kuhn-Tucker (KKT)conditions are both necessary and sufﬁcient for optimality[59]. Let γ = ( γ , . . . , γ M ) and µ be the Lagrange multipliersassociated with constraints (102) and (103), respectively. Then, the Lagrangian of Problem (101) is given by L ( r , γ, µ ) = M X i =1 (cid:20) w i yr i + w i (cid:21) + M X i =1 γ i ( r i − b i y ) + µ M X i =1 r i +1 − y ! . (104)Take the derivative of (104) with respect to r l and set it equalto 0, we get − w l yr l + γ l + µ = 0 . (105)This and KKT conditions imply r l = r w l yγ l + µ , (106) γ l ≥ , r l − b l y ≤ , (107) γ l ( r l − b l y ) = 0 , (108) M X i =1 r i + 1 = y. (109)If γ l = 0 , then r l = p ( w l y ) /µ and r l ≤ b l y ; otherwise, if γ l > , then r l = b l y and r l < p ( w l y ) /µ . Hence, we have r l = min (cid:26) b l y, r w l yµ ⋆ (cid:27) , (110)where by (103), µ ⋆ satisﬁes M X i =1 min (cid:26) b i y, r w i yµ ⋆ (cid:27) + 1 = y. (111)We can observe that µ ⋆ is a function of y . Because of that,we can deﬁne β ⋆ ( y ) = p / ( yµ ⋆ ) , which is a function of y as well. Then, the optimum solution to (101) can be rewrittenas r l = min { b l , β ⋆ ( y ) √ w l } y, ∀ l, (112)where β ⋆ ( y ) satisﬁes M X i =1 min { b i , β ⋆ ( y ) √ w i } + 1 y = 1 . (113)Substituting (112) and (113) back in Problem (101), we getthe following optimization problem (the outer layer): min y> M X i =1 (cid:20) w i min { b i , β ⋆ ( y ) √ w i } + w i (cid:21) (114) s.t. M X i =1 min { b i , β ⋆ ( y ) √ w i } + 1 y = 1 . (115)Problem (114) serves as a lower bound of Problem , andhence a lower bound of Problem . We can observe thatthe objective function in (114) is decreasing in β ⋆ ( y ) . More- over, (115) implies that β ⋆ ( y ) is strictly increasing in y if P Mi =1 b i ≥ . As a result, y = ∞ is the optimal solution ofProblem (114). At the limit, the constraint (115) converges to(20). Since β ⋆ serves as a solution for (20), we can deducethat lim y →∞ β ⋆ ( y ) = β ⋆ . Thus, we have the following lowerbound: ¯∆ w-peakopt ≥ ¯∆ w-peakopt , ≥ M X i =1 (cid:20) w i min { b i , β ⋆ √ w i } + w i (cid:21) . (116)This completes the proof.A PPENDIX EP ROOF OF L EMMA − e − x ≤ x , we can obtain r l e − r l ts E [ T ] + [1 − e − r l ts E [ T ] ] M X i =1 r i = r l + [1 − e − r l ts E [ T ] ] M X i =1 r i − r l ! ≤ r l + r l t s E [ T ] m X i =1 r i − r l ! , (117)Hence, if r satisﬁes the constraint r l + r l t s E [ T ] (cid:16)P Mi =1 r i − r l (cid:17)P Mi =1 r i + 1 ≤ b l , (118)then r also satisﬁes the constraint of Problem in (17).Consider the following set of solution indexed by a parameter c > : r l = cu l , ∀ l, (119) u l = b l − P Mi =1 b i , ∀ l (120)We want to ﬁnd a c such that the solution in (119) and (120)is feasible for Problem . To achieve this, we ﬁrst substitutethe solution (119) and (120) into the constraint (118), and get cu l + c u l t s E [ T ] (cid:16)P Mi =1 u i − u l (cid:17) c P Mi =1 u i + 1 ≤ b l . (121)If equality is satisﬁed in (121), we can obtain the followingquadratic equation for c: c " u l t s E [ T ] M X i =1 u i − u l ! + c u l − b l M X i =1 u i ! − b l = 0 . (122)The solution to (122) is given by c l in (26). Hence, r l = c l u l is feasible for the constraint (118) for source l . As feasibility for one source only is insufﬁcient, we furtherprove that the solution in (119) and (120) with c = min l c l is feasible for satisfying the energy constraints of all sources l = 1 , . . . , M . To that end, let us consider the monotonicityof the LHS of (121). By taking the derivative with respect to c , we get u l t s E [ T ] (cid:16)P Mi =1 u i − u l (cid:17) (cid:16) c P Mi =1 u i + 2 c (cid:17) + u l ( c P Mi =1 u i + 1) > . (123)Hence, r l = (cid:18) min l c l (cid:19) u l , ∀ l, (124)is feasible for the energy constraints of all sources l =1 , . . . , M . After some manipulations, the solution in (120) and(124) are equivalently expressed as (18) and (25) - (27). Thiscompletes the proof. A PPENDIX FP ROOF OF L EMMA e − r l ( t s / E [ T ]) /r l by e − P Mi =1 r i ( t s / E [ T ]) / [ b l ( P Mi =1 r i + 1)] and e P Mi =1 r i ( t s / E [ T ]) by 1 in (45) of Problem , we obtain the following optimization problem: min r l > M X l =1 w l e − P Mi =1 r i ts E [ T ] b i + M X l =1 w l s.t. r l ≤ b l M X i =1 r i + 1 ! , ∀ l. (125)Since r l ≤ b l ( P Mi =1 r i +1) , we have e − r l ts E [ T ] r l ≥ e − P Mi =1 r i ts E [ T ] b l (cid:16)P Mi =1 r i + 1 (cid:17) . (126)Moreover, we have e P Mi =1 r i ( t s / E [ T ]) ≥ . Thus, Problem (125)serves as a lower bound of Problem , and hence a lowerbound of Problem as well. By removing the constant term P Ml =1 w l in the objective function of Problem (125) and thentaking the logarithm, Problem (125) is reformulated as min r i > log M X i =1 w i b i ! − M X i =1 r i t s E [ T ] s.t. r l ≤ b l M X i =1 r i + 1 ! , ∀ l. (127)Obviously, Problem (127) is a convex optimization problemand satisﬁes Slater’s conditions. Thus, the KKT conditionsare are necessary and sufﬁcient for optimality. Let τ = ( τ , . . . , τ M ) be the Lagrange multipliers associated with theconstraints of Problem (127). Then, the Lagrangian of Problem(127) is given by L ( r , τ ) = log M X i =1 w i b i ! − M X i =1 r i t s E [ T ] ! + M X i =1 τ i " r i − b i M X i =1 r i + 1 ! . (128)Take the derivative of (128) with respect to r l and set it equalto 0, we get − t s E [ T ] + τ l (1 − b l ) − X i = l τ i b i = 0 . (129)This and KKT conditions imply τ l = t s E [ T ](1 − b l ) + P i = l τ i b i − b l , (130) τ l ≥ , r l − b l M X i =1 r i + 1 ! ≤ , (131) τ l " r l − b l M X i =1 r i + 1 ! = 0 . (132)Since P Mi =1 b i < , (130) implies that τ l > for all l . Thisand (132) result in r l = b l M X i =1 r i + 1 ! , ∀ l. (133)Because P Mi =1 b i < , (133) has a unique solution, which isgiven by r l = b l − P Mi =1 b i , ∀ l. (134)Hence, the solution to (125) and (127) is given by (134).Substitute (134) into (125), we get the following lower bound: ¯∆ w-peakopt ≥ ¯∆ w-peakopt , ≥ M X l =1 w l e − P Mi =1 bi − P Mi =1 bi ts E [ T ] b l + M X l =1 w l . (135)This completes the proof.A PPENDIX GP ROOF OF C OROLLARY a . Problem(38) is a convex optimization problem and satisﬁes Slater’sconditions. Thus, the KKT conditions are necessary and suf-ﬁcient for optimality. Let λ = ( λ , . . . , λ M ) and ν be theLagrange multipliers associated with the constraints (39) and(40), respectively. Then, the Lagrangian of Problem (38) is given by L ( a , λ, ν ) = M X i =1 (cid:20) w i a i + w i (cid:21) + M X i =1 λ i ( a i − b i ) + ν M X i =1 a i − ! . (136)Take the derivative of (136) with respect to a l and set it equalto 0, we get − w l a l + λ l + ν = 0 . (137)This and KKT conditions imply a l = r w l λ l + ν , (138) λ l ≥ , a l − b l ≤ , (139) λ l ( a l − b l ) = 0 , (140) ν ≥ , M X i =1 a i − ≤ , (141) ν M X i =1 a i − ! = 0 . (142)If λ l = 0 , then we have a l = p w l /ν and a l ≤ b l . Thisimplies that ν > and hence P Mi =1 a i = 1 , which holdswhen P Mi =1 b i ≥ .If λ l > , then we have a l = b l and a l ≤ p w l /ν . Inthis case, we either have ν > , which implies P Mi =1 a i = 1 and this holds when P Mi =1 b i ≥ ; or ν = 0 , which implies P Mi =1 a i ≤ and this holds when P Mi =1 b i ≤ .From the above argument, the solution can be driven ac-cording to the following two cases:Case 1 ( Energy-adequate regime ( P Mi =1 b i ≥ ) ): In thiscase, the optimal solution is given by a ⋆l = min (cid:26) b l , r w l ν ⋆ (cid:27) , ∀ l, (143)where we must have ν ⋆ > , which implies P Mi =1 a ⋆i = 1 .Hence, ν ⋆ satisﬁes M X i =1 min (cid:26) b i , r w i ν ⋆ (cid:27) = 1 . (144)By comparing (144) with (20), we can deduce that p /ν ⋆ = β ⋆ , where β ⋆ satisﬁes M X i =1 min { b i , β ⋆ √ w i } = 1 . (145)Since P Mi =1 b i ≥ , (145) has a solution for β ⋆ as shownin Lemma 13. Hence, the solution to Problem (38) can be rewritten as a ⋆l = min { b l , β ⋆ √ w l } , ∀ l. (146)Substituting (146) into (38), we obtain ¯∆ w-peakopt-s = M X i =1 (cid:20) w i min { b i , β ⋆ √ w i } + w i (cid:21) , (147)which is equal to the asymptotic optimal objective value ofProblem in energy-adequate regime in (24).Case 2 ( Energy-scarce regime ( P Mi =1 b i < ) ): In this case,the optimal solution is a ⋆l = b l , ∀ l. (148)Substituting by this into (38), we obtain ¯∆ w-peakopt-s = M X i =1 (cid:20) w i b i + w i (cid:21) , (149)which is equal to the asymptotic optimal objective value ofProblem in energy-scarce regime in (30). This completesthe proof. A PPENDIX HP ROOF OF T HEOREM A. Notation and Background on General State-Space MarkovProcesses

While analyzing learning algorithm, we will have to workwith Markov processes on general state-space [55], [60]. Inthis section we provide a brief account of such processes.

Notation : For a set of r.v. s X , we let F ( X ) denote thesmallest sigma-algebra with respect to which each r.v. in X ismeasurable. For a set X , we let X c denote its complement. Foran event X , we let ( X ) denote its indicator random variable.For a set X , we let B ( X ) denote the sigma-algebra of Borelsets of X .We begin by showing that s ( n ) can be taken to be the systemstate /sufﬁcient statistics [56] in order to describe the sampledprocess. In what follows, we let S := R + × { , } . Denote by θ := R T max y =0 y f ( y ) dy the mean transmission time of a packetof any source, i.e., we use the abbreviation θ = E [ T ] . Theproof of the following result is omitted for brevity. Lemma 14.

Consider the system in which M sources share achannel, and utilize the sleep period parameters as r ( n ) ≡ r in order to modulate the sleep durations of sources. We thenhave that P (cid:0) s ( n + 1) ∈ A (cid:12)(cid:12) F t (cid:1) = K ( s ( n ) , r , A ; f ) , (150) where F t denotes the sigma-algebra generated by all therandom variables until the n -th discrete sampling instant. Thefunction K is the kernel [55] associated with the controlledtransition probabilities of the process s ( n ) , K : S M × R M + × B ( R M ) [0 , . (151) Thus, K ( s , r , A ; f ) is the probability with which the state attime n + 1 belongs to the set A , given that the state at time n is equal to s , and the vector comprising of sleep periodparameters at time n is equal to r . Note that the kernel isparametrized by the density function of transmission time f . We begin by stating some deﬁnitions associated withMarkov Chains on

General State-Spaces . Though these can befound in standard textbooks on General State-Space MarkovChains such as [55], [60], we include them here in order tomake the paper self-contained.Let us now ﬁx the controls at r ( n ) ≡ r , and consider theresulting discrete-time Markov chain s ( n ) ∈ S M . If A is aBorel set, we let P n ( x , A ) denote the probability of the event s ( n ) ∈ A , given that s (0) = x . Deﬁnition 1. (Small Set) A set C ∈ B ( S M ) is called ν m smallif for all x ∈ C we have that P m ( x , A ) ≥ ν m ( A ) , ∀ A ∈ B ( S M ) , for some non-trivial measure ν m ( · ) and some m ∈ N . Deﬁnition 2. (Petite Set) Let q = { q n } n ∈ N be a probabilitydistribution on N . A set C ∈ B ( S M ) and a non-trivial sub-probability measure ν q ( · ) are called petite if we have that X n ∈ N q n P n ( x , A ) ≥ ν q ( A ) , ∀ A ∈ B ( S M ) , ∀ x ∈ C. Deﬁnition 3. (Strong Aperiodicity) If there exists a ν smallset C such that we have ν ( C ) > , then the chain s ( n ) isstrongly aperiodic.B. Preliminary Results We now show that in order to minimize the expected valueof C ( H ) , it sufﬁces to design controllers that “work directly”with the sampled system. Thus, the quantity s ( n ) as describedin (62) serves as a sufﬁcient statistic for the purpose ofoptimizing the expectation of cumulative peak age [56]. Wealso show that this objective can be posed as a constrainedMarkov decision process [61]. Lemma 15.

Let s ( n ) , n = 1 , , . . . , be the sampled controlledMarkov process. There exists a function g : S M R so that E [ C ( H )] in (63) is given by E (cid:20) H P n =1 g ( s ( n )) (cid:21) .Proof. Consider the cumulative peak-age cost (63) in whichthe l -th source incurs a penalty of ∆ peak l,i upon delivery of the i -th packet. Let this delivery occur at the end of the n -th discretetime-slot (note that this time n is random). Let us denote by a peak l ( n ) the peak age of source l during the (continuous) timeinterval (in the non-discretized system) corresponding to thediscrete time slots n − and n . We could (instead of charginga penalty of ∆ peak l,i units at the end of n -th slot) charge thequantity E (cid:8) a peak ( n ) | s ( n − , r ( n − (cid:9) at the discrete timeinstant n − . For sources k = l that are not transmittingbetween n − and n , and have m k ( n −

1) = 0 , we let g ( s ( n − . It then follows from the law of the iteratedexpectations [62] that the expected cost of the system underthis modiﬁed cost function remains the same as that of theoriginal system. This completes the proof. Ergodicity of s ( n ) : We now derive a few useful resultsabout the Markov process s ( n ) . Lemma 16.

Consider the multi-source wireless networkoperating under the controls r ( n ) ≡ r , and assume that thesensing time t s is sufﬁciently small, i.e., it satisﬁes t s < .Consider the associated process s ( n ) , n = 1 , , . . . We thenhave the following:1) Deﬁne e i := ( M − i ) ǫ, and m i = 0 , ∀ i ∈ [ N ] , where ǫ > is chosen to be sufﬁciently small. Considerthe set C := ⊗ Ni =1 [[( M − i ) , ( M − i ) + e i ] × { m i } ] . (152) The set C is small for the process s ( n ) .2) For the process s ( n ) , each compact set is petite.3) The process s ( n ) is strongly aperiodic.Proof.

1) Consider s (0) ∈ C . It follows that at time n = 0 ,all the sources are sleeping. Consider the following setdenoted C ′ : Sources and wake up within t s timeduration of each other, while the other sources wakeup much later than these two. Consequently, there is acollision between Source and Source , and hence attime n = 1 these two sources enter into sleep mode,so that at time n = 1 all the sources are asleep. Also assume that the cumulative time elapsed for this event tooccur is approximately equal to t s + δ , where δ > is asufﬁciently small parameter. The probability of the event { s (1) ∈ C ′ } can be lower bounded as follows P ( s (1) ∈ C ′ ) ≥ r Z δ + ǫδ exp( − r x ) dx ! × t s r exp( − r ( δ + ǫ )) × (cid:20) Π Ni =3 Z ∞ δ + ǫ r i exp( − r i x ) dx (cid:21) . Since the above lower-bound on the probability of “reach-ing C ′ ” is true for all s (0) ∈ C , it follows fromDeﬁnition 1 that the set C is small.2) Consider the process s ( n ) starting in state s (0) , and letthe age vector a (0) belong to a compact set, so that s (0) also belongs to a compact set. We will derive a lowerbound on the probability of the event { s ( N ) ∈ C } , where C is as in (152). This will prove (ii) since we havealready shown in (i) that the set C is small. Considerthe following sample path: at each time i ∈ [1 , M ] , wehave that source i successfully transmits a packet, andmoreover the age of the packet received is approximatelyequal to . We will derive a lower-bound on the prob-ability of this event. In the following discussion we use b > and η ∈ (0 , − t s − b ) , where η denotes the timewhen Source wakes up. Since the counter of the i -th source has a probability density equal to r i e − r i x , theprobability that during the i -th slot source i gets channelaccess is lower bounded by (1 − exp( − ηr i ))Π j = i e − r j ;while the probability that the age of its delivered packetis around , given that it wakes up at η , is lower boundedby R b f ( y ) dy . Thus, the probability of this sample pathis lower bounded by Π Ni =1 (1 − exp( − ηr i ))Π j = i e − r j Z b f ( y ) dy. This concludes the proof since along this sample path wehave that s ( N ) ∈ C .3) It follows from the discussion on page 121 of [60] thatin order to prove the claim it sufﬁces to show that thevolume of the set C ∩ C ′ is greater than . However, thiscondition holds true if the parameter δ in (i) above hasbeen chosen so as to satisfy t s + δ < ǫ .We now show that the process s ( n ) has a certain “mixingproperty”. For a measure µ and a function f , we deﬁne k µ k f := R f ( x ) dµ ( x ) . Lemma 17 (Geometric Ergodicity) . Consider the controlledMarkov process s ( n ) , n = 1 , , . . . , associated with thenetwork in which the controller utilizes r ( n ) ≡ r . The process s ( n ) has an invariant probability measure, which we denoteas π ( ∞ , r ) . Moreover, Z ( k y k + 1) d ( P n ( x , · ) − π ( ∞ , r )) ( y ) ≤ R ( k s (0) k + 1) ρ n , n ∈ N , (153) where R > , and ρ < .Proof. Since we have shown in Lemma 16 that s ( n ) isstrongly aperiodic, it follows from Theorem 6.3 of [60] that inorder to prove the claim it sufﬁces to show that the followingholds true when k s ( n + 1) k is sufﬁciently large E ( k s ( n + 1) k |F n ) ≤ λ k s ( n ) k + L, (154)where λ < . Note that each source gets to trans-mit with a probability at least min l α l , and also the ex-pected value of the inter-sampling time is upper-boundedby max n E [ T ] , E [ T ] P Mi =1 r i + t s o . It then follows that (154)holds true with λ set equal to min l α l , and L equal to max n E [ T ] , E [ T ] P Mi =1 r i + t s o . Lemma 18. (Differential Cost Function) Consider the process s ( n ) , n = 1 , , . . . , that describes the evolution of the networkin which the controller utilizes r ( n ) ≡ r . Then, there exists afunction V : S M R that satisﬁes V ( x ) + Z g ( x ) dπ ( ∞ , r ) = g ( x ) + Z K ( x , r , y ; f ) V ( y ) dy, (155) where K is the transition kernel as described in Lemma 14,the function g is the one-step cost function as in Lemma (15) .Moreover, the function V satisﬁes the following, V ( x ) ≤ R − ρ ( k ∆(0) k + 1) , (156) where the constant R is as in Lemma 17.Proof. We have shown in Lemma 17 that the process s ( n ) isgeometrically ergodic. Hence, it follows from Theorem 7.5.10of [63] that there exists a function V ( · ) that satisﬁes (155),and moreover it is given as follows, V ( x ) = ∞ X n =1 (cid:20) E x ( g ( x ( n ))) − Z S M g ( y ) dπ ( ∞ , r )( y ) (cid:21) , x ∈ S . Substituting the geometric bound (153) into the above, we obtain the following V ( x ) = ∞ X n =1 (cid:20) E x g ( x ( n )) − Z S M g ( y ) dπ ( ∞ , r )( y ) (cid:21) ≤ ∞ X n =1 (cid:12)(cid:12)(cid:12) E x g ( x ( n )) − Z S M g ( y ) dπ ( ∞ , r )( y ) (cid:12)(cid:12)(cid:12) ≤ R ( k x (0) k + 1) ∞ X n =1 ρ n = R ( k x (0) k + 1)1 − ρ , where ρ < . Lemma 19 (Smoothness properties of the optimal averagecost) . The optimal sleep period parameters r ⋆θ and averagecost ¯∆ w − peak satisfy the following:1) We have that the function r ⋆θ : Θ R M + that maps themean transmission time θ to the optimal sleep periodparameter, is a continuous function of θ . Similarly, theaverage peak age is a continuous function of r , i.e., lim r → r ⋆θ lim H →∞ H H X n =1 E r [ g ( s ( n ))] → lim H →∞ H H X n =1 E r ⋆θ [ g ( s ( n ))] , where the sub-script r in the expectation E r above refersto the fact that the averaging is performed w.r.t. themeasure induced by the policy that uses sleep rates equalto r .2) The cumulative peak-age is locally Lipschitz continuousfunction of r . Thus, | ¯∆ w − peak ( r ⋆θ ; θ ) − ¯∆ w − peak ( r ; θ ) | ≤ L k r ⋆θ − r k , whenever k r ⋆θ − r k is sufﬁciently small, and where theLipschitz constant at sleep period parameter r is givenby L := max i ∈ [ M ] ∂ ¯∆ w − peak ∂r i ( r ) . Similarly, the optimal sleep period parameter is a locallyLipschitz function of θ , so that we have, k r ⋆θ − r ⋆θ k ≤ L | θ − θ | , L > , whenever | θ − θ | is sufﬁciently small.In summary, there exists a δ > such that whenever | θ − θ | ≤ δ , then | ¯∆ w − peak ( r ⋆θ ; θ ) − ¯∆ w − peak ( r ⋆θ ; θ ) | ≤ L | θ − θ | . Proof.

Continuity of the functions under discussion is im-mediate from the relations (18), (19), (20), (25), (26), (27).To prove the statement about Lipschitz continuity, it sufﬁcesto show that the average peak age is a Lipschitz continuousfunction of r , and the optimal rate r ⋆θ is Lipschitz continuousfunction of θ . To prove this, it sufﬁces to show that the averagepeak age is a continuously differentiable function of r , andalso r ⋆θ is a continuously differentiable function of θ (see [64]for more details). The continuously differentiable property isevident from the relations (11), (18)-(20) and (25)-(27). Thiscompletes the proof. Bounds on the Estimation Error : We now derive someconcentration results for the estimate ˆ θ ( n ) around the truevalue θ ⋆ . Let C ( n ) be the conﬁdence interval associated withthe estimate ˆ θ ( n ) , i.e., C ( n ) := n θ : | θ − ˆ θ ( n ) | ≤ ξ ( n ) , θ > o , (157)where ξ ( n ) := T max s n γ ) N ( n ) , ≤ n ≤ H,γ ≥ is a constant, N ( n ) is the total number of packet deliv-eries until n , and T max is the maximum possible transmissiontime. We begin by showing that with a high probability, ourconﬁdence balls are true at all the times. Lemma 20.

Deﬁne G ( n ) := { ω : θ ⋆ ∈ C ( n ) } , where C ( n ) is as in (157) , and θ ⋆ is the vector consisting oftrue parameter values. We then have that P ( G c ( n )) ≤ n γ − . Proof.

Fix a positive integer n , and let ˆ θ denote the empiricalestimate obtained from n samples T (1) , T (2) , . . . , T ( n ) ofthe service times. It follows from Azuma-Hoeffding’s inequal-ity [65] that P (cid:16) | ˆ θ − θ ⋆ | > x (cid:17) ≤ exp (cid:18) − n x T (cid:19) . By using x = T max q n γ ) n in the above, we obtain, P | ˆ θ − θ ⋆ | > T max r log n γ n ! ≤ exp ( − log n γ )= 1 n γ . Since the total number of samples n can assume values from the set { , , , . . . , n } , the proof then follows by using unionbound on n . Lemma 21.

Fix a δ ∈ (0 , p min ) , where p min is as in (72) .Deﬁne the event, G ( n ) := n ω : N ( n ) > ( p min − p δ ) n o , (158) where N ( n ) denotes the number of samples that have beenobtained until time n for estimating transmission times. Wethen have that P ( G c ( n )) ≤ exp ( − δ n ) . Proof.

Consider the following martingale difference sequence m ( i ) = E (cid:8) c ( i ) (cid:12)(cid:12) F i − (cid:9) − c ( n ) . Since E (cid:8) c ( i ) (cid:12)(cid:12) F i − (cid:9) ≥ p min ,we have that n X i =1 m ( i ) ≥ c min n − N ( n ) . (159)Since | m ( i ) | ≤ , we have the following from Azuma-Hoeffding’s inequality [65], P (cid:12)(cid:12)(cid:12) n X i =1 m ( i ) (cid:12)(cid:12)(cid:12) ≥ x ! ≤ exp (cid:18) − x n (cid:19) . Letting x = √ δ n , we get the following, P (cid:12)(cid:12)(cid:12) n X i =1 m ( i ) (cid:12)(cid:12)(cid:12) ≥ p δ n ! ≤ exp ( − δ n ) . (160)Substituting (159) into the above inequality, we obtain P (cid:16) N ( n ) ≤ (cid:16) p min − p δ (cid:17) n (cid:17) ≤ exp ( − δ n ) . This completes the proof.

C. Regret Analysis

The cumulative regret R ( H ) (65) decomposes into the sumof “episodic regrets” R ( e ) ( k ) as follows: E [ R ( H )] = K X k =1 E h R ( e ) ( k ) i , (161)where R ( e ) ( k ) : = E ( X n ∈E k g ( s ( n )) − ¯∆ w-peak ( r ⋆ ) (cid:12)(cid:12)(cid:12) F τ k ) . (162)Combining the regret decomposition with the smoothnessproperties of the optimal average cost that were derived inLemma 19, we obtain the following key result that allows usto upper-bound R ( H ) . Lemma 22.

The cumulative expected regret (161) for alearning algorithm can be upper-bounded as follows, E [ R ( H )] ≤ K K X k =1 ( τ k +1 − τ k ) P ( | ˆ θ ( τ k ) − θ ⋆ | > δ )+ L N X k =1 ( τ k +1 − τ k ) E (cid:16) | ˆ θ ( τ k ) − θ ⋆ | n | ˆ θ ( τ k ) − θ ⋆ | ≤ δ o(cid:17) , (163) where the constant δ > is as in Lemma 19.Proof. It follows from the ergodicity properties of the process s ( n ) that were proved in Lemma 18 and Assumption 2regarding s ( n ) , that the episodic regret can be bounded asfollows ( ρ, R are as in Lemma 18 and Assumption 2), R ( e ) ( k ) ≤ R − ρ ( K + 1) (164) + (cid:12)(cid:12)(cid:12) ¯∆ w − peak ( r ⋆ ˆ θ ( τ k ) ; θ ) − ¯∆ w-peak ( r ⋆ ) (cid:12)(cid:12)(cid:12) ( τ k +1 − τ k ) . (165)The following two events are possible:(i) | θ ⋆ − ˆ θ ( τ k ) | < δ : In this case it follows from Lemma 19that (cid:12)(cid:12)(cid:12) ¯∆ w − peak ( r ⋆ ˆ θ ( τ k ) ; θ ) − ¯∆ w-peak ( r ⋆ ) (cid:12)(cid:12)(cid:12) ≤ L | θ ⋆ − ˆ θ ( τ k ) | . (ii) | θ ⋆ − ˆ θ ( τ k ) | > δ : It follows from Assumption 3that the average performance under any sleep param-eter cannot exceed K , and hence we can bound (cid:12)(cid:12)(cid:12) ¯∆ w − peak ( r ⋆ ˆ θ ( τ k ) ; θ ) − ¯∆ w-peak ( r ⋆ ) (cid:12)(cid:12)(cid:12) by K .The proof then follows by substituting the bounds discussedabove for the two cases into (164), and using regret decom-position result.We now separately bound the expressions obtained in thetwo events ( | θ ⋆ − ˆ θ ( τ k ) | < δ , | θ ⋆ − ˆ θ ( τ k ) | > δ ). Regret when | θ ⋆ − ˆ θ ( τ k ) | > δ :Choose a sufﬁciently large k ∈ N that satisﬁes τ k = O (cid:18) δ log H (cid:19) . (166)Deﬁne the following event G := ∩ k ≥ k G ( τ k ) . By combining the result of Lemma 21 with the union boundand using (166) we conclude that G has a probability greaterthan − P k>k exp( − δ τ k ) = 1 − O (cid:0) H (cid:1) . On G , the numberof samples N ( τ k ) at the beginning of each episode k > k isgreater than (cid:0) p min − √ δ (cid:1) τ k . Thus on G , for episodes k > k the radius of C ( τ k ) is less than q γ log H ( p min −√ δ ) τ k . Let k be thesmallest integer that satisﬁes γ log H ( p min − √ δ ) τ k ≤ δ , i.e. τ k ≥ p min − √ δ ) δ γ log H, (167)where the constant δ > is as in Lemma 19. Thus on G , for episodes k ≥ max { k , k } , the radius of conﬁdenceintervals is less than δ . Note that on ∩ k G ( τ k ) the conﬁdenceintervals (157) at the beginning of each episode are true.Hence, on {∩ k G ( τ k ) } ∩ G we have | ˆ θ ( τ k ) − θ ⋆ | < δ forepsiodes k ≥ max { k , k } . Thus, on {∩ k G ( τ k ) } ∩ G thisregret is bounded by K max { τ k , τ k } . Now consider samplepaths for which some of the conﬁdence intervals fail. Theprobability that C ( τ k ) fails is less than τ γ − k (Lemma 20);moreover since the episode duration of E k , ( τ k +1 − τ k ) is lessthan τ k , we have that the expected value of the regret during E k in the event of failure of C ( τ k ) is less than K τ γ − k . Since γ ≥ , the cumulative expected regret arising from this isbounded by K P k τ γ − k ≤ K π [66]. We summarize ourdiscussion as follows. Lemma 23.

Under Algorithm 2 the following is true, K X k =1 ( τ k +1 − τ k ) P ( | ˆ θ ( τ k ) − θ ⋆ | > δ ) ≤ K max (cid:26) γ log H ( p min − √ δ ) δ , O (cid:18) δ log H (cid:19)(cid:27) + K π , (168) where γ ≥ . Regret when | θ ⋆ − ˆ θ ( τ k ) | < δ :As discussed above, on ∩ k G ( τ k ) ∩G we have | θ ( τ k ) − θ ⋆ | < δ for episodes k > k . Thus, after using the smoothness propertyof optimal average cost that was developed in Lemma 19, weobtain that the second summation in the r.h.s. of (163) can bebounded by the following quantity, X k>k ( τ k +1 − τ k ) s γ log H ( p min − √ δ ) τ k . Since we have τ k +1 − τ k ≤ τ k , the above can be boundedby q γ log H ( p min −√ δ ) P k>k √ τ k . By using Cauchy Schwart’zinequality, the quantity P k>k √ τ k can be upper-boundedas √ HK , where K denotes the number of episodes. Since K = O (log H ) , this regret is bounded by q Hγ (log H ) ( p min −√ δ ) . Thebound we discussed is summarized below. Lemma 24.