[PDF] Efficient Learning-based Scheduling for Information Freshness in Wireless Networks

Abstract

Motivated by the recent trend of integrating artificial intelligence into the Internet-of-Things (IoT), we consider the problem of scheduling packets from multiple sensing sources to a central controller over a wireless network. Here, packets from different sensing sources have different values or degrees of importance to the central controller for intelligent decision making. In such a setup, it is critical to provide timely and valuable information for the central controller. In this paper, we develop a parameterized maximum-weight type scheduling policy that combines both the AoI metrics and Upper Confidence Bound (UCB) estimates in its weight measure with parameter \eta. Here, UCB estimates balance the tradeoff between exploration and exploitation in learning and are critical for yielding a small cumulative regret. We show that our proposed algorithm yields the running average total age at most by O(N^2\eta). We also prove that our proposed algorithm achieves the cumulative regret over time horizon T at most by O(NT/\eta+\sqrt{NT\log T}). This reveals a tradeoff between the cumulative regret and the running average total age: when increasing \eta, the cumulative regret becomes smaller, but is at the cost of increasing running average total age. Simulation results are provided to evaluate the efficiency of our proposed algorithm.

Full PDF

aa r X i v : . [ c s . PF ] J a n Efﬁcient Learning-based Scheduling for InformationFreshness in Wireless Networks

Bin LiDepartment of Electrical, Computer, and Biomedical EngineeringUniversity of Rhode Island, Kingston, RI 02881, USAEmail: [email protected]

Abstract —Motivated by the recent trend of integrating arti-ﬁcial intelligence into the Internet-of-Things (IoT), we considerthe problem of scheduling packets from multiple sensing sourcesto a central controller over a wireless network. Here, packetsfrom different sensing sources have different values or degreesof importance to the central controller for intelligent decisionmaking. In such a setup, it is critical to provide timely andvaluable information for the central controller. In this paper,we develop a parameterized maximum-weight type schedulingpolicy that combines both the AoI metrics and Upper ConﬁdenceBound (UCB) estimates in its weight measure with parameter η .Here, UCB estimates balance the tradeoff between explorationand exploitation in learning and are critical for yielding a smallcumulative regret. We show that our proposed algorithm yieldsthe running average total age at most by O ( N η ) . We also provethat our proposed algorithm achieves the cumulative regret overtime horizon T at most by O ( NT /η + √ NT log T ) . This reveals atradeoff between the cumulative regret and the running averagetotal age: when increasing η , the cumulative regret becomessmaller, but is at the cost of increasing running average totalage. Simulation results are provided to evaluate the efﬁciency ofour proposed algorithm. I. I

NTRODUCTION

With the recent advances in artiﬁcial intelligence (AI), thereis a trend for incorporating AI into the Internet-of-Things (IoT)consisting of multiple wireless sensing sources to provide wisedecisions. In such an IoT system, it is critical to make surethat the received sensing information is valuable and timely.As such, in this paper, we consider the problem of schedulingpackets from multiple sensing sources to a central controllerover a wireless network as shown in Fig. 1, where packets fromdifferent sensing sources have different values or degrees ofimportance to the central controller for decision making.In particular, we assume that each sensing source constantlygenerates packets with random values independently and iden-tically distributed (i.i.d.) with an unknown distribution. Thevalue of a packet is revealed only after the central controllersuccessfully receives it. On the one hand, we would like todeliver packets from the most important sensing sources to thecentral controller for making a better decision subject to thewireless interference constraints. However, the controller doesnot have any prior knowledge of the degree of importance ofthese sensing sources and requires to gradually learn thesestatistics while scheduling the best sensing sources (a.k.a.

This research has been supported in part by NSF grants: CNS-1717108,CNS-1815563, and CNS-1942383. exploration-exploitation tradeoff in online learning). On theother hand, we should also ensure that the received packetshave a low Age-of-Information (AoI) that measures the du-ration between the packet generation time and its receivedtime. This is because the stale information is less useful tothe central controller and might even mislead the controllerto make harmful decisions. To that end, we aim to develop ascheduling algorithm to achieve this dual objective, which iscomplicated by the strong coupling between the learning andAoI performance.

Central (cid:3)

ControllerAccess (cid:3)

PointSensing (cid:3)

Source (cid:3) (cid:3)

Source (cid:3) N Each (cid:3) sensing (cid:3) source (cid:3) generates (cid:3) a (cid:3) packet (cid:3) in (cid:3) each (cid:3) time (cid:3) slot Fig. 1: An intelligent Internet-of-Things (IoT) system.Without the AoI constraint, the considered problem canbe formulated as a combinatorial multi-armed bandit (MAB)problem (e.g., [1], [2], [3], [4]), where each arm correspondsto a sensing source, and the goal is to minimize the cumulativeregret over a ﬁnite time horizon (i.e., the difference betweenthe optimal cumulative reward and the cumulative rewardunder an algorithm). The combinatorial MAB algorithmsallow to play multiple arms simultaneously in each time slotinstead of one arm as in classical MAB algorithms such asUpper-Conﬁdence-Bound (UCB [5]), Kullback-Leibler UCB(KL-UCB [6]), and Thompson sampling [7]. The efﬁcientcombinatorial MAB algorithms should quickly identify theset of best arms and keep pulling them. This, however, leadsto the large AoI for other relatively poor arms. In particular,the AoI keeps increasing over time and this implies that thereceived information from relatively poor arms is outdated.This motivates us to incorporate the AoI metric into thelearning algorithm design for combinatorial MAB problems.While there are some recent works on AoI-efﬁcient wirelesscheduling (e.g, [8], [9], [10] and see [11] for an overview),their goal was to minimize AoI while guaranteeing the desiredthroughput. They did not consider the MAB setting withunknown system statistics, which is typical in intelligent IoTsystems. As such, in this paper, we integrate the main ideas ofUCB algorithm (e.g., [5]) and AoI-efﬁcient scheduling (e.g.,[8]), and propose a Learning-based Age-Efﬁcient Scheduling(LAES) Algorithm that utilizes both the UCB estimates andAoI metrics. While there are some recent works in combinato-rial bandits with fairness constraints (e.g., [12]), they focusedon the long-term fairness constraint, i.e., each arm should atleast be played for a ﬁxed fraction of times on average. Themain approach is to maintain a virtual queue for each armthat keeps track of its debt and prioritizes arms with highvirtual queue lengths, typically referred to as virtual queuetechniques (e.g., see [13] for an overview). However, the AoIcaptures the short-term dynamic of the system and thus itsevolution is fundamentally different from the virtual queuelength. In particular, it has an unbounded decrement whenevera packet is successfully delivered, which has a signiﬁcantimpact on the performance analysis of the proposed algorithm.Our contributions in this work are summarized as follows: • We develop a parameterized maximum-weight typescheduling policy that combines both the AoI metric and UCBestimate in its weight measure (cf. Section IV). In particular,we use the parameter η to balance the AoI metric and UCBestimate. The larger the η , the more emphasis on the UCBestimate and thus leads to the smaller regret, but it is at thecost of the larger AoI. • We derive an upper bound on the running average total ageunder our proposed algorithm with any η > (cf. Proposition1), which linearly increases with the parameter η . Such anupper bound is tight in some cases in the sense that the averagetotal average under our proposed algorithm linearly scales withthe parameter η . • We show that the cumulative regret over a ﬁnite timehorizon T can be bounded from above by O ( N T /η + √ N T log T ) under our proposed algorithm (cf. Proposition2). Here, the second term has the same order as that ofthe UCB algorithm and is attributed to the cost for explo-ration/exploitation in online learning, while the ﬁrst term N T /η is the cost paid for improving AoI performance. This,together with the derived upper bound on the running averagetotal age, reveals a tradeoff: when increasing η , the regretupper bound decreases, but the upper bound on runningaverage total age increases. • We support our analytical results with extensive simula-tions (cf. Section V), which demonstrates the superior perfor-mance of our proposed algorithm over both UCB algorithmand age-based algorithm (i.e., our proposed algorithm with η = 0 ). Simulation results also conﬁrm a tradeoff betweenthe cumulative regret and the running average total age. The f ( x ) = O ( x ) if there exists a positive real number M such that f ( x ) ≤ Mx, ∀ x ≥ . desired tradeoff can be achieved by tuning the value ofparameter η .The remainder of this paper is organized as follows: SectionII reviews related work. Section III introduces system modeland problem statement. Section IV introduces our proposedalgorithm, and analyzes both its AoI and regret performance.Section V presents simulation results and Section VI concludesthis paper. II. R ELATED W ORK AND C ONTEXT

In this section, we overview two main areas that areclosely related to our work: multi-armed bandit and age ofinformation, and further provide a brief discussion of ourdesign methodology in the context of prior work. (a) Multi-Armed Bandit:

The MAB problem models anagent that attempts to learn system statistics while optimiz-ing its decision based on existing learning experiences, andhas wide applications in recommender systems, healthcare,ﬁnance, and computer networks. As such, it has been receivedextensive research efforts (e.g., [14], [5], [6], [7]). The seminalwork of Lai and Robbins [14] established a fundamentallogarithmic lower bound on the cumulative regret (i.e., thedifference between the optimal cumulative reward and the cu-mulative reward under an algorithm) over a ﬁnite time horizonunder a class of uniformly good policies and developed a UCBalgorithm that asymptotically achieves this fundamental lowerbound. Such a logarithmic regret bound has been shown tobe achieved by the sample-mean-based UCB algorithm and ǫ -greedy policy (see [5]), Kullback-Leibler UCB (KL-UCB [6]),and Thompson sampling [7].Subsequent works extended the classical MAB problem tovarious settings that account for different applications. Theone closest to ours is combinatorial MAB (e.g., [1], [2], [3],[4]), where a subset of arms can be played simultaneously ateach time. More recent works considered the combinatorialMAB with fairness constraint (e.g., [12], [15]), where eacharm should at least be played for a certain fraction of timeon average. The authors introduced the virtual-queue-lengthto address fairness constraint and incorporated it into thealgorithm design. However, all these MAB works did notaddress the AoI performance and thus yet unbounded AoI overtime, as demonstrated in our simulations (cf. Section V). (b) Age of Information: AoI measures the duration be-tween the time when the information was generated and itsreceived time. It directly captures the information freshnessand thus has received great attention in recent years. Unlike thetraditional queueing delay that is negligible in the case with alow sampling rate (i.e., low arrival rate), the AoI is dominatedby the inter-arrival time and thus is rather large in the lowsampling rate regime. This key difference has spurred AoIresearch in several aspects in recent years, e.g., AoI analysisand optimization (e.g., [16], [17]), AoI in vehicular networks(e.g., [18], [19]), online sampling and remote estimation (e.g.,[20], [21]), AoI and energy harvesting (e.g., [22], [23], [24]),just to name a few.2he one that is closest to our research is the AoI-efﬁcientscheduling in wireless networks (e.g., [8], [9], [10] and see[11] for an overview) that aims to develop wireless schedulingalgorithms with the goal of minimizing AoI. For example,the authors in [8] developed an age-based scheduler for real-time trafﬁc that achieves not only desired timely throughputbut also guaranteed AoI performance. Our research differsfrom this line of research in that we explicitly incorpo-rate AoI metrics into the MAB algorithm design, which isdesirable in the emerging intelligent IoT applications. Thiskey difference poses signiﬁcant challenges in guaranteeinginformation freshness in the MAB setting that is unseen inexisting AoI research. While a recent work [25] consideredthe AoI performance in the MAB setting, it focused on thesingle-user setting and did not consider the case with multipleusers and wireless interference constraints. (c) Our Design Philosophy:

In this paper, we extend aUCB-type algorithm to our setting that demands desired AoIperformance while minimizing cumulative regret over time.One extreme is to serve arms with the largest UCB estimatesin order to minimize the cumulative regret, but it can resultin increasing AoI over time. The other extreme is to servearms with the largest ages, yet this could lead to a largeregret. This is because it does not learn any system statisticsnor exploit the best arms so far. Therefore, it is clear thatone should tradeoff the beneﬁts of these two approaches. Thenatural idea is to integrate both UCB estimates and AoI metricsinto the scheduling decisions. However, the AoI metric inour work is fundamentally different from the virtual queuelength, since it has an unbounded decrement whenever apacket is successfully delivered. Such an abrupt dynamic posesa signiﬁcant challenge in characterizing AoI performance. Themain contribution of this paper is to develop a parameterizedlearning-based age-efﬁcient algorithm and to show that suchan algorithm achieves a tradeoff between the cumulative regretand average total age, which can be tuned by our algorithmicparameter. III.

SYSTEM MODEL

We consider a wireless network with N links, where eachlink represents a transmitter-receiver pair that are within thetransmission range of each other. We assume that the systemoperates in slotted time with normalized slots t ∈ { , , , . . . } .In each time slot t , the transmitter of link n ( n = 1 , , . . . , N )generates a packet with a random value X n ( t ) ∈ [0 , , whichis independently and identically distributed (i.i.d.) with anunknown mean µ n . Here, X n ( t ) represents the reward whena packet is successfully delivered over link n in time slot t .Due to the wireless interference constraints, only a subset oflinks can transmit in each time slot. We use S n ( t ) = 1 if link n is scheduled for transmission in time slot t , and S n ( t ) = 0 otherwise. We call S ( t ) , ( S n ( t )) Nn =1 the feasible schedule denoting the set of links that can be active simultaneously intime slot t . Let S be the collection of all feasible schedules.We assume that each link n experiences i.i.d. ON-OFF channelfading over time with C n ( t ) = 1 denoting that the channel of link n is ON in time slot t . Let p n , Pr { C n ( t ) = 1 } bethe probability that link n has an available channel in timeslot t . We assume that each link has a non-zero probabilitythat its channel is ON, i.e., p min , min n p n > . Hence, thereceived reward R ( t ) in each time slot t can be expressedas R ( t ) , P Nn =1 X n ( t ) C n ( t ) S n ( t ) . We consider the casewhere the channel state is known via channel probing at thebeginning of each time slot .Our goal is to maximize the cumulative reward P T − t =0 R ( t ) until the T th time slot while guaranteeing the desired infor-mation freshness. If the statistics of rewards (i.e., { µ n , n =1 , , . . . , N } ) are known in advance, then the ﬁrst objective canbe achieved by solving the following optimization problem: S ∗ ( t ) , ( S ∗ n ( t )) Nn =1 ∈ arg max S ∈S N X n =1 µ n C n ( t ) S n . (1)That is, it serves a set of non-interfering and available linkswith the maximum sum of mean rewards in each time slot.Unfortunately, the statistics of rewards are unknown. Thisrequires the algorithm not only to learn these statistics (alsoknown as (a.k.a.) exploration) but also to select the bestschedule so far (a.k.a. exploitation). Our ﬁrst goal is equivalentto minimizing the cumulative regret over consecutive T timeslots, which is the gap between the accumulated reward andthe optimal reward, i.e.,Reg ( T ) , T − X t =0 N X n =1 ( E [ µ n C n ( t ) S ∗ n ( t )] − E [ µ n C n ( t ) S n ( t )]) . To address our second goal for the desired informationfreshness, we introduce Z n ( t ) to denote the age of informationreceived from the n th link in time slot t , which increases byone if a packet is not received by the receiver of link n intime slot t and reset to one otherwise, i.e., Z n ( t + 1) = ( Z n ( t ) + 1 if S n ( t ) C n ( t ) = 0;1 if S n ( t ) C n ( t ) = 1 . (2)Fig. 2 shows one sample path of age of link n . We canobverse from Fig. 2 that Z n ( t ) resets to one whenever thereis a successful packet delivery. We note that the dynamics ofthe age is similar to that of Time-Since-Last-Service (TSLS)counter in [26], [27]. Time slots A ge o f L i n k n Z n ( t ) Time slots when the packet is sucessfully delivered

Fig. 2: The evolution of age of link n . Our algorithm design and its analysis can be easily adapted to the casewith unknown channel state. P Nn =1 E [ Z n ( t )] . We achieve this dual objec-tive by developing a parametric class of wireless schedulersthat efﬁciently utilize a combination of UCB estimates forminimizing the cumulative regret and ages in its decision.IV. A LGORITHM D ESIGN AND P ERFORMANCE A NALYSIS

In this section, we develop a learning-based wireless sched-uler by integrating the key idea of the well-known UCBalgorithms (see [5]) and age metrics. In particular, the UCB isutilized to deal with the fundamental exploitation-explorationtradeoff in online learning and aims to achieve minimum cu-mulative regret. On the other hand, age metrics are employedto guarantee desired information freshness.In order to obtain the weight for exploitation and explo-ration, we introduce the following notations. Let H n ( t ) be thenumber of times link n has successfully received a packet untiltime slot t , i.e., H n ( t ) , P t − τ =0 C n ( τ ) S n ( τ ) . We set H n (0) =0 due to the fact that the system starts at t = 0 . We use µ n ( t ) todenote the sample mean of the received rewards of link n untiltime slot t , i.e., µ n ( t ) , (cid:16)P t − τ =0 X n ( τ ) C n ( τ ) S n ( τ ) (cid:17) /H n ( t ) .If H n ( t ) = 0 (i.e., link n has not successfully received a packetyet until time slot t ), we set µ n ( t ) = 1 . Let w n ( t ) denote the UCB estimate of link n in time slot t and is deﬁned as follows: w n ( t ) , min ( µ n ( t ) + s t H n ( t ) , ) , (3)where p t/ (2 H n ( t )) is the exploration term that mea-sures the uncertainty of the received reward of link n until timeslot t . Indeed, the smaller the H n ( t ) , the less exploitation oflink n and thus less accuracy of its sample mean estimation, inwhich case link n should get a higher priority to be scheduled.Here, we use the truncated version of the UCB estimate, sincethe actual reward of each link is at most . Again, when H n ( t ) = 0 , we set w n ( t ) = 1 . That is, if link n has notbeen scheduled yet until time slot t , it has the highest priorityto get served.In order to achieve a low cumulative regret, we preferto serve links with large UCB estimates in each time slot.Indeed, we would like to serve links with high sample meanrewards and links with large uncertainties of received rewardsdue to fewer explorations. In order to address informationfreshness guarantees, we also need to incorporate age metricsinto the scheduling design. In particular, the links with largeages should get high priorities to be scheduled. This naturallymotivates the following algorithm.Note that η is a parameter that balances the age metrics andthe UCB estimates. In particular, if η = 0 , then the LAEScoincides with the age-based policy (see [11, Ch. 4.5.4]).In the presence of fully-connected networks with non-fadingchannels (i.e., at most one link can be scheduled in eachtime slot and C n ( t ) = 1 , ∀ n, t ≥ ), the age-based policy isequivalent to the well-known Round-Robin policy that serveslinks in turn. In such a case, the age-based policy, in fact,minimizes the average total age (see [26, Proposition 2]). The Algorithm 1

Learning-based Age-Efﬁcient Scheduling(LAES) AlgorithmIn each time slot t , given the channel state ( C n ( t )) Nn =1 , selecta schedule b S [ t ] , ( b S n ( t )) Nn =1 satisfying b S [ t ] ∈ arg max S ∈S N X n =1 ( Z n ( t ) + ηw n ( t )) C n ( t ) S n , (4)where η ≥ is some control parameter.larger the η , the higher priority the UCB estimates and henceyields a smaller cumulative regret.Next, we characterize the age performance of the proposedLAES Algorithm. Proposition 1: [Information Freshness Guarantee] If Z n (0) = 0 , ∀ n = 1 , , . . . , N , then, under the LAES algorithmwith any η ≥ , the running average total age can be boundedfrom above as follow: T T − X t =0 N X n =1 E [ Z n ( t )] ≤ ( η + 1) N p min , holding for any T ≥ , where p min , min n p n > . Proof:

We consider the Lyapunov function V ( t ) V ( t ) , N X n =1 Z n ( t ) . (5)and study its drift. Using telescoping techniques as in theclassical Lyapunov drift analysis (e.g., [13]), we obtain anupper bound on the running average total age. Please seeAppendix A for the detailed proof. Remarks 1:

From Proposition 1, we can see that the runningaverage total age is bounded under the LAES Algorithm withany η ≥ , which is desirable since the central controlleralways demands a certain degree of information freshness.In addition, the derived upper bound on the running averagetotal age linearly increases with the parameter η . This matchesour intuition on the LAES Algorithm that a large η impliesa smaller weight on the age metric and thus deteriorates theAoI performance. Remarks 2:

The derived upper bound on the running averagetotal age linearly scales with the parameter η , which mightbe tight in some cases. Indeed, consider two interfering non-fading links, where C n ( t ) = 1 , ∀ n = 1 , , ∀ t , and at most onelink can be scheduled in each time slot. Suppose µ > µ , andassume that both links are scheduled sufﬁciently many times.In such a case, both w ( t ) and w ( t ) are close to µ and µ ,respectively. As such, under the LAES Algorithm, link 2 isscheduled roughly one every ⌈ η ( µ − µ ) ⌉ time slots and link1 is scheduled in all other time slots. Hence, the average ageof link 2 in each time slot is roughly equal to (1+2+3+ . . . + ⌈ η ( µ − µ ) ⌉ ) / ⌈ η ( µ − µ ) ⌉ = (1 + ⌈ η ( µ − µ ) ⌉ ) / . On theother hand, the average age of link 2 in each time slot is equalto ( ⌈ η ( µ − µ ) ⌉− / ⌈ η ( µ − µ ) ⌉ = 1 + 1 / ⌈ η ( µ − µ ) ⌉ .Hence, the average total age in each time slot is O ( η ) .4 emarks 3: In the case that all links have a non-zeroprobability of the channel being OFF (i.e., p n < , ∀ n =1 , , . . . , N ), the upper bound on the mean total age is inde-pendent of the parameter η . Indeed, if the event F n ( τ ) , { C n ( τ ) = 1 , C n ′ ( τ ) = 0 , ∀ n ′ = n } happens for some τ ∈ [ t − m + 1 , t ) , then under the LAES Algorithm, link n should be scheduled at least once during the past m timeslots, and thus Z n ( t ) < m . This implies that Pr { Z n ( t ) ≥ m }≤ Pr {F n ( τ ) does not happen for all τ ∈ [ t − m + 1 , t ) } ( a ) = ν mn ( b ) ≤ ν m , (6)where step ( a ) is true for ν n , − p n Π n ′ = n (1 − p n ′ ) ∈ (0 , under the assumption that p n < , ∀ n , and follows fromthe fact that channel rates are independently distributed acrosslinks and i.i.d. over time for each link, and ( b ) holds for ν , max n ν n . Hence, we have E [ Z n ( t )] = ∞ X m =1 Pr { Z n ( t ) ≥ m } ≤ ∞ X m =1 ν m = ν − ν . (7)As such, the average total age in each time slot is upperbounded by N ν/ (1 − ν ) , which is independent of parameter η . However, such an upper bound is extremely large, asdemonstrated in Section V, and thus it does not say too muchon the dependence of the average total age on the parameter η when the average age is moderate or small.Lastly, we provide an upper bound on the cumulative regretunder the LAES Algorithm with η > . Proposition 2: [Upper Bound on Regret] If Z n (0) =0 , ∀ n = 1 , , . . . , N , then, under the LAES Algorithm with η > , the cumulative regret Reg ( T ) until time slot T > can be bounded from above as follows: R ( T ) ≤ N Tη + 2 p N | S | max T log T + N (cid:18) π (cid:19) , where | S | max denotes the maximum number of links that canbe scheduled simultaneously in each time slot. Proof:

We ﬁrst perform drift-plus-penalty analysis on E [ V ( t + 1) − V ( t )] + η ∆ R ( t ) , (8)where ∆ R ( t ) , P Nn =1 E h µ n C n ( t ) S ∗ n ( t ) − µ n C n ( t ) b S n ( t ) i and the cumulative regret Reg ( T ) , P T − t =0 ∆ R ( t ) . Then,we carefully incorporate the regret analysis for classical UCBalgorithm (e.g., [5]) into our analysis. The analysis is similarto the line of regret analysis in [28] and [12], and is availablein Appendix B. Remarks 4:

The derived upper bound on the cumulativeregret consists of two terms: (i) p N | S | max T log T + N (1+5 π / has the same order O ( √ N T log T ) as the instance-independent upper bound for the classical UCB algorithm(see [29, Ch. 2.4.3]) and thus this term is attributed to thecost involved in the exploration/exploitation process in onlinelearning; (ii) N T /η decreases as parameter η increases. Thisalso matches our intuition on the LAES Algorithm: the larger the η , the larger the weight put on the UCB estimate and thusyields a smaller cumulative regret.This together with Proposition 1 reveals a tradeoff betweenthe running average total age and the regret performancein the general network setup under the LAES Algorithm:when increasing η , the regret upper bound decreases, but theupper bound on running average total age increases. That is,the improvement of the cumulative regret is at the cost ofincreasing running average total age. Moreover, it can be easilyderived that the product of the upper bound on the runningaverage total age and the upper bound on the cumulative regretis on the order of O ( N T ) for any η ≤ O ( p N T / log T ) . InTable I, we provide three different η values to illustrate thetradeoff between the running average total age and cumulativeregret.Parameter η Regret Age O ( p N T / log T ) O ( √ N T log T ) O ( p N T / log T ) O ( √ N T ) O ( √ N T ) O ( √ N T ) O (1) O ( N T ) O ( N ) TABLE I: Cumulative Regret vs. Running Average Age.From Table I, we can see that in order for the cumulativeregret to be on the same order as that for UCB algorithm(i.e., Reg ( T ) = O ( √ N T log T ) ), the running average totalage should be on the order of O ( p N T / log T ) under theLAES Algorithm. Nevertheless, we can use a relatively large η (e.g., η = 200 ) and achieve both low regret and low averageage, as demonstrated in Section V.V. S IMULATIONS

In this section, we perform simulations to evaluate theperformance of our proposed LAES Algorithm. We considerthe following two network setups: (i) a fully-connected non-fading network with N = 5 links (at most one link canbe scheduled in each time slot and C n ( t ) = 1 , ∀ n, t ≥ ),and (ii) a − link ON-OFF fading network where at mosttwo links can be scheduled in each time slot. For the ﬁrstsetup, the mean reward vector is µ = (0 . , . , . , . , . .For the second network setup, we set the mean rewardvector µ = (0 . , . , . , . , . , . , . , . , . , . andheterogeneous ON-OFF channel fading parameters p =(0 . , . , . , . , . , . , . , . , . , . . We compare theLAES Algorithm with η ∈ { , , , , } with the UCBalgorithm that makes the decision only based on the UCBestimates. Note that the LAES with η = 0 coincides with theage-based scheduler. We run experiments, each of whichhas simulated × time slots.Fig. 3 shows the performance of UCB algorithm and LAESAlgorithm with different η values in the fully-connected non-fading network. We can observe from Fig. 3a that UCBalgorithm outperforms the LAES Algorithm with all η valuesin terms of cumulative regret performance. The larger the η ,the smaller the cumulative regret. This is because the larger η puts more weight on the UCB estimates and thus the LAES5 Time slots -1 C u m u l a t i v e R eg r e t = 0=10=50=100=200UCB (a) Cumulative Regret Link 1 Link 2 Link 3 Link 4 Link 500.20.40.60.81 P a ck e t D e li v e r y R a t i o =0=10=50=100=200UCB (b) Successful Delivery Ratio Time slots R unn i ng A v e r age A ge = 0=10=50=100=200UCB (c) Total Average Age Fig. 3: Performance of the LAES Algorithm in a fully-connected non-fading network.

Time slots -1 C u m u l a t i v e R eg r e t = 0=10=50=100=200UCB (a) Cumulative Regret L i n k i n k i n k i n k i n k i n k i n k i n k i n k i n k P a ck e t D e li v e r y R a t i o =0=10=50=100=200UCB (b) Successful Delivery Ratio Time slots R unn i ng A v e r age A ge = 0=10=50=100=200UCB (c) Total Average Age Fig. 4: Performance of the LAES Algorithm in a − link ON-OFF fading network.Algorithm with an extremely large η value should have similarregret as the UCB algorithm. Indeed, we can see from Fig. 3bthat the packet successful delivery ratio of the best link (i.e.,link ) increases as η increases.However, the cumulative regret performance improvementis at the cost of increasing running average total age, asshown in Fig. 3c that shows the running average of totalage over time. Indeed, we can observe from Fig. 3c thatunder the LAES Algorithm, the age becomes larger as η increases. Nevertheless, it is worth pointing out that the agekeeps increasing over time under the UCB algorithm while itis always bounded under the LAES Algorithm with all ﬁxed η values. We can observe similar phenomena in a relativelycomplicated network with links, as shown in Fig. 4, despitethe derived upper bound on the average total average (cf. (7))in each time slot is independent of the parameter η . This isbecause such an upper bound is equal to . × (muchlarger than ). VI. C ONCLUSION

In this paper, we considered the problem of schedulingpackets from multiple sensing sources to a central controllerover a wireless network with the goal of minimizing cu-mulative regret over time while guaranteeing desired AoIperformance. We developed a parameterized maximum-weight type scheduling policy that combines both the AoI metricsand UCB estimates in its weight measure with parameter η . We derived an upper bound on the running average totalage, which linearly increases with the parameter η . We alsoderived an upper bound on the cumulative regret under ourproposed algorithm. These derived upper bounds reveal atradeoff: the improvement of the cumulative regret is at thecost of increasing running average total age. Simulation resultswere provided to conﬁrm such a tradeoff and to demonstratethe superior performance of our proposed algorithm over theUCB algorithm and the age-based algorithm.A PPENDIX AP ROOF OF P ROPOSITION V ( t ) , N X n =1 Z n ( t ) . (9)Then, under the LAES Algorithm, we have V ( t + 1) = N X n =1 Z n ( t + 1) ( a ) = N X n =1 (cid:16) ( Z n ( t ) + 1)(1 − C n ( t ) b S n ( t )) + C n ( t ) b S n ( t ) (cid:17) N X n =1 Z n ( t ) − N X n =1 Z n ( t ) C n ( t ) b S n ( t ) + N ( b ) = V ( t ) − N X n =1 Z n ( t ) C n ( t ) b S n ( t ) + N, (10)where step ( a ) uses the dynamics of the age (cf. (2)) and ( b ) follows from the deﬁnition of the Lyapunov function V ( t ) .Let Z ( t ) , ( Z n ( t )) Nn =1 . Then, we have E [ V ( t + 1) − V ( t ) | Z ( t )]= − E " N X n =1 Z n ( t ) C n ( t ) b S n ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z ( t ) + N. (11)Given the age vector Z ( t ) and channel state C ( t ) , ( C n ( t )) Nn =1 , according to the deﬁnition of the LAES Algo-rithm, we have N X n =1 ( Z n ( t ) + ηw n ( t )) C n ( t ) b S n ( t ) ≥ (cid:0) Z n ∗ ( t ) ( t ) + ηw n ∗ ( t ) (cid:1) C n ∗ ( t ) ( t ) ≥ Z n ∗ ( t ) ( t ) C n ∗ ( t ) ( t ) , (12)where the ﬁrst step is true for n ∗ ( t ) ∈ arg max n Z n ( t ) . Thisimplies that E " N X n =1 Z n ( t ) C n ( t ) b S n ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z ( t ) ( a ) ≥ E " Z n ∗ ( t ) ( t ) C n ∗ ( t ) ( t ) − η N X n =1 w n ( t ) C n ( t ) b S n ( t ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z ( t ) ( b ) ≥ p min Z n ∗ ( t ) ( t ) − ηN, (13)where step ( a ) uses (12), and ( b ) uses the fact that w n ( t ) ≤ C n ( t ) ≤ and b S n ( t ) ≤ , ∀ n, t ≥ and p min , min n p n > .By substituting (13) into (11), we have E [ V ( t + 1) − V ( t ) | Z ( t )] ( a ) ≤ − p min Z max ( t ) + ( η + 1) N ( b ) ≤ − p min N N X n =1 Z n ( t ) + ( η + 1) N (14)where step ( a ) is true for Z max ( t ) , max n Z n ( t ) = Z n ∗ ( t ) ( t ) and ( b ) follows from the fact that Z max ( t ) ≥ N P Nn =1 Z n ( t ) .Taking the expectation on both sides of (14), we have E [ V ( t + 1) − V ( t )] ≤ − p min N N X n =1 E [ Z n ( t )] + ( η + 1) N. Summing the above inequality over time t = 0 , , . . . , T − ,we have E [ V ( T ) − V (0)] ≤ − p min N T − X t =0 N X n =1 E [ Z n ( t )] + ( η + 1) N T, which implies T T − X t =0 N X n =1 E [ Z n ( t )] ≤ ( η + 1) N p min . (15)Here, we use the fact that V (0) = 0 and V ( T ) ≥ .A PPENDIX BP ROOF OF P ROPOSITION ( T ) , T − X t =0 N X n =1 (cid:16) E [ µ n C n ( t ) S ∗ n ( t )] − E h µ n C n ( t ) b S n ( t ) i(cid:17) = T − X t =0 ∆ R ( t ) , (16)where ∆ R ( t ) , P Nn =1 E h µ n C n ( t ) S ∗ n ( t ) − µ n C n ( t ) b S n ( t ) i .We add the term η ∆ R ( t ) on both sides of the drift ofLyapunov function V ( t ) (cf. (11)) and obtain E [ V ( t + 1) − V ( t )] + η ∆ R ( t )= − N X n =1 E h Z n ( t ) C n ( t ) b S n ( t ) i + N + η N X n =1 E [ µ n C n ( t ) S ∗ n ] − η N X n =1 E h µ n C n ( t ) b S n ( t ) i = N + N X n =1 E h ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17)i − N X n =1 E [ Z n ( t ) C n ( t ) S ∗ n ] ≤ N + N X n =1 E h ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17)i , (17)where the last step is true since P Nn =1 E [ Z n ( t ) C n ( t ) S ∗ n ] ≥ .Summing (17) over t = 0 , , , . . . , T − , we have T − X t =0 E [ V ( t + 1) − V ( t )] + η T − X t =0 ∆ R ( t ) ≤ N T + T − X t =0 N X n =1 E h ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17)i , which impliesReg ( T ) , T − X t =0 ∆ R ( t ) ≤ N Tη + 1 η T − X t =0 N X n =1 E h ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17)i (18)Here, we use the fact that V (0) = 0 , V ( T ) ≥ , and thedeﬁnition of Reg ( T ) .7ext, we focus on the term N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17) . Then, we have N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) (cid:16) S ∗ n − b S n ( t ) (cid:17) ( a ) ≤ N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) e S n ( t ) − N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) b S n ( t ) ( b ) ≤ N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) e S n ( t ) − N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) b S n ( t )+ N X n =1 ( Z n ( t ) + ηw n ( t )) C n ( t ) b S n ( t ) − N X n =1 ( Z n ( t ) + ηw n ( t )) C n ( t ) e S n ( t )= η N X n =1 ( w n ( t ) − µ n ) C n ( t ) b S n ( t )+ η N X n =1 ( µ n − w n ( t )) C n ( t ) e S n ( t ) , (19)where step ( a ) is true for e S ( t ) , ( e S n ( t )) Nn =1 ∈ arg max S ∈S N X n =1 ( Z n ( t ) + ηµ n ) C n ( t ) S n , and ( b ) uses the deﬁnition of b S ( t ) .By substituting (19) into (18), we haveReg ( T ) ≤ N Tη + T − X t =0 N X n =1 E h ( w n ( t ) − µ n ) C n ( t ) b S n ( t ) i| {z } , G ( T ) + T − X t =0 N X n =1 E h ( µ n − w n ( t )) C n ( t ) e S n ( t ) i| {z } , G ( T ) . (20)Next, we focus on G ( T ) and G ( T ) , respectively. Let t n,τ denote the time slot at which link n successfully received apacket, i.e., C n ( t n,τ ) b S n ( t n,τ ) = 1 and C n ( t n,τ ) b S n ( t n,τ ) =0 if t = t n,τ , τ = 1 , , . . . , H n ( T ) . Therefore, we have H n ( t n,τ ) = τ − .Let G n, ( T ) , P T − t =0 E h ( w n ( t ) − µ n ) C n ( t ) b S n ( t ) i andthus G ( T ) = P Nn =1 G n, ( T ) . Hence, we have G n, ( T ) ( a ) ≤ T − X t =0 E h ( w n ( t ) − µ n ) C n ( t ) b S n ( t ) F n ( t ) i ( b ) ≤ E  H n ( T ) X τ =1 ( w n ( t n,τ ) − µ n ) F n ( t n,τ )  ( c ) ≤ E  H n ( T ) X τ =2 ( w n ( t n,τ ) − µ n ) F n ( t n,τ )  ( d ) ≤ E  H n ( T ) X τ =2 ( w n ( t n,τ ) − µ n ) F n ( t n,τ ) ∩G n ( t n,τ )  + E  H n ( T ) X τ =2 G n ( t n,τ )  , (21)where step ( a ) is true for F n ( t ) , { w n ( t ) ≥ µ n } and {·} being an indicator function; ( b ) uses the deﬁnition of t n,τ , andthe fact that C n ( t ) ≤ and b S n ( t ) ≤ , ∀ t ≥ ; ( c ) followsfrom the fact that w n ( t ) ≤ , ∀ t ≥ ; ( d ) is true for G n ( t ) , ( µ n ( t ) − µ n ≤ s t H n ( t ) ) , and G n ( t ) being the complement of the event G n ( t ) .Next, we consider the second term on the right hand side(RHS) of (21). E  H n ( T ) X τ =2 ( w n ( t n,τ ) − µ n ) F n ( t n,τ ) ∩G n ( t n,τ )  ( a ) ≤ E  H n ( T ) X τ =2 s t n,τ H n ( t n,τ )  ( b ) ≤ p T E  H n ( T ) X τ =2 √ τ −  ≤ p T Z H n ( T )1 √ x dx ! ≤ p T E hp H n ( T ) i , (22)where step ( a ) uses the deﬁnition of w n ( t ) and G n ( t ) , and ( b ) follows from the fact that t n,τ ≤ T and the deﬁnition of t n,τ .With regard to the third term on the RHS of (21), we have E h G n ( t n,τ ) i = Pr {G n ( t n,τ ) } ( a ) ≤ Pr ( T − [ m = τ − ( µ n ( m ) − µ n > s m τ − )) ≤ Pr ( T − [ m = τ − ( µ n ( m ) − µ n > r m m )) b ) ≤ T − X m = τ − Pr ( µ n ( m ) − µ n > r m m ) ( c ) ≤ T − X m = τ − m ≤ τ − + Z ∞ τ − x dx ( d ) ≤ τ − , where step ( a ) follows from the fact that G n ( t n,τ ) ⊂ T − [ m = τ − ( µ n ( m ) − µ n > s m τ − ) ;( b ) uses the union bound; ( c ) follows from the Chernoff-Hoeffding Bound (see, e.g., [5, Fact 1]), i.e., for X , X , . . . , X n be i.i.d. random variables with commonrange [0 , and mean µ , then for any a ≥ , we have Pr ( n n X i =1 X i ≥ µ + a ) ≤ e − na , (23) ( d ) is true for τ ≥ .Hence, the third term on the RHS of (21) can be boundedas follows. E  H n ( T ) X τ =2 G n ( t n,τ )  ≤ E  H n ( T ) X τ =2 τ −  ≤ ∞ X τ =1 τ = π , (24)where the last step use the fact that P ∞ n =1 /n = π / . Bysubstituting (22) and (24) into (21) and using the deﬁnition of G ( T ) , we have G ( T ) ≤ N (cid:18) π (cid:19) + 2 p T N X n =1 E hp H n ( T ) i ( a ) ≤ N (cid:18) π (cid:19) + 2 N p T E vuut N N X n =1 H n ( T )  ( b ) ≤ N (cid:18) π (cid:19) + 2 p N | S | max T log T , (25)where step ( a ) uses the Jensen’s inequality, and ( b ) is truesince P Nn =1 H n ( T ) ≤ T | S | max and | S | max is the maximumnumber of links that can be scheduled in each time slot.Next, we consider the term G ( T ) . First, we note that G ( T ) ≤ T − X t =0 N X n =1 E h ( µ n − w n ( t )) e S n ( t ) F n ( t ) i , (26)where we recall that F n ( t ) , { w n ( t ) ≥ µ n } . Note that for t ≤ t n, , we have w n ( t ) = 1 and thus F n ( t ) happens. Therefore, we have G ( T ) ≤ N X n =1 E  T − X t = t n, +1 ( µ n − w n ( t )) e S n ( t ) F n ( t )  ( a ) ≤ N X n =1 E  T − X t = t n, +1 Pr ( µ n ( t ) − µ n ≤ − s t H n ( t − ) ≤ N X n =1 T − X τ =1 τ X m =1 Pr ( m m X i =1 X ( i ) − µ n ≤ − r τ m ) ( b ) ≤ N X n =1 T − X τ =1 τ X m =1 τ = N X n =1 T − X τ =1 τ c ) ≤ N π , (27)where step ( a ) follows from the fact that µ n ≤ and e S n ( t ) ≤ as well as the deﬁnition of F n ( t ) ; ( b ) again usesthe Chernoff-Hoeffding Bound (cf. (23)); ( c ) is true since P T − τ =1 /τ ≤ P ∞ τ =1 /τ = π / .Hence, by substituting (25) and (27) into (20), we have thedesired result. R EFERENCES[1] V. Anantharam, P. Varaiya, and J. Walrand, “Asymptotically efﬁcientallocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards,”

IEEE Transactions on Automatic Control , vol. 32,no. 11, pp. 968–976, 1987.[2] Y. Gai, B. Krishnamachari, and R. Jain, “Combinatorial network op-timization with unknown variables: Multi-armed bandits with linearrewards and individual observations,”

IEEE/ACM Transactions on Net-working , vol. 20, no. 5, pp. 1466–1478, 2012.[3] W. Chen, Y. Wang, and Y. Yuan, “Combinatorial multi-armed bandit:General framework and applications,” in

International Conference onMachine Learning , 2013, pp. 151–159.[4] R. Combes, M. S. T. M. Shahi, A. Proutiere et al. , “Combinatorial ban-dits revisited,” in

Advances in Neural Information Processing Systems ,2015, pp. 2116–2124.[5] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,”

Machine learning , vol. 47, no. 2-3, pp.235–256, 2002.[6] A. Garivier and O. Capp´e, “The kl-ucb algorithm for bounded stochasticbandits and beyond,” in

Proceedings of the 24th annual conference onlearning theory , 2011, pp. 359–376.[7] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multi-armed bandit problem,” in

Conference on learning theory , 2012, pp.39–1.[8] N. Lu, B. Ji, and B. Li, “Age-based scheduling: Improving data freshnessfor wireless real-time trafﬁc,” in

Proceedings of the Eighteenth ACMInternational Symposium on Mobile Ad Hoc Networking and Computing ,2018, pp. 191–200.[9] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of informationin wireless networks with throughput constraints,” in

IEEE INFOCOM2018-IEEE Conference on Computer Communications . IEEE, 2018,pp. 1844–1852.[10] I. Kadota and E. Modiano, “Minimizing the age of information in wire-less networks with stochastic arrivals,”

IEEE Transactions on MobileComputing , 2019.[11] Y. Sun, I. Kadota, R. Talak, and E. Modiano, “Age of information: A newmetric for information freshness,”

Synthesis Lectures on CommunicationNetworks , vol. 12, no. 2, pp. 1–224, 2019.[12] F. Li, J. Liu, and B. Ji, “Combinatorial sleeping bandits with fairnessconstraints,”

IEEE Transactions on Network Science and Engineering ,2019.[13] M. J. Neely, “Stochastic network optimization with application tocommunication and queueing systems,”

Synthesis Lectures on Commu-nication Networks , vol. 3, no. 1, pp. 1–211, 2010.[14] T. L. Lai and H. Robbins, “Asymptotically efﬁcient adaptive allocationrules,”

Advances in applied mathematics , vol. 6, no. 1, pp. 4–22, 1985.

15] V. Patil, G. Ghalme, V. Nair, and Y. Narahari, “Achieving fairness in thestochastic multi-armed bandit problem.” in

AAAI , 2020, pp. 5379–5386.[16] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often shouldone update?” in . IEEE, 2012, pp.2731–2735.[17] E. Altman, R. El-Azouzi, D. S. Menasche, and Y. Xu, “Forever young:Aging control for hybrid networks,” in

Proceedings of the TwentiethACM International Symposium on Mobile Ad Hoc Networking andComputing , 2019, pp. 91–100.[18] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of infor-mation in vehicular networks,” in . IEEE, 2011, pp. 350–358.[19] B. Choudhury, V. K. Shah, A. Dayal, and J. H. Reed, “Experimentalanalysis of safety application reliability in v2v networks,” arXiv preprintarXiv:2005.13031 , 2020.[20] K. Nar and T. Bas¸ar, “Sampling multidimensional wiener processes,”in . IEEE, 2014, pp.3426–3431.[21] T. Z. Ornee and Y. Sun, “Sampling for remote estimationthrough queues: Age of information and beyond,” arXiv preprintarXiv:1902.03552 , 2019.[22] R. D. Yates, “Lazy is timely: Status updates by an energy harvestingsource,” in . IEEE, 2015, pp. 3008–3012.[23] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff,“Update or wait: How to keep your data fresh,”

IEEE Transactions onInformation Theory , vol. 63, no. 11, pp. 7492–7508, 2017.[24] Y. Dong, P. Fan, and K. B. Letaief, “Energy harvesting powered sensingin iot: Timeliness versus distortion,”

IEEE Internet of Things Journal ,2020.[25] S. Fatale, K. Bhandari, U. Narula, S. Moharir, and M. K. Hanawal,“Regret of age-of-information bandits,” arXiv , pp. arXiv–2001, 2020.[26] R. Li, A. Eryilmaz, and B. Li, “Throughput-optimal wireless schedulingwith regulated inter-service times,” in . IEEE, 2013, pp. 2616–2624.[27] B. Li, R. Li, and A. Eryilmaz, “Throughput-optimal scheduling designwith regular service guarantees in wireless networks,”

IEEE/ACM Trans-actions on Networking , vol. 23, no. 5, pp. 1542–1552, 2014.[28] W.-K. Hsu, J. Xu, X. Lin, and M. R. Bell, “Integrate learning andcontrol in queueing systems with uncertain payoffs,”

Purdue University,available at https://engineering. purdue. edu/% 7elinx/papers. html,Tech. Rep , 2018.[29] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochas-tic and nonstochastic multi-armed bandit problems,” arXiv preprintarXiv:1204.5721 , 2012., 2012.