[PDF] Decentralized Age-of-Information Bandits

Abstract

Age-of-Information (AoI) is a performance metric for scheduling systems that measures the freshness of the data available at the intended destination. AoI is formally defined as the time elapsed since the destination received the recent most update from the source. We consider the problem of scheduling to minimize the cumulative AoI in a multi-source multi-channel setting. Our focus is on the setting where channel statistics are unknown and we model the problem as a distributed multi-armed bandit problem. For an appropriately defined AoI regret metric, we provide analytical performance guarantees of an existing UCB-based policy for the distributed multi-armed bandit problem. In addition, we propose a novel policy based on Thomson Sampling and a hybrid policy that tries to balance the trade-off between the aforementioned policies. Further, we develop AoI-aware variants of these policies in which each source takes its current AoI into account while making decisions. We compare the performance of various policies via simulations.

Full PDF

DDecentralized Age-of-Information Bandits

Archiki Prasad

Department of Electrical EngineeringIndian Institute of Technology Bombay [email protected]

Vishal Jain

Department of Electrical EngineeringIndian Institute of Technology Bombay [email protected]

Sharayu Moharir

Department of Electrical EngineeringIndian Institute of Technology Bombay [email protected]

Abstract —Age-of-Information (AoI) is a performance metricfor scheduling systems that measures the freshness of the dataavailable at the intended destination. AoI is formally deﬁnedas the time elapsed since the destination received the recentmost update from the source. We consider the problem ofscheduling to minimize the cumulative AoI in a multi-sourcemulti-channel setting. Our focus is on the setting where channelstatistics are unknown and we model the problem as a distributedmulti-armed bandit problem. For an appropriately deﬁned AoIregret metric, we provide analytical performance guarantees ofan existing UCB-based policy for the distributed multi-armedbandit problem. In addition, we propose a novel policy basedon Thomson Sampling and a hybrid policy that tries to balancethe trade-off between the aforementioned policies. Further, wedevelop AoI-aware variants of these policies in which each sourcetakes its current AoI into account while making decisions. Wecompare the performance of various policies via simulations.

Index Terms —Age-of-Information, Multi-Armed Bandits

I. I

NTRODUCTION

We consider the problem of scheduling in a multi-sourcemulti-channel system, focusing on the metric of Age-of-Information (AoI), introduced in [1]. AoI is formally deﬁnedas the time elapsed since the destination received the recentmost update from the source. It follows that AoI is a measureof the freshness of the data available at the intended destinationwhich makes it a suitable metric for time-sensitive systemslike smart homes, smart cars, and other IoT based systems.Since its introduction, AoI has been used in areas like caching,scheduling, and channel state information estimation. A com-prehensive survey of AoI-based works is available in [2].The work in [3] shows the performance of AoI bandits fora single source and multiple channels, where the source actsas the “bandit” which pulls one of the arms in every time-slot,i.e., selects one of the channels for communication. The aimis to ﬁnd the best arm (channel) while minimizing the AoI (in-stead of the usual reward maximization) for that source. In thiswork, we address more practical scenarios in which systemshave multiple sources looking to simultaneously communicatethrough a common, limited pool of channels. Here, we needto ensure that the total sum of the AoIs of all the sourcesis minimized. Moreover, we consider a decentralized settingwhere the sources cannot share information with each other,meaning that chances of collisions (attempting to communicatethrough the same channel in a time-slot) can be very high.Thus, the task of designing policies that minimize total AoIwhile ensuring fairness among all sources and avoiding colli-sions is a challenging problem. Prior works on decentralized multi-player MAB problems primarily discuss reward-basedpolicies, however, minimizing AoI is more challenging, as theimpact of sub-optimal decisions gets accumulated over time.The decentralized system comprises of multiple sourcesand multiple channels, where at every time slot, each of the M decentralized users searches for idle channels to senda periodic update. The probability of an attempted updatesucceeding is independent across communication channelsand independent and identically (i.i.d) distributed across time-slots for each channel. AoI increases by one on each failedupdate and resets to one on each successful update. Thesedistributed players can only learn from their local observationsand collide (with a reward penalty) when choosing the samearm. The desired objective is to develop a sequential policyrunning at each user to select one of the channels, without anyinformation exchange, in order to minimize the cumulative AoIover a ﬁnite time-interval of T consecutive time-slots. A. Our ContributionOptimality of Round-Robin policy:

We describe an oracleRound-Robin policy and characterize its optimality.

Upper-Bound on AoI regret of DLF [4]:

We characterize ageneric expression for the upper bound of the total cumulativeAoI for any policy. Further, we show that the AoI regret ofthe DLF policy scales as O ( M N log T ) . New AoI-agnostic policies:

We propose a Thompson Sam-pling [5] based policy. We also present a new hybrid policytrading-off between Thompson Sampling and DLF.

New AoI-aware policies:

We propose AoI-aware variantsfor all the AoI-agnostic policies. When AoI values are belowa certain threshold, the variants mimic the original policy. Oth-erwise, the variants exploit local past observations. Throughsimulations, we show that these variants exhibit lower AoIregrets as compared to their agnostic counterparts.

B. Related Work

In this section, we discuss the prior work most relevantto our setting. AoI-based scheduling has been explored by[6]–[11], where the channel statistics were assumed to beknown and an inﬁnite time-horizon steady-state performance-based approach is adopted. [3] explores the setting where thechannel statistics are unknown, for a single source and multiplechannels. We consider a decentralized system with multiplesources, which is a much more complex setting. a r X i v : . [ ee ss . S Y ] O c t ime Division Fair Sharing algorithm proposed in [12]addresses the fair-access problem in a distributed multi-playersetting. This policy was outperformed by the policy proposedin [4]. Recent work in Multi-Player MABs (MPMABs) include[13]–[15]. The policies in [13] consider an alternate channelallotment in the event of a collision. Although these ideascan be adopted in all our policies, in this work, we did notwish to make any assumptions about the feasibility of thesealternate allotments. The settings in [14], [15] are signiﬁcantlydifferent and cannot be readily adapted to our setting. Further,the policies may not be directly consistent with the AoI metric.II. S ETTING

A. Our System

We consider a system with M sources and N channels( N ≥ M ). The sources track/measure different time-varyingquantities and relay their measurements to a monitoring sta-tion via available communication channels. Time is dividedinto slots. In each time slot, each of the M decentralizedusers (sources) selects an arm (channel) only based on itsown observation histories under a decentralized policy, andthen attempts to transmit via that channel. Each attemptedcommunication via channel i is successful with probability µ i and unsuccessful otherwise, independent of all other channelsand across time-slots. The values of the µ i s are unknown tothe sources. Without loss of generality, we assume the µ i s tobe in descending order, i.e., µ > µ > ... > µ N .A decentralized setting means that when a particular arm i is selected by user j , the reward (whether the transmissionwas successful or not) is only observed by user j , and if noother user is playing the same arm, a reward is obtained withprobability µ j . Else, if multiple users are playing the samearm (which is possible in a decentralized setting), then weassume that, due to collision, exactly one of the conﬂictingusers can use the channel to transmit, while the other usersget zero reward. This “winner” is chosen at random since weassume all sources to have equal priority. This is consistentwith network protocols like CSMA with perfect sensing. B. Metric: Age-of-information Regret

The age-of-information is a metric that measures the fresh-ness of information available at the monitoring station. It isformally deﬁned as follows.

Deﬁnition 1 (Age-of-Information (AoI)) . Let a ( t ) denote theAoI at the monitoring station in time-slot t and u ( t ) denote theindex of the time-slot in which the monitoring station receivedthe latest update from the source before the beginning of time-slot t . Then, a ( t ) = t − u ( t ) . By deﬁnition, a ( t ) = (cid:40) if the update in slot t − succeeds a ( t −

1) + 1 otherwise. We use the terms source and user, and arm and channel interchangeably

Let a P m ( t ) be the AoI of source m in time-slot t undera given policy P , and let a ∗ m ( t ) be the corresponding AoIunder the oracle policy, discussed in detail section VI-A. Wedeﬁne the AoI regret at time T as the cumulative differencein expected AoI for the two policies in time-slots 1 to T summed over all the sources. Hence, the lesser the AoI apolicy accumulates, the better it is. All policies operate underAssumption 1, similar to the initial conditions in [3]. Deﬁnition 2 (Age-of-Information Regret (AoI Regret)) . AoIregret under policy P is denoted by R P ( T ) and R P ( T ) = M (cid:88) m =1 T (cid:88) t =1 E [ a P m ( t ) − a ∗ m ( t )] . (1) Assumption 1 (Initial Conditions) . The system starts operat-ing at time-slot t = −∞ , but for any policy, decision makingbegins at t = 1 . It does not use information from observationsin time-slots t ≤ to make decisions in time-slots t ≥ . III. A

NALYTICAL R ESULTS

In this section, we state and discuss our main theoreticalresults. We characterize our Oracle/Genie policy and prove itsoptimality. We also provide an upper-bound for the regret of aUCB [16] based policy for multi-user and multi-armed setting,called Distributed Learning with Fairness (DLF) [4].

A. Oracle Policy

We will consider two candidates for the Oracle policy,namely I.I.D and Symmetric M -Periodic policy described inTheorem 1 and Deﬁnition 3 respectively. Theorem 1 (I.I.D Policy) . Consider all policies P wherescheduling decisions are independently and identically dis-tributed across time, the policy which schedules the M sourceson a random permutation of the set of arms S ( M ) at everyinstant (uniformly at random) is optimal. Deﬁnition 3 (Symmetric M -Periodic Policy) . Let D m (t) be anN-element set containing the number of times each channel isscheduled by the source m up till time-period t , arranged indecreasing order. Any periodic policy P , with period M, suchthat ∀ m ∈ [ M ] , D m ( τ ) = D ( τ ) , for τ ∈ ( t − M, t ] , ∀ t > M is termed as a Symmetric M -Periodic Policy . All the symmetric M -periodic policies under our considera-tion are void of any collision between any two sources and usesonly the best- M arms. Thus, for all the symmetric M -periodicpolicies under our consideration, D ( τ ) is an M -length vector.These two conditions are necessary for any optimal policy. Werefer interested readers to Lemmas 1 and 2 in Section VI. Next,in Deﬁnition 4, we provide a special case of the Symmetric M -Periodic policy called the Round-Robin Policy. Deﬁnition 4 (Round-Robin Policy) . Let S ( k ) = { , · · · , k } be the set of k channels with the highest success probabilities.For a problem instance ( M, N, µ ) , consider the index set I hich is a random permutation of the arms in the set S ( M ) .Then, in time-slot t , the round-robin policy schedules a source m on the channel I (( m + t ) mod M )+1 . A symmetric M -periodic policy can be uniquely charac-terized by the sequence D (1) , · · · , D ( M ) . By deﬁnition, theround-robin policy is a symmetric M -periodic policy, andsatisﬁes the property that each D ( i ) , for i ∈ [ M ] , contains only1s and 0s, and D ( M ) = (1 , , · · · , . Finally, Theorems 2-4characterize the optimality of the Round-Robin policy undercertain conditions. This leads us to Conjecture 1, where wegeneralize the Oracle to be optimal when there are more thantwo simultaneous sources. Theorem 2 (Optimality of Round-Robin Policy for M = 2 ) . For any problem instance ( M, N, µ ) such that M = 2 , N ≥ M , a policy that schedules source m ∈ [ M ] on the channel (( m + t ) mod M ) + 1 is optimal. Theorem 3 (Round-Robin is the best Symmetric M -PeriodicPolicy) . For any problem instance ( M, N, µ ) , there does notexist a symmetric M -periodic policy with a larger expectedtotal age-of-information value than the round-robin policy. Theorem 4 (I.I.D. vs. Round-Robin) . The round-robin policyhas a smaller expected total age-of-information value than theI.I.D. policy.

Conjecture 1.

Generalizing Theorem 2 (for

M > ), overall possible permutation of arms in the set S ( M ) , the OraclePolicy is optimal for any problem instance ( M, N, µ ) . Without loss of generality, hereafter we will use set S ( M ) as our pre-decided index set I , as described in Deﬁnition 4, inall the algorithms discussed in Section IV. Further, the regretof all the policies is calculated with respect to the round-robinpolicy, that is, the round-robin policy is our Oracle. Standardsingle-source bandit policies try to pick the best-estimated arm.Since we have multiple sources with equal priority, we try topick the “ k th best” arm, where k changes in a round-robinfashion for each source, thus ensuring fairness. However, thiscan still lead to collisions, as the estimates of the channelmeans are independently maintained by each source, and noinformation is shared. Thus, we check at every instant if thechannel is acquired by the source (based on our collisionmodel), and update the mean estimates only when the sourcehas access to the channel. The regret for any policy P arisesif a source m , for an index k , chooses an arm other than thedesired one (which is used by the oracle) or due to a collision,the source is unable to acquire the channel. Thus, there is atrade-off between exploration and the number of collisions. B. Distributed Learning with Fairness (DLF)

DLF is an equal-priority multi-source and multi-channelpolicy based on UCB-1. This policy tries to mimic our Oraclepolicy while avoiding collisions, by estimating the channel µ s through a two-step process- estimating the set of best- M channels and then selecting the channel with k th highest value Algorithm 1:

Distributed Learning Algorithm withFairness for M Sources and N Channels at Source m (DLF) Initialize:

Set ˆ µ mn = 0 to be the estimated successprobability of Channel n , T mn (1) = 0 n ∈ [ N ] . while ≤ t ≤ N do Schedule an update to a link n ( t ) such that n ( t ) = (( m + t ) mod M ) + 1 , Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn = X mn ( t ) ( t ) , T mn ( t ) = 1 while t ≥ N + 1 do Set index k = (( m + t ) mod M ) + 1 Let the set O k contain the k arms with the k largest values in (2) ˆ µ mn + (cid:115) tT mn ( t − (2)Schedule an update to a link n ( t ) such that n ( t ) = arg min n ∈O k ˆ µ mn − (cid:115) tT mn ( t − if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1 of µ from the set, where the index k is determined by the user m and current slot t . This is formally described in Algorithm 1. Theorem 5 (Performance of DLF) . Consider any probleminstance ( M, N, µ ) , such that µ min = min i =1: N µ i > , ∆ =min i,j ∈ [ M ]; i>j µ i − µ j , and a constant c . Then, for any sufﬁcientlylarge T > N, under Assumption 1, R DLF ( T ) ≤ M µ min + M c log Tµ min (cid:20) N − (cid:18) T ∆ +1+ 2 π (cid:19)(cid:21) . From Theorem 5, we conclude that DLF scales asO ( M N log T ) . The proof ﬁrst characterizes the source wiseregret as a function of the expected number of times a non-desired channel is scheduled or another source acquires thedesired channel under the policy. The result then follows usingknown upper-bounds for this quantity, from [4] for DLF. Weelaborate on this in Section VI through Lemmas 4 and 5. lgorithm 2: Distributed Learning-based ThompsonSampling for M Sources and N Channels at Source m (DL-TS) Initialize:

Set ˆ µ mn = 0 to be the estimated successprobability of Channel n , T mn (1) = 0 n ∈ [ N ] . while t ≥ do α mn ( t ) = ˆ µ mn ( t ) T mn ( t −

1) + 1 , β mn ( t ) = (1 − ˆ µ mn ( t )) T mn ( t −

1) + 1 , For each n ∈ [ N ] , pick a sample ˆ θ mn ( t ) ofdistribution, ˆ θ mn ( t ) ∼ Beta ( α mn ( t ) , α mn ( t )) (3) Set index k = (( m + t ) mod M ) + 1 Schedule an update to a link n ( t ) such that it isthe arm with the k th largest value in (3). if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1 IV. O

THER P OLICIES

In this section, we extend Thompson Sampling [5] to oursetting. We also propose a new hybrid policy, as well as AoI-Aware variants for each of these policies, and compare theirperformances via simulations.

A. AoI-Agnostic Policies

Deﬁnition 5 (AoI-Agnostic Policies) . A policy is AoI-agnosticif, given past scheduling decisions and the number of success-ful updates sent via each of the N channels in the past, it doesnot explicitly use the age of information of any source in atime-slot to make scheduling decisions. While age-of-information is a new metric, to devise AoI-agnostic policies one can use the myriad of policies usedfor Bernoulli rewards, most commonly the Upper Conﬁ-dence Bound (UCB) [16] and Thompson Sampling [5]. Thesepolicies can be readily applied to a one-source setting, i.e. M = 1 [3]. DLF [4], as described in Algorithm 1, is basedon UCB and tries to mimic our Oracle policy while avoidingcollisions. Similarly, we extend Thompson Sampling for oursetting, and we term this new policy as Distributed Learning-based Thompson Sampling (DL-TS) , given in Algorithm 2.Further, we combine these two policies to propose a hybridpolicy -

Distributed Learning-based Hybrid policy (DLH) ,detailed in Algorithm 3.Empirically, we observe that DLF exhibits a lesser numberof collisions while being more exploratory, and DL-TS ismore exploitative but leads to a higher number of collisions.

Algorithm 3:

Distributed Learning-based Hybrid pol-icy for M Sources and N Channels at Source m (DLH) Initialize:

Set ˆ µ mn = 0 to be the estimated successprobability of Channel n , T mn (1) = 0 n ∈ [ N ] . while t ≥ do Let E ( t ) ∼ Ber (cid:16) min (cid:110) , mn log tt (cid:111)(cid:17) . if E ( t ) = 1 then DLF:

Schedule an update on a channel chosenby the DLF policy given in Algorithm 1. else Thompson Sampling:

Schedule an update on achannel chosen by the DL-TS policy given inAlgorithm 2. if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1 The Distributed Learning-based Hybrid policy (DLH), shownin Algorithm 3, switches between DLF and DL-TS using aBernoulli random variable to balance this trade-off. As thechance of a collision is higher in the initial time-slots, herethe algorithm uses DLF with a higher probability than DL-TS. However, the probability of using DLF decreases with anincrease in the number of time-slots elapsed.

B. AoI-Aware Policies

In this section, we propose AoI-aware variants of the threepolicies discussed in the previous section. In the classicalMAB with Bernoulli rewards, the contribution of a time-slotto the overall regret is upper bounded by one, but in AoIbandits it can be more than one. Also, unlike the MAB, forAoI bandits, the difference between AoIs under a candidatepolicy and the Oracle policy in a time-slot can be unbounded.This motivates the need to take the current AoI value intoaccount when making scheduling decisions.Intuitively, it makes sense to explore when AoI is low andexploit when AoI is high since the cost of making a mistake ismuch higher when AoI is high. We use this intuition to designAoI-aware policies. The key idea behind these policies is thatthey mimic the original policies when AoI is below a thresholdand exploit when AoI is equal to or above a threshold, for anappropriately chosen threshold.In each policy, at a source m , we maintain an estimateof success probability of the arms, denoted by ˆ µ m . For anindex k at time t , we mimic the original policy if the AoIis not more than µ mk (which is the AoI value if the k th armwas used throughout). Otherwise, we exploit the “ k th best” lgorithm 4: AoI-Aware Distributed Learning-basedThompson Sampling for M Sources and N Channelsat Source m (DL-TS-AA) Initialize:

Set ˆ µ mn = 0 to be the estimated successprobability of Channel n , T mn (1) = 0 n ∈ [ N ] . while t ≥ do α mn ( t ) = ˆ µ mn ( t ) T mn ( t −

1) + 1 , β mn ( t ) = (1 − ˆ µ mn ( t )) T mn ( t −

1) + 1 , Set index k = (( m + t ) mod M ) + 1 Let limit(t) = k th min n ∈ [ N ] α mn ( t )+ β mn ( t ) α mn ( t ) if a ( t − > limit(t) then Exploit:

Select channel with the k th highestestimated success probability else Explore: For each n ∈ [ N ] , pick a sample ˆ θ mn ( t ) ofdistribution, ˆ θ mn ( t ) ∼ Beta ( α mn ( t ) , α mn ( t )) (4) Schedule an update to a link n ( t ) such that itis the arm with the k th largest value in (4). if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1 arm (based on past observations) as per the source. The AoI-Aware modiﬁcations to DLF, DL-TS and DLH are given inAlgorithms 5, 4 and 6 respectively.V. S IMULATIONS

In this section, we will present the simulation results for allthe six policies discussed in section IV. All the simulationshereafter are conducted for T = 2 × time-slots with eachdata point averaged across 200 iterations. For the purpose ofsimulating the policies, we choose µ such that consecutiveelements are equidistant, and the difference is denoted by ∆ .Figure 1 shows the regret, R P ( T ) , for all the six poli-cies mentioned previously for different problem instances ( M, N, µ ) . We notice AoI-Aware policies in all cases (exceptfor DL-TS and DL-TS-AA in Figure 1a) have a smaller regret(and fewer collisions) than their

AoI-Agnostic counterparts.In Table I, we compare the distribution of source-wise chan-nels scheduled under two policies: DLF-AA and DLTS-AA forproblem instance ( M = 2 , N = 4 , µ = [0 . , . , . , . .We observe that the number of pulls of sub-optimal arms,that is, arms / ∈ S (2) is higher in DLF-AA as compared toDLTS-AA. At the same time, the former policy averages about Algorithm 5:

AoI-Aware Distributed Learning Algo-rithm with Fairness for M Sources and N Channels atSource m (DLF-AA) Initialize:

1) + 1 , β mn ( t ) = (1 − ˆ µ mn ( t )) T mn ( t −

1) + 1 , Set index k = (( m + t ) mod M ) + 1 Let limit(t) = k th min n ∈ [ N ] α mn ( t )+ β mn ( t ) α mn ( t ) if a ( t − > limit(t) then Exploit:

Select channel with the k th highestestimated success probability else Let the set O k contain the k arms with the k largest values in (5) ˆ µ mn + (cid:115) tT mn ( t − (5)Schedule an update to a link n ( t ) such that n ( t ) = arg min n ∈O k ˆ µ mn − (cid:115) tT mn ( t − if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1

414 collisions overall, whereas the latter averages about 556collisions overall. Similarly in Table II, we show the channeldistribution for the problem instance ( M = 2 , N = 4 , µ =[0 . , . , . , . , . . The average number of collisionsfor DLF-AA is 1071 and the same for DL-TS-AA is 1478.The same trend occurs when comparing DLF with DL-TS. This indicates that there is a trade-off between the twocauses of regret: sub-optimal arm pulls and the number ofcollisions. This served as the primary motivation behind thehybrid policies- DLH and DLH-AA, which lead to an inter-mediate number of both sub-optimal arm pulls and number ofcollisions. As one can see from Figure 1, DLH-AA performs lgorithm 6: AoI-Aware Distributed Learning-basedHybrid Policy for M Sources and N Channels atSource m (DLH-AA) Initialize:

Schedule an update on a channel chosenby the DLF-AA policy given in Algorithm 5. else Thompson Sampling:

Schedule an update on achannel chosen by the DL-TS-AA policygiven in Algorithm 4. if Channel Acquired then Receive reward X mn ( t ) ( t ) ∼ Ber ( µ mn ( t ) ) ˆ µ mn ( t ) = ˆ µ mn ( t ) · T mn ( t ) ( t − X n ( t ) m ( t ) T k ( t ) ( t − T mn ( t ) ( t ) = T mn ( t ) ( t −

1) + 1 else ˆ µ mn ( t ) = ˆ µ mn ( t ) T mn ( t ) ( t ) = T mn ( t ) ( t − t = t + 1 Table I: Distribution of Channels scheduled across sources forvarious policies when M = 2 , N = 4 and µ = [0 .

65; 0 . n = 1 n = 2 n = 3 n = 4 DLF-AA m = 1 m = 2 DLTS-AA m = 1 m = 2 Table II: Distribution of Channels scheduled across sourcesfor various policies when M = 3 , N = 5 and µ = [0 .

6; 0 . n = 1 n = 2 n = 3 n = 4 n = 5 DLF-AA m = 1 m = 2 m = 3 DLTS-AA m = 1 m = 2 m = 3 the best in terms of cumulative regret for different values ofM and N, especially when ∆ is small. For higher ∆ , it wasobserved that the DL-TS policies performed better, since µ sbeing farther apart will lead to lesser collisions. DLH alwaysresulted in intermediate values of regret, in between regretvalues for DLF and DL-TS. However, DLH is ‘closer’ to DL-TS as compared to DLF, which is justiﬁed by the constructionof the policy.In both Tables I and II, there is a sizable difference inthe pulls of the M th arm as compared to the others in the (a) M = 2 , N = 4 , µ = [0 . , . , . , . (b) M = 3 , N = 5 , µ = [0 . , . , . , . , . (c) M = 4 , N = 7 , µ = [0 . , . , . , . , . , . , . Figure 1: Regret plots for all six policies for T = 2 × averaged over 200 iterationsset S ( M ) . We believe that this is because across polices amajority of the sub-optimal pulls arise from scheduling the M th channel. This trend can be seen in Figure 3(b) of [4].Note that although the order of magnitude of the difference inthe arm pulls is similar to ours, the absolute number is higheras their simulations are conducted for T = 10 which is muchlarger than what we used. We have veriﬁed that on increasing T , the number of arm pulls is close to the numbers presentedin [4].Figure 2 shows the variation of regret values of policiesat T = 2 × with different parameters. In Figure 2a, weobserve that the regret scales approximately linearly with N ,with the DLF policies having the highest values of both slope a) R P (2 × ) vs. N ; M = 3 , µ = 0 . and ∆ = 0 . .(b) R P (2 × ) /M vs. M ; N = 7 and µ = [0 .

3; 0 . (c) R P (2 × ) vs. ∆ ; M = 3 , N = 5 and µ = 0 . Figure 2: Variation in regret, R P ( T ) , at T = 2 × , withdifferent values of N, M and ∆ (or µ ), in that order.and regret, and DL-TS the lowest. This is in accordance withthe degree of exploration in the policies. For variation with M , in Figure 2b, we plot R P ( T ) /M instead of regret values,and since these plots are almost linear, suggesting that regretscales with M . Notice that the order of the magnitude of theslopes is reversed as compared to the previous plot - this isbecause as N increases, DL-TS becomes more susceptible tocollisions, thus accumulating higher regret. Figure 2c indicatesan inverse relationship of regret R P ( T ) with ∆ in all the sixpolicies. We note that, when ∆ = 0 . , DLH and DLF basedpolicies do better than the DL-TS based ones, DLH-AA beingthe best policy. This trend changes to DL-TS based policieshaving the lowest regret and the DLF based policies have thehighest, with increasing ∆ . VI. P ROOFS

A. Proof of Oracle

An optimal policy minimizes the expected value of the totalage-of-information across all sources at any large and ﬁnitetime-slot T . We ﬁrst establish that any candidate for an optimalpolicy must satisfy two properties given in Lemmas 1 and2. Then, in Lemma 3, we characterize the AoI for any I.I.Dpolicy. Finally, using these lemmas and the KKT conditions,we can prove Theorem 1. Lemma 1 (Collision Sub-Optimality) . For a problem instance ( M, N, µ ) , where µ min = min i =1: N µ i > , any policy that leadsto a collision in any time slot is sub-optimal.Proof of Lemma 1. This can be proved by contradiction.Assume, a policy P , which is optimal and there is a collisionat channel j ∈ [ N ] in time-slot t . We know that in theproblem instance ( M, N, µ ) , N ≥ M . For a collision tohappen, at least two sources have been scheduled on thechannel j . Let the number of sources involved in the collisionbe l , and ≤ l ≤ M ∴ ∃ i , · · · , i l − ∈ [ N ] and i (cid:54) = i (cid:54) = . . . (cid:54) = i l − (cid:54) = j , suchthat channels i , · · · , i l − are idle. µ min = min k =1: N µ k > ⇒ µ i , · · · , µ i l − > Construct an alternate policy P (cid:48) , which in time slot t , isidentical to policy P for non-colliding sources, and for all butone sources colliding at channel j , schedules them on channels i , · · · , i l − randomly without repetition.For these sources, under policy P , the AoI would alwaysincrease by 1 , but under policy P (cid:48) , the AoI would increaseby 1 wp. ≤ − µ min < . ∴ E [ a P m ] ≥ E [ a P (cid:48) m ] = ⇒ M (cid:88) m =1 E [ a P m ] ≥ M (cid:88) m =1 E [ a P (cid:48) m ] , which is a contradiction. No such optimal policy P exists. (cid:4) Lemma 2 (Best- M Channels) . Let S ( k ) = { , · · · , k } be theset of k channels with the highest success probabilities.Then,any policy, that is not sub-optimal as per Lemma 1, but selectsan arm not in S ( M ) , is sub-optimal.Proof of Lemma 2. This can be proved by contradiction.Assume, a policy P , which is optimal such that in time-slot t , a source m is scheduled on channel j ∈ [ N ] where j > M .From Lemma 1, in any time-slot t, M channels must be inuse. By construction, it follows that ∃ i ∈ [ M ] and i < j ,such that channel i is idle. Further, µ min = min k =1: N µ k > ⇒ µ i ≥ µ j . Construct an alternate policy P (cid:48) , which in time slot t , isidentical to policy P for all sources but m, where it schedulesource m on channel i . For this source(s), under policy P ,the AoI would increase by 1 wp. − µ j , but under policy (cid:48) , the AoI would increase by 1 wp. − µ i ≤ − µ j . ∴ E [ a P m ] ≥ E [ a P (cid:48) m ] = ⇒ M (cid:88) m =1 E [ a P m ] ≥ M (cid:88) m =1 E [ a P (cid:48) m ] , which is a contradiction. No such optimal policy P exists. (cid:4) Lemma 3 (AoI for I.I.D Policy) . For any iid. policy P ,that schedules channels in an iid. manner across time, letthe effective success probability for a source m be ˜ µ m . ˜ µ m = (cid:80) Nj =1 λ m,j µ j , where, λ m,j is the probability thatsource m is scheduled on the channel j . Then, the expectedvalue of the AoI of a source m in time slot t under this policyis E [ a P m ( t )] = µ m .Proof of Lemma 3. By deﬁnition, P ( a P m ( t ) > τ ) = τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1) . Note that since a P m ( t ) ≥ for all t and m, E [ a P m ( t )] = ∞ (cid:88) τ =0 P ( a m ( t ) > τ ) . It follows that, E [ a P m ( t )] = E [ E [ a P m ( t )]] = E (cid:34) ∞ (cid:88) τ =0 P ( a P m ( t ) > τ ) (cid:35) = E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1)(cid:35) (6) = ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:18) − E [ µ k m ( t − i ) ] (cid:19) = ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:18) − ˜ µ m (cid:19) = 1˜ µ m (From Assumption 1) . (cid:4) Proof of Theorem 1.

From Lemma 3, the total AoI is M (cid:88) m =1 E [ a P m ( t )] = M (cid:88) m =1 µ m = M (cid:88) m =1 (cid:80) Nj =1 λ m,j µ j . (7)We need to minimize (7), under the constraints ∀ i ∈ [ M ] , (cid:80) Nj =1 λ i,j = 1 and λ i,j ≥ . Further, from Lemmas 1 and 2, ∀ j ∈ [ M ] , M (cid:88) i =1 λ i,j = 1 and ∀ i ∈ [ M ] , j > M, λ i,j = 0 . (8)Let λλλ = ( λ , , λ , , · · · , λ i,j , · · · , λ M,M ) denote the set of M non-zero λ i,j , according to the constraints. Formalizing this constraint-minimization problem as per the KKT condi-tions we get,Minimize f ( λλλ ) = M (cid:88) m =1 (cid:80) Nj =1 λ m,j µ j g i,j ( λλλ ) = − λ i,j ≤ ∀ i, j ∈ [ M ] h si ( λλλ ) = M (cid:88) j =1 λ i,j − ∀ i ∈ [ M ] h cj ( λλλ ) = M (cid:88) i =1 λ i,j − ∀ j ∈ [ M ] Let L ( λλλ, α, β ) = f ( λλλ )+ α (cid:62) g ( λλλ )+ β (cid:62) h ( λλλ ) be the Lagrangianfunction, where g ( λλλ ) = ( g , ( λλλ ) , g , ( λλλ ) , · · · , g M,M ( λλλ )) and h ( λλλ ) = ( h s ( λλλ ) , · · · , h sM ( λλλ ) , h c ( λλλ ) , · · · , h cM ( λλλ ) .Considering λλλ ∗ = ( M , M , · · · , M ) M times as a candidateoptimal point, we show that the necessary conditions areveriﬁed. Stationarity: ∇ f ( λλλ ∗ ) + α (cid:62) ∇ g ( λλλ ∗ ) + β (cid:62) ∇ h ( λλλ ∗ ) = 0 .Set α = 0 . ∂f ( λλλ ) ∂λ i,j (cid:12)(cid:12)(cid:12)(cid:12) λλλ = λλλ ∗ = − µ j M (cid:16)(cid:80) Mj =1 µ j (cid:17) ∂h si ( λλλ ) ∂λ i,j = 1 , ∂h cj ( λλλ ) ∂λ i,j = 1 Let β i = β M + j = µ j M ( (cid:80) Mj =1 µ j ) , then the stationarity conditionis satisﬁed. Primal feasibility: g i,j ( λλλ ∗ ) = − M ≤ ∵ M > h si ( λλλ ∗ ) = M (cid:88) j =1 M − h cj ( λλλ ∗ ) = M (cid:88) i =1 M − Dual feasibility: α i,j ≥ ∀ i, j ∈ [ M ] ∵ α = 0 Complementary slackness: α (cid:62) g ( λλλ ∗ ) = 0 ∵ α = 0 We know that f ( λλλ ) is a convex function, and g ( λλλ ) compriseof linear and thus, convex functions, and h ( λλλ ) comprisesof afﬁne functions. As a result, the necessary conditions aresufﬁcient for optimality, giving us the desired result. (cid:4) We design two schedules:

Policy P A and Policy P B . Byenumerating all cases, we show that the AoI regret under Policy P A is less than that under Policy P B , and as a resultprove Theorem 2. roof of Theorem 2. Consider a problem instance ( M, N, µ ) ,where M = 2 and a system of two coupled policies P A and P B at time T = k +2+ k , where k ≥ , k ≥ and T > Policy P A : The round-robin policy that schedules source m ∈ [ M ] on the channel (( m + t ) mod M ) + 1 in time-slot t . Policy P B : A policy that is not sub-optimal as per Lemmas 1and 2, and makes the same scheduling decisions as Policy P A in time-slot t ∈ [1 , k ] ∪ [ k + 3 , T ] . In M -length interval, t ∈ [ k + 1 , k + 2] , the sources are scheduled on repeatedarms from set S (2) .Next, we need to show that: M (cid:88) m =1 E (cid:2) a P B m ( T ) (cid:3) ≥ M (cid:88) m =1 E (cid:2) a P A m ( T ) (cid:3) . From (6) , E (cid:2) a P B m ( T ) (cid:3) = E (cid:20) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k Bm ( t − i ) (cid:1) (cid:21) =1 + E (cid:20) k − (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k Bm ( t − i ) (cid:1) + k + M − (cid:88) τ = k τ (cid:89) i =0 (cid:0) − µ k Bm ( t − i ) (cid:1) + T (cid:88) τ = k + M τ (cid:89) i =0 (cid:0) − µ k Bm ( t − i ) (cid:1) (cid:21) = 1 + E (cid:20) k − (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) (cid:21) + E (cid:20) k + M − (cid:88) τ = k k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) τ (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) (cid:21) + E (cid:20) T (cid:88) τ = k + M (cid:18) k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) k + M − (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) × τ (cid:89) i = k + M (cid:0) − µ k Am ( t − i ) (cid:1) (cid:19)(cid:21) . (9)Using (9), and M = 2 we compute M (cid:88) m =1 E (cid:2) a P B m ( T ) (cid:3) − M (cid:88) m =1 E (cid:2) a P A m ( T ) (cid:3) = (cid:88) m =1 (cid:26) k + M − (cid:88) τ = k k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) (cid:18) τ (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) − τ (cid:89) i = k (cid:0) − µ k Am ( t − i ) (cid:1) (cid:19) + T (cid:88) τ = k + M (cid:18) k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) × (cid:18) k + M − (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) − k + M − (cid:89) i = k (cid:0) − µ k Am ( t − i ) (cid:1) (cid:19) The expressions underlined are same under policies P A and P B × τ (cid:89) i = k + M (cid:0) − µ k Am ( t − i ) (cid:1) (cid:19)(cid:27) . (10)The ﬁrst term in (10) can be simpliﬁed as, (cid:88) m =1 (cid:26) k + M − (cid:88) τ = k k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) (cid:18) τ (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) − τ (cid:89) i = k (cid:0) − µ k Am ( t − i ) (cid:1) (cid:19)(cid:27) . Case-I: If k mod 2 = 0= k − (cid:89) i =0 (cid:18) (1 − µ )(1 − µ ) (cid:19) k × (cid:18) (1 − µ ) + (1 − µ ) − − µ ) (1 − µ ) (cid:19) ≥ . (11) Case-II: If k mod 2 = 1= k − (cid:89) i =0 (cid:18) (1 − µ )(1 − µ ) (cid:19) k − (cid:18) (1 − µ ) + (1 − µ ) − − µ ) (1 − µ ) (cid:19) + k − (cid:89) i =0 (cid:18) (1 − µ )(1 − µ ) (cid:19) k − × (cid:18) (1 − µ ) + (1 − µ ) − (1 − µ ) (1 − µ ) ( µ − µ ) (cid:19) ≥ k − (cid:89) i =0 (cid:18) (1 − µ )(1 − µ ) (cid:19) k − (cid:18) ( µ − µ ) ( µ + µ ) (cid:19) ≥ . (12)(11) and (12) show that the ﬁrst term in (10) is non-negative.A similar proof can be shown for the second term as well.This gives the desired result ( M = 2 )- M (cid:88) m =1 E (cid:2) a P B m (cid:3) − M (cid:88) m =1 E (cid:2) a P A m (cid:3) ≥ . (cid:4) B. Proof of Theorems 3 and 4

We then, generalize the proof of Theorem 2 and use theMuirhead’s inequality [17] to prove Theorems 3 and 4. Thestructure of these proofs is similar to the above proof.

Proof of Theorem 3.

Consider a problem instance ( M, N, µ ) ,and a system of two coupled policies P A and P B at time T = k + M + k , where k ≥ , k ≥ and T > M

Policy P A : Consider the index set I which is a randompermutation of the arms in the set S ( M ) . Policy P A is theround-robin policy that schedules source m ∈ [ M ] on thechannel I (( m + t ) mod M )+1 in time-slot t . Policy P B : A policy, adhering to Lemmas 1 and 2,that makes the same scheduling decisions as Policy P A inime-slot t ∈ [1 , k ] ∪ [ k + M + 1 , T ] . In the M -lengthinterval t ∈ [ k + 1 , k + M ] , the channels are scheduledas per a symmetric M -periodic policy, characterized by thesequence D B (1) , · · · , D B ( M ) .Both the policies are randomized, and constitute differentrealizations. Thus, the age-of-information accrued is capturedby the expectation over all the random instances possiblewithin each policy. Next, we need to show that: M (cid:88) m =1 E (cid:2) a P B m ( T ) (cid:3) ≥ M (cid:88) m =1 E (cid:2) a P A m ( T ) (cid:3) . For a source m , using (9) we get that, E (cid:2) a P B m ( T ) (cid:3) − E (cid:2) a P A m ( T ) (cid:3) = k + M − (cid:88) τ = k E (cid:20) k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) (cid:26) τ (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) − τ (cid:89) i = k (cid:0) − µ k Am ( t − i ) (cid:1) (cid:27)(cid:21) + T (cid:88) τ = k + M E (cid:20) k − (cid:89) i =0 (cid:0) − µ k Am ( t − i ) (cid:1) τ (cid:89) i = k + M (cid:0) − µ k Am ( t − i ) (cid:1) × (cid:26) k + M − (cid:89) i = k (cid:0) − µ k Bm ( t − i ) (cid:1) − k + M − (cid:89) i = k (cid:0) − µ k Am ( t − i ) (cid:1) (cid:27)(cid:21) (13)Let us consider each of the terms inside the ﬁrst summationin (13). By construction, D B ( τ ) majorizes over D A ( τ ) . Theexpectation, for each policy, is equivalent to the Muirheadmean ([ D ( τ ) ]-mean) of the positive real numbers − µ , − µ , · · · , − µ M . Thus, by Muirhead’s inequality, each of theexpectation terms is positive. Similary, we can argue the samefor expectation terms inside the second summation, resultingin E [ a P B m ( T ) (cid:3) − E (cid:2) a P A m ( T ) (cid:3) ≥ . (14)Summing for all sources we get the result, M (cid:88) m =1 E (cid:2) a P B m ( T ) (cid:3) ≥ M (cid:88) m =1 E (cid:2) a P A m ( T ) (cid:3) . (cid:4) Proof of Theorem 4.

Consider a problem instance ( M, N, µ ) ,and a system of two coupled policies P A and P B at time T = k + M + k , where k ≥ , k ≥ and T > M

Policy P A : Consider the index set I which is a randompermutation of the arms in the set S ( M ) . Policy P A is theround-robin policy that schedules source m ∈ [ M ] on thechannel I (( m + t ) mod M )+1 in time-slot t . Policy P B : A policy, adhering to Lemmas 1 and 2, thatmakes the same scheduling decisions as Policy P A in time-slot t ∈ [1 , k ] ∪ [ k + M + 1 , T ] . In the M -length interval t ∈ [ k + 1 , k + M ] , the channels are scheduled uniformlyand randomly in an i.i.d manner. Both the policies are randomized, and constitute differentrealizations. Thus, the age-of-information accrued is capturedby the expectation over all the random instances possiblewithin each policy.In case of policy P B , for an l -length interval, there are M l possible sequences of scheduled arms, and each sequence hasthe same probability. Suppose, X ( l ) represents the set of allpossible sequences of length l ( |X ( l ) | = M l ). Let X s ( l ) represent the set of sequences denoted by D s ( l ) , then, X ( l ) = (cid:91) s X s ( l ) and X si ( l ) ∩ X sj ( l ) = φ for i (cid:54) = j. (15)Here the union extends over all the possible ways of construct-ing the vector D ( l ) . Using (14) and (15), we get, E (cid:2) a P B m ( T ) (cid:3) = (cid:16) (cid:88) s |X s ( T ) | · E (cid:2) a P s m ( T ) (cid:3)(cid:17)(cid:46) |X ( T ) |≥ (cid:16) (cid:88) s |X s ( T ) | · E (cid:2) a P A m ( T ) (cid:3)(cid:17)(cid:46) |X ( T ) | = E (cid:2) a P A m ( T ) (cid:3) · (cid:88) s |X s ( T ) | (cid:46) |X ( T ) | = E (cid:2) a P A m ( T ) (cid:3) . (16)Summing for all sources we get the result, M (cid:88) m =1 E (cid:2) a P B m ( T ) (cid:3) ≥ M (cid:88) m =1 E (cid:2) a P A m ( T ) (cid:3) . (cid:4) C. Proof of Theorem 5

We ﬁrst upper bound the expected cumulative source-wiseAoI for any schedule by the expected cumulative source-wiseAoI of an alternative schedule in which all uses of sub-optimalchannels in the original schedule are replaced by all sourcesusing (and thereby colliding on) the worst channel. Only oneof these sources, at random, acquires the channel. We furtherupper bound the expected cumulative source-wise AoI withanother schedule where all uses of the worst channel areclustered together starting from T = 1 , followed by the oracleallotments. We can express the resultant upper bound on thesource-wise expected cumulative AoI in terms of the numberof deviations from the oracle allotments and the number oftimes another source acquires the same channel as the source(resulting in a collision). This is effectively done in Lemma 4. Lemma 4.

Let k m ( t ) denote the index of the communicationchannel used in time-slot t by policy P and k ∗ m ( t ) be the indexof the channel used by the Oracle, for source m. Let K m ( T ) = { k m (1) , k m (2) , · · · , k m ( T ) } be the sequence of channels usedin time-slots to T and N ( K m ( T )) = T (cid:88) t =1 k m ( t ) (cid:54) = k ∗ m ( t ) , denote the number of time-slots in which a sub-optimal chan-nel is used by source m and N ( K m ( T )) = T (cid:88) t =1 (cid:88) i (cid:54) = m k i ( t )= k ∗ m ( t ) , enote the number of time-slots the optimal arm (alloted bythe oracle) was used by some other source i (cid:54) = m . Then, underAssumption 1, for a constant c , the regret for source m, R m P ( T ) ≤ Mµ min + M c log Tµ min (1 + E [ N ( K m ( T ))]) , where, N ( K m ( T )) = N ( K m ( T )) + N ( K m ( T )) . The next lemma summarizes the results from Theorem1 in [4] to provide upper bounds on N ( K m ( T )) and N ( K m ( T )) as per the DLF policy. Lemma 5.

Let k m ( t ) denote the index of the communicationchannel used in time-slot t by policy P and k ∗ m ( t ) be theindex of the channel used by the Oracle, for source m. Let E DLF [ N ( K ( T ))] denote the expected number of time-slots inwhich a sub-optimal channel is picked in time-slots 1 to T by DLF. Let E DLF [ N ( K ( T ))] denote the expected number oftime-slots in which the optimal arms was picked by some othersource i (cid:54) = m time-slots 1 to T by DLF. Then, for T > N , E DLF [ N ( K ( T ))] ≤ ( N − (cid:18) T ∆ + 1 + 2 π (cid:19) , E DLF [ N ( K ( T ))] ≤ ( M − (cid:18) T ∆ + 1 + 2 π (cid:19) , where ∆ = min i,j ∈ [ M ]; i>j µ i − µ j . When calculating the overall regret from individual source-wise regrets, (cid:80) Mm =1 R m P , we need to use lemma 4. However,on a closer look, one observes that for a source m and an eventunder N ( K m ( T )) are captured by N ( K l ( T )) for some othersource l . That is, M (cid:88) m =1 N ( K m ( T )) ⊆ M (cid:88) m =1 N ( K m ( T )) . Thus, to avoid over-count, we capture these events only under (cid:80) Mm =1 N ( K m ( T )) . This gives us R P = M (cid:88) m =1 R m P ≤ M (cid:88) m =1 Mµ min + M c log Tµ min (1 + E [ N ( K m ( T ))]) . (17) Remark 1 (Performance of DL-TS) . Lemma 4 provides ageneric result that shows the performance of any policy P .However, in the case of DL-TS, a generic result like Lemma 5is not available for Thompson Sampling. It is not straight-forward to adapt the proof given for Theorem 2 in [18] to ourcase, that is, periodic pulls of the k th best arm. Achieving thesame remains to be an open problem.Proof of Lemma 4. Let µ ∗ = (cid:32) M (cid:88) i =1 µ i (cid:33) (cid:14) M. By deﬁnition, P ( a P m ( t ) > τ ) = τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1) . Note that since a P m ( t ) ≥ for all t and m, E [ a P m ( t )] = ∞ (cid:88) τ =0 P ( a m ( t ) > τ ) . It follows that, E [ a P m ( t )] = E [ E [ a P m ( t )]] = E (cid:34) ∞ (cid:88) τ =0 P ( a P m ( t ) > τ ) (cid:35) = E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1)(cid:35) . (18)For t ≥ c log T , we deﬁne E t as the event that k m ( τ ) = k ∗ m ( τ ) for t − c log T + 1 ≤ τ ≤ t . Then, E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) E t (cid:35) ≤ c log T (cid:88) i =1 i (cid:89) j =1 (1 − µ ∗ )+ ∞ (cid:88) i = c log T +1 (1 − µ ∗ ) c log T i (cid:89) j = c log T +1 (1 − µ min M ) . For t ≥ M c log T , we deﬁne E t as the event that k m ( τ ) = k ∗ m ( τ ) for t − M c log T + 1 ≤ τ ≤ t . Then, E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k m ( t − i ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) E t (cid:35) ≤ Mc log T (cid:88) i =1 i (cid:89) j =1 (1 − µ k ∗ m ( t ))+ ∞ (cid:88) i = Mc log T +1 (cid:32) M (cid:89) m =1 (1 − µ m ) (cid:33) c log T i (cid:89) j = Mc log T +1 (1 − µ min M ) . Here, the worst case is that all sources collide at the arm withthe lowest transmission probability. Based on our collisionmodel, a source m will acquire this channel with probability /M . Further note that, ∞ (cid:88) i = Mc log T +1 (cid:32) M (cid:89) m =1 (1 − µ m ) (cid:33) c log T i (cid:89) j = c log T +1 (1 − µ min M ) ≤ (cid:32) M (cid:89) m =1 (1 − µ m ) (cid:33) c log T Mµ min = Mµ min T and assign Mc log T (cid:88) i =1 i (cid:89) j =1 (1 − µ k ∗ m ( t )) = φ. t follows that, E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k ( t − i ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) E t (cid:35) ≤ φ + Mµ min T . (19)Moreover, since c µ mk ( t ) ≥ µ min /M , for all t , E (cid:34) ∞ (cid:88) τ =0 τ (cid:89) i =0 (cid:0) − µ k ( t − i ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) E ct (cid:35) ≤ Mµ min . (20)Note that E ct = t (cid:91) τ = t − c log T +1 (cid:26) A ( τ ) ∪ B ( τ ) (cid:27) E ct ≤ t (cid:88) τ = t − c log T +1 (cid:18) A ( τ ) + B ( τ ) (cid:19) . (21)From (18), (19), (20), and (21), and set φ ( T − M c log T ) = φ (cid:48) .Notice, constant c scales with M, so deﬁne c (cid:48) = − GM ((1 − µ ) , ··· , (1 − µ M ))) , thus, c (cid:48) = M c T (cid:88) t =1 E [ a P m ( t )]= Mc log T (cid:88) t =1 E [ a P m ( t )] + T (cid:88) t = Mc log T +1 E [ a P m ( t )] ≤ M c log Tµ min + φ (cid:48) + M ( T − M c log T ) µ min T + Mµ min E  T (cid:88) t = Mc log T +1 E ct  (22) ≤ M ( c (cid:48) log T + 1) µ min + φ (cid:48) + Mµ min ×  T (cid:88) t = c (cid:48) log T E  t (cid:88) τ = t − c (cid:48) log T +1 (cid:0) A ( τ ) + B ( τ ) (cid:1) ≤ φ (cid:48) + M ( c (cid:48) log T + 1) µ min + M c (cid:48) log Tµ min E (cid:34) T (cid:88) t =1 (cid:0) A ( t ) + B ( t ) (cid:1)(cid:35) ≤ φ (cid:48) + Mµ min + M ( c (cid:48) log T ) µ min (cid:32) E (cid:34) T (cid:88) t =1 (cid:0) A ( t ) + B ( t ) (cid:1)(cid:35)(cid:33) ≤ φ (cid:48) + Tµ ∗ + Mµ min + M ( c (cid:48) log T ) µ min (cid:18) E (cid:20) T (cid:88) t =1 (cid:0) A ( t ) + (cid:88) i (cid:54) = m B i ( t ) (cid:1)(cid:21)(cid:19) = φ (cid:48) + Tµ ∗ + Mµ min + M ( c (cid:48) log T ) µ min (cid:18) E (cid:2) N ( K m ( T ))+ N ( K m ( T )) (cid:3)(cid:19) = φ (cid:48) + Mµ min + M ( c (cid:48) log T ) µ min (1 + E [ N ( K ( T ))]) . (23) Regret is deﬁned as R m P ( t ) = T (cid:88) t =1 E [ a P m ( t )] − T (cid:88) t =1 E [ a ∗ m ( t )] ≤ T (cid:88) t =1 E [ a P m ( t )] − T (cid:88) t =1+ Mc log T E [ a ∗ m ( t )] ≤ T (cid:88) t =1 E [ a P m ( t )] − T (cid:88) t =1+ Mc log T φ = T (cid:88) t =1 E [ a P m ( t )] − φ (cid:48) = Mµ min + M ( c (cid:48) log T ) µ min (1 + E [ N ( K ( T ))]) . (24) (cid:4) VII. C

ONCLUSION

We model a multi-source multi-channel setting using Age-of-Information bandits with decentralized users. We ﬁrst char-acterize the oracle policy, namely, the round-robin policy, anddemonstrate that the upper-bound of AoI-regret of the exist-ing Distributed Learning with Fairness (DLF) policy scalesas O ( M N log T ) . However, proving the optimality of theoracle policy for the general setting is still an open problem.We then propose two other AoI-agnostic policies: DistributedLearning-based Thompson Sampling (DL-TS) and DistributeLearning-based Hybrid policy (DLH), with different trade-offs between exploration and the number of collisions. Thesepolicies only utilize the past arm-pulls in deciding the nextarm. We also present AoI-aware policies that incorporate thecurrent value of AoI into the arm selections. Via simulations,we show that the AoI-aware policies lead to fewer collisionsand thus, outperform their agnostic counterparts.A CKNOWLEDGMENT

We thank Shourya Pandey of the Department of ComputerScience and Engineering at Indian Institute of TechnologyBombay for mathematical insights in the proofs.R

EFERENCES[1] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of infor-mation in vehicular networks,” in . IEEE, 2011, pp. 350–358.[2] A. Kosta, N. Pappas, and V. Angelakis, “Age of information: A newconcept, metric, and tool,”

Foundations and Trends in Networking ,vol. 12, no. 3, pp. 162–259, 2017.[3] K. Bhandari, S. Fatale, U. Narula, S. Moharir, and M. K. Hanawal,“Age-of-information bandits,” in , 2020, pp. 1–8.[4] Y. Gai and B. Krishnamachari, “Distributed Stochastic Online LearningPolicies for Opportunistic Spectrum Access,”

IEEE Transactions onSignal Processing , vol. 62, no. 23, pp. 6184–6193, 2014.[5] W. R. Thompson, “On the likelihood that one unknown probabilityexceeds another in view of the evidence of two samples,”

Biometrika ,vol. 25, no. 3/4, pp. 285–294, 1933.[6] B. Sombabu and S. Moharir, “Age-of-information aware scheduling forheterogeneous sources,” in

Proceedings of the 24th Annual InternationalConference on Mobile Computing and Networking , 2018, pp. 696–698.7] Y.-P. Hsu, E. Modiano, and L. Duan, “Scheduling algorithms forminimizing age of information in wireless broadcast networks withrandom arrivals: The no-buffer case,” arXiv preprint arXiv:1712.07419 ,2017.[8] V. Tripathi and S. Moharir, “Age of information in multi-sourcesystems,” in

GLOBECOM 2017-2017 IEEE Global CommunicationsConference . IEEE, 2017, pp. 1–6.[9] V. Tripathi and E. Modiano, “A whittle index approach to minimizingfunctions of age of information,” in . IEEE,2019, pp. 1160–1167.[10] P. R. Jhunjhunwala and S. Moharir, “Age-of-information aware schedul-ing,” in . IEEE, 2018, pp. 222–226.[11] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of informationin wireless networks with throughput constraints,” in

IEEE INFOCOM2018-IEEE Conference on Computer Communications . IEEE, 2018,pp. 1844–1852.[12] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit withmultiple players,”

IEEE Transactions on Signal Processing , vol. 58,no. 11, pp. 5667–5681, 2010.[13] L. Besson and E. Kaufmann, “Multi-player bandits revisited,” in

Algo-rithmic Learning Theory , 2018, pp. 56–92.[14] M. K. Hanawal and S. J. Darak, “Multi-player bandits: A trekkingapproach,” arXiv preprint arXiv:1809.06040 , 2018.[15] E. Boursier and V. Perchet, “Sic-mmab: synchronisation involves com-munication in multiplayer multi-armed bandits,” in

Advances in NeuralInformation Processing Systems , 2019, pp. 12 071–12 080.[16] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,”

Machine learning , vol. 47, no. 2-3, pp.235–256, 2002.[17] P. S. Bullen,

Handbook of means and their inequalities . SpringerScience & Business Media, 2013, vol. 560.[18] E. Kaufmann, N. Korda, and R. Munos, “Thompson sampling: Anasymptotically optimal ﬁnite-time analysis,” in