[PDF] Cooperation Speeds Surfing: Use Co-Bandit!

Abstract

In this paper, we explore the benefit of cooperation in adversarial bandit settings. As a motivating example, we consider the problem of wireless network selection. Mobile devices are often required to choose the right network to associate with for optimal performance, which is non-trivial. The excellent theoretical properties of EXP3, a leading multi-armed bandit algorithm, suggest that it should work well for this type of problem. Yet, it performs poorly in practice. A major limitation is its slow rate of stabilization. Bandit-style algorithms perform better when global knowledge is available, i.e., when devices receive feedback about all networks after each selection. But, unfortunately, communicating full information to all devices is expensive. Therefore, we address the question of how much information is adequate to achieve better performance. We propose Co-Bandit, a novel cooperative bandit approach, that allows devices to occasionally share their observations and forward feedback received from neighbors; hence, feedback may be received with a delay. Devices perform network selection based on their own observation and feedback from neighbors. As such, they speed up each other's rate of learning. We prove that Co-Bandit is regret-minimizing and retains the convergence property of multiplicative weight update algorithms with full information. Through simulation, we show that a very small amount of information, even with a delay, is adequate to nudge each other to select the right network and yield significantly faster stabilization at the optimal state (about 630x faster than EXP3).

Full PDF

CCooperation Speeds Surﬁng: Use Co-Bandit!

Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan

Department of Computer Science, National University of Singapore { anuja, seth.gilbert, tankl } @comp.nus.edu.sg Abstract

In this paper, we explore the beneﬁt of cooperation in adversarial bandit settings. As a motivating example, we consider theproblem of wireless network selection. Mobile devices are often required to choose the right network to associate with for optimalperformance, which is non-trivial. The excellent theoretical properties of EXP3, a leading multi-armed bandit algorithm, suggest thatit should work well for this type of problem. Yet, it performs poorly in practice. A major limitation is its slow rate of stabilization.Bandit-style algorithms perform better when global knowledge is available, i.e., when devices receive feedback about all networksafter each selection. But, unfortunately, communicating full information to all devices is expensive. Therefore, we address thequestion of how much information is adequate to achieve better performance.We propose Co-Bandit, a novel cooperative bandit approach, that allows devices to occasionally share their observations andforward feedback received from neighbors; hence, feedback may be received with a delay. Devices perform network selection basedon their own observation and feedback from neighbors. As such, they speed up each other’s rate of learning. We prove that Co-Banditis regret-minimizing and retains the convergence property of multiplicative weight update algorithms with full information. Throughsimulation, we show that a very small amount of information, even with a delay, is adequate to nudge each other to select the rightnetwork and yield signiﬁcantly faster stabilization at the optimal state (about 630x faster than EXP3).

I. I

NTRODUCTION

Mobile devices often have to select the right network to associate with for optimal performance, which is non-trivial. This isprimarily because network availability is transient and the quality of networks changes dynamically due to mobility of devicesand environmental factors. The challenge is for devices to make decentralized decisions and yet achieve an optimal allocation,where no device would want to unilaterally change network. Moreover, since the environment is dynamic, devices must be ableto seamlessly adapt their decisions to maintain a good network connection. Resource selection problems can be formulated as arepeated congestion game; in each round, a device selects a network and receives some reward, i.e., bandwidth. The multi-armedbandit problem is closely related to repeated multi-player games, where each player independently aims at improving its decisionwhile all players collectively act as an adversary. Multi-armed bandit algorithms have impressive theoretical properties whichsuggest that they provide an excellent solution to this problem.EXP3 (Exponential-weight algorithm for Exploration and Exploitation) [6] is one of the leading bandit algorithms. It isfully decentralized, is regret-minimizing, i.e., as time elapses, it performs nearly as well as always selecting the best action inhindsight, and converges to a (weakly stable) Nash equilibrium [23], [38]. However, it performs poorly in practice. One majorlimitation is that it takes an unacceptable amount of time to stabilize; it took the equivalent of over 14 days in some of oursimulations. The availability of global knowledge would yield better performance. But, it requires support from network serviceproviders, which may be infeasible. In this paper, we explore the possibility of cooperation among devices, and consider thetradeoff between the amount of cooperation and performance. We show that a very small amount of information yields massiveperformance improvement, in terms of rate of stabilization to an optimal state. Several mobile applications to-date rely on thecooperation of peers [19], [40] which suggests the feasibility of cooperation for wireless network selection.We formulate the wireless network selection problem as a repeated congestion game, and model the behaviour of mobiledevices using online learning with partial information and delayed feedback in an adversarial setting. We propose Co-Bandit, anovel cooperative bandit algorithm with good theoretical and practical performance. Mobile devices (a) occasionally share theirobservations (without overloading the network), e.g., they can broadcast bit rates observed from their chosen networks usingBluetooth, (b) forward feedback received over a certain period of time; hence, feedback may be received with a delay, and (c)use feedback received from neighbors to enhance their decisions. We model the underlying communication network as a randomdirected graph based on communication pattern in a wireless network setting. Vertices represent mobile devices and a directededge implies sharing of feedback. Moreover, the topology of the graph changes over time. All source code is available on GitHub . To summarize, the following are our key contributions:1) We formulate the network selection problem as a repeated congestion game and model the behaviour of devices usingonline learning with partial information and delayed feedback in an adversarial setting. https://github.com/anuja-meetoo/Co-Bandit a r X i v : . [ c s . N I] J a n ) We propose Co-Bandit, a cooperative bandit approach, that allows devices to share their observations, forward feedbackreceived, and perform network selection based on their observation as well as those received from neighbours, at timeswith a delay.3) We prove that Co-Bandit (a) is regret-minimizing and provide an upper bound that highlights the effect of cooperationand delay, and (b) retains the convergence property of multiplicative weight update algorithms with full information.4) Through simulation, we demonstrate that Co-Bandit (a) stabilizes at the optimal state relatively fast with only a smallamount of information, (b) gracefully deals with transient behaviours, and (c) scales with an increase in number ofdevices and/or networks. II. W IRELESS NETWORK SELECTION

Here, we deﬁne the wireless network selection problem, formulate it as a repeated congestion game, and model the behaviorof mobile devices using online learning with partial information and delayed feedback in an adversarial setting.

A. Wireless network selection problem

We consider an environment with multiple wireless devices and heterogeneous wireless networks, such as the one depictedin Figure 1. The latter illustrates three service areas, namely a food court, a study area and a bus stop (shaded regions labelledA, B and C, respectively). It shows ﬁve wireless networks, numbered 1 to 5, whose area of coverage is delimited by dottedlines. Mobile devices have access to different sets of wireless networks depending on their location. The aim of each device isto quickly identify and associate with the “optimal” wireless network, which may vary over time.

A B C1 2 34 5 Cellular networkIEEE 802.11 WLANMobile deviceFood courtStudy areaBus stop

Fig. 1: Service areas with heterogeneous wireless networks.To perform an optimal network selection, a device must be aware of the bit rate it can achieve from each network at that time.This is affected by a number of factors, namely (a) the total bandwidth of each network, (b) the distance between the deviceand the access points (APs) and interference in the environment, and (c) the number of devices associated with each network. Adevice can learn about a network’s bit rate by (a) exploring the network, or (b) estimating it based on feedback received fromneighbors who have explored the network. Each time a device switches network, it incurs a cost, which we assume is measuredin terms of delay. However, we do not particularly focus on minimizing switching cost in this work; although once the algorithmstabilizes, the frequency of switching networks is expected to be low.

B. Formulation of wireless network selection game

Mobile devices operate in a dynamic environment, and hence require continuous exploration and adaptation of strategy.Therefore, we formulate the wireless network selection problem as a repeated resource selection game, a special type of congestiongame [34]. We assume that time is slotted and a network selection is made at every time slot.Since the quality of a network degrades proportionally to its number of associated clients, other mobile devices accessingshared networks can be regarded as adversaries. We consider a setting where devices occasionally share their observations andforward feedback received from others. As such, observations may be received with a delay. We assume that devices are honestand do no lie about their observations. Given the need for sequential decision making, it seems natural to model the behavior ofdevices using online learning with partial information and delayed feedback in an adversarial setting.We formally deﬁne the wireless network selection game as a tuple

Γ = (cid:104)N , K , ( S j ) j ∈N , ( U ti ) i ∈K (cid:105) , where1) N = { · · · n } is the ﬁnite set of n active (honest) mobile devices indexed by j .2) K = { · · · k } denotes the ﬁnite set of k wireless networks available in the service area.) S j ⊆ K j is the strategy set of mobile device j , where K j ⊆ K is the set of networks available to j .4) Gain (payoff or utility) g i,j ( t ) of device j from network i at time t , scaled to [0, 1], is expressed by a function U i ofthe number of devices n i ( t ) associated with i as follows: n i ( t ) = |{ j (cid:48) ∈ N : i j (cid:48) ( t ) = i }| where i j (cid:48) ( t ) is the network selected by j (cid:48) at time t . g i,j ( t ) = (cid:26) U ti ( n i ( t )) , if i = i j ( t ) U ti ( n i ( t ) + 1) , otherwise.If network i has not been explored by device j , g i,j ( t ) may be estimated from gains shared by other devices. A device’sgain affects its strategy and, hence, ignores switching cost so that networks with high gain but high switching cost arenot penalized.5) The perceived loss l i,j ( t − ˜ t, t ) of network i for device j at time t − ˜ t is the difference between the highest bit rateavailable for j at t − ˜ t and the bit rate network i offered, based on what is known by time t : l i,j ( t − ˜ t, t )= (cid:40) , if g i,j ( t − ˜ t ) is unknown at time t , max m ∈K { g m,j ( t − ˜ t ) } − g i,j ( t − ˜ t ) , otherwise.6) Cumulative download of a device j is given by T (cid:88) t =1 U ti j ( t ) ( n i j ( t )) · ( slot duration − switch delay ) where switch delay is the switching cost ( switch delay is zero when the device stays in the same network), slot duration (higher than switch delay ) is the length of a time slot, and T is the time horizon. For simplicity, we ignore overheaddata (packet header) and retransmissions.7) A strategy proﬁle is given by S = S x · · · x S n . It is at Nash equilibrium [31] if g i j ( S ) ≥ g i j ( S − j , S (cid:48) j ) for every S (cid:48) j and every j ∈ N , where ( S − j , S (cid:48) j ) implies that only device j changes its strategy. Hence, no device wants to unilaterallychange its strategy.The goal of each device is to maximize its cumulative download over time. From a global perspective, we want each mobiledevice to identify (as quickly as possible) and select the the right network with sufﬁciently high probability and spend most ofthe time in it. In other words, we want the algorithm to spend the maximum amount of time at Nash equilibrium.III. C O -B ANDIT

In this section, we develop Co-Bandit, a novel cooperative bandit algorithm that approximates the EWA (ExponentiallyWeighted Average) algorithm without communicating full information. Co-Bandit is a distributed algorithm and each device j ∈ N runs an instance of it. However, the strategy of a device affects those of other devices that have a common set of availablenetworks by affecting their gains. In addition, devices occasionally share their observations to help their neighbors learn faster;instead of each device exploring every network, they cooperatively explore them and share their observations.We brieﬂy explain EWA, see, e.g. [9], and EXP3 [6]. EWA assumes the availability of global knowledge (full information),i.e., a device receives feedback about all networks. On the other hand, EXP3 assumes a bandit setting where a device onlyreceives feedback about its chosen network. EWA.

It maintains a weight for each network, which represents the conﬁdence that the network is a good choice. It starts byassuming uniform weight over all networks. A network’s weight is affected by its loss; a lower loss yields a higher weight.Hence, the “best” network will eventually have highest weight. EWA assumes that time is slotted. At the beginning of everytime slot, a device randomly selects a network to associate with during the whole time slot, from a probability distribution whichis based on the weights. It observes a bit rate (gain) from the chosen network during the time slot. At the end of the slot, itreceives feedback about the gain it could obtain from all other networks. It computes the loss of all networks, updates theirweights using a multiplicative update rule. As such, it improves its selection over time.

EXP3.

Much like EWA, it assigns a weight to each network, and initially assumes uniform weight over all networks. A network’sweight is affected by its gain; a higher gain yields a higher weight. Therefore, the “best” network will eventually have highestweight. It assumes that time is slotted. In each time slot, it selects a network at random based on a probability distributionthat mixes between using the weights and a uniform distribution; the latter ensures that EXP3 keeps exploring occasionally anddiscovers a better network that was previously “bad”. The device observes a gain (bit rate) from its chosen network. At the endof the time slot, it updates the weight of the chosen network using a multiplicative update rule.

Difference of Co-Bandit compared to EWA and EXP3.

Co-Bandit differs from EWA and EXP3 by allowing devices to sharetheir observations with their neighbors. Hence, a device may receive feedback about more than one network, but not necessarilyll of them. In addition, it handles feedback received with a delay. We consider a spectrum of settings that lie in between the fullinformation and bandit settings. EWA and EXP3 are applied at the two extremes of the spectrum. As the amount of cooperationincreases, the performance of Co-Bandit is expected to be close to that of EWA.We now describe the Co-Bandit algorithm.

Algorithm description.

Algorithm 1 outlines the major steps in Co-Bandit, excluding the part on how to handle a change inthe set of available networks. See Table I for notations.TABLE I: Notations used to describe Co-Bandit (subscript j implies that it is speciﬁc to a device j ; t refers to the current time slot, and, when present, indicates that the valueis relevant for time slot t ) K j Set of networks available to j . k j No. of networks available to j , i.e., |K j | . n No. of active mobile devices, i.e., |N | . η Learning rate. w i,j ( t ) Conﬁdence that network i is a good choice. p i,j ( t ) Probability for choosing network i . i j ( t ) Network chosen for time slot t . g i,j ( t ) ∈ [0 , l i ] Gain from network i . l i,j ( t − ˜ t, t ) Current perceived loss of j from i at t − ˜ t . (cid:99) l i,j ( t ) Loss estimate of network i . I i,j ( t − ˜ t, t ) Indicator function that i was chosen at t − ˜ t . q i,j ( t − ˜ t, t ) Probability that I i,j ( t − ˜ t, t ) = 1 . d Observations up to d slots old are valid. p t Probability of sharing. p l Probability of listening for messages. H j ( t − ˜ t, t ) Devices whose gain for t − ˜ t are known. x No. of slots a network can be unheard of. unheard j Networks unheard of since time slot t − x . Co-Bandit assumes that time is slotted. Much like EWA and EXP3, it maintains a weight for each network (initially uniformover all networks). A network’s weight is affected by its loss, i.e., difference between the highest gain (bit rate) the devicecould observe during that particular time slot and the bit rate the network had to offer. At the beginning of every time slot, if adevice has not learned about some network for a long time, it explores it with some probability. Otherwise, it randomly selects anetwork from a probability distribution which is based on the weights. It associates with the chosen network for the whole timeslot from which it observes some gain. It may decide to share its observation with its neighbors and listen to broadcast messages.As such, it may receive feedback about multiple networks. At the end of the time slot, it updates the weights of all networks;weights of those whose quality are unknown remain unchanged. The same multiplicative weight update and probability updaterules as for EWA are used, see, e.g., [9], while the loss estimate rule is an adaptation from [8].We now further explain the novel aspects of Co-Bandit.

Cooperation.

In contrast to a bandit setting, where devices make decisions based solely on their own observations, we considera setting in which devices cooperate. In every time slot, a device observes a gain (bit rate) from its network. It broadcasts itsobservation with probability p t . Otherwise, it listens for broadcasts with a probability p l . Hence, the underlying communicationnetwork is a random directed graph, where the set of vertices denote the mobile devices. A directed edge from device j todevice j (cid:48) implies that j broadcasts its observation and j (cid:48) listens to and receives the broadcast message. The topology of thegraph differs across time slots and depends on the random decisions taken by the devices at different times. Furthermore, it isnot known to devices. Cooperation enables devices to leverage feedback received from neighbors to enhance their decisions andspeed up their learning rate.A device’s broadcast message includes (a) a timestamp, (b) the device’s ID, (c) the network selected, (d) the bit rate observed,(e) an estimate of the number of devices associated with the network selected, (f), the set of networks available, and (g) thedevice’s probability distribution. The timestamp and device ID are used to ﬁlter out duplicate messages as they are forwarded(as described next). A device computes its bit rate from a network at time t , which it did not explore at that time, based on thebit rate(s) and number of clients of the network reported by neighbors. The probability distribution is used to compute the lossestimate. The set of available networks is useful when devices observe common networks but not necessarily the same set ofnetworks; devices can relate the probability distribution to a set of networks. Delayed feedback.

Devices not only share their current observation, but everything they have learned during the last d + 1 time slots. This includes their own observations as well as feedback received from neighbors. Forwarding messages ensuresthat it reaches more devices; a device might have missed it earlier as it was not listening, because of packet loss or due to lgorithm 1: Co-Bandit explore unheard() determines whether a device must explore a network unheard of for more than x time slots; it returnsTrue with probability | unheard j | n ; a device explores a network unheard of with probability n Input : k j ∈ Z > , real η > , p t ∈ [0 , , p l ∈ [0 , , d ∈ Z ≥ , x ∈ Z ≥ Initialize: w i,j (1) ← for i = 1 , · · · , k j unheard j ← Ø foreach time slot t = 1 , , · · · do p i,j ( t ) ← w i,j ( t ) kj (cid:80) m =1 w m,j ( t ) for i = 1 , · · · , k j if unheard j (cid:54) = Ø and explore unheard () then i j ( t ) ← random (uniform) from unheard j else i j ( t ) ← random from distribution p j ( t )% associate with network i j ( t ) g i j ( t ) ← gain observed, where g i j ( t ) ∈ [0 , with probability p t , broadcast observations and messages received % else, listen with probability p l % update set unheard j foreach network i = 1 , · · · , k j do (cid:99) l i,j ( t ) ← d (cid:48) + 1 d (cid:48) (cid:80) ˜ t =0 l i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) where d (cid:48) ← min { d, t − } ; I i,j ( t − ˜ t, t ) ← I {∃ j (cid:48) ∈ H j ( t − ˜ t, t ) : i j (cid:48) ( t − ˜ t ) = i } ; q i,j ( t − ˜ t, t ) ← − (cid:89) j (cid:48) ∈H j ( t − ˜ t,t ) (1 − p i,j (cid:48) ( t − ˜ t )) w i,j ( t + 1) ← w i,j ( t ) exp ( − η (cid:99) l i,j ( t ))max m ∈K j { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } its physical distance from the sender. It also makes it possible for messages to reach many devices while assuming smalltransmission and listening probabilities. The order in which messages are received varies and depends on the topology of therandom communication graph over the last d + 1 time slots. An observation made during time slot t − d can be received anytime between t − d and t . Hence, the estimate of the loss of a network at time t − d can vary over the last d + 1 time slots asnew feedback arrives. Observations more than d time slots old are considered stale and are dropped. Explicit exploration.

Given that Co-Bandit starts by assuming uniform weight over all networks, all devices may end upperceiving a network with signiﬁcantly low bandwidth (relative to other networks) as being “bad”. This may result in no oneselecting it. In addition, the quality of a network that was initially “bad” may improve over time. However, at that time theprobability for a device to select it might be too small. To cater for these cases, devices constantly keep track of whether theyare learning about all the networks they have access to. They explore those unheard of for a considerable amount of time withprobability n ; we do not want all devices to explore it; if they all do, they will most likely observe a low gain. The time mustbe long enough such that if a device is associated to a network, devices will learn about it. The device exploring a networkunheard of broadcasts its observation with probability so that everyone learns about the network. This feature of the algorithmalso implies an additional cost in a setting where a network is actually too “bad” for anyone to select it. Change in set of networks.

When a device discovers a new network, e.g., while the mobile user moves around, its weight isset to , i.e., maximum. This will ensure that the device will most likely explore the network. Furthermore, when a networkwith sufﬁciently high probability of being selected is no longer available, the weights of all networks are reset. These allow thealgorithm to quickly adapt to changes.V. T HEORETICAL ANALYSIS OF C O -B ANDIT

Here, we give an upper bound on the regret of a device using Co-Bandit, and show that Co-Bandit retains the convergenceproperty of multiplicative weight update algorithms with full information.For the purpose of the analysis, we assume that (a) K j = K for every j ∈ N , i.e., all devices have the same set of networksavailable to them, (b) the environment is static, and (c) all devices can hear each other, i.e., all devices listening at a particulartime will hear those broadcasting at that time. Regret bound.

Weak regret refers to the difference between the loss incurred by always selecting the best network in hindsightand that of Co-Bandit. We follow the proof for upper bound on regret for EXP3 [6] and proofs given in [8], [1], and show thatCo-Bandit is regret-minimizing.Let L Co − Bandit ( T ) denote the cumulative loss of Co-Bandit at T, L min ( T ) be the cumulative loss at T when always choosingthe best network in hindsight, d be the maximum delay with which a feedback may be received, and b be the probability of adevice directly hearing from another one. Theorem 1:

For any k > , any maximum delay with which a feedback can be received d ∈ Z ≥ , learning rate η = (cid:113) max { k ,b } ln ke ( d +1) T , number of time slots a network can be unheard of x = d , probability of directly hearing from a neighbor b ∈ [0 , − e − ] , stopping time T > ( d + 1) k ln k , and any assignment of rewards, the expected weak regret is upper boundedas: E [ L Co − Bandit ( T ) − L min ( T )] ≤ e (cid:115) ( d + 1) ln K T max { k , b } + d Hence, Co-Bandit is regret-minimizing as its weak regret tends to zero. As time elapses, it performs nearly as well as alwaysselecting the best network in hindsight.Suppose that we assume d = 0 , i.e., feedback are received without any delay. When b = 0 , i.e., no one ever shares itsobservation, we have a weak regret of order √ k ln k T as in the bandit setting [6]. On the other hand, when b = 1 , i.e. eachdevice always shares its observation, we have a weak regret of order √ ln k T as in full information setting, see, e.g. [9]. Hence,we can interpolate between full information and bandit settings, special cases of the spectrum of settings considered.Assuming that d > , the higher the delay, the later in time devices may learn about a network’s current quality. Hence, thehigher the regret bound. However, although not reﬂected in the regret bound, the higher the value of d , the higher the probabilityof receiving a feedback being forwarded. Thus, the better the performance of Co-Bandit in practice.The formal proof is provided in appendix A Convergence.

Strategies in the support of the mixed strategy δ j of player j are those played with a non-zero probability [31].Weakly stable equilibria [23] is deﬁned as mixed Nash equilibria ( δ , · · · , δ n ) with the additional property that each player j remains indifferent between the strategies in the support of δ j when any other single player j (cid:48) changes to a pure strategy in thesupport of δ j (cid:48) ; however, each strategy in the support of δ j may not remain a best response and device j may prefer a strategyoutside the support of δ j .We prove that Co-Bandit retains the convergence property of multiplicative-weights learning algorithm in a full informationsetting [23] and EXP3 [38]. Theorem 2:

When η is arbitrarily small and the number of devices n tends to inﬁnity (frequency of exploring network(s)unheard of tends to zero), the strategy proﬁle of all devices using Co-Bandit converges to a weakly stable equilibrium; weaklystable equilibria are pure Nash equilibria with probability when the bit rate of each network is chosen at random independently[23].Hence, when all devices leverage Co-Bandit, they end up being optimally distributed across networks. No one will observehigher gain by unilaterally switching network. We show that the dynamics of the probability distribution over the set of networksis given by the following replicator equation: ξ i,j = p i,j (cid:88) m ∈K−{ i } p m,j ( l m,j − q l i,j ) In the full information setting, the probability q of hearing about network i is equal to . In that case, we get a replicatordynamic identical to the one in [23]. With a drop in the value of q , the value of ξ i,j will rise, implying a slower convergence.Furthermore, in the bandit setting, b = 0 and q = p i,j . Given our deﬁnition of loss, ξ i,j will be zero all the time. Hence, thealgorithm never converges to the right state. As such, we interpolate between the full information and bandit settings.The formal proof is provided in appendix B. I MPLEMENTATION DETAILS

We evaluate Co-Bandit and compare its performance against those of EWA, see, e.g. [9], and EXP3 [6], through simulationusing synthetic data. All algorithms are implemented in Python, using SimPy [35]. We discuss the implementation of Co-Banditand specify parameter values chosen.

Learning rate.

In our implementation of Co-Bandit, learning rate η = 10 . In the theoretical analysis, we assume a small valuefor η . However, in practice, we observe that Co-Bandit takes too long to stabilize at the optimal state when η is very small; but,assuming a very high learning rate makes Co-Bandit too “aggressive” and it stabilizes at a sub-optimal state. Cooperation.

Unless speciﬁed otherwise, each device shares its knowledge (transmits) with probability p t = n . Otherwise, itlistens for broadcast messages with probability p l = , i.e. a device listens for feedback messages once every three time slots.We assume here that devices can estimate the number of devices associated to its network, e.g., from feedback received overtime or scanning for arp messages [3] for WiFi. Delayed message.

We assume d = 5 , i.e., messages up to time slots old are considered valid. While increasing the valueof d raises the likelihood of devices receiving feedback, it also implies more data has to be stored and broadcast. In addition,in a dynamic wireless network setting, old observations may be more misleading than useful. Devices drop duplicate messagesreceived as they are forwarded. Gain.

Although it is not a pre-requirement for Co-Bandit, for simplicity, we assume that a network’s bandwidth is equallyshared among its clients. While the gain of i j ( t ) is the scaled bit rate device j observes from network i at time t , the gain ofa network i ∈ K j − { i j } is estimated from feedback j receives from its neighbors. We compute the unscaled gain of network i as n i ( t )+1 (cid:80) j (cid:48) ∈H j : i j (cid:48) ( t )= i g i,j (cid:48) ( t ) . Switching cost.

We model delay (switching cost) using Johnson’s SU distribution for WiFi and Student’s t-distribution forcellular, each identiﬁed as a best ﬁt [15] to 500 delay values collected from real world experiments.

Exploration.

Devices explore networks unheard of for 32 time slots or more (i.e., x = 32 , or 8 simulated minutes) withprobability n ; we assume n can be estimated. While we assume, x = d in the theoretical analysis, in practice we assume ahigher value for x , as we want to limit the frequency of exploring networks which may cause the algorithm to spend most ofthe time in a sub-optimal state. In addition, x ≥ k j since we hear from one devices (on average) in every time slot. However,if x is too big, the rate of adaptation drops. Parameter choice for EWA and EXP3.

For EWA, we assume the same learning rate as for Co-Bandit, i.e., η = 10 . We assumethat γ = t − for EXP3 [26]. VI. E VALUATION IN STATIC SETTINGS

In this section, we evaluate Co-Bandit in static settings, i.e. where the number of mobile devices in the service area, andthe number and quality of wireless networks remain constant. We compare its performance to those of EWA and EXP3. Weshow that (a) as the amount of cooperation increases, its performance approaches that of EWA in a full information setting,(b) delayed feedback ensures that information reaches more devices, enhancing performance of the algorithm, (c) it stabilizes atNash equilibrium relatively fast (comparable to that of EWA), (d) it far outperforms EXP3 in terms of rate of stabilization andper device cumulative download, and (e) it scales with an increase in number of devices and number of wireless networks inthe service area.In our prior work, we proposed Smart EXP3 [2], a bandit algorithm with far better practical performance than EXP3. However,we exclude comparison of the performance of Co-Bandit to that of Smart EXP3 as the latter has several features that we canadd to Co-Bandit to improve its performance.

Setup.

We observe very good performance with few networks; it gets more challenging when we increase the number of networks.Thus, we consider settings with 20 devices and 5 networks; in general, most places will not have more networks. We assumenon-uniform data rates 18, 8, 13, 16 and 10 Mbps, a factor close to the theoretical data rates of IEEE 802.11 standards [16] andcellular networks [18] that yields a unique Nash equilibrium, unless speciﬁed otherwise. Although it is not a pre-requirement forthe algorithm, we assume that (a) mobile devices are time-synchronized, and (b) all devices can hear each other. Results involvedata from 100 runs of 5 (simulated) hours each, i.e., 1200 time slots, unless speciﬁed otherwise.

Evaluation criteria.

We evaluate the performance of each algorithm based on (a) the state at which it stabilizes and how “bad”this is compared to Nash equilibrium; we use the notions of stability and distance to Nash equilibrium from [2] (Deﬁnitions 2and 3), (b) the time it takes to stabilize, (c) cumulative download of the devices, and (d) scalability.An algorithm is said to have reached a stable state when each device selects a particular network with probability at least0.75 until the end (at least over the last 10 time slots). EWA is not inherently stable (as per our deﬁnition of stability). Forexample, when a device observes the same gain from two networks, it selects them with equal probability.Distance to Nash equilibrium is the maximum percentage higher gain any device would have observed if the algorithm wasat Nash equilibrium, compared to its current gain. If no device can achieve more than (cid:15) percent increase in gain by unilaterallydeviating from its strategy, then the algorithm is at (cid:15) − equilibrium [31]. ffect of cooperation. We study the effect of cooperation by varying the probability of sharing, i.e., the value of p t . Here, weignore delayed feedback, i.e., d = 0 , and assume that devices always listen, even while transmitting. Figure 2 shows that theperformance of Co-Bandit improves as the value of p t increases, as expected (shown by a lower distance to Nash equilibrium).The distance also drops when p t = 0 as devices were still sharing their observation when exploring a network unheard of;yielding a better performance than that of EXP3 (discussed later) even with a small amount of information occasionally. Whendevices never share anything (“No sharing”), they only observe a gain from their chosen network. Hence, given our update rules,their probability distribution remains uniform, and the distance never drops.

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e No sharing ( p t = 0) ( p t = 0 . p t = 0 .

25) ( p t = 0 .

5) ( p t = 1) Fig. 2: Tradeoff between the amount of communication and average distance to Nash equilibrium of Co-Bandit (% higher gainany device would have observed, compared to its current gain, if the algorithm was at Nash equilibrium) — shaded regionrepresents (cid:15) -equilibrium, where (cid:15) = 7 . .All the runs stabilized at Nash equilibrium when p t ≥ . (considering the values of p t in Figure 2). When p t = 0 , 8 runswere stable at Nash equilibrium, 8 runs did not stabilize, and the other runs stabilized at a state that require a maximum of 2devices to switch network to reach Nash equilibrium. Table II shows the time Co-Bandit takes to stabilize in terms of mediannumber of time slots, given the amount of cooperation. As expected, the time to stabilize decreases as the amount of cooperationrises (denoted by an increase in the value of p t ).TABLE II: Effect of cooperation on the time Co-Bandit takes to stabilize (whether at Nash equilibrium or some other state). Amount of cooperation - p t = 0 p t = 0 . p t = 0 . p t = 0 . p t = 1 Effect of delayed feedback.

We want Co-Bandit to quickly stabilize at Nash equilibrium with minimal communication; adevice should not spend all the time broadcasting or listening for feedback. We assume p t = n , i.e. (on average) one devicecommunicates in every time slot. We ignore delayed feedback, i.e. set d = 0 , and vary the probability with which a device listensfor feedback. If a device is not broadcasting, it listens with probability p l . Figure 3 shows that as p l decreases, the performancedrops (distance to Nash equilibrium increases).We leverage delayed feedback, where devices communicate everything they have learned since time slot t − d to ensure thatfeedback reaches more devices, even if they transmit and listen with a small probability. We assume that devices listen withprobability and evaluate the effect of delayed feedback. We observe, from Figures 4 and 5, that Co-Bandit is stable at a betterstate (on average) as the value of d increases. Figure 4 shows that the distance to Nash equilibrium drops as d increases; it alsostabilizes faster.Figure 5 shows that the percentage of runs that are stable at Nash equilibrium rises as d increases. However, d should notbe too high as the quality of networks may change quickly rendering observations far back in time irrelevant. When d > , we

00 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e p l = 1 p l = 0 . p l = 0 . p l = 0 . Fig. 3: Effect of varying probability of listening on average distance to Nash equilibrium of Co-Bandit (% higher gain anydevice would have observed, compared to its current gain, if the algorithm was at Nash equilibrium) — shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . .

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e d = 0 d = 2 d = 4 d = 6 Fig. 4: Tradeoff between delay in feedback and average distance to Nash equilibrium of Co-Bandit (% higher gain any devicewould have observed, compared to its current gain, if the algorithm was at Nash equilibrium) — shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . . d = 6 d = 4 d = 2 d = 0 %% run stable at Nash equilibrium% run stable at some other state Fig. 5: Tradeoff between how much delay in feedback is acceptable and stability of Co-Bandit (whether stable and type of stablestate — Nash equilibrium or some other state).otice that all runs which are stable at a state other than Nash equilibrium requires a single device to switch network to reachNash equilibrium.

Performance comparison of Co-Bandit, EWA, and EXP3.

Figure 6 shows that Co-Bandit far outperforms EXP3. It alwaysstabilized at Nash equilibrium. Yet, Co-Bandit maintains a small non-zero distance after stabilizing. This is due to the cost ofexploring networks unheard of for a signiﬁcant amount for time. While all runs of EWA were stable at Nash equilibrium, none ofthose for EXP3 stabilized within 1200 time slots. In some simulations with 20 devices and 3 wireless networks, EXP3 took over85,500 time slots (on average) to stabilize at Nash equilibrium [2]. Table III shows that Co-Bandit achieves a median cumulativegain that is comparable to that of EWA, but 45% higher than that of EXP3. Co-Bandit took 2.69x more time (median) thanEWA to stabilize.

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e EWA (full information)Co-BanditEXP3

Fig. 6: Performance comparison of Co-Bandit, EWA and EXP3 based on distance to Nash equilibrium (% higher gain anydevice would have observed, compared to its current gain, if the algorithm was at Nash equilibrium)— shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . .TABLE III: Median per device cumulative download (GB) and median time Co-Bandit, EWA and EXP3 take to stabilize (whetherat Nash equilibrium or some other state). Algorithm Cumulativedownload (GB)

Co-Bandit

EXP3

Scalability.

We evaluate the scalability of Co-Bandit in terms of the rate at which it reaches a stable state. The algorithm wasrun 50 times, for 20,000 time slots (i.e., 83.33 simulated hours) each, with different numbers of devices and networks. It alwaysstabilized, nearly all the time at either Nash equilibrium or a state that requires a single device to switch network to be at Nashequilibrium. Figure 7 shows that as the number of devices grows, Co-Bandit takes more time to stabilize. The effect is evenmore signiﬁcant with 3 networks, as the gain and loss observed are very small when the number of devices is high. With 20devices, a rise in number of networks does not affect stabilization time much. But, with higher number of devices, a rise innumber of networks yields faster stabilization.

Setups with other data rates.

We evaluate Co-Bandit in two additional setups with 20 devices and 5 networks, with an aggregatebandwidth of 65 Mbps that yields a unique Nash equilibrium. Figure 8 shows that, in both setups, Co-Bandit stabilized. TableIV gives the cumulative download and time taken to stabilize in each setup. In the ﬁrst setup, the networks have a uniform datarate of 13 Mbps each. All runs stabilized at Nash equilibrium. Given that the algorithm starts by assuming uniform weight overall networks, it stabilized faster in this setup. The second setup assumes non-uniform data rates 6, 7, 22, 16 and 14 Mbps. Here,the optimal distribution of devices is more skewed. Hence, Co-Bandit takes longer to stabilize. 44% run were stable at Nashequilibrium. Most of the other runs were stable at a state with no device in a network or which requires a single device to switchnetwork to observe a slightly higher gain (up to 0.25 Mbps) for Co-bandit to be at Nash equilibrium.We now consider a minimal reset to improve the performance of Co-Bandit in setups which require a skewed optimaldistribution of devices across networks. , , , , ,

000 t i m e s l o t s t o s t a b ili ze Fig. 7: Scalability of Co-Bandit with an increase in number of devices and/or networks — in terms of rate of stabilization.

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e , , , , and 10 Mbps13 , , , , and 13 Mbps6 , , , , and 14 Mbps Fig. 8: Distance to Nash equilibrium of Co-Bandit (% higher gain any device would have observed, compared to its currentgain, if the algorithm was at Nash equilibrium) in setups with 20 devices and 5 networks and aggregate bandwidth 65 Mbps —shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . .TABLE IV: Median per device cumulative download (GB) and median time Co-Bandit takes to stabilize (whether at Nashequilibrium or some other state) in setups with 20 devices and 5 networks with an aggregate bandwidth of 65 Mbps. Data rates (Mbps) Cumulativedownload (GB)

13, 13, 13, 13, 13

6, 7, 22, 16, 14

Effect of minimal reset.

We evaluate the effect of a minimal reset in the setup with data rates 6, 7, 22, 16 and 14 Mbps. First,when a device explores a network unheard of and ﬁnds it to be better than the one it is selecting with probability at least 0.75,Co-Bandit resets the weight of the network being explored. The weights of other networks remain unchanged as the device mustnot unlearn everything. The aim is to allow a device to identify a better network and quickly adapt. Second, when a deviceconstantly learns from its neighbors that another network is better than the one it is selecting with probability at least 0.75 (bymore than a percentage; we assume . ), it resets the weight of the other network with probability n ij ( n i j is estimated). Thishelps when the difference in bit rates of the two networks is small, and allows a device to adapt faster to changes in networkquality. When resetting, Co-Bandit drops data previously received for the last d time slots.igure 9 shows that a minimal reset improves performance (on average) with a higher percentage of runs (96%) stable atNash equilibrium. The other runs did not stabilize. But, it increases stabilization time by approximately 3.4x (median).

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e Co-BanditCo-Bandit with resetEWA (full information)EXP3

Fig. 9: Effect of minimal reset on average distance to Nash equilibrium of Co-Bandit (% higher gain any device would haveobserved, compared to its current gain, if the algorithm was at Nash equilibrium)— shaded region represents (cid:15) -equilibrium,where (cid:15) = 7 . . VII. E VALUATION IN DYNAMIC SETTINGS

In this section, we evaluate the adaptability of Co-Bandit to changes in the environment, namely when (a) devices join andleave the service area at different times, and (b) the set of networks available and quality of networks change over time asusers of mobile devices move across service areas. We compare its performance to those of EWA and EXP3, and show that itgracefully adapts to changes, with a performance comparable to that of EWA. It far outperforms EXP3.We consider three dynamic settings involving 20 mobile devices. As evaluation criteria, we consider the states of the algorithmsover time and how far these states are from Nash equilibrium. As in Section VI, we assume that mobile devices are time-synchronized (although not a pre-requirement for the algorithm). We present results of Co-Bandit, both with and without theminimal reset. All results involve data from 100 runs of 5 (simulated) hours each, i.e., 1200 time slots.

Devices leaving the service area.

In this setting, all 20 mobile devices have access to the same 5 wireless networks with datarates 18, 8, 13, 16 and 10 Mbps, as in Section VI. We assume that all devices can hear each other. 10 of the devices leavethe service area at the end of time slot t = 600, freeing resources. Figure 10 shows that Co-Bandit dynamically adapts to thechange, and far outperforms EXP3. With the minimal reset, Co-Bandit adapts faster than EWA when resources are freed at t =600. Without a reset, EWA takes time adapt once it is stable at another state (here, the optimal state when 20 devices were inthe service area).

Devices joining and leaving the service area.

We consider the same set of networks as in the previous dynamic setting. Alldevices have access to the same networks and can hear each other. But, 10 of them join the service area at the beginning of t= 401 and leave at the end of t = 800. Figure 11 shows that Co-Bandit dynamically adapts to the changes, and far outperformsEXP3. Both Co-Bandit and EWA perform better in this setting compared to the previous dynamic setting. They are alreadystable at the optimal state when the devices join at t = 401. These devices ﬁt into the setting, without causing much disruption.When they leave at t = 800, the algorithms quickly revert back to the stable state they were at prior to the 10 devices joining.

Mobile users moving around.

We now consider the service areas in Figure 1, where networks 1, 2, 3, 4 and 5 have data rates16, 14, 22, 7 and 4 Mbps, respectively. Initially there are 10 devices (1 - 10) at the food court, 5 devices (11 - 15) at the studyarea and 5 devices (16 - 20) at the bus stop. 8 devices (1 - 8) from the food court move to the study area at the beginning of t = 401 and eventually reach the bus stop at the start of t = 801 . We assume that all devices in a service area can hear eachother, e.g., all devices at the food court can hear each other but cannot hear devices at the study area.Figure 12 shows the performance of each algorithm for devices in each area and those moving across areas, separately.We observe similar behaviour as in static settings during the ﬁrst 400 time slots for all algorithms. However, the distance forCo-Bandit is higher for devices at the study area and bus stop. This is due to the cost of exploring networks unheard of (inparticular network 5, common to both areas, to which a single device is associated at optimal state — thus, the probabilityof hearing about it is low). As devices move across service areas, they disrupt the setting, but eventually Co-Bandit adaptsaccordingly and outperforms EXP3. Yet, we observe more exploration in this setting because of our assumption that devices in

00 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e Co-BanditCo-Bandit with resetEWA (full information)EXP3

Fig. 10: Average distance to Nash equilibrium (% higher gain any device would have observed, compared to its current gain,if the algorithm was at Nash equilibrium) in a setting where 10 devices leave the service area at the end of t = 600 — shadedregion represents (cid:15) -equilibrium, where (cid:15) = 7 . .

200 400 600 800 1 ,

000 1 , % h i g h e r ga i n a d e v i cec a n o b s e r v e Co-BanditCo-Bandit with resetEWA (full information)EXP3

Fig. 11: Average distance to Nash equilibrium (% higher gain any device would have observed, compared to its current gain,if the algorithm was at Nash equilibrium) when 10 devices join the service are at t = 401 and leave at the end of t = 800 —shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . .different areas do not hear from each other. For example, devices 9 and 10 have to associate with network 2 as from t = 401 ;they will only be able to hear from each other at that time and will keep exploring networks 1 and 2 with uniform probabilityevery 32 time slots. The minimal reset improves the rate of adaptation to changes.VIII. O THER RELATED WORK

In this section, we discuss state-of-art wireless network selection approaches, and relevant work done on cooperative bandit,the use of graph to model communication network and delayed feedback.A signiﬁcant amount of work consider the use of multiple wireless networks, such as Multinet [10], and MPTCP [17].Yet, identifying the optimal network is crucial for good performance [12]. A number of centralized wireless network selection D e v i ce s - D e v i ce s - D e v i ce s -

200 400 600 800 1 ,

000 1 , D e v i ce s - % h i g h e r ga i n a d e v i cec a n o b s e r v e Co-Bandit Co-Bandit with reset EWA EXP3

Fig. 12: Average distance to Nash equilibrium (% higher gain any device would have observed, compared to its current gain, ifthe algorithm was at Nash equilibrium) in the setting shown in Figure 1 with 8 users moving from area A to C through B —shaded region represents (cid:15) -equilibrium, where (cid:15) = 7 . .approaches [4], [7], [29], [36] have been proposed. But, they are not scalable and are limited to managed networks. Severaldistributed solutions have been presented. Some require coordination from APs [22]. Others require global knowledge [32], [5],[30], or availability of some information [43], [11], or assume a stochastic bandit setting [42]. They all have some limitations.In our prior work, we proposed Smart EXP3 [2], a bandit algorithm with good theoretical and practical performance (far betterthan EXP3). Here, we consider cooperation for an improved rate of stabilization. A cooperative approach to network selectionwas considered in [13], where devices estimate and share properties of their networks to other devices associated to the samenetwork. We consider sharing across networks, although within a range.The closest to our work is [8], in which the authors proposed an algorithm that allows agents to share feedback with theirneighbors relying on a communication network modeled as a graph. Feedback is used as soon as it is received and those olderthan some threshold are dropped. However, they consider an abstract problem and an abstract graph, and give an average welfareregret bound that relies on combinatorial graph properties. We solve the wireless network selection problem and consider arandom graph based on communication pattern in a dynamic wireless network environment. As such, we provide a strongerper-device regret bound, which is better suited to the wireless network setting and highlights the impact of varying the amountof cooperation and delay. In addition, we consider stabilization of our algorithm in a multi-agent setting.Graph structured feedback has been studied in numerous other work. In [1], relationship between actions is modelled usinga time-changing directed graph, and the agent gets instant feedback about its chosen action and related actions. A considerableamount of work has considered networks of cooperative stochastic bandits using dynamic peer-to-peer random network [37],ﬁxed communication graph [25], and social network [24]. Cooperative contextual bandit is studied in [39] where each agent canselect an action or request another agent to select it (with a cost). All the work focus on minimizing (in most cases, average)regret. A signiﬁcant amount of work has considered delayed feedback, focusing on its impact on regret. Several work assumedfull information [41], [27], [28], [21], [33] or stochastic settings [14], considering both ﬁxed and variable delay. The traditionalsub-optimal approaches to deal with delayed feedback are to (a) use multiple instances of a non-delayed algorithm [41], [8],[20], or (b) wait for all feedback to be received before taking the next decision.X. C ONCLUSION

EXP3, a leading multi-armed bandit algorithm, has excellent theoretical properties but takes an unacceptable amount oftime to stabilize in practice. Full information setting requires support from network service providers, which may be infeasible.Hence, we consider a spectrum of settings between full information and bandit settings with cooperation among devices. Wehave proposed Co-Bandit, a novel cooperative bandit algorithm and evaluated its performance in dynamic wireless networksettings, where a mobile device has to select the optimal wireless network for good performance. Empirical results show thatit far outperforms EXP3. A little cooperation among devices, even when feedback is received with a delay, can signiﬁcantlyenhance performance and the rate of learning.As future work, we intend to (a) evaluate Co-Bandit in real-world settings, (b) further enhance its performance throughfeatures of Smart EXP3, e.g., cater for stability when more than one network offers the same bit rate, (c) consider settings withdishonest devices, i.e., settings where devices lie about their observations, and (d) apply it to other resource selection problemsrequiring fast stabilization. R

EFERENCES[1] Noga Alon, Nicolo Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir. Nonstochastic multi-armed bandits with graph-structured feedback.

SIAM Journal on Computing , 46(6):1785–1826, 2017.[2] Anuja Meetoo Appavoo, Seth Gilbert, and Kian-Lee Tan. Shrewd selection speeds surﬁng: Use smart exp3! In , pages 188–199. IEEE, 2018.[3] arp scan. The arp scanner - linux man page. https://linux.die.net/man/1/arp-scan, 2018.[4] E. Aryafar, A. Keshavarz-Haddad, C. Joe-Wong, and M. Chiang. Max-min fair resource allocation in hetnets: Distributed algorithms and hybrid architecture.In

ICDCS, 2017 , pages 857–869. IEEE, 2017.[5] E. Aryafar, A. Keshavarz-Haddad, M.l Wang, and M. Chiang. Rat selection games in hetnets. In

INFOCOM , pages 998–1006. IEEE, 2013.[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.

SIAM Journal on Computing , 32(1):48–77, 2002.[7] Y. Bejerano, S-J. Han, and L. E. Li. Fairness and load balancing in wireless lans using association control. In

MobiCom , pages 315–329. ACM, 2004.[8] Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. Delay and cooperation in nonstochastic bandits.

Journal of Machine LearningResearch , 49:605–622, 2016.[9] Nicolo Cesa-Bianchi and G´abor Lugosi.

Prediction, learning, and games . Cambridge university press, 2006.[10] R. Chandra and P. Bahl. Multinet: Connecting to multiple ieee 802.11 networks using a single wireless card. In

INFOCOM , volume 2, pages 882–893.IEEE, 2004.[11] M. H. Cheung, F. Hou, J. Huang, and R. Southwell. Congestion-aware distributed network selection for integrated cellular and wi-ﬁ networks. arXivpreprint arXiv:1703.00216 , 2017.[12] S. Deng, R. Netravali, A. Sivaraman, and H. Balakrishnan. Wiﬁ, lte, or both?: Measuring multi-homed wireless internet performance. In

IMC , pages181–194. ACM, 2014.[13] S. Deng, A. Sivaraman, and H. Balakrishnan. All your network are belong to us: A transport framework for mobile network selection. In

HotMobile .ACM, 2014.[14] Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efﬁcient optimal learning for contextualbandits. arXiv preprint arXiv:1106.2369 , 2011.[15] Fitter. A tool to ﬁt data to many distributions and best one(s) for python. https://pypi.python.org/pypi/ﬁtter, 2016.[16] IEEE Standard for Information technology (2009). Local and metropolitan area networks speciﬁc requirementspart 11. Technical report, IEEE, 2009.[17] A. Ford, C. Raiciu, M. Handley, and O. Bonaventure. Tcp extensions for multipath operation with multiple addresses. Technical report, Ford, 2013.[18] C. Gessner, A. Roessier, and M. Kottkamp. Umts long term evolution (lte)–technology introduction application note., 2012.[19] Business Insider. Here’s how Google Maps knows when there is trafﬁc., 2017. https://tinyurl.com/ycymy4wp, accessed 2019-01-07.[20] Pooria Joulani, Andras Gyorgy, and Csaba Szepesv´ari. Online learning under delayed feedback. In

International Conference on Machine Learning , pages1453–1461, 2013.[21] Pooria Joulani, Andr´as Gy¨orgy, and Csaba Szepesv´ari. Delay-tolerant online convex optimization: Uniﬁed analysis and adaptive-gradient algorithms. In

AAAI , volume 16, pages 1744–1750, 2016.[22] B. Kauffmann, F. Baccelli, A. Chaintreau, V. Mhatre, K. Papagiannaki, and C. Diot. Measurement-based self organization of interfering 802.11 wirelessaccess networks. In

INFOCOM 2007 , pages 1451–1459. IEEE, 2007.[23] R. Kleinberg, G. Piliouras, and E. Tardos. Multiplicative updates outperform generic no-regret learning in congestion games. In

ACM STOC , pages533–542. ACM, 2009.[24] Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network.

IEEE/ACM Transactionson Networking (TON) , 26(4):1782–1795, 2018.[25] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. On distributed cooperative decision-making in multiarmed bandits. In

Control Conference(ECC), 2016 European , pages 243–248. IEEE, 2016.[26] S. Maghsudi and S. Stanczak. Relay selection with no side information: An adversarial bandit approach. In

WCNC , pages 715–720. IEEE, 2013.[27] Chris Mesterharm. On-line learning with delayed label feedback. In

International Conference on Algorithmic Learning Theory , pages 399–413. Springer,2005.[28] Chris Mesterharm.

Improving on-line learning . PhD thesis, Rutgers University-Graduate School-New Brunswick, 2007.[29] A. Mishra, V. Brik, S. Banerjee, A. Srinivasan, and W. A. Arbaugh. A client-driven approach for channel management in wireless lans. In

Infocom , 2006.[30] E Monsef, A. Keshavarz-Haddad, E. Aryafar, J. Saniie, and M. Chiang. Convergence properties of general network selection games. In

INFOCOM , pages1445–1453. IEEE, 2015.[31] Noam Nisan, Tim Roughgarden, Eva Tardos, and Vijay V Vazirani.

Algorithmic game theory , volume 1. Cambridge University Press Cambridge, 2007.32] D. Niyato and E. Hossain. Dynamics of network selection in heterogeneous wireless networks: An evolutionary game approach.

TVT , 58(4):2008–2017,2009.[33] Kent Quanrud and Daniel Khashabi. Online learning with adversarial delays. In

Advances in Neural Information Processing Systems , pages 1270–1278,2015.[34] R. W. Rosenthal. A class of games possessing pure-strategy nash equilibria.

International Journal of Game Theory , 2(1):65–67, 1973.[35] SimPy. SimPy - Event discrete simulation for Python, 2016. https://simpy.readthedocs.io/, accessed 2018-19-12.[36] K. Sui, M. Zhou, D. Liu, M. Ma, D. Pei, Y. Zhao, Z. Li, and T. Moscibroda. Characterizing and improving wiﬁ latency in large-scale operational networks.In

MobiSys , pages 347–360. ACM, 2016.[37] Bal´azs Sz¨or´enyi, R´obert Busa-Fekete, Istv´an Heged˝us, R´obert Orm´andi, M´ark Jelasity, and Bal´azs K´egl. Gossip-based distributed stochastic banditalgorithms. In

Journal of Machine Learning Research Workshop and Conference Proceedings , volume 2, pages 1056–1064. International MachineLearning Societ, 2013.[38] C. Tekin and M. Liu. Performance and convergence of multi-user online learning. In

GAMENETS , pages 321–336. Springer, 2011.[39] Cem Tekin and Mihaela van der Schaar. Distributed online learning via cooperative contextual bandits.

IEEE Transactions on Signal Processing

Information Theory, 2002. Proceedings. 2002 IEEEInternational Symposium on , page 148. IEEE, 2002.[42] Q. Wu, Z. Du, P. Yang, Y.-D. Yao, and J. Wang. Trafﬁc-aware online network selection in heterogeneous wireless networks.

TVT , 65(1):381–397, 2016.[43] K. Zhu, D. Niyato, and P. Wang. Network selection in heterogeneous wireless networks: Evolution with incomplete information. In

WCNC , pages 1–6.IEEE, 2010. A PPENDIX AP ROOF OF UPPER BOUND ON WEAK REGRET

We present some facts derived from deﬁnitions, and Lemmas that are used in the proofs. Lemma 1 upper bounds the expectedsum, for all the networks i ∈ K , of the ratio of any device j ’s probability of choosing network i at time t − ˜ t to the probabilitythat at least one of the devices it heard from (including itself) chooses network i at time t − ˜ t . Moreover, bounding the extent towhich the probability distribution can drift in d time slots (due to the delayed feedback), plays an important role in controllingregret. Lemmas 2 and 3 control its evolution, by lower bounding and upper bounding the drift, respectively. Fact 1.

Given that feedback received from neighbors are forwarded, as time elapses, the probability of receiving a particularobservation increases (more devices have the information and are broadcasting it). Here, we compute b ˜ t , the probability oflearning a device’s observation made at time t − ˜ t , for ˜ t ≥ , i.e. within a delay of ˜ t time slots. The length of the path betweenthe two vertices (devices) involved can be up to ˜ t + 1 .Let Y be a discrete random variable that represents the length of a directed path between two vertices in the communicationgraph in a particular time slot, and b be the probability of (directly) hearing from a device.Then, P ( Y = 1) = b P ( Y = 2) = ( n − n − − b , considering all permutations of path without repetition of vertices P ( Y = 3) = ( n − n − − b ...P ( Y = ˜ t + 1) = ( n − n − − ˜ t )! b t +1 , when n ≥ ˜ t + 2 When the delay ˜ t increases beyond n − , the probability remains the same since all paths the message can take have alreadybeen considered.Therefore, b ˜ t = 1 − ˜ t (cid:89) t (cid:48) =0 (cid:18) − min (cid:26) ( n − max { ( n − − t (cid:48) ) , } ! b t (cid:48) +1 , (cid:27)(cid:19) (1) act 2. E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21) = 1 q i,j ( t − ˜ t, t ) E t − ˜ t (cid:2) I i,j ( t − ˜ t, t ) (cid:3) = 1 q i,j ( t − ˜ t, t ) · q i,j ( t − ˜ t, t ) , by deﬁnition of I i,j ( t − ˜ t, t )= 1 (2) Fact 3. E t − ˜ t (cid:104) (cid:99) l i,j ( t ) (cid:105) = E t − ˜ t  d (cid:48) + 1 d (cid:48) (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t )  , from the loss estimate rule = 1 d (cid:48) + 1 d (cid:48) (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21) = 1 d (cid:48) + 1 d (cid:48) (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) , using Fact 2 = l i,j ( t ) (3) Lemma 1:

Given n active mobile devices, and k wireless networks in the service area, for any device j ∈ N , b ∈ [0 , − e − ] , x = d at time t , and any time t ≥ , we can say that E (cid:34)(cid:88) i ∈K p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:35) ≤ { k , b } Proof : q i,j ( t − ˜ t, t ) is the probability that the quality of network i at time t − ˜ t is known to device j by now (by thecurrent time t ), whether by exploring it at time t − ˜ t or by hearing about it from neighbor(s) over the past ˜ t time slots.By deﬁnition, q i,j ( t − ˜ t, t ) = 1 − (cid:89) j (cid:48) ∈ H j ( t − ˜ t,t ) (1 − p i,j (cid:48) ( t − ˜ t ))= 1 − (cid:0) − p i,j ( t − ˜ t ) (cid:1) (cid:89) j (cid:48) ∈ H j ( t − ˜ t,t ) −{ j } (cid:0) − p i,j (cid:48) ( t − ˜ t ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) probability no device shareddetails about network i We now compute the probability that some device has shared details about network i . We assume that t ≥ d + 1 and x = d .We consider two cases, namely (a) at least one device j (cid:48) explored the network as it was unheard of for x time slots, and (b)some device j (cid:48) was associated to it and shared its observation; this observation may have been propagated by other devices if ˜ t > .In the ﬁrst case, a device j (cid:48) selects network i with probability n and shares its observation with probability . In this case, theprobability that some device explores and shares about network i is given as − (cid:18) − n (cid:19) n − ≈ − e − In the second case, some device j (cid:48) has selected it and device j will hear from j (cid:48) about i with probability b ˜ t .Hence, the probability that some device has shared details about network i ≥ min { − e − , b ˜ t } , where b ˜ t ≥ b .Thus, we can say that q i,j ( t − ˜ t, t ) ≥ − (cid:0) − p i,j ( t − ˜ t ) (cid:1) (cid:0) − min { − e − , b } (cid:1) ≥ − (cid:0) − p i,j ( t − ˜ t ) (cid:1) (1 − b ) , as we assume that b ≤ − e − hen, E (cid:34)(cid:88) i ∈K p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:35) ≤ E (cid:34)(cid:88) i ∈K p i,j ( t − ˜ t )1 − (cid:0) − p i,j ( t − ˜ t ) (cid:1) (1 − b ) (cid:35) ≤ (cid:88) i ∈K k − (cid:0) − k (cid:1) (1 − b ) ≤ − (cid:0) − k (cid:1) (1 − b ) We can say that − (cid:0) − k (cid:1) (1 − b ) ≤ − (cid:0) − k (cid:1) ≤ k We can also say that − (cid:0) − k (cid:1) (1 − b ) ≤ − (1 − b ) ≤ b As such, − (cid:0) − k (cid:1) (1 − b ) ≤ { k , b } Therefore, E (cid:34)(cid:88) i ∈K p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:35) ≤ { k , b } which concludes the proof. Lemma 2:

For any device j ∈ N , each network i ∈ K , any t ≥ when d ∈ Z ≥ , and < η ≤ ke ( d + 1) , p i,j ( t + 1) ≤ (cid:18) d (cid:19) p i,j ( t ) Proof : We follow the proofs of Lemmas 1, 2 and 19 in [8]. p i,j ( t + 1) − p i,j ( t ) = p i,j ( t + 1) − w i,j ( t ) k (cid:80) m =1 w m,j ( t ) , using the probability update rule (1)From the weight update rule, we have w i,j ( t + 1) = w i,j ( t ) exp ( − η (cid:99) l i,j ( t ))max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } w i,j ( t ) = w i,j ( t + 1) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } exp ( − η (cid:99) l i,j ( t )) ≥ w i,j ( t + 1) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } , as exp ( − η (cid:99) l i,j ( t )) ≤ Combining this with (1), we get p i,j ( t + 1) − p i,j ( t ) ≤ p i,j ( t + 1) − w i,j ( t + 1) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } k (cid:80) m =1 w m,j ( t ) (2)rom the probability update rule, we have p i,j ( t + 1) = w i,j ( t + 1) k (cid:80) m =1 w m,j ( t + 1) w i,j ( t + 1) = k (cid:88) m =1 w m,j ( t + 1) p i,j ( t + 1) (3)Combining this with (2), we have p i,j ( t + 1) − p i,j ( t ) ≤ p i,j ( t + 1) − k (cid:80) m =1 w m,j ( t + 1) p i,j ( t + 1) max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } k (cid:80) m =1 w m,j ( t ) ≤ p i,j ( t + 1) − k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:100) l y,j ( t )) } p i,j ( t + 1) max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } k (cid:80) m =1 w m,j ( t ) ,using the weight update rule ≤ p i,j ( t + 1) − k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) p i,j ( t + 1) k (cid:80) m =1 w m,j ( t ) ≤ p i,j ( t + 1)  − k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) k (cid:80) m =1 w m,j ( t )  (4)From the probability update rule, we have p i,j ( t ) = w i,j ( t ) k (cid:80) y =1 w y,j ( t ) w i,j ( t ) = k (cid:88) y =1 w y,j ( t ) p i,j ( t ) ombining this with (4), we get p i,j ( t + 1) − p i,j ( t ) ≤ p i,j ( t + 1)  − k (cid:80) m =1 k (cid:80) y =1 w y,j ( t ) p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) k (cid:80) m =1 w m,j ( t )  ≤ p i,j ( t + 1) (cid:32) − k (cid:88) m =1 p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) (cid:33) ≤ p i,j ( t + 1) (cid:32) k (cid:88) m =1 p m,j ( t ) − k (cid:88) m =1 p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) (cid:33) ≤ p i,j ( t + 1) k (cid:88) m =1 p m,j ( t ) (cid:16) − exp ( − η (cid:100) l m,j ( t )) (cid:17) ≤ p i,j ( t + 1) k (cid:88) m =1 p m,j ( t ) (cid:16) η (cid:100) l m,j ( t ) (cid:17) , as − e − x ≤ x ≤ η p i,j ( t + 1) k (cid:88) m =1 p m,j ( t ) (cid:100) l m,j ( t ) (5)We now upper bound (cid:80) km =1 p m,j ( t ) (cid:100) l m,j ( t ) by following an inductive argument similar to the ones in the proofs of Lemmas 2and 19 in [8]. For simplicity, we assume that t ≥ d + 1 . k (cid:88) m =1 p m,j ( t ) (cid:100) l m,j ( t ) = k (cid:88) m =1 p m,j ( t ) · d + 1 d (cid:88) ˜ t =0 l m,j ( t − ˜ t, t ) q m,j ( t − ˜ t, t ) I m,j ( t − ˜ t, t ) , using the loss estimate rule ≤ d + 1 k (cid:88) m =1 p m,j ( t ) d (cid:88) ˜ t =0 q m,j ( t − ˜ t, t ) , since l m,j ( t − ˜ t, t ) I m,j ( t − ˜ t, t ) ≤ ≤ d + 1 k (cid:88) m =1 d (cid:88) ˜ t =0 (cid:18) d (cid:19) ˜ t p m,j ( t − ˜ t ) q m,j ( t − ˜ t, t ) , by inductive hypothesis ≤ k (cid:88) m =1 d (cid:88) ˜ t =0 d + 1 (cid:18) d (cid:19) ˜ t , as q m,j ( t − ˜ t, t ) ≥ p m,j ( t − ˜ t ) ≤ k (cid:88) m =1 e , since e approximates (cid:18) d (cid:19) ˜ t and d (cid:88) ˜ t =0 d + 1 = 1 ≤ ke Combining this with (5), when d ∈ Z ≥ , and η ≤ ke ( d + 1) , we get p i,j ( t + 1) − p i,j ( t ) ≤ ηke p i,j ( t + 1) ≤ ke ( d + 1) · ke p i,j ( t + 1) ≤ d + 1 p i,j ( t + 1) p i,j ( t + 1) (cid:18) − d + 1 (cid:19) ≤ p i,j ( t ) p i,j ( t + 1) (cid:18) dd + 1 (cid:19) ≤ p i,j ( t ) i,j ( t + 1) ≤ (cid:18) d (cid:19) p i,j ( t ) which concludes the proof. Lemma 3:

For any device j ∈ N , each network i ∈ K , and any t ≥ d + 1 when d ∈ Z ≥ , η > , we have p i,j ( t + 1) − p i,j ( t ) ≥ − eηd d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) Proof : We follow the proofs of Lemmas 1, 2 and 20 in [8].From the probability update rule, we have p i,j ( t + 1) = w i,j ( t + 1) k (cid:80) m =1 w m,j ( t + 1) Therefore, we can say that p i,j ( t + 1) − p i,j ( t ) = w i,j ( t + 1) k (cid:80) m =1 w m,j ( t + 1) − p i,j ( t )= w i,j ( t ) exp ( − η (cid:99) l i,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } − p i,j ( t ) , using the weight update rule = w i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) − p i,j ( t ) (1)From the probability update rule, we get p i,j ( t ) = w i,j ( t ) k (cid:80) y =1 w y,j ( t ) w i,j ( t ) = k (cid:88) y =1 w y,j ( t ) p i,j ( t ) Combining this with (1), we get p i,j ( t + 1) − p i,j ( t ) = k (cid:80) y =1 w y,j ( t ) p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 k (cid:80) y =1 w y,j ( t ) p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) − p i,j ( t )= p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) − p i,j ( t ) ≥ p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) − p i,j ( t ) , as exp ( − η (cid:100) l m,j ( t )) ≤ , and k (cid:88) m =1 p m,j ( t ) = 1 ≥ p i,j ( t ) (cid:16) exp ( − η (cid:99) l i,j ( t )) − (cid:17) ≥ − η p i,j ( t ) (cid:99) l i,j ( t ) , since e − x − ≥ − x (2)e now upper bound p i,j ( t ) (cid:99) l i,j ( t ) , as in Lemma 2. For simplicity, we assume that t ≥ d + 1 . p i,j ( t ) (cid:99) l i,j ( t ) = p i,j ( t ) d + 1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) , using the loss estimate rule ≤ d + 1 d (cid:88) ˜ t =0 p i,j ( t ) l i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) ≤ d + 1 d (cid:88) ˜ t =0 p i,j ( t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) , as l i,j ( t − ˜ t, t ) ≤ By repeatedly applying Lemma 2, we get p i,j ( t ) ≤ (cid:18) d (cid:19) p i,j ( t − ≤ (cid:18) d (cid:19) p i,j ( t − ≤ (cid:18) d (cid:19) ˜ t p i,j ( t − ˜ t ) Thus, we can say that p i,j ( t ) (cid:99) l i,j ( t ) ≤ d + 1 d (cid:88) ˜ t =0 (cid:18) d (cid:19) ˜ t p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) ≤ ed + 1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) , as e approximates (cid:18) d (cid:19) ˜ t Combining this with (2), we get p i,j ( t + 1) − p i,j ( t ) ≥ − eηd + 1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) which concludes the proof.We now proceed with the proof for regret bound. Proof of regret bound : We follow the proof for upper bound on regret for EXP3 [6] and proofs given in [8], [1]. Let W t = w ,j ( t ) + · · · + w k,j ( t ) . We try to ﬁnd a bound on the ratio of weights from one round to the next, i.e., W t +1 W t . W t +1 W t = k (cid:88) i =1 w i,j ( t + 1) W t , given that W t +1 = k (cid:88) i =1 w i,j ( t + 1)= k (cid:88) i =1 W t · w i,j ( t ) exp ( − η (cid:99) l i,j ( t ))max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } , using the weight update ruleThus, max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } W t +1 W t = k (cid:88) i =1 w i,j ( t ) W t exp ( − η (cid:99) l i,j ( t ))= k (cid:88) i =1 p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) , as w i,j ( t ) W t = p i,j ( t ) (1)rom Taylor series, e − x ≤ − x + x , for all x ≥ .In our case, x = η (cid:99) l i,j ( t ) . Combining this with (1), we get max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } W t +1 W t ≤ k (cid:88) i =1 p i,j ( t ) (cid:20) − η (cid:99) l i,j ( t ) + η (cid:16) (cid:99) l i,j ( t ) (cid:17) (cid:21) ≤ k (cid:88) i =1 p i,j ( t ) − η k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) ≤ − η k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) Taking logarithms on both sides, ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } W t +1 W t (cid:19) ≤ ln (cid:32) − η k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) (cid:33) (2) ln(1 − x ) ≤ − x for all x ≥ . In our case, x = η k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) − η k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) Therefore, combining this with (2), we get ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) + ln W t +1 − ln W t ≤ − η k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) Summing over t, T (cid:88) t =1 (cid:18) ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) + ln W t +1 − ln W t (cid:19) ≤ − η T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) (3) W T +1 ≥ w i,j ( T + 1) w i,j ( T + 1) = w i,j ( T ) exp ( − η (cid:99) l i,j ( T ))max m ∈K { w m,j ( T ) exp ( − η (cid:100) l m,j ( T )) } , from the weight update rule = w i,j ( T − · exp ( − η (cid:99) l i,j ( T − m ∈K { w m,j ( T − exp ( − η (cid:100) l m,j ( T − } · exp ( − η (cid:99) l i,j ( T ))max m ∈K { w m,j ( T ) exp ( − η (cid:100) l m,j ( T )) } = T (cid:89) t =1 exp ( − η (cid:99) l i,j ( t ))max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } , since w i,j (1) = 1= exp (cid:32) T (cid:88) t =1 − η (cid:99) l i,j ( t ) (cid:33) T (cid:89) t =1 max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } Thus, W T +1 ≥ exp (cid:32) − η T (cid:88) t =1 (cid:99) l i,j ( t ) (cid:33) T (cid:89) t =1 max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } sing this to solve the left-hand side of (3), in which T (cid:88) t =1 (ln W t +1 − ln W t ) is a telescoping sum, we get T (cid:88) t =1 (cid:18) ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) + ln W t +1 − ln W t (cid:19) ≥ T (cid:88) t =1 ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) + ln W T +1 − ln W ≥ T (cid:88) t =1 ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) + T (cid:88) t =1 − η (cid:99) l i,j ( t ) − ln (cid:32) T (cid:89) t =1 max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) (cid:33) − ln k ≥ T (cid:88) t =1 ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) − η T (cid:88) t =1 (cid:99) l i,j ( t ) − T (cid:88) t =1 ln (cid:18) max m ∈K { w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) } (cid:19) − ln k ≥ − η T (cid:88) t =1 (cid:99) l i,j ( t ) − ln k Combining this with (3), we get − η T (cid:88) t =1 (cid:99) l i,j ( t ) − ln k ≤ − η T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) + η T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) Multiplying both sides by η and rearranging, we get T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:99) l i,j ( t ) ≤ T (cid:88) t =1 (cid:99) l i,j ( t ) + η T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) (cid:16) (cid:99) l i,j ( t ) (cid:17) + 1 η ln k Using the loss estimate rule and starting from t = d + 1 (for simplicity), we get d + 1 T (cid:88) t = d +1 k (cid:88) i =1 p i,j ( t ) d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) ≤ d + 1 T (cid:88) t = d +1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) + η T (cid:88) t = d +1 k (cid:88) i =1 p i,j ( t )  d + 1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  + 1 η ln k (4)We bound the term on the left-hand side of (4) and the second term on its right-hand side separately. We start by lower boundingthe term on the left-hand side. A repeated application of Lemma 3, for ˜ t = 0 , · · · , d , yields p i,j ( t ) ≥ p i,j ( t − − eηd + 1 d (cid:88) ˜ t =0 p i,j ( t − − ˜ t ) I i,j ( t − − ˜ t, t ) q i,j ( t − − ˜ t, t ) ≥ p i,j ( t − − eηd + 1 d (cid:88) ˜ t =0 p i,j ( t − − ˜ t ) I i,j ( t − − ˜ t, t ) q i,j ( t − − ˜ t, t ) − eηd + 1 d (cid:88) ˜ t =0 p i,j ( t − − ˜ t ) I i,j ( t − − ˜ t, t ) q i,j ( t − − ˜ t, t ) ≥ p i,j ( t − ˜ t ) − eηd + 1 ˜ t (cid:88) h =1 d (cid:88) r =0 p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) herefore, d + 1 T (cid:88) t = d +1 k (cid:88) i =1 p i,j ( t ) d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )= 1 d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:18) p i,j ( t − ˜ t ) − eηd + 1 ˜ t (cid:88) h =1 d (cid:88) r =0 p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:19) = 1 d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) − eη ( d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) ˜ t (cid:88) h =1 d (cid:88) r =0 p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (5)We now upper bound the second term on the right-hand side of (4).  d + 1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  ≤  d + 1 d (cid:88) ˜ t =0 I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  , given that l i,j ( t − ˜ t, t ) ≤ ≤ d + 1 d (cid:88) ˜ t =0 I i,j ( t − ˜ t, t ) (cid:0) q i,j ( t − ˜ t, t ) (cid:1) , using Jensen’s inequalityWe recall, from the poof of Lemma 3, that a repeated application of Lemma 2 yields p i,j ( t ) ≤ (cid:18) d (cid:19) ˜ t p i,j ( t − ˜ t ) ≤ e p i,j ( t − ˜ t ) , given that e approximates (cid:18) d (cid:19) ˜ t such that η T (cid:88) t = d +1 k (cid:88) i =1 p i,j ( t )  d + 1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  ≤ eη d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) (cid:0) q i,j ( t − ˜ t, t ) (cid:1) (6)Combining (5) and (6) with (4) and rearranging, we get d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:124) (cid:123)(cid:122) (cid:125) (I) ≤ d + 1 T (cid:88) t = d +1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:124) (cid:123)(cid:122) (cid:125) (II) + eη ( d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) ˜ t (cid:88) h =1 d (cid:88) r =0 p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:124) (cid:123)(cid:122) (cid:125) (III) + eη d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) (cid:0) q i,j ( t − ˜ t, t ) (cid:1) (cid:124) (cid:123)(cid:122) (cid:125) (IV) + 1 η ln k (7)e take expectation E t − ˜ t on both sides and solve each term (I to IV) separately. E t − ˜ t [ (I) ] = E t − ˜ t  d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  = 1 d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) l i,j ( t − ˜ t, t ) E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21) = 1 d + 1 T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) l i,j ( t − ˜ t, t ) , using Fact 2 = T (cid:88) t = d +1 k (cid:88) i =1 p i,j ( t ) l i,j ( t ) , using Fact 3 = T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) l i,j ( t ) − d +1 (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) l i,j ( t ) ≥ T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) l i,j ( t ) − d , as l i,j ( t ) ≤ and k (cid:88) i =1 p i,j ( t ) = 1 (8) E t − ˜ t [ (II) ] = E t − ˜ t  d + 1 T (cid:88) t = d +1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t )  = 1 d + 1 T (cid:88) t = d +1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21) = 1 d + 1 T (cid:88) t = d +1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) , using Fact 2 = T (cid:88) t = d +1 l i,j ( t ) ≤ T (cid:88) t =1 l i,j ( t ) (9) E t − ˜ t [ (III) ] = E t − ˜ t  eη ( d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) ˜ t (cid:88) h =1 d (cid:88) r =0 p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t )  = E t − ˜ t  eη ( d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 ˜ t (cid:88) h =1 d (cid:88) r =0 l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − ˜ t, t ) q i,j ( t − h − r, t )  We consider three cases, depending on the values of the indices ˜ t , h and r . Case 1 : t − ˜ t > t − h − r E (cid:20) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − ˜ t, t ) q i,j ( t − h − r, t ) (cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21)(cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:21) , using Fact 2 ≤ E (cid:20) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:21) , since l i,j ( t − ˜ t, t ) ≤ ≤ E (cid:20) p i,j ( t − h − r ) E t − h − r (cid:20) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:21)(cid:21) ≤ E [ p i,j ( t − h − r )] , using Fact 2 ase 2 : t − ˜ t < t − h − r E (cid:20) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − ˜ t, t ) q i,j ( t − h − r, t ) (cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) E t − h − r (cid:20) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:21)(cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) (cid:21) , using Fact 2 ≤ E (cid:20) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) (cid:21) , as l i,j ( t − ˜ t, t ) ≤ ≤ (cid:18) d (cid:19) ˜ t − h − r E (cid:20) I i,j ( t − ˜ t, t ) p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:21) , by repeatedly applying Lemma 2 ≤ e E (cid:20) p i,j ( t − ˜ t ) E t − ˜ t (cid:20) I i,j ( t − ˜ t, t ) q i,j ( t − ˜ t, t ) (cid:21)(cid:21) , as e approximates (cid:18) d (cid:19) ˜ t − h − r ≤ e E (cid:2) p i,j ( t − ˜ t ) (cid:3) , using Fact 2 Case 3 : t − ˜ t = t − h − r . Thus, p i,j ( t − h − r ) = p i,j ( t − ˜ t ) and, given that h ≥ , ˜ t > r and I i,j ( t − h − r, t ) I i,j ( t − ˜ t, t ) = I i,j ( t − h − r, t ) . E (cid:20) l i,j ( t − ˜ t, t ) I i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − ˜ t, t ) q i,j ( t − h − r, t ) (cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) p i,j ( t − h − r ) I i,j ( t − h − r, t ) q i,j ( t − ˜ t, t ) q i,j ( t − h − r, t ) (cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) E t − h − r (cid:20) I i,j ( t − h − r, t ) q i,j ( t − h − r, t ) (cid:21)(cid:21) = E (cid:20) l i,j ( t − ˜ t, t ) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) (cid:21) , using Fact 2 ≤ E (cid:20) p i,j ( t − h − r ) q i,j ( t − ˜ t, t ) (cid:21) , since l i,j ( t − ˜ t, t ) ≤ ≤ E (cid:20) p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:21) onsidering all the three cases, putting everything together and overapproximating, we get E t − ˜ t [ (III) ] ≤ E  e η ( d + 1) T (cid:88) t = d +1 k (cid:88) i =1  (cid:88) ˜ t,h,r :˜ th + r p i,j ( t − ˜ t ) + (cid:88) ˜ t,h,r :˜ t = h + r p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t )  ≤ e η ( d + 1) E  T (cid:88) t = d +1  (cid:88) ˜ t,h,r :˜ th + r k (cid:88) i =1 p i,j ( t − ˜ t ) + (cid:88) ˜ t,h,r :˜ t = h + r k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t )  ≤ e η ( d + 1) E  T (cid:88) t = d +1  (cid:88) ˜ t,h,r :˜ t (cid:54) = h + r k (cid:88) i =1 p i,j ( t − ˜ t ) + (cid:88) ˜ t,h,r :˜ t = h + r k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t )  ≤ e η ( d + 1) E  T (cid:88) t = d +1  (cid:88) ˜ t,h,r :˜ t (cid:54) = h + r k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) + (cid:88) ˜ t,h,r :˜ t = h + r k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t )  , as q i,j ( t − ˜ t, t ) ≤ ≤ e η ( d + 1) T (cid:88) t = d +1 (cid:88) ˜ t,h,r E (cid:34) k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:35) ≤ e η ( d + 1) T (cid:88) t = d +1 (cid:88) ˜ t,h,r { k , b } , using Lemma 1 ≤ e η T (cid:88) t = d +1 d (cid:88) ˜ t =0 d + 1 (cid:18) { k , b } (cid:19) ˜ t (cid:88) h =1 d (cid:88) r =0 d + 1 ≤ e η T (cid:88) t = d +1 d (cid:88) ˜ t =0 ˜ td + 1 (cid:18) { k , b } (cid:19) , as d (cid:88) r =0 d + 1 = 1 , and ˜ t (cid:88) h =1 t ≤ e η T (cid:88) t =1 d (cid:88) ˜ t =0 ˜ td + 1 (cid:18) { k , b } (cid:19) ≤ e ηdT { k , b } , as (cid:88) ˜ t = 1 d ˜ t = d (10) E t − ˜ t [ (IV) ] = E t − ˜ t  eη d + 1) T (cid:88) t = d +1 k (cid:88) i =1 d (cid:88) ˜ t =0 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) (cid:0) q i,j ( t − ˜ t, t ) (cid:1)  = eη d + 1) T (cid:88) t = d +1 d (cid:88) ˜ t =0 E t − ˜ t (cid:34) k (cid:88) i =1 p i,j ( t − ˜ t ) I i,j ( t − ˜ t, t ) (cid:0) q i,j ( t − ˜ t, t ) (cid:1) (cid:35) = eη d + 1) T (cid:88) t = d +1 d (cid:88) ˜ t =0 E t − ˜ t (cid:34) k (cid:88) i =1 p i,j ( t − ˜ t ) q i,j ( t − ˜ t, t ) (cid:35) , using Fact 2 = eη d + 1) T (cid:88) t = d +1 d (cid:88) ˜ t =0 (cid:18) { k , b } (cid:19) , using Lemma 1 ≤ eη d + 1) T (cid:88) t =1 d (cid:88) ˜ t =0 { k , b }≤ eηT { k , b } (11)Combining (8), (9), (10) and (11) in (7), we get T (cid:88) t =1 k (cid:88) i =1 p i,j ( t ) l i,j ( t ) ≤ T (cid:88) t =1 l i,j ( t ) + e ηdT { k , b } + eηT { k , b } + 1 η ln k + d et L Co − Bandit ( T ) = (cid:80) Tt =1 (cid:80) ki =1 p i,j ( t ) l i,j ( t ) , i.e. denote the expected aggregate loss of C-Bandit, and L min ( T ) = (cid:80) Tt =1 l i,j ( t ) ,i.e. be the aggregate loss of the best expert. Therefore, overapproximating and simplifying, we get E [ L Co − Bandit ( T ) − L min ( T )] ≤ e ( d + 1) ηT { k , b } + 1 η ln k + d We set the following value for η . η = (cid:115) max { k , b } ln ke ( d + 1) T In Lemma 2, we assumed that η ≤ ke ( d + 1) . Therefore, (cid:115) max { k , b } ln ke ( d + 1) T ≤ ke ( d + 1)max { k , b } ln ke ( d + 1) T ≤ k e ( d + 1) T ≤ k ( d + 1) max { k , b } ln k ≤ k ( d + 1) ln k Hence, we get E [ L Co − Bandit ( T ) − L min ( T )] ≤ e (cid:115) ( d + 1) T ln K max { k , b } + e (cid:115) ( d + 1) T ln K max { k , b } + d ≤ e (cid:115) ( d + 1) T ln K max { k , b } + d which concludes the proof. A PPENDIX BP ROOF OF CONVERGENCE

We follow the proofs given in [23] and [38]. We analyze the evolution of the probability of network i . In addition to a randomchoice from the probability distribution, a device also explores network(s) unheard of for at least x time slots with probability n . We assume that n tends to ∞ ; hence n tends to zero.We consider two cases, namely (1) when the loss of network i has been observed by device j or learnt from its neighbors, and(2) when the loss of network i is unknown to device j . The former case occurs with probability q and the latter with probability − q , as deﬁned in Algorithm 1.We start with the ﬁrst case, i.e. when device j has observed the loss of network i or has learnt about it from its neighbors. p i,j ( t + 1) = w i,j ( t + 1) k (cid:80) m =1 w m,j ( t + 1) , from the probability update rule = w i,j ( t ) exp ( − η (cid:99) l i,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } , using the weight update rule = w i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) (1)rom the probability update rule, we have p i,j ( t ) = w i,j ( t ) k (cid:80) m =1 w m,j ( t ) w i,j ( t ) = k (cid:88) m =1 w m,j ( t ) p i,j ( t ) (2)Combining (2) with (1), we get p i,j ( t + 1) = k (cid:80) m =1 w m,j ( t ) p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 k (cid:80) y =1 w y,j ( t ) p m,j ( t ) exp ( − η (cid:100) l m,j ( t ))= p i,j ( t ) exp ( − η (cid:99) l i,j ( t )) k (cid:80) m =1 p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) We drop the discrete time script t and compute the continuous time process by deriving the limit of the probability update ruleas η → , a ﬁrst order differential equation known as the replicator dynamic. This is consistent with Lemma 2 (in Appendix A)and Theorem 1 in which we assumed arbitrarily small values of η . p i,j = p i,j exp (cid:16) − η (cid:99) l i,j (cid:17) k (cid:80) m =1 p m,j exp (cid:16) − η (cid:100) l m,j (cid:17) ˙ p i,j = lim η → dp i,j dη = lim η → ddη  p i,j exp (cid:16) − η (cid:99) l i,j (cid:17) k (cid:80) m =1 p m,j exp (cid:16) − η (cid:100) l m,j (cid:17)  = lim η →  − k (cid:80) m =1 p m,j exp (cid:16) − η (cid:100) l m,j (cid:17) p i,j exp (cid:16) − η (cid:99) l i,j (cid:17) (cid:99) l i,j + p i,j exp (cid:16) − η (cid:99) l i,j (cid:17) k (cid:80) m =1 p m,j exp (cid:16) − η (cid:100) l m,j (cid:17) (cid:100) l m,j (cid:18) k (cid:80) m =1 p m,j exp (cid:16) − η (cid:100) l m,j (cid:17)(cid:19)  = − p i,j (cid:99) l i,j k (cid:80) m =1 p m,j + p i,j k (cid:80) m =1 p m,j (cid:100) l m,j (cid:18) k (cid:80) m =1 p m,j (cid:19) = p i,j k (cid:88) m =1 p m,j (cid:16) (cid:100) l m,j − (cid:99) l i,j (cid:17) (3)We now consider the second case, i.e. when the loss of network i is unknown to device j ; we assume here that the loss of theother networks are known to device j . i,j ( t + 1) = w i,j ( t + 1) k (cid:80) m =1 w m,j ( t + 1) , from the probability update rule = w i,j ( t )max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } (cid:80) m ∈K−{ i } w m,j ( t ) exp ( − η (cid:100) l m,j ( t ))max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } + w i,j ( t )max y ∈K { w y,j ( t ) exp ( − η (cid:99) l y,j ( t )) } , using the weight update rule = w i,j ( t ) (cid:80) m ∈K−{ i } w m,j ( t ) exp ( − η (cid:100) l m,j ( t )) + w i,j ( t ) (4)Combining (2) with (4), we get p i,j ( t + 1) = k (cid:80) m =1 w m,j ( t ) p i,j ( t ) (cid:80) m ∈K−{ j } k (cid:80) y =1 w y,j ( t ) p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) + k (cid:80) m =1 w m,j ( t ) p i,j ( t )= p i,j ( t ) (cid:80) m ∈K−{ i } p m,j ( t ) exp ( − η (cid:100) l m,j ( t )) + p i,j ( t ) As for the ﬁrst case, we drop the discrete time script t and derive the limit of the probability update rule as η → . p i,j = p i,j (cid:80) m ∈K−{ i } p m,j exp ( − η (cid:100) l m,j ) + p i,j ˙ p i,j = lim η → dp i,j dη = lim η → ddη  p i,j (cid:80) m ∈K−{ i } p m,j exp ( − η (cid:100) l m,j ) + p i,j  = lim η →  − p i,j (cid:32) − (cid:80) m ∈K−{ i } p m,j exp ( − η (cid:100) l m,j ) (cid:100) l m,j (cid:33)(cid:32) (cid:80) m ∈K−{ i } p m,j exp ( − η (cid:100) l m,j ) + p i,j (cid:33)  = p i,j (cid:80) m ∈K−{ i } p m,j (cid:100) l m,j (cid:32) (cid:80) m ∈K−{ i } p m,j + p i,j (cid:33) = p i,j (cid:88) m ∈K−{ i } p m,j (cid:100) l m,j (5)Then, from (3) and (5), the expected change in p i,j with respect to the probability distribution of device j over all networks ∈ K is given as ¯ p i,j = E j [ ˙ p i,j ]= E j (cid:104) ˙ p i,j | (cid:99) l i,j is known (cid:105) + E j (cid:104) ˙ p i,j | (cid:99) l i,j is unknown (cid:105) = q E (cid:34) p i,j k (cid:88) m =1 p m,j (cid:16) (cid:100) l m,j − (cid:99) l i,j (cid:17)(cid:35) + (1 − q ) E  p i,j (cid:88) m ∈K−{ i } p m,j (cid:100) l m,j  = q p i,j k (cid:88) m =1 p m,j ( l m,j − l i,j ) + (1 − q ) p i,j (cid:88) m ∈K−{ i } p m,j l m,j = q p i,j  (cid:88) m ∈K−{ i } p m,j ( l m,j − l i,j ) + p i,j ( l i,j − l i,j )  + (1 − q ) p i,j (cid:88) m ∈K−{ i } p m,j l m,j = q p i,j (cid:88) m ∈K−{ i } p m,j ( l m,j − l i,j ) + (1 − q ) p i,j (cid:88) m ∈K−{ i } p m,j l m,j = q p i,j (cid:88) m ∈K−{ i } p m,j l m,j − q p i,j (cid:88) m ∈K−{ i } p m,j l i,j + (1 − q ) p i,j (cid:88) m ∈K−{ i } p m,j l m,j = p i,j (cid:88) m ∈K−{ i } p m,j l m,j − q p i,j (cid:88) m ∈K−{ i } p m,j l i,j = p i,j (cid:88) m ∈K−{ i } p m,j ( l m,j − q l i,j ) Taking expectation with respect to other devices’ actions, ξ i,j = E − j [¯ p i,j ]= p i,j (cid:88) m ∈K−{ i } p m,j ( E − j [ l m,j ] − q E − j [ l i,j ])= p i,j (cid:88) m ∈K−{ i } p m,j ( l m,j − q l i,j ) Given that this replicator dynamics is the same as the one [23], except that we have a factor q of l i,ji,j