[PDF] Networked Stochastic Multi-Armed Bandits with Combinatorial Strategies

Abstract

In this paper, we investigate a largely extended version of classical MAB problem, called networked combinatorial bandit problems. In particular, we consider the setting of a decision maker over a networked bandits as follows: each time a combinatorial strategy, e.g., a group of arms, is chosen, and the decision maker receives a reward resulting from her strategy and also receives a side bonus resulting from that strategy for each arm's neighbor. This is motivated by many real applications such as on-line social networks where friends can provide their feedback on shared content, therefore if we promote a product to a user, we can also collect feedback from her friends on that product. To this end, we consider two types of side bonus in this study: side observation and side reward. Upon the number of arms pulled at each time slot, we study two cases: single-play and combinatorial-play. Consequently, this leaves us four scenarios to investigate in the presence of side bonus: Single-play with Side Observation, Combinatorial-play with Side Observation, Single-play with Side Reward, and Combinatorial-play with Side Reward. For each case, we present and analyze a series of \emph{zero regret} polices where the expect of regret over time approaches zero as time goes to infinity. Extensive simulations validate the effectiveness of our results.

Full PDF

aa r X i v : . [ c s . L G ] M a r Networked Stochastic Multi-Armed Bandits withCombinatorial Strategies

Shaojie Tang

University of Texas at Dallas

Yaqin Zhou

Singapore University of Technology and Design

Abstract —In this paper, we investigate a largely extendedversion of classical MAB problem, called networked combinatorialbandit problems . In particular, we consider the setting of adecision maker over a networked bandits as follows: each timea combinatorial strategy, e.g., a group of arms, is chosen, andthe decision maker receives a reward resulting from her strategyand also receives a side bonus resulting from that strategy foreach arm’s neighbor. This is motivated by many real applicationssuch as on-line social networks where friends can provide theirfeedback on shared content, therefore if we promote a productto a user, we can also collect feedback from her friends on thatproduct. To this end, we consider two types of side bonus inthis study: side observation and side reward . Upon the number ofarms pulled at each time slot, we study two cases: single-play and combinatorial-play . Consequently, this leaves us four scenariosto investigate in the presence of side bonus: Single-play withSide Observation, Combinatorial-play with Side Observation,Single-play with Side Reward, and Combinatorial-play with SideReward. For each case, we present and analyze a series of zeroregret polices where the expect of regret over time approacheszero as time goes to inﬁnity. Extensive simulations validate theeffectiveness of our results.

I. I

NTRODUCTION

A multi-armed bandits problem (MAB) problem is a ba-sic sequential decision making problem deﬁned by a set ofstrategies. At each decision epoch, a decision maker selectsa strategy that involves a combination of random banditsor variables, and then obtains an observable reward. Thedecision maker learns to maximize the total reward obtainedin a sequence of decisions through history observation. MABproblems naturally capture the fundamental tradeoff betweenexploration and exploitation in sequential experiments. Thatis, the decision maker must exploit strategies that did well inthe past on one hand, and explore strategies that might havehigher gain on the other hand. MAB problems now play animportant role in online computation under unknown envi-ronment, such as pricing and bidding in electronic commerce[ ? ], [ ? ], Ad placement on web pages [ ? ], source routing indynamic networks [ ? ], and opportunistic channel accessing incognitive radio networks [ ? ], [ ? ]. In this paper, we investigatea largely extended version of classical MAB problem, called networked combinatorial bandit problems . In particular, weconsider the setting of a decision maker over a networkedbandits as follows: each time a combinatorial strategy, e.g., agroup of arms, is chosen, and the decision maker receives adirect reward resulting from her strategy and also receives a side bonus (either observation or reward) resulting from thatstrategy for each arm’s neighbors.In this study, we take as input a relation graph G thatrepresents the correlation among K arms. In the standardsetting, pulling an arm i gets reward and observation X i,t ,while in the networked combinatorial bandit problem withside bonus, one also gets side observation or even rewarddue to the similarity or potential inﬂuence among neighboringarms. We consider two types of side bonus in this work:(1) Side-observation: by pulling arm i at time t one gainsthe direct reward associated with i and also observes thereward of her neighboring arms. Such side-observation [ ? ] ismade possible in settings of on-line social networks wherefriends can provide their feedback on shared content, thereforeif we promote a product to a user, we can also collectfeedback from her friends on that product; (2) Side-reward: inmany practical applications such as recommendation in socialnetworks, pulling an arm i not only yields side observation onneighbors, but also receives extra rewards. That is by pullingarm i one gains the reward associated with i together withher neighboring arms directly. This setting is motivated by theobservation that users are usually inﬂuenced by her friendswhen making purchasing decisions. [ ? ].Despite of many existing results on MAB problems againstunknown stochastic environment [ ? ], [ ? ], [ ? ], [ ? ], [ ? ], theiradopted formulations do not ﬁt those applications that involveeither side bonus or exponentially large number of candidatestrategies. There are several challenges facing our new study.First of all, under combinatorial setting, the number of can-didate strategies could be exponentially large, if one simplytreats each strategy as an arm, the resulting regret bound isexponential in the number of variables or arms. TraditionalMAB assumes that all the arms are independent, which is inap-propriate in our setting. In the presence of side bonus, how toappropriately leverage additional information in order to gainhigher rewards is another challenge. To this end, we explore amore general formulation for networked combinatorial banditproblems under four scenarios, namely, single/combinatorialplay with side observation, single/combinatorial play with sidereward. The objective is to minimize the upper bound of regret(or maximize the total reward) over time.The contributions of this paper are listed as follows: • For Single-play with Side Observation case, we presentthe ﬁrst distribution-free learning (DFL) policy, whoseime and space complexity are bounded by O ( K ) . Ourpolicy achieves zero regret that does not depend on ∆ min ,the minimum distance between the best static strategy andany other strategy. • For Combinatorial-play with Side Observation case, wepresent a learning policy with zero regret. Compared withtraditional MAB problem without side bonus, we reducethe regret bound signiﬁcantly. • For Single-play with Side Rewards case, we develop adistribution-free zero regret learning policy. We theoret-ically show that this scheme converges faster than anyexisting method. • For Combinatorial-play with Side Rewards case, by as-suming that the combinatorial problem at each deci-sion point can be solved optimally, we present the ﬁrstdistribution-free zero regret policy.We evaluate our proposed learning policy through extensivesimulations and simulation results validate the effectivenessof our schemes.The remainder of this paper is organized as follows. We ﬁrstgive a formal description of networked combinatorial multi-armed bandits problem in Section II. We study Single-playwith Side Observation case in Section III. In Section IV, westudy Combinatorial-play with Side Observation case. Single-play with Side Rewards case has been discussed in Section V.In Section VI, we study Combinatorial-play with Side Rewardscase. We evaluate our policies via extensive simulations inSection VII. We review related works in Section VIII. Weconclude this paper, and discuss limitations as well as futureworks in Section IX. Most notations used in this paper aresummarized in Table I.II. M

ODELS AND P ROBLEM F ORMULATION

In the standard MAB problem, a K -armed bandit problemis deﬁned by K distributions P , . . . , P K , each arm withrespective means µ , . . . , µ K . When the decision maker pullsarm i at time t , she receives a reward X i,t . We assume allrewards { X i,t , i ∈ [1 , K ] , t ≥ } are independent, and all {P i } have support in [0 , . Let i = 1 denote the optimal arm,and ∆ i = µ − µ i be the difference between the best arm andarm i .The relation graph G = ( V, E ) over the K arms de-scribes the correlations among them, where an undirected link e ( i, j ) ∈ E indicates the correlation between two neighboringarms i and j . In the standard setting, pulling an arm i getsreward and observation X i,t , while in the networked combi-natorial bandit problem with side bonus, one also gets sideobservation or even reward from neighboring arms due to thesimilarity or potential inﬂuence among them. Let N ( i ) denotethe set of neighboring arms of arm i and N i = { i } ∪ N ( i ) . Inthis work, we consider two types of side bonus: • Side observation : by pulling arm i at time t one gainsthe reward X i,t associated with i and also observes thereward X j,t of i ’s neighboring arm j ∈ N i . This ismotivated by many real applications, for example, intoday’s online social network, friends can provide their feedback on shared content, therefore if we promote aproduct to one user, we can also collect feedback fromher friends on that product; • Side reward : by pulling an arm i not only yields sideobservation on neighbors, but also receives rewards fromthem, i.e., the total rewards would be P j ∈ N i X j,t . Thissetting is motivated by the observation that in manypractical applications such as recommendation in socialnetworks, users are usually inﬂuenced by her friendswhen making purchasing decisions.Upon the number of arms pulled at each time slot, we willstudy single-play case and combinatorial-play case. • In the single-play case , the decision maker selects onearm at each time slot, e.g. , traditional MAB problembelongs to this category; • In the combinatorial-play case , the decision maker re-quires to select a combination of M ( M ≤ K ) arms thatsatisﬁes given constraints. One such example is onlineadvertising, assume an advertiser can only place up to m advertisements on his website, he repeatedly selects aset of m advertisements, observes the click-through-rate,with the goal of maximizing the average click-through-rate. This problem can be formulated as a combinatorialMAB problem where each arm represents one adver-tisement, subject to the constraint that one can play atmost m arms at each time slot. In the combinatorialcase, at each time slot t , an M -dimensional strategy vector s x is selected under some policy from the feasiblestrategy set F . By feasible we mean that each strategysatisﬁes the underlying constraints imposed to F . We use x = 1 , . . . , | F | to index strategies of feasible set F in thedecreasing order of average reward λ x , e.g. , s has thelargest average reward. Note that a strategy may consistof less than M random variables, as long as it satisﬁesthe given constraints. We then set i = 0 for any emptyentry i .In either case, the objective is to minimize long-term regret after n time slots, deﬁned by cumulative difference betweenthe received reward and the optimal reward.Consequently, this leaves us four scenarios to investi-gate: Single-play with Side Observation, Combinatorial-playwith Side Observation, Single-play with Side Reward, andCombinatorial-play with Side Reward. We then describe theproblem formulation for each case. We use I t to denote indexof selected arm (resp. strategy) by the decision maker at timeslot t , and subscript to denote the optimal arm (resp. strategy)in the four cases. We evaluate policies using regret, R n , whichis deﬁned as the difference in the total expected reward (over n rounds) between always playing the optimal strategy andplaying arms according to the policy. We say a policy achieves zero regret if the expected average regret over time approacheszero as time goes to inﬁnity, i.e. , R n /n → as n → ∞ .1) Single-play with Side Observation (SSO) . In this case, thedecision maker pulls an arm i , observes all X j,t , j ∈ N i ,and gets a reward X i,t . The regret by time slot n is written ABLE IS

UMMARY OF NOTATIONS

Variable Meaning K number of arms M number of selected arms G relation graph over the arms X i,t observation/direct reward on arm i at time tµ i mean of X i,t N i set of neighboring arms of arm i ∆ i the distance between the best strategy and strategy iB i,t side reward received by arm i from N i O i,t number of observation times on arm i by time tO bi,t number of update times on side rewards of arm i by time tX i,t time averaged value of observation on arm i by time tH vertex-induced subgraph of G composed by arms with ∆ i ≥ δ C clique cover of HF feasible strategy (arm or com-arm) set R x,t direct reward on com-arm x at time t σ x mean of R x,t Y x set of neighboring arms of component arms in com-arm xN maximum of | Y x | among all com-arms CB x,t combinatorial side reward received by com-arm x from Y x ∆ x the distance between the best strategy and strategy x ∆ min minimum of ∆ x among all strategies as, R n = n X t =1 µ − n X t =1 X I t ,t . (1)Here I t denotes the index of arm played at t .2) Combinatorial-play with Side Observation (CSO) . Ratherthan pulling a single arm, the decision maker pulls a setof arms, s I t , receives a reward R I t ,t = X i ∈ s It X i,t and also observes reward X j,t for each neighboring arm j ∈ Y I t , where Y I t = ∪ i ∈ s It N i is the set of neighboringarms for selected strategy I t . Therefore, let λ denote theexpected reward from the optimal strategy, the regret isdeﬁned as R n = n X t =1 λ − n X t =1 R I t ,t . (2)3) Single-play with Side Rewards (SSR) . When pulling anarm i , it yields a total reward B i,t = X j ∈ N i X j,t Therefore, the best arm shall be the one with the maxi-mum expected total reward. Let u i = P j ∈ N i µ j denotethe mean of reward for arm i , and u the maximumreward. The regret is R n = n X t =1 u − n X t =1 B I t ,t . (3)Note here, the optimal arm may differ from the optimalarm under single-play with side observation. 4) Combinatorial-play Side Rewards (CSR) . Different fromcombinatorial-play with side observation, the decisionmaker directly obtains the rewards from all neighboringarms. That is, the totally received reward includes directreward by strategy x and side reward by its neighbors.Let Y x = ∪ i ∈ s x N i be the set of neighboring arms forstrategy x , and σ x = P i ∈ Y x µ i be the expected rewardof s x . The combinatorial reward at time slot t is writtenas CB I t ,t = P i ∈ Y It X i,t . We deﬁne the regret as R n = n X t =1 σ − n X t =1 CB I t ,t . (4)III. S INGLE - PLAY WITH SIDE OBSERVATION

We start with the case of Single-play with Side Observation.In this case, the decision maker learns to select an arm (resp.strategy) with maximum reward, meanwhile observes sideinformation of its neighbors deﬁned in relation graph. Ourproposed policy, which is the ﬁrst distribution free learningpolicy for SSO reffered to as DFL-SSO, is shown in Algo-rithm 1. As shown in Line 2-5, the decision maker updates allneighbors’ side information, i.e., number of observation up tocurrent time, and time-averaged reward. The key idea behindthe algorithm is that side-observation potentially reduces theregret as the decision maker can explore more without pain,thus gain more history information to exploit.To theoretically analyze the beneﬁt of side observation, wenovelly leverage the technique of graph partition and cliquecover. The basic idea in standard analysis of regret boundwith side observation in distribution-dependent case is to useclique cover of relation graph, and use the arm with maximum ∆ i inside each cilque to represent the clique for analysis.While standard proof of distribution-free regret bound is todivide the arms into two sets via a threshold ∆ c on ∆ i ,and then respectively analyze the bounds of the two sets ofarms. Therefore, to obtain a distribution-free result, we cannotdirectly use the arm with maximum ∆ i inside a clique forrepresentation to prove distribution-free regret bound, as thearms with ∆ i smaller than ∆ c are distributed inside cliques.To address this issue, we ﬁrst partition the relation graph G using the predeﬁned threshold, and then mainly analyze thebeneﬁt of side observation in one vertex-induced subgraph H for arms having ∆ i above ∆ c . In the subgraph H , it is thenpossible to analyze the distribution-free regret bound using thetechnique of clique cover.Theorem 1 quantiﬁes the beneﬁt brought about by it, whereit shows that the more side observation (e.g., smaller cliquenumber) is, the smaller the upper bound of regret is. Theorem 1:

The expected regret of Algorithm 1 after n timeslots is bounded by R n ≤ . √ nK + 0 . C p n/K, (6)where C is clique cover of vertex-induced subgraph H witharms of ∆ i above threshold δ in relation graph G . Proof:

The proof is based on our novel combination ofgraph partition and clique cover. We ﬁrst partition relation ig. 1. Graph partition: G is relation graph, and H is vertex-induced graphthat is covered by cliques Algorithm 1

Distribution-Free Learning policy for single-playwith side observation (DFL-SSO) For each time slot t = 0 , , . . . , n Select an arm i by maximizing X i,t + s log ( t/ ( KO i,t )) O i,t (5)to pull for k ∈ N i do O k,t +1 ← O k,t + 1 X k,t +1 ← X k,t /O k,t + (1 − /O k,t ) X k,t end for end forgraph to rewrite regret in terms of cliques, and then mainlytighten the upper bound by analyzing regret of cliques.

1. Partition relation graph and rewrite regret of sub-graph H in terms of cliques. We order the arms in an increasing order of ∆ i . We use ∆ c ≤ δ = α p K/n ≤ ∆ c +1 to split the K arms into twodisjoint sets, one set K with ∆ x ≤ ∆ c and the other set K with ∆ x > ∆ c (We will set the value of α in later analysis).Let c be the smallest index of arm satisfying ∆ k ≤ ∆ c . Weremove all arms in K from the relation graph G , as well asadjacent edges to nodes in K . In this way, we get a subgraph H of G , over arms in K . The regret satisﬁes, R ( n ) ≤ n ∆ c + R H ( n ) , (7)where R H ( n ) is regret generated by selecting suboptimal armsin K .Consider a clique covering C of H , i.e., a set of cliquessuch that each c ∈ C is a clique and V = ∪ c ∈C c . We deﬁnethe clique regret R c ( n ) for any c ∈ C by R c ( n ) = X t

2. Regret analysis for regret of subgraph H In the rest part, we focus on proving upper bound of regret R H ( n ) . Let ∆ c = max i ∈ c ∆ i , and T c ( t ) = P i ∈ c T i ( t ) denotethe number of times (any arm in) clique c has been played upto time t , where T i ( t ) is the number of times arm i has beenselected up to time t . Similarly, we suppose that cliques areordered in the increasing order of ∆ c . Let v j = µ − ∆ j for cliques in K , c ≤ j ≤ K , and v c = µ − ∆ c . Let z c = + ∞ and ∆ K +1 = + ∞ . For better description, we use c to denote the case of c = 0 .As every arms in a clique c must be observed for the samenumber of times, then for each clique and l ≥ , we have R c = X i ∈ c ∆ i T i ( n ) ≤ l max i ∈ c ∆ i + X i ∈C ∞ X l = l { I t = i, t ≥ l } (10)Meanwhile, R H ( n ) = X c ∈ K R c = X c ∈C l ∆ c + K X i =1 ∆ i T ′ i ( n ) , (11)Where T ′ i ( n ) denotes the number of arm i played after t = l ,and we refer to the second term as R ′ H Deﬁne W = min ≤ t ≤ n W ,t , (12)and U j,i = W ∈ [ v j +1 ,v j ) ∆ i T ′ i ( n ) . (13)We have the following for R ′ H ( n ) , R ′ H ( n ) = K X i = c ∆ i T ′ i ( n ) (14) = K X j = c j X i =1 U j,i + K X j = c C X i = j +1 U j,i . (15)For the ﬁrst term of Equation (15), we have: K X j = c j X i =1 U j,i ≤ K X j = c W ∈ [ v j +1 ,v j ) n ∆ j (16) = n ∆ c + n C X c =1 W ≤ v c (∆ c − ∆ c − ) . (17)We have the ﬁrst equation as ∆ j ≥ ∆ i and T i ≤ n .To bound the second term of Equation (15), we record τ i = { min t : W i,t < v i } (18)after l . To pull a suboptimal arm i at t , one must have W i,t >W ,t ≥ W . By Algorithm 1, we have { W ≥ v i } ⊂ { T ′ i ( n ) ≤ τ i } , since once we have pulled τ i times arm i its index willalways be lower than the index of arm 1.herefore, we have R ( n ) ≤ n ∆ c + X c ∈C l ∆ c + K X i =1 ∆ i E ( τ i | t > l )+ n C X c =1 W , ∆ i E ( τ i | τ i > l ) (20) ≤ + ∞ X l = l P ( τ i ≥ l )= + ∞ X l = l P ( ∀ t ≤ l, W i,t > v i ) ≤ + ∞ X l = l P (cid:18) X i,l − µ i ≥ ∆ i − r log + ( n/Kl ) l (cid:19) (21)Let l = 8 log ( nK ∆ i ) / ∆ i . For l ≥ l , we have log + ( t/ ( Kl )) ≤ log + ( n/ ( Kl )) ≤ ( nK × ∆ i (22) ≤ l ∆ i ≤ l ∆ i . (23)Therefor, we have ∆ i − r log + ( n/Kl ) l ≥ ∆ i − ∆ i √ a ∆ i (24)with a = − √ , ∆ c l ≤ nK ∆ i ) / ∆ i ≤ e p n/K (25)To bound (21) using Hoeffding Bound, i.e., E { τ i | t > l } ≤ + ∞ X l = l P ( X i,l − µ i ≥ a ∆ i ) (26) ≤ + ∞ X l = l exp ( − l ( a ∆ i ) ) (27) = + ∞ X l = l − l ( a ∆ i ) − exp( − a ∆ i ) ) (28) ≤ − exp( − a ∆ i ) ) (29) ≤ a ∆ i ) − ( − a ∆ i ) ) (30) = 12 a ∆ i (1 − a ) . (31)Then we have ∆ i E { τ i | t > l } ≤ nK ∆ i ) / ∆ i + 12 a ∆ i (1 − a ) ≤ e p n/K + α − a (1 − a ) p n/K. (32) Now we prove to bound n P C c =0 P ( W ≤ v c )(∆ c − ∆ c − ) .Recall that ∆ c ≤ δ ≤ ∆ c +1 , and let δ c be ∆ c =0 . Taking P ( W ≤ µ − ∆ c ) as an nonincreasing function of ∆ c , wehave C X c =1 P ( W ≤ v c )(∆ c − ∆ c − ) ≤ δ − ∆ c + Z δ P ( W ≤ µ − u du. (33)For a ﬁxed u ∈ [ δ , and f ( u ) = 8 log( p n/Ku ) /u , wehave P ( W ≤ µ − u P (cid:18) ∃ ≤ l ≤ n : X ,l + r log ( n/ ( Kl )) l < µ − u (cid:19) ≤ P (cid:18) ∃ ≤ l ≤ f ( u ) : µ − X ,l > r log ( n/ ( Kl )) l (cid:19) + P (cid:18) ∃ ≤ l ≤ f ( u ) : µ − X ,l > u (cid:19) (34)Let P denote the ﬁrst term of (34), using the form of m +1 f ( u ) ≤ l ≤ m f ( u ) , we have P ≤ ∞ X m =1 P (cid:18) ∃ m +1 f ( u ) ≤ l ≤ m f ( u ) : l ( µ − X m,l ) > s f ( u )2 m +1 log( n m Kf ( u ) ) (cid:19) ≤ ∞ X m =1 exp (cid:18) − f ( u )2 − ( m +1) log( n m C f ( u ) ) f ( u )2 − m (cid:19) = 2 Kf ( u ) n (35)Let P denote the ﬁrst term of (34), using the form of m f ( u ) ≤ l ≤ m +1 f ( u ) , we have similarly, P ≤ ∞ X m =1 P (cid:18) ∃ m f ( u ) ≤ l ≤ m +1 f ( u ) : l ( µ − X m,l ) > lu (cid:19) ≤ ∞ X m =0 exp (cid:18) − m − f ( u ) u ) f ( u )2 m +1 (cid:19) ≤ f ( u ) u / − ≤ nu /K − (36)The last inequality comes from f ( u ) is upper bounded by n/ ( eK ) .y taking integrity on P and P , we respectively have n Z δ P du ≤ n Kn Z δ f ( u ) du (37) = n Kn (cid:20) e p n/Ku ) u (cid:21) δ ≤ eα ) α √ nK, (38)and n Z δ P du ≤

12 log (cid:18) α + 1 α − (cid:19) √ nK. (39)Instantly we have n C X c =0 P ( W ≤ v c )(∆ c − ∆ c − ) ≤ n ( δ − ∆ c ) + (cid:18) eα ) α + 12 log (cid:18) α + 1 α − (cid:19)(cid:19) √ nK Finally, we get the regret bounded by R n ≤ X c ∈C e p n/K + (cid:18) α + 8 log( eα ) α + 12 log (cid:18) α + 1 α − (cid:19) + α − a (1 − a ) (cid:19) √ nK (40)Let α = e , and we already have a = − √ , then R n ≤ . √ nK + 0 . C p n/K. (41)IV. C OMBINATORIAL - PLAY WITH SIDE OBSERVATION

In this section, we consider combinatorial-play with sideobservation. In this case, an intuitively extension is to takeeach strategy as an arm ( we name it com-arm ), and thenapply the algorithm for SSO to solve the problem. However,the key question is how to utilize the side-observation on armsdeﬁned in relation graph to gain more observation on com-arms, that is, how to deﬁne neighboring com-arms. To thisend, we introduce the concept of strategy relation graph tomodel the correlation among com-arms, by which we convertthe problem of CSO to SSO.The construction process for strategy relation graph isas follows. We deﬁne strategy relation graph SG ( F, L ) forstrategies in F , where F is vertex set, and L is edge set. Eachstrategy s x is denoted by a vertex, and a link l = ( s x , s y ) in L connects two distinct vertexes s x and s y if s y ∈ Y x and viceversa. The neighbor deﬁnition for strategies is natural as oncea strategy is played, the union of neighbors of arms in thisstrategy could be observed according to neighbor deﬁnitionfor arms in G , which surely reward of any strategy composedby these observed arms is also observed. We give an examplein Fig. 2. There are arms in relation graph G , indexed by i = 1 , , , . The combinatorial MAB problem is to select amaximum weighted independent set of arms where unknown ={1,2}N ={1,2,3}N ={2,3,4}N ={3,r} G SG s ={1} s ={2} s ={3} s ={4} s ={1,3} s ={1,4} s ={2,4} Fig. 2. Convert combinatorial-play to single-play: constructing strategyrelation graph SG ( F, L ) based on arm relation graph G bandit is weight. As shown in Fig. 2, the feasible strategyset for this problem consists of feasible strategies, i.e.,independent sets of arms in G : s = { } , ∪ i ∈ s N i = { , } s = { } , ∪ i ∈ s N i = { , , } s = { } , ∪ i ∈ s N i = { , , } s = { } , ∪ i ∈ s N i = { , } s = { , } , ∪ i ∈ s N i = { , , , } s = { , } , ∪ i ∈ s N i = { , , , } s = { , } , ∪ i ∈ s N i = { , , , } Taking s and s for illustration, the component arms of s , i.e., { } , is a subset of ∪ i ∈ s N i = { , , , } , andthe component arms of s , i.e., { , } is also a subset of ∪ i ∈ s N i = { , , } . Therefore, the two strategies are con-nected in the relation graph SG .Consequently, we can convert the combinatorial-play MABwith side observation to a single-MAB with side observation.More speciﬁcally, taking each strategy as an arm, SG ( F, L ) is exactly a relation graph for com-arms in F . The problemturns into a single-play MAB problem where at each timeslot the decision maker selects one com-arm from | F | ones tomaximize her long-term reward.The algorithm is shown in Algorithm 2, and we derive theregret bound below directly. Theorem 2:

The expected regret of Algorithm 2 after n timeslots is bounded by R n ≤ . p n | F | + 0 . C p n/ | F | . (43)In the traditional distribution-free MAB by taking each com-arm as an unknown variable [ ? ], the regret bound wouldbe p n | F | . Our theoretical result signiﬁcantly reduces theregret and tightens the bound. lgorithm 2 Distribution-Free Learning policy forcombinatorial-play with side observation (DFL-CSO) For each time slot t = 0 , , . . . , n Select a com-arm s x by maximizing R x,t + s log ( t/ ( KO x,t )) O x,t (42)to pull UPDATE: for y ∈ N x do O y,t +1 ← O y,t + 1 R y,t +1 ← R y,t /O y,t + (1 − /O y,t ) R y,t end for end forV. S INGLE - PLAY WITH SIDE REWARDS

Though the single-play MAB with side reward have thesame observation as the single-play MAB with side obser-vation, the distinction on reward function makes the problemdifferent. In the case of SSR, the reward function is side rewardof the selected arm I t , instead of its direct reward. Here wetreat the side reward of each arm as a new unknown randomvariable, i.e., we require to learn B i,t that is a combination ofall direct rewards in N i . As direct rewards of arms in N i areobserved asynchronously, we cannot update the observationon B i,t as the way in SSO where observation is symmetricbetween two neighboring nodes. The trick is updating thenumber of observation on B i,t only when direct rewards of allarm in N i are renewed. We use O bi,t to denote this quantity todiffer from O i,t which denotes the number of direct reward isobserved. Therefore, whenever an arm is played or its neighboris played, the number of observation on side reward O bi,t canbe updated only when the least frequently observed arm in N i is updated. That is, O bi,t = ( O bi,t − + 1 if min j ∈ N i O j,t is updated O bi,t Otherwise. (44)The algorithm for single-play MAB with side reward is sum-marized in Algorithm 3 where we directly use side reward B i,t as observation, and update O bi,t according to (44). The regretbound of our proposed algorithm is presented in Theorem 3. Theorem 3:

The expected regret of Algorithm 3 after n timeslots is bounded by R n ≤ K √ nK (46) Proof:

In this case, B i,t ∈ [0 , K ] , which indicates thatthe range of received reward is scaled by K at most. Wenormalize B i,t ∈ [0 , . Using the same techniques in proof ofMOSS algorithm [ ? ], we get the normalized regret bound, andthen the regret bound in (46) by scaling the normalized regretbound by K . In Algorithm 3, the number of observation timeson side reward should be no less than the scenario withoutside observation. Therefore, Algorithm 3 would convergenceto the optimality faster than the MOSS algorithm without sideobservation. Algorithm 3

Distribution-Free Learning policy for single-playwith side reward (DFL-SSR) For each time slot t = 0 , , . . . , n Select an arm i by maximizing B i,t + s log ( t/ ( KO bi,t )) O bi,t (45)to pull for k ∈ N i do O k,t +1 ← O k,t + 1 if min j ∈ N k O j,t is updated O bk,t +1 = O bk,t + 1 B k,t +1 = B k,t /O bk,t + (1 − /O bk,t ) B k,t end if end for end forVI. C OMBINATORIAL - PLAY WITH SIDE REWARDS

Now we consider the combinatorial-play case with sidereward. Recall that in this scenario, it requires to select a com-arm s x with maximum side reward, where the side rewardis the sum of observed rewards of all arms neighboring toarms in s x . The case is more complicated than previous threecases, due to: 1) Asymmetric observations on side rewardfor neighboring nodes in one clique; 2) Probably exponentialnumber of strategies caused arbitrary constraint. Therefore, itis complicated to analyze the regret bound if adopting thesame techniques of combinatory-play with side observation.Instead of learning side reward of strategies directly, we learnthe direct reward of arms that compose com-arms. Algorithm 4

Distribution-Free Learning policy forcombinatorial-play with side reward (DFL-CSR) For each time slot t = 0 , , . . . , n Select a com-arm s x by maximizing X i ∈ Y x (cid:18) X i,t + vuut max (ln t / KO i,t , O i,t (cid:19) (47)to pull for k ∈ Y x do O k,t +1 ← O k,t + 1 X k,t +1 = X k,t /O bk,t + (1 − /O bk,t ) X k,t end for end for Theorem 4:

The expected regret of Algorithm 4 after n timeslots is bounded by R ( n ) ≤ N K + (cid:18) √ eK + 8(1 + N ) N (cid:19) n +(1 + 4 √ KN e ) N Kn . (48)where N ≤ K is the maximum of | Y x | , x = 1 . . . | F | . Proof:

See Appendix. E x p e c t e d r e g r e t MOSSDFL-SSO (a) Expected regret

Time slot A cc u m u l a t e d R e g r e t MOSSDFL-SSO (b) Accumulated regretFig. 3. Comparison of regret: MOSS v.s. DFL-SSO

Time slot E x p e c t e d r e g r e t (a) Sparse relation graph −40−30 −20 −10 Time slot E x p e c t e d r e g r e t (b) Dense relation graphFig. 4. Expected regret of DFL-CSO VII. S

IMULATION

In this section, we evaluate the performance of the proposed algorithms in simulations. We mainly analyze the regretgenerated by each algorithm after a long time slot n = 10000 .We ﬁrst evaluate regret generated by DFL-SSO, and com-pare with MOSS learning policy. The experiment setting isas follows. We randomly generate a relation graph with arms, each following an i.i.d random process over time withmean between [0 , . We then plot the accumulated regret andexpected regret over time, as shown in Fig. 3(a). Though theexpected regret over time by MOSS converges to a valuearound that coincides with its theoretical bound in Fig. 3(a),it shows that its accumulated regret grows dramatically. Itis oblivious the proposed algorithm with side informationperforms much better than MOSS, e.g., the accumulated regretand expected regret of our proposed algorithm (DFL-SSO)both converge to .For other algorithms, as we ﬁrst study the variants ofMAB problem, there are no candidate algorithms to compare.We show the trend of expected regret over time for eachcase. In evaluation of Algorithm 2, we note that the regretbound contains the terms: number of com-arms and numberof cliques. The upper bound becomes huge if the numberof com-arms is voluminous, and a small clique number cansigniﬁcantly reduce the bound. In order to investigate theimpact experimentally, we then test for regret both undersparse relation graph and dense relation graph. In Fig. 4(a),where the arms are uniformly and randomly connected with alow probability of . , it shows that the expected regret slowlyincreases beyond . While in Fig. 4(b), where the arms areuniformly and randomly connected with a higher probability Time slot E x p e c t e d r e g r e t Fig. 5. Expected regret of DFL-SSR

Time slot E x p e c t e d r e g r e t Fig. 6. Expected regret of DFL-CSR of . , it shows that the expected regret gradually approaches . It implicates that the side observation indeed helps to reduceregret if one can observe more, even for the case that previousliterature show that it will introduce exponential regret bylearning each individual com-arm of a huge feasible strategyset [ ? ]. The simulation results for Algorithm 3 and 4 are shownin Fig. 5 and 6, where the expected regret in both ﬁguresconverges to dramatically.III. R ELATED WORKS

The classical multi-armed bandit problem does not assumethat existence of side bonus. More recently, [ ? ] and [ ? ]considered the networked bandit problem in the presence ofside observations. They study single play case and proposeseveral policies whose regret bound depends on ∆ min , e.g. ,an arbitrarily small ∆ min will invalidate the zero-regret result.In this work, we present the ﬁrst distribution free policy forsingle play with side observation case.For the variant with combinatorial play without side bonus,Anantharam et al. [ ? ] ﬁrstly consider the problem that exactly N arms are selected simultaneously without constraint amongarms. Gai et al. recently extend this version to a moregeneral problem with arbitrary constraints [ ? ]. The modelis also relaxed to a linear combination of no more than N arms. However, the results presented in [ ? ] are distribution-dependent. To this end, we are the ﬁrst to study combinatorialplay case in the presence of side bonus. In particular, for thecombinatorial play with side observation case, we develop adistribution-free zero regret learning policy. We theoreticallyshow that this scheme converges faster than existing method.And for the combinatorial play with side reward case, wepropose the ﬁrst distribution-free learning policy that has zero-regret. IX. C ONCLUSION

In this paper, we investigate networked combinatorial banditproblems under four cases. This is motivated by the existenceof potential correlation or inﬂuence among neighboring arms.We present and analyze a series of zero regret polices foreach case. In the future, we are interested in investigatingsome heuristics to improve the received regret in practice. Forexample, at each time slot, instead of playing the selectedarm/strategy with maximum index value (Equation (5), (42)),we will play the arm/strategy that has maximum experimentalaverage observation among the neighbors of I t . Therefore, weensure that the received reward is better than the one withmaximum index value. X. A PPENDIX

A. Proof of Theorem 4

To prove the theorem, we will use Chernoff-Hoeffdingbound and the maximal inequality by Hoeffding [ ? ]. Lemma 1: (Chernoff-Hoeffding Bound [ ? ]) ξ , . . . , ξ n arerandom variables within range [0 , , and E [ ξ t | ξ , ..., ξ t − ] = µ, ∀ ≤ t ≤ n . Let S n = P ξ i , then for all a > P ( S n ≥ nµ + a ) ≤ exp ( − a /n ) , P ( S n ≤ nµ − a ) ≤ exp ( − a /n ) . (49) Lemma 2: (Maximal inequality) [ ? ] ξ , . . . , ξ n are i.i.drandom variables with expect µ , then for any y > and n > , P (cid:18) ∃ τ ∈ , . . . , n, τ X t =1 ( µ − ξ t ) > y (cid:19) < exp( − y n ) . (50) Each com-arm s x and its neighboring arm set Y x actuallycompose a new com-arm, which could be denoted by Y x as s x ⊂ Y x . Each new com-arm Y x corresponds to a unknownbonus CB x,t with mean σ x . Recall that we have assumed σ ≥ · · · ≥ σ | F | . As com-arm Y is the optimal com-arm,we have ∆ x = σ − σ x , and let Z x = σ − ∆ x . We furtherdeﬁne W = min ≤ t ≤ n W ,t . We may assume the ﬁrst timeslot z = arg min ≤ t ≤ n W ,t .

1. Rewrite regret in terms of arms

Separating the strategies in two sets by ∆ x of some com-arm s x (we will deﬁne x later in the proof), we have R n = x X x =1 ∆ x E [ T x,n ] + | F | X x = x +1 ∆ x E [ T x,n ] ≤ ∆ x n + | F | X x = x +1 ∆ x E [ T x,n ] . (51)We then analyze the second term of (51). As there may beexponential number of strategies, counting T x,n of each com-arm by the classic upper-conﬁdence-bound analysis yieldsregret growing linearly with the number of strategies. Notethat each com-arm consists of N arms at most, we can rewritethe regret in terms of arms instead of strategies. We thenintroduce a set of counters { e T x,n | k = 1 , . . . , K } . At eachtime slot, either 1) a com-arm with ∆ x ≤ ∆ x or 2) a com-arm with ∆ x > ∆ x is played. In the ﬁrst case, no e T x,n willget updated. In the second case, we increase e T x,n by for anyarm k = arg min j ∈ Y x { O j,t } . Thus whenever a com-arm with ∆ x > ∆ x is chosen, exactly one element in { e T x,n } increasesby . This implies that the total number that strategies of ∆ x > ∆ x have been played is equal to sum of all counters in { e T x,n } , i.e., P | F | x = x +1 E [ T x,n ] = P Kk =1 e T x,n . Thus, we canrewrite the second term of (51) as | F | X x = x +1 ∆ x E [ T x,n ] ≤ ∆ X | F | X x = x +1 E [ T x,n ] ≤ ∆ X K X k =1 E [ e T x,n ] . (52) Let I k,t be the indicator function that equals if e T x,n isupdated at time slot t . Deﬁne the indicator function { y } = 1 if the event y happens and otherwise. When I k,t = 1 , acom-arm Y x with x > x has been played for which O k,t =min { O j,t : ∀ j ∈ Y x } . Then e T x,n = n X t =1 { I k,t = 1 } (53) ≤ n X t =1 { W ,t ≤ W x,t } (54) ≤ n X t =1 { W ≤ W x,t } (55) ≤ n X t =1 { W ≤ W x,t , W ≥ Z x } (56) + n X t =1 { W ≤ W x,t , W < Z x } (57) = e T k,n + e T k,n . (58) We use e T k,n and e T k,n to respectively denote Equation (56)and (57) for short. Next we show that both of the terms arebounded.. Bounding e T k,n Here we note the event { W ≥ Z x } and { W x,t > W } implies event { W x,t > Z x } . Let ln + ( y ) = max(ln( y ) , . Forany positive integer l , we then have, e T k,n ≤ n X t =1 { W x,t ≥ Z x } (59) ≤ l + n X t = l { W x,t ≥ Z x , e T k,t > l } (60) = l + n X t = l P { W x,t ≥ Z x , e T k,t > l } (61) = l + n X t = l P (cid:26) X j ∈ Y x (cid:18) X j,t + vuut ln + ( t / KO j,t ) l (cid:19) ≥ X j ∈ Y x µ j + ∆ x , e T k,t > l (cid:27) . (62) The event (cid:26)P j ∈ Y x (cid:18) X j,t + q ln + ( t / /KO j,t ) O j,t (cid:19) ≥ P j ∈ Y x µ j + ∆ x (cid:27) indicates that the following must be true, ∃ j ∈ Y x , X j,t + s ln + ( t / /KO j,t ) O j,t ≥ µ j + ∆ x N . (63)

Using union bound one directly obtains: e T k,n ≤ l + n X t = l X j ∈ Y x P (cid:26) X j,t + s ln + ( t / /KO j,t ) O j,t ≥ µ j + ∆ x N (cid:27) (64) ≤ l + n X t = l X j ∈ Y x P (cid:26) X j,t − µ j ≥ ∆ x N − s ln + ( t / /KO j,t ) O j,t (cid:27) . (65)Now we let l = 16 N ⌈ ln( n / K ∆ x ) / ∆ x ) ⌉ with ⌈ y ⌉ the smallest integer larger than y . We further set δ = e / p K/n / and set x such that ∆ x ≤ δ < ∆ x +1 .As O j,t ≥ l , ln + (cid:18) t / KO j,t (cid:19) ≤ ln + (cid:18) n / KO j,t (cid:19) ≤ ln + ( n / /Kl ) ≤ ln + ( n / K × ∆ x N ) ≤ l ∆ x N ≤ O j,t ∆ x N . (66) Hence we have, ∆ x N − s ln + ( t / /KO j,t ) O j,t ≥ ∆ x N − ∆ x √ N = c ∆ x (67) with c = N − √ N = N .Therefor, using Hoeffding’s inequality and Equation (65),and then plugging into the value of l , we get, e T k,n ≤ l + n X t = l X j ∈ Y x P (cid:26) X j,t − µ j ≥ c ∆ x (cid:27) ≤ l + n X t = l X j ∈ Y x exp( − O j,t ( c ∆ x ) ) ≤ l + K · n · exp( − l ( c ∆ x ) )= 1 + 16 N ln( n / K ∆ x )∆ x + K · n · exp( − n e )) . (68)As δ = e / q K/n and ∆ x > δ , the second term in(68) is bounded by N (1 + ln n / ) Ke · n / < N ( n / + n / ) Ke The last term of (68) is bounded by K · n · exp( − n e )) ≤ Ke · n Finally we get e T k,n = 1 + 16 N ( n / + n / ) Ke + Ke · n . (69) Bounding e T k,n e T k,n = n X t =1 { W ≤ W x,t , W < Z x }≤ n X t =1 P { W < Z x } ≤ n P { W < Z x } . (70) Remember that at time slot z , we have W = min W ,t . Forthe probability { W < Z x } of ﬁxed x , we have P { W < σ − ∆ x } (71) = P (cid:26) N X j ∈ N ,j =1 w j,z < σ − ∆ x (cid:27) (72) ≤ X j ∈ N P (cid:26) w j,z < µ j − ∆ x N (cid:27) . (73) We deﬁne function f ( u ) = e ln( q n / K u ) /u for u ∈ [ δ , N ] . Then we have, P (cid:26) w j,z < µ j − ∆ x N (cid:27) = P (cid:26) ∃ ≤ l ≤ n : l X τ =1 (cid:18) X j,τ + s ln + ( τ / Kl ) l (cid:19) < lµ j − l ∆ x N (cid:27) ≤ P (cid:26) ∃ ≤ l ≤ n : l X τ =1 ( µ j − X j,τ ) > r l ln + ( τ / Kl ) + l ∆ x N (cid:27) ≤ P (cid:26) ∃ ≤ l ≤ f (∆ x ) : l X τ =1 ( µ j − X j,τ ) > r l ln + ( τ / Kl ) (cid:27) + P (cid:26) ∃ f (∆ x ) < l ≤ n : l X τ =1 ( µ j − X j,τ ) > l ∆ x N (cid:27) . (74) or the ﬁrst term we use a peeling argument with a geometricgrid of the form g +1 f (∆ x ) ≤ l ≤ g f (∆ x ) : P (cid:26) ∃ ≤ l ≤ f (∆ x ) : l X τ =1 ( µ j − X j,τ ) > r l ln + ( τ / Kl ) (cid:27) ≤ ∞ X g =0 P (cid:26) ∃ g +1 f (∆ x ) ≤ l ≤ g f (∆ x ) : l X τ =1 ( µ j − X j,τ ) > s f (∆ x )2 g +1 ln + ( τ / g Kf (∆ x ) ) (cid:27) ≤ ∞ X g =0 exp (cid:18) − f (∆ x ) g +1 ln + ( τ / g Kf (∆ x ) ) f (∆ x ) g (cid:19) ≤ ∞ X g =0 (cid:20) Kf (∆ x ) n / g (cid:21) ≤ Kf (∆ x ) n / (75) where in the second inequality we use Lemma 2.As the special design of function f ( u ) , we have f ( u ) takesmaximum of n / K / when u = e / p K/n / . For ∆ x >e / p K/n / , we have Kf (∆ x ) n / ≤ √ K n − / . (76) For the second term we also use a peeling argument but witha geometric grid of the form g f (∆ x ) ≤ l < g +1 f (∆ x ) : P (cid:26) ∃ f (∆ x ) < l ≤ n : l X τ =1 ( µ j − X j,τ ) > l ∆ x N (cid:27) ≤ ∞ X g =0 P (cid:26) ∃ g f (∆ x ) ≤ l ≤ g +1 f (∆ x ) : l X τ =1 ( µ j − X j,τ ) > g − f (∆ x )∆ x N (cid:27) ≤ ∞ X g =0 exp (cid:18) − g f (∆ x )∆ x N (cid:19) ≤ ∞ X g =0 exp (cid:18) − ( g + 1) f (∆ x )∆ x / N (cid:19) = 1exp( f (∆ x )∆ x / N ) − . (77) We note that f ( u ) u has a minimum of e √ K n / when u = x . Thus for (77), we further have, f (∆ x )∆ x N ) − ≤ (cid:18) en / √ KN (cid:19) − ≤ √ KN n − e . (78)Combining (73) and (70), we then have e T k,n ≤ Nn / √ K + 4 √ KN n / e ≤ (1 + 4 √ KN e ) Nn . (79)

4. Results without dependency on ∆ min Summing e T k,n and e T k,n , we have e T x,n ≤ e T k,n + e T k,n = 1 + 16 N Ke (1 + 8 N

15 ) n + (1 + 4 √ KN e ) N n and using ∆ X ≤ N and ∆ x ≤ δ for x ≤ x , we have R ( n ) ≤ √ Ken + N K (cid:20) N Ke (1 + 8 N

15 ) n +(1 + 4 √ KN e ) N n (cid:21) ≤ N K + (cid:18) √ eK + 8(1 + N ) N (cid:19) n +(1 + 4 √ KN e ) N Kn56