[PDF] Distributed No-Regret Learning in Multi-Agent Systems

Abstract

In this tutorial article, we give an overview of new challenges and representative results on distributed no-regret learning in multi-agent systems modeled as repeated unknown games. Four emerging game characteristics---dynamicity, incomplete and imperfect feedback, bounded rationality, and heterogeneity---that challenge canonical game models are explored. For each of the four characteristics, we illuminate its implications and ramifications in game modeling, notions of regret, feasible game outcomes, and the design and analysis of distributed learning algorithms.

Full PDF

DDistributed No-Regret Learning inMulti-Agent Systems

Xiao Xu and Qing ZhaoCornell University, Ithaca, NY. Email: { xx243, qz16 } @cornell.edu Game theory is a well-established tool for studying interactions among self-interested play-ers. Under the assumption of complete information on the game composition at each player,the focal point of game-theoretic studies has been on the

Nash equilibrium (NE) in analyzinggame outcomes and predicting strategic behaviors of rational players.The diﬃculty in obtaining complete information in real-world applications gives riseto the formulation of repeated unknown games , where each player has access to only localinformation such as his own actions and utilities, but is otherwise unaware of the gamecomposition or even the existence of opponents. In such a setting, a rational player improveshis decision-making through real-time interactions with the system and learns from pastexperiences [1]. The problem can be viewed through the lens of distributed online learning,where the central question is whether learning dynamics of distributed players lead to asystem-level equilibrium in some sense. Studies in the past few decades have revealedintriguing connections between various notions of no-regret learning at each player andcertain relaxed versions of NE at the system level [1, 2].While one-step closer to real-world systems, repeated unknown games, in their canonicalforms, often adopt idealistic assumptions in terms of the stationarity of the player populationand their utilities, availability of complete and perfect feedback, full rationality of playerswith unbounded cognition and computation capacity, and homogeneity among players in

This work was supported by the National Science Foundation under Grant CCF-1815559. a r X i v : . [ c s . G T ] F e b heir knowledge of the game. Many emerging multi-agent systems, however, are inherentlydynamic and heterogeneous, and inevitably limited in terms of available information andthe cognition and computation capacity of the players. We give below two examples. Example: adversarial machine learning.

Security issues are at the forefront of machinelearning and deep learning research, especially in safety-critical and risk-sensitive applica-tions. The interaction between the defender and the attacker can be modeled as a two-playergame. While the player population may be small, the game is highly complex in terms ofthe action space, utilities, feedback models, and the available knowledge each player hasabout the other. In particular, the attacker is characterized by its knowledge—how muchinformation it has for designing attacks—and power—how often a successful attack can belaunched. Both can be dynamically changing and adaptive to the strategies of the defender.A full spectrum of attacker proﬁles has been considered, ranging from the so-called black-box model to the white-box model (i.e., an omniscient attacker). The attack process isalso dynamic, often exhibiting bursty behaviors following a successful intrusion or a systemmalfunction. The action space of the attacker can be equally diverse, including poison-ing attacks and perturbation attacks. The former targets the training phase by injectingcorrupted labels and examples for the purpose of embedding wrong decision rules into themachine learning algorithm. The latter targets the blind spots of a fully trained artiﬁcialintelligence using strategically perturbed instances that trigger wrong outputs, even whenthe perturbation is so minute as being indiscernible to humans. In terms of utilities, theattacker’s goal may be to compromise the integrity of the system (i.e., to evade detectionby causing false negatives) or the availability of the system (to ﬂood the system with falsepositives). See a comprehensive taxonomy of attacks against machine learning systems in [3].

Example: transportation systems.

Route selection in urban transportation is a typical ex-ample of a non-cooperative game repeated over time. The game is characterized by alarge population of players that is both dynamic and heterogeneous, with vehicles leavingand joining the system and utilities varying across players and over time. The envisionedlarge-scale adoption of autonomous vehicles will further diversify the traﬃc composition.Autonomous vehicles are signiﬁcantly diﬀerent from human drivers in terms of decision-2aking rationality, access to and usage of system-level knowledge, and memory and com-putation power. Bounded rationality is more evident in human drivers: they are likely toselect a familiar route and inclined to settle for suﬃcing yet suboptimal options.Complex multi-agent systems as in the above examples call for new game models, newconcepts of regret, new design of distributed learning algorithms, and new techniques foranalyzing game outcomes. We present in this article representative results on distributed no-regret learning in multi-agent systems. We start in Sec. 2 with a brief review of backgroundknowledge on classical repeated unknown games. In the subsequent four sections, we explorefour game characteristics—dynamicity, incomplete and imperfect feedback, bounded ratio-nality, and heterogeneity—that challenge the classical game models. For each characteristic,we illuminate its implications and ramiﬁcations in game modeling, notions of regret, feasiblegame outcomes, and the design and analysis of distributed learning algorithms. Limited byour understanding of this expansive research ﬁeld and constrained by the page limit, thecoverage is inevitably incomplete. We hope the article nevertheless provides an informativeglimpse of the current landscape of this ﬁeld and stimulates future research interests.

In this section, we review key concepts in game theory and highlight classical results ondistributed learning in repeated unknown games. An N -player static game is represented by a tuple G ( N , A , u ), where N = { , ..., N } is theset of players, A = A ×· · ·×A N the Cartesian product of each player’s action space A i , and u = ( u , ..., u N ) the utility functions that capture the interaction among players. Speciﬁ-cally, the utility function u i of player i encodes his preference towards an action. It is amapping from the action proﬁle a = ( a , ..., a N ) of all players to player i ’s reward u i ( a ).A Nash equilibrium (NE) is an action proﬁle a ∗ = ( a ∗ , ..., a ∗ N ) under which no player canincrease his reward via a unilateral deviation. Speciﬁcally, u i ( a ∗ ) ≥ u i ( a (cid:48) i , a ∗− i ) for all i andall a (cid:48) i (cid:54) = a ∗ i , where a ∗− i denotes the action proﬁle after excluding player i . Due to the focus on3 ixed NECEPure NECCE Always exists, hard to compute. Always exists, easy to compute/learn.May not exist, hard to compute.

Figure 1: Relations and properties of four types of equilibria [5].deterministic actions (also called pure strategies ), the resulting equilibrium is a pure Nashequilibrium . A player may also adopt a mixed strategy , which is a probability distribution s i over the action space. Correspondingly, a mixed Nash equilibrium is a product distribution s ∗ = s ∗ × · · · × s ∗ N under which the expected utility E a ∗ ∼ s ∗ [ u i ( a ∗ )] for every player i is nosmaller than that under a unilateral deviation s (cid:48) i (cid:54) = s ∗ i in player i ’s strategy. A game witha ﬁnite population and a ﬁnite action space has at least one mixed NE but may not haveany pure NE [4].NE is deﬁned under the assumption that players adopt independent strategies (notethe product form of s ∗ ). A more general equilibrium— correlated equilibrium (CE)—allowscorrelation across players’ strategies. We note that for equilibrium deﬁnitions introducedhere, we focus on games with a ﬁnite action space. Speciﬁcally, a CE is a joint prob-ability distribution s (not necessarily in a product form) satisfying E a ∼ s [ u i ( a i , a − i ) | a i ] ≥ E a ∼ s [ u i ( a (cid:48) i , a − i ) | a i ] for all i , a i , and a (cid:48) i , where the expectation is over the joint strategy s con-ditioned on that the realized action of player i is a i . The concept of CE can be interpretedby introducing a mediator, who draws an outcome a from s and privately recommendsaction a i to player i . The equilibrium condition states that no player has the incentiveto deviate from the outcome of the correlated draw from s after his part is revealed. CEcan be further relaxed to the so-called coarse correlated equilibrium (CCE), which is a jointdistribution s satisfying E a ∼ s [ u i ( a )] ≥ E a ∼ s [ u i ( a (cid:48) i , a − i )] for all i and all a (cid:48) i (cid:54) = a i . Diﬀerentfrom CE, CCE imposes an equilibrium condition that is realization independent.The four types of equilibria exhibit a sequential inclusion relation as illustrated in Fig. 1.The more general set of strategy proﬁles (i.e., allowing correlated strategies across players)4n CE and CCE may lead to higher expected utilities summed over all players. CE andCCE can also be computed via linear programming, while pure NE and mixed NE are hardto compute [4]. More importantly, CE and CCE are learnable through certain learningdynamics of players when a game is played repeatedly as discussed next. A caveat isthat the set of CCE may contain highly non-rational strategies that choose only strictlydominated actions (actions that are suboptimal responses to all action proﬁles of the otherplayers). See [6] for speciﬁc examples. A repeated game consists of T repetitions of a static game (referred to as the stage gamein this context) . In a repeated unknown game, after taking an action a ti (potentiallyrandomized according to a mixed strategy) in the t -th stage, player i accrues a utility u i ( a t )and observes the entire utility vector ( u i ( a (cid:48) i , a t − i )) a (cid:48) i ∈A i for all actions a (cid:48) i in his action space(we focus on a ﬁnite action space here) against the action proﬁle a t − i of the other players.The actions and utilities of the other players, however, are unknown and unobservable.From a single player’s perspective, a repeated unknown game can be viewed as an onlinelearning problem where the player chooses actions sequentially in time by learning frompast experiences. A commonly adopted performance measure in online learning is regret ,deﬁned as the cumulative reward loss against a properly deﬁned benchmark policy withhindsight vision and/or certain clairvoyant knowledge about the game. In other words, thebenchmark policy deﬁnes the learning objective that an online algorithm aims to achieveover time. Diﬀerent benchmark policies lead to diﬀerent regret measures. Two classicalregret notions are the external regret and the internal regret as detailed below.Let π i denote the online learning algorithm adopted by player i . For a ﬁxed action In a general deﬁnition of a repeated game [7], the stage game is parameterized by a state, which aﬀectsthe utility function. Two basic settings exist in the literature: (i) the state evolves over time following aMarkov transition rule (the state in the next stage depends on the state and actions in the current stage);(ii) the state is ﬁxed throughout all stages. We focus on the second setting in discussing classical results onrepeated games. { a t − i } Tt =1 of the other players, the external regret of π i is deﬁned as:max a (cid:48) ∈A i E π i (cid:34) T (cid:88) t =1 ( u i ( a (cid:48) , a t − i ) − u i ( a t )) (cid:35) , (1)where E π i denotes the expectation over the random action process { a ti } Tt =1 induced by π i .In other words, the benchmark policy in the external regret chooses the best ﬁxed responseto the other players’ actions in hindsight. The internal regret of π i is deﬁned as:max a,a (cid:48) ∈A i E π i (cid:34) T (cid:88) t =1 I { a ti = a } ( u i ( a (cid:48) , a t − i ) − u i ( a t )) (cid:35) , (2)where I {·} is the indicator function. In this deﬁnition, the benchmark policy is the best hind-sight modiﬁcation of π i by swapping a single action with another throughout all stages.An online learning algorithm π i is said to achieve the no-regret condition if against allaction sequences { a t − i } Tt =1 of the other players, the cumulative regret has a sublinear growthrate with the time horizon T . In other words, π i oﬀers, asymptotically as T → ∞ , the sameaverage reward per stage as the speciﬁc benchmark policy adopted in the correspondingregret measure. No-regret learning is also referred to as Hannan consistency due to theoriginal work [8] as well as [9].It is clear that the signiﬁcance of no-regret learning depends on the adopted benchmarkpolicy which the learning algorithm is measured against. A benchmark policy with strongerperformance leads to a stronger notion of regret. In particular, the internal regret is astronger notion than the external regret: no-regret learning under the former implies no-regret learning under the latter, but not vice versa [10].A number of no-regret learning algorithms exist in the literature. Representative al-gorithms achieving no-external-regret learning include Multiplicative Weights (MW) (alsoknown as the Hedge algorithm) and Follow the Perturbed Leader [1]. Both are randomizedpolicies, as randomization is necessary for achieving no-regret learning in an adversarialsetting with general reward functions [1]. In particular, under the MW algorithm, eachplayer maintains a weight W a ( t ) of each action a at every stage t based on past rewards: W a ( t ) = e (cid:15) (cid:80) tτ =1 r a ( τ ) = W a ( t − e (cid:15)r a ( t ) , where r a ( τ ) is the reward received under a at6tage τ and (cid:15) > a in the next stage isproportional to its weight given by W a ( t ) (cid:80) a (cid:48) W a (cid:48) ( t ) .For no-internal-regret learning, a representative algorithm is Regret Matching [11]. Let R a → a (cid:48) ( t ) = t (cid:80) tτ =1 I { a τi = a } ( u i ( a (cid:48) , a τ − i ) − u i ( a τ )) denote the average gain per play byswitching from action a to an alternative a (cid:48) in the past t plays. In the ( t + 1)-th stage,the probability of switching from the previous action a t to an alternative a (cid:48) is given by (cid:15) R a t → a (cid:48) ( t ), where (cid:15) > a t . Regret Matching also oﬀers no-external-regret learning by settingthe probability of selecting an action a at the ( t + 1)-th stage to the normalized averagegain per play from playing action a throughout the past t plays, i.e., R a ( t ) (cid:80) a (cid:48) R a (cid:48) ( t ) , where R a ( t ) = t (cid:80) tτ =1 ( u i ( a, a τ − i ) − u i ( a τ )) [11]. Regret captures the learning objective of an individual player. At the system level, it isdesirable to know whether the dynamical behaviors of distributed players converge to anequilibrium in some sense and whether the self-interested regret minimization promises acertain level of optimality in terms of social welfare.For the ﬁrst question, it has been shown that if every player adopts a no-external-regret learning algorithm, the empirical distribution of the sequence of actions taken by allplayers converges to the set of CCE of the stage game [5]. No-regret learning under theinternal regret measure guarantees convergence to the more restrictive set of CE [11]. Suchconvergence results are, however, in terms of the empirical frequency of the players’ actionsrather than the actual sequence of plays. The convergence is also only to the set of equilibria,rather than to an equilibrium in the corresponding set. In fact, by treating learning in gamesas a dynamical system, recent studies have shown that in the continuous-time setting, theactual plays under no-regret learning algorithms (such as Follow the Regularized Leader)may exhibit cycles rather than convergence [12]. In the discrete-time setting, it has beenshown that in zero-sum games, the actual plays under the MW algorithm (starting from anon-equilibrium initial strategy) diverges from every fully mixed NE [13]. For games withspecial structures (e.g., potential games [14] with a ﬁnite action space and bilinear smooth7ames [15] with a continuum of actions), however, stronger results on the convergence ofthe actual plays to the more restrictive set of (mixed) NE have been established.In addition to the convergence of learning dynamics, the social welfare resulting fromthe self-interested learning of individual players is of great interest in many applications.In (known) static games, the loss in social welfare W ( s ) = E a ∼ s (cid:104)(cid:80) Ni =1 u i ( a ) (cid:105) (i.e., thesystem-level utility under a strategy proﬁle s ) due to the self-interested behaviors of playersis quantiﬁed by the price of anarchy (POA). It is deﬁned as the ratio of the optimal socialwelfare OPT = max s W ( s ) among all strategies to the smallest social welfare in the set ofmixed NE. For repeated unknown games, a corresponding concept, price of total anarchy (POTA), is deﬁned as: OPTmin s ,..., s T T (cid:80) Tt =1 W ( s t ) , (3)where s , ..., s T is the sequence of strategy proﬁles in the no-regret dynamics of all players.It has been shown that in games with special structures (e.g., valid games and congestiongames), no-regret learning guarantees a POTA that converges to the POA of the stagegame even though the sequence of actual plays may not converge to a (mixed) NE [16]. Theconvergence of the POTA to the POA of the stage game implies that no-regret learningcan fully negate the impact of the unknown nature of the game on social welfare. Theresult was later extended in [5] to a general class of games referred to as smooth games (which includes valid games and congestion games as special cases). To achieve higher socialwelfare, cooperation among players is necessary. For example, if every player agrees to followa learning algorithm designed speciﬁcally for optimizing the system-level performance, theoptimal action proﬁle will be selected a high percentage of time [17]. In a dynamic repeated game, the stage game is time-varying. The dynamicity may be inany of the three elements of the game composition: the set of players, the action space, andthe utility functions . Note that the general deﬁnition of repeated games in [7] includes dynamicity in the utility function, asthe state parameter may evolve over time following a Markov transition rule. The dynamic repeated gamediscussed in this section diﬀers from the general repeated game in two aspects: (i) the set of players and the .1 Notions of Regret Dynamic unknown games call for new notions of regret to provide meaningful performancemeasures for distributed online learning algorithms. Speciﬁcally, the benchmark policy ofa ﬁxed single best action used in the external regret and that of a ﬁxed single best actionmodiﬁcation used in the internal regret can be highly suboptimal in dynamic games. As aresult, achieving no-regret learning under thus-deﬁned regret measures can no longer serveas a stamp for good performance.A rather immediate extension of the external regret is to consider every interval of thelearning horizon and measure the cumulative loss against a single best action in hindsightthat is speciﬁc to each interval. This leads to the notion of adaptive regret , under which no-regret learning requires a sublinear growth of the cumulative reward loss in every intervalas the interval length tends to inﬁnity. The adaptive regret is particularly suitable forpiecewise stationary systems where changes can be abrupt but infrequent. Classical learningalgorithms such as MW can be extended to achieve no-adaptive-regret [18]. The key issuein algorithm design is a mechanism to discount experiences from the distant past.Another extension of the external regret is the so-called dynamic regret , in which thebenchmark policy can be an arbitrary sequence of actions, as opposed to a ﬁxed actionthroughout an interval of growing length. Achieving diminishing reward loss against allsequences of actions is, however, unattainable. Constraints on either the benchmark actionsequence or the reward functions are necessary for deﬁning a meaningful measure. On thevariation of the benchmark action sequence, a commonly adopted constraint in the settingwith ﬁnite actions is that the benchmark sequence is piecewise-stationary with at most K changes (the thus-deﬁned regret is also referred to as the K-shifting regret ). In this case,the no-adaptive-regret condition directly implies no-dynamic-regret [18]. With a continuumof actions, the constraint is often imposed on the cumulative distance between every twoconsecutive actions in the sequence, i.e., V T ( { a t } Tt =1 ) = (cid:80) T − t =1 || a t +1 − a t || . It has beenshown that if the benchmark sequence is slow-varying, i.e., V T = o ( T ), no-dynamic-regretis achievable through well-designed restart procedures [19]. The variation constraint can action space can also be time-varying; (ii) the utility functions are in general independent across stages. T , i.e., (cid:80) T − t =1 sup a | u t +1 ( a ) − u t ( a ) | = o ( T ). Similar constraints can be imposed on the gradient ∇ u t ( a ) of the utility function and with the variation measured by the L p -norm. See [20]and references therein for details and corresponding no-regret learning algorithms.The external regret and its extensions are measured against an alternative strategy ofa single player. A new notion of regret— Nash equilibrium regret —considers a benchmarkpolicy that is jointly determined by the strategies of all players [21]. Consider a repeatedgame with time-varying utility functions { u ti } Tt =1 for each player i . Let ¯ u i = T (cid:80) Tt =1 u ti bethe average utility function and s ∗ the mixed NE of the static game deﬁned by the averageutility functions ¯ u = (¯ u , ..., ¯ u N ). The NE regret of player i following a policy π i is thengiven by E π [ (cid:80) Tt =1 u ti ( a t )] − T E a ∗ ∼ s ∗ [¯ u i ( a ∗ )], where a t is the action proﬁle selected by thepolicies π = ( π , ..., π N ) of all players at stage t . No-regret learning under the NE regretensures that each player’s average reward asymptotically matches that promised by themixed NE under the average utility functions. A centralized learning algorithm achievingno-NE-regret was developed in [21] for repeated two-player zero-sum games with arbitrarilyvarying utility functions. Achieving no-regret learning under the measure of NE regret in adistributed setting, however, remains open. The two key measures—convergence to equilibria and POTA—for system-level perfor-mance also need to be modiﬁed to take into account game dynamics. The time-varyingsequence {G t } Tt =1 of stage games deﬁnes a sequence of equilibria and a sequence { OPT t } Tt =1 of optimal social welfare. The desired relation between no-regret learning dynamics at in-dividual players and the system-level equilibria is thus in terms of tracking rather thanconverging. For the deﬁnition of POTA, the optimal social welfare in the numerator in (3)needs to be replaced with the average optimal social welfare T (cid:80) Tt =1 OPT t .An online learning algorithm is said to successfully track the sequence of (mixed) NEin a dynamic game if the average distance between the sequence of (mixed) action proﬁles10esulting from the algorithm and the sequence of (mixed) NE vanishes as T tends to inﬁnity.A representative study in [19] considers a game with a continuum of actions and dynamicitymanifesting only in the utility functions. Under the assumptions that the sequence of NE isslow-varying and the utility functions are monotonic, it was shown that learning algorithmswith sublinear dynamic regret successfully track the sequence of NE. The monotonicity ofthe utility functions plays a key role in the analysis: it translates the closeness betweenthe learning dynamics and the NE in terms of the cumulative reward (as in the regretmeasure) to the closeness in terms of their distance in the action space (the concern of thetracking outcome).The performance of no-regret learning in terms of social welfare was studied in [22]for games with a dynamic population of players. Speciﬁcally, in each stage, each playermay independently exit with a ﬁxed probability and is subsequently replaced with a newplayer with a potentially diﬀerent utility function (the population size is therefore ﬁxed andthe player set is a stationary process over time). For structural games such as ﬁrst-priceauctions, bandwidth allocation, and congestion games, the relation between no-adaptive-regret learning and the average optimal social welfare was examined.Game dynamics can be in diverse forms. There lacks a holistic understanding on thematching between regret notions and the underlying dynamics of the game. Diﬀerent formsof game dynamics demand diﬀerent benchmark policies in order to arrive at a meaningfulregret measure that lends signiﬁcance to the stamp of “no-regret learning” yet at the sametime is attainable. Viewing from a diﬀerent angle, one may pose the fundamental questionon what kinds of game dynamics are tamable through distributed online learning and makeno-regret learning and approximately optimal social welfare feasible. Learning and adaptation rely on feedback. Quality of the feedback in terms of completenessand accuracy thus has signiﬁcant implications in no-regret learning. We explore this issuein this section. 11 .1 Incomplete Feedback

Incomplete feedback stands in contrast to full-information feedback where utilities of allactions a player could have taken are observed in each stage. Incompleteness can be spatialacross the action space or temporal across decision stages. In the former case, a commonlystudied model is the so-called bandit feedback , where only the utility of the chosen actionis revealed. In the latter, the feedback model is referred to as lossy feedback where thereare decision stages with no feedback [23]. One can easily envision a more general modelcompounding bandit feedback with lossy feedback. Studies on this general model are lackingin the literature.The term “bandit feedback” has its roots in the classical problem of multi-armed ban-dit [24]. The name of the problem comes from likening an archetypical single-player onlinelearning problem to playing a multi-armed slot machine (known as a bandit for its ability ofemptying the player’s pocket). Each arm, when pulled, generates rewards according to anunknown stochastic model or in an adversarial fashion. Only the reward of the chosen armis revealed after each play. Due to the incomplete feedback, the player faces the tradeoﬀbetween exploration (to gather information from less explored arms) and exploitation (tomaximize immediate reward by favoring arms with a good reward history).In a multi-player game setting with bandit feedback, no-regret learning from an indi-vidual player’s perspective can be cast as a single-player non-stochastic / adversarial banditmodel where the payoﬀ of each arm/action is adversarially chosen and aggregates the inter-action with the other players in the game. The concept of external regret in the game settingcorresponds to the weak regret in the adversarial bandit model [25], which adopts the bestsingle-arm policy in hindsight as the benchmark. The MW algorithm was modiﬁed in [25]to handle the change of the feedback model from full-information to bandit. Speciﬁcally,the weight W a ( t ) of action a at time t is updated as W a ( t ) = W a ( t − e (cid:15)r a ( t ) /p a ( t ) where p a ( t ) is the probability of selecting action a at time t and r a ( t ) = 0 if a is unselected. Divid-ing the observed reward by the corresponding probability of the chosen action ensures theunbiasedness of the observation. Quite intuitively, the price for not observing the rewardsof all actions is the degradation of the regret order in the size of the action space, i.e., from12( (cid:112) log( |A| ) T ) in the full-information setting [1], to Θ( (cid:112) |A| T ) in the bandit setting [26].The multi-player bandit problem explicitly models the existence of N players competingfor M ( M > N ) arms [27]. Originally motivated by applications in wireless communicationnetworks where distributed users compete for access to multiple channels, this speciﬁc gamemodel is characterized by a special form of interaction among players: a collision occurswhen multiple players select the same arm, which results in utility loss. The objective of thisdistributed learning problem is to minimize the system-level regret over all players againstthe optimal centralized (hence collision-free) allocation of the players to the best set ofarms [27]. In addition to the exploration-exploitation tradeoﬀ in the single-player setting,this distributed learning problem under a system-level objective also faces the tradeoﬀbetween selecting a good arm and avoiding colliding with competing players. A numberof distributed learning algorithms have been developed to achieve a sublinear system-levelregret with respect to T [27]. Recent extensions of the multi-player bandit problem furtherconsider the setting where each arm oﬀers diﬀerent payoﬀs across players [28].The multi-player bandit problem is a special game model in that the players have iden-tical action space and their interaction is only in the form of “collisions” when choosingthe same action. In a general game setting, the impact of incomplete feedback on no-regret learning and system-level performance is largely open. One quantitative measure ofthe impact is the regret order with respect to the size of the action space. As mentionedabove, bandit feedback results in an additional (cid:112) |A| term in the regret order, which canbe signiﬁcant when the action space is large. Recent work [29, 30] has shown that localcommunications among neighboring players in a network setting can mitigate the negativeimpact of bandit feedback on the regret order in |A| . In terms of the impact on the system-level performance, it has been shown under a game model with a continuum of actions thatbandit feedback degrades the convergence rate of the learning dynamics to equilibria [31]. Imperfect feedback refers to the inaccuracy of the observed utilities in revealing the qualityof the selected actions. Recall that mixed strategies are necessary for achieving no-regretlearning in the adversarial setting. The quality of a mixed strategy is characterized by13he expected utility where the expectation is taken over the randomness of strategies of allplayers. Referred to as expected feedback , the feedback model assuming observations on theexpected utility, however, can be unrealistic. A more commonly adopted feedback modelis the realized feedback where only the utility of the realized action proﬁle is revealed. Therealized feedback can be viewed as a noisy unbiased estimate of the expected feedback wherethe noise is due to the randomness of players’ strategies.The so-called noisy feedback assumes a diﬀerent source of noise: it comes from the ex-ternal environment and is additive to either the observed utility vectors in the so-calledsemi-bandit feedback [14] with a ﬁnite action space, or the gradient of the utility functionsin the ﬁrst-order feedback [32] with a continuum of actions. Under the assumptions of un-biasedness and bounded variance, the issue of the additive noise can be addressed by ratherstandard estimation techniques and analysis. A more challenging setting is to considernon-stochastic noise due to adversarial attacks, especially in applications such as adversar-ial machine learning. This problem was recently studied in the single-player setting [33].Studies in the multi-agent setting are still lacking.

The concept of bounded rationality was ﬁrst introduced in economics [34] to provide morerealistic models than the often adopted perfect rationality that assumes the decision-makingof players is the result of a full optimization of their utilities. In reality, players often takereasoning shortcuts that may lead to suboptimal decisions. Such reasoning shortcuts maybe a result of limited cognition of human minds or necessitated by the available computationtime and power relative to the complexity of action optimization.Cognitive limitations include the limited ability in anticipating other decision-makers’strategic responses and certain psychological factors that interfere with the valuation of op-tions. Various models exist for capturing the limitations in the players’ valuation of options.For example, a player may be myopic, focusing only on the short-term reward [35]. Evenwith forward-thinking, a player may settle for suboptimal actions perceived as acceptableby the player [34]. The limitation in a player’s ability to anticipate other players’ strate-14ies can be modeled through a cognitive hierarchy by grouping players according to theircognitive abilities and characterizing them in an iterative fashion. Speciﬁcally, players withthe lowest level of cognitive ability are grouped as the level-0 players who make decisionsrandomly. Level- k ( k >

0) players are then deﬁned iteratively as those who assume theyare playing against lower-level players and anticipate the opponents’ strategies accordingly.Recent work draws an interesting connection between the cognitive hierarchy model and the

Optimistic Mirror Descent (OMD) algorithm for solving the saddle point problem with ap-plications in generative adversarial networks [36]. The saddle-point problem can be viewedas a two-player zero-sum game with a continuum of actions. The solutions to the problemcorrespond to the set of NE. It has been shown that the OMD algorithm guarantees a con-verging system dynamic to an NE in terms of the actual plays while Gradient Descent (GD)may lead to cycles [36]. In the language of cognitive hierarchy, players adopting GD can beregarded as level-0 thinkers in the sense that they do not anticipate the strategies of theiropponents. Players adopting OMD are level-1 thinkers since they take advantage of the factthat their opponents are taking similar gradient methods, which will not lead to abrupt gra-dient changes between two consecutive stages [36]. Consequently, an extra gradient updateis applied in OMD to accelerate learning.Besides cognitive limitations, players are also constrained in terms of physical resourcessuch as memory and computation power. Acquiring, storing, and processing all relevant in-formation for decision-making may be infeasible, especially in complex systems with a largeaction space. For example, players may only choose from strategies with bounded complex-ity [37], or use only recent observations in decision-making due to memory constraints [38].While models for bounded rationality abound in economics, political science, and otherrelated disciplines, incorporating such models into distributed online learning is still in itsinfancy. A holistic understanding on the implications of bounded rationality in distributedonline learning is yet to be gained. An intriguing aspect of the problem is that boundedrationality may not necessarily imply degraded performance. For example, in dynamicgames, bounded memory of past experiences may have little eﬀect since no-regret learningdictates that the distant past be forgotten (see discussions in Sec. 3).15

Heterogeneity

The heterogeneity of complex multi-agent systems characterizes the asymmetry across play-ers in three aspects: the available information and knowledge about the system, availableactions, and the level of adaptivity to opponents’ strategies. In the example of mixedtraﬃc in urban transportation, autonomous vehicles, while likely to have greater computa-tion power for solving complex decision problems, may have to obey an additional set ofregulations on available actions.In adversarial machine learning, in addition to the asymmetry on the knowledge andpower, the attacker and the defender may also have diﬀerent levels of real-time adaptivity tothe other player’s strategy. Classical regret notions such as the external regret that assumesﬁxed actions of the other players, while applicable to oblivious attackers, are no longer validunder adaptive attacks. A partial solution is to adopt a new notion of policy regret deﬁnedagainst an adaptive adversary who assigns reward vectors based on previous actions of theplayer [39]. Speciﬁcally, let u t ( · ; a t − ) denote the player’s reward function determined bythe adversary at time t , given the sequence of actions a t − taken by the player in the past.The policy regret with reward functions { u t } Tt =1 is deﬁned asmax a ∈A E (cid:34) T (cid:88) t =1 u t ( a ; { a, ..., a } ) − T (cid:88) t =1 u t ( a t ; a t − ) (cid:35) , (4)where u t ( · ; { a, ..., a } ) denotes the reward function determined by the adversary if the playertook actions { a, ..., a } in the past. The m -memory policy regret is deﬁned by assuming thatthe reward function depends only on the past m actions of the player.The diﬀerence between the external regret and the policy regret may not be crucial if theadversary and the player have homogeneous objectives (e.g., mixed traﬃc in transportationsystems). It has been shown that there exists a wide class of algorithms that can ensureno-regret learning under both regret deﬁnitions, as long as the adversary is also using suchan algorithm [40]. In applications such as adversarial machine learning where the adversarymay be a malicious opponent, the two notions of regret are incompatible: there exists an m -memory adaptive adversary that can make any action sequence of the player with sublinear16egret in one notion suﬀer from linear regret in the other [40]. A general technique fordeveloping no-policy-regret algorithms in the single-player setting was proposed in [39]. Interms of the system-level performance, it was shown in two-player games that no-policy-regret learning guarantees convergence of the system dynamic to a new notion of equilibriumcalled policy equilibrium [40]. However, the understanding of policy equilibrium is limited.In games with more than two players, even the deﬁnition of policy equilibrium is unclear. References [1] N. Cesa-Bianchi and G. Lugosi,

Prediction, Learning, and Games . Cambridge uni-versity press, 2006.[2] H. P. Young,

Strategic Learning and its Limits . OUP Oxford, 2004.[3] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machinelearning,”

Machine Learning , vol. 81, no. 2, pp. 121–148, 2010.[4] N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani,

Algorithmic Game Theory .Cambridge university press, 2007.[5] T. Roughgarden, “Intrinsic robustness of the price of anarchy,”

Journal of the ACM(JACM) , vol. 62, no. 5, p. 32, 2015.[6] Y. Viossat and A. Zapechelnyuk, “No-regret dynamics and ﬁctitious play,”

Journal ofEconomic Theory , vol. 148, no. 2, pp. 825–842, 2013.[7] R. Laraki and S. Sorin, “Advances in zero-sum dynamic games,” in

Handbook of GameTheory with Economic Applications . Elsevier, 2015, vol. 4, pp. 27–93.[8] J. Hannan, “Approximation to bayes risk in repeated play,”

Contributions to the Theoryof Games , vol. 3, pp. 97–139, 1957.[9] D. Blackwell et al. , “An analog of the minimax theorem for vector payoﬀs.”

PaciﬁcJournal of Mathematics , vol. 6, no. 1, pp. 1–8, 1956.1710] G. Stoltz and G. Lugosi, “Internal regret in on-line portfolio selection,”

Machine Learn-ing , vol. 59, no. 1-2, pp. 125–159, 2005.[11] S. Hart and A. Mas-Colell, “A simple adaptive procedure leading to correlated equi-librium,”

Econometrica , vol. 68, no. 5, pp. 1127–1150, 2000.[12] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras, “Cycles in adversarial regular-ized learning,” in

Proceedings of the 29th Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM, 2018, pp. 2703–2717.[13] J. P. Bailey and G. Piliouras, “Multiplicative weights update in zero-sum games,” in

Proceedings of the 2018 ACM Conference on Economics and Computation . ACM,2018, pp. 321–338.[14] A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potentialgames,” in

Advances in Neural Information Processing Systems , 2017, pp. 6369–6378.[15] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien,and I. Mitliagkas, “Negative momentum for improved game dynamics,” in

The 22ndInternational Conference on Artiﬁcial Intelligence and Statistics , 2019, pp. 1802–1811.[16] A. Blum, M. Hajiaghayi, K. Ligett, and A. Roth, “Regret minimization and the priceof total anarchy,” in

Proceedings of The 40th Annual ACM Symposium on Theory ofComputing . ACM, 2008, pp. 373–382.[17] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality throughdistributed learning,”

SIAM Journal on Control and Optimization , vol. 52, no. 5, pp.2753–2770, 2014.[18] H. Luo and R. E. Schapire, “Achieving all with no parameters: AdaNormalHedge,” in

Conference on Learning Theory , 2015, pp. 1286–1304.[19] B. Duvocelle, P. Mertikopoulos, M. Staudigl, and D. Vermeulen, “Learning in time-varying games,” arXiv preprint arXiv:1809.03066 , 2018.1820] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization indynamic environments: Improved regret rates for strongly convex problems,” in . IEEE, 2016, pp. 7195–7201.[21] A. R. Cardoso, J. Abernethy, H. Wang, and H. Xu, “Competing against Nash equilibriain adversarially changing zero-sum games,” in

Proceedings of the 36th InternationalConference on Machine Learning , vol. 97. PMLR, 2019, pp. 921–930.[22] T. Lykouris, V. Syrgkanis, and ´E. Tardos, “Learning and eﬃciency in games withdynamic population,” in

Proceedings of The 27th Annual ACM-SIAM Symposium onDiscrete Algorithms , 2016, pp. 120–129.[23] Z. Zhou, P. Mertikopoulos, S. Athey, N. Bambos, P. W. Glynn, and Y. Ye, “Learningin games with lossy feedback,” in

Advances in Neural Information Processing Systems ,2018, pp. 5140–5150.[24] Q. Zhao,

Multi-Armed Bandits: Theory and Applications to Online Learning in Net-works . Morgan & Claypool Publishers, 2019.[25] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multi-armed bandit problem,”

SIAM Journal on Computing , vol. 32, no. 1, pp. 48–77, 2002.[26] J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic ban-dits,” in

Proceedings of the 22nd Annual Conference on Learning Theory , 2009, pp.217–226.[27] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple play-ers,”

IEEE Transactions on Signal Processing , vol. 58, no. 11, pp. 5667–5681, 2010.[28] I. Bistritz and A. Leshem, “Distributed multi-player bandits—a game of thrones ap-proach,” in

Advances in Neural Information Processing Systems , 2018, pp. 7222–7232.[29] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Delay and cooperation in nonstochasticbandits,”

The Journal of Machine Learning Research , vol. 20, no. 1, pp. 613–650, 2019.1930] Y. Bar-On and Y. Mansour, “Individual regret in cooperative nonstochastic multi-armed bandits,” in

Advances in Neural Information Processing Systems , 2019, pp.3110–3120.[31] M. Bravo, D. Leslie, and P. Mertikopoulos, “Bandit learning in concave n-persongames,” in

Advances in Neural Information Processing Systems , 2018, pp. 5661–5671.[32] P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets andunknown payoﬀ functions,”

Mathematical Programming , vol. 173, no. 1-2, pp. 465–507, 2019.[33] K.-S. Jun, L. Li, Y. Ma, and J. Zhu, “Adversarial attacks on stochastic bandits,” in

Advances in Neural Information Processing Systems , 2018, pp. 3640–3649.[34] H. A. Simon, “A behavioral model of rational choice,”

The Quarterly Journal of Eco-nomics , vol. 69, no. 1, pp. 99–118, 1955.[35] X. Gabaix and D. Laibson, “Bounded rationality and directed cognition,”

HarvardUniversity , 2005.[36] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training GANs with optimism.”in

International Conference on Learning Representations , 2018.[37] M. Scarsini and T. Tomala, “Repeated congestion games with bounded rationality,”

International Journal of Game Theory , vol. 41, no. 3, pp. 651–669, 2012.[38] L. Chen, F. Lin, P. Tang, K. Wang, R. Wang, and S. Wang, “K-memory strategies inrepeated games,” in

Proceedings of the 16th Conference on Autonomous Agents andMultiagent Systems , 2017, pp. 1493–1498.[39] R. Arora, O. Dekel, and A. Tewari, “Online bandit learning against an adaptive adver-sary: from regret to policy regret,”

Proceedings of the 29th International Conferenceon Machine Learning , pp. 1747–1754, 2012.[40] R. Arora, M. Dinitz, T. V. Marinov, and M. Mohri, “Policy regret in repeated games,”in