Distributed No-Regret Learning in Multi-Agent Systems
DDistributed No-Regret Learning inMulti-Agent Systems
Xiao Xu and Qing ZhaoCornell University, Ithaca, NY. Email: { xx243, qz16 } @cornell.edu Game theory is a well-established tool for studying interactions among self-interested play-ers. Under the assumption of complete information on the game composition at each player,the focal point of game-theoretic studies has been on the
Nash equilibrium (NE) in analyzinggame outcomes and predicting strategic behaviors of rational players.The difficulty in obtaining complete information in real-world applications gives riseto the formulation of repeated unknown games , where each player has access to only localinformation such as his own actions and utilities, but is otherwise unaware of the gamecomposition or even the existence of opponents. In such a setting, a rational player improveshis decision-making through real-time interactions with the system and learns from pastexperiences [1]. The problem can be viewed through the lens of distributed online learning,where the central question is whether learning dynamics of distributed players lead to asystem-level equilibrium in some sense. Studies in the past few decades have revealedintriguing connections between various notions of no-regret learning at each player andcertain relaxed versions of NE at the system level [1, 2].While one-step closer to real-world systems, repeated unknown games, in their canonicalforms, often adopt idealistic assumptions in terms of the stationarity of the player populationand their utilities, availability of complete and perfect feedback, full rationality of playerswith unbounded cognition and computation capacity, and homogeneity among players in
This work was supported by the National Science Foundation under Grant CCF-1815559. a r X i v : . [ c s . G T ] F e b heir knowledge of the game. Many emerging multi-agent systems, however, are inherentlydynamic and heterogeneous, and inevitably limited in terms of available information andthe cognition and computation capacity of the players. We give below two examples. Example: adversarial machine learning.
Security issues are at the forefront of machinelearning and deep learning research, especially in safety-critical and risk-sensitive applica-tions. The interaction between the defender and the attacker can be modeled as a two-playergame. While the player population may be small, the game is highly complex in terms ofthe action space, utilities, feedback models, and the available knowledge each player hasabout the other. In particular, the attacker is characterized by its knowledge—how muchinformation it has for designing attacks—and power—how often a successful attack can belaunched. Both can be dynamically changing and adaptive to the strategies of the defender.A full spectrum of attacker profiles has been considered, ranging from the so-called black-box model to the white-box model (i.e., an omniscient attacker). The attack process isalso dynamic, often exhibiting bursty behaviors following a successful intrusion or a systemmalfunction. The action space of the attacker can be equally diverse, including poison-ing attacks and perturbation attacks. The former targets the training phase by injectingcorrupted labels and examples for the purpose of embedding wrong decision rules into themachine learning algorithm. The latter targets the blind spots of a fully trained artificialintelligence using strategically perturbed instances that trigger wrong outputs, even whenthe perturbation is so minute as being indiscernible to humans. In terms of utilities, theattacker’s goal may be to compromise the integrity of the system (i.e., to evade detectionby causing false negatives) or the availability of the system (to flood the system with falsepositives). See a comprehensive taxonomy of attacks against machine learning systems in [3].
Example: transportation systems.
Route selection in urban transportation is a typical ex-ample of a non-cooperative game repeated over time. The game is characterized by alarge population of players that is both dynamic and heterogeneous, with vehicles leavingand joining the system and utilities varying across players and over time. The envisionedlarge-scale adoption of autonomous vehicles will further diversify the traffic composition.Autonomous vehicles are significantly different from human drivers in terms of decision-2aking rationality, access to and usage of system-level knowledge, and memory and com-putation power. Bounded rationality is more evident in human drivers: they are likely toselect a familiar route and inclined to settle for sufficing yet suboptimal options.Complex multi-agent systems as in the above examples call for new game models, newconcepts of regret, new design of distributed learning algorithms, and new techniques foranalyzing game outcomes. We present in this article representative results on distributed no-regret learning in multi-agent systems. We start in Sec. 2 with a brief review of backgroundknowledge on classical repeated unknown games. In the subsequent four sections, we explorefour game characteristics—dynamicity, incomplete and imperfect feedback, bounded ratio-nality, and heterogeneity—that challenge the classical game models. For each characteristic,we illuminate its implications and ramifications in game modeling, notions of regret, feasiblegame outcomes, and the design and analysis of distributed learning algorithms. Limited byour understanding of this expansive research field and constrained by the page limit, thecoverage is inevitably incomplete. We hope the article nevertheless provides an informativeglimpse of the current landscape of this field and stimulates future research interests.
In this section, we review key concepts in game theory and highlight classical results ondistributed learning in repeated unknown games. An N -player static game is represented by a tuple G ( N , A , u ), where N = { , ..., N } is theset of players, A = A ×· · ·×A N the Cartesian product of each player’s action space A i , and u = ( u , ..., u N ) the utility functions that capture the interaction among players. Specifi-cally, the utility function u i of player i encodes his preference towards an action. It is amapping from the action profile a = ( a , ..., a N ) of all players to player i ’s reward u i ( a ).A Nash equilibrium (NE) is an action profile a ∗ = ( a ∗ , ..., a ∗ N ) under which no player canincrease his reward via a unilateral deviation. Specifically, u i ( a ∗ ) ≥ u i ( a (cid:48) i , a ∗− i ) for all i andall a (cid:48) i (cid:54) = a ∗ i , where a ∗− i denotes the action profile after excluding player i . Due to the focus on3 ixed NECEPure NECCE Always exists, hard to compute. Always exists, easy to compute/learn.May not exist, hard to compute.
Figure 1: Relations and properties of four types of equilibria [5].deterministic actions (also called pure strategies ), the resulting equilibrium is a pure Nashequilibrium . A player may also adopt a mixed strategy , which is a probability distribution s i over the action space. Correspondingly, a mixed Nash equilibrium is a product distribution s ∗ = s ∗ × · · · × s ∗ N under which the expected utility E a ∗ ∼ s ∗ [ u i ( a ∗ )] for every player i is nosmaller than that under a unilateral deviation s (cid:48) i (cid:54) = s ∗ i in player i ’s strategy. A game witha finite population and a finite action space has at least one mixed NE but may not haveany pure NE [4].NE is defined under the assumption that players adopt independent strategies (notethe product form of s ∗ ). A more general equilibrium— correlated equilibrium (CE)—allowscorrelation across players’ strategies. We note that for equilibrium definitions introducedhere, we focus on games with a finite action space. Specifically, a CE is a joint prob-ability distribution s (not necessarily in a product form) satisfying E a ∼ s [ u i ( a i , a − i ) | a i ] ≥ E a ∼ s [ u i ( a (cid:48) i , a − i ) | a i ] for all i , a i , and a (cid:48) i , where the expectation is over the joint strategy s con-ditioned on that the realized action of player i is a i . The concept of CE can be interpretedby introducing a mediator, who draws an outcome a from s and privately recommendsaction a i to player i . The equilibrium condition states that no player has the incentiveto deviate from the outcome of the correlated draw from s after his part is revealed. CEcan be further relaxed to the so-called coarse correlated equilibrium (CCE), which is a jointdistribution s satisfying E a ∼ s [ u i ( a )] ≥ E a ∼ s [ u i ( a (cid:48) i , a − i )] for all i and all a (cid:48) i (cid:54) = a i . Differentfrom CE, CCE imposes an equilibrium condition that is realization independent.The four types of equilibria exhibit a sequential inclusion relation as illustrated in Fig. 1.The more general set of strategy profiles (i.e., allowing correlated strategies across players)4n CE and CCE may lead to higher expected utilities summed over all players. CE andCCE can also be computed via linear programming, while pure NE and mixed NE are hardto compute [4]. More importantly, CE and CCE are learnable through certain learningdynamics of players when a game is played repeatedly as discussed next. A caveat isthat the set of CCE may contain highly non-rational strategies that choose only strictlydominated actions (actions that are suboptimal responses to all action profiles of the otherplayers). See [6] for specific examples. A repeated game consists of T repetitions of a static game (referred to as the stage gamein this context) . In a repeated unknown game, after taking an action a ti (potentiallyrandomized according to a mixed strategy) in the t -th stage, player i accrues a utility u i ( a t )and observes the entire utility vector ( u i ( a (cid:48) i , a t − i )) a (cid:48) i ∈A i for all actions a (cid:48) i in his action space(we focus on a finite action space here) against the action profile a t − i of the other players.The actions and utilities of the other players, however, are unknown and unobservable.From a single player’s perspective, a repeated unknown game can be viewed as an onlinelearning problem where the player chooses actions sequentially in time by learning frompast experiences. A commonly adopted performance measure in online learning is regret ,defined as the cumulative reward loss against a properly defined benchmark policy withhindsight vision and/or certain clairvoyant knowledge about the game. In other words, thebenchmark policy defines the learning objective that an online algorithm aims to achieveover time. Different benchmark policies lead to different regret measures. Two classicalregret notions are the external regret and the internal regret as detailed below.Let π i denote the online learning algorithm adopted by player i . For a fixed action In a general definition of a repeated game [7], the stage game is parameterized by a state, which affectsthe utility function. Two basic settings exist in the literature: (i) the state evolves over time following aMarkov transition rule (the state in the next stage depends on the state and actions in the current stage);(ii) the state is fixed throughout all stages. We focus on the second setting in discussing classical results onrepeated games. { a t − i } Tt =1 of the other players, the external regret of π i is defined as:max a (cid:48) ∈A i E π i (cid:34) T (cid:88) t =1 ( u i ( a (cid:48) , a t − i ) − u i ( a t )) (cid:35) , (1)where E π i denotes the expectation over the random action process { a ti } Tt =1 induced by π i .In other words, the benchmark policy in the external regret chooses the best fixed responseto the other players’ actions in hindsight. The internal regret of π i is defined as:max a,a (cid:48) ∈A i E π i (cid:34) T (cid:88) t =1 I { a ti = a } ( u i ( a (cid:48) , a t − i ) − u i ( a t )) (cid:35) , (2)where I {·} is the indicator function. In this definition, the benchmark policy is the best hind-sight modification of π i by swapping a single action with another throughout all stages.An online learning algorithm π i is said to achieve the no-regret condition if against allaction sequences { a t − i } Tt =1 of the other players, the cumulative regret has a sublinear growthrate with the time horizon T . In other words, π i offers, asymptotically as T → ∞ , the sameaverage reward per stage as the specific benchmark policy adopted in the correspondingregret measure. No-regret learning is also referred to as Hannan consistency due to theoriginal work [8] as well as [9].It is clear that the significance of no-regret learning depends on the adopted benchmarkpolicy which the learning algorithm is measured against. A benchmark policy with strongerperformance leads to a stronger notion of regret. In particular, the internal regret is astronger notion than the external regret: no-regret learning under the former implies no-regret learning under the latter, but not vice versa [10].A number of no-regret learning algorithms exist in the literature. Representative al-gorithms achieving no-external-regret learning include Multiplicative Weights (MW) (alsoknown as the Hedge algorithm) and Follow the Perturbed Leader [1]. Both are randomizedpolicies, as randomization is necessary for achieving no-regret learning in an adversarialsetting with general reward functions [1]. In particular, under the MW algorithm, eachplayer maintains a weight W a ( t ) of each action a at every stage t based on past rewards: W a ( t ) = e (cid:15) (cid:80) tτ =1 r a ( τ ) = W a ( t − e (cid:15)r a ( t ) , where r a ( τ ) is the reward received under a at6tage τ and (cid:15) > a in the next stage isproportional to its weight given by W a ( t ) (cid:80) a (cid:48) W a (cid:48) ( t ) .For no-internal-regret learning, a representative algorithm is Regret Matching [11]. Let R a → a (cid:48) ( t ) = t (cid:80) tτ =1 I { a τi = a } ( u i ( a (cid:48) , a τ − i ) − u i ( a τ )) denote the average gain per play byswitching from action a to an alternative a (cid:48) in the past t plays. In the ( t + 1)-th stage,the probability of switching from the previous action a t to an alternative a (cid:48) is given by (cid:15) R a t → a (cid:48) ( t ), where (cid:15) > a t . Regret Matching also offers no-external-regret learning by settingthe probability of selecting an action a at the ( t + 1)-th stage to the normalized averagegain per play from playing action a throughout the past t plays, i.e., R a ( t ) (cid:80) a (cid:48) R a (cid:48) ( t ) , where R a ( t ) = t (cid:80) tτ =1 ( u i ( a, a τ − i ) − u i ( a τ )) [11]. Regret captures the learning objective of an individual player. At the system level, it isdesirable to know whether the dynamical behaviors of distributed players converge to anequilibrium in some sense and whether the self-interested regret minimization promises acertain level of optimality in terms of social welfare.For the first question, it has been shown that if every player adopts a no-external-regret learning algorithm, the empirical distribution of the sequence of actions taken by allplayers converges to the set of CCE of the stage game [5]. No-regret learning under theinternal regret measure guarantees convergence to the more restrictive set of CE [11]. Suchconvergence results are, however, in terms of the empirical frequency of the players’ actionsrather than the actual sequence of plays. The convergence is also only to the set of equilibria,rather than to an equilibrium in the corresponding set. In fact, by treating learning in gamesas a dynamical system, recent studies have shown that in the continuous-time setting, theactual plays under no-regret learning algorithms (such as Follow the Regularized Leader)may exhibit cycles rather than convergence [12]. In the discrete-time setting, it has beenshown that in zero-sum games, the actual plays under the MW algorithm (starting from anon-equilibrium initial strategy) diverges from every fully mixed NE [13]. For games withspecial structures (e.g., potential games [14] with a finite action space and bilinear smooth7ames [15] with a continuum of actions), however, stronger results on the convergence ofthe actual plays to the more restrictive set of (mixed) NE have been established.In addition to the convergence of learning dynamics, the social welfare resulting fromthe self-interested learning of individual players is of great interest in many applications.In (known) static games, the loss in social welfare W ( s ) = E a ∼ s (cid:104)(cid:80) Ni =1 u i ( a ) (cid:105) (i.e., thesystem-level utility under a strategy profile s ) due to the self-interested behaviors of playersis quantified by the price of anarchy (POA). It is defined as the ratio of the optimal socialwelfare OPT = max s W ( s ) among all strategies to the smallest social welfare in the set ofmixed NE. For repeated unknown games, a corresponding concept, price of total anarchy (POTA), is defined as: OPTmin s ,..., s T T (cid:80) Tt =1 W ( s t ) , (3)where s , ..., s T is the sequence of strategy profiles in the no-regret dynamics of all players.It has been shown that in games with special structures (e.g., valid games and congestiongames), no-regret learning guarantees a POTA that converges to the POA of the stagegame even though the sequence of actual plays may not converge to a (mixed) NE [16]. Theconvergence of the POTA to the POA of the stage game implies that no-regret learningcan fully negate the impact of the unknown nature of the game on social welfare. Theresult was later extended in [5] to a general class of games referred to as smooth games (which includes valid games and congestion games as special cases). To achieve higher socialwelfare, cooperation among players is necessary. For example, if every player agrees to followa learning algorithm designed specifically for optimizing the system-level performance, theoptimal action profile will be selected a high percentage of time [17]. In a dynamic repeated game, the stage game is time-varying. The dynamicity may be inany of the three elements of the game composition: the set of players, the action space, andthe utility functions . Note that the general definition of repeated games in [7] includes dynamicity in the utility function, asthe state parameter may evolve over time following a Markov transition rule. The dynamic repeated gamediscussed in this section differs from the general repeated game in two aspects: (i) the set of players and the .1 Notions of Regret Dynamic unknown games call for new notions of regret to provide meaningful performancemeasures for distributed online learning algorithms. Specifically, the benchmark policy ofa fixed single best action used in the external regret and that of a fixed single best actionmodification used in the internal regret can be highly suboptimal in dynamic games. As aresult, achieving no-regret learning under thus-defined regret measures can no longer serveas a stamp for good performance.A rather immediate extension of the external regret is to consider every interval of thelearning horizon and measure the cumulative loss against a single best action in hindsightthat is specific to each interval. This leads to the notion of adaptive regret , under which no-regret learning requires a sublinear growth of the cumulative reward loss in every intervalas the interval length tends to infinity. The adaptive regret is particularly suitable forpiecewise stationary systems where changes can be abrupt but infrequent. Classical learningalgorithms such as MW can be extended to achieve no-adaptive-regret [18]. The key issuein algorithm design is a mechanism to discount experiences from the distant past.Another extension of the external regret is the so-called dynamic regret , in which thebenchmark policy can be an arbitrary sequence of actions, as opposed to a fixed actionthroughout an interval of growing length. Achieving diminishing reward loss against allsequences of actions is, however, unattainable. Constraints on either the benchmark actionsequence or the reward functions are necessary for defining a meaningful measure. On thevariation of the benchmark action sequence, a commonly adopted constraint in the settingwith finite actions is that the benchmark sequence is piecewise-stationary with at most K changes (the thus-defined regret is also referred to as the K-shifting regret ). In this case,the no-adaptive-regret condition directly implies no-dynamic-regret [18]. With a continuumof actions, the constraint is often imposed on the cumulative distance between every twoconsecutive actions in the sequence, i.e., V T ( { a t } Tt =1 ) = (cid:80) T − t =1 || a t +1 − a t || . It has beenshown that if the benchmark sequence is slow-varying, i.e., V T = o ( T ), no-dynamic-regretis achievable through well-designed restart procedures [19]. The variation constraint can action space can also be time-varying; (ii) the utility functions are in general independent across stages. T , i.e., (cid:80) T − t =1 sup a | u t +1 ( a ) − u t ( a ) | = o ( T ). Similar constraints can be imposed on the gradient ∇ u t ( a ) of the utility function and with the variation measured by the L p -norm. See [20]and references therein for details and corresponding no-regret learning algorithms.The external regret and its extensions are measured against an alternative strategy ofa single player. A new notion of regret— Nash equilibrium regret —considers a benchmarkpolicy that is jointly determined by the strategies of all players [21]. Consider a repeatedgame with time-varying utility functions { u ti } Tt =1 for each player i . Let ¯ u i = T (cid:80) Tt =1 u ti bethe average utility function and s ∗ the mixed NE of the static game defined by the averageutility functions ¯ u = (¯ u , ..., ¯ u N ). The NE regret of player i following a policy π i is thengiven by E π [ (cid:80) Tt =1 u ti ( a t )] − T E a ∗ ∼ s ∗ [¯ u i ( a ∗ )], where a t is the action profile selected by thepolicies π = ( π , ..., π N ) of all players at stage t . No-regret learning under the NE regretensures that each player’s average reward asymptotically matches that promised by themixed NE under the average utility functions. A centralized learning algorithm achievingno-NE-regret was developed in [21] for repeated two-player zero-sum games with arbitrarilyvarying utility functions. Achieving no-regret learning under the measure of NE regret in adistributed setting, however, remains open. The two key measures—convergence to equilibria and POTA—for system-level perfor-mance also need to be modified to take into account game dynamics. The time-varyingsequence {G t } Tt =1 of stage games defines a sequence of equilibria and a sequence { OPT t } Tt =1 of optimal social welfare. The desired relation between no-regret learning dynamics at in-dividual players and the system-level equilibria is thus in terms of tracking rather thanconverging. For the definition of POTA, the optimal social welfare in the numerator in (3)needs to be replaced with the average optimal social welfare T (cid:80) Tt =1 OPT t .An online learning algorithm is said to successfully track the sequence of (mixed) NEin a dynamic game if the average distance between the sequence of (mixed) action profiles10esulting from the algorithm and the sequence of (mixed) NE vanishes as T tends to infinity.A representative study in [19] considers a game with a continuum of actions and dynamicitymanifesting only in the utility functions. Under the assumptions that the sequence of NE isslow-varying and the utility functions are monotonic, it was shown that learning algorithmswith sublinear dynamic regret successfully track the sequence of NE. The monotonicity ofthe utility functions plays a key role in the analysis: it translates the closeness betweenthe learning dynamics and the NE in terms of the cumulative reward (as in the regretmeasure) to the closeness in terms of their distance in the action space (the concern of thetracking outcome).The performance of no-regret learning in terms of social welfare was studied in [22]for games with a dynamic population of players. Specifically, in each stage, each playermay independently exit with a fixed probability and is subsequently replaced with a newplayer with a potentially different utility function (the population size is therefore fixed andthe player set is a stationary process over time). For structural games such as first-priceauctions, bandwidth allocation, and congestion games, the relation between no-adaptive-regret learning and the average optimal social welfare was examined.Game dynamics can be in diverse forms. There lacks a holistic understanding on thematching between regret notions and the underlying dynamics of the game. Different formsof game dynamics demand different benchmark policies in order to arrive at a meaningfulregret measure that lends significance to the stamp of “no-regret learning” yet at the sametime is attainable. Viewing from a different angle, one may pose the fundamental questionon what kinds of game dynamics are tamable through distributed online learning and makeno-regret learning and approximately optimal social welfare feasible. Learning and adaptation rely on feedback. Quality of the feedback in terms of completenessand accuracy thus has significant implications in no-regret learning. We explore this issuein this section. 11 .1 Incomplete Feedback
Incomplete feedback stands in contrast to full-information feedback where utilities of allactions a player could have taken are observed in each stage. Incompleteness can be spatialacross the action space or temporal across decision stages. In the former case, a commonlystudied model is the so-called bandit feedback , where only the utility of the chosen actionis revealed. In the latter, the feedback model is referred to as lossy feedback where thereare decision stages with no feedback [23]. One can easily envision a more general modelcompounding bandit feedback with lossy feedback. Studies on this general model are lackingin the literature.The term “bandit feedback” has its roots in the classical problem of multi-armed ban-dit [24]. The name of the problem comes from likening an archetypical single-player onlinelearning problem to playing a multi-armed slot machine (known as a bandit for its ability ofemptying the player’s pocket). Each arm, when pulled, generates rewards according to anunknown stochastic model or in an adversarial fashion. Only the reward of the chosen armis revealed after each play. Due to the incomplete feedback, the player faces the tradeoffbetween exploration (to gather information from less explored arms) and exploitation (tomaximize immediate reward by favoring arms with a good reward history).In a multi-player game setting with bandit feedback, no-regret learning from an indi-vidual player’s perspective can be cast as a single-player non-stochastic / adversarial banditmodel where the payoff of each arm/action is adversarially chosen and aggregates the inter-action with the other players in the game. The concept of external regret in the game settingcorresponds to the weak regret in the adversarial bandit model [25], which adopts the bestsingle-arm policy in hindsight as the benchmark. The MW algorithm was modified in [25]to handle the change of the feedback model from full-information to bandit. Specifically,the weight W a ( t ) of action a at time t is updated as W a ( t ) = W a ( t − e (cid:15)r a ( t ) /p a ( t ) where p a ( t ) is the probability of selecting action a at time t and r a ( t ) = 0 if a is unselected. Divid-ing the observed reward by the corresponding probability of the chosen action ensures theunbiasedness of the observation. Quite intuitively, the price for not observing the rewardsof all actions is the degradation of the regret order in the size of the action space, i.e., from12( (cid:112) log( |A| ) T ) in the full-information setting [1], to Θ( (cid:112) |A| T ) in the bandit setting [26].The multi-player bandit problem explicitly models the existence of N players competingfor M ( M > N ) arms [27]. Originally motivated by applications in wireless communicationnetworks where distributed users compete for access to multiple channels, this specific gamemodel is characterized by a special form of interaction among players: a collision occurswhen multiple players select the same arm, which results in utility loss. The objective of thisdistributed learning problem is to minimize the system-level regret over all players againstthe optimal centralized (hence collision-free) allocation of the players to the best set ofarms [27]. In addition to the exploration-exploitation tradeoff in the single-player setting,this distributed learning problem under a system-level objective also faces the tradeoffbetween selecting a good arm and avoiding colliding with competing players. A numberof distributed learning algorithms have been developed to achieve a sublinear system-levelregret with respect to T [27]. Recent extensions of the multi-player bandit problem furtherconsider the setting where each arm offers different payoffs across players [28].The multi-player bandit problem is a special game model in that the players have iden-tical action space and their interaction is only in the form of “collisions” when choosingthe same action. In a general game setting, the impact of incomplete feedback on no-regret learning and system-level performance is largely open. One quantitative measure ofthe impact is the regret order with respect to the size of the action space. As mentionedabove, bandit feedback results in an additional (cid:112) |A| term in the regret order, which canbe significant when the action space is large. Recent work [29, 30] has shown that localcommunications among neighboring players in a network setting can mitigate the negativeimpact of bandit feedback on the regret order in |A| . In terms of the impact on the system-level performance, it has been shown under a game model with a continuum of actions thatbandit feedback degrades the convergence rate of the learning dynamics to equilibria [31]. Imperfect feedback refers to the inaccuracy of the observed utilities in revealing the qualityof the selected actions. Recall that mixed strategies are necessary for achieving no-regretlearning in the adversarial setting. The quality of a mixed strategy is characterized by13he expected utility where the expectation is taken over the randomness of strategies of allplayers. Referred to as expected feedback , the feedback model assuming observations on theexpected utility, however, can be unrealistic. A more commonly adopted feedback modelis the realized feedback where only the utility of the realized action profile is revealed. Therealized feedback can be viewed as a noisy unbiased estimate of the expected feedback wherethe noise is due to the randomness of players’ strategies.The so-called noisy feedback assumes a different source of noise: it comes from the ex-ternal environment and is additive to either the observed utility vectors in the so-calledsemi-bandit feedback [14] with a finite action space, or the gradient of the utility functionsin the first-order feedback [32] with a continuum of actions. Under the assumptions of un-biasedness and bounded variance, the issue of the additive noise can be addressed by ratherstandard estimation techniques and analysis. A more challenging setting is to considernon-stochastic noise due to adversarial attacks, especially in applications such as adversar-ial machine learning. This problem was recently studied in the single-player setting [33].Studies in the multi-agent setting are still lacking.
The concept of bounded rationality was first introduced in economics [34] to provide morerealistic models than the often adopted perfect rationality that assumes the decision-makingof players is the result of a full optimization of their utilities. In reality, players often takereasoning shortcuts that may lead to suboptimal decisions. Such reasoning shortcuts maybe a result of limited cognition of human minds or necessitated by the available computationtime and power relative to the complexity of action optimization.Cognitive limitations include the limited ability in anticipating other decision-makers’strategic responses and certain psychological factors that interfere with the valuation of op-tions. Various models exist for capturing the limitations in the players’ valuation of options.For example, a player may be myopic, focusing only on the short-term reward [35]. Evenwith forward-thinking, a player may settle for suboptimal actions perceived as acceptableby the player [34]. The limitation in a player’s ability to anticipate other players’ strate-14ies can be modeled through a cognitive hierarchy by grouping players according to theircognitive abilities and characterizing them in an iterative fashion. Specifically, players withthe lowest level of cognitive ability are grouped as the level-0 players who make decisionsrandomly. Level- k ( k >
0) players are then defined iteratively as those who assume theyare playing against lower-level players and anticipate the opponents’ strategies accordingly.Recent work draws an interesting connection between the cognitive hierarchy model and the
Optimistic Mirror Descent (OMD) algorithm for solving the saddle point problem with ap-plications in generative adversarial networks [36]. The saddle-point problem can be viewedas a two-player zero-sum game with a continuum of actions. The solutions to the problemcorrespond to the set of NE. It has been shown that the OMD algorithm guarantees a con-verging system dynamic to an NE in terms of the actual plays while Gradient Descent (GD)may lead to cycles [36]. In the language of cognitive hierarchy, players adopting GD can beregarded as level-0 thinkers in the sense that they do not anticipate the strategies of theiropponents. Players adopting OMD are level-1 thinkers since they take advantage of the factthat their opponents are taking similar gradient methods, which will not lead to abrupt gra-dient changes between two consecutive stages [36]. Consequently, an extra gradient updateis applied in OMD to accelerate learning.Besides cognitive limitations, players are also constrained in terms of physical resourcessuch as memory and computation power. Acquiring, storing, and processing all relevant in-formation for decision-making may be infeasible, especially in complex systems with a largeaction space. For example, players may only choose from strategies with bounded complex-ity [37], or use only recent observations in decision-making due to memory constraints [38].While models for bounded rationality abound in economics, political science, and otherrelated disciplines, incorporating such models into distributed online learning is still in itsinfancy. A holistic understanding on the implications of bounded rationality in distributedonline learning is yet to be gained. An intriguing aspect of the problem is that boundedrationality may not necessarily imply degraded performance. For example, in dynamicgames, bounded memory of past experiences may have little effect since no-regret learningdictates that the distant past be forgotten (see discussions in Sec. 3).15
Heterogeneity
The heterogeneity of complex multi-agent systems characterizes the asymmetry across play-ers in three aspects: the available information and knowledge about the system, availableactions, and the level of adaptivity to opponents’ strategies. In the example of mixedtraffic in urban transportation, autonomous vehicles, while likely to have greater computa-tion power for solving complex decision problems, may have to obey an additional set ofregulations on available actions.In adversarial machine learning, in addition to the asymmetry on the knowledge andpower, the attacker and the defender may also have different levels of real-time adaptivity tothe other player’s strategy. Classical regret notions such as the external regret that assumesfixed actions of the other players, while applicable to oblivious attackers, are no longer validunder adaptive attacks. A partial solution is to adopt a new notion of policy regret definedagainst an adaptive adversary who assigns reward vectors based on previous actions of theplayer [39]. Specifically, let u t ( · ; a t − ) denote the player’s reward function determined bythe adversary at time t , given the sequence of actions a t − taken by the player in the past.The policy regret with reward functions { u t } Tt =1 is defined asmax a ∈A E (cid:34) T (cid:88) t =1 u t ( a ; { a, ..., a } ) − T (cid:88) t =1 u t ( a t ; a t − ) (cid:35) , (4)where u t ( · ; { a, ..., a } ) denotes the reward function determined by the adversary if the playertook actions { a, ..., a } in the past. The m -memory policy regret is defined by assuming thatthe reward function depends only on the past m actions of the player.The difference between the external regret and the policy regret may not be crucial if theadversary and the player have homogeneous objectives (e.g., mixed traffic in transportationsystems). It has been shown that there exists a wide class of algorithms that can ensureno-regret learning under both regret definitions, as long as the adversary is also using suchan algorithm [40]. In applications such as adversarial machine learning where the adversarymay be a malicious opponent, the two notions of regret are incompatible: there exists an m -memory adaptive adversary that can make any action sequence of the player with sublinear16egret in one notion suffer from linear regret in the other [40]. A general technique fordeveloping no-policy-regret algorithms in the single-player setting was proposed in [39]. Interms of the system-level performance, it was shown in two-player games that no-policy-regret learning guarantees convergence of the system dynamic to a new notion of equilibriumcalled policy equilibrium [40]. However, the understanding of policy equilibrium is limited.In games with more than two players, even the definition of policy equilibrium is unclear. References [1] N. Cesa-Bianchi and G. Lugosi,
Prediction, Learning, and Games . Cambridge uni-versity press, 2006.[2] H. P. Young,
Strategic Learning and its Limits . OUP Oxford, 2004.[3] M. Barreno, B. Nelson, A. D. Joseph, and J. D. Tygar, “The security of machinelearning,”
Machine Learning , vol. 81, no. 2, pp. 121–148, 2010.[4] N. Nisan, T. Roughgarden, E. Tardos, and V. V. Vazirani,
Algorithmic Game Theory .Cambridge university press, 2007.[5] T. Roughgarden, “Intrinsic robustness of the price of anarchy,”
Journal of the ACM(JACM) , vol. 62, no. 5, p. 32, 2015.[6] Y. Viossat and A. Zapechelnyuk, “No-regret dynamics and fictitious play,”
Journal ofEconomic Theory , vol. 148, no. 2, pp. 825–842, 2013.[7] R. Laraki and S. Sorin, “Advances in zero-sum dynamic games,” in
Handbook of GameTheory with Economic Applications . Elsevier, 2015, vol. 4, pp. 27–93.[8] J. Hannan, “Approximation to bayes risk in repeated play,”
Contributions to the Theoryof Games , vol. 3, pp. 97–139, 1957.[9] D. Blackwell et al. , “An analog of the minimax theorem for vector payoffs.”
PacificJournal of Mathematics , vol. 6, no. 1, pp. 1–8, 1956.1710] G. Stoltz and G. Lugosi, “Internal regret in on-line portfolio selection,”
Machine Learn-ing , vol. 59, no. 1-2, pp. 125–159, 2005.[11] S. Hart and A. Mas-Colell, “A simple adaptive procedure leading to correlated equi-librium,”
Econometrica , vol. 68, no. 5, pp. 1127–1150, 2000.[12] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras, “Cycles in adversarial regular-ized learning,” in
Proceedings of the 29th Annual ACM-SIAM Symposium on DiscreteAlgorithms . SIAM, 2018, pp. 2703–2717.[13] J. P. Bailey and G. Piliouras, “Multiplicative weights update in zero-sum games,” in
Proceedings of the 2018 ACM Conference on Economics and Computation . ACM,2018, pp. 321–338.[14] A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potentialgames,” in
Advances in Neural Information Processing Systems , 2017, pp. 6369–6378.[15] G. Gidel, R. A. Hemmat, M. Pezeshki, R. Le Priol, G. Huang, S. Lacoste-Julien,and I. Mitliagkas, “Negative momentum for improved game dynamics,” in
The 22ndInternational Conference on Artificial Intelligence and Statistics , 2019, pp. 1802–1811.[16] A. Blum, M. Hajiaghayi, K. Ligett, and A. Roth, “Regret minimization and the priceof total anarchy,” in
Proceedings of The 40th Annual ACM Symposium on Theory ofComputing . ACM, 2008, pp. 373–382.[17] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving pareto optimality throughdistributed learning,”
SIAM Journal on Control and Optimization , vol. 52, no. 5, pp.2753–2770, 2014.[18] H. Luo and R. E. Schapire, “Achieving all with no parameters: AdaNormalHedge,” in
Conference on Learning Theory , 2015, pp. 1286–1304.[19] B. Duvocelle, P. Mertikopoulos, M. Staudigl, and D. Vermeulen, “Learning in time-varying games,” arXiv preprint arXiv:1809.03066 , 2018.1820] A. Mokhtari, S. Shahrampour, A. Jadbabaie, and A. Ribeiro, “Online optimization indynamic environments: Improved regret rates for strongly convex problems,” in . IEEE, 2016, pp. 7195–7201.[21] A. R. Cardoso, J. Abernethy, H. Wang, and H. Xu, “Competing against Nash equilibriain adversarially changing zero-sum games,” in
Proceedings of the 36th InternationalConference on Machine Learning , vol. 97. PMLR, 2019, pp. 921–930.[22] T. Lykouris, V. Syrgkanis, and ´E. Tardos, “Learning and efficiency in games withdynamic population,” in
Proceedings of The 27th Annual ACM-SIAM Symposium onDiscrete Algorithms , 2016, pp. 120–129.[23] Z. Zhou, P. Mertikopoulos, S. Athey, N. Bambos, P. W. Glynn, and Y. Ye, “Learningin games with lossy feedback,” in
Advances in Neural Information Processing Systems ,2018, pp. 5140–5150.[24] Q. Zhao,
Multi-Armed Bandits: Theory and Applications to Online Learning in Net-works . Morgan & Claypool Publishers, 2019.[25] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multi-armed bandit problem,”
SIAM Journal on Computing , vol. 32, no. 1, pp. 48–77, 2002.[26] J.-Y. Audibert and S. Bubeck, “Minimax policies for adversarial and stochastic ban-dits,” in
Proceedings of the 22nd Annual Conference on Learning Theory , 2009, pp.217–226.[27] K. Liu and Q. Zhao, “Distributed learning in multi-armed bandit with multiple play-ers,”
IEEE Transactions on Signal Processing , vol. 58, no. 11, pp. 5667–5681, 2010.[28] I. Bistritz and A. Leshem, “Distributed multi-player bandits—a game of thrones ap-proach,” in
Advances in Neural Information Processing Systems , 2018, pp. 7222–7232.[29] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, “Delay and cooperation in nonstochasticbandits,”
The Journal of Machine Learning Research , vol. 20, no. 1, pp. 613–650, 2019.1930] Y. Bar-On and Y. Mansour, “Individual regret in cooperative nonstochastic multi-armed bandits,” in
Advances in Neural Information Processing Systems , 2019, pp.3110–3120.[31] M. Bravo, D. Leslie, and P. Mertikopoulos, “Bandit learning in concave n-persongames,” in
Advances in Neural Information Processing Systems , 2018, pp. 5661–5671.[32] P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets andunknown payoff functions,”
Mathematical Programming , vol. 173, no. 1-2, pp. 465–507, 2019.[33] K.-S. Jun, L. Li, Y. Ma, and J. Zhu, “Adversarial attacks on stochastic bandits,” in
Advances in Neural Information Processing Systems , 2018, pp. 3640–3649.[34] H. A. Simon, “A behavioral model of rational choice,”
The Quarterly Journal of Eco-nomics , vol. 69, no. 1, pp. 99–118, 1955.[35] X. Gabaix and D. Laibson, “Bounded rationality and directed cognition,”
HarvardUniversity , 2005.[36] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng, “Training GANs with optimism.”in
International Conference on Learning Representations , 2018.[37] M. Scarsini and T. Tomala, “Repeated congestion games with bounded rationality,”
International Journal of Game Theory , vol. 41, no. 3, pp. 651–669, 2012.[38] L. Chen, F. Lin, P. Tang, K. Wang, R. Wang, and S. Wang, “K-memory strategies inrepeated games,” in
Proceedings of the 16th Conference on Autonomous Agents andMultiagent Systems , 2017, pp. 1493–1498.[39] R. Arora, O. Dekel, and A. Tewari, “Online bandit learning against an adaptive adver-sary: from regret to policy regret,”
Proceedings of the 29th International Conferenceon Machine Learning , pp. 1747–1754, 2012.[40] R. Arora, M. Dinitz, T. V. Marinov, and M. Mohri, “Policy regret in repeated games,”in