Learning in games with continuous action sets and unknown payoff functions
aa r X i v : . [ m a t h . O C ] J a n LEARNING IN GAMES WITH CONTINUOUS ACTION SETSAND UNKNOWN PAYOFF FUNCTIONS
PANAYOTIS MERTIKOPOULOS AND ZHENGYUAN ZHOU Abstract.
This paper examines the convergence of no-regret learning ingames with continuous action sets. For concreteness, we focus on learningvia “dual averaging”, a widely used class of no-regret learning schemes whereplayers take small steps along their individual payoff gradients and then “mir-ror” the output back to their action sets. In terms of feedback, we assumethat players can only estimate their payoff gradients up to a zero-mean errorwith bounded variance. To study the convergence of the induced sequence ofplay, we introduce the notion of variational stability , and we show that stableequilibria are locally attracting with high probability whereas globally stableequilibria are globally attracting with probability . We also discuss some ap-plications to mixed-strategy learning in finite games, and we provide explicitestimates of the method’s convergence speed.
1. Introduction
The prototypical setting of online optimization can be summarized as follows: atevery stage n = 1 , , . . . , of a repeated decision process, an agent selects an action X n from some set X (assumed here to be convex and compact), and obtains a reward u n ( X n ) determined by an a priori unknown payoff function u n : X → R . Subse-quently, the agent receives some problem-specific feedback (for instance, an estimateof the gradient of u n at X n ), and selects a new action with the goal of maximizingthe obtained reward. Aggregating over the stages of the process, this is usuallyquantified by asking that the agent’s regret R n ≡ max x ∈X P nk =1 [ u k ( x ) − u k ( X k )] grow sublinearly in n , a property known as “no regret”.In this general setting, the most widely used class of no-regret policies is the online mirror descent (OMD) method of Shalev-Shwartz (2007) and its variants– such as “Following the Regularized Leader” (Shalev-Shwartz and Singer, 2007),dual averaging (Nesterov, 2009; Xiao, 2010), etc. Specifically, if the problem’spayoff functions are concave, mirror descent guarantees an O ( √ n ) regret bound Univ. Grenoble Alpes, CNRS, Inria, LIG, F-38000, Grenoble, France. Stanford University, Dept. of Electrical Engineering, Stanford, CA, 94305.
E-mail addresses : [email protected], [email protected] .2010 Mathematics Subject Classification.
Primary 91A26, 90C15; secondary 90C33, 68Q32.
Key words and phrases.
Continuous games; dual averaging; variational stability; Fenchel cou-pling; Nash equilibrium.The authors are indebted to the associate editor and two anonymous referees for their detailedsuggestions and remarks. The paper has also benefited greatly from thoughtful comments byJérôme Bolte, Nicolas Gast, Jérôme Malick, Mathias Staudigl, and the audience of the ParisOptimization Seminar.P. Mertikopoulos was partially supported by the French National Research Agency (ANR)project ORACLESS (ANR–GAGA–13–JS01–0004–01) and the Huawei Innovation Research Pro-gram ULTRON. which is well-known to be tight in a “black-box” environment (i.e., without anyfurther assumptions on u n ). Thus, owing to these guarantees, this class of first-ordermethods has given rise to an extensive literature in online learning and optimization;for a survey, see Shalev-Shwartz (2011), Bubeck and Cesa-Bianchi (2012), Hazan(2012), and references therein.In this paper, we consider a multi-agent extension of the above framework wherethe agents’ rewards are determined by their individual actions and the actions ofall other agents via a fixed mechanism: a non-cooperative game . Even though thismechanism may be unknown and/or opaque to the players, the additional structureit provides means that finer convergence criteria apply, chief among them beingthat of convergence to a Nash equilibrium (NE). We are thus led to the followingfundamental question: if all players of a repeated game employ a no-regret updatingpolicy, do their actions converge to a Nash equilibrium of the underlying game?
Summary of contributions.
In general, the answer to this question is a resounding“no”. Even in simple, finite games, no-regret learning may cycle (Mertikopouloset al., 2018) and its limit set may contain highly non-rationalizable strategies thatassign positive weight only to strictly dominated strategies (Viossat and Zapechel-nyuk, 2013). As such, our aim in this paper is twofold: i ) to provide sufficient conditions under which no-regret learning converges toequilibrium; and ii ) to assess the speed and robustness of this convergence in the presence ofuncertainty, feedback noise, and other learning impediments.Our contributions along these lines are as follows: First, in Section 2, we intro-duce an equilibrium stability notion which we call variational stability (VS), andwhich is formally similar to (and inspired by) the seminal notion of evolutionary sta-bility in population games (Maynard Smith and Price, 1973). This stability notionextends the standard notion of operator monotonicity, so it applies in particularto all monotone games (that is, concave games that satisfy Rosen’s (1965) diago-nal strict concavity condition). In fact, going beyond concave games, variationalstability allows us to treat convergence questions in general games with continuousaction spaces without having to restrict ourselves to a specific subclass (such aspotential or common interest games).Our second contribution is a detailed analysis of the long-run behavior of no-regret learning under variational stability. Regarding the information available tothe players, our only assumption is that they have access to unbiased, bounded-variance estimates of their individual payoff gradients at each step; beyond this,we assume no prior knowledge of their payoff functions and/or the game. Despitethis lack of information, variational stability guarantees that ( i ) the induced se-quence of play converges globally to globally stable equilibria with probability (Theorem 4.7); and ( ii ) it converges locally to locally stable equilibria with highprobability (Theorem 4.11). As a corollary, if the game admits a (pseudo-)concavepotential or if it is monotone, the players’ actions converge to Nash equilibriumno matter the level of uncertainty affecting the players’ feedback. In Section 5, wefurther extend these results to learning with imperfect feedback in finite games. Heuristically, variational stability is to games with a finite number of players and a continuumof actions what evolutionary stability is to games with a continuum of players and a finite actionspace. Our choice of terminology reflects precisely this analogy.
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 3
Our third contribution concerns the method’s convergence speed. Mirroring aknown result of Nesterov (2009) for variational inequalities, we show that the gapfrom a stable state decays ergodically as O (1 / √ n ) if the method’s step-size is chosenappropriately. Dually to this, we also show that the algorithm’s expected runninglength until players reach an ε -neighborhood of a stable state is O (1 /ε ) . Finally,if the stage game admits a sharp equilibrium (a straightforward extension of thenotion of strict equilibrium in finite games), we show that, with probability , theprocess reaches an equilibrium in a finite number of steps.Our analysis relies on tools and techniques from stochastic approximation, mar-tingale limit theory and convex analysis. In particular, with regard to the latter,we make heavy use of a “primal-dual divergence” measure between action and gra-dient variables, which we call the Fenchel coupling . This coupling is a hybridizationof the Bregman divergence which provides a potent tool for proving convergencethanks to its Lyapunov properties.
Related work.
Originally, mirror descent was introduced by Nemirovski and Yudin(1983) for solving offline convex programs. The dual averaging (DA) variant thatwe consider here was pioneered by Nesterov (2009) and proceeds as follows: at eachstage, the method takes a gradient step in a dual space (where gradients live); theresult is then mapped (or “mirrored”) back to the problem’s feasible region, a newgradient is generated, and the process repeats. The “mirroring” step above is itselfdetermined by a strongly convex regularizer (or “distance generating”) function:the squared Euclidean norm gives rise to Zinkevich’s (2003) online gradient descentalgorithm, while the (negative) Gibbs entropy on the simplex induces the well-known exponential weights (EW) algorithm (Vovk, 1990; Arora et al., 2012).Nesterov (2009) and Nemirovski et al. (2009) provide several convergence re-sults for dual averaging in (stochastic) convex programs and saddle-point prob-lems, while Xiao (2010) provides a thorough regret analysis for online optimizationproblems. In addition to treating the interactions of several competing agents atonce, the fundamental difference of our paper with these works is that the conver-gence analysis in the latter is “ergodic”, i.e., it concerns the time-averaged sequence ¯ X n = P nk =1 γ k X k / P nk =1 γ k , and not the actual sequence of actions X n employedby the players.In online optimization, this averaging comes up naturally because the focus is onthe players’ regret. In the offline case, the points where an oracle is called duringthe execution of an algorithm do not carry any particular importance, so averagingprovides a convenient way of obtaining convergence. However, in a game-theoreticsetting, the figure of merit is the actual sequence of play , which determines theplayers’ payoffs at each stage. The behavior of X n may differ drastically from thatof ¯ X n , so our treatment requires a completely different set of tools and techniques(especially in the stochastic regime).Much of our analysis boils down to solving in an online way a (stochastic) vari-ational inequality (VI) characterizing the game’s Nash equilibria. Nesterov (2007)and Juditsky et al. (2011) provide efficient offline methods to do this, relying onan “extra-gradient” step to boost the convergence rate of the ergodic sequence ¯ X n .In our limited-feedback setting, we do not assume that players can make an extra In the online learning literature, dual averaging is sometimes called lazy mirror descent andcan be seen as a linearized “Follow the Regularized Leader” (FTRL) scheme – for more details,we refer the reader to Beck and Teboulle (2003), Xiao (2010), and Shalev-Shwartz (2011).
P. MERTIKOPOULOS AND Z. ZHOU oracle call to actions that were not actually employed, so the extrapolation resultsof Nesterov (2007) and Juditsky et al. (2011) do not apply. The single-call resultsof Nesterov (2009) are closer in spirit to our paper but, again, they focus exclusivelyon monotone variational inequalities and the ergodic sequence ¯ X n – not the actualsequence of play X n . All the same, for completeness, we make the link with ergodicconvergence in Theorems 4.13 and 6.2.When applied to mixed-strategy learning in finite games, the class of algorithmsstudied here has very close ties to the family of perturbed best response mapsthat arise in models of fictitious play and reinforcement learning (Hofbauer andSandholm, 2002; Leslie and Collins, 2005; Coucheney et al., 2015). Along theselines, Mertikopoulos and Sandholm (2016) recently showed that a continuous-timeversion of the dynamics studied in this paper eliminates dominated strategies andconverges to strict equilibria from all nearby initial conditions. Our analysis inSection 5 extends these results to a discrete-time, stochastic setting.In games with continuous action sets, Perkins and Leslie (2012) and Perkinset al. (2017) examined a mixed-strategy actor-critic algorithm which converges to aprobability distribution that assigns most weight to equilibrium states. At the purestrategy level, several authors have considered VI-based and Gauss–Seidel meth-ods for solving generalized Nash equilibrium problems (GNEPs); for a survey, seeFacchinei and Kanzow (2007) and Scutari et al. (2010). The intersection of theseworks with the current paper is when the game satisfies a global monotonicity con-dition similar to the diagonal strict concavity condition of Rosen (1965). However,the literature on GNEPs does not consider the implications for the players’ regret,the impact of uncertainty and/or local convergence/stability issues, so there is nooverlap with our results.Finally, during the final preparation stages of this paper (a few days before theactual submission), we were made aware of a preprint by Bervoets et al. (2016)examining the convergence of pure-strategy learning in strictly concave games withone-dimensional action sets. A key feature of the analysis of Bervoets et al. (2016)is that players only observe their realized, in-game payoffs, and they choose actionsbased on their payoffs’ variation from the previous period. The resulting meandynamics boil down to an instantiation of dual averaging induced by the entropicregularization penalty h ( x ) = x log x (cf. Section 3), suggesting several interestinglinks with the current work. Notation.
Given a finite-dimensional vector space V with norm k·k , we write V ∗ forits dual, h y, x i for the pairing between y ∈ V ∗ and x ∈ V , and k y k ∗ ≡ sup {h y, x i : k x k ≤ } for the dual norm of y in V ∗ . If C ⊆ V is convex, we also write C ◦ ≡ ri( C ) for the relative interior of C , kCk = sup {k x ′ − x k : x, x ′ ∈ C} for its diameter, and dist( C , x ) = inf x ′ ∈C k x ′ − x k for the distance between x ∈ V and C .For a given x ∈ C , the tangent cone TC C ( x ) is defined as the closure of the set ofall rays emanating from x and intersecting C in at least one other point; dually, the polar cone PC C ( x ) to C at x is defined as PC C ( x ) = { y ∈ V ∗ : h y, z i ≤ for all z ∈ TC C ( x ) } . For concision, when C is clear from the context, we will drop it altogetherand write TC( x ) and PC( x ) instead.
2. Continuous games and variational stability
Basic definitions and examples.
Throughout this paper, we focus on gamesplayed by a finite set of players i ∈ N = { , . . . , N } . During play, each player selects EARNING IN GAMES WITH CONTINUOUS ACTION SETS 5 an action x i from a compact convex subset X i of a finite-dimensional normed space V i , and their reward is determined by the profile x = ( x , . . . , x N ) of all players’actions – often denoted as x ≡ ( x i ; x − i ) when we seek to highlight the action x i ofplayer i against the ensemble of actions x − i = ( x j ) j = i of all other players.In more detail, writing X ≡ Q i X i for the game’s action space , each player’s payoff is determined by an associated payoff function u i : X → R . In terms ofregularity, we assume that u i is continuously differentiable in x i , and we write v i ( x ) ≡ ∇ x i u i ( x i ; x − i ) (2.1)for the individual gradient of u i at x ; we also assume that u i and v i are bothcontinuous in x . Putting all this together, a continuous game is a tuple
G ≡G ( N , ( X i ) i ∈N , ( u i ) i ∈N ) with players, actions and payoffs defined as above.As a special case, we will sometimes consider payoff functions that are individu-ally ( pseudo- ) concave in the sense that u i ( x i ; x − i ) is ( pseudo- ) concave in x i for all x − i ∈ Q j = i X j , i ∈ N . (2.2)When this is the case, we say that the game itself is (pseudo-)concave. Below, webriefly discuss some well-known examples of such games: Example . In a finite game Γ ≡ ( N , A , u ) ,each player i ∈ N chooses an action α i from a finite set A i of “pure strategies” and noassumptions are made on the players’ payoff functions u i : A ≡ Q j A j → R . Playerscan “mix” these choices by playing mixed strategies , i.e., probability distributions x i drawn from the simplex X i ≡ ∆( A i ) . In this case (and in a slight abuse ofnotation), the expected payoff to player i in the mixed profile x = ( x , . . . , x N ) canbe written as u i ( x ) = X α ∈A · · · X α N ∈A N u i ( α , . . . , α N ) x ,α · · · x N,α N , (2.3)so the players’ individual gradients are simply their payoff vectors: v i ( x ) = ∇ x i u i ( x ) = ( u i ( α i ; x − i )) α i ∈A i . (2.4)The resulting continuous game is called the mixed extension of Γ . Since X i = ∆( A i ) is convex and u i is linear in x i , G is itself concave in the sense of (2.2). Example . Consider the following Cournot oligopolymodel: There is a finite set N = { , . . . , N } of firms , each supplying the marketwith a quantity x i ∈ [0 , C i ] of the same good (or service) up to the firm’s produc-tion capacity C i . This good is then priced as a decreasing function P ( x ) of eachfirm’s production; for concreteness, we focus on the linear model P ( x ) = a − P i b i x i where a is a positive constant and the coefficients b i > reflect the price-settingpower of each firm.In this model, the utility of firm i is given by u i ( x ) = x i P ( x ) − c i x i , (2.5)where c i represents the marginal production cost of firm i . Letting X i = [0 , C i ] , theresulting game is easily seen to be concave in the sense of (2.2). In the above, we tacitly assume that u i is defined on an open neighborhood of X i . Thisallows us to use ordinary derivatives, but none of our results depend on this device. We alsonote that v i ( x ) acts naturally on vectors z i ∈ V i via the mapping z i
7→ h v i ( x ) , z i i ≡ u ′ i ( x ; z i ) = d/dτ | τ =0 u i ( x i + τz i ; x − i ) ; in view of this, v i ( x ) is treated as an element of V ∗ i , the dual of V i . P. MERTIKOPOULOS AND Z. ZHOU X TC( x ∗ ) PC( x ∗ ) x ∗ X v ( x ∗ ) Figure 1.
Geometric characterization of Nash equilibria.
Example . Congestion games are game-theoretic modelsthat arise in the study of traffic networks (such as the Internet). To define them,fix a set of players N that share a set of resources r ∈ R , each associated witha nondecreasing convex cost function c r : R + → R (for instance, links in a datanetwork and their corresponding delay functions). Each player i ∈ N has a certain resource load ρ i > which is split over a collection A i ⊆ R of resource subsets α i of R – e.g., sets of links that form paths in the network. Then, the action space ofplayer i ∈ N is the scaled simplex X i = ρ i ∆( A i ) = { x i ∈ R A i + : P α i ∈A i x iα i = ρ i } of load distributions over A i .Given a load profile x = ( x , . . . , x N ) , costs are determined based on the utiliza-tion of each resource as follows: First, the demand w r of the r -th resource is definedas the total load w r = P i ∈N P α i ∋ r x iα i on said resource. This demand incurs a cost c r ( w r ) per unit of load to each player utilizing resource r , where c r : R + → R is a nondecreasing convex function. Accordingly, the total cost to player i ∈ N is c i ( x ) = X α i ∈A i x iα i c iα i ( x ) , (2.6)where c iα i ( x ) = P r ∈ α i c r ( w r ) denotes the cost incurred to player i by the utilizationof α i ⊆ R . The resulting atomic splittable congestion game G ≡ G ( N , X , − c ) iseasily seen to be concave in the sense of (2.2).2.2. Nash equilibrium.
Our analysis focuses primarily on
Nash equilibria (NE), i.e.,strategy profiles that discourage unilateral deviations. Formally, x ∗ ∈ X is a Nashequilibrium if u i ( x ∗ i ; x ∗− i ) ≥ u i ( x i ; x ∗− i ) for all x i ∈ X i , i ∈ N . (NE)Obviously, if x ∗ is a Nash equilibrium, we have the first-order condition u ′ i ( x ∗ ; z i ) = h v i ( x ∗ ) , z i i ≤ for all z i ∈ TC i ( x ∗ i ) , i ∈ N , (2.7)where TC i ( x ∗ i ) denotes the tangent cone to X i at x ∗ i . Therefore, if x ∗ is a Nash equi-librium, each player’s individual gradient v i ( x ∗ ) belongs to the polar cone PC i ( x ∗ i ) to X i at x ∗ i (cf. Fig. 1); moreover, the converse also holds if the game is pseudo-concave. We encode this more concisely as follows: Proposition 2.1. If x ∗ ∈ X is a Nash equilibrium, then v ( x ∗ ) ∈ PC( x ∗ ) , i.e., h v ( x ∗ ) , x − x ∗ i ≤ for all x ∈ X . (2.8) The converse also holds if the game is ( pseudo- ) concave in the sense of (2.2) . EARNING IN GAMES WITH CONTINUOUS ACTION SETS 7
Remark . In the above (and in what follows), v = ( v i ) i ∈N denotes the ensembleof the players’ individual payoff gradients and h v, z i ≡ P i ∈N h v i , z i i stands for thepairing between v and the vector z = ( z i ) i ∈N ∈ Q i ∈N V i . For concision, we alsowrite V ≡ Q i V i for the ambient space of X ≡ Q i X i and V ∗ for its dual. Proof of Proposition 2.1. If x ∗ is a Nash equilibrium, (2.8) is obtained by setting z i = x i − x ∗ i in (2.7) and summing over all i ∈ N . Conversely, if (2.8) holds and thegame is (pseudo-)concave, pick some x i ∈ X i and let x = ( x i ; x ∗− i ) in (2.8). Thisgives h v i ( x ∗ ) , x i − x ∗ i i ≤ for all x i ∈ X i so (NE) follows by the basic properties of(pseudo-)concave functions. (cid:4) Proposition 2.1 shows that Nash equilibria of concave games are precisely the so-lutions of the variational inequality (2.8), so existence follows from standard results.Using a similar variational characterization, Rosen (1965) proved the following suf-ficient condition for equilibrium uniqueness:
Theorem 2.2 (Rosen, 1965) . Assume that G satisfies the payoff monotonicity con-dition h v ( x ′ ) − v ( x ) , x ′ − x i ≤ for all x, x ′ ∈ X , (MC) with equality if and only if x = x ′ . Then, G admits a unique Nash equilibrium. Games satisfying (MC) are called ( strictly ) monotone and they enjoy propertiessimilar to those of (strictly) convex functions. In particular, letting x ′− i = x − i ,(MC) gives h v i ( x ′ i ; x − i ) − v i ( x i ; x − i ) , x ′ i − x i i ≤ for all x i , x ′ i ∈ X i , x − i ∈ X − i , (2.9)implying in turn that u i ( x ) is (strictly) concave in x i for all i . Therefore, any gamesatisfying (MC) is also concave. Variational stability.
Combining Proposition 2.1 and (MC), it follows that the(necessarily unique) Nash equilibrium of a monotone game satisfies the inequality h v ( x ) , x − x ∗ i ≤ h v ( x ∗ ) , x − x ∗ i ≤ for all x ∈ X . (2.10)In other words, if x ∗ is a Nash equilibrium of a monotone game, the players’ in-dividual payoff gradients “point towards” x ∗ in the sense that v ( x ) forms an acuteangle with x ∗ − x . Motivated by this, we introduce below the following relaxationof the monotonicity condition (MC): Definition 2.3.
We say that x ∗ ∈ X is variationally stable (or simply stable ) if thereexists a neighborhood U of x ∗ such that h v ( x ) , x − x ∗ i ≤ for all x ∈ U , with equality if and only if x = x ∗ . In particular, if U can be taken to be all of X ,we say that x ∗ is globally variationally stable (or globally stable for short). Remark . The terminology “variational stability” alludes to the seminal notion of evolutionary stability introduced by Maynard Smith and Price (1973) for populationgames (i.e., games with a continuum of players and a common, finite set of actions A ). Specifically, if v ( x ) = ( v α ( x )) α ∈A denotes the payoff field of such a game (with Rosen (1965) originally referred to (MC) as diagonal strict concavity; Hofbauer and Sand-holm (2009) use the term “stable” for population games that satisfy a formal analogue of (MC),while Sandholm (2015) and Sorin and Wan (2016) call such games “contractive” and “dissipative”respectively. In all cases, the adverb “strictly” refers to the “only if” requirement in (MC).
P. MERTIKOPOULOS AND Z. ZHOU
First-order requirement Second-order testNash equilibrium (NE) h v ( x ∗ ) , x − x ∗ i ≤ N/AVariational stability (VS) h v ( x ) , x − x ∗ i ≤ H G ( x ∗ ) ≺ Monotonicity (MC) h v ( x ′ ) − v ( x ) , x ′ − x i ≤ H G ( x ) ≺ Concave potential (PF) v ( x ) = ∇ f ( x ) ∇ f ( x ) ≺ Table 1.
Monotonicity, stability, and Nash equilibrium: the existence ofa concave potential implies monotonicity; monotonicity implies the exis-tence of a globally stable point; and globally stable points are equilibria. x ∈ ∆( A ) denoting the state of the population), Definition 2.6 boils down to thevariational characterization of evolutionarily stable states due to Hofbauer et al.(1979). As we show in the next sections, variational stability plays the same rolefor learning in games with continuous action spaces as evolutionary stability playsfor evolution in games with a continuum of players.By (2.10), a first example of variational stability is provided by the class ofmonotone games: Corollary 2.4. If G satisfies (MC) , its ( unique ) Nash equilibrium is globally stable.
The converse to Corollary 2.4 does not hold, even partially. For instance, considerthe single-player game with payoffs given by the function u ( x ) = 1 − d X ℓ =1 √ x ℓ , x ∈ [0 , d . (2.11)In this simple example, the origin is the unique maximizer (and hence unique Nashequilibrium) of u . Moreover, we trivially have h v ( x ) , x i = − P dℓ =1 x ℓ / √ x ℓ ≤ with equality if and only if x = 0 , so the origin satisfies the global version of (VS);however, u is not even pseudo-concave if d ≥ , so the game cannot be monotone.In words, (MC) is a sufficient condition for the existence of a (globally) stable state,but not a necessary one.Nonetheless, even in this (non-monotone) example, variational stability charac-terizes the game’s unique Nash equilibrium. We make this link precise below: Proposition 2.5.
Suppose that x ∗ ∈ X is variationally stable. Then:a ) If G is ( pseudo- ) concave, x ∗ is an isolated Nash equilibrium of G .b ) If x ∗ is globally stable, it is the game’s unique Nash equilibrium. Proposition 2.5 indicates that variationally stable states are isolated (for theproof, see that of Proposition 2.7 below). However, this also means that Nashequilibria of games that admit a concave – but not strictly concave – potential mayfail to be stable. To account for such cases, we will also consider the followingsetwise version of variational stability:
Definition 2.6.
Let X ∗ ⊆ X be closed and nonempty. We say that X ∗ is variationallystable (or simply stable ) if there exists a neighborhood U of X ∗ such that h v ( x ) , x − x ∗ i ≤ for all x ∈ U , x ∗ ∈ X ∗ , (VS) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 9 with equality for a given x ∗ ∈ X ∗ if and only if x ∈ X ∗ . In particular, if U canbe taken to be all of X , we say that X ∗ is globally variationally stable (or globallystable for short).Obviously, Definition 2.6 subsumes Definition 2.3: if x ∗ ∈ X is stable in thepointwise sense of Definition 2.3, then it is also stable when viewed as a singletonset. In fact, when this is the case, it is also easy to see that x ∗ cannot belong tosome larger variationally stable set, so the notion of variational stability tacitlyincorporates a certain degree of maximality. This is made clearer in the following: Proposition 2.7.
Suppose that X ∗ ⊆ X is variationally stable. Then:a ) X ∗ is convex.b ) If G is concave, X ∗ is an isolated component of Nash equilibria.c ) If X ∗ is globally stable, it coincides with the game’s set of Nash equilibria.Proof of Proposition 2.7. To show that X ∗ is convex, take x ∗ , x ∗ ∈ X ∗ and set x ∗ λ = (1 − λ ) x ∗ + λx ∗ for λ ∈ [0 , . Substituting in (VS), we get h v ( x ∗ λ ) , x ∗ λ − x ∗ i = λ h v ( x ∗ λ ) , x ∗ − x ∗ i ≤ and h v ( x ∗ λ ) , x ∗ λ − x ∗ i = − (1 − λ ) h v ( x ∗ λ ) , x ∗ − x ∗ i ≤ ,implying that h v ( x ∗ λ ) , x ∗ − x ∗ i = 0 . Writing x ∗ − x ∗ = λ − ( x ∗ λ − x ∗ ) , we then get h v ( x ∗ λ ) , x ∗ λ − x ∗ i = 0 . By (VS), we must have x ∗ λ ∈ X ∗ for all λ ∈ [0 , , implying inturn that X ∗ is convex.We now proceed to show that X ∗ only consists of Nash equilibria. To that end,asssume first that X ∗ is globally stable, pick some x ∗ ∈ X ∗ , and let z i = x i − x ∗ i for some x i ∈ X i , i ∈ N . Then, for all τ ∈ [0 , , we have ddτ u i ( x ∗ i + τ z i ; x ∗− i ) = h v i ( x ∗ i + τ z i ; x ∗− i ) , z i i = 1 τ h v i ( x ∗ i + τ z i ; x ∗− i ) , x ∗ i + τ z i − x ∗ i i ≤ , (2.12)where the last inequality follows from (VS). In turn, this shows that u i ( x ∗ i ; x ∗− i ) ≥ u i ( x ∗ i + z i ; x ∗− i ) = u i ( x i ; x ∗− i ) for all x i ∈ X i , i ∈ N , i.e., x ∗ is a Nash equilibrium.Our claim for locally stable sets then follows by taking τ = 0 above and applyingProposition 2.1.We are left to show that there are no other Nash equilibria close to X ∗ (locally orglobally). To do so, assume first that X ∗ is locally stable and let x ′ / ∈ X ∗ be a Nashequilibrium lying in a neighborhood U of X ∗ where (VS) holds. By Proposition 2.1,we have h v ( x ′ ) , x − x ′ i ≤ for all x ∈ X . However, since x ′ / ∈ X ∗ , (VS) impliesthat h v ( x ′ ) , x ∗ − x ′ i > for all x ∗ ∈ X ∗ , a contradiction. We conclude that thereare no other equilibria of G in U , i.e., X ∗ is an isolated set of Nash equilibria; theglobal version of our claim then follows by taking U = X . (cid:4) Tests for variational stability.
We close this section with a second derivativecriterion that can be used to verify whether (VS) holds. To state it, define the
Hessian of a game G as the block matrix H G ( x ) = ( H G ij ( x )) i,j ∈N with H G ij ( x ) = ∇ x j ∇ x i u i ( x ) + ( ∇ x i ∇ x j u j ( x )) ⊤ . (2.13)We then have: In that case (VS) would give h v ( x ′ ) , x ′ − x ∗ i = 0 for some x ′ = x ∗ , a contradiction. Proposition 2.8. If x ∗ is a Nash equilibrium of G and H G ( x ∗ ) ≺ on TC( x ∗ ) , then x ∗ is stable – and hence an isolated Nash equilibrium. In particular, if H G ( x ) ≺ on TC( x ) for all x ∈ X , x ∗ is globally stable – so it is the unique equilibrium of G .Remark. The requirement “ H G ( x ∗ ) ≺ on TC( x ∗ ) ” above means that z ⊤ H G ( x ∗ ) z < for every nonzero tangent vector z ∈ TC( x ∗ ) . Proof.
Assume first that H G ( x ) ≺ on TC( x ) for all x ∈ X . By Theorem 6 inRosen (1965), G satisfies (MC) so our claim follows from Corollary 2.4. For oursecond claim, if H G ( x ∗ ) ≺ on TC( x ∗ ) for some Nash equilibrium x ∗ of G , wealso have H G ( x ) ≺ for all x in a neighborhood U = Q i ∈N U i of x ∗ in X . Bythe same theorem in Rosen (1965), we get that (MC) holds locally in U , so theabove reasoning shows that x ∗ is the unique equilibrium of the restricted game G| U ( N , U, u | U ) . Hence, x ∗ is locally stable and isolated in G . (cid:4) We provide two straightforward applications of Proposition 2.8 below:
Example . Following Monderer and Shapley (1996), a game G is called a potential game if it admits a potential function f : X → R such that u i ( x i ; x − i ) − u i ( x ′ i ; x − i ) = f ( x i ; x − i ) − f ( x ′ i ; x − i ) for all x, x ′ ∈ X , i ∈ N . (PF)Local maximizers of f are Nash equilibria and the converse also holds if f is concave(Neyman, 1997). By differentiating (PF), it is easy to see that the Hessian of G isjust the Hessian of its potential. Hence, if a game admits a concave potential f ,the game’s Nash set X ∗ = arg max x ∈X f ( x ) is globally stable. Example . Consider again the Cournot oligopoly model ofExample 2.2. A simple differentiation yields H G ij ( x ) = 12 ∂ u i ∂x i ∂x j + 12 ∂ u j ∂x j ∂x i = − b i δ ij − ( b i + b j ) , (2.14)where δ ij = { i = j } is the Kronecker delta. This shows that a Cournot oligopolyadmits a unique, globally stable equilibrium whenever the RHS of (2.14) is negative-definite. This is always the case if the model is symmetric ( b i = b for all i ∈ N ), butnot necessarily otherwise. Quantitatively, if the coefficients b i are independent andidentically distributed (i.i.d.) on [0 , , a Monte Carlo simulation shows that (2.14)is negative-definite with probability between and for N ∈ { , . . . , } .
3. Learning via dual averaging
In this section, we adapt the widely used dual averaging (DA) method of Nesterov(2009) to our game-theoretic setting. Intuitively, the main idea is as follows: Ateach stage of the process, every player i ∈ N gets an estimate ˆ v i of the individualgradient of their payoff function at the current action profile, possibly subject tonoise and uncertainty. Subsequently, they take a step along this estimate in the dualspace V ∗ i (where gradients live), and they “mirror” the output back to the primalspace X i in order to choose an action for the next stage and continue playing (fora schematic illustration, see Fig. 2). This is so because, in the symmetric case, the RHS of (2.14) is a circulant matrix witheigenvalues − b and − ( N + 1) b . In optimization, the roots of the method can be traced back to Nemirovski and Yudin (1983);see also Beck and Teboulle (2003), Nemirovski et al. (2009) and Shalev-Shwartz (2011).
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 11
X ⊆ VY = V ∗ v Q Y Y Y γ ˆ v γ ˆ v X X X Q Q Q
Figure 2.
Schematic representation of dual averaging.
Formally, starting with some arbitrary (and possibly uninformed) gradient esti-mate Y = ˆ v at n = 1 , this scheme can be described via the recursion X i,n = Q i ( Y i,n ) ,Y i,n +1 = Y i,n + γ n ˆ v i,n +1 , (DA)where:1) n denotes the stage of the process.2) ˆ v i,n +1 ∈ V ∗ i is an estimate of the individual payoff gradient v i ( X n ) of player i at stage n (more on this below).3) Y i,n ∈ V ∗ i is an auxiliary “score” variable that aggregates the i -th player’sindividual gradient steps.4) γ n > is a nonincreasing step-size sequence, typically of the form /n β forsome β ∈ (0 , .5) Q i : V ∗ i → X i is the choice map that outputs the i -th player’s action as afunction of their score vector Y i (see below for a rigorous definition).In view of the above, the core components of (DA) are a ) the players’ gradientestimates; and b ) the choice maps that determine the players’ actions. In the restof this section, we discuss both in detail.3.1. Feedback and uncertainty.
Regarding the players’ individual gradient obser-vations, we assume that each player i ∈ N has access to a “black box” feedbackmechanism – an oracle – which returns an estimate of their payoff gradients at theircurrent action profile. Of course, this information may be imperfect for a multitudeof reasons: for instance i ) estimates may be susceptible to random measurementerrors; ii ) the transmission of this information could be subject to noise; and/or iii ) the game’s payoff functions may be stochastic expectations of the form u i ( x ) = E [ˆ u i ( x ; ω )] for some random variable ω, (3.1)and players may only be able to observe the realized gradients ∇ x i ˆ u i ( x ; ω ) .With all this in mind, we will focus on the noisy feedback model ˆ v i,n +1 = v i ( X n ) + ξ i,n +1 , (3.2) where the noise process ξ n = ( ξ i,n ) i ∈N is an L -bounded martingale differencesequence adapted to the history ( F n ) ∞ n =1 of X n (i.e., ξ n is F n -measurable but ξ n +1 isn’t). More explicitly, this means that ξ n satisfies the statistical hypotheses:1. Zero-mean: E [ ξ n +1 | F n ] = 0 for all n = 1 , , . . . (a.s.). (H1)2. Finite mean squared error: there exists some σ ≥ such that E [ k ξ n +1 k ∗ | F n ] ≤ σ for all n = 1 , , . . . (a.s.). (H2)Alternatively, (H1) and (H2) simply posit that the players’ individual gradientestimates are conditionally unbiased and bounded in mean square , viz. E [ˆ v n +1 | F n ] = v ( X n ) , (3.3a) E [ k ˆ v n +1 k ∗ | F n ] ≤ V ∗ for some finite V ∗ > . (3.3b)The above allows for a broad range of error processes, including all compactlysupported, (sub-)Gaussian, (sub-)exponential and log-normal distributions. Infact, both hypotheses can be relaxed (for instance, by assuming a small bias orasking for finite moments up to some order q < ), but we do not do so to keepthings simple.3.2. Choosing actions.
Given that the players’ score variables aggregate gradi-ent steps, a reasonable choice for Q i would be the arg max correspondence y i arg max x i ∈X i h y i , x i i that outputs those actions which are most closely aligned with y i . Notwithstanding, there are two problems with this approach: a ) this assignmentis too aggressive in the presence of uncertainty; and b ) generically, the output wouldbe an extreme point of X , so (DA) could never converge to an interior point. Thus,instead of taking a “hard” arg max approach, we will focus on regularized maps ofthe form y i arg max x i ∈X i {h y i , x i i − h i ( x i ) } , (3.4)where the “regularization” term h i : X i → R satisfies the following requirements: Definition 3.1.
Let C be a compact convex subset of a finite-dimensional normedspace V . We say that h : C → R is a regularizer (or penalty function ) on C if:(1) h is continuous.(2) h is strongly convex , i.e., there exists some K > such that h ( tx + (1 − t ) x ′ ) ≤ th ( x ) + (1 − t ) h ( x ′ ) − Kt (1 − t ) k x ′ − x k (3.5)for all x, x ′ ∈ C and all t ∈ [0 , .The choice (or mirror ) map Q : V ∗ → C induced by h is then defined as Q ( y ) = arg max {h y, x i − h ( x ) : x ∈ C} . (3.6)In what follows, we will be assuming that each player i ∈ N is endowed with anindividual penalty function h i : X i → R that is K i -strongly convex. Furthermore,to emphasize the interplay between primal and dual variables (the players’ actions x i and their score vectors y i respectively), we will write Y i ≡ V ∗ i for the dual spaceof V i and Q i : Y i → X i for the choice map induced by h i . Indices have been chosen so that all relevant processes are F n -measurable at stage n . In particular, we will not be assuming i.i.d. errors; this point is crucial for applications todistributed control where measurements are typically correlated with the state of the system.
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 13
Algorithm 1.
Dual averaging with Euclidean projections (Example 3.1).
Require: step-size sequence γ n ∝ /n β , β ∈ (0 , ; initial scores Y i ∈ Y i for n = 1 , , . . . do for every player i ∈ N do play X i ← Π X i ( Y i ) ; {choose an action} observe ˆ v i ; {estimate gradient} update Y i ← Y i + γ n ˆ v i ; {take gradient step} end for end for More concisely, this information can be encoded in the aggregate penalty function h ( x ) = P i h i ( x i ) with associated strong convexity constant K ≡ min i K i . Theinduced choice map is simply Q ≡ ( Q , . . . , Q N ) so we will write x = Q ( y ) for theaction profile induced by the score vector y = ( y , . . . , y N ) ∈ Y ≡ Q i Y i . Remark . In finite games, McKelvey and Palfrey (1995) referred to Q i as a“quantal response function” (the notation Q alludes precisely to this terminology).In the same game-theoretic context, the composite map Q i ◦ v i is often calleda smooth, perturbed, or regularized best response; for a detailed discussion, seeHofbauer and Sandholm (2002) and Mertikopoulos and Sandholm (2016).We discuss below a few examples of this regularization process: Example . Let h ( x ) = k x k . Then, h is -stronglyconvex with respect to k·k and the corresponding choice map is the closest pointprojection Π X ( y ) ≡ arg max x ∈X (cid:8) h y, x i − k x k (cid:9) = arg min x ∈X k y − x k . (3.7)The induced learning scheme (cf. Algorithm 1) may thus be viewed as a multi-agent variant of gradient ascent with lazy projections (Zinkevich, 2003). For futurereference, note that h is differentiable on X and Π X is surjective (i.e., im Π X = X ). Example . Motivated by mixed strategy learning infinite games (Example 2.1), let ∆ = { x ∈ R d + : P dj =1 x j = 1 } denote the unitsimplex of R d . Then, a standard regularizer on ∆ is provided by the (negative)Gibbs entropy h ( x ) = d X ℓ =1 x ℓ log x ℓ . (3.8)The entropic regularizer (3.8) is -strongly convex with respect to the L -norm on R d . Moreover, a straightforward calculation shows that the induced choice map is Λ( y ) = 1 P dℓ =1 exp( y ℓ ) (exp( y ) , . . . , exp( y d )) . (3.9)This model is known as logit choice and the associated learning scheme has beenstudied extensively in evolutionary game theory and online learning; for a detailed We assume here that
V ≡ Q i V i is endowed with the product norm k x k V = P i k x i k V i . account, see Vovk (1990), Littlestone and Warmuth (1994), Laraki and Mertikopou-los (2013), and references therein. In contrast to the previous example, h is differ-entiable only on the relative interior ∆ ◦ of ∆ and im Λ = ∆ ◦ (i.e., Λ is “essentially”surjective).3.3. Surjectivity vs. steepness.
We close this section with an important link be-tween the boundary behavior of penalty functions and the surjectivity of the inducedchoice maps. To describe it, it will be convenient to treat h as an extended-real-valued function h : V → R ∪ {∞} by setting h = ∞ outside X . The subdifferential of h at x ∈ V is then defined as ∂h ( x ) = { y ∈ V ∗ : h ( x ′ ) ≥ h ( x ) + h y, x ′ − x i for all x ′ ∈ V} , (3.10)and h is called subdifferentiable at x ∈ X whenever ∂h ( x ) is nonempty. This isalways the case if x ∈ X ◦ , so X ◦ ⊆ dom ∂h ≡ { x ∈ X : ∂h ( x ) = ∅ } ⊆ X (Rockafellar, 1970, Chap. 26).Intuitively, h fails to be subdifferentiable at a boundary point x ∈ bd( X ) onlyif it becomes “infinitely steep” near x . We thus say that h is steep at x whenever x / ∈ dom h ; otherwise, h is said to be nonsteep at x . The following propositionshows that regularizers that are everywhere nonsteep (as in Example 3.1) inducechoice maps that are surjective; on the other hand, regularizers that are everywheresteep (cf. Example 3.2) induce choice maps that are interior-valued: Proposition 3.2.
Let h be a K -strongly convex regularizer with induced choice map Q : Y → X , and let h ∗ : Y → R be the convex conjugate of h , i.e., h ∗ ( y ) = max {h y, x i − h ( x ) : x ∈ X } , y ∈ Y . (3.11) Then: x = Q ( y ) if and only if y ∈ ∂h ( x ) ; in particular, im Q = dom ∂h . h ∗ is differentiable on Y and ∇ h ∗ ( y ) = Q ( y ) for all y ∈ Y . Q is (1 /K ) -Lipschitz continuous. Proposition 3.2 is essentially folklore in optimization and convex analysis; fora proof, see Rockafellar (1970, Theorem 23.5) and Rockafellar and Wets (1998,Theorem 12.60(b)).
4. Convergence analysis
A key property of (DA) in concave games is that it leads to no regret , viz. max x i ∈X i n X k =1 [ u i ( x i ; X − i,k ) − u i ( X k )] = o ( n ) for all i ∈ N , (4.1)provided that the algorithm’s step-size is chosen appropriately – for a precise state-ment, see Xiao (2010) and Shalev-Shwartz (2011). As such, under (DA), everyplayer’s average payoff matches asymptotically that of the best fixed action in hind-sight (though, of course, this does not take into account changes to other players’actions due to a change in a given player’s chosen action).In this section, we expand on this worst-case guarantee and we derive some gen-eral convergence results for the actual sequence of play induced by (DA). Specif-ically, in Section 4.1 we show that if (DA) converges to some action profile, thislimit is a Nash equilibrium. Subsequently, to obtain stronger convergence results, EARNING IN GAMES WITH CONTINUOUS ACTION SETS 15 we introduce in Section 4.2 the so-called
Fenchel coupling , a “primal-dual” diver-gence measure between the players’ (primal) action variables x i ∈ X i and their(dual) score vectors y i ∈ Y i . Using this coupling as a Lyapunov function, we showin Sections 4.3 and 4.4 that globally (resp. locally) stable states are globally (resp.locally) attracting under (DA). Finally, in Section 4.5, we examine the convergenceproperties of (DA) in zero-sum concave-convex games.4.1. Limit states.
We first show that if the sequence of play induced by (DA) con-verges to some x ∗ ∈ X with positive probability, this limit is a Nash equilibrium: Theorem 4.1.
Suppose that (DA) is run with imperfect gradient information satis-fying (H1)–(H2) and a step-size sequence γ n such that ∞ X n =1 (cid:16) γ n τ n (cid:17) < ∞ X n =1 γ n = ∞ , (4.2) where τ n = P nk =1 γ k . If the game is ( pseudo- ) concave and X n converges to x ∗ ∈ X with positive probability, x ∗ is a Nash equilibrium.Remark . Note here that the requirement (4.2) holds for every step-size policyof the form γ n ∝ /n β , β ≤ (i.e. even for increasing γ n ). Proof of Theorem 4.1.
Let v ∗ = v ( x ∗ ) and assume ad absurdum that x ∗ is not aNash equilibrium. By the characterization (2.7) of Nash equilibria, there exists aplayer i ∈ N and a deviation q i ∈ X i such that h v ∗ i , q i − x ∗ i i > . Thus, by continuity,there exists some a > and neighborhoods U , V of x ∗ and v ∗ respectively, suchthat h v ′ i , q i − x ′ i i ≥ c (4.3)whenever x ′ ∈ U and v ′ ∈ V .Now, let Ω be the event that X n converges to x ∗ , so P (Ω ) > by assumption.Within Ω , we may assume for simplicity that X n ∈ U and v ( X n ) ∈ V for all n , so(DA) yields Y n +1 = Y + n X k =1 γ k ˆ v k +1 = Y n + n X k =1 γ k [ v ( X k ) + ξ k +1 ] = Y + τ n ¯ v n +1 , (4.4)where we set ¯ v n +1 = τ − n P nk =1 γ k ˆ v k +1 = τ − n P nk =1 γ k [ v ( X k ) + ξ k +1 ] .We now claim that P (¯ v n → v ∗ | Ω ) = 1 . Indeed, by (4.2) and (H2), we have ∞ X n =1 τ n E [ k γ n ξ n +1 k ∗ | F n ] ≤ ∞ X n =1 γ n τ n σ < ∞ . (4.5)Therefore, by the law of large numbers for martingale difference sequences (Halland Heyde, 1980, Theorem 2.18), we obtain τ − n P nk =1 γ k ξ k +1 → (a.s.). Giventhat v ( X n ) → v ∗ in Ω and P (Ω ) > , we infer that P (¯ v n → v ∗ | Ω ) = 1 , asclaimed.Now, with Y i,n ∈ ∂h i ( X i,n ) by Proposition 3.2, we also have h i ( q i ) − h i ( X i,n ) ≥ h Y i,n , q i − X i,n i = h Y i, , q i − X i,n i + τ n − h ¯ v i,n , q i − X i,n i . (4.6)Since ¯ v n → v ∗ almost surely on Ω , (4.3) yields h ¯ v i,n , q i − X i,n i ≥ c > for allsufficiently large n . However, given that |h Y i, , q i − X i,n i| ≤ k Y i, k ∗ k q i − X i,n k ≤k Y i, k ∗ kX k = O (1) , a simple substitution in (4.6) yields h i ( q i ) − h i ( X i,n ) & cτ n →∞ with positive probability, a contradiction. We conclude that x ∗ is a Nash equi-librium of G , as claimed. (cid:4) The Fenchel coupling.
A key tool in establishing the convergence properties of(DA) is the so-called
Bregman divergence D ( p, x ) between a given base point p ∈ X and a test state x ∈ X . Following Kiwiel (1997), D ( p, x ) is defined as the differencebetween h ( p ) and the best linear approximation of h ( p ) from x , viz. D ( p, x ) = h ( p ) − h ( x ) − h ′ ( x ; p − x ) , (4.7)where h ′ ( x ; z ) = lim t → + t − [ h ( x + tz ) − h ( x )] denotes the one-sided derivative of h at x along z ∈ TC( x ) . Owing to the (strict) convexity of h , we have D ( p, x ) ≥ and X n → p whenever D ( p, X n ) → (Kiwiel, 1997). Accordingly, the convergence of asequence X n to a target point p can be checked directly by means of the associateddivergence D ( p, X n ) .Nevertheless, it is often impossible to glean any useful information on D ( p, X n ) from (DA) when X n = Q ( Y n ) is not interior. Instead, given that (DA) mixes primaland dual variables (actions and scores respectively), it will be more convenient touse the following “primal-dual divergence” between dual vectors y ∈ Y and basepoints p ∈ X : Definition 4.2.
Let h : X → R be a penalty function on X . Then, the Fenchelcoupling induced by h is defined as F ( p, y ) = h ( p ) + h ∗ ( y ) − h y, p i for all p ∈ X , y ∈ Y . (4.8)The terminology “Fenchel coupling” is due to Mertikopoulos and Sandholm (2016)and refers to the fact that (4.8) collects all terms of Fenchel’s inequality. As a result, F ( p, y ) is nonnegative and strictly convex in both arguments (though not jointlyso). Moreover, it enjoys the following key properties: Proposition 4.3.
Let h be a K -strongly convex penalty function on X . Then, for all p ∈ X and all y, y ′ ∈ Y , we have: a ) F ( p, y ) = D ( p, Q ( y )) if Q ( y ) ∈ X ◦ ( but not necessarily otherwise ) . (4.9a) b ) F ( p, y ) ≥ K k Q ( y ) − p k . (4.9b) c ) F ( p, y ′ ) ≤ F ( p, y ) + h y ′ − y, Q ( y ) − p i + K k y ′ − y k ∗ . (4.9c)Proposition 4.3 (proven in Appendix A) justifies the terminology “primal-dualdivergence” and plays a key role in our analysis. Specifically, given a sequence Y n in Y , (4.9b) yields Q ( Y n ) → p whenever F ( p, Y n ) → , meaning that F ( p, Y n ) canbe used to test the convergence of Q ( Y n ) to p .For technical reasons, it is convenient to also assume the converse, namely that F ( p, Y n ) → whenever Q ( Y n ) → p. (H3)When h is steep, we have F ( p, y ) = D ( p, Q ( y )) for all y ∈ Y , so (H3) boils down tothe requirement D ( p, x n ) → whenever X n → p. (4.10)This so-called “reciprocity condition” is well known in the theory of Bregman func-tions (Chen and Teboulle, 1993; Kiwiel, 1997; Alvarez et al., 2004): essentially, itmeans that the sublevel sets of D ( p, · ) are neighborhoods of p in X . Hypothesis (H3)posits that the images of the sublevel sets of F ( p, · ) under Q are neighborhoods of p in X , so it may be seen as a “primal-dual” variant of Bregman reciprocity. Underthis light, it is easy to check that Examples 3.1 and 3.2 both satisfy (H3).Obviously, when (H3) holds, Proposition 4.3 gives: EARNING IN GAMES WITH CONTINUOUS ACTION SETS 17
Corollary 4.4.
Under (H3) , F ( p, Y n ) → if and only if Q ( Y n ) → p . To extend the above to subsets of X , we further define the setwise coupling F ( C , y ) = inf { F ( p, y ) : p ∈ C} , C ⊆ X , y ∈ Y . (4.11)In analogy to the pointwise case, we then have: Proposition 4.5.
Let C be a closed subset of X . Then, Q ( Y n ) → C whenever F ( C , Y n ) → ; in addition, if (H3) holds, the converse is also true. The proof of Proposition 4.5 is a straightforward exercise in point-set topology sowe omit it. What’s more important is that, thanks to Proposition 4.5, the Fenchelcoupling can also be used to test for convergence to a set; in what follows, weemploy this property freely.4.3.
Global convergence.
In this section, we focus on globally stable Nash equilibria(and sets thereof). We begin with the perfect feedback case:
Theorem 4.6.
Suppose that (DA) is run with perfect feedback ( σ = 0 ) , choice mapssatisfying (H3) , and a step-size γ n such that P nk =1 γ k (cid:14) P nk =1 γ k → . If the set X ∗ of the game’s Nash equilibria is globally stable, X n converges to X ∗ .Proof. Let X ∗ be the game’s set of Nash equilibria, fix some arbitrary ε > , andlet U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } . Then, by Proposition 4.5, it suffices to showthat X n ∈ U ε for all sufficiently large n .To that end, for all x ∗ ∈ X ∗ , Proposition 4.3 yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) , X n − x ∗ i + γ n K k v ( X n ) k ∗ . (4.12)To proceed, assume inductively that X n ∈ U ε . By (H3), there exists some δ > such that cl( U ε/ ) contains a δ -neighborhood of X ∗ . Consequently, with X ∗ globally stable, there exists some c ≡ c ( ε ) > such that h v ( x ) , x − x ∗ i ≤ − c for all x ∈ U ε − U ε/ , x ∗ ∈ X ∗ . (4.13)If X n ∈ U ε − U ε/ and γ n ≤ cK/V ∗ , (4.12) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) . Hence, minimizing over x ∗ ∈ X ∗ , we get F ( X ∗ , Y n +1 ) ≤ F ( X ∗ , Y n ) < ε , so X n +1 = Q ( Y n +1 ) ∈ U ε . Otherwise, if X n ∈ U ε/ and γ n < εK/V ∗ , combining (VS) with(4.12) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + ε/ so, again, F ( X ∗ , Y n +1 ) ≤ F ( X ∗ , Y n ) + ε/ ≤ ε , i.e. X n +1 ∈ U ε . We thus conclude that X n +1 ∈ U ε whenever X n ∈ U ε and γ n < min { cK/V ∗ , √ εK/V ∗ } .To complete the proof, Lemma A.3 shows that X n visits U ε infinitely often underthe stated assumptions. Since γ n → , our assertion follows. (cid:4) We next show that Theorem 4.6 extends to the case of imperfect feedback underthe additional regularity requirement:The gradient field v ( x ) is Lipschitz continuous. (H4)With this extra assumption, we have: Indeed, if this were not the case, there would exist a sequence Y ′ n in Y such that Q ( Y ′ n ) → X ∗ but F ( X ∗ , Y ′ n ) ≥ ε/ , in contradiction to (H3). Since σ = 0 , we can take here V ∗ = max x ∈X k v ( x ) k ∗ . Hypothesis Precise statement (H1) Zero-mean errors E [ ξ n +1 | F n ] = 0 (H2) Finite error variance E [ k ξ n +1 k ∗ | F n ] ≤ σ (H3) Bregman reciprocity F ( p, y n ) → whenever Q ( y n ) → p (H4) Lipschitz gradients v ( x ) is Lipschitz continuous Table 2.
Overview of the various regularity hypotheses used in the paper.
Theorem 4.7.
Suppose that (DA) is run with a step-size sequence γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If (H1)–(H4) hold and the set X ∗ of the game’sNash equilibria is globally stable, X n converges to X ∗ ( a.s. ) . Corollary 4.8. If G satisfies (MC) , X n converges to the ( necessarily unique ) Nashequilibrium of G ( a.s. ) . Corollary 4.9. If G admits a concave potential, X n converges to the set of Nashequilibria of G ( a.s. ) . Because of the noise affecting the players’ gradient estimates, our proof strategyfor Theorem 4.7 is quite different from that of Theorem 4.6. In particular, insteadof working directly in discrete time, we start with the continuous-time system ˙ y = v ( x ) ,x = Q ( y ) , (DA-c)which can be seen as a “mean-field” approximation of the recursive scheme (DA).As we show in Appendix A, the orbits x ( t ) = Q ( y ( t )) of (DA-c) converge to X ∗ in a certain, “uniform” way. Moreover, under the assumptions of Theorem 4.7, thesequence Y n generated by the discrete-time, stochastic process (DA) comprises an asymptotic pseudotrajectory (APT) of the dynamics (DA-c), i.e. Y n asymptoticallytracks the flow of (DA-c) with arbitrary accuracy over windows of arbitrary lengthBenaïm (1999). APTs have the key property that, in the presence of a globalattractor, they cannot stray too far from the flow of (DA-c); however, given that Q may fail to be invertible, the trajectories x ( t ) = Q ( y ( t )) do not consitute a semiflow,so it is not possible to leverage the general stochastic approximation theory ofBenaïm (1999). To overcome this difficulty, we exploit the derived convergencebound for x ( t ) = Q ( y ( t )) , and we then use an inductive shadowing argument toshow that (DA) converges itself to X ∗ . Proof of Theorem 4.7.
Fix some ε > , let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } , andwrite Φ t : Y → Y for the semiflow induced by (DA-c) on Y – i.e. (Φ t ( y )) t ≥ is thesolution orbit of (DA-c) that starts at y ∈ Y . We first claim there exists some finite τ ≡ τ ( ε ) such that F ( X ∗ , Φ τ ( y )) ≤ max { ε, F ( X ∗ , y ) − ε } for all y ∈ Y . Indeed, since cl( U ε ) is a closed neighborhoodof X ∗ by (H3), (VS) implies that there exists some c ≡ c ( ε ) > such that h v ( x ) , x − x ∗ i ≤ − c for all x ∗ ∈ X ∗ , x / ∈ U ε . (4.14) For a precise definition, see (4.16) below. That such a trajectory exists and is unique is a consequence of (H4).
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 19
Consequently, if τ y = inf { t > Q (Φ t ( y )) ∈ U ε } denotes the first time at which anorbit of (DA-c) reaches U ε , Lemma A.2 in Appendix A gives: F ( x ∗ , Φ t ( y )) ≤ F ( x ∗ , y ) − ct for all x ∗ ∈ X ∗ , t ≤ τ y . (4.15)In view of this, set τ = ε/c and consider the following two cases:(1) τ y ≥ τ : then, (4.15) gives F ( x ∗ , Φ τ ( y )) ≤ F ( x ∗ , y ) − ε for all x ∗ ∈ X ∗ , so F ( X ∗ , Φ τ ( y )) ≤ F ( X ∗ , y ) − ε .(2) τ y < τ : then, Q (Φ τ ( y )) ∈ U ε , so F ( X ∗ , Φ τ ( y )) ≤ ε .In both cases we have F ( X ∗ , Φ τ ( y )) ≤ max { ε, F ( X ∗ , y ) − ε } , as claimed.Now, let ( Y ( t )) t ≥ denote the affine interpolation of the sequence Y n generatedby (DA), i.e. Y is the continuous curve which joins the values Y n at all times τ n = P nk =1 γ k . Under the stated assumptions, a standard result of Benaïm (1999,Propositions 4.1 and 4.2) shows that Y ( t ) is an asymptotic pseudotrajectory of Φ ,i.e. lim t →∞ sup ≤ h ≤ T k Y ( t + h ) − Φ h ( Y ( t )) k ∗ = 0 for all T > (a.s.) . (4.16)Thus, with some hindsight, let δ ≡ δ ( ε ) be such that δ kX k + δ / (2 K ) ≤ ε andchoose t ≡ t ( ε ) so that sup ≤ h ≤ τ k Y ( t + h ) − Φ h ( Y ( t )) k ∗ ≤ δ for all t ≥ t . Then,for all t ≥ t and all x ∗ ∈ X ∗ , Proposition 4.3 gives F ( x ∗ , Y ( t + h )) ≤ F ( x ∗ , Φ h ( Y ( t )))+ h Y ( t + h ) − Φ h ( Y ( t )) , Q (Φ h ( Y ( t ))) − x ∗ i + 12 K k Y ( t + h ) − Φ h ( Y ( t )) k ∗ ≤ F ( x ∗ , Φ h ( Y ( t ))) + δ kX k + δ K ≤ F ( x ∗ , Φ h ( Y ( t ))) + ε. (4.17)Hence, minimizing over x ∗ ∈ X ∗ , we get F ( X ∗ , Y ( t + h )) ≤ F ( X ∗ , Φ h ( Y ( t ))) + ε for all t ≥ t . (4.18)By Lemma A.3, there exists some t ≥ t such that F ( X ∗ , Y ( t )) ≤ ε (a.s.).Thus, given that F ( X ∗ , Φ h ( Y ( t ))) is nonincreasing in h by Lemma A.2, Eq. (4.18)yields F ( X ∗ , Y ( t + h )) ≤ ε + ε = 3 ε for all h ∈ [0 , τ ] . However, by the definition of τ , we also have F ( X ∗ , Φ τ ( Y ( t ))) ≤ max { ε, F ( X ∗ , Y ( t )) − ε } ≤ ε , implying in turnthat F ( X ∗ , Y ( t + τ )) ≤ F ( X ∗ , Φ τ ( Y ( t ))) + ε ≤ ε . Therefore, by repeating theabove argument at t + τ and proceeding inductively, we get F ( X ∗ , Y ( t + h )) ≤ ε for all h ∈ [ kτ, ( k + 1) τ ] , k = 1 , , . . . (a.s.). Since ε has been chosen arbitrarily, weconclude that F ( X ∗ , Y n ) → , so X n → X ∗ by Proposition 4.5. (cid:4) We close this section with a few remarks:
Remark . In the above, the Lipschitz continuity assumption (H4) is used toshow that the sequence X n comprises an APT of the continuous-time dynamics(DA-c). Since any continuous functions on a compact set is uniformly continuous,the proof of Proposition 4.1 in Benaïm (1999, p. 14) shows that (H4) can be droppedaltogether if (DA-c) is well-posed (which, in turn, holds if v ( x ) is only locally Lipschitz). Albeit less general, Lipschitz continuity is more straightforward as anassumption, so we do not go into the details of this relaxation.
We should also note that several classic convergence results for dual averagingand mirror descent do not require Lipschitz continuity at all (see e.g. Nesterov,2009, and Nemirovski et al., 2009). The reason for this is that these results focuson the convergence of the averaged sequence ¯ X n = P nk =1 γ k X k (cid:14) P nk =1 γ k , whereasthe figure of merit here is the actual sequence of play X n . The latter sequenceis more sensitive to noise, hence the need for additional regularity; in our ergodicanalysis later in the paper, (H4) is not invoked. Remark . Theorem 4.7 shows that (DA) converges to equilibrium, but thesummability requirement P ∞ n =1 γ n < ∞ suggests that players must be more conser-vative under uncertainty. To make this more precise, note that the step-size assump-tions of Theorem 4.6 are satisfied for all step-size policies of the form γ n ∝ /n β , β ∈ (0 , ; however, in the presence of errors and uncertainty, Theorem 4.7 guaran-tees convergence only when β ∈ (1 / , .The “critical” value β = 1 / is tied to the finite mean squared error hypothesis(H2). If the players’ gradient observations have finite moments up to some order q > , a more refined stochastic approximation argument can be used to show thatTheorem 4.7 still holds under the lighter requirement P ∞ n =1 γ q/ n < ∞ . Thus,even in the presence of noise, it is possible to employ (DA) with any step-sizesequence of the form γ n ∝ /n β , β ∈ (0 , , provided that the noise process ξ n has E [ k ξ n +1 k ∗ q | F n ] < ∞ for some q > /β − . In particular, if the noise affectingthe players’ observations has finite moments of all orders (for instance, if ξ n is sub-exponential or sub-Gaussian), it is possible to recover essentially all the step-sizepolicies covered by Theorem 4.6.4.4. Local convergence.
The results of the previous section show that (DA) con-verges globally to states (or sets) that are globally stable, even under noise anduncertainty. In this section, we show that (DA) remains locally convergent tostates that are only locally stable with probability arbitrarily close to .For simplicity, we begin with the deterministic, perfect feedback case: Theorem 4.10.
Suppose that (DA) is run with perfect feedback ( σ = 0 ) , choice mapssatisfying (H3) , and a sufficiently small step-size with P nk =1 γ k (cid:14) P nk =1 γ k → . If X ∗ is a stable set of Nash equilibria, there exists a neighborhood U of X ∗ such that X n converges to X ∗ whenever X ∈ U .Proof. As in the proof of Theorem 4.6, let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } . Since X ∗ is stable, there exists some ε > and some c > satisfying (4.13) and suchthat (VS) holds throughout U ε . If X ∈ U ε and γ ≤ min { cK/V ∗ , √ εK/V ∗ } ,the same induction argument as in the proof of Theorem 4.6 shows that X n ∈ U ε for all n . Since (VS) holds throughout U ε , Lemma A.3 shows that X n visits anyneighborhood of X ∗ infinitely many times. Thus, by the same argument as in theproof of Theorem 4.6, we get X n → X ∗ . (cid:4) The key idea in the proof of Theorem 4.10 is that if the step-size of (DA) issmall enough, X n = Q ( Y n ) always remains within the “basin of attraction” of X ∗ ;hence, local convergence can be obtained in the same way as global convergence fora game with smaller action spaces. However, if the players’ feedback is subject toestimation errors and uncertainty, a single unlucky instance could drive X n awayfrom said basin, possibly never to return. Consequently, any local convergenceresult in the presence of noise is necessarily probabilistic in nature. EARNING IN GAMES WITH CONTINUOUS ACTION SETS 21
Conditioning on the event that X n stays close to X ∗ , local convergence can beobtained as in the proof of Theorem 4.7. Nevertheless, showing that this eventoccurs with controllably high probability requires a completely different analysis.This is the essence of our next result: Theorem 4.11.
Fix a confidence level δ > and suppose that (DA) is run with asufficiently small step-size γ n satisfying P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If X ∗ is stable and (H1)–(H4) hold, then X ∗ is locally attracting with probability at least − δ ; more precisely, there exists a neighborhood U of X ∗ such that P ( X n → X ∗ | X ∈ U ) ≥ − δ. (4.19) Corollary 4.12.
Let x ∗ be a Nash equilibrium with negative-definite Hessian ma-trix H G ( x ∗ ) ≺ . Then, with assumptions as above, x ∗ is locally attracting withprobability arbitrarily close to .Proof of Theorem 4.11. Let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } and pick ε > smallenough so that (VS) holds for all x ∈ U ε . Assume further that X ∈ U ε so thereexists some x ∗ ∈ X ∗ such that F ( x ∗ , Y ) < ε . Then, for all n , Proposition 4.3 yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) , X n − x ∗ i + γ n ψ n +1 + γ n K k ˆ v n +1 k ∗ , (4.20)where we have set ψ n +1 = h ξ n +1 , X n − x ∗ i .We first claim that sup n P nk =1 γ k ψ k +1 ≤ ε with probability at least − δ/ if γ n is chosen appropriately. Indeed, set S n +1 = P nk =1 γ k ψ k +1 and let E n,ε denote theevent { sup ≤ k ≤ n +1 | S k | ≥ ε } . Since S n is a martingale, Doob’s maximal inequality(Hall and Heyde, 1980, Theorem 2.1) yields P ( E n +1 ,ε ) ≤ E [ | S n +1 | ] ε ≤ σ kX k P nk =1 γ k ε , (4.21)where we used the variance estimate E [ ψ k +1 ] = E [ E [ |h ξ k +1 , X k − x ∗ i| | F k ]] ≤ E [ E [ k ξ k +1 k ∗ k X k − x ∗ k | F k ]] ≤ σ kX k , (4.22)and the fact that E [ ψ k +1 ψ ℓ +1 ] = E [ E [ ψ k +1 ψ ℓ +1 ] | F k ∨ ℓ ] = 0 whenever k = ℓ . Since E n +1 ,ε ⊇ E n,ε ⊇ . . . , the event E ε = S ∞ n =1 E n,ε occurs with probability P ( E ε ) ≤ Γ σ kX k /ε , where Γ ≡ P ∞ n =1 γ n . Thus, if Γ ≤ δε / (2 σ kX k ) , we get P ( E ε ) ≤ δ/ .We now claim that the process R n +1 = (2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ is also boundedfrom above by ε with probability at least − δ/ if γ n is chosen appropriately.Indeed, working as above, let F n,ε denote the event { sup ≤ k ≤ n +1 R k ≥ ε } . Since R n is a nonnegative submartingale, Doob’s maximal inequality again yields P ( F n +1 ,ε ) ≤ E [ R n +1 ] ε ≤ V ∗ P nk =1 γ k Kε . (4.23)Consequently, the event F ε = S ∞ n =1 F n,ε occurs with probability P ( F ε ) ≤ Γ V ∗ /ε ≤ δ/ if γ n is chosen so that Γ ≤ Kδε/V ∗ .Assume therefore that Γ ≤ min { δε / (2 σ kX k ) , Kδε/V ∗ } . The above showsthat P ( ¯ E ε ∩ ¯ F ε ) = 1 − P ( E ε ∪ F ε ) ≥ − δ/ − δ/ − δ , i.e. S n and R n are bothbounded from above by ε for all n and all x ∗ with probability at least − δ . Since F ( x ∗ , Y ) ≤ ε by assumption, we readily get F ( x ∗ , Y ) ≤ ε if ¯ E ε and ¯ F ε both hold.Furthermore, telescoping (4.20) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y ) + n X k =1 h v ( X k ) , X k − x ∗ i + S n +1 + R n +1 for all n, (4.24)so if we assume inductively that F ( x ∗ , Y k ) ≤ ε for all k ≤ n (implying that h v ( X k ) , X k − x ∗ i ≤ for all k ≤ n ), we also get F ( x ∗ , Y n +1 ) ≤ ε if neither E ε nor F ε occur. Since P ( E ε ∪ F ε ) ≤ δ , we conclude that X n stays in U ε for all n withprobability at least − δ . In turn, when this is the case, Lemma A.3 shows that X ∗ is recurrent under X n . Hence, by repeating the same steps as in the proof ofTheorem 4.7, we get X n → X ∗ with probability at least − δ , as claimed. (cid:4) Convergence in zero-sum concave games.
We close this section by examiningthe asymptotic behavior of (DA) in -player, concave-convex zero-sum games. Todo so, let N = { A, B } denote the set of players with corresponding payoff functions u A = − u B respectively concave in x A and x B . Letting u ≡ u A = − u B , the value of the game is defined as u ∗ = max x A ∈X A min x B ∈X B u ( x A , x B ) = min x B ∈X B max x A ∈X A u ( x A , x B ) . (4.25)The solutions of the concave-convex saddle-point problem (4.25) are the Nash equi-libria of G and the players’ equilibrium payoffs are u ∗ and − u ∗ respectively.In the “perfect feedback” case ( σ = 0 ), Nesterov (2009) showed that the ergodicaverage ¯ X n = P nk =1 γ k X k P nk =1 γ k (4.26)of the sequence of play generated by (DA) converges to equilibrium. With imperfectfeedback and steep h , Nemirovski et al. (2009) further showed that ¯ X n convergesin expectation to the game’s set of Nash equilibria, provided that (H1) and (H2)hold. Our next result provides an almost sure version of this result which is alsovalid for nonsteep h : Theorem 4.13.
Let G be a concave -player zero-sum game. If (DA) is run withimperfect feedback satisfying (H1)–(H2) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ , the ergodic average ¯ X n of X n converges to the set of Nashequilibria of G ( a.s. ) .Proof of Theorem 4.13. Consider the gap function ǫ ( x ) = u ∗ − min p B ∈X B u ( x A , p B ) + max p A ∈X A u ( p A , x B ) − u ∗ = max p ∈X X i ∈N u i ( p i ; x − i ) . (4.27)Obviously, ǫ ( x ) ≥ with equality if and only if x is a Nash equilibrium, so it sufficesto show that ǫ ( ¯ X n ) → (a.s.).To do so, pick some p ∈ X . Then, as in the proof of Theorem 4.7, we have F ( p, Y n +1 ) ≤ F ( p, Y n ) + γ n h v ( X n ) , X n − p i + γ n ψ n +1 + 12 K γ n k ˆ v n +1 k ∗ . (4.28) When h is steep, the mirror descent algorithm examined by Nemirovski et al. (2009) is aspecial case of the dual averaging method of Nesterov (2009). This is no longer the case if h is notsteep, so the analysis of Nemirovski et al. (2009) does not apply to (DA). In the online learningliterature, this difference is sometimes referred to as “greedy” vs. “lazy” mirror descent. EARNING IN GAMES WITH CONTINUOUS ACTION SETS 23
Hence, after rearranging and telescoping, we get n X k =1 γ k h v ( X k ) , p − X k i ≤ F ( p, Y ) + n X k =1 γ k ψ k +1 + 12 K n X k =1 γ k k ˆ v k +1 k ∗ , (4.29)where ψ n +1 = h ξ n +1 , X n − p i and we used the fact that F ( p, Y n ) ≥ . By concavity,we also have h v ( x ) , p − x i = X i ∈N h v i ( x ) , p i − x i i ≥ X i ∈N [ u i ( p i ; x − i ) − u i ( x )] = X i ∈N u i ( p i ; x − i ) , (4.30)for all x ∈ X . Therefore, letting τ n = P nk =1 γ k , we get τ n n X k =1 γ k h v ( X k ) , p − X k i ≥ τ n n X k =1 γ k X i ∈N u i ( p i ; X − i,k ) ≥ u ( p A , ¯ X B,n ) − u ( ¯ X A,n , p B )= X i ∈N u i ( p i ; ¯ X − i,n ) , (4.31)where we used the fact that u is concave-convex in the second line. Thus, combining(4.29) and (4.31), we finally obtain X i ∈N u i ( p i ; ¯ X − i,n ) ≤ F ( p, Y ) + P nk =1 γ k ψ k +1 + (2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ τ n . (4.32)As before, the law of large numbers (Hall and Heyde, 1980, Theorem 2.18) yields τ − n P nk =1 γ k ψ k +1 → (a.s.). Furthermore, given that E [ k ˆ v n +1 k ∗ | F n ] ≤ V ∗ and P nk =1 γ k < ∞ , we also get τ − n P nk =1 γ k k ˆ v k +1 k ∗ → by Doob’s martingale con-vergence theorem (Hall and Heyde, 1980, Theorem 2.5), implying in turn that P i ∈N u i ( p i ; ¯ X − i,n ) → (a.s.). Since p is arbitrary, we conclude that ǫ ( ¯ X n ) → (a.s.), as claimed. (cid:4)
5. Learning in finite games
As a concrete application of the analysis of the previous section, we turn to theasymptotic behavior of (DA) in finite games. Briefly recalling the setup of Ex-ample 2.1, each player in a finite game Γ ≡ Γ( N , ( A i ) i ∈N , ( u i ) i ∈N ) chooses a purestrategy α i from a finite set A i and receives a payoff of u i ( α , . . . , α N ) . Pure strate-gies are drawn based on the players’ mixed strategies x i ∈ X i ≡ ∆( A i ) , so eachplayer’s expected payoff is given by the multilinear expression (2.3). Accordingly,the individual payoff gradient of player i ∈ N in the mixed profile x = ( x , . . . , x N ) is the (mixed) payoff vector v i ( x ) = ∇ x i u i ( x i ; x − i ) = ( u i ( α i ; x − i )) α i ∈A i of Eq. (2.4).Consider now the following learning scheme: At stage n , every player i ∈ N selects a pure strategy α i,n ∈ A i according to their individual mixed strategy X i,n ∈ X i . Subsequently, each player observes – or otherwise calculates – thepayoffs of their pure strategies α i ∈ A i against the chosen actions α − i,n of all otherplayers (possibly subject to some random estimation error). Specifically, we positthat each player receives as feedback the “noisy” payoff vector ˆ v i,n +1 = ( u i ( α i ; α − i,n )) α i ∈A i + ξ i,n +1 , (5.1) Algorithm 2.
Logit-based learning in finite games (Example 3.2).
Require: step-size sequence γ n ∝ /n β , β ∈ (0 , ; initial scores Y i ∈ R A i for n = 1 , , . . . do for every player i ∈ N do set X i ← Λ i ( Y i ) ; {mixed strategy} play α i ∼ X i ; {choose action} observe ˆ v i ; {estimate payoffs} update Y i ← Y i + γ n ˆ v i ; {update scores} end for end for where the error process ξ n = ( ξ i,n ) i ∈N is assumed to satisfy (H1) and (H2). Then,based on this feedback, players update their mixed strategies and the process re-peats (for a concrete example, see Algorithm 2).In the rest of this section, we study the long-term behavior of this adaptivelearning process. Specifically, we focus on: a ) the elimination of dominated strate-gies; b ) convergence to strict Nash equilibria; and c ) convergence to equilibrium in -player, zero-sum games.5.1. Dominated strategies.
We say that a pure strategy α i ∈ A i of a finite game Γ is dominated by β i ∈ A i (and we write α i ≺ β i ) if u i ( α i ; x − i ) < u i ( β i ; x − i ) for all x − i ∈ X − i ≡ Q j = i X j . (5.2)Put differently, α i ≺ β i if and only if v iα i ( x ) < v iβ i ( x ) for all x ∈ X . In turn,this implies that the payoff gradient of player i points consistently towards the face x iα i = 0 of X i , so it is natural to expect that α i is eliminated under (DA). Indeed,we have: Theorem 5.1.
Suppose that (DA) is run with noisy payoff observations of the form (5.1) and a step-size sequence γ n satisfying (4.2) . If α i ∈ A i is dominated, then X iα i ,n → ( a.s. ) .Proof. Suppose that α i ≺ β i for some β i ∈ A i . Then, suppressing the player index i for simplicity, (DA) gives Y β,n +1 − Y α,n +1 = c βα + n X k =1 γ k [ˆ v β,k +1 − ˆ v α,k +1 ]= c βα + n X k =1 γ k [ v β ( X k ) − v α ( X k )] + n X k =1 γ k ζ k +1 , (5.3)where we set c βα = Y β, − Y α, and ζ k +1 = E [ˆ v β,k +1 − ˆ v α,k +1 | F k ] − [ v β ( X k ) − v α ( X k )] . (5.4)Since α ≺ β , there exists some c > such that v β ( x ) − v α ( x ) ≥ c for all x ∈ X .Then, (5.3) yields Y β,n +1 − Y α,n +1 ≥ c βα + τ n (cid:20) c + P nk =1 γ k ζ k +1 τ n (cid:21) , (5.5) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 25 where τ n = P nk =1 γ k . As in the proof of Theorem 4.1, the law of large numbers formartingale difference sequences (Hall and Heyde, 1980, Theorem 2.18) implies that τ − n P nk =1 γ k ζ k +1 → under the step-size assumption (4.2), so Y β,n − Y α,n → ∞ (a.s.).Suppose now that lim sup n →∞ X α,n = 2 ε for some ε > . By descending toa subsequence if necessary, we may assume that X α,n ≥ ε for all n , so if we let X ′ n = X n + ε ( e β − e α ) , the definition of Q gives h ( X ′ n ) ≥ h ( X n ) + h Y n , X ′ n − X n i = h ( X n ) + ε ( Y β,n − Y α,n ) → ∞ , (5.6)a contradiction. This implies that X α,n → (a.s.), as asserted. (cid:4) Strict equilibria.
A Nash equilibrium x ∗ of a finite game is called strict when(NE) holds as a strict inequality for all x i = x ∗ i , i.e. when no player can deviateunilaterally from x ∗ without reducing their payoff (or, equivalently, when everyplayer has a unique best response to x ∗ ). This implies that strict Nash equilibriaare pure strategy profiles x ∗ = ( α ∗ , . . . , α ∗ N ) such that u i ( α ∗ i ; α ∗− i ) > u i ( α i ; α ∗− i ) for all α i ∈ A i \ { α ∗ i } , i ∈ N . (5.7)Strict Nash equilibria can be characterized further as follows: Proposition 5.2.
Then, the following are equivalent:a ) x ∗ is a strict Nash equilibrium.b ) h v ( x ∗ ) , z i ≤ for all z ∈ TC( x ∗ ) with equality if and only if z = 0 .c ) x ∗ is stable. Thanks to the above characterization of strict equilibria (proven in Appendix A),the convergence analysis of Section 4 yields:
Proposition 5.3.
Let x ∗ be a strict equilibrium of a finite game Γ . Suppose furtherthat (DA) is run with noisy payoff observations of the form (5.1) and a sufficientlysmall step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If (H1)–(H3) hold, x ∗ is locally attracting with arbitrarily high probability; specifically, for all δ > ,there exists a neighborhood U of x ∗ such that P ( X n → x ∗ | X ∈ U ) ≥ − δ. (5.8) Proof.
We first show that E [ˆ v n +1 | F n ] = v ( X n ) . Indeed, for all i ∈ N , α i ∈ A i , wehave E [ˆ v iα i ,n +1 | F n ] = X α − i ∈A − i u i ( α i ; α − i ) X α − i ,n + E [ ξ iα i ,n +1 | F n ] = u i ( α i ; X − i,n ) , (5.9)where, in a slight abuse of notation, we set X α − i ,n for the joint probability assignedto the pure strategy profile α − i of all players other than i at stage n .By (2.4), it follows that E [ˆ v n +1 | F n ] = v ( X n ) so the estimator (5.1) is unbiasedin the sense of (H1). Hypothesis (H2) can be verified similarly, so the estimator(5.1) satisfies (3.3). Since x ∗ is stable by Proposition 5.2 and v ( x ) is multilinear(so (H4) is satisfied automatically), our assertion follows from Theorem 4.11. (cid:4) In the special case of logit-based learning (Example 3.2), Cohen et al. (2017)showed that Algorithm 2 converges locally to strict Nash equilibria under similarinformation assumptions. Proposition 5.2 essentially extends this result to the en-tire class of regularized learning processes induced by (DA) in finite games, showing that the logit choice map (3.9) has no special properties in this regard. Cohen et al.(2017) further showed that the convergence rate of logit-based learning is exponen-tial in the algorithm’s “running horizon” τ n = P nk =1 γ k . This rate is closely linkedto the logit choice model, and different choice maps yield different convergencespeeds; we discuss this issue in more detail in Section 6.5.3. Convergence in zero-sum games.
We close this section with a brief discussionof the ergodic convergence properties of (DA) in finite two-player zero-sum games.In this case, the analysis of Section 4.5 readily yields:
Corollary 5.4.
Let Γ be a finite -player zero-sum game. If (DA) is run with noisypayoff observations of the form (5.1) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ , the ergodic average ¯ X n = P nk =1 γ k X k (cid:14) P nk =1 γ k of the players’mixed strategies converges to the set of Nash equilibria of Γ ( a.s. ) .Proof. As in the proof of Proposition 5.3, the estimator (5.1) satisfies E [ˆ v n +1 | F n ] = v ( X n ) , so (H1) and (H2) also hold in the sense of (3.3). Our claim then followsfrom Theorem 4.13. (cid:4) Remark . In a very recent paper, Bravo and Mertikopoulos (2017) showed thatthe time average ¯ X ( t ) = t − R t X ( s ) ds of the players’ mixed strategies under (DA-c)with Brownian payoff shocks converges to Nash equilibrium in -player, zero-sumgames. Corollary 5.4 may be seen as a discrete-time version of this result.
6. Speed of convergence
Ergodic convergence rate.
In this section, we focus on the rate of convergenceof (DA) to stable equilibrium states (and/or sets thereof). To that end, we willmeasure the speed of convergence to a globally stable set X ∗ ⊆ X via the equilibriumgap function ǫ ( x ) = inf x ∗ ∈X ∗ h v ( x ) , x ∗ − x i . (6.1)By Definition 2.6, ǫ ( x ) ≥ with equality if and only if x ∈ X ∗ , so ǫ ( x ) can be seenas a (game-dependent) measure of the distance between x and the target set X ∗ .This can be seen more clearly in the case of strongly stable equilibria, defined hereas follows: Definition 6.1.
We say that x ∗ ∈ X is strongly stable if there exists some L > such that h v ( x ) , x − x ∗ i ≤ − L k x − x ∗ k for all x ∈ X . (6.2)More generally, a closed subset X ∗ of X is called strongly stable if h v ( x ) , x − x ∗ i ≤ − L dist( X ∗ , x ) for all x ∈ X , x ∗ ∈ X ∗ . (6.3)Obviously, ǫ ( x ) ≥ L dist( X ∗ , x ) if X ∗ is L -strongly stable, i.e. ǫ ( x ) grows atleast quadratically near strongly stable sets – just like strongly convex functionsgrow quadratically around their minimum points. With this in mind, we providebelow an explicit estimate for the decay rate of the average equilibrium gap ¯ ǫ n = P nk =1 γ k ǫ ( X k ) (cid:14) P nk =1 γ k in the spirit of Nemirovski et al. (2009): EARNING IN GAMES WITH CONTINUOUS ACTION SETS 27
Theorem 6.2.
Suppose that (DA) is run with imperfect gradient information satis-fying (H1)–(H2) . Then E [¯ ǫ n ] ≤ F + V ∗ / (2 K ) P nk =1 γ k P nk =1 γ k , (6.4) where F = F ( X ∗ , Y ) . If, in addition, P ∞ n =1 γ n < ∞ , we have ¯ ǫ n ≤ A P nk =1 γ k for all n ( a.s. ) , (6.5) where A > is a finite random variable such that, with probability at least − δ , A ≤ F + σ kX k κ + κ V ∗ , (6.6) where κ = 2 δ − P ∞ n =1 γ n . Corollary 6.3.
Suppose that (DA) is initialized at Y = 0 and is run for n iterationswith constant step-size γ = V − ∗ p K Ω /n where Ω = max h − min h . Then, E [¯ ǫ n ] ≤ V ∗ p Ω / ( Kn ) . (6.7) In addition, if X ∗ is L -strongly stable, the long-run average distance to equilibrium ¯ r n = P nk =1 dist( X ∗ , X n ) (cid:14) P nk =1 γ k satisfies E [¯ r n ] ≤ p L − V ∗ Ω / ( Kn ) . (6.8) Proof of Theorem 6.2.
Let x ∗ ∈ X ∗ . Rearranging (4.20) and telescoping yields n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ F ( x ∗ , Y ) + n X k =1 γ k ψ k +1 + 12 K n X k =1 γ k k ˆ v k +1 k ∗ , (6.9)where ψ k +1 = h ξ k +1 , X k − x ∗ i . Thus, taking expectations on both sides, we obtain n X k =1 γ k E [ h v ( X k ) , x ∗ − X k i ] ≤ F ( x ∗ , Y ) + V ∗ K n X k =1 γ k . (6.10)Subsequently, minimizing both sides of (6.10) over x ∗ ∈ X ∗ yields n X k =1 γ k E [ ǫ ( X k )] ≤ F + V ∗ K n X k =1 γ k , (6.11)where we used Jensen’s inequality to interchange the inf and E operations. Theestimate (6.4) then follows immediately.To establish the almost sure bound (6.5), set S n +1 = P nk =1 γ k ψ k +1 and R n +1 =(2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ . Then, (6.9) becomes n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ F ( x ∗ , Y ) + S n + R n , (6.12)Arguing as in the proof of Theorem 4.11, it follows that sup n E [ | S n | ] and sup n E [ R n ] are both finite, i.e. S n and R n are both bounded in L . By Doob’s (sub)martingaleconvergence theorem (Hall and Heyde, 1980, Theorem 2.5), it also follows that S n and R n both converge to an (a.s.) finite limit S ∞ and R ∞ respectively. Conse-quently, by (6.12), there exists a finite (a.s.) random variable A > such that n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ A for all n (a.s.) . (6.13) The bound (6.5) follows by taking the minimum of (6.13) over x ∗ ∈ X ∗ and dividingboth sides by P nk =1 γ k . Finally, applying Doob’s maximal inequality to (4.21) and(4.23), we obtain P (cid:0) sup n S n ≥ σ kX k κ (cid:1) ≤ δ/ and P (cid:0) sup n R n ≥ V ∗ κ (cid:1) ≤ δ/ .Combining these bounds with (6.12) shows that A can be taken to satisfy (6.6)with probability at least − δ , as claimed. (cid:4) Proof of Corollary 6.3.
By the definition (4.11) of the setwise Fenchel coupling,we have F ≤ h ( x ∗ ) + h ∗ (0) ≤ max h − min h = Ω . Our claim then follows byinvoking Jensen’s inequality, noting that E [dist( X ∗ , X n )] ≤ E [dist( X ∗ , X n ) ] ≤ L − E [ ǫ ( X n )] , and applying (6.4). (cid:4) Although the mean bound (6.4) is valid for any step-size sequence, the summa-bility condition P ∞ n =1 γ n < ∞ for the almost sure bound (6.5) rules out moreaggressive step-size policies of the form γ n ∝ /n β for β ≤ / . Specifically, the“critical” value β = 1 / is again tied to the finite mean squared error hypothesis(H2): if the players’ gradient measurements have finite moments up to some order q > , a more refined application of Doob’s inequality reveals that (6.5) still holdsunder the lighter summability requirement P ∞ n =1 γ q/ n < ∞ . In this case, theexponent β = 1 / is optimal with respect to the guarantee (6.4) and leads to analmost sure convergence rate of the order of O ( n − / log n ) .Except for this log n factor, the O ( n − / ) convergence rate of (DA) is the exactlower complexity bound for black-box subgradient schemes for convex problems(Nemirovski and Yudin, 1983; Nesterov, 2004). Thus, running (DA) with a step-size policy of the form γ n ∝ n − / leads to a convergence speed that is optimalin the mean, and near-optimal with high probability. It is also worth noting that,when the horizon of play is known in advance (as in Corollary 6.3), the constant Ω = max h − min h that results from the initialization Y = 0 is essentially the sameas the constant that appears in the stochastic mirror descent analysis of Nemirovskiet al. (2009) and Nesterov (2009).6.2. Running length.
Intuitively, the main obstacle to achieving rapid convergenceis that, even with an optimized step-size policy, the sequence of play may end uposcillating around an equilibrium state because of the noise in the players’ obser-vations. To study such phenomena, we focus below on the running length of (DA),defined as ℓ n = n − X k =1 k X k +1 − X k k . (6.14)Obviously, if X n converges to some x ∗ ∈ X , a shorter length signifies less oscillationsof X n around x ∗ . Thus, in a certain way, ℓ n is a more refined convergence criterionthan the induced equilibrium gap ǫ ( X n ) .Our next result shows that the mean running length of (DA) until players reachan ε -neighborhood of a (strongly) stable set is at most O (1 /ε ) : Theorem 6.4.
Suppose that (DA) is run with imperfect feedback satisfying (H1)–(H2) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . Also, given a closedsubset X ∗ of X , consider the stopping time n ε = inf { n ≥ X ∗ , X n ) ≤ ε } andlet ℓ ε ≡ ℓ n ε denote the running length of (DA) until X n reaches an ε -neighborhoodof X ∗ . If X ∗ is L -strongly stable, we have E [ ℓ ε ] ≤ V ∗ KL F + (2 K ) − V ∗ P ∞ k =1 γ k ε . (6.15) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 29
Proof.
For all x ∗ ∈ X ∗ and all n ∈ N , (4.20) yields F ( x ∗ , Y n ε ∧ n +1 ) ≤ F ( x ∗ , Y ) − n ε ∧ n X k =1 γ k h v ( X k ) , X k − x ∗ i + n ε ∧ n X k =1 γ k ψ k +1 + 12 K n ε ∧ n X k =1 γ k k ˆ v k +1 k ∗ . (6.16)Hence, after taking expectations and minimizing over x ∗ ∈ X ∗ , we get ≤ F − Lε E " n ε ∧ n X k =1 γ k + E " n ε ∧ n X k =1 γ k ψ k +1 + V ∗ K ∞ X k =1 γ k , (6.17)where we we used the fact that k X k − x ∗ k ≥ ε for all k ≤ n ε .Consider now the stopped process S n ε ∧ n = P n ε ∧ nk =1 γ k ψ k +1 . Since n ε ∧ n ≤ n < ∞ , S n ε ∧ n is a martingale and E [ S n ε ∧ n ] = 0 . Thus, by rearranging (6.17), we obtain E " n ε ∧ n X k =1 γ k ≤ F + (2 K ) − V ∗ P ∞ k =1 γ k Lε . (6.18)Hence, with n ε ∧ n → n ε as n → ∞ , Lebesgue’s monotone convergence theoremshows that the process τ ε = P n ε k =1 γ k is finite in expectation and E [ τ ε ] ≤ F + (2 K ) − V ∗ P ∞ k =1 γ k Lε . (6.19)Furthermore, by Proposition 3.2 and the definition of ℓ n , we also have ℓ n = n − X k =1 k X k +1 − X k k ≤ K n − X k =1 k Y k − Y k − k ∗ = 1 K n − X k =1 γ k k ˆ v k +1 k ∗ . (6.20)Now, let ζ k +1 = k ˆ v k +1 k ∗ and Ψ n +1 = P nk =1 γ k [ ζ k +1 − E [ ζ k +1 | F k ]] . By construc-tion, Ψ n is a martingale and E [Ψ n +1 ] = E " n X k =1 γ k [ ζ k +1 − E [ ζ k +1 | F k ]] ≤ V ∗ ∞ X k =1 γ k < ∞ for all n. (6.21)Thus, by the optional stopping theorem (Shiryaev, 1995, p. 485), we get E [Ψ n ε ] = E [Ψ ] = 0 , so E " n ε X k =1 γ k ζ k +1 = E " n ε X k =1 γ k E [ ζ k +1 | F k ] ≤ V ∗ E " n ε X k =1 γ k = V ∗ E [ τ ε ] . (6.22)Our claim then follows by combining (6.20) and (6.22) with the bound (6.19). (cid:4) Theorem 6.4 should be contrasted to classic results on the Kurdyka–Łojasiewiczinequality where having a “bounded length” property is crucial in establishing tra-jectory convergence (Bolte et al., 2010). In our stochastic setting, it is not realisticto expect a bounded length (even on average), because, generically, the noise doesnot vanish in the neighborhood of a Nash equilibrium. Instead, Theorem 6.4should be interpreted as a measure of how the fluctuations due to noise and un-certainty affect the trajectories’ average length; the authors are not aware of anysimilar results along these lines. For a notable exception however, see Theorem 6.6 below.
Sharp equilibria and fast convergence.
Because of the random shocks inducedby the noise in the players’ gradient observations, it is difficult to obtain an almostsure (or high probability) estimate for the convergence rate of the last iterate X n of (DA). Specifically, even with a rapidly decreasing step-size policy, a single re-alization of the error process ξ n may lead to an arbitrarily big jump of X n at anytime, thus destroying any almost sure bound on the convergence rate of X n .On the other hand, in finite games, Cohen et al. (2017) recently showed thatlogit-based learning (cf. Algorithm 2) achieves a quasi-linear convergence rate withhigh probability if the equilibrium in question is strict. Specifically, Cohen et al.(2017) showed that if x ∗ is a strict Nash equilibrium and X n does not start too farfrom x ∗ , then, with high probability, k X n − x ∗ k = O ( − c P nk =1 γ k ) for some positiveconstant c > that depends only on the players’ relative payoff differences.Building on the variational characterization of strict Nash equilibria provided byProposition 5.2, we consider below the following analogue for continuous games: Definition 6.5.
We say that x ∗ ∈ X is a sharp equilibrium of G if h v ( x ∗ ) , z i ≤ for all z ∈ TC( x ∗ ) , (6.23)with equality if and only if z = 0 . Remark . The terminology “sharp” follows Polyak (1987, Chapter 5.2), whointroduced a similar notion for (unconstrained) convex programs. In particular,in the single-player case, it is easy to see that (6.23) implies that x ∗ is a sharpmaximum of u ( x ) , i.e. u ( x ∗ ) − u ( x ) ≥ c k x − x ∗ k for some c > .A first consequence of Definition 6.5 is that v ( x ∗ ) lies in the topological interior ofthe polar cone PC( x ∗ ) to X at x ∗ (for a schematic illustration, see Fig. 1); in turn,this implies that sharp equilibria can only occur at corners of X . By continuity, thisfurther implies that sharp equilibria are locally stable (cf. the proof of Theorem 6.6below); hence, by Proposition 2.7, sharp equilibria are also isolated. Our nextresult shows that if players employ (DA) with surjective choice maps, then, withhigh probability, sharp equilibria are attained in a finite number of steps: Theorem 6.6.
Fix a tolerance level δ > and suppose that (DA) is run with surjec-tive choice maps and a sufficiently small step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If x ∗ is sharp and (DA) is not initialized too far from x ∗ , we have P ( X n reaches x ∗ in a finite number of steps ) ≥ − δ, (6.24) provided that (H1)–(H4) hold. If, in addition, x ∗ is globally stable, X n converges to x ∗ in a finite number of steps from every initial condition ( a.s. ) .Proof. As we noted above, v ( x ∗ ) lies in the interior of the polar cone PC( x ∗ ) to X at x ∗ . Hence, by continuity, there exists a neighborhood U ∗ of x ∗ such that v ( x ) ∈ int(PC( x ∗ )) for all x ∈ U ∗ . In turn, this implies that h v ( x ) , x − x ∗ i < for all x ∈ U ∗ \ { x ∗ } , i.e. x ∗ is stable. Therefore, by Theorem 4.11, there exists aneighborhood U of x ∗ such that X n converges to x ∗ with probability at least − δ .Now, let U ′ ⊆ U ∗ be a sufficiently small neighborhood of x ∗ such that h v ( x ) , z i ≤− c k z k for some c > and for all z ∈ TC( x ∗ ) . Then, with probability at least − δ , Indeed, if this were not the case, we would have h v ( x ∗ ) , z i = 0 for some nonzero z ∈ TC( x ∗ ) . That such a neighborhood exists is a direct consequence of Definition 6.5.
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 31 there exists some (random) n such that X n ∈ U ′ for all n ≥ n , so h v ( X n ) , z i ≤− c k z k for all n ≥ n . Thus, for all z ∈ TC( x ∗ ) with k z k = 1 , we have h Y n +1 , z i = h Y n , z i + n X k = n γ k h v ( X k ) , z i + n X k = n γ k h ξ k +1 , z i≤ k Y n k ∗ − c n X k = n γ k + n X k = n γ k h ξ k +1 , z i . (6.25)By the law of large numbers for martingale difference sequences (Hall and Heyde,1980, Theorem 2.18), we also have P nk = n γ k ξ k +1 / P nk = n γ k → (a.s.), so thereexists some n ∗ such that k P nk = n γ k ξ k +1 k ∗ ≤ ( c/ P nk = n γ k for all n ≥ n ∗ (a.s.).We thus obtain h Y n +1 , z i ≤ k Y n k ∗ − c n X k = n γ k + c k z k n X k = n γ k ≤ k Y n k ∗ − c n X k = n γ k , (6.26)showing that h Y n , z i → −∞ uniformly in z with probability at least − δ .To proceed, Proposition A.1 in Appendix A shows that y ∗ + PC( x ∗ ) ⊆ Q − ( x ∗ ) whenever Q ( y ∗ ) = x ∗ . Since Q is surjective, there exists some y ∗ ∈ Q − ( x ∗ ) , so itsuffices to show that, with probability at least − δ , Y n lies in the pointed cone y ∗ +PC( x ∗ ) for all sufficiently large n . To do so, simply note that Y n − y ∗ ∈ PC( x ∗ ) ifand only if h Y n − y ∗ , z i ≤ for all z ∈ TC( x ∗ ) with k z k = 1 . Since h Y n , z i convergesuniformly to −∞ with probability at least − δ , our assertion is immediate.Finally, for the globally stable case, recall that X n converges to x ∗ with proba-bility from any initial condition (Theorem 4.7). The argument above shows that X n = x ∗ for all large n , so X n converges to x ∗ in a finite number of steps (a.s.). (cid:4) Remark . Theorem 6.6 suggests that dual averaging with surjective choice mapsleads to significantly faster convergence to sharp equilibria. In this way, it is consis-tent with an observation made by Mertikopoulos and Sandholm (2016, Proposition5.2) for the convergence of the continuous-time, deterministic dynamics (DA-c) infinite games.
7. Discussion
An important question in the implementation of dual averaging is the choiceof regularizer, which in turn determines the players’ choice maps Q i : Y i → X i .From a qualitative point of view, this choice would not seem to matter much: theconvergence results of Sections 4 and 5 hold for all choice maps of the form (3.6).Quantitatively however, the specific choice map employed by each player impactsthe algorithm’s convergence speed, and different choice maps could lead to vastlydifferent rates of convergence.As noted above, in the case of sharp equilibria, this choice seems to favor nonsteeppenalty functions (that is, surjective choice maps). Nonetheless, in the general case,the situation is less clear because of the dimensional dependence hidden in the Ω /K factor that appears e.g. in the mean rate guarantee (6.7). This factor dependscrucially on the geometry of the players’ action spaces and the underlying norm,and its optimum value may be attained by steep penalty functions – for instance,the entropic regularizer (3.8) is well known to be asymptotically optimal in the caseof simplex-like feasible regions (Shalev-Shwartz, 2011, p. 140). Another key question in game-theoretic and online learning has to do with theinformation that is available to the players at each stage. If players perform atwo-point sampling step in order to simulate an extra oracle call at an action pro-file different than the one employed, this extra information could be presumablyleveraged in order to increase the speed of convergence to a Nash equilibrium. Inan offline setting, this can be achieved by more sophisticated techniques relying ondual extrapolation (Nesterov, 2007) and/or mirror-prox methods (Juditsky et al.,2011). Extending these extra-gradient approaches to online learning processes asabove would be an interesting extension of the current work.At the other end of the spectrum, if players only have access to their realized, in-game payoffs, they would need to reconstruct their individual payoff gradients via asuitable single-shot estimator (Polyak, 1987; Flaxman et al., 2005). We believe ourconvergence analysis can be extended to this case by properly controlling the “bias-variance” tradeoff of this estimator and using more refined stochastic approximationarguments. The very recent manuscript by Bervoets et al. (2016) provides anencouraging first step in the case of (strictly) concave games with one-dimensionalaction sets; we intend to explore this direction in future work.
Appendix A. Auxiliary results
In this appendix, we collect some auxiliary results that would have otherwisedisrupted the flow of the main text. We begin with the basic properties of theFenchel coupling:
Proof of Proposition 4.3.
For our first claim, let x = Q ( y ) . Then, by definition F ( p, y ) = h ( p ) + h y, Q ( y ) i − h ( Q ( y )) − h y, p i = h ( p ) − h ( x ) − h y, p − x i . (A.1)Since y ∈ ∂h ( x ) by Proposition 3.2, we have h y, p − x i = h ′ ( x ; p − x ) whenever x ∈ X ◦ , thus proving (4.9a). Furthermore, the strong convexity of h also yields h ( x ) + t h y, p − x i ≤ h ( x + t ( p − x )) ≤ th ( p ) + (1 − t ) h ( x ) − Kt (1 − t ) k x − p k , (A.2)leading to the bound K (1 − t ) k x − p k ≤ h ( p ) − h ( x ) − h y, p − x i = F ( p, y ) (A.3)for all t ∈ (0 , . Eq. (4.9b) then follows by letting t → + in (A.3).Finally, for our third claim, we have F ( p, y ′ ) = h ( p ) + h ∗ ( y ′ ) − h y ′ , p i≤ h ( p ) + h ∗ ( y ) + h y ′ − y, ∇ h ∗ ( y ) i + 12 K k y ′ − y k ∗ − h y ′ , p i = F ( p, y ) + h y ′ − y, Q ( y ) − p i + 12 K k y ′ − y k ∗ , (A.4)where the inequality in the second line follows from the fact that h ∗ is (1 /K ) -strongly smooth (Rockafellar and Wets, 1998, Theorem 12.60(e)). (cid:4) Complementing Proposition 4.3, our next result concerns the inverse images ofthe choice map Q : Proposition A.1.
Let h be a penalty function on X , and let x ∗ ∈ X . If x ∗ = Q ( y ∗ ) for some y ∗ ∈ Y , then y ∗ + PC( x ∗ ) ⊆ Q − ( x ∗ ) . EARNING IN GAMES WITH CONTINUOUS ACTION SETS 33
Proof.
By Proposition 3.2, we have x ∗ = Q ( y ) if and only if y ∈ ∂h ( x ∗ ) , so it sufficesto show that y ∗ + v ∈ ∂h ( x ∗ ) for all v ∈ PC( x ∗ ) . Indeed, we have h v, x − x ∗ i ≤ for all x ∈ X , so h ( x ) ≥ h ( x ∗ ) + h y ∗ , x − x ∗ i ≥ h ( x ∗ ) + h y ∗ + v, x − x ∗ i . (A.5)The above shows that y ∗ + v ∈ ∂h ( x ∗ ) , as claimed. (cid:4) Our next result concerns the evolution of the Fenchel coupling under the dynam-ics (DA-c):
Lemma A.2.
Let x ( t ) = Q ( y ( t )) be a solution orbit of (DA-c) . Then, for all p ∈ X ,we have ddt F ( p, y ( t )) = h v ( x ( t )) , x ( t ) − p i . (A.6) Proof.
By definition, we have ddt F ( p, y ( t )) = ddt [ h ( p ) + h ∗ ( y ( t )) − h y ( t ) , p i ]= h ˙ y ( t ) , ∇ h ∗ ( y ( t )) i − h ˙ y ( t ) , p i = h v ( x ( t )) , x ( t ) − p i , (A.7)where, in the last line, we used Proposition 3.2. (cid:4) Our last auxiliary result shows that, if the sequence of play generated by (DA)is contained in the “basin of attraction” of a stable set X ∗ , then it admits anaccumulation point in X ∗ : Lemma A.3.
Suppose that X ∗ ⊆ X is stable and (DA) is run with a step-size suchthat P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . Assume further that ( X n ) ∞ n =1 is containedin a region R of X such that (VS) holds for all x ∈ R . Then, under (H1) and (H2) ,every neighborhood U of X ∗ is recurrent; specifically, there exists a subsequence X n k of X n such that X n k → X ∗ ( a.s. ) . Finally, if (DA) is run with perfect feedback ( σ = 0 ) , the above holds under the lighter assumption P nk =1 γ k (cid:14) P nk =1 γ k → .Proof of Lemma A.3. Let U be a neighborhood of X ∗ and assume to the contrarythat, with positive probability, X n / ∈ U for all sufficiently large n . By starting thesequence at a later index if necessary, we may assume that X n / ∈ U for all n withoutloss of generality. Thus, with X ∗ stable and X n ∈ R for all n by assumption, thereexists some c > such that h v ( X n ) , X n − x ∗ i ≤ − c for all x ∗ ∈ X ∗ and for all n. (A.8)As a result, for all x ∗ ∈ X ∗ , we get F ( x ∗ , Y n +1 ) = F ( x ∗ , Y n + γ n ˆ v n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) + ξ n +1 , X n − x ∗ i + 12 K γ n k ˆ v n +1 k ∗ ≤ F ( x ∗ , Y n ) − cγ n + γ n ψ n +1 + 12 K γ n k ˆ v n +1 k ∗ , (A.9)where we used Proposition 4.3 in the second line and we set ψ n +1 = h ξ n +1 , X n − x ∗ i in the third. Telescoping (A.9) then gives F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y ) − τ n " c − P nk =1 γ k ψ k +1 τ n − K P nk =1 γ k k ˆ v k +1 k ∗ τ n , (A.10) where τ n = P nk =1 γ k .Since E [ ψ n +1 | F n ] = h E [ ξ n +1 | F n ] , X n − x ∗ i = 0 by (H1) and E [ | ψ n +1 | | F n ] ≤ E [ k ξ n +1 k ∗ k X n − x ∗ k | F n ] ≤ σ kX k < ∞ by (H2), the law of large numbersfor martingale difference sequences yields τ − n P nk =1 γ k ψ k +1 → (Hall and Heyde,1980, Theorem 2.18). Furthermore, letting R n +1 = P nk =1 γ k k ˆ v k +1 k ∗ , we also get E [ R n +1 ] ≤ n X k =1 γ k E [ˆ v k +1 ] ≤ V ∗ ∞ X k =1 γ k < ∞ for all n, (A.11)so Doob’s martingale convergence theorem shows that R n converges (a.s.) to somerandom, finite value (Hall and Heyde, 1980, Theorem 2.5).Combining the above, (A.10) gives F ( x ∗ , Y n ) ∼ − aτ n → −∞ (a.s.), a contradic-tion. Finally, if σ = 0 , we also have ψ n +1 = 0 and k ˆ v n +1 k ∗ = k v ( X n ) k ∗ ≤ V ∗ for all n , so (A.10) yields F ( x ∗ , Y n ) → −∞ provided that τ − n P nk =1 γ k → , acontradiction. In both cases, we conclude that X n is recurrent, as claimed. (cid:4) Finally, we turn to the characterization of strict equilibria in finite games:
Proof of Proposition 5.2.
We will show that ( a ) = ⇒ ( b ) = ⇒ ( c ) = ⇒ ( a ).( a ) = ⇒ ( b ). Suppose that x ∗ = ( α ∗ , . . . , α ∗ N ) is a strict equilibrium. Then, theweak inequality h v ( x ∗ ) , z i ≤ follows from Proposition 2.1. For the strict part, if z i ∈ TC i ( x ∗ i ) is nonzero for some i ∈ N , we readily get h v i ( x ∗ ) , z i i = X α i = α ∗ i z i,α i (cid:2) u i ( α ∗ i ; α ∗− i ) − u i ( α i ; α ∗− i ) (cid:3) < , (A.12)where we used the fact that z i is tangent to X at x ∗ i , so P α i ∈A i z iα i = 0 and z iα i ≥ for α i = α ∗ i , with at least one of these inequalities being strict when z i = 0 .( b ) = ⇒ ( c ). Property ( b ) implies that v ( x ∗ ) lies in the interior of the polar cone PC( x ∗ ) to X at x ∗ . Since PC( x ∗ ) has nonempty interior, continuity implies that v ( x ) also lies in PC( x ∗ ) for x sufficiently close to x ∗ . We thus get h v ( x ) , x − x ∗ i ≤ for all x in a neighborhood of x ∗ , i.e. x ∗ is stable.( c ) = ⇒ ( a ). Assume that x ∗ is stable but not strict, so u iα i ( x ∗ ) = u iβ i ( x ∗ ) for some i ∈ N , and some α i ∈ supp( x ∗ i ) , β i ∈ A i . Then, if we take x i = x ∗ i + λ ( e iβ i − e iα i ) and x − i = x ∗− i with λ > small enough, we get h v ( x ) , x − x ∗ i = h v i ( x ) , x i − x ∗ i i = λu iβ i ( x ∗ ) − λu iα i ( x ∗ ) = 0 , (A.13)contradicting the assumption that x ∗ is stable. This shows that x ∗ is strict. (cid:4) References
Alvarez, F., Bolte, J., and Brahic, O. (2004). Hessian Riemannian gradient flows in convexprogramming.
SIAM Journal on Control and Optimization , 43(2):477–501.Arora, S., Hazan, E., and Kale, S. (2012). The multiplicative weights update method: A meta-algorithm and applications.
Theory of Computing , 8(1):121–164.Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methodsfor convex optimization.
Operations Research Letters , 31(3):167–175.Benaïm, M. (1999). Dynamics of stochastic approximation algorithms. In Azéma, J., Émery,M., Ledoux, M., and Yor, M., editors,
Séminaire de Probabilités XXXIII , volume 1709 of
Lecture Notes in Mathematics , pages 1–68. Springer Berlin Heidelberg.Bervoets, S., Bravo, M., and Faure, M. (2016). Learning and convergence to Nash in networkgames with continuous action set. Working paper.
EARNING IN GAMES WITH CONTINUOUS ACTION SETS 35
Bolte, J., Daniilidis, A., Ley, O., and Mazet, L. (2010). Characterizations of Łojasiewicz inequal-ities: Subgradient flows, talweg, convexity.
Transactions of the American MathematicalSociety , 362(6):3319–3363.Bravo, M. and Mertikopoulos, P. (2017). On the robustness of learning in games with stochasticallyperturbed payoff observations.
Games and Economic Behavior , 103, John Nash Memorialissue:41–66.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems.
Foundations and Trends in Machine Learning , 5(1):1–122.Chen, G. and Teboulle, M. (1993). Convergence analysis of a proximal-like minimization algorithmusing Bregman functions.
SIAM Journal on Optimization , 3(3):538–543.Cohen, J., Héliou, A., and Mertikopoulos, P. (2017). Hedging under uncertainty: Regret min-imization meets exponentially fast convergence. In
SAGT ’17: Proceedings of the 10thInternational Symposium on Algorithmic Game Theory .Coucheney, P., Gaujal, B., and Mertikopoulos, P. (2015). Penalty-regulated dynamics and robustlearning procedures in games.
Mathematics of Operations Research , 40(3):611–633.Facchinei, F. and Kanzow, C. (2007). Generalized Nash equilibrium problems. , 5(3):173–210.Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in thebandit setting: gradient descent without a gradient. In
SODA ’05: Proceedings of the 16thannual ACM-SIAM symposium on discrete algorithms , pages 385–394.Hall, P. and Heyde, C. C. (1980).
Martingale Limit Theory and Its Application . Probability andMathematical Statistics. Academic Press, New York.Hazan, E. (2012). A survey: The convex optimization approach to regret minimization. In Sra, S.,Nowozin, S., and Wright, S. J., editors,
Optimization for Machine Learning , pages 287–304.MIT Press.Hofbauer, J. and Sandholm, W. H. (2002). On the global convergence of stochastic fictitious play.
Econometrica , 70(6):2265–2294.Hofbauer, J. and Sandholm, W. H. (2009). Stable games and their dynamics.
Journal of EconomicTheory , 144(4):1665–1693.Hofbauer, J., Schuster, P., and Sigmund, K. (1979). A note on evolutionarily stable strategies andgame dynamics.
Journal of Theoretical Biology , 81(3):609–612.Juditsky, A., Nemirovski, A. S., and Tauvel, C. (2011). Solving variational inequalities withstochastic mirror-prox algorithm.
Stochastic Systems , 1(1):17–58.Kiwiel, K. C. (1997). Free-steering relaxation methods for problems with strictly convex costs andlinear constraints.
Mathematics of Operations Research , 22(2):326–349.Laraki, R. and Mertikopoulos, P. (2013). Higher order game dynamics.
Journal of EconomicTheory , 148(6):2666–2695.Leslie, D. S. and Collins, E. J. (2005). Individual Q -learning in normal form games. SIAM Journalon Control and Optimization , 44(2):495–514.Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm.
Information andComputation , 108(2):212–261.Maynard Smith, J. and Price, G. R. (1973). The logic of animal conflict.
Nature , 246:15–18.McKelvey, R. D. and Palfrey, T. R. (1995). Quantal response equilibria for normal form games.
Games and Economic Behavior , 10(6):6–38.Mertikopoulos, P., Papadimitriou, C. H., and Piliouras, G. (2018). Cycles in adversarial regularizedlearning. In
SODA ’18: Proceedings of the 29th annual ACM-SIAM Symposium on DiscreteAlgorithms .Mertikopoulos, P. and Sandholm, W. H. (2016). Learning in games via reinforcement and regu-larization.
Mathematics of Operations Research , 41(4):1297–1324.Monderer, D. and Shapley, L. S. (1996). Potential games.
Games and Economic Behavior ,14(1):124 – 143.Nemirovski, A. S., Juditsky, A., Lan, G. G., and Shapiro, A. (2009). Robust stochastic approxi-mation approach to stochastic programming.
SIAM Journal on Optimization , 19(4):1574–1609.
Nemirovski, A. S. and Yudin, D. B. (1983).
Problem Complexity and Method Efficiency in Opti-mization . Wiley, New York, NY.Nesterov, Y. (2004).
Introductory Lectures on Convex Optimization: A Basic Course . Number 87in Applied Optimization. Kluwer Academic Publishers.Nesterov, Y. (2007). Dual extrapolation and its applications to solving variational inequalitiesand related problems.
Mathematical Programming , 109(2):319–344.Nesterov, Y. (2009). Primal-dual subgradient methods for convex problems.
Mathematical Pro-gramming , 120(1):221–259.Neyman, A. (1997). Correlated equilibrium and potential games.
International Journal of GameTheory , 26(2):223–227.Perkins, S. and Leslie, D. S. (2012). Asynchronous stochastic approximation with differentialinclusions.
Stochastic Systems , 2(2):409–446.Perkins, S., Mertikopoulos, P., and Leslie, D. S. (2017). Mixed-strategy learning with continuousaction sets.
IEEE Trans. Autom. Control , 62(1):379–384.Polyak, B. T. (1987).
Introduction to Optimization . Optimization Software, New York, NY, USA.Rockafellar, R. T. (1970).
Convex Analysis . Princeton University Press, Princeton, NJ.Rockafellar, R. T. and Wets, R. J. B. (1998).
Variational Analysis , volume 317 of
A Series ofComprehensive Studies in Mathematics . Springer-Verlag, Berlin.Rosen, J. B. (1965). Existence and uniqueness of equilibrium points for concave N -person games. Econometrica , 33(3):520–534.Sandholm, W. H. (2015). Population games and deterministic evolutionary dynamics. In Young,H. P. and Zamir, S., editors,
Handbook of Game Theory IV , pages 703–778. Elsevier.Scutari, G., Facchinei, F., Palomar, D. P., and Pang, J.-S. (2010). Convex optimization, gametheory, and variational inequality theory in multiuser communication systems.
IEEE SignalProcess. Mag. , 27(3):35–49.Shalev-Shwartz, S. (2007).
Online learning: Theory, algorithms, and applications . PhD thesis,Hebrew University of Jerusalem.Shalev-Shwartz, S. (2011). Online learning and online convex optimization.
Foundations andTrends in Machine Learning , 4(2):107–194.Shalev-Shwartz, S. and Singer, Y. (2007). Convex repeated games and Fenchel duality. In
Advancesin Neural Information Processing Systems 19 , pages 1265–1272. MIT Press.Shiryaev, A. N. (1995).
Probability . Springer, Berlin, 2 edition.Sorin, S. and Wan, C. (2016). Finite composite games: Equilibria and dynamics.
Journal ofDynamics and Games , 3(1):101–120.Viossat, Y. and Zapechelnyuk, A. (2013). No-regret dynamics and fictitious play.
Journal ofEconomic Theory , 148(2):825–842.Vovk, V. G. (1990). Aggregating strategies. In
COLT ’90: Proceedings of the 3rd Workshop onComputational Learning Theory , pages 371–383.Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimiza-tion.
Journal of Machine Learning Research , 11:2543–2596.Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In