[PDF] Learning in games with continuous action sets and unknown payoff functions

Abstract

This paper examines the convergence of no-regret learning in games with continuous action sets. For concreteness, we focus on learning via "dual averaging", a widely used class of no-regret learning schemes where players take small steps along their individual payoff gradients and then "mirror" the output back to their action sets. In terms of feedback, we assume that players can only estimate their payoff gradients up to a zero-mean error with bounded variance. To study the convergence of the induced sequence of play, we introduce the notion of variational stability, and we show that stable equilibria are locally attracting with high probability whereas globally stable equilibria are globally attracting with probability 1. We also discuss some applications to mixed-strategy learning in finite games, and we provide explicit estimates of the method's convergence speed.

Full PDF

aa r X i v : . [ m a t h . O C ] J a n LEARNING IN GAMES WITH CONTINUOUS ACTION SETSAND UNKNOWN PAYOFF FUNCTIONS

PANAYOTIS MERTIKOPOULOS AND ZHENGYUAN ZHOU Abstract.

This paper examines the convergence of no-regret learning ingames with continuous action sets. For concreteness, we focus on learningvia “dual averaging”, a widely used class of no-regret learning schemes whereplayers take small steps along their individual payoﬀ gradients and then “mir-ror” the output back to their action sets. In terms of feedback, we assumethat players can only estimate their payoﬀ gradients up to a zero-mean errorwith bounded variance. To study the convergence of the induced sequence ofplay, we introduce the notion of variational stability , and we show that stableequilibria are locally attracting with high probability whereas globally stableequilibria are globally attracting with probability . We also discuss some ap-plications to mixed-strategy learning in ﬁnite games, and we provide explicitestimates of the method’s convergence speed.

1. Introduction

The prototypical setting of online optimization can be summarized as follows: atevery stage n = 1 , , . . . , of a repeated decision process, an agent selects an action X n from some set X (assumed here to be convex and compact), and obtains a reward u n ( X n ) determined by an a priori unknown payoﬀ function u n : X → R . Subse-quently, the agent receives some problem-speciﬁc feedback (for instance, an estimateof the gradient of u n at X n ), and selects a new action with the goal of maximizingthe obtained reward. Aggregating over the stages of the process, this is usuallyquantiﬁed by asking that the agent’s regret R n ≡ max x ∈X P nk =1 [ u k ( x ) − u k ( X k )] grow sublinearly in n , a property known as “no regret”.In this general setting, the most widely used class of no-regret policies is the online mirror descent (OMD) method of Shalev-Shwartz (2007) and its variants– such as “Following the Regularized Leader” (Shalev-Shwartz and Singer, 2007),dual averaging (Nesterov, 2009; Xiao, 2010), etc. Speciﬁcally, if the problem’spayoﬀ functions are concave, mirror descent guarantees an O ( √ n ) regret bound Univ. Grenoble Alpes, CNRS, Inria, LIG, F-38000, Grenoble, France. Stanford University, Dept. of Electrical Engineering, Stanford, CA, 94305.

E-mail addresses : [email protected], [email protected] .2010 Mathematics Subject Classiﬁcation.

Primary 91A26, 90C15; secondary 90C33, 68Q32.

Key words and phrases.

Continuous games; dual averaging; variational stability; Fenchel cou-pling; Nash equilibrium.The authors are indebted to the associate editor and two anonymous referees for their detailedsuggestions and remarks. The paper has also beneﬁted greatly from thoughtful comments byJérôme Bolte, Nicolas Gast, Jérôme Malick, Mathias Staudigl, and the audience of the ParisOptimization Seminar.P. Mertikopoulos was partially supported by the French National Research Agency (ANR)project ORACLESS (ANR–GAGA–13–JS01–0004–01) and the Huawei Innovation Research Pro-gram ULTRON. which is well-known to be tight in a “black-box” environment (i.e., without anyfurther assumptions on u n ). Thus, owing to these guarantees, this class of ﬁrst-ordermethods has given rise to an extensive literature in online learning and optimization;for a survey, see Shalev-Shwartz (2011), Bubeck and Cesa-Bianchi (2012), Hazan(2012), and references therein.In this paper, we consider a multi-agent extension of the above framework wherethe agents’ rewards are determined by their individual actions and the actions ofall other agents via a ﬁxed mechanism: a non-cooperative game . Even though thismechanism may be unknown and/or opaque to the players, the additional structureit provides means that ﬁner convergence criteria apply, chief among them beingthat of convergence to a Nash equilibrium (NE). We are thus led to the followingfundamental question: if all players of a repeated game employ a no-regret updatingpolicy, do their actions converge to a Nash equilibrium of the underlying game?

Summary of contributions.

In general, the answer to this question is a resounding“no”. Even in simple, ﬁnite games, no-regret learning may cycle (Mertikopouloset al., 2018) and its limit set may contain highly non-rationalizable strategies thatassign positive weight only to strictly dominated strategies (Viossat and Zapechel-nyuk, 2013). As such, our aim in this paper is twofold: i ) to provide suﬃcient conditions under which no-regret learning converges toequilibrium; and ii ) to assess the speed and robustness of this convergence in the presence ofuncertainty, feedback noise, and other learning impediments.Our contributions along these lines are as follows: First, in Section 2, we intro-duce an equilibrium stability notion which we call variational stability (VS), andwhich is formally similar to (and inspired by) the seminal notion of evolutionary sta-bility in population games (Maynard Smith and Price, 1973). This stability notionextends the standard notion of operator monotonicity, so it applies in particularto all monotone games (that is, concave games that satisfy Rosen’s (1965) diago-nal strict concavity condition). In fact, going beyond concave games, variationalstability allows us to treat convergence questions in general games with continuousaction spaces without having to restrict ourselves to a speciﬁc subclass (such aspotential or common interest games).Our second contribution is a detailed analysis of the long-run behavior of no-regret learning under variational stability. Regarding the information available tothe players, our only assumption is that they have access to unbiased, bounded-variance estimates of their individual payoﬀ gradients at each step; beyond this,we assume no prior knowledge of their payoﬀ functions and/or the game. Despitethis lack of information, variational stability guarantees that ( i ) the induced se-quence of play converges globally to globally stable equilibria with probability (Theorem 4.7); and ( ii ) it converges locally to locally stable equilibria with highprobability (Theorem 4.11). As a corollary, if the game admits a (pseudo-)concavepotential or if it is monotone, the players’ actions converge to Nash equilibriumno matter the level of uncertainty aﬀecting the players’ feedback. In Section 5, wefurther extend these results to learning with imperfect feedback in ﬁnite games. Heuristically, variational stability is to games with a ﬁnite number of players and a continuumof actions what evolutionary stability is to games with a continuum of players and a ﬁnite actionspace. Our choice of terminology reﬂects precisely this analogy.

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 3

Our third contribution concerns the method’s convergence speed. Mirroring aknown result of Nesterov (2009) for variational inequalities, we show that the gapfrom a stable state decays ergodically as O (1 / √ n ) if the method’s step-size is chosenappropriately. Dually to this, we also show that the algorithm’s expected runninglength until players reach an ε -neighborhood of a stable state is O (1 /ε ) . Finally,if the stage game admits a sharp equilibrium (a straightforward extension of thenotion of strict equilibrium in ﬁnite games), we show that, with probability , theprocess reaches an equilibrium in a ﬁnite number of steps.Our analysis relies on tools and techniques from stochastic approximation, mar-tingale limit theory and convex analysis. In particular, with regard to the latter,we make heavy use of a “primal-dual divergence” measure between action and gra-dient variables, which we call the Fenchel coupling . This coupling is a hybridizationof the Bregman divergence which provides a potent tool for proving convergencethanks to its Lyapunov properties.

Related work.

Originally, mirror descent was introduced by Nemirovski and Yudin(1983) for solving oﬄine convex programs. The dual averaging (DA) variant thatwe consider here was pioneered by Nesterov (2009) and proceeds as follows: at eachstage, the method takes a gradient step in a dual space (where gradients live); theresult is then mapped (or “mirrored”) back to the problem’s feasible region, a newgradient is generated, and the process repeats. The “mirroring” step above is itselfdetermined by a strongly convex regularizer (or “distance generating”) function:the squared Euclidean norm gives rise to Zinkevich’s (2003) online gradient descentalgorithm, while the (negative) Gibbs entropy on the simplex induces the well-known exponential weights (EW) algorithm (Vovk, 1990; Arora et al., 2012).Nesterov (2009) and Nemirovski et al. (2009) provide several convergence re-sults for dual averaging in (stochastic) convex programs and saddle-point prob-lems, while Xiao (2010) provides a thorough regret analysis for online optimizationproblems. In addition to treating the interactions of several competing agents atonce, the fundamental diﬀerence of our paper with these works is that the conver-gence analysis in the latter is “ergodic”, i.e., it concerns the time-averaged sequence ¯ X n = P nk =1 γ k X k / P nk =1 γ k , and not the actual sequence of actions X n employedby the players.In online optimization, this averaging comes up naturally because the focus is onthe players’ regret. In the oﬄine case, the points where an oracle is called duringthe execution of an algorithm do not carry any particular importance, so averagingprovides a convenient way of obtaining convergence. However, in a game-theoreticsetting, the ﬁgure of merit is the actual sequence of play , which determines theplayers’ payoﬀs at each stage. The behavior of X n may diﬀer drastically from thatof ¯ X n , so our treatment requires a completely diﬀerent set of tools and techniques(especially in the stochastic regime).Much of our analysis boils down to solving in an online way a (stochastic) vari-ational inequality (VI) characterizing the game’s Nash equilibria. Nesterov (2007)and Juditsky et al. (2011) provide eﬃcient oﬄine methods to do this, relying onan “extra-gradient” step to boost the convergence rate of the ergodic sequence ¯ X n .In our limited-feedback setting, we do not assume that players can make an extra In the online learning literature, dual averaging is sometimes called lazy mirror descent andcan be seen as a linearized “Follow the Regularized Leader” (FTRL) scheme – for more details,we refer the reader to Beck and Teboulle (2003), Xiao (2010), and Shalev-Shwartz (2011).

P. MERTIKOPOULOS AND Z. ZHOU oracle call to actions that were not actually employed, so the extrapolation resultsof Nesterov (2007) and Juditsky et al. (2011) do not apply. The single-call resultsof Nesterov (2009) are closer in spirit to our paper but, again, they focus exclusivelyon monotone variational inequalities and the ergodic sequence ¯ X n – not the actualsequence of play X n . All the same, for completeness, we make the link with ergodicconvergence in Theorems 4.13 and 6.2.When applied to mixed-strategy learning in ﬁnite games, the class of algorithmsstudied here has very close ties to the family of perturbed best response mapsthat arise in models of ﬁctitious play and reinforcement learning (Hofbauer andSandholm, 2002; Leslie and Collins, 2005; Coucheney et al., 2015). Along theselines, Mertikopoulos and Sandholm (2016) recently showed that a continuous-timeversion of the dynamics studied in this paper eliminates dominated strategies andconverges to strict equilibria from all nearby initial conditions. Our analysis inSection 5 extends these results to a discrete-time, stochastic setting.In games with continuous action sets, Perkins and Leslie (2012) and Perkinset al. (2017) examined a mixed-strategy actor-critic algorithm which converges to aprobability distribution that assigns most weight to equilibrium states. At the purestrategy level, several authors have considered VI-based and Gauss–Seidel meth-ods for solving generalized Nash equilibrium problems (GNEPs); for a survey, seeFacchinei and Kanzow (2007) and Scutari et al. (2010). The intersection of theseworks with the current paper is when the game satisﬁes a global monotonicity con-dition similar to the diagonal strict concavity condition of Rosen (1965). However,the literature on GNEPs does not consider the implications for the players’ regret,the impact of uncertainty and/or local convergence/stability issues, so there is nooverlap with our results.Finally, during the ﬁnal preparation stages of this paper (a few days before theactual submission), we were made aware of a preprint by Bervoets et al. (2016)examining the convergence of pure-strategy learning in strictly concave games withone-dimensional action sets. A key feature of the analysis of Bervoets et al. (2016)is that players only observe their realized, in-game payoﬀs, and they choose actionsbased on their payoﬀs’ variation from the previous period. The resulting meandynamics boil down to an instantiation of dual averaging induced by the entropicregularization penalty h ( x ) = x log x (cf. Section 3), suggesting several interestinglinks with the current work. Notation.

Given a ﬁnite-dimensional vector space V with norm k·k , we write V ∗ forits dual, h y, x i for the pairing between y ∈ V ∗ and x ∈ V , and k y k ∗ ≡ sup {h y, x i : k x k ≤ } for the dual norm of y in V ∗ . If C ⊆ V is convex, we also write C ◦ ≡ ri( C ) for the relative interior of C , kCk = sup {k x ′ − x k : x, x ′ ∈ C} for its diameter, and dist( C , x ) = inf x ′ ∈C k x ′ − x k for the distance between x ∈ V and C .For a given x ∈ C , the tangent cone TC C ( x ) is deﬁned as the closure of the set ofall rays emanating from x and intersecting C in at least one other point; dually, the polar cone PC C ( x ) to C at x is deﬁned as PC C ( x ) = { y ∈ V ∗ : h y, z i ≤ for all z ∈ TC C ( x ) } . For concision, when C is clear from the context, we will drop it altogetherand write TC( x ) and PC( x ) instead.

2. Continuous games and variational stability

Basic deﬁnitions and examples.

Throughout this paper, we focus on gamesplayed by a ﬁnite set of players i ∈ N = { , . . . , N } . During play, each player selects EARNING IN GAMES WITH CONTINUOUS ACTION SETS 5 an action x i from a compact convex subset X i of a ﬁnite-dimensional normed space V i , and their reward is determined by the proﬁle x = ( x , . . . , x N ) of all players’actions – often denoted as x ≡ ( x i ; x − i ) when we seek to highlight the action x i ofplayer i against the ensemble of actions x − i = ( x j ) j = i of all other players.In more detail, writing X ≡ Q i X i for the game’s action space , each player’s payoﬀ is determined by an associated payoﬀ function u i : X → R . In terms ofregularity, we assume that u i is continuously diﬀerentiable in x i , and we write v i ( x ) ≡ ∇ x i u i ( x i ; x − i ) (2.1)for the individual gradient of u i at x ; we also assume that u i and v i are bothcontinuous in x . Putting all this together, a continuous game is a tuple

G ≡G ( N , ( X i ) i ∈N , ( u i ) i ∈N ) with players, actions and payoﬀs deﬁned as above.As a special case, we will sometimes consider payoﬀ functions that are individu-ally ( pseudo- ) concave in the sense that u i ( x i ; x − i ) is ( pseudo- ) concave in x i for all x − i ∈ Q j = i X j , i ∈ N . (2.2)When this is the case, we say that the game itself is (pseudo-)concave. Below, webrieﬂy discuss some well-known examples of such games: Example . In a ﬁnite game Γ ≡ ( N , A , u ) ,each player i ∈ N chooses an action α i from a ﬁnite set A i of “pure strategies” and noassumptions are made on the players’ payoﬀ functions u i : A ≡ Q j A j → R . Playerscan “mix” these choices by playing mixed strategies , i.e., probability distributions x i drawn from the simplex X i ≡ ∆( A i ) . In this case (and in a slight abuse ofnotation), the expected payoﬀ to player i in the mixed proﬁle x = ( x , . . . , x N ) canbe written as u i ( x ) = X α ∈A · · · X α N ∈A N u i ( α , . . . , α N ) x ,α · · · x N,α N , (2.3)so the players’ individual gradients are simply their payoﬀ vectors: v i ( x ) = ∇ x i u i ( x ) = ( u i ( α i ; x − i )) α i ∈A i . (2.4)The resulting continuous game is called the mixed extension of Γ . Since X i = ∆( A i ) is convex and u i is linear in x i , G is itself concave in the sense of (2.2). Example . Consider the following Cournot oligopolymodel: There is a ﬁnite set N = { , . . . , N } of ﬁrms , each supplying the marketwith a quantity x i ∈ [0 , C i ] of the same good (or service) up to the ﬁrm’s produc-tion capacity C i . This good is then priced as a decreasing function P ( x ) of eachﬁrm’s production; for concreteness, we focus on the linear model P ( x ) = a − P i b i x i where a is a positive constant and the coeﬃcients b i > reﬂect the price-settingpower of each ﬁrm.In this model, the utility of ﬁrm i is given by u i ( x ) = x i P ( x ) − c i x i , (2.5)where c i represents the marginal production cost of ﬁrm i . Letting X i = [0 , C i ] , theresulting game is easily seen to be concave in the sense of (2.2). In the above, we tacitly assume that u i is deﬁned on an open neighborhood of X i . Thisallows us to use ordinary derivatives, but none of our results depend on this device. We alsonote that v i ( x ) acts naturally on vectors z i ∈ V i via the mapping z i

7→ h v i ( x ) , z i i ≡ u ′ i ( x ; z i ) = d/dτ | τ =0 u i ( x i + τz i ; x − i ) ; in view of this, v i ( x ) is treated as an element of V ∗ i , the dual of V i . P. MERTIKOPOULOS AND Z. ZHOU X TC( x ∗ ) PC( x ∗ ) x ∗ X v ( x ∗ ) Figure 1.

Geometric characterization of Nash equilibria.

Example . Congestion games are game-theoretic modelsthat arise in the study of traﬃc networks (such as the Internet). To deﬁne them,ﬁx a set of players N that share a set of resources r ∈ R , each associated witha nondecreasing convex cost function c r : R + → R (for instance, links in a datanetwork and their corresponding delay functions). Each player i ∈ N has a certain resource load ρ i > which is split over a collection A i ⊆ R of resource subsets α i of R – e.g., sets of links that form paths in the network. Then, the action space ofplayer i ∈ N is the scaled simplex X i = ρ i ∆( A i ) = { x i ∈ R A i + : P α i ∈A i x iα i = ρ i } of load distributions over A i .Given a load proﬁle x = ( x , . . . , x N ) , costs are determined based on the utiliza-tion of each resource as follows: First, the demand w r of the r -th resource is deﬁnedas the total load w r = P i ∈N P α i ∋ r x iα i on said resource. This demand incurs a cost c r ( w r ) per unit of load to each player utilizing resource r , where c r : R + → R is a nondecreasing convex function. Accordingly, the total cost to player i ∈ N is c i ( x ) = X α i ∈A i x iα i c iα i ( x ) , (2.6)where c iα i ( x ) = P r ∈ α i c r ( w r ) denotes the cost incurred to player i by the utilizationof α i ⊆ R . The resulting atomic splittable congestion game G ≡ G ( N , X , − c ) iseasily seen to be concave in the sense of (2.2).2.2. Nash equilibrium.

Our analysis focuses primarily on

Nash equilibria (NE), i.e.,strategy proﬁles that discourage unilateral deviations. Formally, x ∗ ∈ X is a Nashequilibrium if u i ( x ∗ i ; x ∗− i ) ≥ u i ( x i ; x ∗− i ) for all x i ∈ X i , i ∈ N . (NE)Obviously, if x ∗ is a Nash equilibrium, we have the ﬁrst-order condition u ′ i ( x ∗ ; z i ) = h v i ( x ∗ ) , z i i ≤ for all z i ∈ TC i ( x ∗ i ) , i ∈ N , (2.7)where TC i ( x ∗ i ) denotes the tangent cone to X i at x ∗ i . Therefore, if x ∗ is a Nash equi-librium, each player’s individual gradient v i ( x ∗ ) belongs to the polar cone PC i ( x ∗ i ) to X i at x ∗ i (cf. Fig. 1); moreover, the converse also holds if the game is pseudo-concave. We encode this more concisely as follows: Proposition 2.1. If x ∗ ∈ X is a Nash equilibrium, then v ( x ∗ ) ∈ PC( x ∗ ) , i.e., h v ( x ∗ ) , x − x ∗ i ≤ for all x ∈ X . (2.8) The converse also holds if the game is ( pseudo- ) concave in the sense of (2.2) . EARNING IN GAMES WITH CONTINUOUS ACTION SETS 7

Remark . In the above (and in what follows), v = ( v i ) i ∈N denotes the ensembleof the players’ individual payoﬀ gradients and h v, z i ≡ P i ∈N h v i , z i i stands for thepairing between v and the vector z = ( z i ) i ∈N ∈ Q i ∈N V i . For concision, we alsowrite V ≡ Q i V i for the ambient space of X ≡ Q i X i and V ∗ for its dual. Proof of Proposition 2.1. If x ∗ is a Nash equilibrium, (2.8) is obtained by setting z i = x i − x ∗ i in (2.7) and summing over all i ∈ N . Conversely, if (2.8) holds and thegame is (pseudo-)concave, pick some x i ∈ X i and let x = ( x i ; x ∗− i ) in (2.8). Thisgives h v i ( x ∗ ) , x i − x ∗ i i ≤ for all x i ∈ X i so (NE) follows by the basic properties of(pseudo-)concave functions. (cid:4) Proposition 2.1 shows that Nash equilibria of concave games are precisely the so-lutions of the variational inequality (2.8), so existence follows from standard results.Using a similar variational characterization, Rosen (1965) proved the following suf-ﬁcient condition for equilibrium uniqueness:

Theorem 2.2 (Rosen, 1965) . Assume that G satisﬁes the payoﬀ monotonicity con-dition h v ( x ′ ) − v ( x ) , x ′ − x i ≤ for all x, x ′ ∈ X , (MC) with equality if and only if x = x ′ . Then, G admits a unique Nash equilibrium. Games satisfying (MC) are called ( strictly ) monotone and they enjoy propertiessimilar to those of (strictly) convex functions. In particular, letting x ′− i = x − i ,(MC) gives h v i ( x ′ i ; x − i ) − v i ( x i ; x − i ) , x ′ i − x i i ≤ for all x i , x ′ i ∈ X i , x − i ∈ X − i , (2.9)implying in turn that u i ( x ) is (strictly) concave in x i for all i . Therefore, any gamesatisfying (MC) is also concave. Variational stability.

Combining Proposition 2.1 and (MC), it follows that the(necessarily unique) Nash equilibrium of a monotone game satisﬁes the inequality h v ( x ) , x − x ∗ i ≤ h v ( x ∗ ) , x − x ∗ i ≤ for all x ∈ X . (2.10)In other words, if x ∗ is a Nash equilibrium of a monotone game, the players’ in-dividual payoﬀ gradients “point towards” x ∗ in the sense that v ( x ) forms an acuteangle with x ∗ − x . Motivated by this, we introduce below the following relaxationof the monotonicity condition (MC): Deﬁnition 2.3.

We say that x ∗ ∈ X is variationally stable (or simply stable ) if thereexists a neighborhood U of x ∗ such that h v ( x ) , x − x ∗ i ≤ for all x ∈ U , with equality if and only if x = x ∗ . In particular, if U can be taken to be all of X ,we say that x ∗ is globally variationally stable (or globally stable for short). Remark . The terminology “variational stability” alludes to the seminal notion of evolutionary stability introduced by Maynard Smith and Price (1973) for populationgames (i.e., games with a continuum of players and a common, ﬁnite set of actions A ). Speciﬁcally, if v ( x ) = ( v α ( x )) α ∈A denotes the payoﬀ ﬁeld of such a game (with Rosen (1965) originally referred to (MC) as diagonal strict concavity; Hofbauer and Sand-holm (2009) use the term “stable” for population games that satisfy a formal analogue of (MC),while Sandholm (2015) and Sorin and Wan (2016) call such games “contractive” and “dissipative”respectively. In all cases, the adverb “strictly” refers to the “only if” requirement in (MC).

P. MERTIKOPOULOS AND Z. ZHOU

First-order requirement Second-order testNash equilibrium (NE) h v ( x ∗ ) , x − x ∗ i ≤ N/AVariational stability (VS) h v ( x ) , x − x ∗ i ≤ H G ( x ∗ ) ≺ Monotonicity (MC) h v ( x ′ ) − v ( x ) , x ′ − x i ≤ H G ( x ) ≺ Concave potential (PF) v ( x ) = ∇ f ( x ) ∇ f ( x ) ≺ Table 1.

Monotonicity, stability, and Nash equilibrium: the existence ofa concave potential implies monotonicity; monotonicity implies the exis-tence of a globally stable point; and globally stable points are equilibria. x ∈ ∆( A ) denoting the state of the population), Deﬁnition 2.6 boils down to thevariational characterization of evolutionarily stable states due to Hofbauer et al.(1979). As we show in the next sections, variational stability plays the same rolefor learning in games with continuous action spaces as evolutionary stability playsfor evolution in games with a continuum of players.By (2.10), a ﬁrst example of variational stability is provided by the class ofmonotone games: Corollary 2.4. If G satisﬁes (MC) , its ( unique ) Nash equilibrium is globally stable.

The converse to Corollary 2.4 does not hold, even partially. For instance, considerthe single-player game with payoﬀs given by the function u ( x ) = 1 − d X ℓ =1 √ x ℓ , x ∈ [0 , d . (2.11)In this simple example, the origin is the unique maximizer (and hence unique Nashequilibrium) of u . Moreover, we trivially have h v ( x ) , x i = − P dℓ =1 x ℓ / √ x ℓ ≤ with equality if and only if x = 0 , so the origin satisﬁes the global version of (VS);however, u is not even pseudo-concave if d ≥ , so the game cannot be monotone.In words, (MC) is a suﬃcient condition for the existence of a (globally) stable state,but not a necessary one.Nonetheless, even in this (non-monotone) example, variational stability charac-terizes the game’s unique Nash equilibrium. We make this link precise below: Proposition 2.5.

Suppose that x ∗ ∈ X is variationally stable. Then:a ) If G is ( pseudo- ) concave, x ∗ is an isolated Nash equilibrium of G .b ) If x ∗ is globally stable, it is the game’s unique Nash equilibrium. Proposition 2.5 indicates that variationally stable states are isolated (for theproof, see that of Proposition 2.7 below). However, this also means that Nashequilibria of games that admit a concave – but not strictly concave – potential mayfail to be stable. To account for such cases, we will also consider the followingsetwise version of variational stability:

Deﬁnition 2.6.

Let X ∗ ⊆ X be closed and nonempty. We say that X ∗ is variationallystable (or simply stable ) if there exists a neighborhood U of X ∗ such that h v ( x ) , x − x ∗ i ≤ for all x ∈ U , x ∗ ∈ X ∗ , (VS) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 9 with equality for a given x ∗ ∈ X ∗ if and only if x ∈ X ∗ . In particular, if U canbe taken to be all of X , we say that X ∗ is globally variationally stable (or globallystable for short).Obviously, Deﬁnition 2.6 subsumes Deﬁnition 2.3: if x ∗ ∈ X is stable in thepointwise sense of Deﬁnition 2.3, then it is also stable when viewed as a singletonset. In fact, when this is the case, it is also easy to see that x ∗ cannot belong tosome larger variationally stable set, so the notion of variational stability tacitlyincorporates a certain degree of maximality. This is made clearer in the following: Proposition 2.7.

Suppose that X ∗ ⊆ X is variationally stable. Then:a ) X ∗ is convex.b ) If G is concave, X ∗ is an isolated component of Nash equilibria.c ) If X ∗ is globally stable, it coincides with the game’s set of Nash equilibria.Proof of Proposition 2.7. To show that X ∗ is convex, take x ∗ , x ∗ ∈ X ∗ and set x ∗ λ = (1 − λ ) x ∗ + λx ∗ for λ ∈ [0 , . Substituting in (VS), we get h v ( x ∗ λ ) , x ∗ λ − x ∗ i = λ h v ( x ∗ λ ) , x ∗ − x ∗ i ≤ and h v ( x ∗ λ ) , x ∗ λ − x ∗ i = − (1 − λ ) h v ( x ∗ λ ) , x ∗ − x ∗ i ≤ ,implying that h v ( x ∗ λ ) , x ∗ − x ∗ i = 0 . Writing x ∗ − x ∗ = λ − ( x ∗ λ − x ∗ ) , we then get h v ( x ∗ λ ) , x ∗ λ − x ∗ i = 0 . By (VS), we must have x ∗ λ ∈ X ∗ for all λ ∈ [0 , , implying inturn that X ∗ is convex.We now proceed to show that X ∗ only consists of Nash equilibria. To that end,asssume ﬁrst that X ∗ is globally stable, pick some x ∗ ∈ X ∗ , and let z i = x i − x ∗ i for some x i ∈ X i , i ∈ N . Then, for all τ ∈ [0 , , we have ddτ u i ( x ∗ i + τ z i ; x ∗− i ) = h v i ( x ∗ i + τ z i ; x ∗− i ) , z i i = 1 τ h v i ( x ∗ i + τ z i ; x ∗− i ) , x ∗ i + τ z i − x ∗ i i ≤ , (2.12)where the last inequality follows from (VS). In turn, this shows that u i ( x ∗ i ; x ∗− i ) ≥ u i ( x ∗ i + z i ; x ∗− i ) = u i ( x i ; x ∗− i ) for all x i ∈ X i , i ∈ N , i.e., x ∗ is a Nash equilibrium.Our claim for locally stable sets then follows by taking τ = 0 above and applyingProposition 2.1.We are left to show that there are no other Nash equilibria close to X ∗ (locally orglobally). To do so, assume ﬁrst that X ∗ is locally stable and let x ′ / ∈ X ∗ be a Nashequilibrium lying in a neighborhood U of X ∗ where (VS) holds. By Proposition 2.1,we have h v ( x ′ ) , x − x ′ i ≤ for all x ∈ X . However, since x ′ / ∈ X ∗ , (VS) impliesthat h v ( x ′ ) , x ∗ − x ′ i > for all x ∗ ∈ X ∗ , a contradiction. We conclude that thereare no other equilibria of G in U , i.e., X ∗ is an isolated set of Nash equilibria; theglobal version of our claim then follows by taking U = X . (cid:4) Tests for variational stability.

We close this section with a second derivativecriterion that can be used to verify whether (VS) holds. To state it, deﬁne the

Hessian of a game G as the block matrix H G ( x ) = ( H G ij ( x )) i,j ∈N with H G ij ( x ) = ∇ x j ∇ x i u i ( x ) + ( ∇ x i ∇ x j u j ( x )) ⊤ . (2.13)We then have: In that case (VS) would give h v ( x ′ ) , x ′ − x ∗ i = 0 for some x ′ = x ∗ , a contradiction. Proposition 2.8. If x ∗ is a Nash equilibrium of G and H G ( x ∗ ) ≺ on TC( x ∗ ) , then x ∗ is stable – and hence an isolated Nash equilibrium. In particular, if H G ( x ) ≺ on TC( x ) for all x ∈ X , x ∗ is globally stable – so it is the unique equilibrium of G .Remark. The requirement “ H G ( x ∗ ) ≺ on TC( x ∗ ) ” above means that z ⊤ H G ( x ∗ ) z < for every nonzero tangent vector z ∈ TC( x ∗ ) . Proof.

Assume ﬁrst that H G ( x ) ≺ on TC( x ) for all x ∈ X . By Theorem 6 inRosen (1965), G satisﬁes (MC) so our claim follows from Corollary 2.4. For oursecond claim, if H G ( x ∗ ) ≺ on TC( x ∗ ) for some Nash equilibrium x ∗ of G , wealso have H G ( x ) ≺ for all x in a neighborhood U = Q i ∈N U i of x ∗ in X . Bythe same theorem in Rosen (1965), we get that (MC) holds locally in U , so theabove reasoning shows that x ∗ is the unique equilibrium of the restricted game G| U ( N , U, u | U ) . Hence, x ∗ is locally stable and isolated in G . (cid:4) We provide two straightforward applications of Proposition 2.8 below:

Example . Following Monderer and Shapley (1996), a game G is called a potential game if it admits a potential function f : X → R such that u i ( x i ; x − i ) − u i ( x ′ i ; x − i ) = f ( x i ; x − i ) − f ( x ′ i ; x − i ) for all x, x ′ ∈ X , i ∈ N . (PF)Local maximizers of f are Nash equilibria and the converse also holds if f is concave(Neyman, 1997). By diﬀerentiating (PF), it is easy to see that the Hessian of G isjust the Hessian of its potential. Hence, if a game admits a concave potential f ,the game’s Nash set X ∗ = arg max x ∈X f ( x ) is globally stable. Example . Consider again the Cournot oligopoly model ofExample 2.2. A simple diﬀerentiation yields H G ij ( x ) = 12 ∂ u i ∂x i ∂x j + 12 ∂ u j ∂x j ∂x i = − b i δ ij − ( b i + b j ) , (2.14)where δ ij = { i = j } is the Kronecker delta. This shows that a Cournot oligopolyadmits a unique, globally stable equilibrium whenever the RHS of (2.14) is negative-deﬁnite. This is always the case if the model is symmetric ( b i = b for all i ∈ N ), butnot necessarily otherwise. Quantitatively, if the coeﬃcients b i are independent andidentically distributed (i.i.d.) on [0 , , a Monte Carlo simulation shows that (2.14)is negative-deﬁnite with probability between and for N ∈ { , . . . , } .

3. Learning via dual averaging

In this section, we adapt the widely used dual averaging (DA) method of Nesterov(2009) to our game-theoretic setting. Intuitively, the main idea is as follows: Ateach stage of the process, every player i ∈ N gets an estimate ˆ v i of the individualgradient of their payoﬀ function at the current action proﬁle, possibly subject tonoise and uncertainty. Subsequently, they take a step along this estimate in the dualspace V ∗ i (where gradients live), and they “mirror” the output back to the primalspace X i in order to choose an action for the next stage and continue playing (fora schematic illustration, see Fig. 2). This is so because, in the symmetric case, the RHS of (2.14) is a circulant matrix witheigenvalues − b and − ( N + 1) b . In optimization, the roots of the method can be traced back to Nemirovski and Yudin (1983);see also Beck and Teboulle (2003), Nemirovski et al. (2009) and Shalev-Shwartz (2011).

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 11

X ⊆ VY = V ∗ v Q Y Y Y γ ˆ v γ ˆ v X X X Q Q Q

Figure 2.

Schematic representation of dual averaging.

Formally, starting with some arbitrary (and possibly uninformed) gradient esti-mate Y = ˆ v at n = 1 , this scheme can be described via the recursion X i,n = Q i ( Y i,n ) ,Y i,n +1 = Y i,n + γ n ˆ v i,n +1 , (DA)where:1) n denotes the stage of the process.2) ˆ v i,n +1 ∈ V ∗ i is an estimate of the individual payoﬀ gradient v i ( X n ) of player i at stage n (more on this below).3) Y i,n ∈ V ∗ i is an auxiliary “score” variable that aggregates the i -th player’sindividual gradient steps.4) γ n > is a nonincreasing step-size sequence, typically of the form /n β forsome β ∈ (0 , .5) Q i : V ∗ i → X i is the choice map that outputs the i -th player’s action as afunction of their score vector Y i (see below for a rigorous deﬁnition).In view of the above, the core components of (DA) are a ) the players’ gradientestimates; and b ) the choice maps that determine the players’ actions. In the restof this section, we discuss both in detail.3.1. Feedback and uncertainty.

Regarding the players’ individual gradient obser-vations, we assume that each player i ∈ N has access to a “black box” feedbackmechanism – an oracle – which returns an estimate of their payoﬀ gradients at theircurrent action proﬁle. Of course, this information may be imperfect for a multitudeof reasons: for instance i ) estimates may be susceptible to random measurementerrors; ii ) the transmission of this information could be subject to noise; and/or iii ) the game’s payoﬀ functions may be stochastic expectations of the form u i ( x ) = E [ˆ u i ( x ; ω )] for some random variable ω, (3.1)and players may only be able to observe the realized gradients ∇ x i ˆ u i ( x ; ω ) .With all this in mind, we will focus on the noisy feedback model ˆ v i,n +1 = v i ( X n ) + ξ i,n +1 , (3.2) where the noise process ξ n = ( ξ i,n ) i ∈N is an L -bounded martingale diﬀerencesequence adapted to the history ( F n ) ∞ n =1 of X n (i.e., ξ n is F n -measurable but ξ n +1 isn’t). More explicitly, this means that ξ n satisﬁes the statistical hypotheses:1. Zero-mean: E [ ξ n +1 | F n ] = 0 for all n = 1 , , . . . (a.s.). (H1)2. Finite mean squared error: there exists some σ ≥ such that E [ k ξ n +1 k ∗ | F n ] ≤ σ for all n = 1 , , . . . (a.s.). (H2)Alternatively, (H1) and (H2) simply posit that the players’ individual gradientestimates are conditionally unbiased and bounded in mean square , viz. E [ˆ v n +1 | F n ] = v ( X n ) , (3.3a) E [ k ˆ v n +1 k ∗ | F n ] ≤ V ∗ for some ﬁnite V ∗ > . (3.3b)The above allows for a broad range of error processes, including all compactlysupported, (sub-)Gaussian, (sub-)exponential and log-normal distributions. Infact, both hypotheses can be relaxed (for instance, by assuming a small bias orasking for ﬁnite moments up to some order q < ), but we do not do so to keepthings simple.3.2. Choosing actions.

Given that the players’ score variables aggregate gradi-ent steps, a reasonable choice for Q i would be the arg max correspondence y i arg max x i ∈X i h y i , x i i that outputs those actions which are most closely aligned with y i . Notwithstanding, there are two problems with this approach: a ) this assignmentis too aggressive in the presence of uncertainty; and b ) generically, the output wouldbe an extreme point of X , so (DA) could never converge to an interior point. Thus,instead of taking a “hard” arg max approach, we will focus on regularized maps ofthe form y i arg max x i ∈X i {h y i , x i i − h i ( x i ) } , (3.4)where the “regularization” term h i : X i → R satisﬁes the following requirements: Deﬁnition 3.1.

Let C be a compact convex subset of a ﬁnite-dimensional normedspace V . We say that h : C → R is a regularizer (or penalty function ) on C if:(1) h is continuous.(2) h is strongly convex , i.e., there exists some K > such that h ( tx + (1 − t ) x ′ ) ≤ th ( x ) + (1 − t ) h ( x ′ ) − Kt (1 − t ) k x ′ − x k (3.5)for all x, x ′ ∈ C and all t ∈ [0 , .The choice (or mirror ) map Q : V ∗ → C induced by h is then deﬁned as Q ( y ) = arg max {h y, x i − h ( x ) : x ∈ C} . (3.6)In what follows, we will be assuming that each player i ∈ N is endowed with anindividual penalty function h i : X i → R that is K i -strongly convex. Furthermore,to emphasize the interplay between primal and dual variables (the players’ actions x i and their score vectors y i respectively), we will write Y i ≡ V ∗ i for the dual spaceof V i and Q i : Y i → X i for the choice map induced by h i . Indices have been chosen so that all relevant processes are F n -measurable at stage n . In particular, we will not be assuming i.i.d. errors; this point is crucial for applications todistributed control where measurements are typically correlated with the state of the system.

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 13

Algorithm 1.

Dual averaging with Euclidean projections (Example 3.1).

Require: step-size sequence γ n ∝ /n β , β ∈ (0 , ; initial scores Y i ∈ Y i for n = 1 , , . . . do for every player i ∈ N do play X i ← Π X i ( Y i ) ; {choose an action} observe ˆ v i ; {estimate gradient} update Y i ← Y i + γ n ˆ v i ; {take gradient step} end for end for More concisely, this information can be encoded in the aggregate penalty function h ( x ) = P i h i ( x i ) with associated strong convexity constant K ≡ min i K i . Theinduced choice map is simply Q ≡ ( Q , . . . , Q N ) so we will write x = Q ( y ) for theaction proﬁle induced by the score vector y = ( y , . . . , y N ) ∈ Y ≡ Q i Y i . Remark . In ﬁnite games, McKelvey and Palfrey (1995) referred to Q i as a“quantal response function” (the notation Q alludes precisely to this terminology).In the same game-theoretic context, the composite map Q i ◦ v i is often calleda smooth, perturbed, or regularized best response; for a detailed discussion, seeHofbauer and Sandholm (2002) and Mertikopoulos and Sandholm (2016).We discuss below a few examples of this regularization process: Example . Let h ( x ) = k x k . Then, h is -stronglyconvex with respect to k·k and the corresponding choice map is the closest pointprojection Π X ( y ) ≡ arg max x ∈X (cid:8) h y, x i − k x k (cid:9) = arg min x ∈X k y − x k . (3.7)The induced learning scheme (cf. Algorithm 1) may thus be viewed as a multi-agent variant of gradient ascent with lazy projections (Zinkevich, 2003). For futurereference, note that h is diﬀerentiable on X and Π X is surjective (i.e., im Π X = X ). Example . Motivated by mixed strategy learning inﬁnite games (Example 2.1), let ∆ = { x ∈ R d + : P dj =1 x j = 1 } denote the unitsimplex of R d . Then, a standard regularizer on ∆ is provided by the (negative)Gibbs entropy h ( x ) = d X ℓ =1 x ℓ log x ℓ . (3.8)The entropic regularizer (3.8) is -strongly convex with respect to the L -norm on R d . Moreover, a straightforward calculation shows that the induced choice map is Λ( y ) = 1 P dℓ =1 exp( y ℓ ) (exp( y ) , . . . , exp( y d )) . (3.9)This model is known as logit choice and the associated learning scheme has beenstudied extensively in evolutionary game theory and online learning; for a detailed We assume here that

V ≡ Q i V i is endowed with the product norm k x k V = P i k x i k V i . account, see Vovk (1990), Littlestone and Warmuth (1994), Laraki and Mertikopou-los (2013), and references therein. In contrast to the previous example, h is diﬀer-entiable only on the relative interior ∆ ◦ of ∆ and im Λ = ∆ ◦ (i.e., Λ is “essentially”surjective).3.3. Surjectivity vs. steepness.

We close this section with an important link be-tween the boundary behavior of penalty functions and the surjectivity of the inducedchoice maps. To describe it, it will be convenient to treat h as an extended-real-valued function h : V → R ∪ {∞} by setting h = ∞ outside X . The subdiﬀerential of h at x ∈ V is then deﬁned as ∂h ( x ) = { y ∈ V ∗ : h ( x ′ ) ≥ h ( x ) + h y, x ′ − x i for all x ′ ∈ V} , (3.10)and h is called subdiﬀerentiable at x ∈ X whenever ∂h ( x ) is nonempty. This isalways the case if x ∈ X ◦ , so X ◦ ⊆ dom ∂h ≡ { x ∈ X : ∂h ( x ) = ∅ } ⊆ X (Rockafellar, 1970, Chap. 26).Intuitively, h fails to be subdiﬀerentiable at a boundary point x ∈ bd( X ) onlyif it becomes “inﬁnitely steep” near x . We thus say that h is steep at x whenever x / ∈ dom h ; otherwise, h is said to be nonsteep at x . The following propositionshows that regularizers that are everywhere nonsteep (as in Example 3.1) inducechoice maps that are surjective; on the other hand, regularizers that are everywheresteep (cf. Example 3.2) induce choice maps that are interior-valued: Proposition 3.2.

Let h be a K -strongly convex regularizer with induced choice map Q : Y → X , and let h ∗ : Y → R be the convex conjugate of h , i.e., h ∗ ( y ) = max {h y, x i − h ( x ) : x ∈ X } , y ∈ Y . (3.11) Then: x = Q ( y ) if and only if y ∈ ∂h ( x ) ; in particular, im Q = dom ∂h . h ∗ is diﬀerentiable on Y and ∇ h ∗ ( y ) = Q ( y ) for all y ∈ Y . Q is (1 /K ) -Lipschitz continuous. Proposition 3.2 is essentially folklore in optimization and convex analysis; fora proof, see Rockafellar (1970, Theorem 23.5) and Rockafellar and Wets (1998,Theorem 12.60(b)).

4. Convergence analysis

A key property of (DA) in concave games is that it leads to no regret , viz. max x i ∈X i n X k =1 [ u i ( x i ; X − i,k ) − u i ( X k )] = o ( n ) for all i ∈ N , (4.1)provided that the algorithm’s step-size is chosen appropriately – for a precise state-ment, see Xiao (2010) and Shalev-Shwartz (2011). As such, under (DA), everyplayer’s average payoﬀ matches asymptotically that of the best ﬁxed action in hind-sight (though, of course, this does not take into account changes to other players’actions due to a change in a given player’s chosen action).In this section, we expand on this worst-case guarantee and we derive some gen-eral convergence results for the actual sequence of play induced by (DA). Specif-ically, in Section 4.1 we show that if (DA) converges to some action proﬁle, thislimit is a Nash equilibrium. Subsequently, to obtain stronger convergence results, EARNING IN GAMES WITH CONTINUOUS ACTION SETS 15 we introduce in Section 4.2 the so-called

Fenchel coupling , a “primal-dual” diver-gence measure between the players’ (primal) action variables x i ∈ X i and their(dual) score vectors y i ∈ Y i . Using this coupling as a Lyapunov function, we showin Sections 4.3 and 4.4 that globally (resp. locally) stable states are globally (resp.locally) attracting under (DA). Finally, in Section 4.5, we examine the convergenceproperties of (DA) in zero-sum concave-convex games.4.1. Limit states.

We ﬁrst show that if the sequence of play induced by (DA) con-verges to some x ∗ ∈ X with positive probability, this limit is a Nash equilibrium: Theorem 4.1.

Suppose that (DA) is run with imperfect gradient information satis-fying (H1)–(H2) and a step-size sequence γ n such that ∞ X n =1 (cid:16) γ n τ n (cid:17) < ∞ X n =1 γ n = ∞ , (4.2) where τ n = P nk =1 γ k . If the game is ( pseudo- ) concave and X n converges to x ∗ ∈ X with positive probability, x ∗ is a Nash equilibrium.Remark . Note here that the requirement (4.2) holds for every step-size policyof the form γ n ∝ /n β , β ≤ (i.e. even for increasing γ n ). Proof of Theorem 4.1.

Let v ∗ = v ( x ∗ ) and assume ad absurdum that x ∗ is not aNash equilibrium. By the characterization (2.7) of Nash equilibria, there exists aplayer i ∈ N and a deviation q i ∈ X i such that h v ∗ i , q i − x ∗ i i > . Thus, by continuity,there exists some a > and neighborhoods U , V of x ∗ and v ∗ respectively, suchthat h v ′ i , q i − x ′ i i ≥ c (4.3)whenever x ′ ∈ U and v ′ ∈ V .Now, let Ω be the event that X n converges to x ∗ , so P (Ω ) > by assumption.Within Ω , we may assume for simplicity that X n ∈ U and v ( X n ) ∈ V for all n , so(DA) yields Y n +1 = Y + n X k =1 γ k ˆ v k +1 = Y n + n X k =1 γ k [ v ( X k ) + ξ k +1 ] = Y + τ n ¯ v n +1 , (4.4)where we set ¯ v n +1 = τ − n P nk =1 γ k ˆ v k +1 = τ − n P nk =1 γ k [ v ( X k ) + ξ k +1 ] .We now claim that P (¯ v n → v ∗ | Ω ) = 1 . Indeed, by (4.2) and (H2), we have ∞ X n =1 τ n E [ k γ n ξ n +1 k ∗ | F n ] ≤ ∞ X n =1 γ n τ n σ < ∞ . (4.5)Therefore, by the law of large numbers for martingale diﬀerence sequences (Halland Heyde, 1980, Theorem 2.18), we obtain τ − n P nk =1 γ k ξ k +1 → (a.s.). Giventhat v ( X n ) → v ∗ in Ω and P (Ω ) > , we infer that P (¯ v n → v ∗ | Ω ) = 1 , asclaimed.Now, with Y i,n ∈ ∂h i ( X i,n ) by Proposition 3.2, we also have h i ( q i ) − h i ( X i,n ) ≥ h Y i,n , q i − X i,n i = h Y i, , q i − X i,n i + τ n − h ¯ v i,n , q i − X i,n i . (4.6)Since ¯ v n → v ∗ almost surely on Ω , (4.3) yields h ¯ v i,n , q i − X i,n i ≥ c > for allsuﬃciently large n . However, given that |h Y i, , q i − X i,n i| ≤ k Y i, k ∗ k q i − X i,n k ≤k Y i, k ∗ kX k = O (1) , a simple substitution in (4.6) yields h i ( q i ) − h i ( X i,n ) & cτ n →∞ with positive probability, a contradiction. We conclude that x ∗ is a Nash equi-librium of G , as claimed. (cid:4) The Fenchel coupling.

A key tool in establishing the convergence properties of(DA) is the so-called

Bregman divergence D ( p, x ) between a given base point p ∈ X and a test state x ∈ X . Following Kiwiel (1997), D ( p, x ) is deﬁned as the diﬀerencebetween h ( p ) and the best linear approximation of h ( p ) from x , viz. D ( p, x ) = h ( p ) − h ( x ) − h ′ ( x ; p − x ) , (4.7)where h ′ ( x ; z ) = lim t → + t − [ h ( x + tz ) − h ( x )] denotes the one-sided derivative of h at x along z ∈ TC( x ) . Owing to the (strict) convexity of h , we have D ( p, x ) ≥ and X n → p whenever D ( p, X n ) → (Kiwiel, 1997). Accordingly, the convergence of asequence X n to a target point p can be checked directly by means of the associateddivergence D ( p, X n ) .Nevertheless, it is often impossible to glean any useful information on D ( p, X n ) from (DA) when X n = Q ( Y n ) is not interior. Instead, given that (DA) mixes primaland dual variables (actions and scores respectively), it will be more convenient touse the following “primal-dual divergence” between dual vectors y ∈ Y and basepoints p ∈ X : Deﬁnition 4.2.

Let h : X → R be a penalty function on X . Then, the Fenchelcoupling induced by h is deﬁned as F ( p, y ) = h ( p ) + h ∗ ( y ) − h y, p i for all p ∈ X , y ∈ Y . (4.8)The terminology “Fenchel coupling” is due to Mertikopoulos and Sandholm (2016)and refers to the fact that (4.8) collects all terms of Fenchel’s inequality. As a result, F ( p, y ) is nonnegative and strictly convex in both arguments (though not jointlyso). Moreover, it enjoys the following key properties: Proposition 4.3.

Let h be a K -strongly convex penalty function on X . Then, for all p ∈ X and all y, y ′ ∈ Y , we have: a ) F ( p, y ) = D ( p, Q ( y )) if Q ( y ) ∈ X ◦ ( but not necessarily otherwise ) . (4.9a) b ) F ( p, y ) ≥ K k Q ( y ) − p k . (4.9b) c ) F ( p, y ′ ) ≤ F ( p, y ) + h y ′ − y, Q ( y ) − p i + K k y ′ − y k ∗ . (4.9c)Proposition 4.3 (proven in Appendix A) justiﬁes the terminology “primal-dualdivergence” and plays a key role in our analysis. Speciﬁcally, given a sequence Y n in Y , (4.9b) yields Q ( Y n ) → p whenever F ( p, Y n ) → , meaning that F ( p, Y n ) canbe used to test the convergence of Q ( Y n ) to p .For technical reasons, it is convenient to also assume the converse, namely that F ( p, Y n ) → whenever Q ( Y n ) → p. (H3)When h is steep, we have F ( p, y ) = D ( p, Q ( y )) for all y ∈ Y , so (H3) boils down tothe requirement D ( p, x n ) → whenever X n → p. (4.10)This so-called “reciprocity condition” is well known in the theory of Bregman func-tions (Chen and Teboulle, 1993; Kiwiel, 1997; Alvarez et al., 2004): essentially, itmeans that the sublevel sets of D ( p, · ) are neighborhoods of p in X . Hypothesis (H3)posits that the images of the sublevel sets of F ( p, · ) under Q are neighborhoods of p in X , so it may be seen as a “primal-dual” variant of Bregman reciprocity. Underthis light, it is easy to check that Examples 3.1 and 3.2 both satisfy (H3).Obviously, when (H3) holds, Proposition 4.3 gives: EARNING IN GAMES WITH CONTINUOUS ACTION SETS 17

Corollary 4.4.

Under (H3) , F ( p, Y n ) → if and only if Q ( Y n ) → p . To extend the above to subsets of X , we further deﬁne the setwise coupling F ( C , y ) = inf { F ( p, y ) : p ∈ C} , C ⊆ X , y ∈ Y . (4.11)In analogy to the pointwise case, we then have: Proposition 4.5.

Let C be a closed subset of X . Then, Q ( Y n ) → C whenever F ( C , Y n ) → ; in addition, if (H3) holds, the converse is also true. The proof of Proposition 4.5 is a straightforward exercise in point-set topology sowe omit it. What’s more important is that, thanks to Proposition 4.5, the Fenchelcoupling can also be used to test for convergence to a set; in what follows, weemploy this property freely.4.3.

Global convergence.

In this section, we focus on globally stable Nash equilibria(and sets thereof). We begin with the perfect feedback case:

Theorem 4.6.

Suppose that (DA) is run with perfect feedback ( σ = 0 ) , choice mapssatisfying (H3) , and a step-size γ n such that P nk =1 γ k (cid:14) P nk =1 γ k → . If the set X ∗ of the game’s Nash equilibria is globally stable, X n converges to X ∗ .Proof. Let X ∗ be the game’s set of Nash equilibria, ﬁx some arbitrary ε > , andlet U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } . Then, by Proposition 4.5, it suﬃces to showthat X n ∈ U ε for all suﬃciently large n .To that end, for all x ∗ ∈ X ∗ , Proposition 4.3 yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) , X n − x ∗ i + γ n K k v ( X n ) k ∗ . (4.12)To proceed, assume inductively that X n ∈ U ε . By (H3), there exists some δ > such that cl( U ε/ ) contains a δ -neighborhood of X ∗ . Consequently, with X ∗ globally stable, there exists some c ≡ c ( ε ) > such that h v ( x ) , x − x ∗ i ≤ − c for all x ∈ U ε − U ε/ , x ∗ ∈ X ∗ . (4.13)If X n ∈ U ε − U ε/ and γ n ≤ cK/V ∗ , (4.12) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) . Hence, minimizing over x ∗ ∈ X ∗ , we get F ( X ∗ , Y n +1 ) ≤ F ( X ∗ , Y n ) < ε , so X n +1 = Q ( Y n +1 ) ∈ U ε . Otherwise, if X n ∈ U ε/ and γ n < εK/V ∗ , combining (VS) with(4.12) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + ε/ so, again, F ( X ∗ , Y n +1 ) ≤ F ( X ∗ , Y n ) + ε/ ≤ ε , i.e. X n +1 ∈ U ε . We thus conclude that X n +1 ∈ U ε whenever X n ∈ U ε and γ n < min { cK/V ∗ , √ εK/V ∗ } .To complete the proof, Lemma A.3 shows that X n visits U ε inﬁnitely often underthe stated assumptions. Since γ n → , our assertion follows. (cid:4) We next show that Theorem 4.6 extends to the case of imperfect feedback underthe additional regularity requirement:The gradient ﬁeld v ( x ) is Lipschitz continuous. (H4)With this extra assumption, we have: Indeed, if this were not the case, there would exist a sequence Y ′ n in Y such that Q ( Y ′ n ) → X ∗ but F ( X ∗ , Y ′ n ) ≥ ε/ , in contradiction to (H3). Since σ = 0 , we can take here V ∗ = max x ∈X k v ( x ) k ∗ . Hypothesis Precise statement (H1) Zero-mean errors E [ ξ n +1 | F n ] = 0 (H2) Finite error variance E [ k ξ n +1 k ∗ | F n ] ≤ σ (H3) Bregman reciprocity F ( p, y n ) → whenever Q ( y n ) → p (H4) Lipschitz gradients v ( x ) is Lipschitz continuous Table 2.

Overview of the various regularity hypotheses used in the paper.

Theorem 4.7.

Suppose that (DA) is run with a step-size sequence γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If (H1)–(H4) hold and the set X ∗ of the game’sNash equilibria is globally stable, X n converges to X ∗ ( a.s. ) . Corollary 4.8. If G satisﬁes (MC) , X n converges to the ( necessarily unique ) Nashequilibrium of G ( a.s. ) . Corollary 4.9. If G admits a concave potential, X n converges to the set of Nashequilibria of G ( a.s. ) . Because of the noise aﬀecting the players’ gradient estimates, our proof strategyfor Theorem 4.7 is quite diﬀerent from that of Theorem 4.6. In particular, insteadof working directly in discrete time, we start with the continuous-time system ˙ y = v ( x ) ,x = Q ( y ) , (DA-c)which can be seen as a “mean-ﬁeld” approximation of the recursive scheme (DA).As we show in Appendix A, the orbits x ( t ) = Q ( y ( t )) of (DA-c) converge to X ∗ in a certain, “uniform” way. Moreover, under the assumptions of Theorem 4.7, thesequence Y n generated by the discrete-time, stochastic process (DA) comprises an asymptotic pseudotrajectory (APT) of the dynamics (DA-c), i.e. Y n asymptoticallytracks the ﬂow of (DA-c) with arbitrary accuracy over windows of arbitrary lengthBenaïm (1999). APTs have the key property that, in the presence of a globalattractor, they cannot stray too far from the ﬂow of (DA-c); however, given that Q may fail to be invertible, the trajectories x ( t ) = Q ( y ( t )) do not consitute a semiﬂow,so it is not possible to leverage the general stochastic approximation theory ofBenaïm (1999). To overcome this diﬃculty, we exploit the derived convergencebound for x ( t ) = Q ( y ( t )) , and we then use an inductive shadowing argument toshow that (DA) converges itself to X ∗ . Proof of Theorem 4.7.

Fix some ε > , let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } , andwrite Φ t : Y → Y for the semiﬂow induced by (DA-c) on Y – i.e. (Φ t ( y )) t ≥ is thesolution orbit of (DA-c) that starts at y ∈ Y . We ﬁrst claim there exists some ﬁnite τ ≡ τ ( ε ) such that F ( X ∗ , Φ τ ( y )) ≤ max { ε, F ( X ∗ , y ) − ε } for all y ∈ Y . Indeed, since cl( U ε ) is a closed neighborhoodof X ∗ by (H3), (VS) implies that there exists some c ≡ c ( ε ) > such that h v ( x ) , x − x ∗ i ≤ − c for all x ∗ ∈ X ∗ , x / ∈ U ε . (4.14) For a precise deﬁnition, see (4.16) below. That such a trajectory exists and is unique is a consequence of (H4).

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 19

Consequently, if τ y = inf { t > Q (Φ t ( y )) ∈ U ε } denotes the ﬁrst time at which anorbit of (DA-c) reaches U ε , Lemma A.2 in Appendix A gives: F ( x ∗ , Φ t ( y )) ≤ F ( x ∗ , y ) − ct for all x ∗ ∈ X ∗ , t ≤ τ y . (4.15)In view of this, set τ = ε/c and consider the following two cases:(1) τ y ≥ τ : then, (4.15) gives F ( x ∗ , Φ τ ( y )) ≤ F ( x ∗ , y ) − ε for all x ∗ ∈ X ∗ , so F ( X ∗ , Φ τ ( y )) ≤ F ( X ∗ , y ) − ε .(2) τ y < τ : then, Q (Φ τ ( y )) ∈ U ε , so F ( X ∗ , Φ τ ( y )) ≤ ε .In both cases we have F ( X ∗ , Φ τ ( y )) ≤ max { ε, F ( X ∗ , y ) − ε } , as claimed.Now, let ( Y ( t )) t ≥ denote the aﬃne interpolation of the sequence Y n generatedby (DA), i.e. Y is the continuous curve which joins the values Y n at all times τ n = P nk =1 γ k . Under the stated assumptions, a standard result of Benaïm (1999,Propositions 4.1 and 4.2) shows that Y ( t ) is an asymptotic pseudotrajectory of Φ ,i.e. lim t →∞ sup ≤ h ≤ T k Y ( t + h ) − Φ h ( Y ( t )) k ∗ = 0 for all T > (a.s.) . (4.16)Thus, with some hindsight, let δ ≡ δ ( ε ) be such that δ kX k + δ / (2 K ) ≤ ε andchoose t ≡ t ( ε ) so that sup ≤ h ≤ τ k Y ( t + h ) − Φ h ( Y ( t )) k ∗ ≤ δ for all t ≥ t . Then,for all t ≥ t and all x ∗ ∈ X ∗ , Proposition 4.3 gives F ( x ∗ , Y ( t + h )) ≤ F ( x ∗ , Φ h ( Y ( t )))+ h Y ( t + h ) − Φ h ( Y ( t )) , Q (Φ h ( Y ( t ))) − x ∗ i + 12 K k Y ( t + h ) − Φ h ( Y ( t )) k ∗ ≤ F ( x ∗ , Φ h ( Y ( t ))) + δ kX k + δ K ≤ F ( x ∗ , Φ h ( Y ( t ))) + ε. (4.17)Hence, minimizing over x ∗ ∈ X ∗ , we get F ( X ∗ , Y ( t + h )) ≤ F ( X ∗ , Φ h ( Y ( t ))) + ε for all t ≥ t . (4.18)By Lemma A.3, there exists some t ≥ t such that F ( X ∗ , Y ( t )) ≤ ε (a.s.).Thus, given that F ( X ∗ , Φ h ( Y ( t ))) is nonincreasing in h by Lemma A.2, Eq. (4.18)yields F ( X ∗ , Y ( t + h )) ≤ ε + ε = 3 ε for all h ∈ [0 , τ ] . However, by the deﬁnition of τ , we also have F ( X ∗ , Φ τ ( Y ( t ))) ≤ max { ε, F ( X ∗ , Y ( t )) − ε } ≤ ε , implying in turnthat F ( X ∗ , Y ( t + τ )) ≤ F ( X ∗ , Φ τ ( Y ( t ))) + ε ≤ ε . Therefore, by repeating theabove argument at t + τ and proceeding inductively, we get F ( X ∗ , Y ( t + h )) ≤ ε for all h ∈ [ kτ, ( k + 1) τ ] , k = 1 , , . . . (a.s.). Since ε has been chosen arbitrarily, weconclude that F ( X ∗ , Y n ) → , so X n → X ∗ by Proposition 4.5. (cid:4) We close this section with a few remarks:

Remark . In the above, the Lipschitz continuity assumption (H4) is used toshow that the sequence X n comprises an APT of the continuous-time dynamics(DA-c). Since any continuous functions on a compact set is uniformly continuous,the proof of Proposition 4.1 in Benaïm (1999, p. 14) shows that (H4) can be droppedaltogether if (DA-c) is well-posed (which, in turn, holds if v ( x ) is only locally Lipschitz). Albeit less general, Lipschitz continuity is more straightforward as anassumption, so we do not go into the details of this relaxation.

We should also note that several classic convergence results for dual averagingand mirror descent do not require Lipschitz continuity at all (see e.g. Nesterov,2009, and Nemirovski et al., 2009). The reason for this is that these results focuson the convergence of the averaged sequence ¯ X n = P nk =1 γ k X k (cid:14) P nk =1 γ k , whereasthe ﬁgure of merit here is the actual sequence of play X n . The latter sequenceis more sensitive to noise, hence the need for additional regularity; in our ergodicanalysis later in the paper, (H4) is not invoked. Remark . Theorem 4.7 shows that (DA) converges to equilibrium, but thesummability requirement P ∞ n =1 γ n < ∞ suggests that players must be more conser-vative under uncertainty. To make this more precise, note that the step-size assump-tions of Theorem 4.6 are satisﬁed for all step-size policies of the form γ n ∝ /n β , β ∈ (0 , ; however, in the presence of errors and uncertainty, Theorem 4.7 guaran-tees convergence only when β ∈ (1 / , .The “critical” value β = 1 / is tied to the ﬁnite mean squared error hypothesis(H2). If the players’ gradient observations have ﬁnite moments up to some order q > , a more reﬁned stochastic approximation argument can be used to show thatTheorem 4.7 still holds under the lighter requirement P ∞ n =1 γ q/ n < ∞ . Thus,even in the presence of noise, it is possible to employ (DA) with any step-sizesequence of the form γ n ∝ /n β , β ∈ (0 , , provided that the noise process ξ n has E [ k ξ n +1 k ∗ q | F n ] < ∞ for some q > /β − . In particular, if the noise aﬀectingthe players’ observations has ﬁnite moments of all orders (for instance, if ξ n is sub-exponential or sub-Gaussian), it is possible to recover essentially all the step-sizepolicies covered by Theorem 4.6.4.4. Local convergence.

The results of the previous section show that (DA) con-verges globally to states (or sets) that are globally stable, even under noise anduncertainty. In this section, we show that (DA) remains locally convergent tostates that are only locally stable with probability arbitrarily close to .For simplicity, we begin with the deterministic, perfect feedback case: Theorem 4.10.

Suppose that (DA) is run with perfect feedback ( σ = 0 ) , choice mapssatisfying (H3) , and a suﬃciently small step-size with P nk =1 γ k (cid:14) P nk =1 γ k → . If X ∗ is a stable set of Nash equilibria, there exists a neighborhood U of X ∗ such that X n converges to X ∗ whenever X ∈ U .Proof. As in the proof of Theorem 4.6, let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } . Since X ∗ is stable, there exists some ε > and some c > satisfying (4.13) and suchthat (VS) holds throughout U ε . If X ∈ U ε and γ ≤ min { cK/V ∗ , √ εK/V ∗ } ,the same induction argument as in the proof of Theorem 4.6 shows that X n ∈ U ε for all n . Since (VS) holds throughout U ε , Lemma A.3 shows that X n visits anyneighborhood of X ∗ inﬁnitely many times. Thus, by the same argument as in theproof of Theorem 4.6, we get X n → X ∗ . (cid:4) The key idea in the proof of Theorem 4.10 is that if the step-size of (DA) issmall enough, X n = Q ( Y n ) always remains within the “basin of attraction” of X ∗ ;hence, local convergence can be obtained in the same way as global convergence fora game with smaller action spaces. However, if the players’ feedback is subject toestimation errors and uncertainty, a single unlucky instance could drive X n awayfrom said basin, possibly never to return. Consequently, any local convergenceresult in the presence of noise is necessarily probabilistic in nature. EARNING IN GAMES WITH CONTINUOUS ACTION SETS 21

Conditioning on the event that X n stays close to X ∗ , local convergence can beobtained as in the proof of Theorem 4.7. Nevertheless, showing that this eventoccurs with controllably high probability requires a completely diﬀerent analysis.This is the essence of our next result: Theorem 4.11.

Fix a conﬁdence level δ > and suppose that (DA) is run with asuﬃciently small step-size γ n satisfying P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If X ∗ is stable and (H1)–(H4) hold, then X ∗ is locally attracting with probability at least − δ ; more precisely, there exists a neighborhood U of X ∗ such that P ( X n → X ∗ | X ∈ U ) ≥ − δ. (4.19) Corollary 4.12.

Let x ∗ be a Nash equilibrium with negative-deﬁnite Hessian ma-trix H G ( x ∗ ) ≺ . Then, with assumptions as above, x ∗ is locally attracting withprobability arbitrarily close to .Proof of Theorem 4.11. Let U ε = { x = Q ( y ) : F ( X ∗ , y ) < ε } and pick ε > smallenough so that (VS) holds for all x ∈ U ε . Assume further that X ∈ U ε so thereexists some x ∗ ∈ X ∗ such that F ( x ∗ , Y ) < ε . Then, for all n , Proposition 4.3 yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) , X n − x ∗ i + γ n ψ n +1 + γ n K k ˆ v n +1 k ∗ , (4.20)where we have set ψ n +1 = h ξ n +1 , X n − x ∗ i .We ﬁrst claim that sup n P nk =1 γ k ψ k +1 ≤ ε with probability at least − δ/ if γ n is chosen appropriately. Indeed, set S n +1 = P nk =1 γ k ψ k +1 and let E n,ε denote theevent { sup ≤ k ≤ n +1 | S k | ≥ ε } . Since S n is a martingale, Doob’s maximal inequality(Hall and Heyde, 1980, Theorem 2.1) yields P ( E n +1 ,ε ) ≤ E [ | S n +1 | ] ε ≤ σ kX k P nk =1 γ k ε , (4.21)where we used the variance estimate E [ ψ k +1 ] = E [ E [ |h ξ k +1 , X k − x ∗ i| | F k ]] ≤ E [ E [ k ξ k +1 k ∗ k X k − x ∗ k | F k ]] ≤ σ kX k , (4.22)and the fact that E [ ψ k +1 ψ ℓ +1 ] = E [ E [ ψ k +1 ψ ℓ +1 ] | F k ∨ ℓ ] = 0 whenever k = ℓ . Since E n +1 ,ε ⊇ E n,ε ⊇ . . . , the event E ε = S ∞ n =1 E n,ε occurs with probability P ( E ε ) ≤ Γ σ kX k /ε , where Γ ≡ P ∞ n =1 γ n . Thus, if Γ ≤ δε / (2 σ kX k ) , we get P ( E ε ) ≤ δ/ .We now claim that the process R n +1 = (2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ is also boundedfrom above by ε with probability at least − δ/ if γ n is chosen appropriately.Indeed, working as above, let F n,ε denote the event { sup ≤ k ≤ n +1 R k ≥ ε } . Since R n is a nonnegative submartingale, Doob’s maximal inequality again yields P ( F n +1 ,ε ) ≤ E [ R n +1 ] ε ≤ V ∗ P nk =1 γ k Kε . (4.23)Consequently, the event F ε = S ∞ n =1 F n,ε occurs with probability P ( F ε ) ≤ Γ V ∗ /ε ≤ δ/ if γ n is chosen so that Γ ≤ Kδε/V ∗ .Assume therefore that Γ ≤ min { δε / (2 σ kX k ) , Kδε/V ∗ } . The above showsthat P ( ¯ E ε ∩ ¯ F ε ) = 1 − P ( E ε ∪ F ε ) ≥ − δ/ − δ/ − δ , i.e. S n and R n are bothbounded from above by ε for all n and all x ∗ with probability at least − δ . Since F ( x ∗ , Y ) ≤ ε by assumption, we readily get F ( x ∗ , Y ) ≤ ε if ¯ E ε and ¯ F ε both hold.Furthermore, telescoping (4.20) yields F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y ) + n X k =1 h v ( X k ) , X k − x ∗ i + S n +1 + R n +1 for all n, (4.24)so if we assume inductively that F ( x ∗ , Y k ) ≤ ε for all k ≤ n (implying that h v ( X k ) , X k − x ∗ i ≤ for all k ≤ n ), we also get F ( x ∗ , Y n +1 ) ≤ ε if neither E ε nor F ε occur. Since P ( E ε ∪ F ε ) ≤ δ , we conclude that X n stays in U ε for all n withprobability at least − δ . In turn, when this is the case, Lemma A.3 shows that X ∗ is recurrent under X n . Hence, by repeating the same steps as in the proof ofTheorem 4.7, we get X n → X ∗ with probability at least − δ , as claimed. (cid:4) Convergence in zero-sum concave games.

We close this section by examiningthe asymptotic behavior of (DA) in -player, concave-convex zero-sum games. Todo so, let N = { A, B } denote the set of players with corresponding payoﬀ functions u A = − u B respectively concave in x A and x B . Letting u ≡ u A = − u B , the value of the game is deﬁned as u ∗ = max x A ∈X A min x B ∈X B u ( x A , x B ) = min x B ∈X B max x A ∈X A u ( x A , x B ) . (4.25)The solutions of the concave-convex saddle-point problem (4.25) are the Nash equi-libria of G and the players’ equilibrium payoﬀs are u ∗ and − u ∗ respectively.In the “perfect feedback” case ( σ = 0 ), Nesterov (2009) showed that the ergodicaverage ¯ X n = P nk =1 γ k X k P nk =1 γ k (4.26)of the sequence of play generated by (DA) converges to equilibrium. With imperfectfeedback and steep h , Nemirovski et al. (2009) further showed that ¯ X n convergesin expectation to the game’s set of Nash equilibria, provided that (H1) and (H2)hold. Our next result provides an almost sure version of this result which is alsovalid for nonsteep h : Theorem 4.13.

Let G be a concave -player zero-sum game. If (DA) is run withimperfect feedback satisfying (H1)–(H2) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ , the ergodic average ¯ X n of X n converges to the set of Nashequilibria of G ( a.s. ) .Proof of Theorem 4.13. Consider the gap function ǫ ( x ) = u ∗ − min p B ∈X B u ( x A , p B ) + max p A ∈X A u ( p A , x B ) − u ∗ = max p ∈X X i ∈N u i ( p i ; x − i ) . (4.27)Obviously, ǫ ( x ) ≥ with equality if and only if x is a Nash equilibrium, so it suﬃcesto show that ǫ ( ¯ X n ) → (a.s.).To do so, pick some p ∈ X . Then, as in the proof of Theorem 4.7, we have F ( p, Y n +1 ) ≤ F ( p, Y n ) + γ n h v ( X n ) , X n − p i + γ n ψ n +1 + 12 K γ n k ˆ v n +1 k ∗ . (4.28) When h is steep, the mirror descent algorithm examined by Nemirovski et al. (2009) is aspecial case of the dual averaging method of Nesterov (2009). This is no longer the case if h is notsteep, so the analysis of Nemirovski et al. (2009) does not apply to (DA). In the online learningliterature, this diﬀerence is sometimes referred to as “greedy” vs. “lazy” mirror descent. EARNING IN GAMES WITH CONTINUOUS ACTION SETS 23

Hence, after rearranging and telescoping, we get n X k =1 γ k h v ( X k ) , p − X k i ≤ F ( p, Y ) + n X k =1 γ k ψ k +1 + 12 K n X k =1 γ k k ˆ v k +1 k ∗ , (4.29)where ψ n +1 = h ξ n +1 , X n − p i and we used the fact that F ( p, Y n ) ≥ . By concavity,we also have h v ( x ) , p − x i = X i ∈N h v i ( x ) , p i − x i i ≥ X i ∈N [ u i ( p i ; x − i ) − u i ( x )] = X i ∈N u i ( p i ; x − i ) , (4.30)for all x ∈ X . Therefore, letting τ n = P nk =1 γ k , we get τ n n X k =1 γ k h v ( X k ) , p − X k i ≥ τ n n X k =1 γ k X i ∈N u i ( p i ; X − i,k ) ≥ u ( p A , ¯ X B,n ) − u ( ¯ X A,n , p B )= X i ∈N u i ( p i ; ¯ X − i,n ) , (4.31)where we used the fact that u is concave-convex in the second line. Thus, combining(4.29) and (4.31), we ﬁnally obtain X i ∈N u i ( p i ; ¯ X − i,n ) ≤ F ( p, Y ) + P nk =1 γ k ψ k +1 + (2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ τ n . (4.32)As before, the law of large numbers (Hall and Heyde, 1980, Theorem 2.18) yields τ − n P nk =1 γ k ψ k +1 → (a.s.). Furthermore, given that E [ k ˆ v n +1 k ∗ | F n ] ≤ V ∗ and P nk =1 γ k < ∞ , we also get τ − n P nk =1 γ k k ˆ v k +1 k ∗ → by Doob’s martingale con-vergence theorem (Hall and Heyde, 1980, Theorem 2.5), implying in turn that P i ∈N u i ( p i ; ¯ X − i,n ) → (a.s.). Since p is arbitrary, we conclude that ǫ ( ¯ X n ) → (a.s.), as claimed. (cid:4)

5. Learning in finite games

As a concrete application of the analysis of the previous section, we turn to theasymptotic behavior of (DA) in ﬁnite games. Brieﬂy recalling the setup of Ex-ample 2.1, each player in a ﬁnite game Γ ≡ Γ( N , ( A i ) i ∈N , ( u i ) i ∈N ) chooses a purestrategy α i from a ﬁnite set A i and receives a payoﬀ of u i ( α , . . . , α N ) . Pure strate-gies are drawn based on the players’ mixed strategies x i ∈ X i ≡ ∆( A i ) , so eachplayer’s expected payoﬀ is given by the multilinear expression (2.3). Accordingly,the individual payoﬀ gradient of player i ∈ N in the mixed proﬁle x = ( x , . . . , x N ) is the (mixed) payoﬀ vector v i ( x ) = ∇ x i u i ( x i ; x − i ) = ( u i ( α i ; x − i )) α i ∈A i of Eq. (2.4).Consider now the following learning scheme: At stage n , every player i ∈ N selects a pure strategy α i,n ∈ A i according to their individual mixed strategy X i,n ∈ X i . Subsequently, each player observes – or otherwise calculates – thepayoﬀs of their pure strategies α i ∈ A i against the chosen actions α − i,n of all otherplayers (possibly subject to some random estimation error). Speciﬁcally, we positthat each player receives as feedback the “noisy” payoﬀ vector ˆ v i,n +1 = ( u i ( α i ; α − i,n )) α i ∈A i + ξ i,n +1 , (5.1) Algorithm 2.

Logit-based learning in ﬁnite games (Example 3.2).

Require: step-size sequence γ n ∝ /n β , β ∈ (0 , ; initial scores Y i ∈ R A i for n = 1 , , . . . do for every player i ∈ N do set X i ← Λ i ( Y i ) ; {mixed strategy} play α i ∼ X i ; {choose action} observe ˆ v i ; {estimate payoﬀs} update Y i ← Y i + γ n ˆ v i ; {update scores} end for end for where the error process ξ n = ( ξ i,n ) i ∈N is assumed to satisfy (H1) and (H2). Then,based on this feedback, players update their mixed strategies and the process re-peats (for a concrete example, see Algorithm 2).In the rest of this section, we study the long-term behavior of this adaptivelearning process. Speciﬁcally, we focus on: a ) the elimination of dominated strate-gies; b ) convergence to strict Nash equilibria; and c ) convergence to equilibrium in -player, zero-sum games.5.1. Dominated strategies.

We say that a pure strategy α i ∈ A i of a ﬁnite game Γ is dominated by β i ∈ A i (and we write α i ≺ β i ) if u i ( α i ; x − i ) < u i ( β i ; x − i ) for all x − i ∈ X − i ≡ Q j = i X j . (5.2)Put diﬀerently, α i ≺ β i if and only if v iα i ( x ) < v iβ i ( x ) for all x ∈ X . In turn,this implies that the payoﬀ gradient of player i points consistently towards the face x iα i = 0 of X i , so it is natural to expect that α i is eliminated under (DA). Indeed,we have: Theorem 5.1.

Suppose that (DA) is run with noisy payoﬀ observations of the form (5.1) and a step-size sequence γ n satisfying (4.2) . If α i ∈ A i is dominated, then X iα i ,n → ( a.s. ) .Proof. Suppose that α i ≺ β i for some β i ∈ A i . Then, suppressing the player index i for simplicity, (DA) gives Y β,n +1 − Y α,n +1 = c βα + n X k =1 γ k [ˆ v β,k +1 − ˆ v α,k +1 ]= c βα + n X k =1 γ k [ v β ( X k ) − v α ( X k )] + n X k =1 γ k ζ k +1 , (5.3)where we set c βα = Y β, − Y α, and ζ k +1 = E [ˆ v β,k +1 − ˆ v α,k +1 | F k ] − [ v β ( X k ) − v α ( X k )] . (5.4)Since α ≺ β , there exists some c > such that v β ( x ) − v α ( x ) ≥ c for all x ∈ X .Then, (5.3) yields Y β,n +1 − Y α,n +1 ≥ c βα + τ n (cid:20) c + P nk =1 γ k ζ k +1 τ n (cid:21) , (5.5) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 25 where τ n = P nk =1 γ k . As in the proof of Theorem 4.1, the law of large numbers formartingale diﬀerence sequences (Hall and Heyde, 1980, Theorem 2.18) implies that τ − n P nk =1 γ k ζ k +1 → under the step-size assumption (4.2), so Y β,n − Y α,n → ∞ (a.s.).Suppose now that lim sup n →∞ X α,n = 2 ε for some ε > . By descending toa subsequence if necessary, we may assume that X α,n ≥ ε for all n , so if we let X ′ n = X n + ε ( e β − e α ) , the deﬁnition of Q gives h ( X ′ n ) ≥ h ( X n ) + h Y n , X ′ n − X n i = h ( X n ) + ε ( Y β,n − Y α,n ) → ∞ , (5.6)a contradiction. This implies that X α,n → (a.s.), as asserted. (cid:4) Strict equilibria.

A Nash equilibrium x ∗ of a ﬁnite game is called strict when(NE) holds as a strict inequality for all x i = x ∗ i , i.e. when no player can deviateunilaterally from x ∗ without reducing their payoﬀ (or, equivalently, when everyplayer has a unique best response to x ∗ ). This implies that strict Nash equilibriaare pure strategy proﬁles x ∗ = ( α ∗ , . . . , α ∗ N ) such that u i ( α ∗ i ; α ∗− i ) > u i ( α i ; α ∗− i ) for all α i ∈ A i \ { α ∗ i } , i ∈ N . (5.7)Strict Nash equilibria can be characterized further as follows: Proposition 5.2.

Then, the following are equivalent:a ) x ∗ is a strict Nash equilibrium.b ) h v ( x ∗ ) , z i ≤ for all z ∈ TC( x ∗ ) with equality if and only if z = 0 .c ) x ∗ is stable. Thanks to the above characterization of strict equilibria (proven in Appendix A),the convergence analysis of Section 4 yields:

Proposition 5.3.

Let x ∗ be a strict equilibrium of a ﬁnite game Γ . Suppose furtherthat (DA) is run with noisy payoﬀ observations of the form (5.1) and a suﬃcientlysmall step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If (H1)–(H3) hold, x ∗ is locally attracting with arbitrarily high probability; speciﬁcally, for all δ > ,there exists a neighborhood U of x ∗ such that P ( X n → x ∗ | X ∈ U ) ≥ − δ. (5.8) Proof.

We ﬁrst show that E [ˆ v n +1 | F n ] = v ( X n ) . Indeed, for all i ∈ N , α i ∈ A i , wehave E [ˆ v iα i ,n +1 | F n ] = X α − i ∈A − i u i ( α i ; α − i ) X α − i ,n + E [ ξ iα i ,n +1 | F n ] = u i ( α i ; X − i,n ) , (5.9)where, in a slight abuse of notation, we set X α − i ,n for the joint probability assignedto the pure strategy proﬁle α − i of all players other than i at stage n .By (2.4), it follows that E [ˆ v n +1 | F n ] = v ( X n ) so the estimator (5.1) is unbiasedin the sense of (H1). Hypothesis (H2) can be veriﬁed similarly, so the estimator(5.1) satisﬁes (3.3). Since x ∗ is stable by Proposition 5.2 and v ( x ) is multilinear(so (H4) is satisﬁed automatically), our assertion follows from Theorem 4.11. (cid:4) In the special case of logit-based learning (Example 3.2), Cohen et al. (2017)showed that Algorithm 2 converges locally to strict Nash equilibria under similarinformation assumptions. Proposition 5.2 essentially extends this result to the en-tire class of regularized learning processes induced by (DA) in ﬁnite games, showing that the logit choice map (3.9) has no special properties in this regard. Cohen et al.(2017) further showed that the convergence rate of logit-based learning is exponen-tial in the algorithm’s “running horizon” τ n = P nk =1 γ k . This rate is closely linkedto the logit choice model, and diﬀerent choice maps yield diﬀerent convergencespeeds; we discuss this issue in more detail in Section 6.5.3. Convergence in zero-sum games.

We close this section with a brief discussionof the ergodic convergence properties of (DA) in ﬁnite two-player zero-sum games.In this case, the analysis of Section 4.5 readily yields:

Corollary 5.4.

Let Γ be a ﬁnite -player zero-sum game. If (DA) is run with noisypayoﬀ observations of the form (5.1) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ , the ergodic average ¯ X n = P nk =1 γ k X k (cid:14) P nk =1 γ k of the players’mixed strategies converges to the set of Nash equilibria of Γ ( a.s. ) .Proof. As in the proof of Proposition 5.3, the estimator (5.1) satisﬁes E [ˆ v n +1 | F n ] = v ( X n ) , so (H1) and (H2) also hold in the sense of (3.3). Our claim then followsfrom Theorem 4.13. (cid:4) Remark . In a very recent paper, Bravo and Mertikopoulos (2017) showed thatthe time average ¯ X ( t ) = t − R t X ( s ) ds of the players’ mixed strategies under (DA-c)with Brownian payoﬀ shocks converges to Nash equilibrium in -player, zero-sumgames. Corollary 5.4 may be seen as a discrete-time version of this result.

6. Speed of convergence

Ergodic convergence rate.

In this section, we focus on the rate of convergenceof (DA) to stable equilibrium states (and/or sets thereof). To that end, we willmeasure the speed of convergence to a globally stable set X ∗ ⊆ X via the equilibriumgap function ǫ ( x ) = inf x ∗ ∈X ∗ h v ( x ) , x ∗ − x i . (6.1)By Deﬁnition 2.6, ǫ ( x ) ≥ with equality if and only if x ∈ X ∗ , so ǫ ( x ) can be seenas a (game-dependent) measure of the distance between x and the target set X ∗ .This can be seen more clearly in the case of strongly stable equilibria, deﬁned hereas follows: Deﬁnition 6.1.

We say that x ∗ ∈ X is strongly stable if there exists some L > such that h v ( x ) , x − x ∗ i ≤ − L k x − x ∗ k for all x ∈ X . (6.2)More generally, a closed subset X ∗ of X is called strongly stable if h v ( x ) , x − x ∗ i ≤ − L dist( X ∗ , x ) for all x ∈ X , x ∗ ∈ X ∗ . (6.3)Obviously, ǫ ( x ) ≥ L dist( X ∗ , x ) if X ∗ is L -strongly stable, i.e. ǫ ( x ) grows atleast quadratically near strongly stable sets – just like strongly convex functionsgrow quadratically around their minimum points. With this in mind, we providebelow an explicit estimate for the decay rate of the average equilibrium gap ¯ ǫ n = P nk =1 γ k ǫ ( X k ) (cid:14) P nk =1 γ k in the spirit of Nemirovski et al. (2009): EARNING IN GAMES WITH CONTINUOUS ACTION SETS 27

Theorem 6.2.

Suppose that (DA) is run with imperfect gradient information satis-fying (H1)–(H2) . Then E [¯ ǫ n ] ≤ F + V ∗ / (2 K ) P nk =1 γ k P nk =1 γ k , (6.4) where F = F ( X ∗ , Y ) . If, in addition, P ∞ n =1 γ n < ∞ , we have ¯ ǫ n ≤ A P nk =1 γ k for all n ( a.s. ) , (6.5) where A > is a ﬁnite random variable such that, with probability at least − δ , A ≤ F + σ kX k κ + κ V ∗ , (6.6) where κ = 2 δ − P ∞ n =1 γ n . Corollary 6.3.

Suppose that (DA) is initialized at Y = 0 and is run for n iterationswith constant step-size γ = V − ∗ p K Ω /n where Ω = max h − min h . Then, E [¯ ǫ n ] ≤ V ∗ p Ω / ( Kn ) . (6.7) In addition, if X ∗ is L -strongly stable, the long-run average distance to equilibrium ¯ r n = P nk =1 dist( X ∗ , X n ) (cid:14) P nk =1 γ k satisﬁes E [¯ r n ] ≤ p L − V ∗ Ω / ( Kn ) . (6.8) Proof of Theorem 6.2.

Let x ∗ ∈ X ∗ . Rearranging (4.20) and telescoping yields n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ F ( x ∗ , Y ) + n X k =1 γ k ψ k +1 + 12 K n X k =1 γ k k ˆ v k +1 k ∗ , (6.9)where ψ k +1 = h ξ k +1 , X k − x ∗ i . Thus, taking expectations on both sides, we obtain n X k =1 γ k E [ h v ( X k ) , x ∗ − X k i ] ≤ F ( x ∗ , Y ) + V ∗ K n X k =1 γ k . (6.10)Subsequently, minimizing both sides of (6.10) over x ∗ ∈ X ∗ yields n X k =1 γ k E [ ǫ ( X k )] ≤ F + V ∗ K n X k =1 γ k , (6.11)where we used Jensen’s inequality to interchange the inf and E operations. Theestimate (6.4) then follows immediately.To establish the almost sure bound (6.5), set S n +1 = P nk =1 γ k ψ k +1 and R n +1 =(2 K ) − P nk =1 γ k k ˆ v k +1 k ∗ . Then, (6.9) becomes n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ F ( x ∗ , Y ) + S n + R n , (6.12)Arguing as in the proof of Theorem 4.11, it follows that sup n E [ | S n | ] and sup n E [ R n ] are both ﬁnite, i.e. S n and R n are both bounded in L . By Doob’s (sub)martingaleconvergence theorem (Hall and Heyde, 1980, Theorem 2.5), it also follows that S n and R n both converge to an (a.s.) ﬁnite limit S ∞ and R ∞ respectively. Conse-quently, by (6.12), there exists a ﬁnite (a.s.) random variable A > such that n X k =1 γ k h v ( X k ) , x ∗ − X k i ≤ A for all n (a.s.) . (6.13) The bound (6.5) follows by taking the minimum of (6.13) over x ∗ ∈ X ∗ and dividingboth sides by P nk =1 γ k . Finally, applying Doob’s maximal inequality to (4.21) and(4.23), we obtain P (cid:0) sup n S n ≥ σ kX k κ (cid:1) ≤ δ/ and P (cid:0) sup n R n ≥ V ∗ κ (cid:1) ≤ δ/ .Combining these bounds with (6.12) shows that A can be taken to satisfy (6.6)with probability at least − δ , as claimed. (cid:4) Proof of Corollary 6.3.

By the deﬁnition (4.11) of the setwise Fenchel coupling,we have F ≤ h ( x ∗ ) + h ∗ (0) ≤ max h − min h = Ω . Our claim then follows byinvoking Jensen’s inequality, noting that E [dist( X ∗ , X n )] ≤ E [dist( X ∗ , X n ) ] ≤ L − E [ ǫ ( X n )] , and applying (6.4). (cid:4) Although the mean bound (6.4) is valid for any step-size sequence, the summa-bility condition P ∞ n =1 γ n < ∞ for the almost sure bound (6.5) rules out moreaggressive step-size policies of the form γ n ∝ /n β for β ≤ / . Speciﬁcally, the“critical” value β = 1 / is again tied to the ﬁnite mean squared error hypothesis(H2): if the players’ gradient measurements have ﬁnite moments up to some order q > , a more reﬁned application of Doob’s inequality reveals that (6.5) still holdsunder the lighter summability requirement P ∞ n =1 γ q/ n < ∞ . In this case, theexponent β = 1 / is optimal with respect to the guarantee (6.4) and leads to analmost sure convergence rate of the order of O ( n − / log n ) .Except for this log n factor, the O ( n − / ) convergence rate of (DA) is the exactlower complexity bound for black-box subgradient schemes for convex problems(Nemirovski and Yudin, 1983; Nesterov, 2004). Thus, running (DA) with a step-size policy of the form γ n ∝ n − / leads to a convergence speed that is optimalin the mean, and near-optimal with high probability. It is also worth noting that,when the horizon of play is known in advance (as in Corollary 6.3), the constant Ω = max h − min h that results from the initialization Y = 0 is essentially the sameas the constant that appears in the stochastic mirror descent analysis of Nemirovskiet al. (2009) and Nesterov (2009).6.2. Running length.

Intuitively, the main obstacle to achieving rapid convergenceis that, even with an optimized step-size policy, the sequence of play may end uposcillating around an equilibrium state because of the noise in the players’ obser-vations. To study such phenomena, we focus below on the running length of (DA),deﬁned as ℓ n = n − X k =1 k X k +1 − X k k . (6.14)Obviously, if X n converges to some x ∗ ∈ X , a shorter length signiﬁes less oscillationsof X n around x ∗ . Thus, in a certain way, ℓ n is a more reﬁned convergence criterionthan the induced equilibrium gap ǫ ( X n ) .Our next result shows that the mean running length of (DA) until players reachan ε -neighborhood of a (strongly) stable set is at most O (1 /ε ) : Theorem 6.4.

Suppose that (DA) is run with imperfect feedback satisfying (H1)–(H2) and a step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . Also, given a closedsubset X ∗ of X , consider the stopping time n ε = inf { n ≥ X ∗ , X n ) ≤ ε } andlet ℓ ε ≡ ℓ n ε denote the running length of (DA) until X n reaches an ε -neighborhoodof X ∗ . If X ∗ is L -strongly stable, we have E [ ℓ ε ] ≤ V ∗ KL F + (2 K ) − V ∗ P ∞ k =1 γ k ε . (6.15) EARNING IN GAMES WITH CONTINUOUS ACTION SETS 29

Proof.

For all x ∗ ∈ X ∗ and all n ∈ N , (4.20) yields F ( x ∗ , Y n ε ∧ n +1 ) ≤ F ( x ∗ , Y ) − n ε ∧ n X k =1 γ k h v ( X k ) , X k − x ∗ i + n ε ∧ n X k =1 γ k ψ k +1 + 12 K n ε ∧ n X k =1 γ k k ˆ v k +1 k ∗ . (6.16)Hence, after taking expectations and minimizing over x ∗ ∈ X ∗ , we get ≤ F − Lε E " n ε ∧ n X k =1 γ k + E " n ε ∧ n X k =1 γ k ψ k +1 + V ∗ K ∞ X k =1 γ k , (6.17)where we we used the fact that k X k − x ∗ k ≥ ε for all k ≤ n ε .Consider now the stopped process S n ε ∧ n = P n ε ∧ nk =1 γ k ψ k +1 . Since n ε ∧ n ≤ n < ∞ , S n ε ∧ n is a martingale and E [ S n ε ∧ n ] = 0 . Thus, by rearranging (6.17), we obtain E " n ε ∧ n X k =1 γ k ≤ F + (2 K ) − V ∗ P ∞ k =1 γ k Lε . (6.18)Hence, with n ε ∧ n → n ε as n → ∞ , Lebesgue’s monotone convergence theoremshows that the process τ ε = P n ε k =1 γ k is ﬁnite in expectation and E [ τ ε ] ≤ F + (2 K ) − V ∗ P ∞ k =1 γ k Lε . (6.19)Furthermore, by Proposition 3.2 and the deﬁnition of ℓ n , we also have ℓ n = n − X k =1 k X k +1 − X k k ≤ K n − X k =1 k Y k − Y k − k ∗ = 1 K n − X k =1 γ k k ˆ v k +1 k ∗ . (6.20)Now, let ζ k +1 = k ˆ v k +1 k ∗ and Ψ n +1 = P nk =1 γ k [ ζ k +1 − E [ ζ k +1 | F k ]] . By construc-tion, Ψ n is a martingale and E [Ψ n +1 ] = E " n X k =1 γ k [ ζ k +1 − E [ ζ k +1 | F k ]] ≤ V ∗ ∞ X k =1 γ k < ∞ for all n. (6.21)Thus, by the optional stopping theorem (Shiryaev, 1995, p. 485), we get E [Ψ n ε ] = E [Ψ ] = 0 , so E " n ε X k =1 γ k ζ k +1 = E " n ε X k =1 γ k E [ ζ k +1 | F k ] ≤ V ∗ E " n ε X k =1 γ k = V ∗ E [ τ ε ] . (6.22)Our claim then follows by combining (6.20) and (6.22) with the bound (6.19). (cid:4) Theorem 6.4 should be contrasted to classic results on the Kurdyka–Łojasiewiczinequality where having a “bounded length” property is crucial in establishing tra-jectory convergence (Bolte et al., 2010). In our stochastic setting, it is not realisticto expect a bounded length (even on average), because, generically, the noise doesnot vanish in the neighborhood of a Nash equilibrium. Instead, Theorem 6.4should be interpreted as a measure of how the ﬂuctuations due to noise and un-certainty aﬀect the trajectories’ average length; the authors are not aware of anysimilar results along these lines. For a notable exception however, see Theorem 6.6 below.

Sharp equilibria and fast convergence.

Because of the random shocks inducedby the noise in the players’ gradient observations, it is diﬃcult to obtain an almostsure (or high probability) estimate for the convergence rate of the last iterate X n of (DA). Speciﬁcally, even with a rapidly decreasing step-size policy, a single re-alization of the error process ξ n may lead to an arbitrarily big jump of X n at anytime, thus destroying any almost sure bound on the convergence rate of X n .On the other hand, in ﬁnite games, Cohen et al. (2017) recently showed thatlogit-based learning (cf. Algorithm 2) achieves a quasi-linear convergence rate withhigh probability if the equilibrium in question is strict. Speciﬁcally, Cohen et al.(2017) showed that if x ∗ is a strict Nash equilibrium and X n does not start too farfrom x ∗ , then, with high probability, k X n − x ∗ k = O ( − c P nk =1 γ k ) for some positiveconstant c > that depends only on the players’ relative payoﬀ diﬀerences.Building on the variational characterization of strict Nash equilibria provided byProposition 5.2, we consider below the following analogue for continuous games: Deﬁnition 6.5.

We say that x ∗ ∈ X is a sharp equilibrium of G if h v ( x ∗ ) , z i ≤ for all z ∈ TC( x ∗ ) , (6.23)with equality if and only if z = 0 . Remark . The terminology “sharp” follows Polyak (1987, Chapter 5.2), whointroduced a similar notion for (unconstrained) convex programs. In particular,in the single-player case, it is easy to see that (6.23) implies that x ∗ is a sharpmaximum of u ( x ) , i.e. u ( x ∗ ) − u ( x ) ≥ c k x − x ∗ k for some c > .A ﬁrst consequence of Deﬁnition 6.5 is that v ( x ∗ ) lies in the topological interior ofthe polar cone PC( x ∗ ) to X at x ∗ (for a schematic illustration, see Fig. 1); in turn,this implies that sharp equilibria can only occur at corners of X . By continuity, thisfurther implies that sharp equilibria are locally stable (cf. the proof of Theorem 6.6below); hence, by Proposition 2.7, sharp equilibria are also isolated. Our nextresult shows that if players employ (DA) with surjective choice maps, then, withhigh probability, sharp equilibria are attained in a ﬁnite number of steps: Theorem 6.6.

Fix a tolerance level δ > and suppose that (DA) is run with surjec-tive choice maps and a suﬃciently small step-size γ n such that P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . If x ∗ is sharp and (DA) is not initialized too far from x ∗ , we have P ( X n reaches x ∗ in a ﬁnite number of steps ) ≥ − δ, (6.24) provided that (H1)–(H4) hold. If, in addition, x ∗ is globally stable, X n converges to x ∗ in a ﬁnite number of steps from every initial condition ( a.s. ) .Proof. As we noted above, v ( x ∗ ) lies in the interior of the polar cone PC( x ∗ ) to X at x ∗ . Hence, by continuity, there exists a neighborhood U ∗ of x ∗ such that v ( x ) ∈ int(PC( x ∗ )) for all x ∈ U ∗ . In turn, this implies that h v ( x ) , x − x ∗ i < for all x ∈ U ∗ \ { x ∗ } , i.e. x ∗ is stable. Therefore, by Theorem 4.11, there exists aneighborhood U of x ∗ such that X n converges to x ∗ with probability at least − δ .Now, let U ′ ⊆ U ∗ be a suﬃciently small neighborhood of x ∗ such that h v ( x ) , z i ≤− c k z k for some c > and for all z ∈ TC( x ∗ ) . Then, with probability at least − δ , Indeed, if this were not the case, we would have h v ( x ∗ ) , z i = 0 for some nonzero z ∈ TC( x ∗ ) . That such a neighborhood exists is a direct consequence of Deﬁnition 6.5.

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 31 there exists some (random) n such that X n ∈ U ′ for all n ≥ n , so h v ( X n ) , z i ≤− c k z k for all n ≥ n . Thus, for all z ∈ TC( x ∗ ) with k z k = 1 , we have h Y n +1 , z i = h Y n , z i + n X k = n γ k h v ( X k ) , z i + n X k = n γ k h ξ k +1 , z i≤ k Y n k ∗ − c n X k = n γ k + n X k = n γ k h ξ k +1 , z i . (6.25)By the law of large numbers for martingale diﬀerence sequences (Hall and Heyde,1980, Theorem 2.18), we also have P nk = n γ k ξ k +1 / P nk = n γ k → (a.s.), so thereexists some n ∗ such that k P nk = n γ k ξ k +1 k ∗ ≤ ( c/ P nk = n γ k for all n ≥ n ∗ (a.s.).We thus obtain h Y n +1 , z i ≤ k Y n k ∗ − c n X k = n γ k + c k z k n X k = n γ k ≤ k Y n k ∗ − c n X k = n γ k , (6.26)showing that h Y n , z i → −∞ uniformly in z with probability at least − δ .To proceed, Proposition A.1 in Appendix A shows that y ∗ + PC( x ∗ ) ⊆ Q − ( x ∗ ) whenever Q ( y ∗ ) = x ∗ . Since Q is surjective, there exists some y ∗ ∈ Q − ( x ∗ ) , so itsuﬃces to show that, with probability at least − δ , Y n lies in the pointed cone y ∗ +PC( x ∗ ) for all suﬃciently large n . To do so, simply note that Y n − y ∗ ∈ PC( x ∗ ) ifand only if h Y n − y ∗ , z i ≤ for all z ∈ TC( x ∗ ) with k z k = 1 . Since h Y n , z i convergesuniformly to −∞ with probability at least − δ , our assertion is immediate.Finally, for the globally stable case, recall that X n converges to x ∗ with proba-bility from any initial condition (Theorem 4.7). The argument above shows that X n = x ∗ for all large n , so X n converges to x ∗ in a ﬁnite number of steps (a.s.). (cid:4) Remark . Theorem 6.6 suggests that dual averaging with surjective choice mapsleads to signiﬁcantly faster convergence to sharp equilibria. In this way, it is consis-tent with an observation made by Mertikopoulos and Sandholm (2016, Proposition5.2) for the convergence of the continuous-time, deterministic dynamics (DA-c) inﬁnite games.

7. Discussion

An important question in the implementation of dual averaging is the choiceof regularizer, which in turn determines the players’ choice maps Q i : Y i → X i .From a qualitative point of view, this choice would not seem to matter much: theconvergence results of Sections 4 and 5 hold for all choice maps of the form (3.6).Quantitatively however, the speciﬁc choice map employed by each player impactsthe algorithm’s convergence speed, and diﬀerent choice maps could lead to vastlydiﬀerent rates of convergence.As noted above, in the case of sharp equilibria, this choice seems to favor nonsteeppenalty functions (that is, surjective choice maps). Nonetheless, in the general case,the situation is less clear because of the dimensional dependence hidden in the Ω /K factor that appears e.g. in the mean rate guarantee (6.7). This factor dependscrucially on the geometry of the players’ action spaces and the underlying norm,and its optimum value may be attained by steep penalty functions – for instance,the entropic regularizer (3.8) is well known to be asymptotically optimal in the caseof simplex-like feasible regions (Shalev-Shwartz, 2011, p. 140). Another key question in game-theoretic and online learning has to do with theinformation that is available to the players at each stage. If players perform atwo-point sampling step in order to simulate an extra oracle call at an action pro-ﬁle diﬀerent than the one employed, this extra information could be presumablyleveraged in order to increase the speed of convergence to a Nash equilibrium. Inan oﬄine setting, this can be achieved by more sophisticated techniques relying ondual extrapolation (Nesterov, 2007) and/or mirror-prox methods (Juditsky et al.,2011). Extending these extra-gradient approaches to online learning processes asabove would be an interesting extension of the current work.At the other end of the spectrum, if players only have access to their realized, in-game payoﬀs, they would need to reconstruct their individual payoﬀ gradients via asuitable single-shot estimator (Polyak, 1987; Flaxman et al., 2005). We believe ourconvergence analysis can be extended to this case by properly controlling the “bias-variance” tradeoﬀ of this estimator and using more reﬁned stochastic approximationarguments. The very recent manuscript by Bervoets et al. (2016) provides anencouraging ﬁrst step in the case of (strictly) concave games with one-dimensionalaction sets; we intend to explore this direction in future work.

Appendix A. Auxiliary results

In this appendix, we collect some auxiliary results that would have otherwisedisrupted the ﬂow of the main text. We begin with the basic properties of theFenchel coupling:

Proof of Proposition 4.3.

For our ﬁrst claim, let x = Q ( y ) . Then, by deﬁnition F ( p, y ) = h ( p ) + h y, Q ( y ) i − h ( Q ( y )) − h y, p i = h ( p ) − h ( x ) − h y, p − x i . (A.1)Since y ∈ ∂h ( x ) by Proposition 3.2, we have h y, p − x i = h ′ ( x ; p − x ) whenever x ∈ X ◦ , thus proving (4.9a). Furthermore, the strong convexity of h also yields h ( x ) + t h y, p − x i ≤ h ( x + t ( p − x )) ≤ th ( p ) + (1 − t ) h ( x ) − Kt (1 − t ) k x − p k , (A.2)leading to the bound K (1 − t ) k x − p k ≤ h ( p ) − h ( x ) − h y, p − x i = F ( p, y ) (A.3)for all t ∈ (0 , . Eq. (4.9b) then follows by letting t → + in (A.3).Finally, for our third claim, we have F ( p, y ′ ) = h ( p ) + h ∗ ( y ′ ) − h y ′ , p i≤ h ( p ) + h ∗ ( y ) + h y ′ − y, ∇ h ∗ ( y ) i + 12 K k y ′ − y k ∗ − h y ′ , p i = F ( p, y ) + h y ′ − y, Q ( y ) − p i + 12 K k y ′ − y k ∗ , (A.4)where the inequality in the second line follows from the fact that h ∗ is (1 /K ) -strongly smooth (Rockafellar and Wets, 1998, Theorem 12.60(e)). (cid:4) Complementing Proposition 4.3, our next result concerns the inverse images ofthe choice map Q : Proposition A.1.

Let h be a penalty function on X , and let x ∗ ∈ X . If x ∗ = Q ( y ∗ ) for some y ∗ ∈ Y , then y ∗ + PC( x ∗ ) ⊆ Q − ( x ∗ ) . EARNING IN GAMES WITH CONTINUOUS ACTION SETS 33

Proof.

By Proposition 3.2, we have x ∗ = Q ( y ) if and only if y ∈ ∂h ( x ∗ ) , so it suﬃcesto show that y ∗ + v ∈ ∂h ( x ∗ ) for all v ∈ PC( x ∗ ) . Indeed, we have h v, x − x ∗ i ≤ for all x ∈ X , so h ( x ) ≥ h ( x ∗ ) + h y ∗ , x − x ∗ i ≥ h ( x ∗ ) + h y ∗ + v, x − x ∗ i . (A.5)The above shows that y ∗ + v ∈ ∂h ( x ∗ ) , as claimed. (cid:4) Our next result concerns the evolution of the Fenchel coupling under the dynam-ics (DA-c):

Lemma A.2.

Let x ( t ) = Q ( y ( t )) be a solution orbit of (DA-c) . Then, for all p ∈ X ,we have ddt F ( p, y ( t )) = h v ( x ( t )) , x ( t ) − p i . (A.6) Proof.

By deﬁnition, we have ddt F ( p, y ( t )) = ddt [ h ( p ) + h ∗ ( y ( t )) − h y ( t ) , p i ]= h ˙ y ( t ) , ∇ h ∗ ( y ( t )) i − h ˙ y ( t ) , p i = h v ( x ( t )) , x ( t ) − p i , (A.7)where, in the last line, we used Proposition 3.2. (cid:4) Our last auxiliary result shows that, if the sequence of play generated by (DA)is contained in the “basin of attraction” of a stable set X ∗ , then it admits anaccumulation point in X ∗ : Lemma A.3.

Suppose that X ∗ ⊆ X is stable and (DA) is run with a step-size suchthat P ∞ n =1 γ n < ∞ and P ∞ n =1 γ n = ∞ . Assume further that ( X n ) ∞ n =1 is containedin a region R of X such that (VS) holds for all x ∈ R . Then, under (H1) and (H2) ,every neighborhood U of X ∗ is recurrent; speciﬁcally, there exists a subsequence X n k of X n such that X n k → X ∗ ( a.s. ) . Finally, if (DA) is run with perfect feedback ( σ = 0 ) , the above holds under the lighter assumption P nk =1 γ k (cid:14) P nk =1 γ k → .Proof of Lemma A.3. Let U be a neighborhood of X ∗ and assume to the contrarythat, with positive probability, X n / ∈ U for all suﬃciently large n . By starting thesequence at a later index if necessary, we may assume that X n / ∈ U for all n withoutloss of generality. Thus, with X ∗ stable and X n ∈ R for all n by assumption, thereexists some c > such that h v ( X n ) , X n − x ∗ i ≤ − c for all x ∗ ∈ X ∗ and for all n. (A.8)As a result, for all x ∗ ∈ X ∗ , we get F ( x ∗ , Y n +1 ) = F ( x ∗ , Y n + γ n ˆ v n +1 ) ≤ F ( x ∗ , Y n ) + γ n h v ( X n ) + ξ n +1 , X n − x ∗ i + 12 K γ n k ˆ v n +1 k ∗ ≤ F ( x ∗ , Y n ) − cγ n + γ n ψ n +1 + 12 K γ n k ˆ v n +1 k ∗ , (A.9)where we used Proposition 4.3 in the second line and we set ψ n +1 = h ξ n +1 , X n − x ∗ i in the third. Telescoping (A.9) then gives F ( x ∗ , Y n +1 ) ≤ F ( x ∗ , Y ) − τ n " c − P nk =1 γ k ψ k +1 τ n − K P nk =1 γ k k ˆ v k +1 k ∗ τ n , (A.10) where τ n = P nk =1 γ k .Since E [ ψ n +1 | F n ] = h E [ ξ n +1 | F n ] , X n − x ∗ i = 0 by (H1) and E [ | ψ n +1 | | F n ] ≤ E [ k ξ n +1 k ∗ k X n − x ∗ k | F n ] ≤ σ kX k < ∞ by (H2), the law of large numbersfor martingale diﬀerence sequences yields τ − n P nk =1 γ k ψ k +1 → (Hall and Heyde,1980, Theorem 2.18). Furthermore, letting R n +1 = P nk =1 γ k k ˆ v k +1 k ∗ , we also get E [ R n +1 ] ≤ n X k =1 γ k E [ˆ v k +1 ] ≤ V ∗ ∞ X k =1 γ k < ∞ for all n, (A.11)so Doob’s martingale convergence theorem shows that R n converges (a.s.) to somerandom, ﬁnite value (Hall and Heyde, 1980, Theorem 2.5).Combining the above, (A.10) gives F ( x ∗ , Y n ) ∼ − aτ n → −∞ (a.s.), a contradic-tion. Finally, if σ = 0 , we also have ψ n +1 = 0 and k ˆ v n +1 k ∗ = k v ( X n ) k ∗ ≤ V ∗ for all n , so (A.10) yields F ( x ∗ , Y n ) → −∞ provided that τ − n P nk =1 γ k → , acontradiction. In both cases, we conclude that X n is recurrent, as claimed. (cid:4) Finally, we turn to the characterization of strict equilibria in ﬁnite games:

Proof of Proposition 5.2.

We will show that ( a ) = ⇒ ( b ) = ⇒ ( c ) = ⇒ ( a ).( a ) = ⇒ ( b ). Suppose that x ∗ = ( α ∗ , . . . , α ∗ N ) is a strict equilibrium. Then, theweak inequality h v ( x ∗ ) , z i ≤ follows from Proposition 2.1. For the strict part, if z i ∈ TC i ( x ∗ i ) is nonzero for some i ∈ N , we readily get h v i ( x ∗ ) , z i i = X α i = α ∗ i z i,α i (cid:2) u i ( α ∗ i ; α ∗− i ) − u i ( α i ; α ∗− i ) (cid:3) < , (A.12)where we used the fact that z i is tangent to X at x ∗ i , so P α i ∈A i z iα i = 0 and z iα i ≥ for α i = α ∗ i , with at least one of these inequalities being strict when z i = 0 .( b ) = ⇒ ( c ). Property ( b ) implies that v ( x ∗ ) lies in the interior of the polar cone PC( x ∗ ) to X at x ∗ . Since PC( x ∗ ) has nonempty interior, continuity implies that v ( x ) also lies in PC( x ∗ ) for x suﬃciently close to x ∗ . We thus get h v ( x ) , x − x ∗ i ≤ for all x in a neighborhood of x ∗ , i.e. x ∗ is stable.( c ) = ⇒ ( a ). Assume that x ∗ is stable but not strict, so u iα i ( x ∗ ) = u iβ i ( x ∗ ) for some i ∈ N , and some α i ∈ supp( x ∗ i ) , β i ∈ A i . Then, if we take x i = x ∗ i + λ ( e iβ i − e iα i ) and x − i = x ∗− i with λ > small enough, we get h v ( x ) , x − x ∗ i = h v i ( x ) , x i − x ∗ i i = λu iβ i ( x ∗ ) − λu iα i ( x ∗ ) = 0 , (A.13)contradicting the assumption that x ∗ is stable. This shows that x ∗ is strict. (cid:4) References

Alvarez, F., Bolte, J., and Brahic, O. (2004). Hessian Riemannian gradient ﬂows in convexprogramming.

SIAM Journal on Control and Optimization , 43(2):477–501.Arora, S., Hazan, E., and Kale, S. (2012). The multiplicative weights update method: A meta-algorithm and applications.

Theory of Computing , 8(1):121–164.Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methodsfor convex optimization.

Operations Research Letters , 31(3):167–175.Benaïm, M. (1999). Dynamics of stochastic approximation algorithms. In Azéma, J., Émery,M., Ledoux, M., and Yor, M., editors,

Séminaire de Probabilités XXXIII , volume 1709 of

Lecture Notes in Mathematics , pages 1–68. Springer Berlin Heidelberg.Bervoets, S., Bravo, M., and Faure, M. (2016). Learning and convergence to Nash in networkgames with continuous action set. Working paper.

EARNING IN GAMES WITH CONTINUOUS ACTION SETS 35

Bolte, J., Daniilidis, A., Ley, O., and Mazet, L. (2010). Characterizations of Łojasiewicz inequal-ities: Subgradient ﬂows, talweg, convexity.

Transactions of the American MathematicalSociety , 362(6):3319–3363.Bravo, M. and Mertikopoulos, P. (2017). On the robustness of learning in games with stochasticallyperturbed payoﬀ observations.

Games and Economic Behavior , 103, John Nash Memorialissue:41–66.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

Foundations and Trends in Machine Learning , 5(1):1–122.Chen, G. and Teboulle, M. (1993). Convergence analysis of a proximal-like minimization algorithmusing Bregman functions.

SIAM Journal on Optimization , 3(3):538–543.Cohen, J., Héliou, A., and Mertikopoulos, P. (2017). Hedging under uncertainty: Regret min-imization meets exponentially fast convergence. In

SAGT ’17: Proceedings of the 10thInternational Symposium on Algorithmic Game Theory .Coucheney, P., Gaujal, B., and Mertikopoulos, P. (2015). Penalty-regulated dynamics and robustlearning procedures in games.

Mathematics of Operations Research , 40(3):611–633.Facchinei, F. and Kanzow, C. (2007). Generalized Nash equilibrium problems. , 5(3):173–210.Flaxman, A. D., Kalai, A. T., and McMahan, H. B. (2005). Online convex optimization in thebandit setting: gradient descent without a gradient. In

SODA ’05: Proceedings of the 16thannual ACM-SIAM symposium on discrete algorithms , pages 385–394.Hall, P. and Heyde, C. C. (1980).

Martingale Limit Theory and Its Application . Probability andMathematical Statistics. Academic Press, New York.Hazan, E. (2012). A survey: The convex optimization approach to regret minimization. In Sra, S.,Nowozin, S., and Wright, S. J., editors,

Optimization for Machine Learning , pages 287–304.MIT Press.Hofbauer, J. and Sandholm, W. H. (2002). On the global convergence of stochastic ﬁctitious play.

Econometrica , 70(6):2265–2294.Hofbauer, J. and Sandholm, W. H. (2009). Stable games and their dynamics.

Journal of EconomicTheory , 144(4):1665–1693.Hofbauer, J., Schuster, P., and Sigmund, K. (1979). A note on evolutionarily stable strategies andgame dynamics.

Journal of Theoretical Biology , 81(3):609–612.Juditsky, A., Nemirovski, A. S., and Tauvel, C. (2011). Solving variational inequalities withstochastic mirror-prox algorithm.

Stochastic Systems , 1(1):17–58.Kiwiel, K. C. (1997). Free-steering relaxation methods for problems with strictly convex costs andlinear constraints.

Mathematics of Operations Research , 22(2):326–349.Laraki, R. and Mertikopoulos, P. (2013). Higher order game dynamics.

Journal of EconomicTheory , 148(6):2666–2695.Leslie, D. S. and Collins, E. J. (2005). Individual Q -learning in normal form games. SIAM Journalon Control and Optimization , 44(2):495–514.Littlestone, N. and Warmuth, M. K. (1994). The weighted majority algorithm.

Information andComputation , 108(2):212–261.Maynard Smith, J. and Price, G. R. (1973). The logic of animal conﬂict.

Nature , 246:15–18.McKelvey, R. D. and Palfrey, T. R. (1995). Quantal response equilibria for normal form games.

Games and Economic Behavior , 10(6):6–38.Mertikopoulos, P., Papadimitriou, C. H., and Piliouras, G. (2018). Cycles in adversarial regularizedlearning. In

SODA ’18: Proceedings of the 29th annual ACM-SIAM Symposium on DiscreteAlgorithms .Mertikopoulos, P. and Sandholm, W. H. (2016). Learning in games via reinforcement and regu-larization.

Mathematics of Operations Research , 41(4):1297–1324.Monderer, D. and Shapley, L. S. (1996). Potential games.

Games and Economic Behavior ,14(1):124 – 143.Nemirovski, A. S., Juditsky, A., Lan, G. G., and Shapiro, A. (2009). Robust stochastic approxi-mation approach to stochastic programming.

SIAM Journal on Optimization , 19(4):1574–1609.

Nemirovski, A. S. and Yudin, D. B. (1983).

Problem Complexity and Method Eﬃciency in Opti-mization . Wiley, New York, NY.Nesterov, Y. (2004).

Introductory Lectures on Convex Optimization: A Basic Course . Number 87in Applied Optimization. Kluwer Academic Publishers.Nesterov, Y. (2007). Dual extrapolation and its applications to solving variational inequalitiesand related problems.

Mathematical Programming , 109(2):319–344.Nesterov, Y. (2009). Primal-dual subgradient methods for convex problems.

Mathematical Pro-gramming , 120(1):221–259.Neyman, A. (1997). Correlated equilibrium and potential games.

International Journal of GameTheory , 26(2):223–227.Perkins, S. and Leslie, D. S. (2012). Asynchronous stochastic approximation with diﬀerentialinclusions.

Stochastic Systems , 2(2):409–446.Perkins, S., Mertikopoulos, P., and Leslie, D. S. (2017). Mixed-strategy learning with continuousaction sets.

IEEE Trans. Autom. Control , 62(1):379–384.Polyak, B. T. (1987).

Introduction to Optimization . Optimization Software, New York, NY, USA.Rockafellar, R. T. (1970).

Convex Analysis . Princeton University Press, Princeton, NJ.Rockafellar, R. T. and Wets, R. J. B. (1998).

Variational Analysis , volume 317 of

A Series ofComprehensive Studies in Mathematics . Springer-Verlag, Berlin.Rosen, J. B. (1965). Existence and uniqueness of equilibrium points for concave N -person games. Econometrica , 33(3):520–534.Sandholm, W. H. (2015). Population games and deterministic evolutionary dynamics. In Young,H. P. and Zamir, S., editors,

Handbook of Game Theory IV , pages 703–778. Elsevier.Scutari, G., Facchinei, F., Palomar, D. P., and Pang, J.-S. (2010). Convex optimization, gametheory, and variational inequality theory in multiuser communication systems.

IEEE SignalProcess. Mag. , 27(3):35–49.Shalev-Shwartz, S. (2007).

Online learning: Theory, algorithms, and applications . PhD thesis,Hebrew University of Jerusalem.Shalev-Shwartz, S. (2011). Online learning and online convex optimization.

Foundations andTrends in Machine Learning , 4(2):107–194.Shalev-Shwartz, S. and Singer, Y. (2007). Convex repeated games and Fenchel duality. In

Advancesin Neural Information Processing Systems 19 , pages 1265–1272. MIT Press.Shiryaev, A. N. (1995).

Probability . Springer, Berlin, 2 edition.Sorin, S. and Wan, C. (2016). Finite composite games: Equilibria and dynamics.

Journal ofDynamics and Games , 3(1):101–120.Viossat, Y. and Zapechelnyuk, A. (2013). No-regret dynamics and ﬁctitious play.

Journal ofEconomic Theory , 148(2):825–842.Vovk, V. G. (1990). Aggregating strategies. In

COLT ’90: Proceedings of the 3rd Workshop onComputational Learning Theory , pages 371–383.Xiao, L. (2010). Dual averaging methods for regularized stochastic learning and online optimiza-tion.

Journal of Machine Learning Research , 11:2543–2596.Zinkevich, M. (2003). Online convex programming and generalized inﬁnitesimal gradient ascent. In