[PDF] Bandit Online Learning of Nash Equilibria in Monotone Games

Abstract

We address online bandit learning of Nash equilibria in multi-agent convex games. We propose an algorithm whereby each agent uses only obtained values of her cost function at each joint played action, lacking any information of the functional form of her cost or other agents' costs or strategies. In contrast to past work where convergent algorithms required strong monotonicity, we prove that the algorithm converges to a Nash equilibrium under mere monotonicity assumption. The proposed algorithm extends the applicability of bandit learning in several games including zero-sum convex games with possibly unbounded action spaces, mixed extension of finite-action zero-sum games, as well as convex games with linear coupling constraints.

Full PDF

aa r X i v : . [ m a t h . O C ] S e p Bandit Online Learning of Nash Equilibria inMonotone Games

Tatiana Tatarenko and Maryam Kamgarpour, IEEE Member

Abstract —We address online bandit learning of Nash equilibriain multi-agent convex games. We propose an algorithm wherebyeach agent uses only obtained values of her cost function at eachjoint played action, lacking any information of the functionalform of her cost or other agents’ costs or strategies. In contrastto past work where convergent algorithms required strongmonotonicity, we prove that the algorithm converges to a Nashequilibrium under mere monotonicity assumption. The proposedalgorithm extends the applicability of bandit learning in severalgames including zero-sum convex games with possibly unboundedaction spaces, mixed extension of ﬁnite-action zero-sum games,as well as convex games with linear coupling constraints.

Index Terms —learning in games, monotone bandit learning

I. I

NTRODUCTION

Game theory is a powerful framework to address optimiza-tion and learning of multiple interacting agents referred to asplayers. Such multi-agent problems arise in several applicationdomains including trafﬁc networks, internet, auctions and ad-versarial learning. In a multi-agent setting, the notion of Nashequilibrium captures a desirable solution as it exhibits stability.Namely, no player has an incentive to unilaterally deviate fromthis solution. Nash equilibrium is also consistent with a notionreferred to as rationality - each player aims to optimize herown cost function. Given these theoretical justiﬁcations forNash equilibria, from the viewpoint of learning the questionis whether players can learn their Nash equilibrium strategieswith limited information about the game. In particular, inseveral application domains, each player might not know thefunctional form of her objective. For example, the travel timesof edges in a trafﬁc network or the market constraints inan auction are unknown a priori and depend in non-trivialways on actions and objectives of other players. However, byplaying the game, a player can have access to the so-calledonline bandit information feedback, namely, she can receivepayoff information of her objective (zeroth-order oracle) forany feasible joint actions taken by all the players. Thus, wefocus on how do players learn Nash equilibria given onlinebandit information.Bandit learning in games has been mainly explored inﬁnite action settings. It is known that if each player usesa no-regret algorithm, the time average of the sequence ofplayed actions converges to a coarse-correlated equilibrium- a relaxed notion of equilibrium which encompasses Nash

T. Tatarenko ([email protected]) is with the ControlMethods and Robotics Lab Technical University Darmstadt, Darmstadt, Ger-many 64283, M. Kamgarpour ([email protected]) is with Electrical andComputer Engineering, UBC, Vancouver, Canada.M. Kamgarpour gratefully acknowledges ERC Starting Grant CONENE. equilibria as well as possibly non-rationalizable strategies.The convergence of time-averaged sequence of plays in no-regret algorithms to a mixed strategy Nash equilibrium canbe established for special games such as two-player zero-sum game. However, the convergence of time-averaged actionsdoes not imply the actual sequence of plays also converges toa Nash equilibrium. This issue was recently re-examined inlight of progresses in bandit convex optimization. In particular,[13] showed that if each player uses a fairly general modelof no-regret algorithm (FTRL) in continuous-time, the actualsequence of plays may not converge to a mixed strategyNash equilibrium in a two-player zero-sum game and it mightexhibit a non-vanishing cycle . The work of [2] analyzed thisnon-convergence through the lens of Hamiltonian and potentialof a game and proposed convergent algorithms assumingaccess to exact ﬁrst and second order oracles (gradients andHessians) of the players’ objectives. More recently, motivatedby the Hamiltonian analogy of zero-sum games, [1] extendedthe analysis of [13] to discrete-time setting and showed thatsequential gradient descent can overcome the divergence ofdiscretized simultaneous gradient descent in these games.The mixed extension of a ﬁnite action game falls into cate-gory of convex games. In such games, each player’s objectiveis convex in her own decision variable for any ﬁxed action ofother players (in mixed extension of ﬁnite action games, thisobjective is linear in each player’s action). Furthermore, thestrategy sets are convex (simplexes in mixed extension games).It is known that the Nash equilibria in convex differentiablegames correspond to the solution set of a variational inequalityproblem. Hence, one can use algorithms for solving variationalinequalities to compute Nash equilibria. This connection hasbeen used in two lines of works addressing bandit learningof Nash equilibria in convex games. The works in [18, 4]showed that no-regret learning can converge to Nash equilibriain certain class of convex games. Both works leveraged theidea of one-point estimation of the gradient of the player’scost function using the bandit payoff information. Whereas [4]performed this gradient estimation by perturbing the playedactions with a point sampled from the uniform distributionon the unit sphere, motivated by Stoke’s theorem [6, 9], theapproach in [18] formed these gradient estimates using Normaldistribution and smoothing motivated by [20, 14]. In bothcases, the convergence was proven by appropriately choosingthe stepsizes to tradeoff the bias and variance of the resultingestimation terms in the stochastic approximation procedure. The FTRL algorithms explored had access to more than bandit feedbackas each player could observe the gradient of its mixed extension cost functionat each time step.

The above works on learning Nash equilibria rely on strictmonotonicity of the game mapping. The game mapping isthe stacked vector of gradient of each player’s objective withrespect to her actions. In case the game is potential, thegame mapping is symmetric and thus, the game mappingcorresponds to gradient of a single function, the so-calledpotential function of the game. Hence, ﬁnding equilibria can becast as a single-objective optimization and bandit algorithmscan be readily applied to learn Nash equilibria. However, ingeneral noncooperative games, this game mapping need not besymmetric. In such cases, a necessary and sufﬁcient conditionfor convergence of no-regret algorithms such as those exploredin [18, 4, 13] is strict monotonicity of the game mapping.The strict monotonicity of the game mapping is a stringentcondition. The non-convergence issue and in particular cyclicbehavior discussed in [13, 2, 1] is due to the fact that the gamemapping in several games do not satisfy strict monotonicity.Zero-sum games for example are only monotone. This classof games have been widely adopted in robust optimization andcontrol, in models of perfect competition and more recently inadversarial training and deep learning. In addition to zero-sumgames, generalized Nash equilibria problems, that is, Nashequilibria with coupling constraints can at best have monotoneextended game mappings due to the coupling constraints. Thegeneralized Nash equilibria problems arise in several domainswhere a resource must be shared between agents and thiscan only be formulated as a hard rather than soft constraints[5]. The relevance of games with merely monotone mapping,motivate our paper on learning Nash equilibria under banditfeedback in this game class.Given the connection of Nash equilibria of convex gameswith solution set of variational inequality (VI) problem, anatural starting point for learning Nash equilibria of merelymonotone games is to search for algorithms for solvingthe corresponding VIs. This topic is well-explored in [16,Chapter 12] and several approaches including extra-gradientand Tikhonov regularization for ﬁnding solution of merelymonotone or pseudo-monotone VIs are proposed. All the pro-posed approaches though require at least ﬁrst-order (gradient)feedback. The challenge with generalizing these approaches tozeroth-order (bandit) feedback is that they require coordinationamong the players and are not suitable in the bandit feedbacksetting because a player cannot sample her objective functionsat different points while ensuring actions of other players re-mains ﬁxed. For example, the extra-gradient algorithm in [12]remedies the cyclic behavior of trajectories in monotone game.However, it is only applicable to exact ﬁrst-order feedbackoracle. On the other hand, standard Tikhonov regularizationrequires a double iterative procedure, where the players wouldsolve a regularized VI corresponding to a regularized gamemapping in each inner iteration. Here, solving the inner VIitself would require an iterative algorithm and it is not clearhow the players should coordinate stopping time for the inneralgorithm in addition to setting the regularization parameter.Motivated by generalizing the algorithms for learning Nashequilibria, in our recent work [19] we proposed an approachto learn Nash equilibria in merely monotone games. Our ap-proach was inspired by the single timescale Tikhonov regular- ization algorithm for solving stochastic variational inequalities[11]. In contrast to the above work that assumed access tonoisy gradients, we considered the single-point estimation ofgradients using the bandit feedback. The main contributionwas to appropriately control the bias and variance introducedin the single-point estimation along with the Tikhonov reg-ularization parameter to ensure convergence. Our proposedalgorithm was not applicable to online learning because theplayed actions were not bound to lie in the feasible set. Ourcurrent paper ﬁlls this gap by providing a convergent algorithmwhile ensuring the query points do lie in the feasible set.Our contributions in this paper are as follows. We develop,to our best knowledge, the ﬁrst bandit approach for onlinelearning Nash equilibria in convex games with merely mono-tone game mappings. In doing so, we propose a novel singletime-scale double-regularization - this double regularizationcorresponds to both the Tikhonov regularization and at thesame time the projection of the actions onto a shrunk feasibleset. By properly managing the interplay between the choiceof the regularization sequence, the radius of shrinkage forthe feasible set and the stepsize we ensure that bias andvariance of the resulting stochastic approximation vanish withappropriate rates. The choice of parameters are stated inAssumption 6, whereas the main convergence result is statedin our Theorem 2. In terms of the proof techniques, thereare few main novelties leading to this theorem: ﬁrst, showingthat the doubly regularized Tikhonov sequence and a single-time scale approach to solve it, remain bounded and convergesto the least-norm solution of the variational inequality (seeTheorem 3 and Proposition 1, respectively); second, showingthat iterates of our algorithm remain bounded (see Lemma5). These results enable us to use well-established results onboundedness and convergence of stochastic processes to oursetup. In summary, we prove convergence in probability of thesequence of actions to a Nash equilibria in monotone games.

Notations.

The set { , . . . , N } is denoted by [ N ] . Bold-face is used to distinguish between vectors in a multi-dimensional space and scalars. Given N vectors x i ∈ R d , i ∈ [ N ] , ( x i ) Ni =1 := ( x ⊤ , . . . , x N ⊤ ) ⊤ ∈ R Nd ; x − i :=( x , . . . , x i − , x i +1 , . . . , x N ) ∈ R ( N − d . R d + and Z + de-note, respectively, vectors from R d with non-negative co-ordinates and non-negative whole numbers. The standardinner product on R d is denoted by ( · , · ) : R d × R d → R ,with associated norm k x k := p ( x , x ) . Given some matrix A ∈ R d × d , A (cid:23) ( ≻ )0 , if and only if x ⊤ A x ≥ ( > )0 forall x = 0 . We use the big- O notation, that is, the function f ( x ) : R → R is O ( g ( x )) as x → a , f ( x ) = O ( g ( x )) as x → a , if lim x → a | f ( x ) || g ( x ) | ≤ K for some positive constant K .And with a slight abuse of notation, we write f ( x ) ≤ O ( g ( x )) as we estimate certain bounds. We say that a function f ( x ) grows not faster than a function g ( x ) as x → ∞ , if thereexists a positive constant Q such that f ( x ) ≤ g ( x ) ∀ x with k x k ≥ Q . For x ∈ R n and convex closed set C ⊂ R n , Proj C x denotes the projection of x onto C . Deﬁnition 1:

A mapping M : R d → R d is monotone over X ⊆ R d , if ( M ( x ) − M ( y ) , x − y ) ≥ for every x , y ∈ X . II. P

ROBLEM F ORMULATION

Consider a game Γ( N, { A i } , { J i } ) with N players, the setsof players’ actions A i ⊆ R d , i ∈ [ N ] , and the cost (objective)functions J i : A → R , where A = A × . . . × A N denotes theset of joint actions. We restrict the class of games as follows. Assumption 1:

The game under consideration is convex .Namely, for all i ∈ [ N ] the set A i is convex and closed,the cost function J i ( a i , a − i ) is deﬁned on R Nd , continuouslydifferentiable in a and convex in a i for ﬁxed a − i . Assumption 2:

The mapping M : R Nd → R Nd , referred toas the game mapping , deﬁned by M ( a ) = ( ∇ a i J i ( a i , a − i )) Ni =1 = ( M ( a ) , . . . , M N ( a )) ⊤ , where M i ( a ) = ( M i, ( a ) , . . . , M i,d ( a )) ⊤ , and M i,k ( a ) = ∂J i ( a ) ∂a ik , a ∈ A , i ∈ [ N ] , k ∈ [ d ] , (1)is monotone on A (see Deﬁnition 1).We consider a Nash equilibrium in game Γ( N, { A i } , { J i } ) as a stable solution outcome because it represents a joint actionfrom which no player has any incentive to unilaterally deviate. Deﬁnition 2:

A point a ∗ ∈ A is called a Nash equilibrium if for any i ∈ [ N ] and a i ∈ A i J i ( a i ∗ , a − i ∗ ) ≤ J i ( a i , a − i ∗ ) . Our goal is to learn such a stable action in a game throughdesigning a payoff-based algorithm. To do so, we ﬁrst con-nect existence of Nash equilibria for Γ( N, { A i } , { J i } ) withsolution set of a corresponding variational inequality problem. Deﬁnition 3:

Consider a mapping T ( · ) : R d → R d and aset Y ⊆ R d . The solution set SOL ( Y, T ) to the variationalinequality problem V I ( Y, T ) is the set of vectors y ∗ ∈ Y such that ( T ( y ∗ ) , y − y ∗ ) ≥ , ∀ y ∈ Y . Theorem 1: ([16, Proposition 1.4.2]) Given a game Γ( N, { A i } , { J i } ) with game mapping M , suppose that theaction sets { A i } are closed and convex, the cost functions { J i } are continuously differentiable in a and convex in a i for everyﬁxed a − i on the interior of A . Then, some vector a ∗ ∈ A isa Nash equilibrium in Γ , if and only if a ∗ ∈ SOL ( A , M ) .It follows that under Assumptions 1 and 2 for a game withmapping M , any solution of V I ( A , M ) is also a Nash equi-librium in such games and vice versa. While Γ( N, { A i } , { J i } ) under Assumptions 1 and 2 might admit a Nash equilibrium,these two assumptions alone do not guarantee uniqueness of aNash equilibrium. More restrictive assumptions for uniquenessare needed, for example, strong monotonicity of the gamemapping or compactness of the action sets [16]. Here, wedo not restrict our attention to such cases. However, to havea meaningful discussion, we do assume existence of at leastone Nash equilibrium in the game. Assumption 3:

The set

SOL ( A , M ) is not empty. Corollary 1:

Let Γ( N, { A i } , { J i } ) be a game with gamemapping M for which Assumptions 1, 2, and 3 hold. Then,there exists at least one Nash equilibrium in Γ . Moreover, anyNash equilibrium in Γ belongs to the set SOL ( A , M ) .The following additional assumptions are needed for con-vergence of the proposed payoff-based algorithm to a Nashequilibrium (see proofs of Lemma 5 and Theorem 2). Assumption 4:

Each element M i of the game mapping M : R Nd → R Nd , deﬁned in Assumption (2) is Lipschitzcontinuous on R d with a Lipschitz constant L i . Assumption 5:

Each cost function J i ( a ) , i ∈ [ N ] grows atmost polynomially in a as k a k → ∞ . Moreover, in the case ofunbounded joint action set A , each continuous cost function J i ( a ) , i ∈ [ N ] grows at most linearly in a as k a k → ∞ . Remark 1:

Note that if the set A is unbounded, Assump-tion 5 is equivalent to each cost function J i ( a ) , i ∈ [ N ] , beingLipschitz continuous on R Nd with some constant l i Thus, inboth bounded and unbounded A , we denote l = max i ∈ [ N ] l i as the uniform upper bound of the mapping M over A .For the development and analysis of our algorithms, we usethe following well-established and easy to verify result. Lemma 1:

Consider a mapping T ( · ) : R d → R d and aconvex closed set Y ⊆ R d . Given θ > , y ∗ ∈ SOL ( Y, T ) ⇐⇒ y ∗ = Proj Y ( y ∗ − θ T ( y ∗ )) . (2)III. P AYOFF -B ASED A LGORITHM

Given online payoff-based information, also referred to asonline bandit or zeroth-order oracle information, each agenthas access to its current action, referred to as its state anddenoted by x i ( t ) = ( x i ( t ) , . . . , x id ( t )) ⊤ ∈ R d , and plays theaction a i ( t ) = Proj A i ( x i ( t )) at iteration t . After that the costvalue ˆ J i ( t ) at the joint action a ( t ) = ( a ( t ) , . . . , a N ( t )) ∈ A , ˆ J i ( t ) = J i ( a ( t )) is revealed to each agent i . Given thesepieces of information, in the proposed algorithm each agent i “mixes” its next state x i ( t + 1) . Namely, it chooses x i ( t + 1) randomly according to the multidimensional normal distribu-tion N ( µ i ( t + 1) = ( µ i ( t + 1) , . . . , µ id ( t + 1)) ⊤ , σ ( t + 1)) withthe following density function: p i ( x i ; µ i ( t + 1) , σ t +1 ) = p i ( x i , . . . , x id ; µ i ( t + 1) , σ t +1 )= 1( √ πσ t +1 ) d exp ( − d X k =1 ( x ik − µ ik ( t + 1)) σ t +1 ) . The initial value of the means µ i (0) , i ∈ [ N ] , can be set toany ﬁnite value. The successive means are updated as follows: µ i ( t + 1) = Proj (1 − r t ) A i (cid:2) µ i ( t ) − γ t σ t (cid:18) ˆ J i ( t ) x i ( t ) − µ i ( t ) σ t + ǫ t µ i ( t ) (cid:19) (cid:3) . (3)In the above, (1 − r t ) A i = { x ∈ A i : dist ( x , ∂A i ) ≥ r t } and < r t < , is a time-dependent shrinkage parameter, γ t is thestepsize parameter and ǫ t > is a regularization parameter.The convergence of the algorithm is dependent on the interplayof these parameters and the variance term σ t > .The difference between the proposed approach and that of[18] is due to the additional term ǫ t in (3). In the absenceof this term the algorithm would converge only if the gamemapping is strictly monotone (see [18, Theorem 2] and coun-terexamples in [7, 13]). Moreover, in distinction from [19], inthe bandit online feedback considered here, players can onlyevaluate their costs over their feasible action set A i and notover the whole R Nd , necessitating the additional projectionterm a i ( t ) = Proj A i ( x i ( t )) and the shrinkage radius r t . Assuch, the previous convergence analysis does not apply. Before stating the convergence result, let us provide insightinto the procedure deﬁned by Equation (3) by deriving ananalogy to a regularized stochastic gradient algorithm.Let p ( x ; µ , σ ) = Q Ni =1 p i ( x i , . . . , x id ; µ i , σ ) denote thedensity function of the joint distribution of agents’ states.Given σ > , for any i ∈ [ N ] deﬁne ˜ J i : R Nd → R as ˜ J i ( µ , . . . , µ N , σ ) = Z R Nd J i ( x ) p ( x ; µ , σ ) d x . (4)Thus, ˜ J i , i ∈ [ N ] is the i th player’s cost function in mixedstrategies. Let µ ( t ) = ( µ ( t ) , . . . , µ N ( t )) and for i ∈ [ N ] ,deﬁne ˜ M i ( · ) = ( ˜ M i, ( · ) , . . . , ˜ M i,d ( · )) ⊤ as the d -dimensionalmapping with the following elements: ˜ M i,k ( µ , σ ) = ∂ ˜ J i ( µ , σ ) ∂µ ik , for k ∈ [ d ] . (5)Our ﬁrst lemma below shows that the second term insidethe projection in (3) is a sample of the gradient of agent i ’scost function in mixed strategies. Lemma 2:

Under Assumptions 1 and 5 ˜ M i,k ( µ ( t ) , σ t ) = E { J i ( x ( t ) , . . . , x N ( t )) x ik ( t ) − µ ik ( t ) σ t | x ik ( t ) ∼ N ( µ ik ( t ) , σ t ) , i ∈ [ N ] , k ∈ [ d ] } . (6)Moreover, ˜ M i,k ( µ , σ ) is bounded for any µ ∈ A .The proof of this Lemma is very similar to that of [19] andis provided in Appendix B.The lemma above implies that had we used the term J i ( x ( t ) , . . . , x N ( t )) x ik ( t ) − µ ik ( t ) σ t in (3) we could perform aone-point estimation of the gradient of the cost functionsin mixed strategies. In the bandit setting considered here,however, we use the term J i ( a ( t ) , . . . , a N ( t )) x ik ( t ) − µ ik ( t ) σ t .Despite this difference, in the analysis (see (22), (23)) weshow that the difference between these two terms convergesto zero due to the shrinkage radius selection. Thus, algorithm(3) can be interpreted as a doubly regularized (due to r r and ǫ t ) stochastic projection algorithms.Following the above interpretation, our main result is The-orem 2 below where we show that by appropriately choosingthe algorithm parameters, we can bound the bias and varianceterms of the stochastic projection and consequently establishconvergence of the iterates µ ( t ) in (3) to a Nash equilibrium. Assumption 6:

Let β t = γ t σ t and choose γ t = t a , σ t = t b , ǫ t = t c and r t = t d a, b, c, d > respectively, such thati) P ∞ t =0 β t = ∞ , P ∞ t =0 β t ǫ t = ∞ , ii) P ∞ t =0 ( ǫ t − ǫ t − ) β t ǫ t + ( r t − r t − ) β t ǫ t < ∞ , iii) P ∞ t =0 γ t < ∞ , P ∞ t =0 β t σ t < ∞ ,iv) lim t →∞ r t σ t = ∞ , lim t →∞ r t ǫ t = 0 .As an example for existence of parameters to satisfy As-sumption 6, let a = , a = , a = , a = . Theorem 2:

Let the players in game Γ( N, { A i } , { J i } ) choose the states { x i ( t ) } at time t according to the normaldistribution N ( µ i ( t ) , σ t ) , where the mean µ i (0) is arbitraryand µ i ( t ) is updated as in (3). Under Assumptions 1-6, as t → ∞ , the mean vector µ ( t ) converges almost surely to aNash equilibrium µ ∗ = a ∗ of the game Γ and the joint action a ( t ) converges in probability to a ∗ . IV. A NALYSIS OF THE A LGORITHM

To prove Theorem 2 we need to ﬁrst establish boundednessof the iterates µ ( t ) for the cases in which the action space isunbounded. Having established the boundedness, we can showthat the limit of the iterates µ ( t ) exists and is the minimumnorm Nash equilibrium of the problem. This convergence isproven using existing results on convergence of a sequence ofrandom variables (Lemma 10 on page 49 in [17]). For easeof reference, we provide the supporting statements used forboundedness [15, Theorem 2.5.2] and for convergence [17,Lemma 10 on page 49] of the iterates in Appendix A. A. Characterizing the terms in the algorithm

We ﬁrst show that algorithm (3) can be interpreted withinthe framework of well-studied Robbins-Monro stochastic ap-proximations procedures [3], where the iterates are updatedaccording to stochastic gradient descent. In our case, biasof the game mapping arises due to each player’s one-pointestimation of its gradient. However, in contrast to a stochasticapproximation procedure, the game mapping is in general notgradient of a single function (as its derivative is not symmetric)unless the game is potential. Furthermore, there are additionalterms in the algorithm iterates due to the projection of thequery points onto the shrunk feasible set and the regularizationrequired to handle the non-strictly monotone game mapping.Let us specify all these terms below.Using the notation M i ( · ) = ( M i, ( · ) , . . . , M i,d ( · )) , we canrewrite the algorithm step in (3) in the following form: µ i ( t + 1) = Proj (1 − r t ) A i [ µ i ( t ) − γ t σ t × (cid:0) M i ( µ ( t )) + Q i ( µ ( t ) , σ t ) + R i ( x ( t ) , µ ( t ) , σ t )+ P i ( x ( t ) , µ ( t ) , σ t ) + ǫ t ) µ i ( t ) (cid:1) ] , (7)for all i ∈ [ N ] , where Q i , R i , P i are Q i ( µ ( t ) , σ t ) = ˜ M i ( µ ( t ) , σ t ) − M i ( µ ( t )) , R i ( x ( t ) , µ ( t ) , σ t ) = F i ( x ( t ) , µ ( t ) , σ t ) − ˜ M i ( µ ( t ) , σ t ) , F i ( x ( t ) , µ ( t ) , σ t ) = J i ( x ( t )) x i ( t ) − µ i ( t ) σ t , P i ( x ( t ) , µ ( t ) , σ t ) = x i ( t ) − µ i ( t ) σ t ( J i ( a ( t )) − J i ( x ( t ))) . Above, M ( µ ( t )) = ( M ( µ ( t )) , . . . , M N ( µ ( t ))) corre-sponds to the gradient term in stochastic approximation proce-dures. The mapping ˜ M i ( µ ( t )) evaluated at µ ( t ) is equivalentto the game mapping in mixed strategies [19]. That is, ˜ M i ( µ ( t )) = Z R Nd M i ( x ) p ( x ; µ ( t ) , σ t ) d x . (8)Thus, the Q ( µ ( t ) , σ t ) = ( Q ( µ ( t ) , σ t ) , . . . , Q N ( µ ( t ) , σ t )) can be interpreted as disturbance of the gradient. Furthermore,according to (6), we have R i ( x ( t ) , µ ( t ) , σ t ) = F i ( x ( t ) , µ ( t ) , σ t ) (9) − E x ( t ) { F i ( x ( t ) , µ ( t ) , σ t ) } , i ∈ [ N ] . The notation E x ( t ) {·} further is used to emphasize that the expectationis taken in respect to x ( t ) which has the normal distribution with the mean µ ( t ) and the covariance matrix σ ( t ) I , where I is the identity matrix. Thus, R below is a martingale difference R ( x ( t ) , µ ( t ) , σ t ) = ( R ( x ( t ) , µ ( t ) , σ t ) , . . . , R N ( x ( t ) , µ ( t ) , σ t )) . Finally, due to the projection of query points, the term P ( x ( t ) , µ ( t ) , σ t ) = ( P ( x ( t ) , µ ( t ) , σ t ) , . . . , P N ( x ( t ) , µ ( t ) , σ t )) , is the vector of the difference between the gradient estimationbased on the state x ( t ) ∈ R Nd and the played action a ( t ) ∈ A .Our goal is to bound each of the terms above to ensureconvergence and boundedness of the iterates. However, ﬁrst,we need to account for the regularization terms ǫ t , r t . B. Analyzing the modiﬁed Tikhonov sequence

In contrast to stochastic approximation algorithms and theproof in [18], we have an addition term ǫ t µ ( t ) to be ableto address merely monotone game mappings. As such, tobound µ ( t ) we also relate the variations of the sequence µ ( t ) to those of the modiﬁed Tikhonov sequence deﬁned below.Let y ( t ) = ( y ( t ) , . . . , y N ( t )) denote the solution of thevariational inequality V I ((1 − r t ) A , M ( y ) + ǫ t y ) , namely y ( t ) ∈ SOL ((1 − r t ) A , M ( y ) + ǫ t y ) . (10)The Tikhonov sequence corresponds to the solution of thevariational inequality above with the r t = 0 . Thus, thesequence { y ( t ) } can be considered the modiﬁed Tikhonovsequence. Similar to the Tikhonov sequence, y ( t ) enjoys thefollowing important property. Theorem 3:

Under Assumptions 2, 3, and 4, y ( t ) deﬁned in(10) exists and is unique for each t . Moreover, for ǫ t ↓ and r t → and given lim t →∞ r t ǫ t = 0 , y ( t ) is uniformly boundedand converges to the least norm solution of V I ( A , M ) .The signiﬁcance of the above theorem is that if we canestablish µ ( t ) converges to y ( t ) , then from 1 we can establishconvergence to a Nash equilibrium for the game. To prove The-orem 3, ﬁrst we establish some useful properties of projectingonto sets (1 − r t ) A , t = 1 , , . . . . Lemma 3:

For any x ∈ R Nd the following holds: k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = O ( | r t − − r t | ) . Please see the Appendix C for the proof of above Lemma.

Proof: (of Theorem 3) Let a be the least norm solutionof V I ( A , M ) . Moreover, let a p be the projection of a ontothe set (1 − r t ) A . Next, let y ( t ) be the unique solution ofthe doubly regularized inequality, namely y ( t ) ∈ SOL ((1 − r t ) A , M + ǫ t I ) . Thus, we conclude that ( M ( a ) , y ( t ) − a ) ≥ , ( M ( y ( t )) + ǫ t y ( t ) , a p − y ( t )) ≥ . Thus, taking into account monotonicity of M , we obtain ≤ ( M ( a ) , y ( t ) − a )+ ( M ( y ( t )) + ǫ t y ( t ) , a − y ( t ))+ ( M ( y ( t )) + ǫ t y ( t ) , a p − a ) = − ( M ( a ) − M ( y ( t )) , a − y ( t )) + ǫ t ( y ( t ) , a − y ( t ))+ ( M ( y ( t )) + ǫ t y ( t ) , a p − a ) ≤ ǫ t ( y ( t ) , a ) − ǫ t k y ( t ) k + ( M ( y ( t )) , a p − a ) + ǫ t ( y ( t ) , a p − a ) . Hence, ǫ t k y ( t ) k ≤ ǫ t ( y ( t ) , a ) + ( M ( y ( t )) , a p − a )+ ǫ t ( y ( t ) , a p − a ) ≤ ǫ t k y ( t ) kk a k + l k a p − a k + ǫ t k y ( t ) kk a p − a k = ǫ t k y ( t ) kk a k + lO ( r t ) + ǫ t k y ( t ) k O ( r t ) , where in the ﬁrst inequality we used Remark 1 and in thesecond one we applied Lemma 3. Hence, k y ( t ) k ≤ k y ( t ) kk a k + lO (cid:18) r t ǫ t (cid:19) + k y ( t ) k O ( r t ) . By taking the upper limit t → ∞ in the inequality above anddue to the settings for ǫ t and r t , we obtain lim t →∞ [ k y ( t ) k ] ≤k a k lim t →∞ k y ( t ) k + lim t →∞ k y ( t ) k lim t →∞ O ( r t ) . It implies that lim t →∞ k y ( t ) k ≤ k a k , and, thus, the sequence k y ( t ) k is upper bounded. Moreover, any accumulation pointof this sequence is bounded above by the Euclidean normof the least-norm solution of V I ( A , M ) . Hence, accordingto the fact that the function Proj (1 − r ) A ( x ) is continuous inboth r and x , y ( t ) converges to the least norm solution of V I ( A , M ) .Since our goal now is to relate µ ( t ) to y ( t ) , aligned withprocedure (3), we will now design a one-time scale approachto solving (10) as per (11) below: z ( t + 1) = Proj (1 − r t ) A [ z ( t ) − β t ( M ( z ( t )) + ǫ t z ( t ))] , (11)where β t is deﬁned in Assumption 6. We show that theprocedure above is a one time-scale algorithm and similarlyto y ( t ) , it converges to the least norm solution of V I ( A , M ) . Proposition 1:

The sequence z ( t ) deﬁned by (11) convergesto the least norm solution of V I ( A , M ) .To prove the result above, we bound k z ( t + 1) − y ( t ) k in terms of the previous terms in the sequence, namely, k z ( t ) − y ( t − k and show that [17, Lemma 10, page 49] onconvergence of a random sequence applies. To do so though,ﬁrst we need to bound the variations of y ( t ) as below. Lemma 4:

Under Assumptions 2, 4, and 6, the Tikhonovsequence y ( t ) deﬁned in (10) satisﬁes k y ( t ) − y ( t − k = O (cid:18) ( ǫ t − ǫ t − ) ǫ t + ( r t − r t − ) ǫ t (cid:19) . Please see Appendix D for the proof.In summary, the results of this section enabled us to proveProposition 1. This proposition serves as the main new result incomparison to non-regularized stochastic gradient proceduresin order to show almost-sure boundedness of k µ ( t ) k and theconvergence of the algorithm to a Nash equilibrium. C. Boundedness of the iteratesLemma 5:

Let Assumptions 2-6 hold in Γ( N, { A i } , { J i } ) and µ ( t ) be the vector updated in the run of the payoff-basedalgorithm (3). Then, Pr { sup t ≥ k µ ( t ) k < ∞} = 1 . Proof:

If the set A is compact, then, according to theupdated in (7), the norm of the vector µ t is bounded for all t . So, let us consider the case of the unbounded A .Deﬁne V ( t, µ ) = k µ − z ( t ) k , where z ( t ) is given in (10).Consider the generating operator of the Markov process µ ( t ) LV ( t, µ ) = E [ V ( t + 1 , µ ( t + 1)) | µ ( t ) = µ ] − V ( t, µ ) . We aim to show that LV ( t, µ ) satisﬁes the following decay LV ( t, µ ) ≤ − α ( t + 1) ψ ( µ ) + φ ( t )(1 + V ( t, µ )) , (12)where ψ , φ and α are terms arising from Q , R and P in(7). Our goal is to show that ψ ≥ on R Nd , φ ( t ) > , ∀ t , P ∞ t =0 φ ( t ) < ∞ , α ( t ) > , P ∞ t =0 α ( t ) = ∞ . Thiscombined with the boundedness of the iterates z ( t ) stated inProposition 1 enable us to apply Theorem 2.5.2 in [15] toconclude almost sure boundedness of µ ( t ) .From now on, for simplicity in notation, we omit theargument σ t in the terms ˜ M , Q , and R . In certain derivations,for the same reason we omit the time parameter t as well.Let us analyze each term i = 1 , . . . , N in V ( t + 1 , µ ( t + 1)) = N X i =1 k µ i ( t + 1) − z i ( t + 1) k . From the procedures for the update of µ ( t ) and z ( t ) and thenon-expansion property of the projection operator, we obtain k µ i ( t + 1) − z i ( t + 1) k ≤ k µ i ( t ) − z i ( t ) − β t (cid:2) ǫ t ( µ i ( t ) − z i ( t ))+ ( M i ( µ ( t )) − M i ( z ( t )) + Q i ( µ ( t )) + R i ( x ( t ) , µ ( t ))+ P i ( x ( t ) , µ ( t )) (cid:3) k = k µ i ( t ) − z i ( t ) k − β t ( M i ( µ ( t )) − M i ( z ( t )) , µ i ( t ) − z i ( t )) − β t ǫ t ( µ i ( t ) − z i ( t ) , µ i ( t ) − z i ( t )) − β t ( Q i ( µ ( t )) + R i ( x ( t ) , µ ( t )) , µ i ( t ) − z i ( t )) − β t ( P i ( x ( t ) , µ ( t )) , µ i ( t ) − z i ( t ))+ β t k G i ( x ( t ) , µ ( t )) k , (13)where, for ease of notation, we have deﬁned G i ( x ( t ) , µ ( t )) = ǫ t ( µ i ( t ) − z i ( t ))+ M i ( µ ( t )) − M i ( z ( t ))+ Q i ( µ ( t )) + R i ( x ( t ) , µ ( t ))+ P i ( x ( t ) , µ ( t )) . (14)Note that the terms in k G i ( x ( t ) , µ ( t )) k are given as k G i ( x ( t ) , µ ( t )) k = ǫ ( t ) k µ i ( t ) − z i ( t ) k + k M i ( µ ( t )) − M i ( z ( t )) k + k Q i ( µ ( t )) k + k R i ( x ( t ) , µ ( t )) k + k P i ( x ( t ) , µ ( t )) k + 2( Q i ( µ ( t )) , R i ( x ( t ) , µ ( t )))+ 2( P i ( x ( t ) , µ ( t )) , R i ( x ( t ) , µ ( t ))) + 2( Q i ( µ ( t )) , P i ( x ( t ) , µ ( t )))+ 2 ǫ t ( M i ( µ ( t )) − M i ( z ( t )) , µ i ( t ) − z i ( t ))+ 2( ǫ t ( µ i ( t ) − z i ( t )) + M i ( µ ( t )) − M i ( z ( t )) , Q i ( µ ( t )) + R i ( x ( t ) , µ ( t )) + P i ( x ( t ) , µ ( t ))) , (15)Thus, accounting for the above, for (9), which implies E {k R i ( x ( t ) , µ ( t )) k| µ ( t ) = µ } = 0 for any µ , and for theCauchy-Schwarz inequality, we get from (13) E {k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t ) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t )) − β t ( Q i ( µ ) , µ i − z i ( t ))+ 2 β t E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ }k µ i − z i ( t ) k + β t E {k G i ( x ( t ) , µ ( t )) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t ) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t ))+ 2 β t k Q i ( µ ) kk µ i − z i ( t )) k + 2 β t E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ }k µ i − z i ( t ) k + β t ǫ ( t ) k µ i − z i ( t ) k + β t [ k M i ( µ ) − M i ( z ( t )) k + k Q i ( µ ) k + E {k R i ( x ( t ) , µ ( t )) k + k P i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } + 2E { ( P i ( x ( t ) , µ ( t )) , R i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } + 2 k Q i ( µ ) k E {k P i ( x ( t ) , µ ( t ))) k| µ ( t ) = µ } + 2( ǫ t k µ i − z i ( t ) k + k M i ( µ ) − M i ( z ( t )) k ) × ( k Q i ( µ ) k + E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ } )] . (16)We proceed estimating the terms in the inequality above. Dueto Assumption 4, we conclude that k M i ( µ ) − M i ( z ( t )) k ≤ L i k µ − z ( t ) k = O ( V ( t, µ ))( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t )) ≤ k M i ( µ ) − M i ( z ( t )) kk µ i − z i ( t ) k≤ L i k µ − z ( t ) kk µ i − z i ( t ) k = O ( V ( t, µ )) . (17)Let us analyze the terms containing the disturbance of gra-dient, namely Q i , in Equation (15). Since Q i ( µ ( t )) =˜ M i ( µ ( t )) − M i ( µ ( t )) , due to Assumption 2 and Equation(8), we obtain k Q i ( µ ) k = k Z R Nd [ M i ( x ) − M i ( µ )] p ( x ; µ , σ t ) d x k≤ Z R Nd k M i ( x ) − M i ( µ ) k p ( x ; µ , σ t ) d x ≤ Z R Nd L i k x − µ k p ( x ; µ , σ t ) d x ≤ Z R Nd L i N X i =1 d X k =1 | x ik − µ ik | ! p ( x ; µ , σ t ) d x = O ( σ t ) , (18)where the last equality is due to the fact that the ﬁrstcentral absolute moment of a random variable with a normal distribution N ( µ, σ ) is O ( σ ) . The estimation above imply, inparticular, that for any µ ∈ A k Q i ( µ ) kk µ i − z i ( t ) k = O ( σ t )(1 + V ( t, µ )) (19) k Q i ( µ ) kk M i ( µ ) − M i ( z ( t )) k ≤ L i k Q i ( µ ) kk µ − z ( t ) k = O ( σ t )(1 + V ( t, µ )) . (20)We bound the martingale term k R i ( x ( t ) , µ ( t )) k . E {k R i ( x ( t ) , µ ( t )) k | µ ( t ) = µ }≤ d X k =1 Z R Nd J i ( x ) ( x ik − µ ik ) σ ( t ) p ( µ , x ) d x ≤ f i ( µ , σ t ) σ ( t ) = O (1 + V ( t, µ )) σ ( t ) , (21)where the ﬁrst inequality is due to the fact that E( ξ − E ξ ) ≤ E ξ and taking into account (9), the second inequality is dueto Assumption 5, with f i ( µ , σ t ) being a quadratic function of µ and σ t , i ∈ [ N ] (see Appendix E for more details).We proceed estimating the terms containing P t ( x t , µ ) . Forany µ ∈ A we have E {k P i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } = E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t = Pr { x ( t ) ∈ R Nd \ A } E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t ≤ Pr { x ( t ) ∈ R Nd \ A } E l k x i ( t ) − µ i k k x ( t ) − µ k σ t = k Pr { x ( t ) ∈ R Nd \ A } , for some k > , (22)where the inequality is due to Assumption (5) implying | J i ( x ( t )) − J i ( a ( t )) | ≤ l k x ( t ) − a ( t ) k , and furthermorebecause k x ( t ) − a ( t ) k ≤ k x ( t ) − µ k for a ( t ) = Proj A x ( t ) .Next, let us estimate Pr { x ( t ) ∈ R Nd \ A } . The idea isthat due to the fact that x ( t ) is sampled from a Gaussiandistribution with mean µ ( t ) , x ( t ) concentrates around its mean µ ( t ) with high probability. Since the mean is projected ontoa shrunk version of the set A , namely, (1 − r t ) A , by appro-priately tuning r t and σ t we can ensure that x ( t ) stays withinthe original feasible set with high probability. Let O r t ( µ ) = { y ∈ R Nd |k y − µ k < r t } denote the r t -neighborhood ofthe point µ . Hence, sup y / ∈ O rt ( µ ) −k y − µ k = − r t . Then,taking into account the fact that O r t ( µ ) is contained in A and r t < , we obtain that for any t and any bounded σ > σ t : Pr { x ( t ) ∈ R Nd \ A } ≤ Pr { x ( t ) ∈ R Nd \ O r t ( µ ) } = Z y / ∈ O rt ( µ ) π ) Nd/ σ Ndt exp (cid:26) − k y − µ k σ t (cid:27) d y = Z y / ∈ O rt ( µ ) exp (cid:26) −k y − µ k (cid:18) σ t − σ (cid:19)(cid:27) × σ Nd σ Ndt π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ exp (cid:26) − r t (cid:18) σ t − σ (cid:19)(cid:27) σ Nd σ Ndt × Z y / ∈ O rt ( µ ) π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ k e − r t σ t σ Ndt (23)for some ﬁnite k > . The last inequality holds because Z y / ∈ O rt ( µ ) π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ and, thus, there exists < k < ∞ : Z y / ∈ O rt ( µ ) e r t σ σ n (2 π ) n/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ k . From (16) it now remains to bound the term E { ( R i ( x ( t ) , µ ( t )) , P i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } . According to deﬁnitions of P i and R i , Remark 1, and theCauchy-Schwarz inequality, E { ( R i ( x ( t ) , µ ( t )) , P i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } (24) = E( J i ( x ( t )) x i ( t ) − µ i σ t − ˜ M i ( µ , σ t ) , x i ( t ) − µ i σ t ( J i ( a ( t )) − J i ( x ( t )))) ≤ k ˜ M i ( µ , σ t ) k E k P i ( x t , µ ) k− E (cid:26) J i ( x ( t ))( J i ( a ( t )) − J i ( x ( t ))) k x i ( t ) − µ i k σ t (cid:27) ≤ k ˜ M i ( µ , σ t ) k E k P i ( x t , µ ) k + Pr { x ( t ) ∈ R Nd \ A }×× E (cid:26) | J i ( x ( t )) |k x ( t ) − µ kk x i ( t ) − µ i k σ t (cid:27) . Note that k ˜ M i ( µ , σ t ) k is bounded from Lemma 2), and that E (cid:26) | J i ( x ( t )) |k x ( t ) − µ kk x i ( t ) − µ i k σ t (cid:27) = h i ( µ , σ t ) σ t , (25)where h i ( µ , σ t ) is a quadratic function of µ and σ t , i ∈ [ N ] (see Appendix E for more details). Hence, due to the choiceof the parameters r t and σ t (in particular, Assumption 6 d))and the estimations in (22)- (24), we conclude that the termscontaining P i are dominated by other terms in the inequalityin (16). Thus, by inserting (17)-(21) into (16), we obtain E {k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t + 1) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t ))+ 2 β t O ( σ t )(1 + V ( t, µ ))+ β t ǫ ( t ) V ( t, µ )+ O ( γ t )(1 + V ( t, µ )) , (26)where in the last inequality we used the fact that ǫ t → (Assumption 6 a)), γ t → , and σ t → for all i ∈ [ N ] as t → ∞ (Assumption 6 c), d)). Thus, taking into accountAssumption 6 c), d) and (26), we obtain E[ k µ ( t + 1) − z ( t + 1) k | µ ( t ) = µ ] = N X i =1 E[ k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ ] ≤ (1 − ǫ t β t ) k µ − z ( t ) k − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t )(1 + V ( t, µ )) . (27)Thus, LV ( t, µ ) ≤ − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t )(1 + V ( t, µ )) . (28)According to Assumption 6 b)-c), P ∞ t =0 β t σ t + γ t < ∞ .Furthermore, from Assumption 6 a) P ∞ t =0 β t = ∞ . Takinginto account this, (28), and monotonicity of M implying ( M ( µ ) − M ( z ( t )) , µ − z ( t )) ≥ , ∀ t, ∀ µ ∈ A , (29)we conclude that LV ( t, µ ) satisﬁes the decay needed for theapplication of Theorem 2.5.2 in [15] and consequently, µ ( t ) is ﬁnite almost surely for any t ∈ Z + irrespective of µ (0) . D. Convergence to Nash equilibrium

We will use the bound estimations in the previous sectionto prove convergence of the algorithm. In particular, we useInequality (27), which bounds the decay of the sequence E[ k µ ( t + 1) − z ( t ) k | µ ( t )] in terms of k µ − z ( t + 1) k . Wewill show that this decay satisﬁes the conditions for applyingLemma 10 in [17]. From this, it can readily be inferred thatrandom variables k µ ( t ) − z ( t ) k converge to zero.First, however, let us verify that even in the compact actioncase, Inequalities (27), (28) hold. Remark 2:

If the set A is compact, due to Assumption 5,the inequality (21) can be replaced by E {k R i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } = O (cid:18) σ t (cid:19) . Moreover, the inequalities (22) and (24) holdfor the case of the bounded set A . Indeed, dueto polynomial behavior of J i ( x ( t )) for large x ( t ) , the terms E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t and E n | J i ( x ( t ))( J i ( a ( t )) − J i ( x ( t ))) |k x i ( t ) − µ i k σ t o are upper boundedby some constants. Thus, for the bounded set A , the inequality(27) can be rewritten as E[ k µ ( t + 1) − z ( t + 1) k | µ ( t ) = µ ] ≤ (1 − ǫ t β t ) k µ − z ( t ) k − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t ) . Proof: (of Theorem 2) Note that we can rewrite (27) as: E[ k µ ( t + 1) − z ( t + 1) k | F t ] ≤ (1 − ǫ t β t ) k µ ( t ) − z ( t ) k + O ( γ t + β t σ t ) , (30)where F t is the σ -algebra generated by the random variables { x ( k ) , µ ( k ) } tk =0 . In (30) we used (29) and Lemma 5.From Assumption 6, and the choices of γ t , σ t , ǫ t , we get O ( γ t + β t σ t ) = O ( t n ) , ǫ t β t = t m , with n > , m ≤ . Thus, lim t →∞ O ( γ t + β t σ t ) ǫ t β t = 0 . Assumption 6 d), the fact that P ∞ t =0 γ t + β t σ t < ∞ and theabove result in the decay (30) imply that we can apply Lemma10 in [17] to the sequence k µ ( t + 1) − z ( t + 1) k to concludeits almost sure convergence to as t → ∞ . Next, by takinginto account Theorem 3 and Theorem 1, we obtain that Pr { lim t →∞ µ ( t ) = a ∗ } = 1 , where a ∗ is the least norm Nash equilibrium in thegame Γ( N, { A i } , { J i } ) . Finally, Assumption 6 implies that lim t →∞ σ t = 0 . Taking into account that x ( t ) ∼ N ( µ ( t ) , σ t ) and lim t →∞ k a ( t ) − x ( t ) k = 0 , we conclude that a ( t ) converges weakly to a Nash equilibrium a ∗ = µ ∗ . Moreover,according to Portmanteau Lemma [10], this convergence isalso in probability. V. D ISCUSSION

In the proposed algorithm convergence is established undermild conditions as strict monotonicity of the game mappingis not implied. This signiﬁcantly extends the applicabilityof bandit online learning. For example, the zero-sum gameconsidered in [13] with an interior Nash equilibrium satisﬁesthe assumption of our theorem. Whereas all the follow-the-regularized leader (FTRL) learning approach fails to convergein a simple zero-sum game (such as matching penny), ourdoubly regularized approach can resolve this problem. Ingeneral, examples of games that satisfy assumptions above in-clude mixed extensions of zero-sum games, Cournot competi-tion, continuous-action congestion games and convex potentialgames. On the other hand, mixed extensions of non-zero sumgames do not satisfy the monotonicity assumption in general.In accordance with the payoff-based information structure,the parameters γ t , σ t , ǫ t , r t are independent of the problemdata including the Lipschitz constant of the game mapping orthe constraint sets. Below, we further specify feasible choicesof the parameters to ensure convergence. Lemma 6:

A sufﬁcient condition on < a , a , a , a < for satisfying Assumption 6 is as follows:i) a + 2 a < , a + 2 a + a < .ii) a + 2 a + a < , a + 2 a + 6 a − a < .iii) a > , a + 3 a > .iv) a < a < a . Proof:

The series P ∞ t =0 /t m converges for m > anddiverges otherwise. Thus, the statements i), iii), iv) above fol-low. To show statement ii), let us consider the term ( ǫ t − ǫ t − ) in the ﬁrst summand of b), namely, P ∞ t =0 ( ǫ t − ǫ t − ) β t ǫ t : ( ǫ t − ǫ t − ) = ( t − a − ( t − − a ) (multiply by t a t a ) = (cid:0) (1 − /t ) − a − (cid:1) /t a (do Taylor approximation) = (1 + a /t + O ( t − ) − /t a = O ( t − − a ) . Combining the above with the denominator β t ǫ t , we obtainthat P t ( ǫ t − ǫ t − ) β t ǫ t converges if a a + a < . Repeatingthe same analysis for ( r t − r t − ) β t ǫ t , we obtain P t ( r t − r t − ) β t ǫ t converges if a + 2 a + 6 a − a < and ii) is veriﬁed. VI. C

ONCLUSIONS

We designed an algorithm for learning Nash equilibria inconvex games with monotone game mappings using onlinebandit feedback information. Our algorithm relied on a suitabledouble regularization to handle non-strictly monotone gamemaps as well as feasibility of the queried actions (onlinesetting). The implications of our result is that players canlearn Nash equilibria in several monotone games such asﬁnite action zero-sum games, inﬁnite action zero-sum convexgames and convex games with linear coupling constraints.Several points remain open and are topic of our currentstudy. These include showing that our algorithm is no-regret,unifying different sampling approaches to perform one-pointestimation of the game mapping for bandit learning in games,and analyzing the convergence rate of the algorithm.R

EFERENCES [1] J. P. Bailey, G. Gidel, and G. Piliouras. Finite regretand cycles with ﬁxed step-size via alternating gradientdescent-ascent. In

Conference on Learning Theory , pages391–407, 2020.[2] D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls,and T Graepel. The mechanics of n-player differentiablegames. In

ICML , volume 80, pages 363–372. JMLR. org,2018.[3] B. Bharath and V. S. Borkar. Stochastic approxima-tion algorithms: Overview and recent trends.

Sadhana ,24(4):425–452, 1999.[4] M. Bravo, D. Leslie, and P. Mertikopoulos. Banditlearning in concave n-person games. In

Advances inNeural Information Processing Systems , pages 5661–5671, 2018.[5] F. Facchinei and C. Kanzow. Generalized Nash equilib-rium problems. , 5(3):173–210, 2007.[6] A. D. Flaxman, A. T. Kalai, and H. B. McMahan.Online convex optimization in the bandit setting: gra-dient descent without a gradient. In

Proceedings ofthe sixteenth annual ACM-SIAM symposium on Discretealgorithms , pages 385–394. Society for Industrial andApplied Mathematics, 2005.[7] S. Grammatico. Comments on distributed robust adaptiveequilibrium computation for generalized convex games(automatica 63(2016) 82-91).

Automatica , 97:186 – 188,2018.[8] J.-B. Hiriart-Urruty and C. Lemarchal.

Fundamentals ofConvex Analysis . Grundlehren Text Editions. Springer,2001.[9] R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In

Advances in Neural Informa-tion Processing Systems , pages 697–704, 2005.[10] A. Klenke.

Probability theory: a comprehensive course .Springer, London, 2008.[11] J. Koshal, A Nedi´c, and U. Shanbhag. Single timescaleregularized stochastic approximation schemes for mono-tone nash games under uncertainty. In

IEEE Conferenceon Decision and Control , pages 231–236, 2010. [12] P. Mertikopoulos, B. Lecouat, H. Zenati, Ch.-Sh. Foo,V. Chandrasekhar, and G. Piliouras. Optimistic mirrordescent in saddle-point problems: Going the extra (gra-dient) mile. arXiv preprint arXiv:1807.02629 , 2018.[13] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cy-cles in adversarial regularized learning. In

Proceedingsof the Twenty-Ninth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 2703–2717, 2018.[14] Y. Nesterov and V. Spokoiny. Random gradient-freeminimization of convex functions.

Found. Comput.Math. , 17(2):527–566, April 2017.[15] M. B. Nevelson and R. Z. Khasminskii.

Stochasticapproximation and recursive estimation [translated fromthe Russian by Israel Program for Scientiﬁc Translations; translation edited by B. Silver] . American MathematicalSociety, 1973.[16] J.-S. Pang and F. Facchinei.

Finite-dimensional varia-tional inequalities and complementarity problems : vol.1 . Springer series in operations research. Springer, NewYork, Berlin, Heidelberg, 2003.[17] B. T. Poljak.

Introduction to optimization . OptimizationSoftware, 1987.[18] T. Tatarenko and M. Kamgarpour. Learning generalizednash equilibria in a class of convex games.

IEEETransactions on Automatic Control , 2018. to appear.URL: https://arxiv.org/abs/1703.04113.[19] T. Tatarenko and M. Kamgarpour. Learning nash equilib-ria in monotone games. In , pages 3104–3109, 2019.[20] A. L. Thathachar and P. S. Sastry.

Networks of LearningAutomata: Techniques for Online Stochastic Optimiza-tion . Springer US, 2003.[21] V.A. Zorich and R. Cooke.

Mathematical Analysis II .Mathematical Analysis. Springer, 2004.A

PPENDIX

The appendix provides supporting theorems and proof ofcertain lemmas and statements.

A. Supporting Theorems

Let { X ( t ) } t , t ∈ Z + , be a discrete-time Markov process onsome state space E ⊆ R d , namely X ( t ) = X ( t, ω ) : Z + × Ω → E , where Ω is the sample space of the probability spaceon which the process X ( t ) is deﬁned. The transition functionof this chain, namely Pr { X ( t +1) ∈ Γ | X ( t ) = X } , is denotedby P ( t, X , t + 1 , Γ) , Γ ⊆ E . Deﬁnition 4:

The operator L deﬁned on the set of measur-able functions V : Z + × E → R , X ∈ E , by LV ( t, X ) = Z P ( t, X , t + 1 , dy )[ V ( t + 1 , y ) − V ( t, X )]= E [ V ( t + 1 , X ( t + 1)) | X ( t ) = X ] − V ( t, X ) , is called a generating operator of a Markov process { X ( t ) } t .Next, we formulate the following theorem for discrete-timeMarkov processes, which is proven in [15], Theorem 2.5.2. Theorem 4:

Consider a Markov process { X ( t ) } t and sup-pose that there exists a function V ( t, X ) ≥ such that inf t ≥ V ( t, X ) → ∞ as k X k → ∞ and LV ( t, X ) ≤ − α ( t + 1) ψ ( t, X ) + f ( t )(1 + V ( t, X )) , where ψ ≥ on R × R d , f ( t ) > , P ∞ t =0 f ( t ) < ∞ . Let α ( t ) be such that α ( t ) > , P ∞ t =0 α ( t ) = ∞ . Then, almost surely sup t ≥ k X ( t, ω ) k = R ( ω ) < ∞ .The following result related to the convergence of thestochastic process is proven in Lemma 10 (page 49) in [17]. Theorem 5:

Let v , . . . , v k be a sequence of random vari-ables, v k ≥ , E v < ∞ and let E { v k +1 | F k } ≤ (1 − α k ) v k + β k , where F k is the σ -algebra generated by the random variables { v , . . . , v k } , < α k < , P ∞ k =0 α k = ∞ , β k ≥ , P ∞ k =0 β k < ∞ , lim k →∞ β k α k = 0 . Then v k → almost surely, E v k → as k → ∞ . B. Proof of Lemma 2Proof:

We verify that the differentiation under the integralsign in (4) is justiﬁed. It can then readily be veriﬁed that(6) holds, by taking the differentiation inside the integral. Asufﬁcient condition for differentiation under the integral isthat the integral of the formally differentiated function withrespect to µ ik converges uniformly, whereas the differentiatedfunction is continuous (see [21], Chapter 17). By formallydifferentiating the function under the integral sign and omittingthe arguments t , we obtain σ Z R Nd J i ( x )( x ik − µ ik ) p ( µ , x , σ ) d x . (31)Given Assumption 1, J i ( x )( x ik − µ ik ) p ( µ , x , σ ) is continuous.Thus, it remains to check that the integral of this functionconverges uniformly with respect to any µ ∈ A . If A isbounded, then the conclusion follows from the polynomialbehavior of the function J i on the inﬁnity.Now, we move to the case of the unbounded set A . Tothis end, we can write the Taylor expansion of the function J i around the point µ ( i, k ) ∈ R Nd with the coordinates µ ( i, k ) ik = µ ik and µ ( i, k ) jm = x jm for any j = i , m = k ,in the integral (31): Z R Nd J i ( x )( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd [ J i ( µ ( i, k ))+ ∂J i ( η ( x , µ )) ∂x ik ( x ik − µ ik )]( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd ∂J i ( η ( x , µ )) ∂x ik ( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd ∂J i ( η ( y , µ )) ∂x ik ( y ik ) p ( , y , σ ) d y , where η ( x , µ ) = µ ( i, k ) + θ ( x − µ ( i, k )) , θ ∈ (0 , , y = x − µ ( i, k ) , η ( y , µ ) = µ ( i, k ) + θ y . The uniformconvergence of the integral above and, in particular, itsboundedness, follows from the fact that under Assumption 5, see the basic sufﬁcient condition using majorant [21], Chapter 17.2.3. ∂J i ( η ( y , µ )) ∂x ik ≤ l ik for some positive constant l ik and for all i ∈ [ N ] and k ∈ [ d ] . Thus, | ∂J i ( η ( y , µ )) ∂x ik ( y ik ) p ( , y , σ ) | ≤ h ( y ) = l ( y ik ) p ( , y , σ ) , where R R Nd h ( y ) d y < ∞ . C. Proof of Lemma 3Proof:

Without loss of generality, assume x / ∈ (1 − r t − ) A (otherwise, k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = 0 ).Due to convexity of the set A ⊂ R Nd there exists a convexfunction g : R Nd → R such that A = { x : g ( x ) ≤ } ,whereas (1 − r t ) A = { x : g ( x ) ≤ − r t } for any t . Moreover,since Proj (1 − r t − ) A x = Proj (1 − r t − ) A { Proj (1 − r t ) A x } , wehave k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = d, where d = min y k y − x ′ k , x ′ = Proj (1 − r t ) A x , s.t. g ( y ) = − r t − . The optimization problem has a solution y ∗ for which thegradient of the corresponding Lagrangian is zero, namely ( y ∗ − x ′ ) k y ∗ − x ′ k + λ ∇ g ( y ∗ ) = , where λ > is the dual multiplier of the problem underconsideration. Notice that due to Assumption 1 and the choiceof r the Slator’s condition for the constraints g ( x ) ≤ − r t holds for all t . Hence, for any x ∈ R Nd there exists a constant Λ > such that λ < Λ (see [8]). Thus, we conclude that ∇ g ( y ∗ ) = − ( y ∗ − x ′ ) λ k y ∗ − x ′ k . Next, due to convexity of the function g , g ( x ′ ) ≥ g ( y ∗ ) + ( ∇ g ( y ∗ ) , x ′ − y ∗ )= − r t − + k y ∗ − x ′ k λ | y ∗ − x ′ k ≥ − r t − + k y ∗ − x ′ k Λ . Thus, taking into account that g ( x ′ ) ≤ − r t , we obtain d = k y ∗ − x ′ k ≤ Λ( r t − − r t ) = O ( | r t − − r t | ) . D. Proof of Lemma 4Proof:

Let us use θ = ǫ / t in Lemma 1 to express y ( t ) as y ( t ) = Proj (1 − r t ) A [ y ( t ) − ǫ / t ( M ( y ( t )) + ǫ t y ( t ))] . Usingthis equivalence, the triangle inequality and non-expansion ofprojection operator we obtain k y ( t ) − y ( t − k≤ k Proj (1 − r t ) A [ y ( t ) − ǫ / t ( M ( y ( t )) + ǫ t y ( t ))] − Proj (1 − r t ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − k + k Proj (1 − r t ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − − Proj (1 − r t − ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − k≤ k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − y ( t − ǫ / t ( ǫ t − − ǫ t ) k + k Proj (1 − r t ) A [˜ y ( t − − Proj (1 − r t − ) A [˜ y ( t − k , where ˜ y ( t −

1) = y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − .Next, due to Lemma 3 we have for any θ t > and κ t > k y ( t ) − y ( t − k ≤ (1 + θ t ) k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − y ( t − ǫ / t ( ǫ t − − ǫ t ) k + (1 + 1 /θ t ) O (( r t − r t − ) ) ≤ (1 + θ t )(1 + κ t ) k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − k + (1 + 1 /κ t ) k y ( t − k ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) . Furthermore, there exists T such that for all t > T k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − k ≤ (1 − ǫ / t ǫ t ) k y ( t ) − y ( t − k + ǫ t k M ( y ( t )) − M ( y ( t − k ≤ (1 − ǫ / t ǫ t + ǫ t ( ǫ t + L )) k y ( t ) − y ( t − k ≤ (1 − ǫ / t ǫ t ) k y ( t ) − y ( t − k , where the ﬁrst inequality is due to monotonicity of themapping M , the second one is as it is Lipschitz continuous,and the third one is due to the fact that ǫ t ( ǫ t + L ) ≤ ǫ / t ǫ t forsufﬁciently large t (since ǫ t → , see Assumption 6). Thus, k y ( t ) − y ( t − k ≤ (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t ) ×k y ( t ) − y ( t − k + (1 + 1 /κ t ) M y ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) , where M y is the uniform upper bound of the norm of thesequence y ( t ) . By rearranging the terms above, we obtain (1 − (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t )) k y ( t ) − y ( t − k ≤ (1 + 1 /κ t ) M y ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) . We conclude the proof by noticing that, according to the choiceof ǫ / t and ǫ t and by taking κ t = θ t = ǫ / t ǫ t , we obtain, (1 − (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t )) ≥ ǫ / t ǫ t − θ t − κ t − θ t κ t ≥ . ǫ / t ǫ t . E. Veriﬁcation of Equations (21) and (25)Due to Assumption 5 there exists a compact set S ⊂ R Nd such that for any x / ∈ S J i ( x ) ≤ ( c , x ) + b for some c = ( c , . . . , c d , . . . , c N , . . . , c Nd ) ∈ R Nd and b ∈ R . Thus, for some positive S , d and d we get Z R Nd J i ( x ) ( x ik − µ ik ) σ t p ( µ , x ) d x ≤ Z S J i ( x ) ( x ik − µ ik ) σ t p ( µ , x ) d x + Z R Nd \ S [ d k x k + d ] ( x ik − µ ik ) σ t p ( µ , x ) d x ≤ Sσ t + Z R Nd [ d k x k + d ] ( x ik − µ ik ) σ t p ( µ , x ) d x = Sσ t + f i ( µ , σ t ) σ t ,f i ( µ , σ t ) being a quadratic function of µ and σ t . The lastequality is due to the fact that Z R Nd ( x ik − µ ik ) p ( µ , x ) d x = σ t , Z R Nd ( x ik ) ( x ik − µ ik ) p ( µ , x ) d x ≤ Z R Nd [2( x ik − µ ik ) + ( µ ik ) )]( x ik − µ ik ) p ( µ , x ) d x = 2 σ t + ( µ ik ) ) σ t , Z R Nd ( x jm ) ( x ik − µ ik ) p ( µ , x ) d x = ( σ t + ( µ jm ) ) σ t | ..