Bandit Online Learning of Nash Equilibria in Monotone Games
aa r X i v : . [ m a t h . O C ] S e p Bandit Online Learning of Nash Equilibria inMonotone Games
Tatiana Tatarenko and Maryam Kamgarpour, IEEE Member
Abstract —We address online bandit learning of Nash equilibriain multi-agent convex games. We propose an algorithm wherebyeach agent uses only obtained values of her cost function at eachjoint played action, lacking any information of the functionalform of her cost or other agents’ costs or strategies. In contrastto past work where convergent algorithms required strongmonotonicity, we prove that the algorithm converges to a Nashequilibrium under mere monotonicity assumption. The proposedalgorithm extends the applicability of bandit learning in severalgames including zero-sum convex games with possibly unboundedaction spaces, mixed extension of finite-action zero-sum games,as well as convex games with linear coupling constraints.
Index Terms —learning in games, monotone bandit learning
I. I
NTRODUCTION
Game theory is a powerful framework to address optimiza-tion and learning of multiple interacting agents referred to asplayers. Such multi-agent problems arise in several applicationdomains including traffic networks, internet, auctions and ad-versarial learning. In a multi-agent setting, the notion of Nashequilibrium captures a desirable solution as it exhibits stability.Namely, no player has an incentive to unilaterally deviate fromthis solution. Nash equilibrium is also consistent with a notionreferred to as rationality - each player aims to optimize herown cost function. Given these theoretical justifications forNash equilibria, from the viewpoint of learning the questionis whether players can learn their Nash equilibrium strategieswith limited information about the game. In particular, inseveral application domains, each player might not know thefunctional form of her objective. For example, the travel timesof edges in a traffic network or the market constraints inan auction are unknown a priori and depend in non-trivialways on actions and objectives of other players. However, byplaying the game, a player can have access to the so-calledonline bandit information feedback, namely, she can receivepayoff information of her objective (zeroth-order oracle) forany feasible joint actions taken by all the players. Thus, wefocus on how do players learn Nash equilibria given onlinebandit information.Bandit learning in games has been mainly explored infinite action settings. It is known that if each player usesa no-regret algorithm, the time average of the sequence ofplayed actions converges to a coarse-correlated equilibrium- a relaxed notion of equilibrium which encompasses Nash
T. Tatarenko ([email protected]) is with the ControlMethods and Robotics Lab Technical University Darmstadt, Darmstadt, Ger-many 64283, M. Kamgarpour ([email protected]) is with Electrical andComputer Engineering, UBC, Vancouver, Canada.M. Kamgarpour gratefully acknowledges ERC Starting Grant CONENE. equilibria as well as possibly non-rationalizable strategies.The convergence of time-averaged sequence of plays in no-regret algorithms to a mixed strategy Nash equilibrium canbe established for special games such as two-player zero-sum game. However, the convergence of time-averaged actionsdoes not imply the actual sequence of plays also converges toa Nash equilibrium. This issue was recently re-examined inlight of progresses in bandit convex optimization. In particular,[13] showed that if each player uses a fairly general modelof no-regret algorithm (FTRL) in continuous-time, the actualsequence of plays may not converge to a mixed strategyNash equilibrium in a two-player zero-sum game and it mightexhibit a non-vanishing cycle . The work of [2] analyzed thisnon-convergence through the lens of Hamiltonian and potentialof a game and proposed convergent algorithms assumingaccess to exact first and second order oracles (gradients andHessians) of the players’ objectives. More recently, motivatedby the Hamiltonian analogy of zero-sum games, [1] extendedthe analysis of [13] to discrete-time setting and showed thatsequential gradient descent can overcome the divergence ofdiscretized simultaneous gradient descent in these games.The mixed extension of a finite action game falls into cate-gory of convex games. In such games, each player’s objectiveis convex in her own decision variable for any fixed action ofother players (in mixed extension of finite action games, thisobjective is linear in each player’s action). Furthermore, thestrategy sets are convex (simplexes in mixed extension games).It is known that the Nash equilibria in convex differentiablegames correspond to the solution set of a variational inequalityproblem. Hence, one can use algorithms for solving variationalinequalities to compute Nash equilibria. This connection hasbeen used in two lines of works addressing bandit learningof Nash equilibria in convex games. The works in [18, 4]showed that no-regret learning can converge to Nash equilibriain certain class of convex games. Both works leveraged theidea of one-point estimation of the gradient of the player’scost function using the bandit payoff information. Whereas [4]performed this gradient estimation by perturbing the playedactions with a point sampled from the uniform distributionon the unit sphere, motivated by Stoke’s theorem [6, 9], theapproach in [18] formed these gradient estimates using Normaldistribution and smoothing motivated by [20, 14]. In bothcases, the convergence was proven by appropriately choosingthe stepsizes to tradeoff the bias and variance of the resultingestimation terms in the stochastic approximation procedure. The FTRL algorithms explored had access to more than bandit feedbackas each player could observe the gradient of its mixed extension cost functionat each time step.
The above works on learning Nash equilibria rely on strictmonotonicity of the game mapping. The game mapping isthe stacked vector of gradient of each player’s objective withrespect to her actions. In case the game is potential, thegame mapping is symmetric and thus, the game mappingcorresponds to gradient of a single function, the so-calledpotential function of the game. Hence, finding equilibria can becast as a single-objective optimization and bandit algorithmscan be readily applied to learn Nash equilibria. However, ingeneral noncooperative games, this game mapping need not besymmetric. In such cases, a necessary and sufficient conditionfor convergence of no-regret algorithms such as those exploredin [18, 4, 13] is strict monotonicity of the game mapping.The strict monotonicity of the game mapping is a stringentcondition. The non-convergence issue and in particular cyclicbehavior discussed in [13, 2, 1] is due to the fact that the gamemapping in several games do not satisfy strict monotonicity.Zero-sum games for example are only monotone. This classof games have been widely adopted in robust optimization andcontrol, in models of perfect competition and more recently inadversarial training and deep learning. In addition to zero-sumgames, generalized Nash equilibria problems, that is, Nashequilibria with coupling constraints can at best have monotoneextended game mappings due to the coupling constraints. Thegeneralized Nash equilibria problems arise in several domainswhere a resource must be shared between agents and thiscan only be formulated as a hard rather than soft constraints[5]. The relevance of games with merely monotone mapping,motivate our paper on learning Nash equilibria under banditfeedback in this game class.Given the connection of Nash equilibria of convex gameswith solution set of variational inequality (VI) problem, anatural starting point for learning Nash equilibria of merelymonotone games is to search for algorithms for solvingthe corresponding VIs. This topic is well-explored in [16,Chapter 12] and several approaches including extra-gradientand Tikhonov regularization for finding solution of merelymonotone or pseudo-monotone VIs are proposed. All the pro-posed approaches though require at least first-order (gradient)feedback. The challenge with generalizing these approaches tozeroth-order (bandit) feedback is that they require coordinationamong the players and are not suitable in the bandit feedbacksetting because a player cannot sample her objective functionsat different points while ensuring actions of other players re-mains fixed. For example, the extra-gradient algorithm in [12]remedies the cyclic behavior of trajectories in monotone game.However, it is only applicable to exact first-order feedbackoracle. On the other hand, standard Tikhonov regularizationrequires a double iterative procedure, where the players wouldsolve a regularized VI corresponding to a regularized gamemapping in each inner iteration. Here, solving the inner VIitself would require an iterative algorithm and it is not clearhow the players should coordinate stopping time for the inneralgorithm in addition to setting the regularization parameter.Motivated by generalizing the algorithms for learning Nashequilibria, in our recent work [19] we proposed an approachto learn Nash equilibria in merely monotone games. Our ap-proach was inspired by the single timescale Tikhonov regular- ization algorithm for solving stochastic variational inequalities[11]. In contrast to the above work that assumed access tonoisy gradients, we considered the single-point estimation ofgradients using the bandit feedback. The main contributionwas to appropriately control the bias and variance introducedin the single-point estimation along with the Tikhonov reg-ularization parameter to ensure convergence. Our proposedalgorithm was not applicable to online learning because theplayed actions were not bound to lie in the feasible set. Ourcurrent paper fills this gap by providing a convergent algorithmwhile ensuring the query points do lie in the feasible set.Our contributions in this paper are as follows. We develop,to our best knowledge, the first bandit approach for onlinelearning Nash equilibria in convex games with merely mono-tone game mappings. In doing so, we propose a novel singletime-scale double-regularization - this double regularizationcorresponds to both the Tikhonov regularization and at thesame time the projection of the actions onto a shrunk feasibleset. By properly managing the interplay between the choiceof the regularization sequence, the radius of shrinkage forthe feasible set and the stepsize we ensure that bias andvariance of the resulting stochastic approximation vanish withappropriate rates. The choice of parameters are stated inAssumption 6, whereas the main convergence result is statedin our Theorem 2. In terms of the proof techniques, thereare few main novelties leading to this theorem: first, showingthat the doubly regularized Tikhonov sequence and a single-time scale approach to solve it, remain bounded and convergesto the least-norm solution of the variational inequality (seeTheorem 3 and Proposition 1, respectively); second, showingthat iterates of our algorithm remain bounded (see Lemma5). These results enable us to use well-established results onboundedness and convergence of stochastic processes to oursetup. In summary, we prove convergence in probability of thesequence of actions to a Nash equilibria in monotone games.
Notations.
The set { , . . . , N } is denoted by [ N ] . Bold-face is used to distinguish between vectors in a multi-dimensional space and scalars. Given N vectors x i ∈ R d , i ∈ [ N ] , ( x i ) Ni =1 := ( x ⊤ , . . . , x N ⊤ ) ⊤ ∈ R Nd ; x − i :=( x , . . . , x i − , x i +1 , . . . , x N ) ∈ R ( N − d . R d + and Z + de-note, respectively, vectors from R d with non-negative co-ordinates and non-negative whole numbers. The standardinner product on R d is denoted by ( · , · ) : R d × R d → R ,with associated norm k x k := p ( x , x ) . Given some matrix A ∈ R d × d , A (cid:23) ( ≻ )0 , if and only if x ⊤ A x ≥ ( > )0 forall x = 0 . We use the big- O notation, that is, the function f ( x ) : R → R is O ( g ( x )) as x → a , f ( x ) = O ( g ( x )) as x → a , if lim x → a | f ( x ) || g ( x ) | ≤ K for some positive constant K .And with a slight abuse of notation, we write f ( x ) ≤ O ( g ( x )) as we estimate certain bounds. We say that a function f ( x ) grows not faster than a function g ( x ) as x → ∞ , if thereexists a positive constant Q such that f ( x ) ≤ g ( x ) ∀ x with k x k ≥ Q . For x ∈ R n and convex closed set C ⊂ R n , Proj C x denotes the projection of x onto C . Definition 1:
A mapping M : R d → R d is monotone over X ⊆ R d , if ( M ( x ) − M ( y ) , x − y ) ≥ for every x , y ∈ X . II. P
ROBLEM F ORMULATION
Consider a game Γ( N, { A i } , { J i } ) with N players, the setsof players’ actions A i ⊆ R d , i ∈ [ N ] , and the cost (objective)functions J i : A → R , where A = A × . . . × A N denotes theset of joint actions. We restrict the class of games as follows. Assumption 1:
The game under consideration is convex .Namely, for all i ∈ [ N ] the set A i is convex and closed,the cost function J i ( a i , a − i ) is defined on R Nd , continuouslydifferentiable in a and convex in a i for fixed a − i . Assumption 2:
The mapping M : R Nd → R Nd , referred toas the game mapping , defined by M ( a ) = ( ∇ a i J i ( a i , a − i )) Ni =1 = ( M ( a ) , . . . , M N ( a )) ⊤ , where M i ( a ) = ( M i, ( a ) , . . . , M i,d ( a )) ⊤ , and M i,k ( a ) = ∂J i ( a ) ∂a ik , a ∈ A , i ∈ [ N ] , k ∈ [ d ] , (1)is monotone on A (see Definition 1).We consider a Nash equilibrium in game Γ( N, { A i } , { J i } ) as a stable solution outcome because it represents a joint actionfrom which no player has any incentive to unilaterally deviate. Definition 2:
A point a ∗ ∈ A is called a Nash equilibrium if for any i ∈ [ N ] and a i ∈ A i J i ( a i ∗ , a − i ∗ ) ≤ J i ( a i , a − i ∗ ) . Our goal is to learn such a stable action in a game throughdesigning a payoff-based algorithm. To do so, we first con-nect existence of Nash equilibria for Γ( N, { A i } , { J i } ) withsolution set of a corresponding variational inequality problem. Definition 3:
Consider a mapping T ( · ) : R d → R d and aset Y ⊆ R d . The solution set SOL ( Y, T ) to the variationalinequality problem V I ( Y, T ) is the set of vectors y ∗ ∈ Y such that ( T ( y ∗ ) , y − y ∗ ) ≥ , ∀ y ∈ Y . Theorem 1: ([16, Proposition 1.4.2]) Given a game Γ( N, { A i } , { J i } ) with game mapping M , suppose that theaction sets { A i } are closed and convex, the cost functions { J i } are continuously differentiable in a and convex in a i for everyfixed a − i on the interior of A . Then, some vector a ∗ ∈ A isa Nash equilibrium in Γ , if and only if a ∗ ∈ SOL ( A , M ) .It follows that under Assumptions 1 and 2 for a game withmapping M , any solution of V I ( A , M ) is also a Nash equi-librium in such games and vice versa. While Γ( N, { A i } , { J i } ) under Assumptions 1 and 2 might admit a Nash equilibrium,these two assumptions alone do not guarantee uniqueness of aNash equilibrium. More restrictive assumptions for uniquenessare needed, for example, strong monotonicity of the gamemapping or compactness of the action sets [16]. Here, wedo not restrict our attention to such cases. However, to havea meaningful discussion, we do assume existence of at leastone Nash equilibrium in the game. Assumption 3:
The set
SOL ( A , M ) is not empty. Corollary 1:
Let Γ( N, { A i } , { J i } ) be a game with gamemapping M for which Assumptions 1, 2, and 3 hold. Then,there exists at least one Nash equilibrium in Γ . Moreover, anyNash equilibrium in Γ belongs to the set SOL ( A , M ) .The following additional assumptions are needed for con-vergence of the proposed payoff-based algorithm to a Nashequilibrium (see proofs of Lemma 5 and Theorem 2). Assumption 4:
Each element M i of the game mapping M : R Nd → R Nd , defined in Assumption (2) is Lipschitzcontinuous on R d with a Lipschitz constant L i . Assumption 5:
Each cost function J i ( a ) , i ∈ [ N ] grows atmost polynomially in a as k a k → ∞ . Moreover, in the case ofunbounded joint action set A , each continuous cost function J i ( a ) , i ∈ [ N ] grows at most linearly in a as k a k → ∞ . Remark 1:
Note that if the set A is unbounded, Assump-tion 5 is equivalent to each cost function J i ( a ) , i ∈ [ N ] , beingLipschitz continuous on R Nd with some constant l i Thus, inboth bounded and unbounded A , we denote l = max i ∈ [ N ] l i as the uniform upper bound of the mapping M over A .For the development and analysis of our algorithms, we usethe following well-established and easy to verify result. Lemma 1:
Consider a mapping T ( · ) : R d → R d and aconvex closed set Y ⊆ R d . Given θ > , y ∗ ∈ SOL ( Y, T ) ⇐⇒ y ∗ = Proj Y ( y ∗ − θ T ( y ∗ )) . (2)III. P AYOFF -B ASED A LGORITHM
Given online payoff-based information, also referred to asonline bandit or zeroth-order oracle information, each agenthas access to its current action, referred to as its state anddenoted by x i ( t ) = ( x i ( t ) , . . . , x id ( t )) ⊤ ∈ R d , and plays theaction a i ( t ) = Proj A i ( x i ( t )) at iteration t . After that the costvalue ˆ J i ( t ) at the joint action a ( t ) = ( a ( t ) , . . . , a N ( t )) ∈ A , ˆ J i ( t ) = J i ( a ( t )) is revealed to each agent i . Given thesepieces of information, in the proposed algorithm each agent i “mixes” its next state x i ( t + 1) . Namely, it chooses x i ( t + 1) randomly according to the multidimensional normal distribu-tion N ( µ i ( t + 1) = ( µ i ( t + 1) , . . . , µ id ( t + 1)) ⊤ , σ ( t + 1)) withthe following density function: p i ( x i ; µ i ( t + 1) , σ t +1 ) = p i ( x i , . . . , x id ; µ i ( t + 1) , σ t +1 )= 1( √ πσ t +1 ) d exp ( − d X k =1 ( x ik − µ ik ( t + 1)) σ t +1 ) . The initial value of the means µ i (0) , i ∈ [ N ] , can be set toany finite value. The successive means are updated as follows: µ i ( t + 1) = Proj (1 − r t ) A i (cid:2) µ i ( t ) − γ t σ t (cid:18) ˆ J i ( t ) x i ( t ) − µ i ( t ) σ t + ǫ t µ i ( t ) (cid:19) (cid:3) . (3)In the above, (1 − r t ) A i = { x ∈ A i : dist ( x , ∂A i ) ≥ r t } and < r t < , is a time-dependent shrinkage parameter, γ t is thestepsize parameter and ǫ t > is a regularization parameter.The convergence of the algorithm is dependent on the interplayof these parameters and the variance term σ t > .The difference between the proposed approach and that of[18] is due to the additional term ǫ t in (3). In the absenceof this term the algorithm would converge only if the gamemapping is strictly monotone (see [18, Theorem 2] and coun-terexamples in [7, 13]). Moreover, in distinction from [19], inthe bandit online feedback considered here, players can onlyevaluate their costs over their feasible action set A i and notover the whole R Nd , necessitating the additional projectionterm a i ( t ) = Proj A i ( x i ( t )) and the shrinkage radius r t . Assuch, the previous convergence analysis does not apply. Before stating the convergence result, let us provide insightinto the procedure defined by Equation (3) by deriving ananalogy to a regularized stochastic gradient algorithm.Let p ( x ; µ , σ ) = Q Ni =1 p i ( x i , . . . , x id ; µ i , σ ) denote thedensity function of the joint distribution of agents’ states.Given σ > , for any i ∈ [ N ] define ˜ J i : R Nd → R as ˜ J i ( µ , . . . , µ N , σ ) = Z R Nd J i ( x ) p ( x ; µ , σ ) d x . (4)Thus, ˜ J i , i ∈ [ N ] is the i th player’s cost function in mixedstrategies. Let µ ( t ) = ( µ ( t ) , . . . , µ N ( t )) and for i ∈ [ N ] ,define ˜ M i ( · ) = ( ˜ M i, ( · ) , . . . , ˜ M i,d ( · )) ⊤ as the d -dimensionalmapping with the following elements: ˜ M i,k ( µ , σ ) = ∂ ˜ J i ( µ , σ ) ∂µ ik , for k ∈ [ d ] . (5)Our first lemma below shows that the second term insidethe projection in (3) is a sample of the gradient of agent i ’scost function in mixed strategies. Lemma 2:
Under Assumptions 1 and 5 ˜ M i,k ( µ ( t ) , σ t ) = E { J i ( x ( t ) , . . . , x N ( t )) x ik ( t ) − µ ik ( t ) σ t | x ik ( t ) ∼ N ( µ ik ( t ) , σ t ) , i ∈ [ N ] , k ∈ [ d ] } . (6)Moreover, ˜ M i,k ( µ , σ ) is bounded for any µ ∈ A .The proof of this Lemma is very similar to that of [19] andis provided in Appendix B.The lemma above implies that had we used the term J i ( x ( t ) , . . . , x N ( t )) x ik ( t ) − µ ik ( t ) σ t in (3) we could perform aone-point estimation of the gradient of the cost functionsin mixed strategies. In the bandit setting considered here,however, we use the term J i ( a ( t ) , . . . , a N ( t )) x ik ( t ) − µ ik ( t ) σ t .Despite this difference, in the analysis (see (22), (23)) weshow that the difference between these two terms convergesto zero due to the shrinkage radius selection. Thus, algorithm(3) can be interpreted as a doubly regularized (due to r r and ǫ t ) stochastic projection algorithms.Following the above interpretation, our main result is The-orem 2 below where we show that by appropriately choosingthe algorithm parameters, we can bound the bias and varianceterms of the stochastic projection and consequently establishconvergence of the iterates µ ( t ) in (3) to a Nash equilibrium. Assumption 6:
Let β t = γ t σ t and choose γ t = t a , σ t = t b , ǫ t = t c and r t = t d a, b, c, d > respectively, such thati) P ∞ t =0 β t = ∞ , P ∞ t =0 β t ǫ t = ∞ , ii) P ∞ t =0 ( ǫ t − ǫ t − ) β t ǫ t + ( r t − r t − ) β t ǫ t < ∞ , iii) P ∞ t =0 γ t < ∞ , P ∞ t =0 β t σ t < ∞ ,iv) lim t →∞ r t σ t = ∞ , lim t →∞ r t ǫ t = 0 .As an example for existence of parameters to satisfy As-sumption 6, let a = , a = , a = , a = . Theorem 2:
Let the players in game Γ( N, { A i } , { J i } ) choose the states { x i ( t ) } at time t according to the normaldistribution N ( µ i ( t ) , σ t ) , where the mean µ i (0) is arbitraryand µ i ( t ) is updated as in (3). Under Assumptions 1-6, as t → ∞ , the mean vector µ ( t ) converges almost surely to aNash equilibrium µ ∗ = a ∗ of the game Γ and the joint action a ( t ) converges in probability to a ∗ . IV. A NALYSIS OF THE A LGORITHM
To prove Theorem 2 we need to first establish boundednessof the iterates µ ( t ) for the cases in which the action space isunbounded. Having established the boundedness, we can showthat the limit of the iterates µ ( t ) exists and is the minimumnorm Nash equilibrium of the problem. This convergence isproven using existing results on convergence of a sequence ofrandom variables (Lemma 10 on page 49 in [17]). For easeof reference, we provide the supporting statements used forboundedness [15, Theorem 2.5.2] and for convergence [17,Lemma 10 on page 49] of the iterates in Appendix A. A. Characterizing the terms in the algorithm
We first show that algorithm (3) can be interpreted withinthe framework of well-studied Robbins-Monro stochastic ap-proximations procedures [3], where the iterates are updatedaccording to stochastic gradient descent. In our case, biasof the game mapping arises due to each player’s one-pointestimation of its gradient. However, in contrast to a stochasticapproximation procedure, the game mapping is in general notgradient of a single function (as its derivative is not symmetric)unless the game is potential. Furthermore, there are additionalterms in the algorithm iterates due to the projection of thequery points onto the shrunk feasible set and the regularizationrequired to handle the non-strictly monotone game mapping.Let us specify all these terms below.Using the notation M i ( · ) = ( M i, ( · ) , . . . , M i,d ( · )) , we canrewrite the algorithm step in (3) in the following form: µ i ( t + 1) = Proj (1 − r t ) A i [ µ i ( t ) − γ t σ t × (cid:0) M i ( µ ( t )) + Q i ( µ ( t ) , σ t ) + R i ( x ( t ) , µ ( t ) , σ t )+ P i ( x ( t ) , µ ( t ) , σ t ) + ǫ t ) µ i ( t ) (cid:1) ] , (7)for all i ∈ [ N ] , where Q i , R i , P i are Q i ( µ ( t ) , σ t ) = ˜ M i ( µ ( t ) , σ t ) − M i ( µ ( t )) , R i ( x ( t ) , µ ( t ) , σ t ) = F i ( x ( t ) , µ ( t ) , σ t ) − ˜ M i ( µ ( t ) , σ t ) , F i ( x ( t ) , µ ( t ) , σ t ) = J i ( x ( t )) x i ( t ) − µ i ( t ) σ t , P i ( x ( t ) , µ ( t ) , σ t ) = x i ( t ) − µ i ( t ) σ t ( J i ( a ( t )) − J i ( x ( t ))) . Above, M ( µ ( t )) = ( M ( µ ( t )) , . . . , M N ( µ ( t ))) corre-sponds to the gradient term in stochastic approximation proce-dures. The mapping ˜ M i ( µ ( t )) evaluated at µ ( t ) is equivalentto the game mapping in mixed strategies [19]. That is, ˜ M i ( µ ( t )) = Z R Nd M i ( x ) p ( x ; µ ( t ) , σ t ) d x . (8)Thus, the Q ( µ ( t ) , σ t ) = ( Q ( µ ( t ) , σ t ) , . . . , Q N ( µ ( t ) , σ t )) can be interpreted as disturbance of the gradient. Furthermore,according to (6), we have R i ( x ( t ) , µ ( t ) , σ t ) = F i ( x ( t ) , µ ( t ) , σ t ) (9) − E x ( t ) { F i ( x ( t ) , µ ( t ) , σ t ) } , i ∈ [ N ] . The notation E x ( t ) {·} further is used to emphasize that the expectationis taken in respect to x ( t ) which has the normal distribution with the mean µ ( t ) and the covariance matrix σ ( t ) I , where I is the identity matrix. Thus, R below is a martingale difference R ( x ( t ) , µ ( t ) , σ t ) = ( R ( x ( t ) , µ ( t ) , σ t ) , . . . , R N ( x ( t ) , µ ( t ) , σ t )) . Finally, due to the projection of query points, the term P ( x ( t ) , µ ( t ) , σ t ) = ( P ( x ( t ) , µ ( t ) , σ t ) , . . . , P N ( x ( t ) , µ ( t ) , σ t )) , is the vector of the difference between the gradient estimationbased on the state x ( t ) ∈ R Nd and the played action a ( t ) ∈ A .Our goal is to bound each of the terms above to ensureconvergence and boundedness of the iterates. However, first,we need to account for the regularization terms ǫ t , r t . B. Analyzing the modified Tikhonov sequence
In contrast to stochastic approximation algorithms and theproof in [18], we have an addition term ǫ t µ ( t ) to be ableto address merely monotone game mappings. As such, tobound µ ( t ) we also relate the variations of the sequence µ ( t ) to those of the modified Tikhonov sequence defined below.Let y ( t ) = ( y ( t ) , . . . , y N ( t )) denote the solution of thevariational inequality V I ((1 − r t ) A , M ( y ) + ǫ t y ) , namely y ( t ) ∈ SOL ((1 − r t ) A , M ( y ) + ǫ t y ) . (10)The Tikhonov sequence corresponds to the solution of thevariational inequality above with the r t = 0 . Thus, thesequence { y ( t ) } can be considered the modified Tikhonovsequence. Similar to the Tikhonov sequence, y ( t ) enjoys thefollowing important property. Theorem 3:
Under Assumptions 2, 3, and 4, y ( t ) defined in(10) exists and is unique for each t . Moreover, for ǫ t ↓ and r t → and given lim t →∞ r t ǫ t = 0 , y ( t ) is uniformly boundedand converges to the least norm solution of V I ( A , M ) .The significance of the above theorem is that if we canestablish µ ( t ) converges to y ( t ) , then from 1 we can establishconvergence to a Nash equilibrium for the game. To prove The-orem 3, first we establish some useful properties of projectingonto sets (1 − r t ) A , t = 1 , , . . . . Lemma 3:
For any x ∈ R Nd the following holds: k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = O ( | r t − − r t | ) . Please see the Appendix C for the proof of above Lemma.
Proof: (of Theorem 3) Let a be the least norm solutionof V I ( A , M ) . Moreover, let a p be the projection of a ontothe set (1 − r t ) A . Next, let y ( t ) be the unique solution ofthe doubly regularized inequality, namely y ( t ) ∈ SOL ((1 − r t ) A , M + ǫ t I ) . Thus, we conclude that ( M ( a ) , y ( t ) − a ) ≥ , ( M ( y ( t )) + ǫ t y ( t ) , a p − y ( t )) ≥ . Thus, taking into account monotonicity of M , we obtain ≤ ( M ( a ) , y ( t ) − a )+ ( M ( y ( t )) + ǫ t y ( t ) , a − y ( t ))+ ( M ( y ( t )) + ǫ t y ( t ) , a p − a ) = − ( M ( a ) − M ( y ( t )) , a − y ( t )) + ǫ t ( y ( t ) , a − y ( t ))+ ( M ( y ( t )) + ǫ t y ( t ) , a p − a ) ≤ ǫ t ( y ( t ) , a ) − ǫ t k y ( t ) k + ( M ( y ( t )) , a p − a ) + ǫ t ( y ( t ) , a p − a ) . Hence, ǫ t k y ( t ) k ≤ ǫ t ( y ( t ) , a ) + ( M ( y ( t )) , a p − a )+ ǫ t ( y ( t ) , a p − a ) ≤ ǫ t k y ( t ) kk a k + l k a p − a k + ǫ t k y ( t ) kk a p − a k = ǫ t k y ( t ) kk a k + lO ( r t ) + ǫ t k y ( t ) k O ( r t ) , where in the first inequality we used Remark 1 and in thesecond one we applied Lemma 3. Hence, k y ( t ) k ≤ k y ( t ) kk a k + lO (cid:18) r t ǫ t (cid:19) + k y ( t ) k O ( r t ) . By taking the upper limit t → ∞ in the inequality above anddue to the settings for ǫ t and r t , we obtain lim t →∞ [ k y ( t ) k ] ≤k a k lim t →∞ k y ( t ) k + lim t →∞ k y ( t ) k lim t →∞ O ( r t ) . It implies that lim t →∞ k y ( t ) k ≤ k a k , and, thus, the sequence k y ( t ) k is upper bounded. Moreover, any accumulation pointof this sequence is bounded above by the Euclidean normof the least-norm solution of V I ( A , M ) . Hence, accordingto the fact that the function Proj (1 − r ) A ( x ) is continuous inboth r and x , y ( t ) converges to the least norm solution of V I ( A , M ) .Since our goal now is to relate µ ( t ) to y ( t ) , aligned withprocedure (3), we will now design a one-time scale approachto solving (10) as per (11) below: z ( t + 1) = Proj (1 − r t ) A [ z ( t ) − β t ( M ( z ( t )) + ǫ t z ( t ))] , (11)where β t is defined in Assumption 6. We show that theprocedure above is a one time-scale algorithm and similarlyto y ( t ) , it converges to the least norm solution of V I ( A , M ) . Proposition 1:
The sequence z ( t ) defined by (11) convergesto the least norm solution of V I ( A , M ) .To prove the result above, we bound k z ( t + 1) − y ( t ) k in terms of the previous terms in the sequence, namely, k z ( t ) − y ( t − k and show that [17, Lemma 10, page 49] onconvergence of a random sequence applies. To do so though,first we need to bound the variations of y ( t ) as below. Lemma 4:
Under Assumptions 2, 4, and 6, the Tikhonovsequence y ( t ) defined in (10) satisfies k y ( t ) − y ( t − k = O (cid:18) ( ǫ t − ǫ t − ) ǫ t + ( r t − r t − ) ǫ t (cid:19) . Please see Appendix D for the proof.In summary, the results of this section enabled us to proveProposition 1. This proposition serves as the main new result incomparison to non-regularized stochastic gradient proceduresin order to show almost-sure boundedness of k µ ( t ) k and theconvergence of the algorithm to a Nash equilibrium. C. Boundedness of the iteratesLemma 5:
Let Assumptions 2-6 hold in Γ( N, { A i } , { J i } ) and µ ( t ) be the vector updated in the run of the payoff-basedalgorithm (3). Then, Pr { sup t ≥ k µ ( t ) k < ∞} = 1 . Proof:
If the set A is compact, then, according to theupdated in (7), the norm of the vector µ t is bounded for all t . So, let us consider the case of the unbounded A .Define V ( t, µ ) = k µ − z ( t ) k , where z ( t ) is given in (10).Consider the generating operator of the Markov process µ ( t ) LV ( t, µ ) = E [ V ( t + 1 , µ ( t + 1)) | µ ( t ) = µ ] − V ( t, µ ) . We aim to show that LV ( t, µ ) satisfies the following decay LV ( t, µ ) ≤ − α ( t + 1) ψ ( µ ) + φ ( t )(1 + V ( t, µ )) , (12)where ψ , φ and α are terms arising from Q , R and P in(7). Our goal is to show that ψ ≥ on R Nd , φ ( t ) > , ∀ t , P ∞ t =0 φ ( t ) < ∞ , α ( t ) > , P ∞ t =0 α ( t ) = ∞ . Thiscombined with the boundedness of the iterates z ( t ) stated inProposition 1 enable us to apply Theorem 2.5.2 in [15] toconclude almost sure boundedness of µ ( t ) .From now on, for simplicity in notation, we omit theargument σ t in the terms ˜ M , Q , and R . In certain derivations,for the same reason we omit the time parameter t as well.Let us analyze each term i = 1 , . . . , N in V ( t + 1 , µ ( t + 1)) = N X i =1 k µ i ( t + 1) − z i ( t + 1) k . From the procedures for the update of µ ( t ) and z ( t ) and thenon-expansion property of the projection operator, we obtain k µ i ( t + 1) − z i ( t + 1) k ≤ k µ i ( t ) − z i ( t ) − β t (cid:2) ǫ t ( µ i ( t ) − z i ( t ))+ ( M i ( µ ( t )) − M i ( z ( t )) + Q i ( µ ( t )) + R i ( x ( t ) , µ ( t ))+ P i ( x ( t ) , µ ( t )) (cid:3) k = k µ i ( t ) − z i ( t ) k − β t ( M i ( µ ( t )) − M i ( z ( t )) , µ i ( t ) − z i ( t )) − β t ǫ t ( µ i ( t ) − z i ( t ) , µ i ( t ) − z i ( t )) − β t ( Q i ( µ ( t )) + R i ( x ( t ) , µ ( t )) , µ i ( t ) − z i ( t )) − β t ( P i ( x ( t ) , µ ( t )) , µ i ( t ) − z i ( t ))+ β t k G i ( x ( t ) , µ ( t )) k , (13)where, for ease of notation, we have defined G i ( x ( t ) , µ ( t )) = ǫ t ( µ i ( t ) − z i ( t ))+ M i ( µ ( t )) − M i ( z ( t ))+ Q i ( µ ( t )) + R i ( x ( t ) , µ ( t ))+ P i ( x ( t ) , µ ( t )) . (14)Note that the terms in k G i ( x ( t ) , µ ( t )) k are given as k G i ( x ( t ) , µ ( t )) k = ǫ ( t ) k µ i ( t ) − z i ( t ) k + k M i ( µ ( t )) − M i ( z ( t )) k + k Q i ( µ ( t )) k + k R i ( x ( t ) , µ ( t )) k + k P i ( x ( t ) , µ ( t )) k + 2( Q i ( µ ( t )) , R i ( x ( t ) , µ ( t )))+ 2( P i ( x ( t ) , µ ( t )) , R i ( x ( t ) , µ ( t ))) + 2( Q i ( µ ( t )) , P i ( x ( t ) , µ ( t )))+ 2 ǫ t ( M i ( µ ( t )) − M i ( z ( t )) , µ i ( t ) − z i ( t ))+ 2( ǫ t ( µ i ( t ) − z i ( t )) + M i ( µ ( t )) − M i ( z ( t )) , Q i ( µ ( t )) + R i ( x ( t ) , µ ( t )) + P i ( x ( t ) , µ ( t ))) , (15)Thus, accounting for the above, for (9), which implies E {k R i ( x ( t ) , µ ( t )) k| µ ( t ) = µ } = 0 for any µ , and for theCauchy-Schwarz inequality, we get from (13) E {k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t ) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t )) − β t ( Q i ( µ ) , µ i − z i ( t ))+ 2 β t E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ }k µ i − z i ( t ) k + β t E {k G i ( x ( t ) , µ ( t )) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t ) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t ))+ 2 β t k Q i ( µ ) kk µ i − z i ( t )) k + 2 β t E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ }k µ i − z i ( t ) k + β t ǫ ( t ) k µ i − z i ( t ) k + β t [ k M i ( µ ) − M i ( z ( t )) k + k Q i ( µ ) k + E {k R i ( x ( t ) , µ ( t )) k + k P i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } + 2E { ( P i ( x ( t ) , µ ( t )) , R i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } + 2 k Q i ( µ ) k E {k P i ( x ( t ) , µ ( t ))) k| µ ( t ) = µ } + 2( ǫ t k µ i − z i ( t ) k + k M i ( µ ) − M i ( z ( t )) k ) × ( k Q i ( µ ) k + E {k P i ( x ( t ) , µ ( t )) k| µ ( t ) = µ } )] . (16)We proceed estimating the terms in the inequality above. Dueto Assumption 4, we conclude that k M i ( µ ) − M i ( z ( t )) k ≤ L i k µ − z ( t ) k = O ( V ( t, µ ))( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t )) ≤ k M i ( µ ) − M i ( z ( t )) kk µ i − z i ( t ) k≤ L i k µ − z ( t ) kk µ i − z i ( t ) k = O ( V ( t, µ )) . (17)Let us analyze the terms containing the disturbance of gra-dient, namely Q i , in Equation (15). Since Q i ( µ ( t )) =˜ M i ( µ ( t )) − M i ( µ ( t )) , due to Assumption 2 and Equation(8), we obtain k Q i ( µ ) k = k Z R Nd [ M i ( x ) − M i ( µ )] p ( x ; µ , σ t ) d x k≤ Z R Nd k M i ( x ) − M i ( µ ) k p ( x ; µ , σ t ) d x ≤ Z R Nd L i k x − µ k p ( x ; µ , σ t ) d x ≤ Z R Nd L i N X i =1 d X k =1 | x ik − µ ik | ! p ( x ; µ , σ t ) d x = O ( σ t ) , (18)where the last equality is due to the fact that the firstcentral absolute moment of a random variable with a normal distribution N ( µ, σ ) is O ( σ ) . The estimation above imply, inparticular, that for any µ ∈ A k Q i ( µ ) kk µ i − z i ( t ) k = O ( σ t )(1 + V ( t, µ )) (19) k Q i ( µ ) kk M i ( µ ) − M i ( z ( t )) k ≤ L i k Q i ( µ ) kk µ − z ( t ) k = O ( σ t )(1 + V ( t, µ )) . (20)We bound the martingale term k R i ( x ( t ) , µ ( t )) k . E {k R i ( x ( t ) , µ ( t )) k | µ ( t ) = µ }≤ d X k =1 Z R Nd J i ( x ) ( x ik − µ ik ) σ ( t ) p ( µ , x ) d x ≤ f i ( µ , σ t ) σ ( t ) = O (1 + V ( t, µ )) σ ( t ) , (21)where the first inequality is due to the fact that E( ξ − E ξ ) ≤ E ξ and taking into account (9), the second inequality is dueto Assumption 5, with f i ( µ , σ t ) being a quadratic function of µ and σ t , i ∈ [ N ] (see Appendix E for more details).We proceed estimating the terms containing P t ( x t , µ ) . Forany µ ∈ A we have E {k P i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } = E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t = Pr { x ( t ) ∈ R Nd \ A } E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t ≤ Pr { x ( t ) ∈ R Nd \ A } E l k x i ( t ) − µ i k k x ( t ) − µ k σ t = k Pr { x ( t ) ∈ R Nd \ A } , for some k > , (22)where the inequality is due to Assumption (5) implying | J i ( x ( t )) − J i ( a ( t )) | ≤ l k x ( t ) − a ( t ) k , and furthermorebecause k x ( t ) − a ( t ) k ≤ k x ( t ) − µ k for a ( t ) = Proj A x ( t ) .Next, let us estimate Pr { x ( t ) ∈ R Nd \ A } . The idea isthat due to the fact that x ( t ) is sampled from a Gaussiandistribution with mean µ ( t ) , x ( t ) concentrates around its mean µ ( t ) with high probability. Since the mean is projected ontoa shrunk version of the set A , namely, (1 − r t ) A , by appro-priately tuning r t and σ t we can ensure that x ( t ) stays withinthe original feasible set with high probability. Let O r t ( µ ) = { y ∈ R Nd |k y − µ k < r t } denote the r t -neighborhood ofthe point µ . Hence, sup y / ∈ O rt ( µ ) −k y − µ k = − r t . Then,taking into account the fact that O r t ( µ ) is contained in A and r t < , we obtain that for any t and any bounded σ > σ t : Pr { x ( t ) ∈ R Nd \ A } ≤ Pr { x ( t ) ∈ R Nd \ O r t ( µ ) } = Z y / ∈ O rt ( µ ) π ) Nd/ σ Ndt exp (cid:26) − k y − µ k σ t (cid:27) d y = Z y / ∈ O rt ( µ ) exp (cid:26) −k y − µ k (cid:18) σ t − σ (cid:19)(cid:27) × σ Nd σ Ndt π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ exp (cid:26) − r t (cid:18) σ t − σ (cid:19)(cid:27) σ Nd σ Ndt × Z y / ∈ O rt ( µ ) π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ k e − r t σ t σ Ndt (23)for some finite k > . The last inequality holds because Z y / ∈ O rt ( µ ) π ) Nd/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ and, thus, there exists < k < ∞ : Z y / ∈ O rt ( µ ) e r t σ σ n (2 π ) n/ σ Nd exp (cid:26) − k y − µ k σ (cid:27) d y ≤ k . From (16) it now remains to bound the term E { ( R i ( x ( t ) , µ ( t )) , P i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } . According to definitions of P i and R i , Remark 1, and theCauchy-Schwarz inequality, E { ( R i ( x ( t ) , µ ( t )) , P i ( x ( t ) , µ ( t ))) | µ ( t ) = µ } (24) = E( J i ( x ( t )) x i ( t ) − µ i σ t − ˜ M i ( µ , σ t ) , x i ( t ) − µ i σ t ( J i ( a ( t )) − J i ( x ( t )))) ≤ k ˜ M i ( µ , σ t ) k E k P i ( x t , µ ) k− E (cid:26) J i ( x ( t ))( J i ( a ( t )) − J i ( x ( t ))) k x i ( t ) − µ i k σ t (cid:27) ≤ k ˜ M i ( µ , σ t ) k E k P i ( x t , µ ) k + Pr { x ( t ) ∈ R Nd \ A }×× E (cid:26) | J i ( x ( t )) |k x ( t ) − µ kk x i ( t ) − µ i k σ t (cid:27) . Note that k ˜ M i ( µ , σ t ) k is bounded from Lemma 2), and that E (cid:26) | J i ( x ( t )) |k x ( t ) − µ kk x i ( t ) − µ i k σ t (cid:27) = h i ( µ , σ t ) σ t , (25)where h i ( µ , σ t ) is a quadratic function of µ and σ t , i ∈ [ N ] (see Appendix E for more details). Hence, due to the choiceof the parameters r t and σ t (in particular, Assumption 6 d))and the estimations in (22)- (24), we conclude that the termscontaining P i are dominated by other terms in the inequalityin (16). Thus, by inserting (17)-(21) into (16), we obtain E {k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ }≤ (1 − β t ǫ t ) k µ i − z i ( t + 1) k − β t ( M i ( µ ) − M i ( z ( t )) , µ i − z i ( t ))+ 2 β t O ( σ t )(1 + V ( t, µ ))+ β t ǫ ( t ) V ( t, µ )+ O ( γ t )(1 + V ( t, µ )) , (26)where in the last inequality we used the fact that ǫ t → (Assumption 6 a)), γ t → , and σ t → for all i ∈ [ N ] as t → ∞ (Assumption 6 c), d)). Thus, taking into accountAssumption 6 c), d) and (26), we obtain E[ k µ ( t + 1) − z ( t + 1) k | µ ( t ) = µ ] = N X i =1 E[ k µ i ( t + 1) − z i ( t + 1) k | µ ( t ) = µ ] ≤ (1 − ǫ t β t ) k µ − z ( t ) k − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t )(1 + V ( t, µ )) . (27)Thus, LV ( t, µ ) ≤ − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t )(1 + V ( t, µ )) . (28)According to Assumption 6 b)-c), P ∞ t =0 β t σ t + γ t < ∞ .Furthermore, from Assumption 6 a) P ∞ t =0 β t = ∞ . Takinginto account this, (28), and monotonicity of M implying ( M ( µ ) − M ( z ( t )) , µ − z ( t )) ≥ , ∀ t, ∀ µ ∈ A , (29)we conclude that LV ( t, µ ) satisfies the decay needed for theapplication of Theorem 2.5.2 in [15] and consequently, µ ( t ) is finite almost surely for any t ∈ Z + irrespective of µ (0) . D. Convergence to Nash equilibrium
We will use the bound estimations in the previous sectionto prove convergence of the algorithm. In particular, we useInequality (27), which bounds the decay of the sequence E[ k µ ( t + 1) − z ( t ) k | µ ( t )] in terms of k µ − z ( t + 1) k . Wewill show that this decay satisfies the conditions for applyingLemma 10 in [17]. From this, it can readily be inferred thatrandom variables k µ ( t ) − z ( t ) k converge to zero.First, however, let us verify that even in the compact actioncase, Inequalities (27), (28) hold. Remark 2:
If the set A is compact, due to Assumption 5,the inequality (21) can be replaced by E {k R i ( x ( t ) , µ ( t )) k | µ ( t ) = µ } = O (cid:18) σ t (cid:19) . Moreover, the inequalities (22) and (24) holdfor the case of the bounded set A . Indeed, dueto polynomial behavior of J i ( x ( t )) for large x ( t ) , the terms E k x i ( t ) − µ i k | J i ( x ( t )) − J i ( a ( t )) | σ t and E n | J i ( x ( t ))( J i ( a ( t )) − J i ( x ( t ))) |k x i ( t ) − µ i k σ t o are upper boundedby some constants. Thus, for the bounded set A , the inequality(27) can be rewritten as E[ k µ ( t + 1) − z ( t + 1) k | µ ( t ) = µ ] ≤ (1 − ǫ t β t ) k µ − z ( t ) k − β t ( M ( µ ) − M ( z ( t )) , µ − z ( t ))+ O ( β t σ t + γ t ) . Proof: (of Theorem 2) Note that we can rewrite (27) as: E[ k µ ( t + 1) − z ( t + 1) k | F t ] ≤ (1 − ǫ t β t ) k µ ( t ) − z ( t ) k + O ( γ t + β t σ t ) , (30)where F t is the σ -algebra generated by the random variables { x ( k ) , µ ( k ) } tk =0 . In (30) we used (29) and Lemma 5.From Assumption 6, and the choices of γ t , σ t , ǫ t , we get O ( γ t + β t σ t ) = O ( t n ) , ǫ t β t = t m , with n > , m ≤ . Thus, lim t →∞ O ( γ t + β t σ t ) ǫ t β t = 0 . Assumption 6 d), the fact that P ∞ t =0 γ t + β t σ t < ∞ and theabove result in the decay (30) imply that we can apply Lemma10 in [17] to the sequence k µ ( t + 1) − z ( t + 1) k to concludeits almost sure convergence to as t → ∞ . Next, by takinginto account Theorem 3 and Theorem 1, we obtain that Pr { lim t →∞ µ ( t ) = a ∗ } = 1 , where a ∗ is the least norm Nash equilibrium in thegame Γ( N, { A i } , { J i } ) . Finally, Assumption 6 implies that lim t →∞ σ t = 0 . Taking into account that x ( t ) ∼ N ( µ ( t ) , σ t ) and lim t →∞ k a ( t ) − x ( t ) k = 0 , we conclude that a ( t ) converges weakly to a Nash equilibrium a ∗ = µ ∗ . Moreover,according to Portmanteau Lemma [10], this convergence isalso in probability. V. D ISCUSSION
In the proposed algorithm convergence is established undermild conditions as strict monotonicity of the game mappingis not implied. This significantly extends the applicabilityof bandit online learning. For example, the zero-sum gameconsidered in [13] with an interior Nash equilibrium satisfiesthe assumption of our theorem. Whereas all the follow-the-regularized leader (FTRL) learning approach fails to convergein a simple zero-sum game (such as matching penny), ourdoubly regularized approach can resolve this problem. Ingeneral, examples of games that satisfy assumptions above in-clude mixed extensions of zero-sum games, Cournot competi-tion, continuous-action congestion games and convex potentialgames. On the other hand, mixed extensions of non-zero sumgames do not satisfy the monotonicity assumption in general.In accordance with the payoff-based information structure,the parameters γ t , σ t , ǫ t , r t are independent of the problemdata including the Lipschitz constant of the game mapping orthe constraint sets. Below, we further specify feasible choicesof the parameters to ensure convergence. Lemma 6:
A sufficient condition on < a , a , a , a < for satisfying Assumption 6 is as follows:i) a + 2 a < , a + 2 a + a < .ii) a + 2 a + a < , a + 2 a + 6 a − a < .iii) a > , a + 3 a > .iv) a < a < a . Proof:
The series P ∞ t =0 /t m converges for m > anddiverges otherwise. Thus, the statements i), iii), iv) above fol-low. To show statement ii), let us consider the term ( ǫ t − ǫ t − ) in the first summand of b), namely, P ∞ t =0 ( ǫ t − ǫ t − ) β t ǫ t : ( ǫ t − ǫ t − ) = ( t − a − ( t − − a ) (multiply by t a t a ) = (cid:0) (1 − /t ) − a − (cid:1) /t a (do Taylor approximation) = (1 + a /t + O ( t − ) − /t a = O ( t − − a ) . Combining the above with the denominator β t ǫ t , we obtainthat P t ( ǫ t − ǫ t − ) β t ǫ t converges if a a + a < . Repeatingthe same analysis for ( r t − r t − ) β t ǫ t , we obtain P t ( r t − r t − ) β t ǫ t converges if a + 2 a + 6 a − a < and ii) is verified. VI. C
ONCLUSIONS
We designed an algorithm for learning Nash equilibria inconvex games with monotone game mappings using onlinebandit feedback information. Our algorithm relied on a suitabledouble regularization to handle non-strictly monotone gamemaps as well as feasibility of the queried actions (onlinesetting). The implications of our result is that players canlearn Nash equilibria in several monotone games such asfinite action zero-sum games, infinite action zero-sum convexgames and convex games with linear coupling constraints.Several points remain open and are topic of our currentstudy. These include showing that our algorithm is no-regret,unifying different sampling approaches to perform one-pointestimation of the game mapping for bandit learning in games,and analyzing the convergence rate of the algorithm.R
EFERENCES [1] J. P. Bailey, G. Gidel, and G. Piliouras. Finite regretand cycles with fixed step-size via alternating gradientdescent-ascent. In
Conference on Learning Theory , pages391–407, 2020.[2] D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls,and T Graepel. The mechanics of n-player differentiablegames. In
ICML , volume 80, pages 363–372. JMLR. org,2018.[3] B. Bharath and V. S. Borkar. Stochastic approxima-tion algorithms: Overview and recent trends.
Sadhana ,24(4):425–452, 1999.[4] M. Bravo, D. Leslie, and P. Mertikopoulos. Banditlearning in concave n-person games. In
Advances inNeural Information Processing Systems , pages 5661–5671, 2018.[5] F. Facchinei and C. Kanzow. Generalized Nash equilib-rium problems. , 5(3):173–210, 2007.[6] A. D. Flaxman, A. T. Kalai, and H. B. McMahan.Online convex optimization in the bandit setting: gra-dient descent without a gradient. In
Proceedings ofthe sixteenth annual ACM-SIAM symposium on Discretealgorithms , pages 385–394. Society for Industrial andApplied Mathematics, 2005.[7] S. Grammatico. Comments on distributed robust adaptiveequilibrium computation for generalized convex games(automatica 63(2016) 82-91).
Automatica , 97:186 – 188,2018.[8] J.-B. Hiriart-Urruty and C. Lemarchal.
Fundamentals ofConvex Analysis . Grundlehren Text Editions. Springer,2001.[9] R. D. Kleinberg. Nearly tight bounds for the continuum-armed bandit problem. In
Advances in Neural Informa-tion Processing Systems , pages 697–704, 2005.[10] A. Klenke.
Probability theory: a comprehensive course .Springer, London, 2008.[11] J. Koshal, A Nedi´c, and U. Shanbhag. Single timescaleregularized stochastic approximation schemes for mono-tone nash games under uncertainty. In
IEEE Conferenceon Decision and Control , pages 231–236, 2010. [12] P. Mertikopoulos, B. Lecouat, H. Zenati, Ch.-Sh. Foo,V. Chandrasekhar, and G. Piliouras. Optimistic mirrordescent in saddle-point problems: Going the extra (gra-dient) mile. arXiv preprint arXiv:1807.02629 , 2018.[13] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras. Cy-cles in adversarial regularized learning. In
Proceedingsof the Twenty-Ninth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 2703–2717, 2018.[14] Y. Nesterov and V. Spokoiny. Random gradient-freeminimization of convex functions.
Found. Comput.Math. , 17(2):527–566, April 2017.[15] M. B. Nevelson and R. Z. Khasminskii.
Stochasticapproximation and recursive estimation [translated fromthe Russian by Israel Program for Scientific Translations; translation edited by B. Silver] . American MathematicalSociety, 1973.[16] J.-S. Pang and F. Facchinei.
Finite-dimensional varia-tional inequalities and complementarity problems : vol.1 . Springer series in operations research. Springer, NewYork, Berlin, Heidelberg, 2003.[17] B. T. Poljak.
Introduction to optimization . OptimizationSoftware, 1987.[18] T. Tatarenko and M. Kamgarpour. Learning generalizednash equilibria in a class of convex games.
IEEETransactions on Automatic Control , 2018. to appear.URL: https://arxiv.org/abs/1703.04113.[19] T. Tatarenko and M. Kamgarpour. Learning nash equilib-ria in monotone games. In , pages 3104–3109, 2019.[20] A. L. Thathachar and P. S. Sastry.
Networks of LearningAutomata: Techniques for Online Stochastic Optimiza-tion . Springer US, 2003.[21] V.A. Zorich and R. Cooke.
Mathematical Analysis II .Mathematical Analysis. Springer, 2004.A
PPENDIX
The appendix provides supporting theorems and proof ofcertain lemmas and statements.
A. Supporting Theorems
Let { X ( t ) } t , t ∈ Z + , be a discrete-time Markov process onsome state space E ⊆ R d , namely X ( t ) = X ( t, ω ) : Z + × Ω → E , where Ω is the sample space of the probability spaceon which the process X ( t ) is defined. The transition functionof this chain, namely Pr { X ( t +1) ∈ Γ | X ( t ) = X } , is denotedby P ( t, X , t + 1 , Γ) , Γ ⊆ E . Definition 4:
The operator L defined on the set of measur-able functions V : Z + × E → R , X ∈ E , by LV ( t, X ) = Z P ( t, X , t + 1 , dy )[ V ( t + 1 , y ) − V ( t, X )]= E [ V ( t + 1 , X ( t + 1)) | X ( t ) = X ] − V ( t, X ) , is called a generating operator of a Markov process { X ( t ) } t .Next, we formulate the following theorem for discrete-timeMarkov processes, which is proven in [15], Theorem 2.5.2. Theorem 4:
Consider a Markov process { X ( t ) } t and sup-pose that there exists a function V ( t, X ) ≥ such that inf t ≥ V ( t, X ) → ∞ as k X k → ∞ and LV ( t, X ) ≤ − α ( t + 1) ψ ( t, X ) + f ( t )(1 + V ( t, X )) , where ψ ≥ on R × R d , f ( t ) > , P ∞ t =0 f ( t ) < ∞ . Let α ( t ) be such that α ( t ) > , P ∞ t =0 α ( t ) = ∞ . Then, almost surely sup t ≥ k X ( t, ω ) k = R ( ω ) < ∞ .The following result related to the convergence of thestochastic process is proven in Lemma 10 (page 49) in [17]. Theorem 5:
Let v , . . . , v k be a sequence of random vari-ables, v k ≥ , E v < ∞ and let E { v k +1 | F k } ≤ (1 − α k ) v k + β k , where F k is the σ -algebra generated by the random variables { v , . . . , v k } , < α k < , P ∞ k =0 α k = ∞ , β k ≥ , P ∞ k =0 β k < ∞ , lim k →∞ β k α k = 0 . Then v k → almost surely, E v k → as k → ∞ . B. Proof of Lemma 2Proof:
We verify that the differentiation under the integralsign in (4) is justified. It can then readily be verified that(6) holds, by taking the differentiation inside the integral. Asufficient condition for differentiation under the integral isthat the integral of the formally differentiated function withrespect to µ ik converges uniformly, whereas the differentiatedfunction is continuous (see [21], Chapter 17). By formallydifferentiating the function under the integral sign and omittingthe arguments t , we obtain σ Z R Nd J i ( x )( x ik − µ ik ) p ( µ , x , σ ) d x . (31)Given Assumption 1, J i ( x )( x ik − µ ik ) p ( µ , x , σ ) is continuous.Thus, it remains to check that the integral of this functionconverges uniformly with respect to any µ ∈ A . If A isbounded, then the conclusion follows from the polynomialbehavior of the function J i on the infinity.Now, we move to the case of the unbounded set A . Tothis end, we can write the Taylor expansion of the function J i around the point µ ( i, k ) ∈ R Nd with the coordinates µ ( i, k ) ik = µ ik and µ ( i, k ) jm = x jm for any j = i , m = k ,in the integral (31): Z R Nd J i ( x )( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd [ J i ( µ ( i, k ))+ ∂J i ( η ( x , µ )) ∂x ik ( x ik − µ ik )]( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd ∂J i ( η ( x , µ )) ∂x ik ( x ik − µ ik ) p ( µ , x , σ ) d x = Z R Nd ∂J i ( η ( y , µ )) ∂x ik ( y ik ) p ( , y , σ ) d y , where η ( x , µ ) = µ ( i, k ) + θ ( x − µ ( i, k )) , θ ∈ (0 , , y = x − µ ( i, k ) , η ( y , µ ) = µ ( i, k ) + θ y . The uniformconvergence of the integral above and, in particular, itsboundedness, follows from the fact that under Assumption 5, see the basic sufficient condition using majorant [21], Chapter 17.2.3. ∂J i ( η ( y , µ )) ∂x ik ≤ l ik for some positive constant l ik and for all i ∈ [ N ] and k ∈ [ d ] . Thus, | ∂J i ( η ( y , µ )) ∂x ik ( y ik ) p ( , y , σ ) | ≤ h ( y ) = l ( y ik ) p ( , y , σ ) , where R R Nd h ( y ) d y < ∞ . C. Proof of Lemma 3Proof:
Without loss of generality, assume x / ∈ (1 − r t − ) A (otherwise, k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = 0 ).Due to convexity of the set A ⊂ R Nd there exists a convexfunction g : R Nd → R such that A = { x : g ( x ) ≤ } ,whereas (1 − r t ) A = { x : g ( x ) ≤ − r t } for any t . Moreover,since Proj (1 − r t − ) A x = Proj (1 − r t − ) A { Proj (1 − r t ) A x } , wehave k Proj (1 − r t − ) A x − Proj (1 − r t ) A x k = d, where d = min y k y − x ′ k , x ′ = Proj (1 − r t ) A x , s.t. g ( y ) = − r t − . The optimization problem has a solution y ∗ for which thegradient of the corresponding Lagrangian is zero, namely ( y ∗ − x ′ ) k y ∗ − x ′ k + λ ∇ g ( y ∗ ) = , where λ > is the dual multiplier of the problem underconsideration. Notice that due to Assumption 1 and the choiceof r the Slator’s condition for the constraints g ( x ) ≤ − r t holds for all t . Hence, for any x ∈ R Nd there exists a constant Λ > such that λ < Λ (see [8]). Thus, we conclude that ∇ g ( y ∗ ) = − ( y ∗ − x ′ ) λ k y ∗ − x ′ k . Next, due to convexity of the function g , g ( x ′ ) ≥ g ( y ∗ ) + ( ∇ g ( y ∗ ) , x ′ − y ∗ )= − r t − + k y ∗ − x ′ k λ | y ∗ − x ′ k ≥ − r t − + k y ∗ − x ′ k Λ . Thus, taking into account that g ( x ′ ) ≤ − r t , we obtain d = k y ∗ − x ′ k ≤ Λ( r t − − r t ) = O ( | r t − − r t | ) . D. Proof of Lemma 4Proof:
Let us use θ = ǫ / t in Lemma 1 to express y ( t ) as y ( t ) = Proj (1 − r t ) A [ y ( t ) − ǫ / t ( M ( y ( t )) + ǫ t y ( t ))] . Usingthis equivalence, the triangle inequality and non-expansion ofprojection operator we obtain k y ( t ) − y ( t − k≤ k Proj (1 − r t ) A [ y ( t ) − ǫ / t ( M ( y ( t )) + ǫ t y ( t ))] − Proj (1 − r t ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − k + k Proj (1 − r t ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − − Proj (1 − r t − ) A [ y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − k≤ k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − y ( t − ǫ / t ( ǫ t − − ǫ t ) k + k Proj (1 − r t ) A [˜ y ( t − − Proj (1 − r t − ) A [˜ y ( t − k , where ˜ y ( t −
1) = y ( t − − ǫ / t ( M ( y ( t − ǫ t − y ( t − .Next, due to Lemma 3 we have for any θ t > and κ t > k y ( t ) − y ( t − k ≤ (1 + θ t ) k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − y ( t − ǫ / t ( ǫ t − − ǫ t ) k + (1 + 1 /θ t ) O (( r t − r t − ) ) ≤ (1 + θ t )(1 + κ t ) k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − k + (1 + 1 /κ t ) k y ( t − k ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) . Furthermore, there exists T such that for all t > T k (1 − ǫ / t ǫ t )( y ( t ) − y ( t − − ǫ / t ( M ( y ( t )) − M ( y ( t − k ≤ (1 − ǫ / t ǫ t ) k y ( t ) − y ( t − k + ǫ t k M ( y ( t )) − M ( y ( t − k ≤ (1 − ǫ / t ǫ t + ǫ t ( ǫ t + L )) k y ( t ) − y ( t − k ≤ (1 − ǫ / t ǫ t ) k y ( t ) − y ( t − k , where the first inequality is due to monotonicity of themapping M , the second one is as it is Lipschitz continuous,and the third one is due to the fact that ǫ t ( ǫ t + L ) ≤ ǫ / t ǫ t forsufficiently large t (since ǫ t → , see Assumption 6). Thus, k y ( t ) − y ( t − k ≤ (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t ) ×k y ( t ) − y ( t − k + (1 + 1 /κ t ) M y ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) , where M y is the uniform upper bound of the norm of thesequence y ( t ) . By rearranging the terms above, we obtain (1 − (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t )) k y ( t ) − y ( t − k ≤ (1 + 1 /κ t ) M y ǫ t ( ǫ t − − ǫ t ) + (1 + 1 /θ t ) O (( r t − r t − ) ) . We conclude the proof by noticing that, according to the choiceof ǫ / t and ǫ t and by taking κ t = θ t = ǫ / t ǫ t , we obtain, (1 − (1 + θ t )(1 + κ t )(1 − ǫ / t ǫ t )) ≥ ǫ / t ǫ t − θ t − κ t − θ t κ t ≥ . ǫ / t ǫ t . E. Verification of Equations (21) and (25)Due to Assumption 5 there exists a compact set S ⊂ R Nd such that for any x / ∈ S J i ( x ) ≤ ( c , x ) + b for some c = ( c , . . . , c d , . . . , c N , . . . , c Nd ) ∈ R Nd and b ∈ R . Thus, for some positive S , d and d we get Z R Nd J i ( x ) ( x ik − µ ik ) σ t p ( µ , x ) d x ≤ Z S J i ( x ) ( x ik − µ ik ) σ t p ( µ , x ) d x + Z R Nd \ S [ d k x k + d ] ( x ik − µ ik ) σ t p ( µ , x ) d x ≤ Sσ t + Z R Nd [ d k x k + d ] ( x ik − µ ik ) σ t p ( µ , x ) d x = Sσ t + f i ( µ , σ t ) σ t ,f i ( µ , σ t ) being a quadratic function of µ and σ t . The lastequality is due to the fact that Z R Nd ( x ik − µ ik ) p ( µ , x ) d x = σ t , Z R Nd ( x ik ) ( x ik − µ ik ) p ( µ , x ) d x ≤ Z R Nd [2( x ik − µ ik ) + ( µ ik ) )]( x ik − µ ik ) p ( µ , x ) d x = 2 σ t + ( µ ik ) ) σ t , Z R Nd ( x jm ) ( x ik − µ ik ) p ( µ , x ) d x = ( σ t + ( µ jm ) ) σ t | ..