[PDF] Poincaré Recurrence, Cycles and Spurious Equilibria in Gradient-Descent-Ascent for Non-Convex Non-Concave Zero-Sum Games

Abstract

We study a wide class of non-convex non-concave min-max games that generalizes over standard bilinear zero-sum games. In this class, players control the inputs of a smooth function whose output is being applied to a bilinear zero-sum game. This class of games is motivated by the indirect nature of the competition in Generative Adversarial Networks, where players control the parameters of a neural network while the actual competition happens between the distributions that the generator and discriminator capture. We establish theoretically, that depending on the specific instance of the problem gradient-descent-ascent dynamics can exhibit a variety of behaviors antithetical to convergence to the game theoretically meaningful min-max solution. Specifically, different forms of recurrent behavior (including periodicity and Poincaré recurrence) are possible as well as convergence to spurious (non-min-max) equilibria for a positive measure of initial conditions. At the technical level, our analysis combines tools from optimization theory, game theory and dynamical systems.

Full PDF

PPoincaré Recurrence, Cycles and Spurious Equilibriain Gradient-Descent-Ascent for Non-ConvexNon-Concave Zero-Sum Games

Lampros Flokas ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]

Emmanouil V. Vlatakis-Gkaragkounis ∗ Department of Computer ScienceColumbia UniversityNew York, NY 10025 [email protected]

Georgios Piliouras

Engineering Systems and DesignSingapore University of Technology and DesignSingapore [email protected]

Abstract

We study a wide class of non-convex non-concave min-max games that generalizesover standard bilinear zero-sum games. In this class, players control the inputs of asmooth function whose output is being applied to a bilinear zero-sum game. Thisclass of games is motivated by the indirect nature of the competition in GenerativeAdversarial Networks, where players control the parameters of a neural networkwhile the actual competition happens between the distributions that the generatorand discriminator capture. We establish theoretically, that depending on the speciﬁcinstance of the problem gradient-descent-ascent dynamics can exhibit a variety ofbehaviors antithetical to convergence to the game theoretically meaningful min-maxsolution. Speciﬁcally, different forms of recurrent behavior (including periodicityand Poincaré recurrence) are possible as well as convergence to spurious (non-min-max) equilibria for a positive measure of initial conditions. At the technical level,our analysis combines tools from optimization theory, game theory and dynamicalsystems.

Min-max optimization is a problem of interest in several communities including Optimization, GameTheory and Machine Learning. In its most general form, given an objective function r : R n × R m → R and we would like to solve the following problem ( θθθ ∗ , φφφ ∗ ) = arg min θθθ ∈ R n arg max φφφ ∈ R m r ( θθθ, φφφ ) . (1)This problem is much more complicated compared to classical minimization problems, as evenunderstanding under which conditions such a solution is meaning-full is far from trivial Daskalakisand Panageas [2018], Mai et al. [2017], Oliehoek et al. [2018], Jin et al. [2019]. What is even moredemanding is understanding what kind of algorithms/dynamics are able to solve this problem when asolution is well deﬁned. ∗ Equal contribution33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. a r X i v : . [ m a t h . O C ] O c t ecently this problem has attracted renewed interest motivated by the advent of Generative Adver-sarial Networks (GANs) and their numerous applications Goodfellow et al. [2014], Radford et al.[2016], Isola et al. [2017], Goodfellow et al. [2014], Zhang et al. [2017], Arjovsky et al. [2017],Ledig et al. [2017], Salimans et al. [2016]. A classical GAN architecture mainly revolves aroundthe competition between two players, the generator and the discriminator. On the one hand, thegenerator aims to train a neural network based generative model that can generate high ﬁdelitysamples from a target distribution. On the other hand, the discriminator’s goal is to train a neuralnetwork classiﬁer than can distinguish between the samples of the target distribution and artiﬁciallygenerated samples. While one could consider each of the tasks in isolation, it is the competitiveinteraction between the generator and the discriminator that has lead to the resounding success ofGANs. It is the "criticism" from a powerful discriminator that pushes the generator to capture thetarget distribution more accurately and it is the access to high ﬁdelity artiﬁcial samples from a goodgenerator that gives rise to better discriminators. Machine Learning researchers and practitionershave tried to formalize this competition using the min-max optimization framework mentioned abovewith great success Arora et al. [2017], Ma [2018], Ge et al. [2018], Yazıcı et al. [2019].One of the main limitations of this framework however is that to this day efﬁciently training GANscan be a notoriously difﬁcult task Salimans et al. [2016], Metz et al. [2017], Mertikopoulos et al.[2018], Kodali et al. [2017]. Addressing this limitation has been the object of interest for a longline work in the recent years Mescheder et al. [2018], Metz et al. [2017], Pfau and Vinyals [2016],Radford et al. [2016], Tolstikhin et al. [2017], Berthelot et al. [2017], Gulrajani et al. [2017]. Despitethe intensiﬁed study, very little is known about efﬁciently solving general min-max optimizationproblems. Even for the relatively simple case of bilinear games, the little results that are known haveusually a negative ﬂavour. For example, the continuous time analogue of standard game dynamicssuch as gradient-descent-ascent or multiplicative weights lead to cyclic or recurrent behavior Piliourasand Shamma [2014], Mertikopoulos et al. [2018] whereas when they are actually run in discrete-time they lead to divergence and chaos Bailey and Piliouras [2018], Cheung and Piliouras [2019], Baileyand Piliouras [2019b]. While positive results for the case of bilinear games exist, like extra-gradient(optimistic) training (Daskalakis et al. [2018], Mertikopoulos et al. [2019a], Daskalakis and Panageas[2019]) and other techniques Balduzzi et al. [2018], Gidel et al. [2019b,a], Abernethy et al. [2019],these results fail to generalize to complex non-convex non-concave settings Oliehoek et al. [2018],Lin et al. [2018], Sanjabi et al. [2018]. In fact, for the case of non-convex-concave optimization,game theoretic interpretations of equilibria might not even be meaningful Mazumdar and Ratliff[2018], Jin et al. [2019], Adolphs et al. [2019].In order to shed some light to this intellectually challenging problem, we propose a quite generalclass of min-max optimization problems that includes bilinear games as well as a wide range ofnon-convex non-concave games. In this class of problems, each player submits its own decisionvector just like in general min-max optimization problems. Then each decision vector is processedseparately by a (potentially different) smooth function. Each player ﬁnally gets rewarded by pluggingin the processed decision vectors to a simple bilinear game. More concretely, there are functions F : R n → R N and G : R m → R M and a matrix U N × M such that r ( θθθ, φφφ ) = FFF ( θθθ ) (cid:62) UGGG ( φφφ ) . (2)We call the resulting class of problems Hidden Bilinear Games.The motivation behind the proposed class of gamess is actually the setting of training GANs itself.During the training process of GANs, the discriminator and the generator "submit" the parametersof their corresponding neural network architectures, denoted as θθθ and φφφ in our problem formulation.However, deep networks introduce nonlinearities in mapping their parameters to their output spacewhich we capture through the non-convex functions F, G . Thus, even though hidden bilinear gamesdo not demonstrate the full complexity of modern GAN architectures and training, they manageto capture two of its most pervasive properties: i) the indirect competition of the generator andthe discriminator and ii) the non-convex non-concave nature of training GANs . Both features aremarkedly missing from simple bilinear games.

Our results.

We provide, the ﬁrst to our own knowledge, global analysis of gradient-descent-ascentfor a class of non-convex non-concave zero-sum games that by design includes both features of Interestingly, running alternating gradient-descent-ascent in discrete-time results once again in recurrentbehavior Bailey et al. [2019]. U is 2x2 (e.g. Matching Pennies)with an interior Nash equilibrium, then the behavior is typically periodic (Theorem 3). If it is ahigher dimensional game (e.g. akin to Rock-Paper-Scissors) then even more complex behavior ispossible. Speciﬁcally, the system is formally analogous to Poincaré recurrent systems (e.g. manybody problem in physics) (Theorems 6, 7). Due to the non-convexity of the operators F, G , the systemcan actually sometimes get stuck at equilibria, however, these ﬁxed points may be merely artifactsof the nonlinearities of

F, G instead of meaningful solutions to the underline minmax problem U .(Theorem 8).In Section 7, we show that moving from continuous to discrete time, only enhances the disequilibriumproperties of the dynamics. Speciﬁcally, instead of energy conservation now energy increases overtime leading away from equilibrium (Theorem 9), whilst spurious (non-minmax) equilibria are stillan issue (Theorem 10). Despite these negative results, there are some positive news, as at least insome cases we can show that time-averaging over these non-equilibrium trajectories (or equivalentlychoosing a distribution of parameters instead of a single set of parameters) can recover the min-max equilibrium (Theorem 4). Technically our results combine tools from dynamical systems (e.g.Poincaré recurrence theorem, Poincaré-Bendixson theorem, Liouville’s theorem) along with toolsfrom game theory and non-convex optimization.Understanding the intricacies of GAN training requires broadening our vocabulary and horizonsin terms of what type of long term behaviors are possible and developing new techniques that canhopefully counter them.The structure of the rest of the paper is as follows. In Section 2 we will present key results from priorwork on the problem of min-max optimization. In Section 3 we will present the main mathematicaltools for our analysis. Sections 4 through 6 will be devoted to studying interesting special cases ofhidden bilinear games. Section 8 will be the conclusion of our work. Non-equilibrating dynamics in game theory.

Kleinberg et al. [2011] established non-convergence fora continuous-time variant of Multiplicative Weights Update (MWU), known as the replicator dynamic,for a 2x2x2 game and showed that as a result the system converges to states whose social welfaredominates that of all Nash equilibria. Palaiopanos et al. [2017] proved the existence of Li-Yorke chaosin MWU dynamics of 2x2 potential games. From the perspective of evolutionary game theory, whichtypically studies continuous time dynamics, numerous nonconvergence results are known but againtypically for small games, e.g., Sandholm [2010]. Piliouras and Shamma [2014] shows that replicatordynamics exhibit a speciﬁc type of near periodic behavior in bilinear (network) zero-sum games,which is known as Poincaré recurrence. Recently, Mertikopoulos et al. [2018] generalized theseresults to more general continuous time variants of FTRL dynamics (e.g. gradient-descent-ascent).Cycles arise also in evolutionary team competition Piliouras and Schulman [2018] as well as innetwork competition Nagarajan et al. [2018]. Technically, Piliouras and Schulman [2018] is theclosest paper to our own as it studies evolutionary competition between Boolean functions, however,the dynamics in the two models are different and that paper is strictly focused on periodic systems.The papers in the category of cyclic/recurrent dynamics combine delicate arguments such as volumepreservation and the existence of constants of motions (“energy preservation"). In this paper weprovide a wide generalization of these type of results by establishing cycles and recurrence type ofbehavior for a large class of non-convex non-concave games. In the case of discrete time dynamics,such as standard gradient-descent-ascent, the system trajectories are ﬁrst order approximations of theabove motion and these conservation arguments do not hold exactly. Instead, even in bilinear games,3igure 1: Trajectories of a single player using gradient-descent-ascent dynamics for a hiddenRock-Paper-Scissors game with sigmoid activations. The different colors correspond to different ini-tializations of the dynamics. The trajectories exhibit Poincaré recurrence as expected by Theorem 7.the “energy" slowly increases over time Bailey and Piliouras [2018] implying chaotic divergenceaway from equilibrium Cheung and Piliouras [2019]. We extend such energy increase results tonon-linear settings.

Learning in zero-sum games and connections to GANs.

Several recent papers have shown positiveresults about convergence to equilibria in (mostly bilinear) zero-sum games for suitable adaptedvariants of ﬁrst-order methods and then apply these techniques to Generative Adversarial Networks(GANs) showing improved performance (e.g. Daskalakis et al. [2018], Daskalakis and Panageas[2019]). Balduzzi et al. [2018] made use of conservation laws of learning dynamics in zero-sumgames (e.g. Bailey and Piliouras [2019a]) to develop new algorithms for training GANs that add a newcomponent to the vector ﬁeld that aims at minimizing this energy function. Different energy shrinkingtechniques for convergence in GANs (non-convex saddle point problems) exploit connections tovariational inequalities and employ mirror descent techniques with an extra gradient step Gidel et al.[2018], Mertikopoulos et al. [2019a]. Moreover, adding negative momentum can help with stabilityin zero-sum games Gidel et al. [2019c]. Game theoretic inspired methods such as time-averagingwork well in practice for a wide range of architectures Yazıcı et al. [2019].

Vectors are denoted in boldface xxx, yyy unless otherwise indicated are considered as column vectors.We use (cid:107)·(cid:107) corresponds to denote the (cid:96) − norm. For a function f : R d → R we use ∇ f to denoteits gradient. For functions of two vector arguments, f ( xxx, yyy ) : R d × R d → R , we use ∇ xxx f, ∇ yyy f to denote its partial gradient. For the time derivative we will use the dot accent abbreviation, i.e., ˙ xxx = ddt [ xxx ( t )] . A function f will belong to C r if it is r times continuously differentiable. The term“sigmoid" function refers to σ : R → R such that σ ( x ) = (1 + e − x ) − . Finally, we use P ( · ) ,operating over a set, to denote its (Lebesgue) measure.4 .2 DeﬁnitionsDeﬁnition 1 (Hidden Bilinear Zero-Sum Game) . In a hidden bilinear zero-sum game there are twoplayers, each one equipped with a smooth function

FFF : R n → R N and GGG : R m → R M and a payoffmatrix U N × M such that each player inputs its own decision vector θθθ ∈ R n and φφφ ∈ R m and is tryingto maximize or minimize r ( θθθ, φφφ ) = FFF ( θθθ ) (cid:62) UGGG ( φφφ ) respectively. In this work we will mostly study continuous time dynamics of solutions for the problem of Equation1 for hidden bilinear zero-sum games but we will also make some important connections to discretetime dynamics that are also prevalent in practice. In order to make this distinction clear, let us deﬁnethe following terms.

Deﬁnition 2 (Continuous Time Dynamical System) . A system of ordinary differential equations ˙ xxx = f ( xxx ) where f : R d → R d will be called a continuous time dynamical system. Solutions of theequation f ( xxx ) = 0 are called the ﬁxed points of the dynamical system. We will call f the vector ﬁeld of the dynamical system. In order to understand the properties ofcontinuous time dynamical systems, we will often need to study their behaviour given different initialconditions. This behaviour is captured by the ﬂow of the dynamical system. More precisely, Deﬁnition 3. If f is Lipschitz-continuous, there exists a continuous map Φ( xxx , t ) : R d × R → R d called ﬂow of the dynamical system such that for all xxx ∈ R d we have that Φ( xxx , t ) is the uniquesolution of the problem { ˙ xxx = f ( xxx ) , xxx (0) = xxx } . We will refer to Φ( xxx , t ) as a trajectory or orbit ofthe dynamical system. In this work we will be mainly study the gradient-descent-ascent dynamics for the problem ofEquation 1. The continuous (discrete) time version of the dynamics (with learning rate α ) are basedon the following equations: ( CGDA ) : (cid:26) ˙ θθθ = −∇ θθθ r ( θθθ, φφφ )˙ φφφ = ∇ φφφ r ( θθθ, φφφ ) (cid:27) (DGDA ) : (cid:26) θθθ k +1 = θθθ k − α ∇ θθθ r ( θθθ k , φφφ k ) φφφ k +1 = φφφ k + α ∇ φφφ r ( θθθ k , φφφ k ) (cid:27) A key notion in our analysis is that of (Poincaré) recurrence. Intuitively, a dynamical system isrecurrent if, after a sufﬁciently long (but ﬁnite) time, almost every state returns arbitrarily close to thesystem’s initial state.

Deﬁnition 4.

A point x ∈ R d is said to be recurrent under the ﬂow Φ , if for every neighborhood U ⊆ R d of x , there exists an increasing sequence of times t n such that lim n →∞ t n = ∞ and Φ( x , t n ) ∈ U for all n . Moreover, the ﬂow Φ is called Poincaré recurrent in non-zero measure set A ⊆ R d if the set of the non-recurrent points in A has zero measure. In this section we will focus on a particular case of hidden biinear games where both the generatorand the discriminator play only two strategies. Let U be our zero-sum game and without loss ofgenerality we can assume that there are functions f : R n → [0 , and g : R m → [0 , such that FFF ( θθθ ) = (cid:18) f ( θθθ )1 − f ( θθθ ) (cid:19) U = (cid:18) u , u , u , u , (cid:19) GGG ( φφφ ) = (cid:18) g ( φφφ )1 − g ( φφφ ) (cid:19) Let us assume that the hidden bi-linear game has a unique mixed Nash equilibrium ( p, q ) : v = u , − u , − u , + u , (cid:54) = 0 , q = − u , − u , v ∈ (0 , , p = − u , − u , v ∈ (0 , Then we can write down the equations of gradient-descent-ascent : (cid:40) ˙ θθθ = − v ∇ f ( θθθ )( g ( φφφ ) − q )˙ φφφ = v ∇ g ( φφφ )( f ( θθθ ) − p ) (cid:41) (3) In order to analyze the behavior of this system, we would like to understand the topology of thetrajectories of θθθ and φφφ , at least individually. The following lemma makes a connection between thetrajectories of each variable in the min-max optimization system of Equation 3 and simple gradientascent dynamics. 5 emma 1.

Let k : R d → R be a C function. Let h : R → R be a C function and xxx ( t ) = ρ ( t ) be the unique solution of the dynamical system Σ . Then for the dynamical system Σ the uniquesolution is zzz ( t ) = ρ ( (cid:82) t h ( s )d s ) (cid:26) ˙ xxx = ∇ k ( xxx ) xxx (0) = xxx (cid:27) : Σ (cid:26) ˙ zzz = h ( t ) ∇ k ( zzz ) zzz (0) = xxx (cid:27) : Σ By applying the previous result for θθθ with k = f and h ( t ) = − v ( g ( φφφ ( t )) − q ) , we get that evenunder the dynamics of Equation 3, θθθ remains on a trajectory of the simple gradient ascent dynamicswith initial condition θθθ (0) . This necessarily affects the possible values of f and g given the initialconditions. Let us deﬁne the sets of values attainable for each initialization. Deﬁnition 5.

For each θθθ (0) , f θθθ (0) is the set of possible values of f ( θθθ ( t )) can attain under gradientascent dynamics. Similarly, we deﬁne g φφφ (0) the corresponding set for g . What is special about the trajectories of gradient ascent is that along this curve f is strictly increasing(For a detailed explanation, reader could check the proof of Theorem 1 in the Appendix) and thereforeeach point θθθ ( t ) in the trajectory has a unique value for f . Therefore even in the system of Equation 3, f ( θθθ ( t )) uniquely identiﬁes θθθ ( t ) . This can be formalized in the next theorem. Theorem 1.

For each θθθ (0) , φφφ (0) , under the dynamics of Equation 3, there are C functions ( X θθθ (0) , X φφφ (0) ) such that X θθθ (0) : f θθθ (0) → R n , X φφφ (0) : g φφφ (0) → R n and θθθ ( t ) = X θθθ (0) ( f ( t )) , φφφ ( t ) = X φφφ (0) ( g ( t )) . Equipped with these results, we are able to reduce this complicated dynamical system of θθθ and φφφ to aplanar dynamical system involving f and g alone. Lemma 2. If θθθ ( t ) and φφφ ( t ) are solutions to Equation 3 with initial conditions ( θθθ (0) , φφφ (0)) , then wehave that f ( t ) = f ( θθθ ( t )) and g ( t ) = g ( φφφ ( t )) satisfy the following equations ˙ f = − v (cid:107)∇ f ( X θθθ (0) ( f )) (cid:107) ( g − q )˙ g = v (cid:107)∇ g ( X φφφ (0) ( g )) (cid:107) ( f − p ) (4)As one can observe both form Equation 3 and Equation 4, ﬁxed points of the gradient-descent-ascentdynamics correspond to either solutions of f ( θθθ ) = p and g ( φφφ ) = q or stationary points of f and g or even some combinations of the aforementioned conditions. Although, all of them are ﬁxedpoints of the dynamical system, only the former equilibria are game theoretically meaningful. Wewill therefore deﬁne a subset of initial conditions for Equation 3 such that convergence to gametheoretically meaningful ﬁxed points may actually be feasible: Deﬁnition 6.

We will call the initialization ( θθθ (0) , φφφ (0)) safe for Equation 3 if θθθ (0) and φφφ (0) are notstationary points of f and g respectively and p ∈ f θθθ (0) and q ∈ g φφφ (0) . For safe initial conditions we can show that gradient-descent-ascent dynamics applied in the classof the hidden bilinear zero-sum game mimic properties and behaviors of conservative/Hamiltonianphysical systems Bailey and Piliouras [2019a], like an ideal pendulum or an ideal spring-mass system.In such systems, there is a notion of energy that remains constant over time and hence the systemtrajectories lie on level sets of these functions. To motivate further this intuition, it is easy to checkthat for the simpliﬁed case where (cid:107)∇ f (cid:107) = (cid:107)∇ g (cid:107) = 1 the level sets correspond to cycles centered atthe Nash equilibrium and the system as a whole captures gradient-descent-ascent for a bilinear × zero-sum game (e.g. Matching Pennies). Theorem 2.

Let θθθ (0) and φφφ (0) be safe initial conditions. Then for the system of Equation 3, thefollowing quantity is time-invariant H ( f, g ) = (cid:90) fp z − p (cid:107)∇ f ( X θθθ (0) ( z )) (cid:107) d z + (cid:90) gq z − q (cid:107)∇ g ( X φφφ (0) ( z )) (cid:107) d z The existence of this invariant immediately guarantees that Nash Equilibrium ( p, q ) cannot be reachedif the dynamical system is not initialized there. Taking advantage of the planarity of the inducedsystem - a necessary condition of Poincaré-Bendixson Theorem - we can prove that:6 heorem 3. Let θθθ (0) and φφφ (0) be safe initial conditions. Then for the system of Equation 3, the orbit ( θθθ ( t ) , φφφ ( t )) is periodic. On a positive note, we can prove that the time averages of f and g as well as the time averages ofexpected utilities of both players converge to their Nash equilibrium values. Theorem 4.

Let θθθ (0) and φφφ (0) be safe initial conditions and ( PPP , QQQ ) = (cid:16)(cid:0) p − p (cid:1) , (cid:0) q − q (cid:1)(cid:17) , then for thesystem of Equation 3 lim T →∞ (cid:82) T f ( θθθ ( t ))d tT = p, lim T →∞ (cid:82) T r ( θθθ ( t ) , φφφ ( t ))d tT = PPP (cid:62)

UQQQ, lim T →∞ (cid:82) T g ( φφφ ( t ))d tT = q In this section we will extend our results by allowing both the generator and the discriminator to playhidden bilinear games with more than two strategies. We will speciﬁcally study the case of hiddenbilinear games where each coordinate of the vector valued functions F and G is controlled by disjointsubsets of the variables θθθ and φφφ , i.e. θθθ =  θθθ θθθ ... θθθ N  FFF ( θθθ ) =  f ( θθθ ) f ( θθθ ) ... f N ( θθθ N )  φφφ =  φφφ φφφ ... φφφ M  GGG ( φφφ ) =  g ( φφφ ) g ( φφφ ) ... g M ( φφφ M )  (5)where each function f i and g i takes an appropriately sized vector and returns a non-negative number.To account for possible constraints (e.g. that probabilities of each distribution must sum to one), wewill incorporate this restriction using Lagrange Multipliers. The resulting problem becomes min θθθ ∈ R n ,µ ∈ R max φφφ ∈ R m ,λ ∈ R FFF ( θθθ ) (cid:62) UGGG ( φφφ ) + λ (cid:32) N (cid:88) i =1 f i ( θθθ i ) − (cid:33) + µ  M (cid:88) i = j g j ( φφφ j ) −  (6)Writing down the equations of gradient-ascent-descent we get ˙ θθθ i = − ∇ f i ( θθθ i )  M (cid:88) j =1 u i,j g j ( φφφ j ) + λ  ˙ φφφ j = ∇ g j ( φφφ j ) (cid:32) N (cid:88) i =1 u i,j f i ( θθθ i ) + µ (cid:33) ˙ µ = −  M (cid:88) j =1 g j ( φφφ j ) −  ˙ λ = (cid:32) N (cid:88) i =1 f i ( θθθ i ) − (cid:33) (7)Once again we can show that along the trajectories of the system of Equation 7, θθθ i can be uniquelyidentiﬁed by f i ( θθθ i ) given θθθ i (0) and the same holds for the discriminator. This allows us to constructfunctions X θθθ i (0) and X φφφ j (0) just like in Theorem 1. We can now write down a dynamical systeminvolving only f i and g j . Lemma 3. If θθθ ( t ) and φφφ ( t ) are solutions to Equation 7 with initial conditions ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) ,then we have that f i ( t ) = f i ( θθθ i ( t )) and g j ( t ) = g j ( φφφ j ( t )) satisfy the following equations ˙ f i = −(cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107)  M (cid:88) j =1 u i,j g j + λ  ˙ g j = (cid:107)∇ g j ( X φφφ j (0) ( g j )) (cid:107) (cid:32) N (cid:88) i =1 u i,j f i + µ (cid:33) (8)Similarly to the previous section, we can deﬁne a notion of safety for Equation 7. Let us assume thatthe hidden Game has a fully mixed Nash equilibrium ( ppp, qqq ) . Then we can deﬁne Deﬁnition 7.

We will call the initialization ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) safe for Equation 7 if θθθ i (0) and φφφ j (0) are not stationary points of f i and g j respectively and p i ∈ f i θθθi (0) and q j ∈ g j φφφj (0) . heorem 5. Assume that ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) is a safe initialization. Then there exist λ ∗ and µ ∗ such that the following quantity is time invariant: H ( FFF , GGG, λ, µ ) = N (cid:88) i =1 (cid:90) f i p i z − p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) d z + M (cid:88) j =1 (cid:90) g j q j z − q j (cid:107)∇ g j ( X φφφ j (0) ( z )) (cid:107) d z + (cid:90) λλ ∗ ( z − λ ∗ ) d z + (cid:90) µµ ∗ ( z − µ ∗ ) d z Given that even our reduced dynamical system has more than two state variables we cannot applythe Poincaré-Bendixson Theorem. Instead we can prove that there exists a one to one differentiabletransformation of our dynamical system so that the resulting system becomes divergence free.Applying Louville’s formula, the ﬂow of the the transformed system is volume preserving. Combinedwith the invariant of Theorem 5, we can prove that the variables of the transformed system remainbounded. This gives us the following guarantees

Theorem 6.

Assume that ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) is a safe initialization. Then the trajectory underthe dynamics of Equation 7 is diffeomoprphic to one trajectory of a Poincaré recurrent ﬂow. This result implies that if the corresponding trajectory of the Poincaré recurrent ﬂow is itself recurrent,which almost all of them are, then the trajectory of the dynamics of Equation 7 is also recurrent. Thisis however not enough to reason about how often any of the trajectories of the dynamics of Equation7 is recurrent. In order to prove that the ﬂow of Equation 7 is Poincaré recurrent we will make someadditional assumptions

Theorem 7.

Let f i and g j be sigmoid functions. Then the ﬂow of Equation 7 is Poincaré recurrent.The same holds for all functions f i and g j that are one to one functions and for which all initializationsare safe. It is worth noting that for the unconstrained version of the previous min-max problem we arrive at thesame conclusions/theorems by repeating the above analysis without using the Lagrange multipliers.

In the previous sections we have analyzed the behavior of safe initializations and we have provedthat they lead to either periodic or recurrent trajectories. For initializations that are not safe for someequilibrium of the hidden game, game theoretically interesting ﬁxed points are not even realizablesolutions. In fact we can prove something stronger:

Theorem 8.

One can construct functions f and g for the system of Equation 3 so that for a positivemeasure set of initial conditions the trajectories converge to ﬁxed points that do not correspond toequilibria of the hidden game. The main idea behind our theorem is that we can construct functions f and g that have local optimathat break the safety assumption. For a careful choice of the value of the local optima we can makethese ﬁxed points stable and then the Stable Manifold Theorem guarantees that a non zero measureset of points in the vicinity of the ﬁxed point converges to it. Of course the idea of these constructionscan be extended to our analysis of hidden games with more strategies. In this section we will discuss the implications of our analysis of continuous time gradient-ascent-descent dynamics on the properties of their discrete time counterparts. In general, the behaviorof discrete time dynamical systems can be signiﬁcantly different Li and Yorke [1975], Bailey andPiliouras [2018], Palaiopanos et al. [2017] so it is critical to perform this non-trivial analysis. We areable to show that the picture of non-equilibriation persists for an interesting class of hidden bilineargames.

Theorem 9.

Let f i and g j be sigmoid functions. Then for the discretized version of the system ofEquation 7 and for safe intializations, function H of Theorem 5 is non-decreasing.

8n immediate consequence of the above theorem is that the discretized system cannot converge to theequlibrium ( ppp, qqq ) if its not initialized there. For the case of non-safe initializations, the conclusions ofTheorem 8 persist in this case as well. Theorem 10.

One can choose a learning rate α and functions f and g for the discretized versionof the system of Equation 3 so that for a positive measure set of initial conditions the trajectoriesconverge to ﬁxed points that do not correspond to equilibria of the hidden game. In this work, inspired broadly by the structure of the complex competition between generators anddiscriminators in GANs, we deﬁned a broad class of non-convex non-concave min max optimizationgames, which we call hidden bilinear zero-sum games. In this setting, we showed that gradient-descent-ascent behavior is considerably more complex than a straightforward convergence to themin-max solution that one might at ﬁrst suspect. We showed that the trajectories even for thesimplest but evocative 2x2 game exhibits cycles. In higher dimensional games, the induced dynamicalsystem could exhibit even more complex behavior like Poincare recurrence. On the other hand, weexplored safety conditions whose violation may result in convergence to spurious game-theoreticallymeaningless equilibria. Finally, we show that even for a simple but widespread family of functionslike sigmoids discretizing gradient-descent-ascent can further intensify the disequilibrium phenomenaresulting in divergence away from equilibrium.As a consequence of this work numerous open problems emerge; Firstly, extending such recurrenceresults to more general families of functions, as well as examining possible generalizations to multi-player network zero-sum games are fascinating questions. Recently, there has been some progress inresolving cyclic behavior in simpler settings by employing different training algorithms/dynamics(e.g., Daskalakis et al. [2018], Mertikopoulos et al. [2019b], Gidel et al. [2019c]). It would be inter-esting to examine if these algorithms could enhance equilibration in our setting as well. Additionally,the proposed safety conditions shows that a major source of spurious equilibria in GANs couldbe the bad local optima of the individual neural networks of the discriminator and the generator.Lessons learned from overparametrized neural network architectures that converge to global optimaDu et al. [2018] could lead to improved efﬁciency in training GANs. Finally, analyzing differentsimpliﬁcation/models of GANs where provable convergence is possible could lead to interestingcomparisons as well as to the emergence of theoretically tractable hybrid models that capture boththe hardness of GAN training (e.g. non-convergence, cycling, spurious equilibria, mode collapse, etc)as well as their power.

Acknowledgements

Georgios Piliouras acknowledges MOE AcRF Tier 2 Grant 2016-T2-1-170, grant PIE-SGP-AI-2018-01 and NRF 2018 Fellowship NRF-NRFF2018-07. Emmanouil-Vasileios Vlatakis-Gkaragkouniswas supported by NSF CCF-1563155, NSF CCF-1814873, NSF CCF-1703925, NSF CCF-1763970.Finally this work was supported by the Onassis Foundation - Scholarship ID: F ZN 010-1/2017-2018.

References

Jacob Abernethy, Kevin A. Lai, and Andre Wibisono. Last-iterate convergence rates for min-maxoptimization.

CoRR , abs/1906.02027, 2019.Leonard Adolphs, Hadi Daneshmand, Aurélien Lucchi, and Thomas Hofmann. Local saddle pointoptimization: A curvature exploitation approach. In

The 22nd International Conference onArtiﬁcial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan ,pages 486–495, 2019. URL http://proceedings.mlr.press/v89/adolphs19a.html .Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN.

CoRR , abs/1701.07875,2017. URL http://arxiv.org/abs/1701.07875 .Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibriumin generative adversarial nets (gans). In

Proceedings of the 34th International Conference on achine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , pages 224–232, 2017.URL http://proceedings.mlr.press/v70/arora17a.html .James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June18-22, 2018 , pages 321–338, 2018. doi: 10.1145/3219166.3219235. URL https://doi.org/10.1145/3219166.3219235 .James P. Bailey and Georgios Piliouras. Multi-agent learning in network zero-sum games is ahamiltonian system. In , 2019a.James P Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: Vanishingregret with non-vanishing step sizes. In

NeurIPS , 2019b.James P. Bailey and Georgios Piliouras. Multi-agent learning in network zero-sum games is ahamiltonian system. In

Proceedings of the 18th International Conference on Autonomous Agentsand MultiAgent Systems, AAMAS ’19, Montreal, QC, Canada, May 13-17, 2019 , pages 233–241,2019c. URL http://dl.acm.org/citation.cfm?id=3331698 .James P. Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with ﬁxed step-sizevia alternating gradient descent-ascent.

CoRR , abs/1907.04392, 2019.David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel.The mechanics of n-player differentiable games. In

International Conference on Machine Learning ,pages 363–372, 2018.Ivar Bendixson. Sur les courbes déﬁnies par des équations différentielles.

Acta Math. , 24:1–88, 1901.doi: 10.1007/BF02403068. URL https://doi.org/10.1007/BF02403068 .David Berthelot, Tom Schumm, and Luke Metz. BEGAN: boundary equilibrium generative adversar-ial networks.

CoRR , abs/1703.10717, 2017. URL http://arxiv.org/abs/1703.10717 .Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization:Chaos and butterﬂy effects of online learning in zero-sum games. In

COLT , 2019.Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descentin min-max optimization. In

Advances in Neural Information Processing Systems 31: AnnualConference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December2018, Montréal, Canada. , pages 9256–9266, 2018. URL http://papers.nips.cc/paper/8136-the-limit-points-of-optimistic-gradient-descent-in-min-max-optimization .Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and con-strained min-max optimization. In , pages 27:1–27:18, 2019. doi:10.4230/LIPIcs.ITCS.2019.27. URL https://doi.org/10.4230/LIPIcs.ITCS.2019.27 .Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans withoptimism. In , 2018. URL https://openreview.net/forum?id=SJJySbbAZ .Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds globalminima of deep neural networks.

CoRR , abs/1811.03804, 2018. URL http://arxiv.org/abs/1811.03804 .Hao Ge, Yin Xia, Xu Chen, Randall Berry, and Ying Wu. Fictitious GAN: training gans withhistorical models. In

Computer Vision - ECCV 2018 - 15th European Conference, Munich,Germany, September 8-14, 2018, Proceedings, Part I , pages 122–137, 2018. doi: 10.1007/978-3-030-01246-5\_8. URL https://doi.org/10.1007/978-3-030-01246-5_8 .Gauthier Gidel, Hugo Berard, Pascal Vincent, and Simon Lacoste-Julien. A variational inequalityperspective on generative adversarial nets.

CoRR , abs/1802.10551, 2018. URL http://arxiv.org/abs/1802.10551 . 10authier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. Avariational inequality perspective on generative adversarial networks. In

ICLR , 2019a. URL https://openreview.net/forum?id=r1laEnA5Ym .Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Gabriel Huang, Rémi Lepriol,Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics.In

AISTATS , 2019b.Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang,Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics.In

The 22nd International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2019,16-18 April 2019, Naha, Okinawa, Japan , pages 1802–1811, 2019c. URL http://proceedings.mlr.press/v89/gidel19a.html .Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in NeuralInformation Processing Systems 27: Annual Conference on Neural Information Processing Systems2014, December 8-13 2014, Montreal, Quebec, Canada , pages 2672–2680, 2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets .Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville.Improved training of wasserstein gans. In

Advances in Neural Information Processing Sys-tems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA , pages 5767–5777, 2017. URL http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans .Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation withconditional adversarial networks. In , pages 5967–5976, 2017. doi:10.1109/CVPR.2017.632. URL https://doi.org/10.1109/CVPR.2017.632 .Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Minmax optimization: Stable limit points ofgradient descent ascent are locally optimal.

CoRR , abs/1902.00618, 2019. URL http://arxiv.org/abs/1902.00618 .Robert D. Kleinberg, Katrina Ligett, Georgios Piliouras, and Éva Tardos. Beyond the nash equi-librium barrier. In

Innovations in Computer Science - ICS 2010, Tsinghua University, Beijing,China, January 7-9, 2011. Proceedings , pages 125–140, 2011. URL http://conference.iiis.tsinghua.edu.cn/ICS2011/content/papers/15.html .Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. On convergence and stability ofgans.

CoRR , abs/1705.07215, 2017. URL http://arxiv.org/abs/1705.07215 .Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta,Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. Photo-realisticsingle image super-resolution using a generative adversarial network. In ,pages 105–114, 2017. doi: 10.1109/CVPR.2017.19. URL https://doi.org/10.1109/CVPR.2017.19 .Tien-Yien Li and James A. Yorke. Period three implies chaos.

The American Mathematical Monthly ,82(10):985–992, 1975.Qihang Lin, Mingrui Liu, Hassan Raﬁque, and Tianbao Yang. Solving weakly-convex-weakly-concave saddle-point problems as weakly-monotone variational inequality.

CoRR , abs/1810.10207,2018. URL http://arxiv.org/abs/1810.10207 .Tengyu Ma. Generalization and equilibrium in generative adversarial nets (gans) (invited talk). In

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018,Los Angeles, CA, USA, June 25-29, 2018 , page 2, 2018. doi: 10.1145/3188745.3232194. URL https://doi.org/10.1145/3188745.3232194 .11ung Mai, Ioannis Panageas, Will Ratcliff, Vijay V. Vazirani, and Peter Yunker. Rock-paper-scissors, differential games and biological diversity.

CoRR , abs/1710.11249, 2017. URL http://arxiv.org/abs/1710.11249 .Eric Mazumdar and Lillian J. Ratliff. On the convergence of gradient-based learning in continuousgames.

CoRR , abs/1804.05464, 2018. URL http://arxiv.org/abs/1804.05464 .Panayotis Mertikopoulos, Christos H. Papadimitriou, and Georgios Piliouras. Cycles in adversarialregularized learning. In

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Dis-crete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018 , pages 2703–2717, 2018.doi: 10.1137/1.9781611975031.172. URL https://doi.org/10.1137/1.9781611975031.172 .Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar,and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradi-ent) mile. In , 2019a. URL https://openreview.net/forum?id=Bkg8jjC9KQ .Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar,and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra(-gradient) mile. In

ICLR , 2019b. URL https://openreview.net/forum?id=Bkg8jjC9KQ .Lars M. Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans doactually converge? In

Proceedings of the 35th International Conference on Machine Learning,ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 , pages 3478–3487, 2018.URL http://proceedings.mlr.press/v80/mescheder18a.html .Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarialnetworks. In , 2017. URL https://openreview.net/forum?id=BydrOIcle .Sai Ganesh Nagarajan, Sameh Mohamed, and Georgios Piliouras. Three body problems in evolu-tionary game dynamics: Convergence, periodicity and limit cycles. In

Proceedings of the 17thInternational Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2018, Stock-holm, Sweden, July 10-15, 2018 , pages 685–693, 2018. URL http://dl.acm.org/citation.cfm?id=3237485 .Frans A. Oliehoek, Rahul Savani, Jose Gallego-Posada, Elise van der Pol, and Roderich Groß.Beyond local nash equilibria for adversarial networks.

CoRR , abs/1806.07268, 2018. URL http://arxiv.org/abs/1806.07268 .Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weightsupdate with constant step-size in congestion games: Convergence, limit cycles andchaos. In

Advances in Neural Information Processing Systems 30: Annual Confer-ence on Neural Information Processing Systems 2017, 4-9 December 2017, LongBeach, CA, USA , pages 5872–5882, 2017. URL http://papers.nips.cc/paper/7169-multiplicative-weights-update-with-constant-step-size-in-congestion-games-convergence-limit-cycles-and-chaos .Lawrence Perko.

Differential Equations and Dynamical Systems . Springer, 3nd. edition, 1991.David Pfau and Oriol Vinyals. Connecting generative adversarial networks and actor-critic methods.

CoRR , abs/1610.01945, 2016. URL http://arxiv.org/abs/1610.01945 .Georgios Piliouras and Leonard J. Schulman. Learning dynamics and the co-evolution of competingsexual species. In , pages 59:1–59:3, 2018. doi: 10.4230/LIPIcs.ITCS.2018.59. URL https://doi.org/10.4230/LIPIcs.ITCS.2018.59 .Georgios Piliouras and Jeff S. Shamma. Optimization despite chaos: Convex relaxations to com-plex limit sets via poincaré recurrence. In

Proceedings of the Twenty-Fifth Annual ACM-SIAMSymposium on Discrete Algorithms, SODA 2014, Portland, Oregon, USA, January 5-7, 2014 ,pages 861–873, 2014. doi: 10.1137/1.9781611973402.64. URL https://doi.org/10.1137/1.9781611973402.64 . 12. Poincaré. Sur le problème des trois corps et les équations de la dynamique.

Acta Math , 13:1–270,1890.H Poincaré. Sur le problème des trois corps et les équations de la dynamique.

Acta Math. , 13:1, 011890.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. In ,2016. URL http://arxiv.org/abs/1511.06434 .Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen.Improved techniques for training gans. In

Advances in Neural Information Processing Sys-tems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10,2016, Barcelona, Spain , pages 2226–2234, 2016. URL http://papers.nips.cc/paper/6125-improved-techniques-for-training-gans .William H. Sandholm.

Population Games and Evolutionary Dynamics . MIT Press, 2010.Maziar Sanjabi, Meisam Razaviyayn, and Jason D. Lee. Solving non-convex non-concave min-max games under polyak-łojasiewicz condition.

CoRR , abs/1812.02878, 2018. URL http://arxiv.org/abs/1812.02878 .Michael Shub.

Global Stability of Dynamical Systems . Springer-Verlag, 1987.Ilya O. Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and BernhardSchölkopf. Adagan: Boosting generative models. In

Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December2017, Long Beach, CA, USA , pages 5424–5433, 2017. URL http://papers.nips.cc/paper/7126-adagan-boosting-generative-models .Yasin Yazıcı, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and VijayChandrasekhar. The unusual effectiveness of averaging in gan training. In

ICLR , 2019.Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photo-realistic image synthesis with stackedgenerative adversarial networks. In

IEEE International Conference on Computer Vision, ICCV2017, Venice, Italy, October 22-29, 2017 , pages 5908–5916, 2017. doi: 10.1109/ICCV.2017.629.URL https://doi.org/10.1109/ICCV.2017.629 .13 oincaré Recurrence, Cycles and Spurious Equilibriain Gradient-Descent-Ascent for Non-ConvexNon-Concave Zero-Sum Games

Supplementary MaterialA Background in dynamical systems

A.1 Poincaré-Bendixson Theorem

The Poincaré-Bendixson theorem is a powerful theorem that implies that two-dimensional systemscannot exhibit chaos. Effectively, the limit behavior is either going to be an equilibrium, a periodicorbit, or a closed loop, punctuated by one (or more) ﬁxed points. Formally, we have:

Theorem 11 (Poincaré-Bendixson Theorem Bendixson [1901]) . Given a differentiable real dynami-cal system deﬁned on an open subset of the plane, then every non-empty compact ω -limit set of anorbit, which contains only ﬁnitely many ﬁxed points, is either a ﬁxed point, a periodic orbit, or aconnected set composed of a ﬁnite number of ﬁxed points together with homoclinic and heteroclinicorbits connecting these. A.2 Liouville’s formula and Poincaré recurrence

In order to study the ﬂows of dynamical systems in higher dimensions, one needs to understand moreabout the behaviour of the ﬂow Φ both in time and space. An important property is the evolution ofthe volume of Φ over time: Theorem 12 (Liouville’s formula) . Let Φ be the ﬂow of a dynamical system with vecor ﬁeld f . Givenany measurable set A , let A ( t ) = Φ( A, t ) and its volume be vol[ A ( t )] = (cid:82) A ( t ) d xxx . Then we havethat d vol[ A ( t )]d t = (cid:90) A ( t ) div[ f ( xxx )]d xxx An interesting class of dynamical systems are those whose vector ﬁelds have zero divergenceeverywhere. Liouville’s formula trivially implies that the volume of the ﬂow is preserved in suchsystems. This is an important tool for proving that a ﬂow of a dynamical system is Poincaré recurrent.

Theorem 13 (Poincaré Recurrence Theorem (version 1) [Poincaré, 1890]) . Let ( X, Σ , µ ) be a ﬁnitemeasure space and let f : X → X be a measure-preserving transformation. Then, for any E ∈ Σ ,the set of those points x of E such that f n ( x ) / ∈ E for all n > has zero measure. That is, almostevery point of E returns to E . In fact, almost every point returns inﬁnitely often. Namely, P ( { x ∈ E : ∃ N such that f n ( x ) / ∈ E for all n > N } ) = 0 . Poincaré [1890] proved that in certain systems almost all trajectories return arbitrarily close to theirinitial position inﬁnitely often. Indeed, let f : X → X be a measure-preserving transformation, { U n : n ∈ N } be a basis of open sets for the bounded subset X ⊂ R d , and for each n deﬁne U n (cid:48) = { x ∈ U n : ∀ n ≥ , f n ( x ) (cid:54)∈ U n } . Notice that such basis exists since R n is a second-countable Hausdorff space. From the initial theorem we know that P ( U n (cid:48) ) = 0 . Let U = ∪ n ∈ N U n (cid:48) .Then P ( U ) = 0 . We assert that if x ∈ X \ U then x is recurrent. In fact, given a neighborhood U of x , there is a basic neighborhood U n such that { x } ⊂ U n ⊂ U , and since x (cid:54)∈ U we have that x ∈ U n \ U n (cid:48) which by deﬁnition of U n (cid:48) means that there exists n ≥ such that f n ( x ) ∈ U n ⊂ U .Thus x is recurrent. Therefore, for the rest of the paper, we will use the following version which iscommon in dynamical systems nomenclature. 14 heorem 14 (Poincaré Recurrence Theorem (dynamical system version)) . Poincaré [1890] If aﬂow

Φ : R n × R → R n preserves volume and has only orbits on a bounded subset D of R n thenalmost each point in D is recurrent, i.e for every open neighborhood U of x there exists an increasingsequence of times t n such that lim n →∞ t n = ∞ and Φ( x , t n ) ∈ U for all n . A.3 Additional DeﬁnitionsDeﬁnition 8 (Differomorphism, Perko [1991]) . Let

U, V be manifolds. A map f : U → V is calleda diffeomorphism if f carries U onto V and also both f and f − are smooth. Deﬁnition 9 (Topological conjugacy, Perko [1991]) . Two ﬂows Φ t : A → A and Ψ t : B → B areconjugate if there exists a homeomorphism g : A → B such that ∀ xxx ∈ A, t ∈ R : g (Φ t ( xxx )) = Ψ t ( g ( xxx )) Furthermore, two ﬂows Φ t : A → A and Ψ t : B → B are diffeomorphic if there exists a diffeomor-phism g : A → B such that ∀ xxx ∈ A, t ∈ R : g (Φ t ( xxx )) = Ψ t ( g ( xxx )) . If two ﬂows are diffeomorphic, then their vector ﬁelds are related by the derivative of the conjugacy.That is, we get precisely the same result that we would have obtained if we simply transformed thecoordinates in their differential equations

Deﬁnition 10 ( ( α, ω ) -limit set, Perko [1991]) . Let Φ( xxx , · ) be the ﬂow of an autonomous dynamicalsystem ˙ pmbx = f ( xxx ) . Then ω ( x x x ) = { xxx : for all T and all (cid:15) > there exists t > T such that | Φ( x x x , t ) − xxx | < (cid:15) } α ( x x x ) = { xxx : for all T and all (cid:15) > there exists t < T such that | Φ( x x x , t ) − xxx | < (cid:15) } Equivalently, ω ( x x x ) = { xxx : there exists an unbounded, increasing sequence { t k } such that lim k →∞ Φ( t k , x x x ) = xxx } α ( x x x ) = { xxx : there exists an unbounded, decreasing sequence { t k } such that lim k →∞ Φ( t k , x x x ) = xxx } Lemma 4 (Recurrence and Conjugacy Mertikopoulos et al. [2018]) . Let Φ t : A → A and Ψ t : B → B be conjugate ﬂows and γ be the diffeomorphism which connects them. Then a point xxx ∈ V isrecurrent for Φ if and only if γ ( xxx ) ∈ γ ( V ) is recurrent for Ψ .Proof. We will ﬁrst prove the if direction. Let’s take any open neighborhood U ⊆ V around xxx . Usingthe diffeomorphism, there is a unique γ ( U ) ⊆ γ ( V ) and additionally since U is open γ ( U ) is alsoopen. Obviously, γ ( xxx ) ∈ γ ( U ) . Thus, if γ ( xxx ) is recurrent there is an unbounded increasing sequenceof moments t n such that Ψ( γ ( xxx ) , t n ) ∈ γ ( U ) . This is equivalent with the fact that there is an unbounded increasing sequence of moments t n suchthat γ − (Ψ( γ ( xxx ) , t n )) ∈ γ − ( γ ( U )) . Using the basic property of topological conjugacy, we have that Φ( xxx, t n ) = γ − (Ψ( γ ( U ) , t n )) . Thus, for t n we have that Φ( xxx, t n ) ∈ U. It follows that xxx is also recurrent for Φ . The result for the opposite direction follows immediately byusing the inverse map. 15 .4 Stable Manifold TheoremsTheorem 15 (Stable Manifold Theorem for Continuous Time Dynamical Systems p.120 Perko[1991]) . Let E be an open subset of R n containing the origin, let f ∈ C ( E ) , and let φ t be the ﬂowof the nonlinear system ˙ xxx = f ( xxx ) . Suppose that f ( ) = and that Df ( O ) has k eigenvalues withnegative real part and n − k eigenvalues with positive real part. Then there exists a k -dimensionaldifferentiable manifold S tangent to the stable subspace E s of the linear system ˙ xxx = Df ( ) xxx at such that for all t ≥ , φ t ( S ) ⊆ S and for all xxx ∈ S : lim t →∞ φ t ( xxx ) = and there exists an n − k dimensional differentiable manifold U tangent to the unstable subspace E u of the linear system ˙ xxx = Df ( ) xxx at such that for all t ≤ , φ t ( U ) ⊆ U and for all xxx ∈ U : lim t →−∞ φ t ( xxx ) = Theorem 16 (Center and Stable Manifolds, p. 65 of Shub [1987]) . Let ppp be a ﬁxed point for the C r local diffeomorphism h : U → R n where U ⊂ R n is an open neighborhood of ppp in R n and r ≥ . Let E s ⊕ E c ⊕ E u be the invariant splitting of R n into generalized eigenspaces of Dh ( ppp ) corresponding to eigenvalues of absolute value less than one, equal to one, and greater than one. Tothe Dh ( ppp ) invariant subspace E s ⊕ E c there is an associated local h invariant C r embedded disc W scloc of dimension dim ( E s ⊕ E c ) , and ball B around ppp such that: h ( W scloc ) ∩ B ⊂ W scloc . If h n ( xxx ) ∈ B for all n ≥ , then xxx ∈ W scloc . A.5 Regular Value TheoremDeﬁnition 11.

Let f : U → V be a smooth map between same dimensional manifolds. We denotethat x ∈ U is a regular point if the derivative is nonsingular. y ∈ V is called a regular value if f − ( y ) contains only regular points. If the derivative is singular, then x is called a critical point .We also say y ∈ V is a critical value if y is not a regular value. Theorem 17 (Regular Value Theorem) . If y ∈ Y is a regular value of f : X → Y then f − ( y ) is amanifold of dimension n − m , since dim ( X ) = n and dim ( Y ) = m . Jacobian of h evaluated at ppp . Omitted Proofs of Section 4Warm up: Cycles in hidden bilinear games with two strategies

In this ﬁrst section, we show a key technical lemma which will be used in many differentparts of our proof. More speciﬁcally, it shows how someone can derive the solution for anon-autonomous system via a conjugate autonomous dynamical system. The main intuitionis that if the non-autonomous term is multiplicative and common across all terms of a vectorﬁeld then it dictates the magnitude of the vector ﬁeld (the speed of the motion), but does notaffect directionality other than moving backwards or forwards along the same trajectory.

Lemma 5 (Restated Lemma 1) . Let k : R d → R be a C function. Let h : R → R be a C functionand xxx ( t ) = ρ ( t ) be the unique solution of the dynamical system Σ . Then for the dynamical system Σ the unique solution is zzz ( t ) = ρ ( (cid:82) t h ( s )d s ) (cid:26) ˙ xxx = ∇ k ( xxx ) xxx (0) = xxx (cid:27) : Σ (cid:26) ˙ zzz = h ( t ) ∇ k ( zzz ) zzz (0) = xxx (cid:27) : Σ Proof.

Firstly, notice that it holds ρ (0) = xxx and ˙ ρ = ∇ k ( ρ ) , since ρ is the unique solution of Σ Itis easy to check that: zzz (0) = ρ ( (cid:90) h ( s )d s ) = ρ (0) = xxx ˙ zzz = ∇ ρ ( (cid:90) t h ( s )d s ) × (cid:90) t h ( s )d s = ∇ ρ ( (cid:90) t h ( s )d s ) h ( t ) The next proposition states that initial condition ( θθθ (0) , φφφ (0)) as well as { f ( t ) , g ( t ) } ∞ t =0 aresufﬁcient to derive the complete system state of Continuous GDA ( θ θ ( t ) , φ φ ( t )) . Theimportance of the below theorem arises when someone takes into consideration periodicityand recurrence phenomena. Due to the existence of mapping ( f ( t ) , g ( t )) to a unique ( θθθ ( t ) , φφφ ( t )) given some initial condition ( θθθ (0) , φφφ (0)) , any periodic or recurrent behavior of ( f ( t ) , g ( t )) extends to the system trajectories. Theorem 18 (Restated Theorem 1) . For each θθθ (0) , φφφ (0) , under the dynamics of Equation 3, thereare C functions ( X θθθ (0) , X φφφ (0) ) such that X θθθ (0) : f θθθ (0) → R n , X φφφ (0) : g φφφ (0) → R n and θθθ ( t ) = X θθθ (0) ( f ( t )) , φφφ ( t ) = X φφφ (0) ( g ( t )) . Proof.

Let us ﬁrst study a simpler dynamical system (Σ ∗ ) with unique solution of γ θθθ (0) ( t ) . (Σ ∗ ) ≡ (cid:26) ˙ θθθ = ∇ f ( θθθ ) θθθ (0) = θθθ (cid:27) It is easy to observe that: ˙ f = ∇ f ( θθθ ) ˙ θθθ = (cid:107)∇ f ( θθθ ) (cid:107) If xxx is a stationary point of f then the trajectory is a single point and the theorem holds trivially.If xxx is not a stationary point of f , f continuously increases along the trajectory of the dynamicalsystem. Therefore A θθθ (0) ( t ) = f ( γ xxx ( t )) is an increasing function and therefore invertible. Let uscall A − θθθ (0) ( f ) the inverse.Let’s recall now the dynamical system of our interest ( Equation 3 )CGDA : (cid:40) ˙ θθθ = − v ∇ f ( θθθ )( g ( φφφ ) − q )˙ φφφ = v ∇ g ( φφφ )( f ( θθθ ) − p ) (cid:41) θθθ -part of the system,i.e (Σ) ≡ (cid:26) ˙ θθθ = − v ∇ f ( θθθ )( g ( φφφ ) − q ) θθθ (0) = θθθ (cid:27) Applying Lemma 5 for the ﬁrst equation with h ( t ) = − v ( g ( φφφ ( t )) − q ) , we have that the solution ofthe dynamical system (Σ) is ψ θθθ (0) ( t ) = γ θθθ (0) ( (cid:90) t h ( s )d s (cid:124) (cid:123)(cid:122) (cid:125) H ( t ) ) = γ θθθ (0) ( H ( t )) Thus it holds f ( ψ θθθ (0) ( t )) = f ( γ θθθ (0) ( H ( t ))) = A θθθ (0) ( H ( t )) or equivalently H ( t ) = A − θθθ (0) ( f ( ψ θθθ (0) ( t ))) Plug in back to the deﬁnition of the solution, clearly we have that : ψ θθθ (0) ( t ) = γ θθθ (0) ( A − θθθ (0) ( f ( ψ θθθ (0) ( t )))) Therefore for X θθθ (0) ( f ) = γ θθθ (0) ◦ A − θθθ (0) ( f ) , which is C as composition of C functions, the theoremholds.We can perform the equivalent analysis for the φφφ (0) and g and prove that for each φφφ (0) , under thedynamics Continuous GDA (Equation 3), there is a C function X φφφ (0) : g φφφ (0) → R n such that φφφ ( t ) = X φφφ (0) ( g ( t )) .Notice that the domains of the aforementioned functions are in fact either singleton pointsor open intervals. This will be important when we study the safety of initial conditions. Lemma 6 (Properties of f θθθ (0) ) . If θθθ (0) is a stationary point of f , then f θθθ (0) consists only of a singlenumber. Otherwise, f θθθ (0) is an open interval.Proof. If θθθ (0) is a ﬁxed point then for the gradient ascent dynamics θθθ ( t ) = θθθ (0) and therefore theTheorem holds trivially. On the other hand, in Theorem 1 we argued that f ( θθθ ( t )) is a continuous andstrictly increasing function so it should map ( −∞ , ∞ ) to an open set and thus the theorem holds.Obviously we can prove an equivalent theorem for g .18aving established the informational equivalence between the parameter and functionalspace, we are ready to derive the induced dynamics of the distribution with which twoplayers participate into the game. Lemma 7 (Restated Lemma 2) . If θθθ ( t ) and φφφ ( t ) are solutions to Equation 3 with initial conditions ( θθθ (0) , φφφ (0)) , then we have that f ( t ) = f ( θθθ ( t )) and g ( t ) = g ( φφφ ( t )) satisfy the following equations ˙ f = − v (cid:107)∇ f ( X θθθ (0) ( f )) (cid:107) ( g − q )˙ g = v (cid:107)∇ g ( X φφφ (0) ( g )) (cid:107) ( f − p ) Proof.

Applying chain rule and the deﬁnition of Continuous GDA (Equation 3) we can see that : (cid:26) ˙ f = ∇ f ( θθθ ( t )) ˙ θθθ ( t )˙ g = ∇ g ( φφφ ( t )) ˙ φφφ ( t ) (cid:27) ⇔ (cid:26) ˙ f = − v (cid:107)∇ f ( θθθ ( t )) (cid:107) ( g ( φφφ ( t )) − q )˙ g = v (cid:107)∇ g ( φφφ ( t )) (cid:107) ( f ( θθθ ( t )) − p ) (cid:27) Finally using Theorem 1 we get: (cid:26) ˙ f = − v (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) ( g ( φφφ ( t )) − q )˙ g = v (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) ( f ( θθθ ( t )) − p ) (cid:27) Finally, we establish that the above 2-dimensional system that couples f, g together is akinto a conservative system that preserves an energy-like function. Under the safety conditions,the proposed invariant is both well-deﬁned and equipped with interesting properties. It iseasy to check that it can play the role of a pseudometric around the Nash Equilibrium of thehidden bilinear game.

Theorem 19 (Restated Theorem 2) . Let θθθ (0) and φφφ (0) be safe initial conditions. Then for the systemof Equation 3, the following quantity is time-invariant H ( f, g ) = (cid:90) fp z − p (cid:107)∇ f ( X θθθ (0) ( z )) (cid:107) d z + (cid:90) gq z − q (cid:107)∇ g ( X φφφ (0) ( z )) (cid:107) d z Proof.

Firstly, one should notice that since θθθ (0) and φφφ (0) are safe initial conditions, H ( f, g ) is welldeﬁned when f, g follows the dynamics Continuous-GDA. We will examine the derivative of theproposed invariant of motion. H ( f ( t ) , g ( t )) = (cid:90) f ( t ) p z − p (cid:107)∇ f ( X θθθ (0) ( z )) (cid:107) d z + (cid:90) g ( t ) q z − q (cid:107)∇ g ( X φφφ (0) ( z )) (cid:107) d z = f ( t ) × f ( t ) − p (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) + g ( t ) × g ( t ) − q (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) Using Theorem 7, we get H ( f ( t ) , g ( t )) = − v (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) ( g ( φφφ ( t )) − q ) × f ( t ) − p (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) + v (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) ( f ( θθθ ( t )) − p ) × g ( t ) − q (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) = − v ( f ( t ) − p )( g ( t ) − q ) + v ( f ( t ) − p )( g ( t ) − q ) = 0 H areone-dimensional manifolds. To get convergence to a periodic orbit, one would require twoorbits (the initial trajectory and the periodic orbit) to merge into the same one dimensionalmanifold, but this is not possible (requires that no transient part exists). Theorem 20 (Restated Theorem 3) . Let θθθ (0) and φφφ (0) be safe initial conditions. Then for the systemof Equation 3, the orbit ( θθθ ( t ) , φφφ ( t )) is periodic.Proof. If ( θθθ (0) , φφφ (0)) is a ﬁxed point then it is trivially a periodic point. Suppose ( θθθ (0) , φφφ (0)) is nota ﬁxed point, then either f (cid:54) = p or g (cid:54) = q (or both). Given that H is invariant, the trajectory of theplanar system stays bounded away from all equilibria. We will examine each case separately: Equilbria with f = p and g = q It is bounded away from these since H ( p, q ) = 0 and H ( f ( θθθ (0)) , g ( φφφ (0))) > . Equilibria with f = p and ∇ f = These equilibria are not achievable since they are not allowedby the safety conditions. ∇ f = when f = p means that p is one of the endpoints of f θθθ (0) . But byLemma 6, f θθθ (0) is an open set and p ∈ f θθθ (0) which leads to a contradiction. Equilibria with g = q and ∇ g = They are also not feasible due to the safety assumption.

Equilibria with ∇ f = and ∇ g = Observe that such points lie in the corners of f θθθ (0) × g φφφ (0) .These points correspond to local maxima of the invariant function. We will prove this for one of thecorners and the same proof works for all others in the same way. Let ( p ∗ , q ∗ ) be one such cornerwith both p ∗ > p and q ∗ > q . Let us take any other point ( r, z ) with p ∗ ≥ r > p and q ∗ ≥ z > q but different from ( p, q ) . Without loss of generality let us assume p ∗ > r . Then in this region H isincreasing in both f and g . Thus H ( r, z ) < H ( p ∗ , z ) ≤ H ( p ∗ , q ∗ ) So this corner (and all the other three corners) are local maxima. A continuous trajectory cannotreach these isolated local maxima while maintaining H invariant.Thus we can create a trapping/invariant region C so that f and g always stay in C and C does notcontain any ﬁxed points. By the Poincaré-Bendixson theorem, the α, ω -limit set of the trajectory is aperiodic orbit. Thus they are isomorphic to S .Since the gradient of H is only equal to at ( p, q ) ∇ H = (cid:32) f − p (cid:107)∇ f ( X θθθ (0) ( f )) (cid:107) , g − q (cid:107)∇ g ( X φφφ (0) ( g )) (cid:107) (cid:33) Therefore H ( f ( θθθ (0)) , g ( φφφ (0))) > H ( p, q ) is a regular value of H . By the regular value theorem thefollowing set is a one dimensional manifold { ( f, g ) ∈ f θθθ (0) × g φφφ (0) : H ( f, g ) = H ( f ( θθθ (0)) , g ( φφφ (0))) } Notice that by the invariance of H and deﬁnition of α, ω − limit sets of ( f ( θθθ (0)) , g ( φφφ (0))) , we knowthat both the trajectory starting at ( θθθ (0) , φφφ (0)) , along with its α, ω − limit sets belong to the abovemanifold. Thus, their union is a closed, connected − manifold and thus it is isomorphic to S .Assume that the trajectory was merely converging to the α, ω − limit sets. Then our one dimensionalmanifold is containing two connected one dimensional manifolds: the trajectory of the system as wellas the α, ω − limit sets . But one can easily show that this would not be a one dimensional manifold,leading to a contradiction.Up to now we have analyzed the trajectories of the planar dynamical system of f and g . But since wehave proved that there is one to one correspondence between θθθ and f and φφφ and g , the periodicityclaims transfer to θθθ ( t ) and φφφ ( t ) . 20 limit-set-limit-settrajectory Figure 2: By the Poincaré-Bendixson theorem we know that both the α and the ω limit-sets areisomorphic to S . The trajectory connecting them makes the union of all three parts is not a onedimensional manifold. But by the regular value theorem on H , the union of all three parts is also aone dimensional manifold. 21n a positive note, one can prove that the time average of f and g do converge as well asthe utilities of the generator and discriminator. Theorem 21 (Restated Theorem 4) . Let θθθ (0) and φφφ (0) be safe initial conditions and ( PPP , QQQ ) = (cid:16)(cid:0) p − p (cid:1) , (cid:0) q − q (cid:1)(cid:17) , then for the system of Equation 3 lim T →∞ (cid:82) T f ( θθθ ( t ))d tT = p, lim T →∞ (cid:82) T r ( θθθ ( t ) , φφφ ( t ))d tT = PPP (cid:62)

UQQQ, lim T →∞ (cid:82) T g ( φφφ ( t ))d tT = q Proof.

In Theorem Theorem 3 we have discussed that the safety of the initial conditions guaranteesthat stationary points of f and g are going to be avoided. So using Lemma 2, we can integrate thefollowing quantities over a time interval [0 , T ] and divide by T . T (cid:90) T v (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) f d t = − T (cid:90) T ( g ( φφφ ( t )) − q ) d t T (cid:90) T v (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) g d t = 1 T (cid:90) T ( f ( θθθ ( t )) − p ) d t Let us deﬁne the follwoing functions of f and g : F ( f ( t )) = v (cid:107)∇ f ( X θθθ (0) ( f ( t ))) (cid:107) G ( g ( t )) = v (cid:107)∇ g ( X φφφ (0) ( g ( t ))) (cid:107) Thus the above dynamical system is equivalent with: T (cid:90) T F ( f ( t )) f d t = − T (cid:90) T ( g ( φφφ ( t )) − q ) d t T (cid:90) T G ( g ( t )) g d t = 1 T (cid:90) T ( f ( θθθ ( t )) − p ) d t However, by a simple change of variables we have that : (cid:90) T F ( f ) dfdt dt = (cid:90) f ( T ) f (0) F ( f ) df (cid:90) T G ( g ) dgdt dt = (cid:90) g ( T ) g (0) G ( g ) dg However we know that f ( t ) , g ( t ) for our dynamical system are periodic and bounded away from theroots of F ( f ) , G ( g ) . So their integrals over a single period of f and g are bounded and we have that lim T →∞ T (cid:90) T F ( f ) f dt = lim T →∞ T (cid:90) f ( T ) f (0) F ( f ) df = 0lim T →∞ T (cid:90) T G ( g ) gdt = lim T →∞ T (cid:90) g ( T ) g (0) G ( g ) dg = 0 Therefore, lim T →∞ T (cid:90) T ( g ( φφφ ( t )) − q )) d t = 0lim T →∞ T (cid:90) T ( f ( θθθ ( t )) − p )) d t = 0 lim T →∞ (cid:90) T g ( φφφ ( t )) dtT = q lim T →∞ (cid:90) T f ( θθθ ( t )) dtT = p Next, we will proceed with the argument about the time average of the objective function.

Fact 1. If ( PPP , QQQ ) is fully mixed Nash Equilibrium, then it holds PPP (cid:62)

UGGG ( φφφ ( t )) = FFF ( θθθ ( t )) (cid:62) UQQQ = PPP (cid:62)

UQQQ ( FFF ( θθθ ( t )) − PPP ) (cid:62) U ( GGG ( φφφ ( t )) − QQQ ) =

FFF ( θθθ ( t )) (cid:62) UGGG ( φφφ ( t )) − PPP (cid:62)

UQQQ

Proof.

It sufﬁces to prove the ﬁrst part of the claim, since the second part is its immediate consequence.Since we have conditioned that ( PPP , QQQ ) is a fully mixed Nash Equilibrium, it holds : PPP (cid:62)

UQQQ = (cid:18) (cid:19) (cid:62) UQQQ = (cid:18) (cid:19) (cid:62) UQQQ

Therefore:

FFF ( θθθ ) (cid:62) UQQQ = f ( θθθ ) (cid:18) (cid:19) (cid:62) UQQQ + (1 − f ( θθθ )) (cid:18) (cid:19) (cid:62) UQQQ = PPP (cid:62)

UQQQ

Symmetrically, it holds :

PPP (cid:62)

UQQQ = PPP (cid:62) U (cid:18) (cid:19) = PPP (cid:62) U (cid:18) (cid:19) . Therefore

PPP (cid:62)

UQQQ = PPP (cid:62) U (cid:18) (cid:19) g ( φφφ ( t )) + PPP (cid:62) U (cid:18) (cid:19) (1 − g ( φφφ ( t ))) = PPP (cid:62)

U G ( φφφ ( t )) . Observe the following fact: T (cid:90) T FFF ( θθθ ( t )) (cid:62) UGGG ( φφφ ( t ))d t − PPP (cid:62)

UQQQ = 1 T (cid:90) T FFF ( θθθ ( t )) (cid:62) UGGG ( φφφ ( t ))d t − T (cid:90) T PPP (cid:62)

UQQQ d t = 1 T (cid:90) T ( FFF ( θθθ ( t )) − PPP ) (cid:62) U ( GGG ( φφφ ( t )) − QQQ )d t Therefore it sufﬁces to show that lim T →∞ T (cid:90) T ( FFF ( θθθ ( t )) − PPP ) (cid:62) U ( GGG ( φφφ ( t ) − QQQ )d t = 0 The payoff matrix U is as follows: U = (cid:18) u , u , u , u , (cid:19) We have that ( FFF ( θθθ ( t )) − PPP ) (cid:62) U ( GGG ( φφφ ( t )) − QQQ ) = ( u , − u , − u , + u , )( f ( θθθ ( t )) − p ))( g ( φφφ ( t )) − q ) . Therefore it sufﬁces to show that : lim T →∞ T (cid:90) T ( f ( θθθ ( t )) − p ))( g ( φφφ ( t )) − q )d t = 0 . By our previous analysis in this theorem, we have already argued that lim T →∞ T (cid:90) T ( g ( φφφ ( t )) − q )d t = 0 lim T →∞ T (cid:90) T f ( θθθ ( t ))( g ( φφφ ( t )) − q )d t = 0 Revisiting the equations of Lemma 2: f F ( f ) df d t = f ( θθθ ( t ))( g ( φφφ ( t )) − q ) ⇒ T (cid:90) T f F ( f ) df d t d t = 1 T (cid:90) T f ( θθθ ( t ))( g ( φφφ ( t )) − q )d t However using similar arguments as before we can prove that lim T →∞ T (cid:90) T f F ( f ) df d t d t = lim T →∞ T (cid:90) f ( T ) f (0) f F ( f ) df = 0 implying that lim T →∞ T (cid:90) T f ( θθθ ( t ))( g ( φφφ ( t )) − q )d t = 0 which completes the proof. 24 Omitted Proofs of Section 5Poincaré recurrence in hidden bilinear games with more strategies

Lemma 8 (Restated Lemma 3) . If θθθ ( t ) and φφφ ( t ) are solutions to Equation 7 with initial conditions ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) , then we have that f i ( t ) = f i ( θθθ i ( t )) and g j ( t ) = g j ( φφφ j ( t )) satisfy thefollowing equations ˙ f i = −(cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107)  M (cid:88) j =1 u i,j g j + λ  ˙ g j = (cid:107)∇ g j ( X φφφ j (0) ( g j )) (cid:107) (cid:32) N (cid:88) i =1 u i,j f i + µ (cid:33) Proof.

Applying chain rule we can see that : ∀ i ∈ [ N ] : ˙ f i = ∇ f i ( θθθ i ( t )) ˙ θθθ i ( t ) ∀ j ∈ [ M ] : ˙ g j = ∇ g j ( φφφ j ( t )) ˙ φφφ j ( t ) Then by the dynamics of Continuous GDA (Equation 3) ∀ i ∈ [ N ] : ˙ f i = ∇ f i ( θθθ i ( t ))  −∇ f i ( θθθ i )  M (cid:88) j =1 u i,j g j ( φφφ j ) + λ  ∀ j ∈ [ M ] : ˙ g j = ∇ g j ( φφφ j ( t )) (cid:32) ∇ g j ( φφφ j ) (cid:32) N (cid:88) i =1 u i,j f i ( θθθ i ) + µ (cid:33)(cid:33) Clearly ∀ i ∈ [ N ] : ˙ f i = −(cid:107)∇ f i ( θθθ i ( t )) (cid:107)  M (cid:88) j =1 u i,j g j ( φφφ j ) + λ  ∀ j ∈ [ M ] : ˙ g j = (cid:107)∇ g j ( φφφ j ( t )) (cid:107) (cid:32) N (cid:88) i =1 u i,j f i ( θθθ i ) + µ (cid:33) Finally using Theorem 1 we know that there exist N + M functions such that : θθθ i ( t ) = X θθθ i (0) ( f i ( t )) φφφ j ( t ) = X φφφ j (0) ( g j ( t )) Combining the last two expressions we get the desired claim.

Theorem 22 (Restated Theorem 5) . Assume that ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) is a safe initialization.Then there exist λ ∗ and µ ∗ such that the following quantity is time invariant: H ( FFF , GGG, λ, µ ) = N (cid:88) i =1 (cid:90) f i p i z − p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) d z + M (cid:88) j =1 (cid:90) g j q j z − q j (cid:107)∇ g j ( X φφφ j (0) ( z )) (cid:107) d z + (cid:90) λλ ∗ ( z − λ ∗ ) d z + (cid:90) µµ ∗ ( z − µ ∗ ) d z Proof.

We know that ( ppp, qqq ) is an equilibrium of the hidden bilinear game min xxx ∈ ∆ N max yyy ∈ ∆ M xxx (cid:62) Uyyy (9)25et us make the same Lagrangian transformation we did in Section 5. min xxx ≥ ,µ ∈ R max yyy ≥ ,λ ∈ R xxx (cid:62) Uyyy + µ (cid:32) M (cid:88) i =1 y i (cid:33) + λ  N (cid:88) j =1 x j  (10)Since ( ppp, qqq ) is an equilibrium of the problem of Equation 9, the KKT conditions on the Problem ofEquation 10 imply that there are (unique) λ ∗ , µ ∗ ∀ j ∈ [ M ] : (cid:88) i ∈ [ N ] u i,j p i + µ ∗ = 0 ∀ i ∈ [ N ] : (cid:88) j ∈ [ M ] u i,j q j + λ ∗ = 0 We will analyze the time derivative of H ( FFF ( t ) , GGG ( t ) , λ ( t ) , µ ( t )) over the trajectory of CGDA (Equa-tion 7). H ( FFF , GGG, λ, µ ) = N (cid:88) i =1 (cid:90) f i p i z − p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) d z + M (cid:88) j =1 (cid:90) g j q j z − q j (cid:107)∇ g j ( X φφφ j (0) ( z )) (cid:107) d z + (cid:90) λλ ∗ ( z − λ ∗ ) d z + (cid:90) µµ ∗ ( z − µ ∗ ) d z ⇒ H ( FFF ( t ) , GGG ( t ) , λ ( t ) , µ ( t )) = N (cid:88) i =1 ˙ f i f i − p i (cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107) + M (cid:88) j =1 ˙ g j g j − q j (cid:107)∇ g j ( X φφφ j (0) ( g j )) (cid:107) + ˙ λ ( λ − λ ∗ ) + ( µ − µ ∗ ) ˙ µH ( FFF ( t ) , GGG ( t ) , λ ( t ) , µ ( t )) = N (cid:88) i =1  M (cid:88) j =1 u i,j g j + λ  ( p i − f i )+ M (cid:88) j =1 (cid:32) N (cid:88) i =1 u i,j f i + µ (cid:33) )( g j − q j )+ ( λ − λ ∗ ) ˙ λ + ( µ − µ ∗ ) ˙ µ Applying the KTT conditions we have M (cid:88) j =1 u i,j g j + λ = M (cid:88) j =1 u i,j ( g j − q j ) + λ − λ ∗ N (cid:88) i =1 u i,j f i + µ = N (cid:88) i =1 u i,j ( f i − p i ) + µ − µ ∗ We can now write down: N (cid:88) i =1 M (cid:88) j =1 u i,j g j ( p i − f i ) + λ = N (cid:88) i =1 M (cid:88) j =1 u i,j ( g j − q j )( p i − f i ) + ( λ − λ ∗ ) N (cid:88) i =1 ( p i − f i ) M (cid:88) j =1 N (cid:88) i =1 u i,j f i ( g j − q j ) + µ = M (cid:88) j =1 N (cid:88) i =1 u i,j ( f i − p i )( g j − q j ) + ( µ − µ ∗ ) M (cid:88) j =1 ( g j − q j ) u i,j terms cancel out. Thus we can write H ( FFF ( t ) , GGG ( t ) , λ ( t ) , µ ( t )) = ( λ − λ ∗ ) N (cid:88) i =1 ( p i − f i ) + +( µ − µ ∗ ) M (cid:88) j =1 ( q j − g j )+ ( λ − λ ∗ ) ˙ λ + ( µ − µ ∗ ) ˙ µ Additionally we have that ppp and qqq are probability vectors so ˙ λ = N (cid:88) i =1 f i − N (cid:88) i =1 ( f i − p i )˙ µ = −  M (cid:88) j =1 g j −  = − M (cid:88) j =1 ( g j − q j ) Thus H ( FFF ( t ) , GGG ( t ) , λ ( t ) , µ ( t )) = 0 Since the proof of the following Theorem is fairly complicated, we will ﬁrstly outline thebasic steps below:1. We ﬁrst show that there is topological conjugate dynamical system whose dynamicsare incompressible i.e. the volume of a set of initial conditions remains invariantas the dynamics evolve over time. By Theorem 14, if every solution remains in abounded space for all t ≥ , incompressibility implies recurrence.2. To establish boundedness in these dynamics, we exploit the aforementioned invari-ant function. Theorem 23 (Restated Theorem 6) . Assume that ( θθθ (0) , φφφ (0) , λ (0) , µ (0)) is a safe initialization.Then the trajectory under the dynamics of Equation 7 is diffeomoprphic to one trajectory of aPoincaré recurrent ﬂow.Proof. Let us start with the dynamics of Equation 7. We we call its ﬂow Φ original : Σ original :  ˙ θθθ i = −∇ f i ( θθθ i )  M (cid:88) j =1 u i,j g j ( φφφ j ) + λ  ˙ φφφ j = ∇ g j ( φφφ j ) (cid:32) N (cid:88) i =1 u i,j f i ( θθθ i ) + µ (cid:33) ˙ µ = −  M (cid:88) j =1 g j ( φφφ j ) −  ˙ λ = (cid:32) N (cid:88) i =1 f i ( θθθ i ) − (cid:33)  In the previous theorems we have proved that ( X θθθ i (0) , X φφφ j (0) ) are diffeomorphisms. We also knowthat by deﬁnition we have that ( X θθθ i (0) ) − ( θθθ i ) = f i ( θθθ i ) ∀ i ∈ [ N ]( X φφφ j (0) ) − ( φφφ j ) = g j ( θθθ i ) ∀ j ∈ [ M ] We can thus deﬁne the following diffeomorphism ν :  f i = ( X θθθ i (0) ) − ( θθθ i ) ∀ i ∈ [ N ] b j = ( X φφφ j (0) ) − ( φφφ j ) ∀ j ∈ [ M ] µ = µλ = λ  Φ distributional : Σ distributional :  ˙ f i = −(cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107) (cid:16)(cid:80) Mj =1 u i,j g j + λ (cid:17) ˙ g j = (cid:107)∇ g j ( X φφφ j (0) ( g j )) (cid:107) (cid:16)(cid:80) Ni =1 u i,j f i + µ (cid:17) ˙ µ = − (cid:16)(cid:80) Mj =1 g j − (cid:17) ˙ λ = (cid:16)(cid:80) Ni =1 f i − (cid:17)  Although Φ distributional could be well deﬁned for a wider set of points, we will focus our attention onthe following set of points V = f θθθ × · · · × f N θθθN (0) × g φφφ × · · · × g M φφφM (0) × ( −∞ , ∞ ) × ( −∞ , ∞ ) Observe that this choice is not problematic since:

Claim 1. V is an invariant set of Φ distributional Proof.

Let

DDD ( t ) = ( f ( t ) , · · · , f N ( t ) , g ( t ) , · · · , g M ( t )) be the proﬁle of all mixed strategies of all agents. Assume that there is a t critical ∈ R such that startingfrom DDD , it holds that for some i ∈ [ N ] , it holds that f i crosses the boundary of V at time t critical . Letus call the crossing point DDD critical . Since f i ( t critical ) is an end-point of f i θθθi (0) we have that ∇ f i ( X θθθ i (0) ( f i ( t critical ))) = 0 and thus by the equations of ˙ f i , we have ˙ f i = 0 . On the one hand, observe that for Φ distributional ( DDD critical , · ) we have that f i should be constant. On the other hand, for Φ distributional ( DDD , · ) it is not the case since DDD ∈ V and DDD critical has an f i that is on the edge of f i θθθi (0) . Thus Φ distributional ( DDD , · ) and Φ distributional ( DDD critical , · ) are different. This is a contradiction since DDD critical and

DDD belong to the same trajectory of the ﬂow. The same argument applies for g j .Clearly Φ original ( { θθθ i (0) , φφφ j (0) , µ (0) , λ (0) } , · ) and Φ( { f i ( θθθ i (0)) , g j ( φφφ j (0)) , µ (0) , λ (0) } , · ) are dif-feomorphic. It thus remains to prove that Φ is Poincaré recurrent. Divergence Free Topological Conjugate Dynamical System

We will transform the above dy-namical system to a divergence free system on different space via the following map : γ :  a i = A i ( f i ) = (cid:90) f i p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) d z ∀ i ∈ [ N ] b j = B j ( g j ) = (cid:90) g i q j (cid:107)∇ g j ( X φφφ j (0) ( z )) (cid:107) d z ∀ j ∈ [ M ] µ = µλ = λ  Claim 2. γ is a diffeomorphism.Proof. Indeed, F i ( f ) = 1 (cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107) G j ( g ) = 1 (cid:107)∇ g j ( X φφφ i (0) ( g j )) (cid:107) A i ( f i ) , B j ( g j ) are monotone functions and consequentlybijections and are continuously differentiable. Again because of the monotonicity using InverseFunction Theorem we can show easily that A i ( f i ) , B j ( g j ) have also continuously differentiableinverse.As a ﬁrst step let us apply γ on the equations of our dynamical system: ˙ a i = d A i ( f i ) df i ˙ f i = ˙ f i (cid:107)∇ f i ( X θθθ i (0) ( f i )) (cid:107) = −  M (cid:88) j =1 u i,j g j + λ  ˙ b j = d B j ( g j ) dg j ˙ g j = ˙ g j (cid:107)∇ g j ( X φφφ j (0) ( g j )) (cid:107) = (cid:32) N (cid:88) i =1 u i,j f i + µ (cid:33) Observe that on the right hand side of our equations, f i can be written as A − i ( a i ) and g j can bewritten as B − j ( g j ) , so this is an autonomous dynamical system, whoose ﬂow we will call Ψ andwhose vector ﬁeld we will call YYY : Σ Preserving :  ˙ a i = − (cid:16)(cid:80) Mj =1 u i,j B − j ( g j ) + λ (cid:17) ˙ b j = (cid:16)(cid:80) Ni =1 u i,j A − i ( a i ) + µ (cid:17) ˙ µ = − (cid:16)(cid:80) Mj =1 B − j ( g j ) − (cid:17) ˙ λ = (cid:16)(cid:80) Ni =1 A − i ( a i ) − (cid:17)  ⇔ Σ Preserving :  ˙ a i ˙ b j ˙ µ ˙ λ  = YYY ( a i , b j , µ, λ ) Taking the Jacobian of

YYY , all elements across the diagonal are zero : The coordinate of ˙ a i does notdepend on a i and the same goes for all state variables. Given that the divergence of the vector ﬁeld isequal to the trace of the Jacobian, we are certain that this new dynamical system is divergence free: div[ YYY ] = 0

Once again we focus our attention on γ ( V ) that is invariant for Ψ . To prove this invariant, assumethat one trajectory of Ψ starting from inside γ ( V ) escaped it. Then given that γ is a diffeomorphism,the corresponding trajectory of Φ will start from V and also escape it, which is not possible since V is invariant for Φ . Boundness of Trajectories

In the next section of the proof, we will show that the trajectories of Ψ are also bounded. Our analysis will be based on the invariant function of Theorem 5. Note that basedon the way we proved Theorem 5, the invariant supplied there is binding for all initializations in V and not just the trajectory of Φ( { f i ( θθθ i (0)) , g j ( φφφ j (0)) , µ (0) , λ (0) } , · ) .We will split our proof in two cases. Claim 3.

For all initializations in γ ( V ) , it holds that λ ( t ) , µ ( t ) are bounded.Proof. Observe the following fact λ ( t ) → ±∞ ⇒ (cid:90) λ ( t ) λ ∗ ( z − λ ∗ )d z → ∞ ⇒ H → ∞ The last step of this analysis comes from the fact that H is a sum of non-negative terms so if one ofthem goes to inﬁnity the whole sum becomes unbounded. Since initializations in V start with ﬁnitevalues of H , it is necessary that λ remains bounded. Obviously, the same proof strategy applies tothe case of µ ( t ) .Now let us analyze the rest of the variables Claim 4.

For all initializations in γ ( V ) , it holds that a i ( t ) , b j ( t ) are bounded. roof. By deﬁnition a i ( t ) → ±∞ ⇒ (cid:90) f i ( t ) p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) → ±∞ Observe also that (cid:90) f i ( t ) p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) → ±∞ ⇒ (cid:90) f i ( t ) p i z − p i (cid:107)∇ f i ( X θθθ i (0) ( z )) (cid:107) → ∞ This is true because z − p i is bounded away from zero when f i is converging to the edges of f i θθθi (0) as p i is in the interior of the set for safe initializations. Thereofe we can once again conclude that a i ( t ) → ±∞ → H → ∞ Once again for initializations in V , H remains constant and ﬁnite. Therefore a i should be bounded.The same analysis works for b j . Application of Poincaré Recurrence Theorem

To summarize the properties that we have estab-lished until now , we have shown that system of Ψ is divergence free and has only bounded orbits.Liouville’s formula also yields that Ψ is a volume preserving ﬂow. By applying Poincaré RecurrenceTheorem ( Theorem 14 ) almost all initial conditions in γ ( V ) of Ψ are recurrent. Thus the set W ofall non-recurrent points in Ψ has measure zero.Using the properties of diffeomorphism, we can to propagate the recurrence behavior of Ψ back to Φ disitributional using Lemma 4 Thus the set of recurrent points of Φ is γ − ( W ) . Since diffeomorphismspreserve measure zero sets and W has measure zero, the set of recurrent points of Φ has measurezero, indicating that Φ is indeed recurrent. Theorem 24 (Restated Theorem 7) . Let f i and g j be sigmoid functions. Then the ﬂow of Equation 7is Poincaré recurrent. The same holds for all functions f i and g j that are one to one functions andfor which all initializations are safe.Proof. One can notice that since f i and g j are invertible functions X θ i (0) ( · ) is totally independent ofthe choice θ i (0) . In other words we can substitute X θ i (0) ( · ) = f i − ( · ) X φ j (0) ( · ) = g j − ( · ) Thus, in contrast to the previous theorem (Theorem 6), the construction of Φ distributional does notdepend on the initialization. There is a unique Φ distributional for all initializations. In fact using thesame map ν as in the previous theorem, we can prove that Φ original is diffeomorphic to Φ distributional .However, using the previous theorem the ﬂow Φ distributional is Poincaré recurrent. Repeating thetopological conjugacy argument of the previous theorem we can transfer the Poincaré recurrenceproperty from the dynamical system of Φ distributional to the dynamical system of Φ original . D Omitted Proofs of Section 6Spurious equilibria

Theorem 25 (Restated Theorem 8) . One can construct functions f and g for the system of Equation3 so that for a positive measure set of initial conditions the trajectories converge to ﬁxed points thatdo not correspond to equilibria of the hidden game.Proof. Our strategy is to analyze the structure of the Jacobian of the vector ﬁeld of Equation 3 atstationary points of f and g . Let us call YYY ( θθθ, φφφ ) the vector ﬁeld of Equation 3. Now we can writedown its Jacobian D YYY ( θθθ, φφφ ) = (cid:18) − v ( g ( φφφ ) − q ) ∇ f ( θθθ ) − v ∇ f ( θθθ ) ⊗ ∇ g ( φφφ ) v ∇ g ( φφφ ) ⊗ ∇ f ( θθθ ) v ( f ( θθθ ) − p ) ∇ g ( φφφ ) (cid:19) f and g . Let us call them θθθ ∗ and φφφ ∗ D YYY ( θθθ ∗ , φφφ ∗ ) = v (cid:18) − ( g ( φφφ ∗ ) − q ) ∇ f ( θθθ ∗ ) n × m m × n ( f ( θθθ ∗ ) − p ) ∇ g ( φφφ ∗ ) (cid:19) We want to study the cases where all eigenvalues of this matrix are negative (i.e. the ﬁxed pointis stable). Let λ i ( ∇ f ( θθθ ∗ )) be the eigenvalues of ∇ f ( θθθ ∗ ) and λ i ( ∇ g ( φφφ ∗ )) the correspondingeigenvalues of ∇ g ( φφφ ∗ ) . Then we know that the eigenvalues of D YYY ( θθθ ∗ , φφφ ∗ ) are − v ( g ( φφφ ∗ ) − q ) λ i ( ∇ f ( θθθ ∗ )) v ( f ( θθθ ∗ ) − p ) λ i ( ∇ g ( φφφ ∗ )) Here we will analyze the case of v > (the case of v < is completely similar). To get that alleigenvalues are negative we can simply require: • ∇ f ( θθθ ∗ ) and ∇ g ( φφφ ∗ ) are invertible. • φφφ ∗ is a local minimum with g ( φφφ ∗ ) > q . Combined with the ﬁrst condition we get that ∇ g ( φφφ ∗ ) is positive deﬁnite. • θθθ ∗ is a local minimum with f ( θθθ ∗ ) < p . Combined with the ﬁrst condition we get that ∇ f ( θθθ ∗ ) is positive deﬁnite.One can observe that the second condition allows the existence of unsafe initializations if φφφ (0) is inthe vicinity of φφφ ∗ .Clearly based on Theorem 15, there is a full dimensional manifold of points that eventually convergeto this ﬁxed point. Given that the manifold has full dimension, this set of points has positivemeasure. Additionally, g ( φφφ ∗ ) and f ( θθθ ∗ ) do not take the values of the unique equilibrium of thehidden Game. E Omitted Proofs of Section 7Discrete Time Gradient-Descent-Ascent

The outline of this Section is the following:1. We ﬁrst review an existing result that shows that invariants of continuous timesystems that have convex level sets, even though they may not be invariants for thediscrete time counterparts, they are at least non-decreasing for the discrete case.2. We show that the invariant of Theorem 5 is convex for the case of sigmoid functions.Therefore it has convex level sets.3. We extend the construction of Theorem 8 to discrete time systems.

Theorem 26 (Theorem 5.3. of Bailey and Piliouras [2019c]) . Suppose a continuous dynamic y ( t ) has an invariant energy H ( y ) . If H is continuous with convex sublevel sets then the energy in thecorresponding discrete-time dynamic obtained via Euler’s method/integration is non-decreasing.Proof. Let us consider a continuous time dynamical system: yyy ( t ) = FFF ( yyy ( t )) Let t denote the current time instant of a trajectory with initial conditions yyy . Doing discrete timegradient-descent-ascent with with step-size η yields an approximation of yyy yyy ( t + η )ˆ yyy yyy t + η = yyy yyy ( t ) + ηyyy ( t ) (11)To prove our theorem it sufﬁces to show that H ( ˆ yyy yyy t + η ) ≥ H ( yyy yyy ( t )) Suppose H ( yyy yyy ( t )) = c and without loss of generality, assume { yyy yyy : H ( yyy yyy ) ≤ c } is full-dimensional. Since { yyy yyy : H ( yyy yyy ) ≤ c } is convex, there exists a supporting hyperplane { yyy yyy : a (cid:124) yyy yyy = a (cid:124) yyy yyy ( t ) } such that a (cid:124) yyy yyy ≤ a (cid:124) yyy yyy ( t ) for all yyy yyy ∈ { yyy yyy : H ( yyy yyy ) ≤ c } .31ecause of the invariance property of H over the trajectory with it holds: H ( yyy yyy ( t )) = c ∀ t ∈ R Therefore, a (cid:124) (cid:18) ddtyyy yyy ( t ) (cid:19) = a (cid:124) (cid:18) lim s → + yyy yyy ( t ) − yyy yyy ( t − s ) s (cid:19) = (cid:18) lim s → + a (cid:124) yyy yyy ( t ) − a (cid:124) yyy yyy ( t − s ) s (cid:19) ≥ (cid:18) lim s → + a (cid:124) yyy yyy ( t ) − a (cid:124) yyy yyy ( t ) s (cid:19) = 0 , implying a (cid:124) ˆ yyy yyy t + η = a (cid:124) yyy yyy ( t ) + a (cid:124) (cid:18) η ddtyyy yyy ( t ) (cid:19) ≥ a (cid:124) yyy yyy ( t ) . For contradiction, suppose H ( ˆ yyy yyy t + η ) < c . By continuity of H , for sufﬁciently small (cid:15) > , ˆ yyy yyy t + η + (cid:15)a ∈ { yyy yyy : H ( yyy yyy ) ≤ c } . However, a (cid:124) ( ˆ yyy yyy t + η + (cid:15)a ) ≥ a (cid:124) yyy yyy ( t ) + (cid:15) || a || > a (cid:124) yyy yyy ( t ) (12)contradicting that { yyy yyy : a (cid:124) yyy yyy = a (cid:124) yyy yyy ( t ) } is a supporting hyperplane. Thus, the statement of thetheorem holds. Lemma 9.

The invariant of Theorem 5 is jointly convex in θθθ , φφφ , λ and µ when f i and g j are sigmoidfunctions of one variable.Proof. Since H is a sum of terms each involving disjoint variables, it sufﬁces to prove that each termis convex with respect to its own variables. This follows immediately for λ and µ . Let us take oneterm involving f i (the same analysis works for g j terms as well). In fact we want to prove that thefollowing function is convex (cid:90) f ( θ i ) p i z − p i (cid:107)∇ f ( X θ i (0) ( z )) (cid:107) d z where f is the sigmoid function. Taking the ﬁrst derivative, knowing that f (cid:48) = (1 − f ) f for sigmoidwe have ( f ( θ i ) − p i ) (1 − f ( θ i )) f ( θ i ) (cid:107)∇ f ( X θ i (0) ( f ( θ i ))) (cid:107) X θ i (0) ( f ( θ i )) is equal to θ i since f is one-to-one. Thus we can simplify ( f ( θ i ) − p i ) (1 − f ( θ i )) f ( θ i ) (cid:107)∇ f ( θ i ) (cid:107) Once again we can use the formula for the derivative of f ( f ( θ i ) − p i ) (1 − f ( θ i )) f ( θ i )((1 − f ( θ i )) f ( θ i )) = ( f ( θ i ) − p i )(1 − f ( θ i )) f ( θ i ) In order to complete the convexity analysis we must take the second derivative test. ddθ i ( f ( θ i ) − p i )(1 − f ( θ i )) f ( θ i ) = f ( θ i ) − p i f ( θ i ) + p i (1 − f ( θ i )) f ( θ i ) (1 − f ( θ i )) f ( θ i ) = f ( θ i ) − p i f ( θ i ) + p i (1 − f ( θ i )) f ( θ i ) The only roots of the numerator are f ( θ i ) = p i ± (cid:113) p i − p i Of course for p i ∈ (0 , these roots are not real. So for all θ i , f ( θ i ) ∈ (0 , and the secondderivative is positive. This concludes our convexity proof.32 Figure 3: Hidden bilinear game with two strategies having p = 0 . and q = 0 . . The functions f and g are sigmoids for each player. We observe the evolution of f and g as well as the invariantof Theorem 2. The trajectories are close to being periodic but H has began to increase even withrelatively few iterations, conﬁrming the ﬁndings of Theorem 9. Theorem 27 (Restated Theorem 9) . Let f i and g j be sigmoid functions. Then for the discretizedversion of the system of Equation 7 and for safe intializations, function H of Theorem 5 is non-decreasing.Proof. First observe that given that sigmoids are invertible functions so X θ i (0) ( f i ) and X φ j (0) ( g j ) are independent of the initial conditions similar to the proof of Theorem 7. Thus invariant of Theorem5 H preserved by all the trajectories of the continuous time dynamical system is common across allinitializations. Using Lemma 9, H is convex and therefore has convex level sets. Of course it is alsocontinuous. Using Theorem 26 we get the requested result. Theorem 28 (Restated Theorem 10) . One can choose a learning rate α and functions f and g for thediscretized version of the system of Equation 3 so that for a positive measure set of initial conditionsthe trajectories converge to ﬁxed points that do not correspond to equilibria of the hidden game.Proof. The proof follows the same construction as in the continuous case of Theorem 8. In fact, theJacobian of the discrete time map is I ( N + M ) × ( N + M ) + α D YYY ( θθθ, φφφ ) where YYY is the vector ﬁeld of the continuous time system. We can do the same construction asin Theorem 8, to get a ﬁxed point ( θθθ ∗ , φφφ ∗ ) such that D YYY ( θθθ, φφφ ) has only negative eigenvalues and ( f ( θθθ ∗ ) , g ( φφφ ∗ )) (cid:54) = ( p, q ) . Let λ min be the smallest eigenvalue of this matrix. Choose α < − λ min Then the Jacobian of the discrete time map has positive eigenvalues that are less than one. Thereforethe discrete time map is locally a diffeomorphism and by the Stable Manifold Theorem for discretetime maps (Theorem 16), the stable manifold is again full dimensional and therefore has positivemeasure. 33

Figure 4: Hidden bilinear game with two strategies having p = 0 . and q = 0 . . The functionsare f ( x ) = 0 . . · σ ( x ) and g ( y ) = σ ( y ) . There is no solution of f ( x ) = pp