[PDF] Gradient Descent-Ascent Provably Converges to Strict Local Minmax Equilibria with a Finite Timescale Separation

Abstract

We study the role that a finite timescale separation parameter τ has on gradient descent-ascent in two-player non-convex, non-concave zero-sum games where the learning rate of player 1 is denoted by γ 1 and the learning rate of player 2 is defined to be γ 2 =τ γ 1 . Existing work analyzing the role of timescale separation in gradient descent-ascent has primarily focused on the edge cases of players sharing a learning rate ( τ=1 ) and the maximizing player approximately converging between each update of the minimizing player ( τ→∞ ). For the parameter choice of τ=1 , it is known that the learning dynamics are not guaranteed to converge to a game-theoretically meaningful equilibria in general. In contrast, Jin et al. (2020) showed that the stable critical points of gradient descent-ascent coincide with the set of strict local minmax equilibria as τ→∞ . In this work, we bridge the gap between past work by showing there exists a finite timescale separation parameter τ ∗ such that x ∗ is a stable critical point of gradient descent-ascent for all τ∈( τ ∗ ,∞) if and only if it is a strict local minmax equilibrium. Moreover, we provide an explicit construction for computing τ ∗ along with corresponding convergence rates and results under deterministic and stochastic gradient feedback. The convergence results we present are complemented by a non-convergence result: given a critical point x ∗ that is not a strict local minmax equilibrium, then there exists a finite timescale separation τ 0 such that x ∗ is unstable for all τ∈( τ 0 ,∞) . Finally, we empirically demonstrate on the CIFAR-10 and CelebA datasets the significant impact timescale separation has on training performance.

Full PDF

GGradient Descent-Ascent Provably Converges to Strict LocalMinmax Equilibria with a Finite Timescale Separation

Tanner Fiez

University of Washingtonﬁ[email protected]

Lillian J. Ratliﬀ

University of Washingtonratliﬄ@uw.edu

Abstract

This paper concerns the local stability and convergence rate of gradient descent-ascent in two-playernon-convex, non-concave zero-sum games. We study the role that a ﬁnite timescale separation parameter τ has on the learning dynamics where the learning rate of player 1 is denoted by γ and the learningrate of player 2 is deﬁned to be γ = τ γ . Existing work analyzing the role of timescale separation ingradient descent-ascent has primarily focused on the edge cases of players sharing a learning rate ( τ = 1)and the maximizing player approximately converging between each update of the minimizing player( τ → ∞ ). For the parameter choice of τ = 1, it is known that the learning dynamics are not guaranteedto converge to a game-theoretically meaningful equilibria in general as shown by Mazumdar et al. (2020)and Daskalakis and Panageas (2018). In contrast, Jin et al. (2020) showed that the stable critical pointsof gradient descent-ascent coincide with the set of strict local minmax equilibria as τ → ∞ . In this work,we bridge the gap between past work by showing there exists a ﬁnite timescale separation parameter τ ∗ such that x ∗ is a stable critical point of gradient descent-ascent for all τ ∈ ( τ ∗ , ∞ ) if and only if it is astrict local minmax equilibrium. Moreover, we provide an explicit construction for computing τ ∗ alongwith corresponding convergence rates and results under deterministic and stochastic gradient feedback.The convergence results we present are complemented by a non-convergence result: given a criticalpoint x ∗ that is not a strict local minmax equilibrium, then there exists a ﬁnite timescale separation τ such that x ∗ is unstable for all τ ∈ ( τ , ∞ ). Finally, we extend the stability and convergence resultsregarding gradient descent-ascent to gradient penalty regularization methods for generative adversarialnetworks (Mescheder et al., 2018) and empirically demonstrate on the CIFAR-10 and CelebA datasetsthe signiﬁcant impact timescale separation has on training performance. In this paper we study learning in zero-sum games of the formmin x ∈ X max x ∈ X f ( x , x )where the objective function of the game f is assumed to be suﬃciently smooth and potentially non-convexand non-concave in the strategy spaces X and X respectively with each X i a precompact subset of R n i . Thisgeneral problem formulation has long been fundamental in game theory (Ba¸sar and Olsder, 1998) and recentlyit has become central to machine learning with applications in generative adversarial networks (Goodfellowet al., 2014), robust supervised learning (Madry et al., 2018; Sinha et al., 2018), reinforcement and multi-agent reinforcement learning (Rajeswaran et al., 2020; Zhang et al., 2019), imitation learning (Ho and Ermon,2016), constrained optimization (Cherukuri et al., 2017), and hyperparameter optimization (Lorraine et al.,2020; MacKay et al., 2019) among several others. From a game-theoretic viewpoint, the work on learning ingames can, in some sense, be viewed as explaining how equilibria arise through an iterative competition foroptimality (Fudenberg et al., 1998). However, in machine learning, the primary purpose of learning dynamicsis to compute equilibria eﬃciently for the sake of providing meaningful solutions to problems of interest.As a result of this perspective, there has been signiﬁcant interest in the study of gradient descent-ascentowing to the fact that the learning rule is computationally eﬃcient and a natural analogue to gradient descent1 a r X i v : . [ c s . L G ] S e p rom function optimization. Formally, the learning dynamics are given by each player myopically updatinga strategy with an individual gradient as follows: x +1 = x − γ D f ( x , x ) x +2 = x + γ D f ( x , x ) . The analysis of gradient descent-ascent is complicated by the intricate optimization landscape in non-convex,non-concave zero-sum games. To begin, there is the fundamental question of what type of solution conceptis desired. Given the class of games under consideration, local solution concepts have been proposed andare often taken to be the goal of a learning algorithm. The primary notions of equilibrium that have beenadopted are the local Nash and local minmax/Stackelberg concepts with a focus on the set of strict localequilibrium that can be characterized by gradient-based suﬃcient conditions. Following several past works,from here on we refer to strict local Nash equilibrium and strict local minmax/Stackelberg equilibrium asdiﬀerential Nash equilibrium and diﬀerential Stackelberg equilibrium, respectively.Regardless of the equilibrium notion under consideration, a number of past works highlight failures ofstandard gradient descent-ascent in non-convex, non-concave zero-sum games. Indeed, it has been showngradient descent-ascent with a shared learning rate ( γ = γ ) is prone to reaching critical points that areneither diﬀerential Nash equilibrium nor diﬀerential Stackelberg equilibrium (Daskalakis and Panageas, 2018;Jin et al., 2020; Mazumdar et al., 2020). While an important negative result, it does not rule out the prospectthat gradient descent-ascent may be able to guarantee equilibrium convergence as it fails to account for akey structural parameter of the learning dynamics, namely the ratio of learning rates between the players.Motivated by the observation that the order of play between players is fundamental to the deﬁnitionof the game, the role of timescale separation in gradient descent-ascent has been explored theoretically inrecent years (Chasnov et al., 2019; Heusel et al., 2017; Jin et al., 2020). On the empirical side of pastwork, it has been widely demonstrated and prescribed that timescale separation in gradient descent-ascentbetween the generator and discriminator, either by heterogeneous learning rates or unrolled updates, iscrucial to improving the solution quality when training generative adversarial networks (Arjovsky et al.,2017; Goodfellow et al., 2014; Heusel et al., 2017). Denoting γ as the learning rate of the player 1, thelearning rate of player 2 can be redeﬁned as γ = τ γ where τ = γ /γ > timescale separation parameter. The work of Jin et al. (2020) took a meaningful step towardunderstanding the eﬀect of timescale separation in gradient descent-ascent by showing that as τ → ∞ thestable critical points of the learning dynamics coincide with the set of diﬀerential Stackelberg equilibrium. Insimple terms, the aforementioned result implies that all ‘bad critical points’ (that is, critical points lackinggame-theoretic meaning) become unstable as the timescale separation approaches inﬁnity and that all ‘goodcritical points’ (that is, game-theoretically meaningful equilibria) remain or become stable as the timescaleseparation approaches inﬁnity. While a promising theoretical development on the local stability of theunderlying dynamics, it does not lead to a practical, implementable learning rule or necessarily provide anexplanation for the satisfying performance in applications of gradient descent-ascent with a ﬁnite timescaleseparation. It remains an open question to fully understand gradient descent-ascent as a function of thetimescale separation and to determine whether the desirable behavior with an inﬁnite timescale separationis achievable for a range of ﬁnite learning rate ratios.This paper continues the theoretical study of gradient descent-ascent with timescale separation in non-convex, non-concave zero-sum games. We focus our attention on answering the remaining open questionsregarding the behavior of the learning dynamics with ﬁnite learning rate ratios and provide a number ofconclusive results. Notably, we develop necessary and suﬃcient conditions for a critical point to be stable fora range of ﬁnite learning rate ratios. The results imply that diﬀerential Stackelberg equilibria are stable fora range of ﬁnite learning rate ratios and that non-equilibria critical points are unstable for a range of ﬁnitelearning rate ratios. Together, this means gradient descent-ascent only converges to diﬀerential Stackelbergequilibrium for a range of ﬁnite learning rate ratios. To our knowledge, this is the ﬁrst provable guarantee ofits kind for an implementable ﬁrst-order method. Moreover, the technical results in this work rely on toolsthat have not appeared in the machine learning and optimization communities analyzing games and exposea number interesting directions of future research. Explicitly, the notion of a guard map, which is arguablyeven an obscure tool in modern control and dynamical systems theory, is ‘rediscovered’ in this work as atechnique for analyzing the stability of game dynamics.2 .1 Contributions To motivate our primary theoretical results, we present a self-contained description of what is known aboutthe local stability of gradient descent-ascent around critical points in Section 3.1. The existing resultsprimarily concern gradient descent-ascent without timescale separation and with a ratio of learning ratesapproaching inﬁnity (see Figure 1 for a graphical depiction of known results in each regime). In contrast, thispaper is focused on characterizing the stability and convergence of gradient descent-ascent across a range ofﬁnite learning rate ratios. To hint at what is achievable in this realm, we present simple examples for whichgradient descent-ascent converges to non-equilibrium critical points and games with diﬀerential Stackelbergequilibrium that are unstable with respect to gradient descent-ascent without timescale separation (seeExamples 1 and 2, Section 3). While the existence of such examples is known (Daskalakis and Panageas, 2018;Jin et al., 2020; Mazumdar et al., 2020), we demonstrate in them that a ﬁnite timescale separation is suﬃcientto remedy the undesirable stability properties of gradient descent-ascent without timescale separation.Toward characterizing this phenomenon in its full generality, we provide intermediate results which areknown, but we prove using technical tools not yet broadly seen and exploited by this community. Tobegin, it is known that the set of diﬀerential Nash equilibrium are stable with respect to gradient descent-ascent (Daskalakis and Panageas, 2018; Mazumdar and Ratliﬀ, 2019), and that they remain stable for anytimescale separation parameter τ ∈ (0 , ∞ ) (Jin et al., 2020). We provide a proof for this result (Proposition 3)using the concept of quadratic numerical range (Tretter, 2008). Furthermore, Jin et al. (2020) recently showedthat as the timescale separation τ → ∞ , the stable critical points of gradient descent-ascent coincide withthe set of diﬀerential Stackelberg equilibrium. We reveal that this result has long existed in the literatureon singularly perturbed systems (Kokotovic et al., 1986, Chapter 2 and the citations within) and provide aproof (see Proposition 4) using analysis methods from the aforementioned line of work that are novel to theliterature on learning in games from the machine learning and optimization communities in recent years.A relevant line of study on singularly perturbed systems is that of characterizing the range of perturbationparameters for which a system is stable (Kokotovic et al., 1986; Saydy, 1996; Saydy et al., 1990). Debatablyintroduced by Saydy et al. (1990), guardian or guard maps act as a certiﬁcate that the roots of a polynomiallie in a particular guarded domain for a range of parameter values. Historically, guard maps serve as a toolfor studying the stability of parameterized families of dynamical systems. We bring this tool to learning ingames and construct a map that guards a class of Hurwitz stable matrices parameterized by the timescaleseparation parameter τ in order to analyze the range of learning rate ratios for which a critical point is stablewith respect to gradient descent-ascent. This technique leads to the following result. Informal Statement of Theorem 3.

Consider a suﬃciently regular critical point x ∗ of gradient descent-ascent. There exists a τ ∗ ∈ (0 , ∞ ) such that x ∗ is stable for all τ ∈ ( τ ∗ , ∞ ) if and only if x ∗ is a diﬀerentialStackelberg equilibrium. Theorem 3 conﬁrms that there does indeed exist a range of ﬁnite learning ratios such that a diﬀerentialStackelberg equilibrium is stable with respect to gradient descent-ascent. Moreover, such a range of learningrate ratios only exists if a critical point is a diﬀerential Stackelberg equilibrium. As we show in Corollary 2,the former implication of Theorem 3 nearly immediately implies there exists a τ ∗ ∈ (0 , ∞ ) such that gradientdescent-ascent converges locally asymptotically for all τ ∈ ( τ ∗ , ∞ ) if and only if x ∗ is a diﬀerential Stackelbergequilibrium given a suitably chosen learning rate and deterministic gradient feedback. We give an explicitasymptotic rate of convergence in Theorem 5 and characterize the iteration complexity in Corollary 3.Moreover, we extend the convergence guarantees to stochastic gradient feedback in Theorem 6.The latter implication of Theorem 3 says that there exists a ﬁnite learning rate ratio such that a non-equilibrium critical point of gradient descent-ascent is unstable. Building oﬀ of this, we complement thestability result of Theorem 3 with the following analagous instability result. Informal Statement of Theorem 4.

Consider any stable critical point x ∗ of gradient descent-ascent whichis not a diﬀerential Stackelberg equilibrium. There exists a ﬁnite learning rate ratio τ ∈ (0 , ∞ ) such that x ∗ is unstable for all τ ∈ ( τ , ∞ ) . Theorem 4 establishes that there exists a range of ﬁnite learning ratios non-equilibrium critical points areunstable with respect to gradient descent-ascent. This implies that for a suitably chosen ﬁnite timescale sep-aration, gradient descent-ascent avoids critical points lacking game-theoretic meaning. Together, Theorem 33nd Theorem 4 answer aﬃrmatively that gradient descent-ascent with timescale separation can guaranteeequilibrium convergence, which answers a standing open question. Moreover, we provide explicit construc-tions for computing τ ∗ and τ given a critical point. In fact our construction of τ ∗ in Theorem 3 is tight,and this is conﬁrmed by our numerical experiments.We ﬁnish the theoretical analysis of gradient descent-ascent in this paper by connecting to the litera-ture on generative adversarial networks. We show under common assumptions on generative adversarialnetworks (Mescheder et al., 2018; Nagarajan and Kolter, 2017) that the introduction of gradient penaltybased regularization to the discriminator does not change the set of critical points for the dynamics and,further, there exists a ﬁnite learning rate ratio τ ∗ such that for any learning rate ratio τ ∈ ( τ ∗ , ∞ ) and anynon-negative, ﬁnite regularization parameter µ , the continuous time limiting regularized learning dynamicsremain stable, and hence, there is a range of learning rates γ for which the discrete time update locallyconverges asymptotically. Informal Statement of Theorem 7.

Consider training a generative adversarial network with a gradientpenalty (for any ﬁxed regularization parameter µ ∈ (0 , ∞ ) ) via a zero-sum game with generator network G θ , discriminator network D ω , and loss f ( θ, ω ) such that relaxed realizable assumptions are satisﬁed fora critical point ( θ ∗ , ω ∗ ) . Then, ( θ ∗ , ω ∗ ) is a diﬀerential Stackelberg equilibrium, and for any τ ∈ (0 , ∞ ) ,gradient descent-ascent converges locally asymptotically. Moreover, an asymptotic rate of convergence isgiven in Corollary 5. The theoretical results we provide are complemented by extensive experiments. In simulation, we explorea number of interesting behaviors of gradient descent-ascent with timescale separation analyzed theoreticallyincluding diﬀerential Stackelberg equilibria shifting from being unstable to stable and non-equilibrium criticalpoints moving from being stable to unstable. Furthermore, we examine how the vector ﬁeld and the spectrumof the game Jacobian evolve as a function of the timescale separation and explore the relationship with therate of convergence. We experiment with gradient descent-ascent on the Dirac-GAN proposed by Meschederet al. (2018) and illustrate the interplay between timescale separation, regularization, and rate of convergence.Building on this, we train generative adversarial networks on the CIFAR-10 and CelebA datasets withregularization and demonstrate that timescale separation can beneﬁt performance and stability. In theexperiments we observe that regularization and timescale separation are intimately connected and thereis an inherent tradeoﬀ between them. This indicates that insights made on simple generative adversarialnetwork formulations may carry over to the complex problems where players are parameterized by neuralnetworks.Collectively, the primary contribution of this paper is the near-complete characterization of the behaviorof gradient descent-ascent with ﬁnite timescale separation. Moreover, by introducing a novel set of analysistools to this literature, our work opens a number of future research questions. As an aside, we believe thesetechnical tools open up novel avenues for not only proving results about learning dynamics in games, butalso for synthesizing algorithms.

The organization of this paper is as follows. Preliminaries on game theoretic notions of equilibria, gradient-based learning algorithms, and dynamical systems theory are reviewed in Section 2.Convergence analysis proceeds in two phases. In Section 3, we study the stability properties of the contin-uous time limiting dynamical system given a timescale separation between the minimizing and maximizingplayers. Speciﬁcally, we show the ﬁrst result on necessary and suﬃcient conditions for convergence of thecontinuous time limiting system corresponding to gradient descent-ascent with time scale separation to gametheoretically meaningful equilibria (i.e., local minmax equilibria in zero-sum games). Following this, in Sec-tion 4, we provide convergence guarantees for the original discrete time dynamical system of interest (namely,gradient descent ascent). Using the results in the proceeding section, we show that gradient descent-ascentconverges to a critical point if and only if it is a diﬀerential Stackelberg equilibrium (i.e., a suﬃciently regular local minmax). In addition, we characterize the iteration complexity of gradient descent-ascent dynamicsand provide ﬁnite-time bounds on local convergence to approximate local Stackelberg equilibria.We apply the main results in the preceding sections to generative adversarial networks in Section 5,and in Section 6 we present several illustrative examples including generative adversarial networks where4e show that tuning the learning rate ratio along with regularization and the exponential moving averagehyperparameter signiﬁcantly improves the Fr´echet Inception Distance (FID) metric for generative adversarialnetworks.Given its length, prior to concluding in Section 7, we review related work drawing connections to solutionconcepts, gradient descent-ascent learning dynamics, applications to adversarial learning where the successof heuristics provide strong motivation for the theoretical work in this paper, and historical connections todynamical systems theory. Throughout the sections proceeding Section 7, we draw connections to relatedworks and results in an eﬀort to place our results in the context of the literature. We conclude in Section 8with a discussion on the signiﬁcance of the results and open questions. The appendix includes the majorityof the detailed proofs as well as additional experiments and commentary.

In this section, we review game theoretic and dynamical systems preliminaries. Additionally, we formulatethe class of learning rules analyzed in this paper.

A two–player zero-sum continuous game is deﬁned by a collection of costs ( f , f ) where f ≡ f and f ≡ − f with f ∈ C r ( X, R ) for some r ≥ X = X × X with each X i a precompact subset of R n i for i = 1 ,

2. Let n = n + n be the dimension of the joint strategy space X = X × X . Player i ∈ I seeks tominimize their cost function f i ( x i , x − i ) with respect to their choice variable x i where x − i is the vector of allother actions x j with j (cid:54) = i .There are two natural equilibrium concepts for such games depending on the order of play—i.e., the Nashequilibrium concept in the case of simultaneous play and the Stackelberg equilibrium concept in the case ofhierarchical play. Each notion of equilibria can be characterized as the intersection points of the reactioncurves of the players (Ba¸sar and Olsder, 1998). Deﬁnition 1 (Local Nash Equilibrium) . The joint strategy x ∈ X is a local Nash equilibrium on (cid:81) i ∈I U i ⊂ X , where U i ⊆ X i , if f ( x , x ) ≤ f ( x (cid:48) , x ) , for all x (cid:48) ∈ U ⊂ X and f ( x , x ) ≥ f ( x , x (cid:48) ) for all x (cid:48) ∈ U ⊂ X . Furthermore, if the inequalities are strict, we say x is a strict local Nash equilibrium. Deﬁnition 2 (Local Stackelberg Equilibrium) . Consider U i ⊂ X i for i = 1 , where, without loss of general-ity, player 1 is the leader (minimizing player) and player 2 is the follower (maximizing player). The strategy x ∗ ∈ U is a local Stackelberg solution for the leader if, ∀ x ∈ U , sup x ∈ r U ( x ∗ ) f ( x ∗ , x ) ≤ sup x ∈ r U ( x ) f ( x , x ) , where r U ( x ) = { y ∈ U | f ( x , y ) ≥ f ( x , x ) , ∀ x ∈ U } is the reaction curve. Moreover, for any x ∗ ∈ r U ( x ∗ ) , the joint strategy proﬁle ( x ∗ , x ∗ ) ∈ U × U is a local Stackelberg equilibrium on U × U . Predicated on existence, equilibria can be characterized in terms of suﬃcient conditions on player costs.Indeed, in continuous games, ﬁrst and second order conditions on player cost functions leads to a diﬀerentialcharacterization (i.e., necessary and suﬃcient conditions) of local Nash equilibria reminiscent of optimalityconditions in nonlinear programming (Ratliﬀ et al., 2016). We denote D i f i as the derivative of f i with respect to x i , D ij f i as the partial derivative of D i f i withrespect to x j , D i f i as the partial derivative of D i f i with respect to x i , and D ( · ) as the total derivative. Characterizing existence of equilibria is outside the scope of this work. However, we remark that Nash equilibria exist forconvex costs on compact and convex strategy spaces and Stackelberg equilibria exist on compact strategy spaces (Ba¸sar andOlsder, 1998, Thm. 4.3, Thm. 4.8, & Sec. 4.9). The diﬀerential characterization of local Nash equilibria in continuous games was ﬁrst reported in (Ratliﬀ et al., 2013).Genericity and structural stability we studied in general-sum settings in (Ratliﬀ et al., 2014) and in zero-sum settings in(Mazumdar and Ratliﬀ, 2019). Example: given f ( x, h ( x )), Df = D f + D f ◦ Dh . roposition 1 (Necessary and Suﬃcient Conditions for Local Nash (Ratliﬀ et al., 2016, Thm. 1 & 2)) . If x is a local Nash equilibrium of the zero-sum game (( f, − f ) , then D f ( x ) = 0 , − D f ( x ) = 0 , D f ( x ) ≥ and D f ( x ) ≤ . On the other hand, if D f ( x ) = 0 , D f ( x ) = 0 , and D f ( x ) > and D f ( x ) < , then x ∈ X is a local Nash equilibrium. The following deﬁnition, characterized by suﬃcient conditions for a local Nash equilibrium as deﬁned inDeﬁnition 1, was ﬁrst introduced in (Ratliﬀ et al., 2013).

Deﬁnition 3 (Diﬀerential Nash Equilibrium (Ratliﬀ et al., 2013)) . The joint strategy x ∈ X is a diﬀerentialNash equilibrium if D f ( x ) = 0 , − D f ( x ) = 0 , D f ( x ) > and D f ( x ) < . Analogous suﬃcient conditions can be stated to characterize a local Stackelberg equilibrium as deﬁnedin Deﬁnition 2. Suppose that f ∈ C r +1 ( X, R ) for some r ≥

1. Given a point x ∗ at which det( D f ( x ∗ )) (cid:54) = 0,the implicit function theorem (Abraham et al., 2012, Thm. 2.5.7) implies there is a neighborhood U of x ∗ and a unique C r map h : U → R n such that for all x ∈ U , D f ( x , h ( x )) = 0. The map h : x (cid:55)→ x isreferred to as the implicit map . Further, Dh ≡ − ( D f ) − ◦ D f . Note that det( D f ( x )) (cid:54) = 0 is a genericcondition (cf. (Fiez et al., 2020, Lem. C.1)). Let Df ( x , h ( x )) be the total derivative of f and analogously,let D f be the second-order total derivative. Deﬁnition 4 (Diﬀerential Stackelberg Equilibrium (Fiez et al., 2020)) . The joint strategy x = ( x , x ) ∈ X is a diﬀerential Stackelberg equilibrium if D f ( x , x ) = 0 , − D f ( x , x ) = 0 , D f ( x , x ) > , and D f ( x , x ) < . Observe that in a general sum setting the ﬁrst order conditions for player 1 are equivalent the totalderivative of f being zero at the candidate critical point where x is implicitly deﬁned as a function of x viathe implicit mapping theorem applied to D f ( x , x ) = 0. Since in this paper and in Deﬁnition 4, the class ofgames is zero sum, D f ( x , x ) = 0 and D f ( x , x ) = 0 (along with the condition that det( D f ( x , x )) (cid:54) = 0which is implied by the second order conditions) are suﬃcient to imply that the total derivative Df ( x , x )is zero.The Jacobian of the ﬁrst order necessary and suﬃcient condition—i.e., conditions that deﬁne potentialcandidate diﬀerential Nash and/or Stackelberg equilibria—is a useful mathematical object for understandingconvergence properties of gradient based learning rules as we will see in subsequent sections. Consider thevector of individual gradients g ( x ) = ( D f ( x ) , − D f ( x )) which deﬁne ﬁrst order conditions for a diﬀerentialNash equilibrium. Let J ( x ) denote the Jacobian of g ( x ) which is deﬁned by J ( x ) = (cid:20) D f ( x ) D f ( x ) D f ( x ) D f ( x ) (cid:21) . (1)We recall from Fiez et al. (2020) an alternative (to Deﬁnition 4, but equivalent set of suﬃcient conditionsfor a diﬀerential Stackelberg in terms of J ( x ). Let S ( · ) denote the Schur complement of ( · ) with respect tothe n × n block-row matrix in ( · ). Proposition 2 (Proposition 1 Fiez et al. (2020)) . Consider a zero-sum game ( f, − f ) deﬁned by f ∈ C r ( X, R ) with r ≥ and player 1 (without loss of generality) taken to be the leader (minimizing player).Let x ∗ satisfy − D f ( x ∗ ) = 0 and − D f ( x ∗ ) > . Then Df ( x ∗ ) = 0 and S ( J ( x ∗ )) > if and only if x ∗ isa diﬀerential Stackelberg equilibria. Remark 1 (A comment on the genericity of diﬀerential Stackelberg/Nash equilibria.) . Due to Fiez et al.(2020, Theorem 1), diﬀerential Stackelberg are generic amongst local Stackelberg equilibria and, similarly,due to Mazumdar and Ratliﬀ (2019, Theorem 2), diﬀerential Nash equilibria are generic amongst local Nashequilibria. This means that the property of being a diﬀerential Stackelberg (respectively, diﬀerential Nash)equilibrium in a zero-sum game is generic in the class of zero-sum games deﬁned by C ( X, R ) functions—that is, for almost all (in some formall sense) zero-sum games, all the local Stackelberg/Nash are diﬀerentialStackelberg/Nash. The diﬀerential characterization of Stackelberg equilibria was ﬁrst introduced in (Fiez et al., 2019). The genericity andstructural structural stability were shown in (Fiez et al., 2020). .2 Gradient-based learning algorithms As noted above, in this paper we focus on settings in which agents or players in this game are seekingequilibria of the game via a learning algorithm. We study arguably the most natural learning rule in zero-sum continuous games: gradient descent-ascent (

GDA ). This gradient-based learning rule is a simultaneousgradient play algorithm in that agents update their actions at each iteration simultaneously.Gradient descent-ascent is deﬁned as follows. At iteration k , each agent i ∈ I updates their choicevariable x i,k ∈ X i by the process x i,k +1 = x i,k − γ i g i ( x i,k , x − i,k ) (2)where γ i is agent i ’s learning rate, and g i ( x ) is agent i ’s gradient-based update mechanism. For simultaneousgradient play, g ( x ) = ( D f ( x ) , D f ( x )) (3)is the vector of individual gradients and in a zero-sum setting, GDA is deﬁned using g ( x ) = ( D f ( x ) , − D f ( x ))where the ﬁrst player is the minimizing player and the second player is the maximizing player.One of the key contributions of this paper is that we provide convergence rates for settings in whichthere is a timescale separation between the learning processes of the minimizing and maximizing players—i.e., settings in which the agents learning rates γ i are not homogeneous. Deﬁne Γ = blkdiag( γ I n , γ I n )where I n i denotes the n i × n i identity matrix. Let τ = γ /γ be the learning rate ratio and deﬁne Λ τ =blockdiag( I n , τ I n ). The τ - GDA dynamics are given by x k +1 = x k − γ Λ τ g ( x k ) . (4) Tools for Convergence Analysis.

We analyze the iteration complexity or local asymptotic rate of con-vergence of learning rules of the form (2) in the neighborhood of an equilibrium. Given two real valuedfunctions F ( k ) and G ( k ), we write F ( k ) = O ( G ( k )) if there exists a positive constant c > | F ( k ) | ≤ c | G ( k ) | . For example, consider iterates generated by (2) with initial condition x and critical point x ∗ . Suppose that we show (cid:107) x k +1 − x ∗ (cid:107) ≤ M k (cid:107) x − x ∗ (cid:107) . Then, we write F ( k ) = O ( M k ) where c = (cid:107) x − x ∗ (cid:107) . In this paper, we study learning rules employed by agents seeking game-theoretically meaningful equilibriain continuous games. Dynamical systems tools for both continuous and discrete time play a crucial role inthis analysis.

Stability.

Before we proceed, we recall and remark on some facts from dynamical systems theory concern-ing stability of equilibria in the continuous-time dynamics˙ x = − g ( x ) (5)relevant to convergence analysis for the discrete-time learning dynamics in (2). Observe that equilibria areshared between (2) and (5). Our focus is on the subset of equilibria that satisfy Deﬁnition 4, and thesubset thereof deﬁned in Deﬁnition 3. Recall the following equivalent characterizations of stability for anequilibrium of (5) in terms of the Jacobian matrix J ( x ) = Dg ( x ). Theorem 1 ((Khalil, 2002, Thm. 4.15)) . Consider a critical point x ∗ of g ( x ) . The following are equivalent:(a) x ∗ is a locally exponentially stable equilibrium of ˙ x = − g ( x ) ; (b) spec( − J ( x ∗ )) ⊂ C ◦− ; (c) there exists asymmetric positive-deﬁnite matrix P = P (cid:62) > such that P J ( x ∗ ) + J ( x ∗ ) (cid:62) P > . It was shown in (Ratliﬀ et al., 2016, Prop. 2) that if the spectrum of − J ( x ) at a diﬀerential Nash equi-librium x is in the open left-half complex plane—i.e., spec( − J ( x )) ⊂ C ◦− —then x is a locally exponentiallystable equilibrium of (5). Indeed, if all agents learn at the same rate so Γ = γI n in (2), then a straightfor-ward application of the Spectral Mapping Theorem (Callier and Desoer, 2012, Thm. 4.7) ensures that anexponentially stable equilibrium x ∗ for (5) is locally exponentially stable in (2) so long as γ > γ or the resulting iteration complexity in an asymptotic or ﬁnite-time sense; furthermore, this line of reasoningdoes not apply when agents learn at diﬀerent rates (Γ (cid:54) = γI in (2)).7 imiting dynamical systems. The continuous time dynamical system takes the form ˙ x = − Λ τ g ( x ) dueto the timescale separation τ . Such a system is known as a singularly perturbed system or a multi-timescalesystem in the dynamical systems theory literature (Kokotovic et al., 1986), particularly where τ − is small.Singularly perturbed systems are classically expressed as˙ x = − D f ( x, z ) (cid:15) ˙ z = − D f ( x, z ) (6)where (cid:15) = τ − is most often a physically meaningful quantity inherent to some dynamical system thatdescribes the evolution of some physical phenomena; e.g., in circuits it may be a constant related to devicematerial properties, and in communication networks, it is often the speed at which data ﬂows through aphysical medium such as cable.In the classical asymptotic analysis of a system of the form (6)—which we write more generally as˙ x = F ( t, x, (cid:15) ) for the purpose of the following observations—the goal is to obtain an approximate solution,say ˜ x ( t, (cid:15) ), such that the approximation error x ( t, (cid:15) ) − ˜ x ( t, (cid:15) ) is small in some norm for small | (cid:15) | and, further,the approximate solution is expressed in terms of a reduced order system. Such results have signiﬁcance interms of revealing underlying structural properties of the original system ˙ x = F ( t, x, (cid:15) ) and its correspondingstate x ( t, (cid:15) ) for small | (cid:15) | . One of the contributions of this work is that we take a similar analysis approach inorder to reveal underlying structural properties of the optimization landscape of zero-sum games/minimaxoptimization problems. Indeed, asymptotic methods can reveal multiple timescale structures that are in-herent in many machine learning problems, as we observe in this paper for zero-sum games. One keypoint of separation in applying dynamical systems theory to the study of algorithms versus physical systemdynamics—in particular, learning in games—this parameter no longer necessarily is a physical quantity but ismost often a hyper-parameter subject to design. In this paper, treating the inverse of the learning rate ratioas a timescale separation parameter, we combine the asymptotic analysis tools from singular perturbationtheory with tools from algebra to obtain convergence guarantees. Leveraging Linearization to Infer Qualitative Properties.

The Hartman-Grobman theorem assertsthat it is possible to continuously deform all trajectories of a nonlinear system onto trajectories of thelinearization at a ﬁxed point of the nonlinear system. Informally, the theorem states that if the linearizationof the nonlinear dynamical system ˙ x = F ( x ) around a ﬁxed point ¯ x —i.e., F (¯ x ) = 0—has no zero or purelyimaginary eigenvalues, then there exists a neighborhood U of ¯ x and a homeomorphism h : U → R n —i.e., h, h − ∈ C ( U, R n )—taking trajectories of ˙ x = F ( x ) and mapping them onto those of ˙ z = DF (¯ x ) z . Inparticular, h (¯ x ) = 0.Given a dynamical system ˙ x = F ( x ), the state or solution of the system at time t starting from x at time t is called the ﬂow and is denoted φ t ( x ). Theorem 2 (Hartman-Grobman (Sastry, 1999, Thm. 7.3); (Teschl, 2000, Thm. 9.9)) . Consider the n -dimensional dynamical system ˙ x = F ( x ) with equilibrium point ¯ x . If DF (¯ x ) has no zero or purely imaginaryeigenvalues, there is a homeomorphism h deﬁned on a neighborhood U of ¯ x taking orbits of the ﬂow φ t to thoseof the linear ﬂow e tDF (¯ x ) of ˙ x = F ( x ) —that is, the ﬂows are topologically conjugate. The homeomorphismpreserves the sense of the orbits and is chosen to preserve parameterization by time. The above theorem says that the qualitative properties of the nonlinear system ˙ x = F ( x ) in the vicinity(which is determined by the neighborhood U ) of an isolated equilibrium ¯ x are determined by its linearizationif the linearization has no eigenvalues on the imaginary axes in the complex plane. We also remark thatHartman-Grobman can also be applied to discrete time maps (cf. Sastry (1999, Thm. 2.18)) with the samequalitative outcome. Internally Chain Transitivity.

In proving results for stochastic gradient descent-ascent, we leveragewhat is known as the ordinary diﬀerential equation method in which the ﬂow of the limiting continuoustime system starting at sample points from the stochastic updates of the players actions is compared to asymptotic psuedo-trajectories —i.e., linear interpolations between sample points. To understand stabilityin the stochastic case, we need the notion of internally chain transitive sets. For more detail, the reader isreferred to (Alongi and Nelson, 2007, Chap. 2–3). 8 closed set U ⊂ R m is an invariant set for a diﬀerential equation ˙ x = F ( x ) if any trajectory x ( t )with x (0) ∈ U satisﬁes x ( t ) ∈ U for all t ∈ R . Let φ t be a ﬂow on a metric space ( X, d ). Given ε > T > x, y ∈ X , an ( ε, T )-chain from x to y with respect to φ t and d is a pair of ﬁnite sequences x = x , x , . . . , x k − , x k = y in X and t , . . . , t k − in [ T, ∞ ), denoted together by ( x , x , . . . , x k − , x k ; t , . . . , t k − ),such that d ( φ t i ( x i ) , x i +1 ) < ε for i = 0 , , , . . . , k −

1. A set U ⊆ X is (internally) chain transitive withrespect to φ t if U is a non-empty closed invariant set with respect to φ t such that for each x, y ∈ U , (cid:15) > T > ε, T )-chain from x to y . A compact invariant set U is invariantly connected if itcannot be decomposed into two disjoint closed nonempty invariant sets. It is easy to see that every internallychain transitive set is invariantly connected. To characterize the convergence of τ - GDA , we begin by studying its continuous time limiting system˙ x = − Λ τ g ( x ) , (7)where we recall that Λ τ = blockdiag( I n , τ I n ). Throughout this section, the class of zero-sum games weconsider are suﬃciently smooth, meaning that f ∈ C r ( X, R ) for some r ≥

2. The Jacobian of the systemfrom (7) in zero-sum games of the form ( f , f ) = ( f, − f ) is given as J τ ( x ) = (cid:20) D f ( x ) D f ( x ) − τ D (cid:62) f ( x ) − τ D f ( x ) (cid:21) . (8)By analyzing the stability of the continuous time system as a function of the timescale separation τ usingthe Jacobian from (8) in this section, we can then draw conclusions about the stability and convergence ofthe discrete time system τ - GDA in Section 4.The organization of this section is as follows. To begin, we present a collection of preliminary observationsin Section 3.1 regarding the stability of continuous time gradient descent-ascent with timescale separation tomotivate the results in the subsequent subsections by establishing known results and introducing alternativeanalysis methods that the technical results in this paper build on. Then, in Sections 3.2 and 3.3 respectively,we present necessary and suﬃcient conditions for stability of the continuous time system around criticalpoints in terms of the learning rate ratio along with suﬃcient conditions to guarantee the instability of thecontinuous time system around non-equilibrium critical points in terms of the timescale separation.

In Figure 1 we present a graphical representation of known results on the stability of gradient descent-ascent with timescale separation in continuous time, where we remark that such results nearly directly implyequivalent conclusions regarding the discrete time system τ - GDA with a suitable choice of learning rate γ .The primary focus of past work has been on the edge cases of τ = 1 and τ → ∞ . For τ = 1, the set ofdiﬀerential Nash equilibrium are stable, but diﬀerential Stackelberg equilibrium may be stable or unstable,and non-equilibrium critical points can be stable. As τ → ∞ , the set of diﬀerential Nash equilibrium remainstable, each diﬀerential Stackelberg equilibrium is guaranteed to become stable, and each non-equilibriumcritical point must be unstable. We ﬁll the gap between the known results by providing results as a functionof ﬁnite τ . With an eye toward this goal, we now provide examples and preliminary results that illustratethe type of guarantees that may be achievable for a range of ﬁnite learning rate ratios.To start oﬀ, we consider the set of diﬀerential Nash equilibrium. It is nearly immediate from the structureof the Jacobian that each diﬀerential Nash equilibrium is stable for τ = 1 (Daskalakis and Panageas, 2018;Mazumdar et al., 2020). Moreover, Jin et al. (2020) showed that regardless of the value of τ ∈ (0 , ∞ ), theset of diﬀerential Nash equilibrium remain stable. In other words, the desirable stability characteristics ofdiﬀerential Nash equilibrium are retained for any choice of timescale separation. We state this result as aproposition for later reference and since our proof technique relies on the concept of quadratic numericalrange (Tretter, 2008), which has not appeared previously in this context. The proof of Proposition 3 isprovided in Appendix B. 9 SEDNE c r i t i c a l p o i n t s L A S E (a) τ = 1 DSEDNE c r i t i c a l p o i n t s L A S E (b) τ → ∞ Figure 1: Graphical representation of the known stability results on τ - GDA in relationship to local equilibriumconcepts with τ = 1 and τ → ∞ . The acronyms in the ﬁgure are diﬀerential Nash equilibria (DNE),diﬀerential Stackelberg equilibria (DSE), and locally asymptotically stable equilibria (LASE). Note thatthe terminology of locally asymptotically stable equilibria refers to the set of stable critical points withrespect to the system ˙ x = − Λ τ g ( x ) for the given τ . Fiez et al. (2020) reported the subset relationshipbetween diﬀerential Nash equilibria and diﬀerential Stackelberg equilibria and Jin et al. (2020) gave a similarcharacterization in terms of local Nash and local minmax. For the regime of τ = 1, Daskalakis and Panageas(2018); Mazumdar et al. (2020) presented the relationship between the set of diﬀerential Nash equilibrium andthe set of locally asymptotically stable equilibrium, and Jin et al. (2020) provided the relationship betweenthe set of diﬀerential Stackelberg equilibrium and the set of locally asymptotically stable equilibrium. Finally,Jin et al. (2020) reported the characterization of the locally asymptotically stable equilibrium as τ → ∞ . Themissing pieces in the literature are results as a function of ﬁnite τ , which we answer in this work deﬁnitively. Proposition 3.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for some r ≥ .Suppose that x ∗ is a diﬀerential Nash equilibrium. Then, spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ (0 , ∞ ) . Fiez et al. (2020) show that the set of diﬀerential Nash equilibrium is a subset of the set of diﬀerentialStackelberg equilibrium. In other words, any diﬀerential Nash equilibrium is a diﬀerential Stackelberg equi-librium, but a diﬀerential Stackelberg equilibrium need not be a diﬀerential Nash equilibrium. Moreover, Jinet al. (2020) show that the result of Proposition 3 fails to extend from diﬀerential Nash equilibria to thebroader class of diﬀerential Stackelberg equilibrium. Indeed, not all diﬀerential Stackelberg equilibrium arestable with respect to the continuous time limiting dynamics of gradient descent-ascent without timescaleseparation. However, as the following example demonstrates, diﬀerential Stackelberg equilibrium that areunstable without timescale separation can become stable for a range of ﬁnite timescale learning rate ratios.

Example 1.

Within the class of zero-sum games, there exists diﬀerential Stackelberg equilibrium that areunstable with respect to ˙ x = − g ( x ) and stable with respect to ˙ x = − Λ τ g ( x ) for all τ ∈ ( τ ∗ , ∞ ) where τ ∗ isﬁnite. Indeed, consider the quadratic zero-sum game deﬁned by the cost f ( x , x ) = 12 (cid:20) x x (cid:21) (cid:62)  − v − v v v − v − v v − v  (cid:20) x x (cid:21) where x , x ∈ R and v > . The unique critical point of the game given by x ∗ = (0 , is a diﬀerentialStackelberg equilibrium since g ( x ∗ ) = 0 , S ( J ( x ∗ )) = diag( v, v/ > and − D f ( x ∗ ) = diag( v/ , v ) > .The spectrum of the Jacobian of Λ τ g ( x ) is given by spec( J τ ( x ∗ )) = (cid:110) v (2 τ + 1 ± √ τ − τ + 1)4 , v ( τ − ± √ τ − τ + 4)4 (cid:111) . Observe that for τ = 1 , spec( J τ ( x ∗ )) = { (3 ± i √ v, ( − ± i √ v } (cid:54)⊂ C ◦ + for any v > so that the diﬀeren-tial Stackelberg equilibrium x ∗ is never stable for the choice of τ . However, for any v > , spec( J τ ( x ∗ )) ⊂ C ◦ + for all τ ∈ (2 , ∞ ) , meaning that the diﬀerential Stackelberg equilibrium x ∗ is indeed stable with respect to thedynamics ˙ x = − Λ τ g ( x ) for a range of ﬁnite learning rate ratios.

10e explore Example 1 further via simulations in Section 6.1. The key takeaway from Example 1 is thatit is clearly not always necessary for the timescale separation τ to approach inﬁnity in order to guarantee thestability of a diﬀerential Stackelberg equilibrium and instead there exists a suﬃcient ﬁnite learning rate ratio.Put simply, the undesirable property of diﬀerential Stackelberg equilibria not being stable with respect togradient descent-ascent without timescale separation can potentially be remedied with only a ﬁnite timescaleseparation.It is well-documented that some stable critical points of the continuous time gradient descent-ascentlimiting dynamics without timescale separation can lack game-theoretic meaning, as they may be neither adiﬀerential Nash equilibria nor diﬀerential Stackelberg equilibria (Daskalakis and Panageas, 2018; Jin et al.,2020; Mazumdar et al., 2020). The following example demonstrates that such undesirable critical pointsthat are stable without timescale separation can become unstable for a range of ﬁnite learning ratios. Example 2.

Within the class of zero-sum games, there exists non-equilibrium critical points that are stablewith respect to ˙ x = − g ( x ) and unstable with respect to ˙ x = − Λ τ g ( x ) for all τ ∈ ( τ , ∞ ) where τ is ﬁnite.Indeed, consider a zero sum game deﬁned by the cost f ( x , x ) = 12 (cid:20) x x (cid:21) (cid:62)  v v − v v v v v − v  (cid:20) x x (cid:21) (9) where x , x ∈ R and v > . The unique critical point of the game given by x ∗ = (0 , is neither adiﬀerential Nash equilibrium nor a diﬀerential Stackelberg equilibrium since D f ( x ∗ ) = diag( v/ , − v/ ≯ and − D f ( x ∗ ) = diag( − v/ , v/ ≯ . The spectrum of the Jacobian of Λ τ g ( x ) is given by spec( J τ ( x ∗ )) = (cid:110) v (2 τ − ± √ τ − τ + 1)8 , v (2 − τ ± √ τ − τ + 4)8 (cid:111) . Given τ = 1 , spec( J τ ( x ∗ )) = { (1 ± i √ v, (1 ± i √ v } ⊂ C ◦ + for any v > so that the non-equilibriumcritical point x ∗ is in fact stable for the choice of timescale separation τ . However, for any v > , spec( J τ ( x ∗ )) (cid:54)⊂ C ◦ + for all τ ∈ (2 , ∞ ) , meaning that the non-equilibrium critical point x ∗ is unstable withrespect to the dynamics ˙ x = − Λ τ g ( x ) for a range of ﬁnite learning rate ratios.The game construction from (9) is quadratic and as a result has a unique critical point. Games can beconstructed in which critical points lacking game-theoretic meaning that are stable without timescale separa-tion become unstable for all τ > τ even in the presence of multiple equilibria. Indeed, consider a zero-sumgame deﬁned by the cost f ( x , x ) = (cid:0) x + 2 x x + x − x + 2 x x − x (cid:1) ( x − + x (cid:0) (cid:80) i =1 ( x i − − ( x i − (cid:1) . (10) This game has critical points at (0 , , , , (1 , , , , and ( − . , . , − . , . . The critical points (1 , , , and ( − . , . , − . , . are diﬀerential Nash equilibria and are consequently stable for anychoice of τ > . The critical point x ∗ = (0 , , , is neither a diﬀerential Nash equilibrium nor a diﬀerentialStackelberg equilibrium. Moreover, the Jacobian of Λ τ g ( x ∗ ) for the game deﬁned by (9) with v = 5 is identicalto that for the game deﬁned by (10) . As a result, we know that x ∗ is stable without timescale separation,but spec( J τ ( x ∗ )) (cid:54)⊂ C ◦ + for all τ ∈ (2 , ∞ ) so that the non-equilibrium critical point x ∗ is again unstable withrespect to the dynamics ˙ x = − Λ τ g ( x ) for a range of ﬁnite learning rate ratios. We investigate the game deﬁned in (10) from Example 2 with simulations in Section 6.2. In an analogousmanner to Example 1, Example 2 demonstrates that it is not always necessary for the timescale separation τ to approach inﬁnity in order to guarantee non-equilibrium critical points become unstable as there canexist a suﬃcient ﬁnite learning rate ratio. This is to say that the unwanted property of non-equilibriumcritical points being stable without timescale separation can also potentially be remedied with only a ﬁnitetimescale separation.The examples of this section have provided evidence that there exists a range of ﬁnite learning rateratios for which diﬀerential Stackelberg equilibrium are stable and a range of learning rate ratios for which11on-equilibrium critical points are unstable. Yet, no result has appeared in the literature on gradientdescent-ascent with timescale separation conﬁrming this behavior in general. We focus on doing preciselythat in the subsection that follows. Before doing so, we remark on the closest existing result. As mentionedpreviously Jin et al. (2020) show that as τ → ∞ , the set of stable critical points with respect to the dynamics˙ x = − Λ τ g ( x ) coincide with the set of diﬀerential Stackelberg equilibrium. However, an equivalent result inthe context of general singularly perturbed systems has been known in the literature (cf. Kokotovic et al.1986, Chap. 2). We give a proof based on this type of analysis because it reveals a new set of analysis toolsto the study of game-theoretic formulations of machine learning and optimization problems; a proof sketchis given below while the full proof is given in Appendix F. Proposition 4.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for some r ≥ .Suppose that x ∗ is such that g ( x ∗ ) = 0 and det( D f ( x ∗ )) (cid:54) = 0 . Then, as τ → ∞ , spec( J τ ( x ∗ )) ⊂ C ◦ + if andonly if x ∗ is a diﬀerential Stackelberg equilibrium.Proof Sketch. The basic idea in showing this result is that there is a (local) transformation of coordinatesfrom the linearized dynamics of ˙ x = − Λ τ g ( x ), which we write as˙ x = (cid:20) A A − τ A (cid:62) τ A (cid:21) x, in a neighborhood of a critical point to an upper triangular system that depends parametrically on τ andhence, the asymptotic behavior is readily obtainable from the block diagonal components of the system inthe new coordinates. Indeed, consider the change of variables z = x + L ( τ − ) x for the second player sothat (cid:20) ˙ x ˙ z (cid:21) = (cid:20) A − A L ( τ − ) A R ( L, τ ) A + τ − L ( τ − ) A (cid:21) (cid:20) x z (cid:21) (11)where R ( L, τ ) = − A (cid:62) − A L ( τ − ) + τ − L ( τ − ) A − τ − L ( τ − ) A L ( τ − ) = 0A transformation of coordinates L ( τ ) such that R ( L, τ ) = 0 always exists (cf. Lemma 7, Appendix F). Hence,the characteristic equation of (11) can be expressed as χ ( s, τ ) = τ n χ s ( s, τ ) χ f ( p, τ ) = 0where χ s ( s, τ ) = det( sI − ( A − A L ( τ − ))) and χ f ( p, τ ) = det( pI − ( A + τ − A L ( τ − ))) with p = sτ − .As τ → ∞ , L ( τ − ) → L (0) = − A − A (cid:62) . Consequently, n of the eigenvalues of ˙ x = − Λ τ g ( x ), denoted by { λ , . . . , λ n } , are the roots of the slow characteristic equation χ s ( s, τ ) = 0 and the rest of the eigenvalues { λ n +1 , . . . , λ n + n } are denoted by λ i = ν j /ε for i = n + j and j ∈ { , . . . , n } where { ν , . . . , ν n } are theroots of the fast characteristic equation χ f ( p, τ ) = 0. The roots of χ s ( s, τ ) are precisely those of the (ﬁrst)Schur complement of − J τ ( x ∗ ) while the roots of χ f ( p, τ ) are precisely those of D f ( x ∗ ).This simple transformation of coordinates to an upper triangular dynamical system shown in (11) leadsimmediately to the asymptotic result in Proposition 4. It also shows that if the eigenavlues of S ( J τ ( x ∗ ))are distinct and similarly, so are those of D f ( x ∗ ) (although, S ( J τ ( x ∗ )) and D f ( x ∗ ) are allowed to haveeigenvalues in common), then the asymptotic results from Proposition 4 imply the following approximationsfor the elements of spec( J τ ( x ∗ )): λ i = λ i ( S ( J τ ( x ∗ )) + O ( τ − ) , i = 1 , . . . , n ,λ j + n = τ ( λ j ( − D f ( x ∗ )) + O ( τ − )) , j = 1 , . . . , n . This follows simply by observing that when the eigenvalues are distinct, the derivatives ds/dτ and dp/dτ arewell-deﬁned by the implicit mapping theorem and the total derivative of χ s ( s, τ ) and χ f ( p, τ ), respectively. Distinct eigenvalues is a generic property in the space of n × n real matrices. .2 Necessary and Suﬃcient Conditions for Stability The proof of Proposition 4 provides some intuition for the next result, which is one of our main contri-butions. Indeed, as shown in Kokotovic et al. (1986, Chap. 2), as τ → ∞ the ﬁrst n eigenvalues of˙ x = − Λ τ g ( x ) tend to ﬁxed positions in the complex plane deﬁned by the eigenvalues of − S = − ( D f ( x ∗ ) − D f ( x ∗ )( D f ( x ∗ )) − D (cid:62) f ( x ∗ )), while the remaining n eigenvalues tend to inﬁnity, with the linear rate τ , along as asymptotes deﬁned by the eigenvalues of D f ( x ∗ ). The asymptotic splitting of the spectrumprovides some intuition for the following result. Theorem 3.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for some r ≥ . Supposethat x ∗ is such that g ( x ∗ ) = 0 and S ( J ( x ∗ )) and D f ( x ∗ ) are non-singular. There exists a τ ∗ ∈ (0 , ∞ ) such that spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ) if and only if x ∗ is a diﬀerential Stackelberg equilibrium. Before getting into the proof sketch, we provide some intuition for the construction of τ ∗ and along theway revive an old analysis tool from dynamical systems theory which turns out to be quite powerful inanalyzing stability properties of parameterized systems. Construction of τ ∗ . There is still the question of how to construct such a τ ∗ and do so in a way thatis as tight as possible. Recall Theorem 1 which states that a matrix is exponentially stable if and only ifthere exists a symmetric positive deﬁnite P = P (cid:62) > P J τ ( x ∗ ) + J (cid:62) τ ( x ∗ ) P >

0. The operator L ( P ) = J (cid:62) τ ( x ∗ ) P + P J τ ( x ∗ ) is known as the Lyapunov operator. Given a positive deﬁnite Q = Q (cid:62) > − J τ ( x ∗ ) is stable if and only if there exists a unique solution P = P (cid:62) to(( J (cid:62) τ ( x ∗ ) ⊗ I ) + ( I ⊗ J (cid:62) τ ( x ∗ )))vec( P ) = ( J (cid:62) τ ( x ∗ ) ⊕ J (cid:62) τ ( x ∗ ))vec( P ) = vec( Q ) (12)where ⊗ and ⊕ denote the Kronecker product and Kronecker sum, respectively. The existence of a uniquesolution P occurs if and only if J (cid:62) τ and − J (cid:62) τ have no eigenvalues in common. Hence, using the factthat eigenvalues vary continuously, if we imagine varying τ and examining the eigenvalues of the map( J (cid:62) τ ( x ∗ ) ⊕ J (cid:62) τ ( x ∗ )), this will tell us the range of τ for which spec( − J τ ( x ∗ )) remains in C ◦− .This method of varying parameters and determining when the roots of a polynomial (or correspondingly,the eigenvalues of a map) cross the boundary of a domain uses what is known as a guardian or guard map (cf. Saydy et al. (1990)). In particular, the guard map provides a certiﬁcate that the roots of a polynomiallie in a particular guarded domain for a range of parameter values. Formally, let X be the set of all n × n real matrices or the set of all polynomials of degree n with real coeﬃcients. Consider S an open subset of X with closure ¯ S and boundary ∂ S . The map ν : X → C is said to be a guardian map for S if for all x ∈ ¯ S , ν ( x ) = 0 ⇐⇒ x ∈ ∂ S . Consider an open subset Ω of the complex plane that is symmetric with respect to the real axis (e.g., theopen left-half complex plane C ◦− ). Then, elements of S (Ω) = { A ∈ R n × n : spec( A ) ⊂ Ω } are said to bestable relative to Ω. Given a pathwise connected subset U of R , a domain S (Ω) and a guard map ν , it isknown that the family { A ( τ ) : τ ∈ U } is stable relative to Ω if and only if ( i ) it is nominally stable—i.e., A ( τ ) ∈ S (Ω) for some τ ∈ U —and ( ii ) ν ( A ( τ )) (cid:54) = 0 for all τ ∈ U (Saydy et al., 1990, Prop. 1). To buildintuition for the guard map, consider the scalar game on R so that the Jacobian at a critical point has thestructure J τ ( x ) = (cid:20) a b − τ b τ d (cid:21) . (13)It is known that a critical point x is stable if det( J τ ( x )) > J τ ( x )) >

0. Thus, it is fairly easy tosee that ν : A (cid:55)→ det( A ) tr( A ) is a guard map for the 2 × S ( C ◦− ). Now, the traceoperator can be generalized using a bialternate product , which is denoted A (cid:12) B for matrices A and B anddeﬁned by A (cid:12) B = ( A ⊗ B + B ⊗ A ) so that 2( A (cid:12) I ) = A ⊕ A (cf. Govaerts (2000, Sec. 4.4.4)). For2 × J τ ( x ) in (13), 2( A (cid:12) I )) = a + τ d = tr( A ). Hence, ν : A (cid:55)→ det( A ) det(2( A (cid:12) I ))generalizes the map ν : A (cid:55)→ det( A ) tr( A ) to an n × n matrix. Replacing A with the parameterized family See Lancaster and Tismenetsky (1985); Magnus (1988) for more detail on the deﬁnition and properties of these mathematicaloperators, and Appendix E for more detail directly related to their use in this paper. () R e ( s pe c ( ( JJ ) )) C n × n S ( C ◦− ) − J τ ( x ∗ ) ∂ S ( C ◦− ) ν ( τ ) ν ( τ ∗ ) = 0 Figure 2: Guard map ν ( τ ) and real parts of the eigenvalues of the vectorized Lyapunov operator − ( J τ ( x ∗ ) (cid:1) J τ ( x ∗ )) using the reduction via the duplication matrix for the quadratic example given in Example 1. Thelargest real positive root of ν ( τ ) is τ ∗ = 2 and in the right plot, we see that all the real parts of the eigenvaluesof the Lyapunov operator are negative indicating stability. The right most graphic is a cartoon visualizationof the guard map method: the outer grey region represents C n × n , the blue region represents the Hurwitzstable n × n matrices S ( C ◦− ), the green region represents the parameterized class of matrices { J τ ( x ∗ ) } τ , andthe curve cutting through the regions is the guard map ν ( τ ). The goal is to ﬁne the subset of { J τ ( x ∗ ) } τ that lie within S ( C ◦− ), which can be done by reducing the problem to ﬁnding the roots of ν ( τ ).of matrices − J τ ( x ∗ ), we have the guard map ν ( τ ) = det( − J τ ( x ∗ )) det(2( − J τ ( x ∗ ) (cid:12) I )). It is fairly easy tosee that this polynomial in τ also guards the open left-half complex plane C ◦− ; details are given in the fullproof in Appendix E. In fact, using the Schur complement formula,det( − J τ ( x ∗ )) = τ n det( D f ( x ∗ )) det( − S )so that if x ∗ is a non-degenerate (a condition implied by the hyperbolicity of x ∗ for S and D f ( x ∗ )),det( − J τ ( x ∗ )) does not change the properties of the guard map. In particular, the values of τ ∈ (0 , ∞ ) where ν ( τ ) = 0 does not depend on det( − J τ ( x ∗ )). Hence, we can use the reduced guard map ν ( τ ) = det(2( − J τ ( x ∗ ) (cid:12) I )) = det( − J τ ( x ∗ ) ⊕ ( − J τ ( x ∗ )) . Reﬂecting back to (12), we see that this guard map in τ is closely related to the vectorization of the Lyapunovoperator and of course, this is not a coincidence. For any symmetric positive deﬁnite Q = Q (cid:62) >

0, therewill be a symmetric positive deﬁnite solution P = P (cid:62) > − ( J (cid:62) τ ( x ∗ ) ⊕ J (cid:62) τ ( x ∗ ))]vec( P ) = vec( − Q )if and only if the operator − ( J τ ( x ∗ ) ⊕ J τ ( x ∗ )) is non-singular. In turn, this is equivalent to det( − ( J τ ( x ∗ ) ⊕ J τ ( x ∗ ))) (cid:54) = 0. Hence, to ﬁnd the range of τ for which, given any Q = Q (cid:62) >

0, the solution switches from apositive deﬁnite P = P (cid:62) > P = P (cid:62) < τ such that ν ( τ ) = det( − ( J τ ( x ∗ ) ⊕ J τ ( x ∗ ))) = 0—i.e., where it hits the boundary ∂ S ( C ◦− ). Proof Sketch for Theorem 3.

The ‘necessary’ direction follows directly from the above observation, while the‘suﬃciency’ direction follows by construction.We leverage the guard map as described above to construct τ ∗ . Deﬁne (cid:1) as an operator that generatesan n ( n + 1) × n ( n + 1) matrix from a matrix A ∈ R n × n such that A (cid:1) A = H + n ( A ⊕ A ) H n where H + n = ( H (cid:62) n H n ) − H (cid:62) n is the (left) pseudo-inverse of H n , a full column rank duplication matrix (cf. Ap-pendix C) which maps a n ( n + 1) vector to a n vector generated by applying vec( · ) to a symmetric matrixand it is designed to respect the vectorization map vec( · ). It is fairly straightforward to see that the Kro-necker sum A ⊕ A = A ⊗ I + I ⊗ A has spectrum { λ j + λ i } where λ i , λ j ∈ spec( A ). The operator A (cid:1) A is simply a more computationally eﬃcient expression of A ⊕ A , and as such the eigenvalues of A (cid:1) A are The intuition can be gained simply by examining the map ν ( τ ) = det( − ( J τ ( x ∗ ) ⊗ J τ ( x ∗ ))), however, this does not produce atight estimate of τ ∗ and requires more computation due to the redundancies from symmetries in the subcomponents of − J τ ( x ∗ ). A ⊕ A removing redundancies. We use A (cid:1) A speciﬁcally because of its computational advantagesin computing τ ∗ .We show in Lemma 6 (Appendix C) that ν : A (cid:55)→ det( A (cid:1) A ) guards the set of n × n Hurwitz stable matrices S ( C ◦− ). We then extend this guard map to the parametric guard map ν ( τ ) = det( − ( J τ ( x ∗ ) (cid:1) J τ ( x ∗ ))). Indeed,if we consider the subset of the family of matrices parameterized by τ that lies in S ( C ◦− ), then for any τ suchthat − J τ ( x ∗ ) is in this subset, we have that ν ( τ ) = 0 if and only if − ( J τ ( x ∗ ) (cid:1) J τ ( x ∗ )) is singular if and onlyif − J τ ( x ∗ ) ∈ ∂ S ( C ◦− ). This shows that ν ( τ ) guards the space of n × n Hurwitz stable matrices S ( C ◦− ). Themap ν ( τ ) deﬁnes a polynomial in τ and to determine the range of τ such that spec( − J τ ( x ∗ )) ⊂ C ◦− , we needto ﬁnd the value of τ such that ν ( τ ) = 0. Towards this end, the guard map ν ( τ ) can be further decomposedby applying the Schur determinant formula to − ( J τ ( x ∗ ) (cid:1) J τ ( x ∗ )). This gives rise to a polynomial of theform ν ( τ ) = τ n ( n +1) / det( D f ( x ∗ ) (cid:1) D f ( x ∗ )) det( S (cid:1) S ) det( τ I n n − B )for some n n × n n matrix B . Hence, the problem of determining the value of τ such that ν ( τ ) = 0 (i.e.,where the polynomial meets the boundary of S ( C ◦− )) is reduced to an eigenvalue problem in τ for the matrix B . This value of τ is precisely the value τ ∗ and since its derived from an eigenvalue problem it is precisely τ ∗ = λ +max ( B ) where λ +max ( · ) is the largest positive real eigenvalue of its argument if one exists and otherwiseits zero (meaning that the matrix − J τ ( x ∗ ) is stable for all τ ∈ (0 , ∞ )). The expression for the matrix B isgiven in the ﬁll proof contained in Appendix E. Corollary 1.

Consider a zero-sum game G = ( f, − f ) with f ∈ C r ( X, R ) for some r ≥ . Suppose thatthe assumptions of Theorem 3 hold and that the set of diﬀerential Stackelberg equilibria, denoted DSE ( G ) , isﬁnite. Let τ ∗ = max x ∗ ∈ DSE ( G ) τ ( x ∗ ) where τ ( x ∗ ) is the value of τ obtained via Theorem 3 for each individualcritical point x ∗ ∈ DSE ( G ) . Then, for all τ ∈ ( τ ∗ , ∞ ) and x ∗ ∈ DSE ( G ) , spec( − J τ ( x ∗ )) ⊂ C ◦− . In short, selecting the maximum value of τ ∗ over the ﬁnite set of equilibria guarantees that the locallinearization of ˙ x = − Λ τ g ( x ) around any diﬀerential Stackelberg equilibria is stable, and hence, the nonlinearsystem is locally stable around each of these critical points. New algebraic tools for analysis at the intersection of game theory and machine learning.

Before moving on, we remark on the utility of the algebraic tools we use in the proof for Theorem 3.Indeed, the guard map concept is extremely powerful for understanding stability of parameterized familiesof dynamical systems, and it is not limited to single parameter families. Hence, there is potential to extendthe above results to games with more than two players or additional parameters. In fact, we do exactlythis in Section 5 where we present results for GANs trained with gradient-penalty type regularizers for thediscriminator. Moreover, it is fairly easy to construct analogous guard maps for non-zero sum games. Manyof the tools and constructions readily extend. We leave these results to a diﬀerent paper so as to not createtoo much clutter in the present work.

Note that Theorem 3 also implies that for any stable spurious critical points, meaning non-Nash/non-Stackelberg equilibria, there is no ﬁnite τ ∗ such that spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ). In particular,there exists at least one ﬁnite, positive value of τ such that spec( − J τ ( x ∗ )) (cid:54)⊂ C ◦− since the only criticalpoint attractors are diﬀerential Stackelberg equilibria for large enough ﬁnite τ . We can extend this result toaddress the question of whether or not there exists a ﬁnite learning rate ratio such that for all larger learningrate ratios − J τ ( x ∗ ) has at least one eigenvalue with strictly positive real part, thereby implying that x ∗ isunstable. Theorem 4 (Instability of spurious critical points) . Consider a zero sum game ( f , f ) = ( f, − f ) deﬁnedby f ∈ C r ( X, R ) for some r ≥ . Suppose that x ∗ is any stable critical point of ˙ x = − g ( x ) which isnot a diﬀerential Stackelberg equilibrium. There exists a ﬁnite learning rate ratio τ ∈ (0 , ∞ ) such that spec( − J τ ( x ∗ )) (cid:54)⊂ C ◦− for all τ ∈ ( τ , ∞ ) .Proof Sketch. The full proof is provided in Appendix G. The proof leverages the fact that a nonlinear systemis unstable if its linearization is unstable, meaning that the linearization has at least one eigenvalue with15trictly positive real part. In our setting, this can be shown by leveraging the Lyapunov equation andLemma 5 which states that if S ( − J ( x ∗ )) has no eigenvalues with zero real part, then there exists matrices P = P (cid:62) and Q = Q (cid:62) > P S ( − J ( x ∗ ))+ S ( − J ( x ∗ )) P = Q where P and S ( − J ( x ∗ )) have thesame inertia —i.e., the number of eigenvalues with positive, negative and zero real parts, respectively, are thesame. An analogous statement applies to − D f ( x ∗ ). From here, we construct a matrix P that is congruent to blockdiag( P , P ) and a matrix Q τ such that − P J τ ( x ∗ ) − J (cid:62) τ ( x ∗ ) P = Q τ . Since P and blockdiag( P , P )are congruent, Sylvester’s law of inertia implies that they have the same number of eigenvalues with positive,negative, and zero real parts, respectively, so that in turn P has at least one eigenvalue with strictly negativereal part. We then construct τ via an eigenvalue problem such that for all τ > τ , Q τ >

0. ApplyingLemma 5 again, we get that J τ ( x ∗ ) has at least one eigenvalue with strictly negative real part so thatspec( − J τ ( x ∗ )) (cid:54)⊂ C ◦− for all τ > τ .Unlike τ ∗ Theorem 3, τ in Theorem 4 is not tight in the sense that − J τ ( x ∗ ) may become unstable for τ < τ . The reason for this is that there are potentially many matrices P and Q that satisfy S ( J ( x ∗ )) P + P S ( J ( x ∗ )) = Q such that S ( J ( x ∗ )) and P have the same inertia; an analogous statement holds for P , Q and − D f ( x ∗ ). The choice of these matrices impact the value of τ . Hence, the question of ﬁnding theexact value of τ beyond which a spurious stable critical point for 1- GDA is unstable remains open.

In this section, derive convergence guarantees for τ - GDA to diﬀerential Stackelberg equilibria in both thedeterministic (i.e., where agents have oracle access to their individual gradients) and the stochastic (i.e.,where agents have an unbiased estimator of their individual gradient) settings.

As a corollary to Theorem 3, we ﬁrst show that the discrete time τ - GDA update is locally asymptoticallystable for a range of learning rates γ .We need the following lemma to prove asymptotic convergence as well as the subsequent results onconvergence rates. Lemma 1.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for some r ≥ . Supposethat x ∗ is a diﬀerential Stackelberg equilibrium and that given τ > , spec( − J τ ( x ∗ )) ⊂ C ◦− . Let γ =min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | . For any γ ∈ (0 , γ ) , τ - GDA converges locally asymptotically.

Corollary 2 (Asymptotic convergence of τ - GDA ) . Suppose the assumptions of Theorem 3 hold so that x ∗ isa critical points of g and S ( J ( x ∗ )) and D f ( x ∗ ) are non-singular. There exists a τ ∗ ∈ (0 , ∞ ) such that τ - GDA with γ ∈ (0 , γ ( τ )) where γ ( τ ) = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | converges locally asymptotically forall τ ∈ ( τ ∗ , ∞ ) if and only if x ∗ is a diﬀerential Stackelberg equilibrium. In addition to showing asymptotic convergence, we also provide an asymptotic convergence rate. To provethe main theorems on convergence rates for both diﬀerential Stackelberg and diﬀerential Nash equilibria, weuse a common argument which is summarized in the lemma below.

Lemma 2.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for some r ≥ .Suppose that x ∗ is a diﬀerential Stackelberg equilibrium and that given τ , spec( − J τ ( x ∗ )) ⊂ C ◦− . Let γ = min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , and λ m = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | . For any α ∈ (0 , γ ) , τ - GDA with learning rate γ = γ − α converges locally asymptotically at a rate of O ((1 − α β ) k/ ) where β = (2Re( λ m ) − α | λ m | ) − . The full proof of the above lemma is provided in Appendix C, and a short proof sketch with the mainideas is summarized below.

Proof Sketch.

Consider a zero-sum game ( f, − f ) as articulated in the lemma statement where x ∗ is either adiﬀerential Stackelberg or diﬀerential Nash equilibrium and ﬁx any τ such that spec( J τ ( x ∗ )).16 e( z )Im( z ) λ λ λ θ λ m Re( λ m ) I m ( λ m ) Figure 3: The inner maximization problem in (14) is over a ﬁnite set spec( J τ ( x ∗ )) = { λ , . . . , λ n } where J τ ( x ∗ ) ∈ R n × n . As γ → ∞ , | − γλ i | →

0. The last λ i such that 1 − γλ i hits the boundary of the unit circlein the complex plane—i.e., | − γλ i | = 1—gives us the optimal value of γ = 2Re( λ m ) / | λ m | = 2 cos( θ ) / | λ m | and the element of spec( J τ ( x ∗ )) that achieves it (see blue arrows).For the discrete time dynamical system x k +1 = x k − γ Λ τ g ( x k ), it is well known that if γ is chosen suchthat ρ ( I − γ J τ ( x ∗ )) <

1, then x k locally asymptotically converges to x ∗ (cf. Proposition 6, Appendix A).With this in mind, we formulate an optimization problem to ﬁnd the upper bound γ on the learning rate γ such that for all γ ∈ (0 , γ ), the spectral radius of the local linearization of the discrete time map is acontraction which is precisely ρ ( I − γ J τ ( x ∗ )) <

1. The optimization problem is given by γ = min γ> (cid:26) γ : max λ ∈ spec( J τ ( x ∗ )) | − γλ | = 1 (cid:27) . (14)The intuition is as follows. The inner maximization problem is over a ﬁnite set spec( J τ ( x ∗ )) = { λ , . . . , λ n } where J τ ( x ∗ ) ∈ R n × n . As γ increases away from zero, each | − γλ i | shrinks in magnitude. The last λ i suchthat 1 − γλ i hits the boundary of the unit circle in the complex plane (i.e., | − γλ i | = 1) gives us the optimalvalue of γ and the element of spec( J τ ( x ∗ )) that achieves it. Examining the constraint, we have that for each λ i , γ ( γ | λ i | − λ i )) ≤ γ >

0. As noted this constraint will be tight for one of the λ , in which case γ = 2Re( λ ) / | λ | since γ >

0. Hence, by selecting γ = min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , we have that | − γ λ | < λ ∈ spec( J τ ( x ∗ )) and any γ ∈ (0 , γ ).From here, we let λ m = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | so that by ﬁxing any α ∈ (0 ,

1) and deﬁning β = (2Re( λ m ) − α | λ m | ) − and γ = γ − α , we obtain the claimed convergence rate by standard argumentsin numerical analysis. In particular, we apply an argument along the lines of Proposition 6 in Appendix A.Indeed, given the τ - GDA update, ρ ( I − γ J τ ( x ∗ )) = 1 − η < η ∈ (0 ,

1) implies there exists aneighborhood U of x ∗ such that if τ - GDA is initialized in U , (cid:107) x k − x ∗ (cid:107) ≤ (1 − η/ k/ (cid:107) x − x ∗ (cid:107) thereby givingthe desired rate. The full details on this part of the argument can be found in Appendix C.The above lemma provides a convergence rate given a diﬀerential Stackelberg equilibrium x ∗ and alearning rate ratio τ such that x ∗ is stable with respect to the dynamics ˙ x = − Λ τ g ( x ).The following theorem—which uses Lemma 2 in its proof—characterizes the iteration complexity for τ - GDA . Speciﬁcally, the result leverages Theorem 3 to construct a ﬁnite τ ∗ ∈ (0 , ∞ ) such that − J τ ( x ∗ ) is(Hurwitz) stable, and then for any τ ∈ ( τ ∗ , ∞ ), Lemma 2 implies a local asymptotic convergence rate. Theorem 5.

Consider a zero-sum game ( f , f ) = ( f, − f ) deﬁned by f ∈ C r ( X, R ) for r ≥ and let x ∗ be a diﬀerential Stackelberg equilibrium of the game. There exists a τ ∗ ∈ (0 , ∞ ) such that for any τ ∈ ( τ ∗ , ∞ ) and α ∈ (0 , γ ) , τ - GDA with learning rate γ = γ − α converges locally asymptotically at a ate of O ((1 − α β ) k/ ) where γ = min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , λ m = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , and β = (2Re( λ m ) − α | λ m | ) − . Moreover, if x ∗ is a diﬀerential Nash equilibrium, τ ∗ = 0 so that for any τ ∈ (0 , ∞ ) and α ∈ (0 , γ ) , τ - GDA with γ = γ − α converges with a rate O ((1 − α β ) k/ ) .Proof. To prove this result, we apply Theorem 3 to construct τ ∗ via the guard map ν ( τ ) = det( − J τ ( x ∗ ) (cid:1) − J τ ( x ∗ )) such that for all τ ∈ ( τ ∗ , ∞ ), spec( J τ ( x ∗ )) ⊂ C ◦ + . This guarantees that spec( − J τ ( x ∗ )) ⊂ C ◦− forany τ ∈ ( τ ∗ , ∞ ) and hence the nonlinear dynamical system˙ x = − Λ τ g ( x )is locally asymptotically (in fact, exponentially) stable by the Hartman-Grobman theorem (cf. Theorem 2).Therefore, for any τ ∈ ( τ ∗ , ∞ ), by Lemma 2, τ - GDA converges with a rate of O ((1 − α β ) k/ ). Remark 2.

We note that as τ → ∞ , the eigenvalues of J τ ( x ∗ ) become real. In fact there exists a ﬁnite valueof τ ∈ ( τ ∗ , ∞ ) after which spec( J τ ( x ∗ )) ⊂ R . Hence, there is an opportunity to take advantage of momentum-based approaches when timescale separation via τ is introduced. This may provide some explanation for whythe heuristics of timescale separation or unrolled updates in combination with momentum work in practicewhen training generative adversarial networks. We believe there may be potential to precisely characterizewhen the eigenvalues of the Jacobian become purely real as a function of τ by constructing a suitable guardmap and this is a direction of future work. Theorem 5 directly implies a ﬁnite time convergence guarantee for obtaining an ε -diﬀerential Stackelbergequilibrium—i.e., an point with an ε -ball around a diﬀerential Stackelberg x ∗ . Corollary 3.

Given ε > , under the assumptions of Theorem 5, τ - GDA obtains an ε –diﬀerential Stacklebergequilibrium in (cid:100) βα log( (cid:107) x − x ∗ (cid:107) /ε ) (cid:101) iterations for any x ∈ B δ ( x ∗ ) with δ = α/ (4 Lβ ) where L is the localLipschitz constant of I − γJ τ ( x ∗ ) . Comments on computing the neighborhood B δ ( x ∗ ) . We note that we have essentially given a proofthat there exists a neighborhood on which τ - GDA converges. Of course, due to the non-convexity of theproblem in general, this neighborhood could be arbitrarily small. We provide an estimate of the neighborhoodsize using the local Lipschitz constant of the local linearization I − γ J τ ( x ∗ ). One way to better understandthe size of this neighborhood is to use Lyapunov analysis, a tool which is well explored in the singularperturbation theory (Kokotovic et al., 1986). In particular, Lyapunov methods can be applied directly tothe nonlinear system if one can construct Lyapunov functions for the fast and slow subsystems individually—also known as the boundary layer model and reduced order model. With these Lyapunov functions in hand,one can “stitch” the two together (via convex combination) and show under some reasonable assumptionsthat this combined function is a Lyapunov function for the overall singularly perturbed system. The beneﬁtof this analysis is that the Lyapunov function gives one an estimate of the region of attraction (via, e.g., thelevel sets); however, it is not easy to construct a Lyapunov function for a nonlinear system in general. Weleave expanding to such methods to future work. Comments on avoiding saddle points.

Before turning to the stochastic setting, we comment on saddlepoint avoidance in the deterministic setting. It was shown by Mazumdar et al. (2020) that gradient-basedlearning in continuous games with heterogeneous learning rates avoids saddles on all but a set of measure zeroinitializations. Hence, τ - GDA avoids saddles for almost every initialization. We also know that all diﬀerentialNash equilibria are locally asymptotically stable for zero-sum settings. Hence, there are no diﬀerential Nashequilibria that are saddle points of the dynamics ˙ x = − Λ τ g ( x ). On the other hand, as Example 1 shows,there are diﬀerential Stackelberg equilibria which correspond to saddle points of the dynamics for somechoices of τ —in particular, τ = 1 in that example. Theorem 3 and Corollary 1, however, implies that fora given zero-sum game (or minmax problem), there exists a ﬁnite τ ∗ such that all locally asymptoticallystable equilibria are diﬀerential Stackelberg equilibria. Hence, an ‘almost sure’ saddle point avoidance resulttogether with the local convergence guarantee provided by Theorem 5 provides a strong characterization oflong-run learning behavior.Avoidance of saddles nor the if and only if convergence guarantee of Theorem 3 are, however, enough toensure avoidance of limit cycles. In fact, it is known that limit cycles can exist in zero sum games (Daskalakis18t al., 2018; Mazumdar et al., 2020). Understanding when such complex phenomena exist in games and de-termining how to ascribe meaning the behavior is an active area of study (see, e.g., the work of Papadimitriouand Piliouras (2019)). In this section, we analyze convergence when players do not have oracle access to their gradients but insteadhave an unbiased estimator in the presence of zero mean, ﬁnite variance noise. Speciﬁcally, we show thatthe agents will converge locally asymptotically almost surely to a diﬀerential Stackelberg equilibrium.The stochastic form of the update is given by x k +1 = x k − γ k (Λ τ g ( x k ) + w k +1 ) (15)where w k +1 is a zero mean, ﬁnite variance random variable and { γ k } is the learning rate sequence. Assumption 1.

The stochastic process { w k } is a martingale diﬀerence sequence with respect to the increasingfamily of σ -ﬁelds deﬁned by F k = σ ( x (cid:96) , w (cid:96) , (cid:96) ≤ k ) , ∀ k ≥ , so that E [ w k +1 | F k ] = 0 almost surely (a.s.) for all k ≥ . Moreover, w k is square-integrable so that, forsome constant C > , E [ (cid:107) w k +1 (cid:107) | F k ] ≤ C (1 + (cid:107) x k (cid:107) ) a.s. , ∀ k ≥ . We note that this assumption has been relaxed in the literature (cf. Thoppe and Borkar (2019)), howeversimplicity, we state the theorem with the most accessible criteria. We remark below in the paragraph onextensions to concentration bounds on the nature of the relaxed assumptions.

Theorem 6.

Consider a zero-sum game ( f, − f ) such that f ∈ C r ( X, R ) for some r ≥ . Suppose thatAssumption 1 holds and that { γ k } is square summable but not summable—i.e., (cid:80) k γ k < ∞ , yet (cid:80) k γ k = ∞ .For any τ ∈ (0 , ∞ ) , the sequence { x k } generated by (15) converges to a, possibly sample path dependent,internally chain transitive invariant set of ˙ x = − Λ τ g ( x ) . Moreover, if x ∗ is a diﬀerential Stackelberg equi-librium, then there exists a ﬁnite τ ∗ ∈ (0 , ∞ ) such that { x k } almost surely converges locally asymptoticallyto x ∗ for every τ ∈ (0 , ∞ ) .Proof. The convergence of { x k } to a, possibly sample path dependent, compact connected internally chaintransitive invariant set of ˙ x = − Λ τ g ( x ) follows from classical results in stochastic approximation theory(cf. Borkar (2008, Chap. 2); Benaim (1996)).Suppose that x ∗ is a diﬀerential Stackelberg equilibrium. By Theorem 3, there exists a ﬁnite τ ∗ ∈ (0 , ∞ )such that for all τ ∈ ( τ ∗ , ∞ ), x ∗ is a locally exponentially stable equilibrium of the continuous time dynamics˙ x = − Λ τ g ( x )—that is, spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ).Fix arbitrary τ ∈ ( τ ∗ , ∞ ). Since spec( − J τ ( x ∗ )) ⊂ C ◦− , det( − J τ ( x ∗ )) (cid:54) = 0 so that x ∗ is an isolated criticalpoint. Furthermore, exponentially stability of x ∗ implies that there exists a (local) Lyapunov function deﬁnedon a neighborhood of x ∗ by the converse Lyapunov theorem (cf. Sastry (1999, Thm. 5.17) or Krasovskii (1963,Thm. 4.3)). Let U be the neighborhood of x ∗ on which the local Lyapunov function is deﬁned, such that U contains no other critical points (which is possible since x ∗ is isolated). That is, let Φ : U → [0 , ∞ ) bethe local Lyapunov function deﬁned on U where x ∗ ∈ U , Φ is positive deﬁnite on U , and for all x ∈ U , ddt Φ( x ) ≤ z ∈ U if and only if Φ( z ) = 0. By Corollary 3 (Borkar, 2008, Chap. 2), { x k } converges to an internally chain transitive invariant set contained in U almost surely. The only internallychain transitive invariant set in U is x ∗ . Corollary 4.

Consider a zero-sum game ( f, − f ) such that f ∈ C ( X, R ) . Suppose that Assumption 1 holdsand that { γ k } is square summable but not summable—i.e., (cid:80) k γ k < ∞ , yet (cid:80) k γ k = ∞ . If there exists aﬁnite τ ∗ ∈ (0 , ∞ ) such that spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ) , then x ∗ is a diﬀerential Stackelbergequilibrium and { x k } almost surely converges locally asymptotically to x ∗ . in the stochastic setting, the result requires time varying learning rates with a suﬃcient separationin timescale. Speciﬁcally, the players need to be using learning rate sequences { γ i,k } for each i ∈ { , } suchthat (without loss of generality) not only is it assumed that γ ,k = o ( γ ,k ), but also (cid:80) k γ ,k + γ ,k < ∞ and (cid:80) k γ i,k = ∞ for each i ∈ { , } . The challenge with these assumptions on the learning rate sequences is thatempirically the sequences that satisfy them result in poor behavior along the learning path such as gettingstuck at saddle points or making no progress. This is, in essence, due to the fact that the faster player—i.e.,player 2 if γ ,k = o ( γ ,k )—equilibriates too quickly causing progress to stall. This can result in undesirablebehavior such as vanishing gradients (so that the discriminator does not provide enough information for thegenerator to make progress), mode collapse, or failure to converge in practical applications such as generativeadversarial networks.On the other hand, our convergence result gives a similar guarantee with less restrictive requirements onthe stepsize sequence. In particular, only a single stepsize sequence is required (so that the algorithm canbe viewed as a single timescale stochastic approximation update) as long as the fast player (who, withoutloss of generality, is player 2 in this paper) scales their estimated gradient by τ ∈ ( τ ∗ , ∞ ) where τ ∗ is as inTheorem 3. Extensions to concentration bounds and relaxed assumptions on stepsizes.

It is possible toobtain concentration bounds and even ﬁnite time, high probability guarantees on convergence leveragingrecent advances in stochastic approximation (Borkar, 2008; Kamal, 2010; Thoppe and Borkar, 2019). Toour knowledge, the concentration bounds in (Thoppe and Borkar, 2019) require the weakest assumptionson learning rates—e.g., the stepsize sequence { γ k } needs only to satisfy (cid:80) k γ k = ∞ , lim k →∞ γ k = 0, and (cid:80) k γ k ≤

1. Speciﬁcally, since it is assumed, for the zero sum game ( f, − f ), that f ∈ C ( X, R ) and x ∗ is adiﬀerential Stackelberg equilibrium, Theorem 3 implies that x ∗ is a locally asymptotically stable attractorof ˙ x = − Λ τ g ( x ) for arbitrary ﬁxed τ ∈ ( τ ∗ , ∞ ), and hence, the concentration bounds in Theorem 1.1 and1.2 of (Thoppe and Borkar, 2019) directly apply.Furthermore, we note that in applications such as generative adversarial networks, while it has beenobserved that timescale separation heuristics such as unrolling or annealing the stepsize of the discriminatorwork well, in the stochastic case, summmable/square-summable assumptions on stepsizes are generally toorestrictive in practice since they lead to a rapid decay in the stepsize which, in turn, can stall progress.On the other hand, stepsize sequences such as γ k = 1 / ( k + 1) β for β ∈ (0 , β , while also maintaining the guarantees of the theoretical results. We state aconvergence guarantee under these relaxed assumptions in Proposition 8 which is contained in Appendix J. In this section, we focus on generative adversarial networks with regularization. Speciﬁcally, using the theorydeveloped so far, we extend the results in Mescheder et al. (2018) to provide a convergence guarantee for arange of regularization parameters and learning rate ratios.As has been repeatedly observed in recent theoretical works on generative adversarial networks, thetraining dynamics of generative adversarial networks are not well understood even though we have seenimpressive practical advances over the last few years. In an attempt to address this, recent works—e.g.,(Berard et al., 2020; Fiez et al., 2020; Mescheder et al., 2017; Nagarajan and Kolter, 2017) amongst others—study the optimization landscape of generative adversarial networks through the lens of dynamical systemstheory which provides analysis tools for convergence based on the eigen-structure of the local linearization ofthe learning dynamics. Nagarajan and Kolter (2017) show, under suitable assumptions, that gradient-basedmethods for training generative adversarial networks are locally convergent assuming the data distributionsare absolutely continuous. However, as observed by Mescheder et al. (2018), such assumptions not only maynot be satisﬁed by many practical generative adversarial network training scenarios such as natural images,but it can often be the case that the data distribution is concentrated on a lower dimensional manifold. Thelatter characteristic leads to nearly purely imaginary eigenvalues and highly ill-condition problems. To date it has not been shown that for a suﬃcient separation in timescale the only critical point attractors are local minmax. f ( θ, ω ) = E p ( z ) [ (cid:96) (D(G( z ; θ ); ω ))] + E p D ( x ) [ (cid:96) ( − D( x ; ω ))]where D ω ( x ) and G θ ( z ) are discriminator and generator networks, respectively, and p D ( x ) is the data distri-bution while p ( z ) is the latent distribution. The goal of the generator is to minimize the above loss and thediscriminator to maximize it. The mapping (cid:96) ∈ C ( R ) is some real-value function; e.g., a common choice is (cid:96) ( x ) = − log(1 + exp( − x )) as in the original generative adversarial network paper (Goodfellow et al., 2014).To motivate both regularization and time-scale separation, consider the following prototypical example of aDirac-GAN. Example: Dirac-GAN.

The Dirac-GAN is arguably as simple an example as one can construct thatposses interesting and non-trivial degeneracies . The parameter space for each the generator and discriminatoris just R —that is, the generator and discriminator have scalar “network”. Deﬁnition 5.

The Dirac-GAN consists of a univariate generator distribution p θ = δ θ and a linear dis-criminator D( x ; ω ) = ω (cid:62) x . The true data distribution p D is given by a Dirac-distribution concentrated atzero. The objective of the Dirac-GAN is f ( θ, ω ) = (cid:96) ( ωθ ) + (cid:96) (0) . Several choices exist for the mapping (cid:96) including (cid:96) ( t ) = − log(1+exp( − t )) which leads to the Jensen-Shannondivergence between p θ and p D . As described in (Mescheder et al., 2018), when training a Dirac-GAN withvanilla GDA , the dynamics are oscillatory. This can immediately be veriﬁed by examining the eigenvalues ofthe Jacobian at the (unique) equilibrium ( θ ∗ , ω ∗ ) = (0 ,

0) which are purely complex taking on values ± i(cid:96) (cid:48) (0)since J (0) = (cid:20) (cid:96) (cid:48) (0) − (cid:96) (cid:48) (0) 0 (cid:21) . It is observed in Lemma 3.1 Mescheder et al. (2018) that

GDA will not generally converge even with unrolling ofthe discriminator, a proxy for timescale separation. We observe, however, that this is because the equilibriumis not hyperbolic, and hence it is not structurally stable (Broer and Takens, 2010). In fact, introducingregularization remedies these issues. Under reasonable generative adversarial network assumptions, we showthat the introduction of a gradient penalty based regularization to the discriminator does not change the setof critical points for the dynamics and, further, for any learning rate ratio τ ∈ (0 , ∞ ) and any positive, ﬁniteregularization parameter µ , the continuous time limiting regularized learning dynamics remain stable, andhence, there is a range of learning rates γ for which the discrete time update locally converges asymptotically. Introducing Regularization.

Gradient penalties ensure that the discriminator cannot create a non-zerogradient which is orthogonal to the data manifold without suﬀering a loss. Introduced by Roth et al. (2017)and reﬁned in Mescheder et al. (2018), we consider training generative adversarial networks with one of twofairly natural gradient-penalties used to regularize the discriminator: R ( θ, ω ) = µ E p D ( x ) [ (cid:107)∇ x D( x ; ω ) (cid:107) ] , (16) R ( θ, ω ) = µ E p θ ( x ) [ (cid:107)∇ x D( x ; ω ) (cid:107) ] , (17)where, by a slight abuse of notation, ∇ x ( · ) denotes the partial gradient with respect to x of the argument( · ) when the argument is the discriminator D( · ; ω ) in order prevent any confusion between the notation D ( · )which we use elsewhere for derivatives. 21lso following Mescheder et al. (2018), we use relaxed assumptions—as compared to the work by Nagara-jan and Kolter (2017)—which allow us to consider generative adversarial networks with data distributionsthat do not (locally) have the same support and hence, are concentrated on lower dimensional manifolds, acommonly observed phenomena in practice (Arjovsky et al., 2017).To state these assumptions, we need some additional notation. Let h ( θ ) = E p θ ( x ) [ ∇ ω D( x ; ω ) | ω = ω ∗ ] and h ( ω ) = E p D ( x ) [ | D( x ; ω ) | + (cid:107)∇ x D( x ; ω ) (cid:107) ]. Deﬁne reparameterization manifolds M G = { θ : p θ = p D } and M D = { ω : h ( ω ) = 0 } and let T θ ∗ M G and T ω ∗ M D denote their respective tangent spaces at θ ∗ and ω ∗ . Assumption 2.

Consider training a generative adversarial network via a zero-sum game deﬁned by f ∈ C ( R m × R m , R ) where G( · ; θ ) and D( · ; ω ) are the generator and discriminator networks, respectively, and x = ( θ, ω ) ∈ R m × R m . Suppose that x ∗ = ( θ ∗ , ω ∗ ) is an equilibrium.a. At ( θ ∗ , ω ∗ ) , p θ ∗ = p D and D( x ; ω ∗ ) = 0 in some neighborhood of supp( p D ) .b. The function (cid:96) ∈ C ( R ) satisﬁes (cid:96) (cid:48) (0) (cid:54) = 0 and (cid:96) (cid:48)(cid:48) (0) < .c. There are (cid:15) –balls B (cid:15) ( θ ∗ ) and B (cid:15) ( ω ∗ ) centered around θ ∗ and ω ∗ , respectively, so that M G ∩ B (cid:15) ( θ ∗ ) and M D ∩ B (cid:15) ( ω ∗ ) deﬁne C -manifolds. Moreover, (i) if w / ∈ T θ ∗ M G , then w (cid:62) ∇ w h ( θ ∗ ) w (cid:54) = 0 , and (ii) if v / ∈ T ω ∗ M D , then v (cid:62) ∇ ω h ( ω ∗ ) v (cid:54) = 0 . Recalling the observations in Mescheder et al. (2018) used for explaining the last of the three assump-tions, Assumption 2.c(i) implies that the discriminator is capable of detecting deviations from the generatordistribution in equilibrium, and Assumption 2.c(ii) implies that the manifold M D is suﬃciently regular and,in particular, its (local) geometry is captured by the second (directional) derivative of h .The Jacobian of the regularized dynamics, for either j = 1 or 2, is of the form J ( τ,µ ) ( θ, ω ) = (cid:20) D f ( θ, ω ) D f ( θ, ω ) − τ D (cid:62) f ( θ, ω ) τ ( − D f ( θ, ω ) + µD R j ( θ, ω )) (cid:21) (18)where we use the subscript ( τ, µ ) to denote the parameter dependence on the learning rate ratio τ andregularization parameter µ . For shorthand, we often replace ( θ, ω ) simply with the variable x .It is straightforward to compute the block components of the Jacobian. Observe that Assumption 2.a im-plies that D f ( θ ∗ , ω ∗ ) is identically zero, and hence x ∗ = ( θ ∗ , ω ∗ ) is never a diﬀerential Nash equilibrium.However, we show that x ∗ is not only a diﬀerential Stackelberg equilibrium, but also characterize the learningrate ratio and regularization parameter range for which x ∗ is (locally) stable with respect to τ - GDA and givea convergence rate.

Theorem 7.

Consider training a generative adversarial network via a zero-sum game with generator network G θ , discriminator network D ω , and loss f ( θ, ω ) with regularization R j ( θ, ω ) (for either j = 1 or j = 2 ) andany regularization parameter µ ∈ (0 , ∞ ) such that Assumption 2 is satisﬁed for an equilibrium x ∗ = ( θ ∗ , ω ∗ ) of the regularized dynamics. Then, x ∗ = ( θ ∗ , ω ∗ ) is a diﬀerential Stackelberg equilibrium. Furthermore, forany τ ∈ (0 , ∞ ) , spec( − J ( τ,µ ) ( x ∗ )) ⊂ C ◦− —i.e., − J ( τ,µ ) ( x ∗ ) is Hurwitz stable.Proof Sketch. To proof that x ∗ is a diﬀerential Stackelberg equilibrium follows analogous arguments to thosein the proof of Theorem 4.1 in Mescheder et al. (2018). Given any positive regularization parameter µ , toprove the stability of x ∗ for any ﬁxed τ ∈ (0 , ∞ ), we leverage the concept of the quadratic numerical rangeof a block operator which is a superset of the spectrum of the operator (cf. Appendix A). The key for botharguments is that Assumption 2 implies that the Jacobian of the regularized game has a speciﬁc structure.Indeed, observe that the structural form of J ( τ,µ ) ( x ∗ ) is J ( τ,µ ) ( x ∗ ) = (cid:20) B − τ B (cid:62) τ ( C + µR ) (cid:21) where B = D f ( x ∗ ), C = − D f ( x ∗ ) and R = D R j ( x ∗ ). This follows from Assumption 2-a., which impliesthat D( x ; ω ∗ ) = 0 in some neighborhood of supp( p D ) and hence, ∇ x D( x ; ω ∗ ) = 0 and ∇ x D( x ; ω ∗ ) = 0 for x ∈ supp( p D ). In turn, we have that D f ( x ∗ ) = 0.From here, it is straightforward to argue that B is full rank and C + µR > T θ ∗ M G (byAssumption 2-c). Together these observations imply that x ∗ is a diﬀerential Stackelberg equilibrium. Then,arguments analogous to those in the proof of Proposition 3 (via the quadratic numerical range) implystability. 22 orollary 5. Under the assumptions of Theorem 7, consider any ﬁxed µ ∈ (0 , µ ] and τ ∈ (0 , ∞ ) , andlet γ = min λ ∈ spec( J ( τ,µ ) ( x ∗ )) λ ) / | λ | , and λ m = arg min λ ∈ spec( J ( τ,µ ) ( x ∗ )) λ ) / | λ | . For any α ∈ (0 , γ ) , τ - GDA with learning rate γ = γ − α converges locally asymptotically at a rate of O ((1 − α β ) k/ ) where β = (2Re( λ m ) − α | λ m | ) − . The proof of the above corollary follows a similar line of reasoning as the proof of Corollary 2.Theorem A.7 of Mescheder et al. (2018) shows that matrices of the form − J = (cid:20) − BB (cid:62) − C (cid:21) (19)are stable if B is full rank and C >

0. The following proposition provides necessary conditions on the sizesof the network architectures for the discriminator and generator network for stability.

Proposition 5.

Consider training a generative adversarial network via a zero-sum game with generatornetwork G θ , discriminator network D ω , and loss f ( θ, ω ) with regularization R j ( θ, ω ) (for some j ∈ { , } )such that Assumption 2 is satisﬁed for an equilibrium x ∗ = ( θ ∗ , ω ∗ ) . Independent of the learning rate ratioand the regularization parameter µ , for x ∗ to be stable it is necessary that the dimension of the discriminatornetwork parameter vector is at least half as large as the corresponding generator network parameter vector—i.e., n ≥ n / where θ ∈ R n and ω ∈ R n . The intuition for the why this proposition should hold follows immediately from observing the structureof the Jacobian: for any matrix of the form (19), at least one eigenvalue will be purely imaginary if n < n / B ∈ R n × n and C ∈ R n × n . The full proof is contained in Appendix I. We now present extensive numerical experiments examining gradient descent-ascent with timescale sepa-ration. As we explored theoretically so far, the stability of gradient descent-ascent critical points has anintricate relationship with timescale separation. We begin to investigate this behavior empirically by simu-lating the gradient descent-ascent dynamics for the games from Examples 1 and 2 and examining how thespectrum of the game Jacobian evolves as a function of the timescale separation. Then, on a polynomialgame, we demonstrate how timescale separation warps the vector ﬁeld of gradient descent-ascent and con-sequently shapes the region of attraction around critical points in the optimization landscape. There are anumber of both qualitative and quantitative theoretical questions that remain open related to characterizingthe region of attraction and how it depends parameterically on τ .After exploring the optimization landscape, we focus in on gradient descent-ascent in the Dirac-GANgame and illustrate the interplay between timescale separation, regularization, and rate of convergence.Finally, we train generative adversarial networks on the CIFAR-10 and CelebA datasets with regulariza-tion and show timescale separation can signiﬁcantly improve stability and performance. Moreover, weﬁnd that several of the insights we draw from the Dirac-GAN game carry over to this complex setting.Appendix L contains several more experimental results including a generative adversarial network for-mulation to learn a covariance matrix and a torus game. The code for our experiments is available at github.com/fiezt/Finite-Learning-Ratio . We now revisit the game from Example 1 that demonstrated there exists diﬀerential Stackelberg equilibriumthat are unstable for choices of the timescale separation τ . To be clear, we repeat the game constructionand some characteristics of the game. Let us consider the quadratic zero-sum game deﬁned by the cost f ( x , x ) = 12 (cid:20) x x (cid:21) (cid:62)  − v − v v v − v − v v − v  (cid:20) x x (cid:21) (20)23 a) (b) (c)(e) (f) (g) Figure 4: Experimental results for the quadratic game deﬁned in (20) of Section 6.1 and presented inExample 1. Figures 4a and 4b show trajectories of the players coordinate pairs ( x , x ) and ( x , x )for a range of learning rate ratios, respectively. Figures 4c shows the distance from the equilibrium alongthe learning paths. Figures 4e, 4f, and 4g show the trajectories of the eigenvalues, the real parts of theeigenvalues, and the imaginary parts of the eigenvalues for the J τ ( x ∗ ) as a function of the τ , respectively.where x , x ∈ R and v >

0. The unique critical point of the game given by x ∗ = ( x ∗ , x ∗ ) = (0 ,

0) is adiﬀerential Stackelberg equilibrium. The spectrum of the Jacobian evaluated at the equilibrium is given byspec( J τ ( x ∗ )) = (cid:110) v (2 τ + 1 ± √ τ − τ + 1)4 , v ( τ − ± √ τ − τ + 4)4 (cid:111) . As mentioned in Example 1, it turns out that spec( J τ ( x ∗ ) ⊂ C ◦ + only when τ ∈ (2 , ∞ ). We remark that wecomputed τ ∗ using the theoretical construction from Theorem 3 and found that it recovered the precise valueof τ ∗ = 2 such that the equilibrium is stable for all τ ∈ ( τ ∗ , ∞ ) with respect to the dynamics ˙ x = − Λ τ g ( x ).In the experiments that follow, we consistently observe that the construction of τ ∗ from the theory is tight.For this experiment, we select v = 4 and simulate τ - GDA from the initial condition ( x , x ) = (5 , , , γ = 0 . τ ∈ { , . , , , } . In Figures 4a and 4b, we show the trajectories of the playerscoordinate pairs ( x , x ) and ( x , x ), respectively. We observe that τ - GDA cycles around the equilibriumwith τ = 2 since it is marginally stable with respect to the dynamics. For τ ∈ (2 , ∞ ), the equilibrium isstable and τ - GDA ends up converging to it at a rate that depends on the choice of τ . We demonstrate howthe convergence rate depends on the choice of τ in Figure 4c by showing the distance from the equilibriumalong the learning path for each of the trajectories. The primary observation is that the cyclic behavior of τ - GDA dissipates as τ grows and as a result the dynamics then rapidly converge to the equilibrium.The behavior of the learning dynamics as a function of the timescale separation τ can be further explainedby evaluating the eigenvalues of the game Jacobian at the equilibrium. We show the eigenvalues of theJacobian at the equilibrium in several forms in Figures 4e, 4f, and 4g. Analyzing the spectrum, we areable to verify that for all τ ∈ (2 , ∞ ) the equilibrium is indeed stable. Moreover, we see that the imaginaryparts of the conjugate pairs of eigenvalues decay after τ = 1 and τ = 6, and then the eigenvalues of the24 a) (b)(c)(e) (f) Figure 5: Experimental results for the polynomial game deﬁned in (21) of Section 6.2 and presented inExample 2. Figures 5a and 5b show trajectories of the players coordinate pairs ( x , x ) and ( x , x ) fora range of learning rate ratios, respectively. Figures 5d, 5e, and 5f show the trajectories of the eigenvalues,the real parts of the eigenvalues, and the imaginary parts of the eigenvalues for J τ ( x ∗ ) as a function of the τ , respectively where x ∗ is the non-equilibrium critical point.conjugate pairs eventually become purely real at τ = 1 .

87 and τ = 11 .

66, respectively. After the eigenvaluesof a conjugate pair become purely real, they split so that one of the eigenvalues asymptotically convergesto an eigenvalue of S ( J ( x ∗ )) by moving back along the real line, while the other eigenvalue tends towardan eigenvalue of − τ D f ( x ∗ ). This occurrence is exactly what was described in Section 3 as an immediateimplication of Proposition 4 when the eigenvalues of S ( J ( x ∗ )) and τ D f ( x ∗ ) are distinct. The convergencerate is in fact limited by the eigenvalues splitting since as τ grows, the spectrum of the Jacobian is limitedby the eigenvalues of the Schur complement which remain constant. A related open question centers onﬁnding the worst case convergence rate as a function of the spectral properties of S ( J ( x ∗ )) and D f ( x ∗ ).Finally, the evolution of the eigenvalues as a function of the timescale separation τ demonstrates that therotational dynamics in τ - GDA vanish as the ratio between the magnitude of the real and imaginary parts ofthe eigenvalues grows.

We now return to the game from Example 2 that showed a non-equilibrium critical point which is stablewithout timescale separation and becomes unstable for a range of ﬁnite learning ratios with multiple equilibriain the vicinity. Again, we repeat the game construction along with some of the key characteristics that were25reviously presented in Example 2. Consider a zero-sum game deﬁned by the cost f ( x , x ) = (cid:0) x + 2 x x + x − x + 2 x x − x (cid:1) ( x − + x (cid:0) (cid:80) i =1 ( x i − − ( x i − (cid:1) . (21)This game has critical points at (0 , , , , , , − . , . , − . , . , , ,

1) and ( − . , . , − . , .

53) are game-theoretically meaningful equilibrium. In fact,they are each diﬀerential Nash equilibrium and are locally stable for any choice of τ ∈ (0 , ∞ ) as a result ofProposition 3. On the other hand, the critical point x ∗ = (0 , , ,

0) is neither a diﬀerential Nash equilibriumnor a diﬀerential Stackelberg equilibrium. However, x ∗ is stable for τ ∈ (0 ,

2) and it is marginally stablefor τ = 2. In general, convergence to the non-equilibrium critical point x ∗ in the presence of multiplegame-theoretically meaningful equilibrium would be viewed as undesirable. In fact, this is precisely thetype of critical point that sophisticated schemes for converging to only diﬀerential Nash equilibria or onlydiﬀerential Stackelberg equilibria seek to avoid (Adolphs et al., 2019; Fiez et al., 2020; Mazumdar et al.,2019; Wang et al., 2020). We show in this example that the simple inclusion of timescale separation ingradient descent-ascent is suﬃcient to avoid x ∗ and instead converge to a diﬀerential Nash equilibrium.Indeed, for all τ ∈ (2 , ∞ ) the non-equilibrium critical point x ∗ is unstable with respect to ˙ x = − Λ τ g ( x ).We simulate τ - GDA from the initial condition ( x , x ) = ( − . , . , . ,

3) with γ = 0 . τ ∈{ . , , , } , where we use the superscript to denote the time index so as not to be confused with themultiple indexes for player choice variables. In Figures 5a and 5b, we show the trajectories of the players co-ordinate pairs ( x , x ) and ( x , x ), respectively. We observe that τ - GDA converges to the non-equilibriumcritical point x ∗ with τ = 0 .

75 as expected and the dynamics move near it and then cycle around it with τ = 2 since the critical point becomes marginally stable. However, for τ = 5 and τ = 12, τ - GDA avoids thenon-equilibrium critical point since it becomes unstable and instead the dynamics converge to the nearbydiﬀerential Nash equilibrium. We show the eigenvalues of the Jacobian at the non-equilibrium critical point x ∗ = (0 , , ,

0) in several forms in Figures 5d–5f. Again, we observe that the eigenvalues quickly becomepurely real as τ grows and then they split, and asymptotically converge toward the eigenvalues of S ( J ( x ∗ ))and − τ D f ( x ∗ ). Together, this example demonstrates that often there is a reasonable ﬁnite learning rateratio such that non-meaningful critical points become unstable for τ - GDA . Consider a zero-sum game deﬁned by the cost f ( x , x ) = − e − ( . x +0 . x ) (cid:0) (0 . x + x ) + (0 . x + x ) (cid:1) . (22)The cost structure of this game is visualized in Figure 6a, where we present a three dimensional view of − f ( x , x ) along with the cost contours and the locations of critical points. This game has eleven criticalpoints including one diﬀerential Nash equilibrium and two diﬀerential Stackelberg equilibria that are nota diﬀerential Nash equilibrium. The critical points that are neither a diﬀerential Nash equilibrium nor adiﬀerential Stackelberg equilibrium are unstable for any choice of timescale separation τ . The diﬀerentialNash equilibrium is at ( x , x ) = (10 . , − .

95) and it is stable for all τ ∈ (0 , ∞ ) by Proposition 3. Thediﬀerential Stackelberg equilibria are at ( x , x ) = ( − . , − . x ∗ , x ∗ ) = ( − . , − . τ ∈ (1 , ∞ ). We computed τ ∗ for the pair of diﬀerential Stackelberg equilibrium using thetheoretical construction from Theorem 3 and observed that it properly recovered τ ∗ = 1 for each equilibriumas the timescale separation such that the continuous time system is stable for all τ ∈ ( τ ∗ , ∞ ). Finally, wenote that while the set of equilibrium follow a linear translation, this game is generic and the equilibria arein fact isolated.In Figure 6b, we show the trajectories of τ - GDA with γ = 0 . τ ∈ { , , , } given the ini-tialization ( x , x ) = ( − , −

9) near the diﬀerential Stackelberg equilibrium at ( x ∗ , x ∗ ) = ( − . , − . τ = 1 results in a trajectory that cycles around theequilibrium in a closed curve since it is marginally stable and J τ ( x ∗ ) has purely imaginary eigenvalues. No-tably, as τ grows, the cyclic behavior dissipates as the timescale separation reshapes the vector ﬁeld untilthe trajectory moves near directly to the zero derivative line of the maximizing player and then follows a26 a) (b)(c) (d) Figure 6: Experimental results for the polynomial game deﬁned in (22) of Section 6.3. Figures 6a provides a3d view of the cost function − f ( x , x ) along with the cost contours and critical point locations. Figure 6bshows trajectories of τ - GDA for a range of learning rate ratios given an initialization around the diﬀerentialStackelberg equilibrium ( x ∗ , x ∗ ) = ( − . , − . J τ ( x ∗ ) as a function of τ where x ∗ is the diﬀerential Stackelberg equilibrium ( x ∗ , x ∗ ) = ( − . , − . J τ ( x ∗ ) as a functionof τ are presented in Figures 6c and 6d. As was the case for the previous experiments, we observe thatafter the eigenvalues become purely real as τ grows, they then split and asymptotically converge toward theeigenvalues of S ( J ( x ∗ )) and − τ D f ( x ∗ ). It is worth noting that much of the rotational behavior in thedynamics and vector ﬁeld disappears as a result of timescale separation well before the eigenvalues becomepurely real; this seems to occur after the timescale separation is such that the magnitude of the real part ofthe eigenvalues is greater than that of the imaginary part.Finally, in Figure 7b, we demonstrate how the choice of timescale separation τ not only warps the vectorﬁeld but also shapes the regions of attraction around critical points. The vector ﬁeld is again shown foreach τ ∈ { , , , } , but now zoomed out to include each of the equilibria. The colors overlayed on thevector ﬁeld indicate the equilibria that the dynamics converge to given an initialization at that position.Positions in the strategy space without color did not converge to an equilibrium in the ﬁxed horizon of75000 iterations with γ = 0 . a)(b) Figure 7: Experimental results for the polynomial game deﬁned in (22) of Section 6.3. In Figure 7a, weoverlay the trajectories from Figure 6b produced by τ - GDA onto the vector ﬁeld generated by the choice oftimescale separation selection τ . The shading of the vector ﬁeld is dictated by its magnitude so that lightershading corresponds to a higher magnitude and darker shading corresponds to a lower magnitude. Figure 7bdemonstrates the eﬀect of timescale separation on the region of attractions around critical points by coloringpoints in the strategy space according to the equilibrium τ - GDA converges. We remark that areas withoutcoloring indicate where τ - GDA did not converge in the time horizon.be globally convergent and may get stuck in limit cycles or may simply move slowly for a long time inﬂat regions of the optimization landscape. We produced this experiment by running τ - GDA for a dense setof initial conditions chosen uniformly over the space of interest. It is clear from the experiment that thechoice of timescale separation determines not only the stability of equilibria, but also has a fundamentalimpact on the equilibria the dynamics converge to from a given initial condition as a result of the warpingof the vector ﬁeld. As a concrete example, given an initialization of ( x , x ) = ( − , − τ = 1 converge to the diﬀerential Nash equilibria at ( x , x ) = (10 . , − . τ >

1, thedynamics instead converge to the diﬀerential Stackelberg equilibrium at ( x , x ) = ( − . , − .

03) that issigniﬁcantly closer to the initial condition. This example motivates future work on methods for obtainingaccurate estimates of the regions of attraction around critical points and techniques to design τ in order toexplicitly shape the region of attraction around an equilibrium of interest. We refer to the end of Section 4.1for further discussion on potentially relevant analysis methods in this direction. In Section 5, we studied gradient descent-ascent with regularization in generative adversarial networks andshowed that the general theory we provide can be extended to such a formulation. Recall that the training28 a) (b)(d)(e) µ = 0 . µ = 1 Figure 8: Experimental results for the Dirac-GAN game deﬁned in (24) of Section 6.4. Figure 8a showstrajectories of τ - GDA for τ ∈ { , , , } with regularization µ = 0 . τ = 1 with regularization µ = 1.Figure 8b shows the distance from the equilibrium along the learning paths. Figure 8f shows the trajectoriesof τ - GDA overlayed on the vector ﬁeld generated by the respective timescale separation and regularizationparameters. The shading of the vector ﬁeld is dictated by its magnitude so that lighter shading correspondsto a higher magnitude and darker shading corresponds to a lower magnitude. Figures 8e and 8f show thetrajectories of the eigenvalues for J τ ( θ ∗ , ω ∗ ) as a function of τ with regularization set to µ = 0 . µ = 1,respectively where ( θ ∗ , ω ∗ ) is the unique critical point of the game.objective for generative adversarial networks is of the form f ( θ, ω ) = E p ( z ) [ (cid:96) (D(G( z ; θ ); ω ))] + E p D ( x ) [ (cid:96) ( − D( x ; ω ))] (23)29here D ω ( x ) and G θ ( z ) are the discriminator and generator networks respectively, p D ( x ) is the data distri-bution while p ( z ) is the latent distribution, and (cid:96) ∈ C ( R ) is some real-value function such that (cid:96) (cid:48) (0) (cid:54) = 0 and (cid:96) (cid:48)(cid:48) (0) <

0. The goal of the generator is to minimize (23) while the discriminator seeks to maximize (23). Asa motivating example, we mentioned the Dirac-GAN proposed by Mescheder et al. (2018), which constitutesan extremely simple, yet compelling generative adversarial network. Formally described in Deﬁnition 5, theDirac-GAN consists of a univariate generator distribution p θ = δ θ and a linear discriminator D( x ; ω ) = ωx ,where the real data distribution p D is given by a Dirac-distribution concentrated at zero. The resultingzero-sum game is deﬁned by the cost f ( θ, ω ) = (cid:96) ( θω ) + (cid:96) (0) . The unique critical point of gradient descent-ascent is a local Nash equilibrium given by ( θ ∗ , ω ∗ ) = (0 , J τ ( θ ∗ , ω ∗ ) = (cid:20) (cid:96) (cid:48) (0) − τ (cid:96) (cid:48) (0) 0 (cid:21) . Consequently, spec( J τ ( θ ∗ , ω ∗ )) = {± i √ τ (cid:96) (cid:48) (0) } so that spec( J τ ( θ ∗ , ω ∗ )) (cid:54)⊂ C ◦ + and regardless of the choice oftimescale separation, τ - GDA oscillates and fails to converge to the equilibrium. This behavior is expected sincethe equilibrium is not hyperbolic and corresponds to neither a diﬀerential Nash equilibrium nor a diﬀerentialStackelberg equilibrium since D f ( θ ∗ , ω ∗ ) = 0 and − D f ( θ ∗ , ω ∗ ) = 0, but it is undesirable nonetheless since( θ ∗ , ω ∗ ) is the unique critical point and a local Nash equilibrium.Mescheder et al. (2018) proposed to remedy the degeneracy issues of generative adversarial networks byusing the following gradient penalties to regularize the discriminator with µ > R ( θ, ω ) = µ E p D ( x ) [ (cid:107)∇ x D( x ; ω ) (cid:107) ] and R ( θ, ω ) = µ E p θ ( x ) [ (cid:107)∇ x D( x ; ω ) (cid:107) ] . For the Dirac-GAN, R ( θ, ω ) = R ( θ, ω ) = µ ω . The zero-sum game corresponding to the Dirac-GAN with regularization can be deﬁned by the cost f ( θ, ω ) = (cid:96) ( θω ) + (cid:96) (0) − µ ω . (24)The unique critical point of the game remains at ( θ ∗ , ω ∗ ) = (0 , J τ ( θ ∗ , ω ∗ ) = (cid:20) (cid:96) (cid:48) (0) − τ (cid:96) (cid:48) (0) τ µ (cid:21) (25)and spec( J τ ( θ ∗ , ω ∗ )) = { ( τ µ ± (cid:112) τ µ − τ ( (cid:96) (cid:48) (0)) ) / } . Observe that for all τ ∈ (0 , ∞ ) and µ ∈ (0 , ∞ ),we get that spec( J τ ( θ ∗ , ω ∗ )) ⊂ C ◦ + . This implies that for all timescale separation parameters τ > µ >

0, the local Nash equilibrium of the unregularized game is stable withrespect to the dynamics ˙ x = − Λ τ g ( x ). As a result, for a suitably chosen learning rate γ , the discrete timeupdate τ - GDA converges to the equilibrium. It is worth pointing out that the critical point ( θ ∗ , ω ∗ ) = (0 , − D f ( θ ∗ , ω ∗ ) = µ > S ( J ( θ ∗ , ω ∗ )) = ( (cid:96) (cid:48) (0)) /µ > τ - GDA for the regularized Dirac-GAN game deﬁned in (24) focusedon exploring the interplay between timescale separation, regularization, and convergence rate since theequilibrium is always stable for a positive regularization parameter. We let (cid:96) ( t ) = − log(1 + exp( − t )),which corresponds to the choice made in the original generative adversarial network formulation proposedby Goodfellow et al. (2014). Figure 8a shows trajectories of τ - GDA from the initial condition ( θ , ω ) = (1 , γ = 0 .

01 for τ ∈ { , , , } with regularization µ = 0 . τ = 1 with µ = 1. Moreover,Figure 8f shows the trajectories of τ - GDA overlayed on the vector ﬁeld generated by the respective timescaleseparation and regularization parameters and Figure 8b shows the distance from the equilibrium along thelearning paths. The choice of µ = 0 . µ ∗ = 1 is chosen since it corresponds to thecritical regularization parameter such that spec( J ( θ ∗ , ω ∗ )) ⊂ R + for all µ > µ ∗ , meaning that the Jacobian30ithout timescale separation has purely real eigenvalues. Finally, Figures 8e and 8f show the trajectories ofthe eigenvalues for J τ ( θ ∗ , ω ∗ ) as a function of τ with regularization set to µ = 0 . µ = 1, respectivelywhere ( θ ∗ , ω ∗ ) is the unique critical point of the game.From Figures 8a and 8d, we observe that the impact of timescale separation with regularization µ = 0 . − D f ( θ, ω ) and thenfollows along that line until reaching the equilibrium. We further see from Figure 8b that with regularization µ = 0 . τ - GDA with τ = 8 converges faster to the equilibrium than τ - GDA with τ = 16, despite the factthat the former exhibits some cyclic behavior in the dynamics while the latter does not. The eigenvalues ofthe Jacobian with regularization µ = 0 . τ = 8 and zero with τ = 16, but the eigenvalue with the minimum real part isgreater at τ = 8 than at τ = 16. This example highlights that a degree of oscillatory behavior in thedynamics is not always harmful for convergence and it can even speed up the rate of convergence. Buildingoﬀ of this, for regularization µ = 1 and timescale separation τ = 1, Figures 8a and 8b show that eventhough τ - GDA follows a direct path toward the equilibrium and does not cycle since the eigenvalues of theJacobian are purely real, the trajectory converges slowly to the equilibrium. While not presented, we ranexperiments with τ ∈ { , , , } with µ = 1 as well and timescale separation only made the convergencerate worse. The eigenvalues of the Jacobian with each regularization parameter presented in Figures 8eand 8f are able to explain this phenomenon. Indeed, for each regularization parameter, the eigenvalues splitafter becoming purely real and then converge toward the eigenvalues of S ( J ( θ ∗ , ω ∗ )) and − τ D f ( θ ∗ , ω ∗ ).Since S ( J ( θ ∗ , ω ∗ )) ∝ /µ and − τ D f ( θ ∗ , ω ∗ ) ∝ τ µ , there is a trade-oﬀ between the choice of regularization µ and the timescale separation τ on the conditioning of the Jacobian matrix. As shown in Figures 8e and 8f,the minimum real part of the eigenvalues with µ = 0 . µ = 1 after suﬃcienttimescale separation, which makes the convergence rate faster. Together, this example demonstrates thatthere may often be a delicate relationship between timescale separation, regularization, and convergencerate, where after a certain threshold each parameter choice may inhibit the rate of convergence.In Appendix L.2, we provide simulation results on the Dirac-GAN game using the non-saturating gener-ative adversarial network objective proposed by Goodfellow et al. (2014). In this formulation, the game isdeﬁned by the costs ( f , f ) = ( − (cid:96) ( − ωθ ) + (cid:96) (0) , − ( (cid:96) ( ωθ ) + (cid:96) (0))). While the non-saturating objective wasmotivated by global considerations (avoiding vanishing gradients) rather than local considerations, it turnsout that it is locally equivalent in terms of the game Jacobian as the standard formulation for the Dirac-GAN.As a result, the stability characteristics are identical and we draw equivalent conclusions from the experimentregarding the behavior of gradient descent-ascent in the game. Finally, we note that in Appendix L.3 weexplore another simple generative adversarial network formulation using the Wasserstein cost function witha linear generator and quadratic discriminator (each of arbitrary dimension) for the problem of learning acovariance matrix. In that example, we draw analogous conclusions about the interplay between timescaleseparation, regularization, and the rate of convergence. We now investigate the role timescale separation has on training generative adversarial networks parame-terized by deep neural networks. The empirical beneﬁts of training with a timescale separation have beendocumented previously. For example, Heusel et al. (2017) showed on a number of image datasets that atimescale separation between the generator and discriminator improves generation performance as measuredby the Frechet Inception Distance (FID). Since then a signiﬁcant number of papers have presented resultstraining generative adversarial networks with timescale separation. Moreover, it is common in the literaturefor the discriminator to be updated multiple times between each update of the generator (Arjovsky et al.,2017). Indeed, it has been widely demonstrated that this heuristic improves the stability and convergence ofthe training process and locally it has a similar eﬀect as including a timescale separation between the gener-ator and discriminator. The disadvantage of this approach is that the number of gradient calls per generatorupdate increases and consequently the convergence is then slower in terms of wall clock time when a similareﬀect could potentially be achieved by a learning rate separation between the generator and discriminator.We remark that it appears to be reasonably common for practitioners to ﬁx a shared learning rate for thegenerator and discriminator along with a pre-selected number of discriminator updates per generator updateand not thoroughly investigate the impact timescale separation has on the training process.31 a) µ = 10(b) µ = 1 Figure 9: CIFAR-10 FID scores with regularization parameter µ = 10 in Figure 9a and µ = 1 in Figure 9b.The goal of our generative adversarial network experiments is to reinforce the importance of the timescaleseparation between the generator and the discriminator as a hyperparameter in the training process, demon-strate how it changes the behavior along the learning path, and show that it is compatible with a numberof common training heuristics. This is to say that our goal is not necessarily to show state-of-the art per-formance, but rather to perform experiments that allow us to make insights relevant to the theory in thispaper. We remark that our empirical work on training generative adversarial networks is distinct from andcomplimentary to that of Heusel et al. (2017) in several ways. The theory given by Heusel et al. (2017)only applies to stochastic stepsizes, however in the experiments they implemented constant step sizes. Wetrain with mini-batches and decaying stepsizes, which does satisfy the theory we provide. Moreover, by andlarge, the experiments by Heusel et al. (2017) compare a ﬁxed learning rate ratio between the generator anddiscriminator to multiple ﬁxed shared learning rates for the generator and discriminator. In contrast, we ﬁxa learning rate for the generator and explore the behavior of the training process as the timescale parameter τ is swept over a given range.We build our experiments based on the methods and implementations of Mescheder et al. (2018) andexplore both the CIFAR-10 and CelebA image datasets. We train the generative adversarial networks withthe non-saturating objective function and the R gradient penalty proposed by Mescheder et al. (2018) withregularization parameters µ ∈ { , } . We note that the non-saturating objective results in a game that isnot zero-sum, however it is commonly used in practice and under the realizable assumptions is it locallyequivalent to the zero-sum objective as discussed at the end of Section 6.4. The network architectures forthe generator and discriminator are both based on the ResNet architecture. The initial learning rate for thegenerator in all of our experiments is ﬁxed to be γ = 0 . k the learning rate of the generator is given by γ ,k = γ / (1 + ν ) k where ν = 0 .

005 and the learning rate ofthe discriminator is γ ,k = τ γ ,k . For each experiment the batch size is 64, the latent data is drawn froma standard normal distribution of dimension 256, and the resolution of the images is 32 × ×

3. Finally,as an optimizer, we run RMSprop with parameter α = 0 .

99. Again, the theory we provide does not strictly32 a) µ = 10(b) µ = 1 Figure 10: CelebA FID scores with regularization parameter µ = 10 in Figure 9a and µ = 1 in Figure 9b.apply to using RMSprop, but it is ubiquitous in practice for training generative adversarial networks andif the timescale separation is suﬃciently large so that the eigenvalues are purely real in the Jacobian thenthe theory we provide is applicable as remarked previously. We provide further details on the network andhyperparameters in Appendix L.4. A ﬁnal heuristic and hyperparameter that we explore in conjunctionwith the timescale separation τ is that of using an exponential moving average to produce the model that isevaluated. This means that at each update k , given that the parameters of the generator are given by x ,k ,the moving average ¯ x k = x ,k β + ¯ x ,k − (1 − β ) is kept where β ∈ (0 , τ belonging to the set { , , , } and theregularization parameter µ belonging to the set { , } . For each choice of τ and µ , we retain exponentialmoving averages of the generator parameters for β ∈ { . , . , . } . The training process is repeated3 times for each hyperparameter conﬁguration to rule out noise from random seeds and the performance isevaluated along the learning path at every 10,000 updates in terms of the FID score. We report the meanscores and the standard error of the mean over the repeated experiments. We run the experiments with µ = 1 for 150k mini-batch updates and the experiments with µ = 10 for 300k mini-batch updates.The results for each dataset across the hyperparameter conﬁgurations are presented in numeric form inFigure 12. Figure 11 shows some generated samples selected at random for each dataset with the hyper-parameter conﬁguration that performed best in terms of the FID score at the end of the training process.Figure 17 in Appendix L.4 has several more generated samples for each dataset selected at random. We nowdescribe the key observations from the experiments for each dataset. CIFAR-10.

The FID scores along the learning path for CIFAR-10 with µ = 10 and µ = 1 are presented inFigures 9a and 9b, respectively. The corresponding scores in numeric form are given in Figures 12a, 12c, and33 a) CIFAR-10 (b) CelebA Figure 11: CIFAR-10 and CelebA samples from generator at 300k iterations with β = 0 . τ = 4. τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (a) CIFAR-10 FID at 150k updates with µ = 10 τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (b) CelebA FID at 150k updates with µ = 10 τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (c) CIFAR-10 FID at 150k updates with µ = 1 τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (d) CelebA FID at 150k updates with µ = 1 τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (e) CIFAR-10 FID at 300k updates with µ = 1 τ \ β ± ± ± ± ± ± ± ± ± ± ± ± (f) CelebA FID at 300k updates with µ = 1 Figure 12: FID Scores on CIFAR-10 and CelebA.12e for µ = 10 at 150k iterations and µ = 1 at 150k and 300k iterations, respectively. To begin, we observethat the exponential moving average signiﬁcantly improves performance, and of the parameters considered, β = 0 . τ - GDA are dampened by timescale separation in this generativeadversarial network experiment, similarly to as observed for the simpler experiments presented previously.Moreover, we that timescale separation also has a signiﬁcant impact on the FID score of the training process.Indeed, even selecting τ = 2 versus τ = 1 can yield an impressive performance gain. In this experiment foreach regularization parameter, τ = 4 converges fastest, followed by τ = 8, then τ = 2, and ﬁnally τ = 1.Finally, observe that the performance with regularization µ = 1 is far superior to that with regularization µ = 10. Interestingly, the last pair of conclusions are in line with the insights drawn from the simple Dirac-GAN experiment in Section 6.4. In particular, timescale separation only speeds up to convergence untilhitting a limiting value and there is a fundamental interplay between timescale separation, regularization,and convergence rate. This indicates that it may be possible to transfer some of the insights made onsimple generative adversarial network formulations to the much more complex problem where players areparameterized by neural networks. 34 elebA. The FID scores along the learning path for CIFAR-10 with µ = 10 and µ = 1 are presentedin Figures 10a and 10b, respectively. The corresponding scores in numeric form are given in Figures 12b,12d, and 12f for µ = 10 at 150k iterations and µ = 1 at 150k and 300k iterations, respectively. In thisexperiment we that while the exponential moving average helps performance, the gain is not as drastic asit was for CIFAR-10. However, timescale separation in combination with the regularization does have amajor eﬀect on the the FID score of the training process in this experiment. For regularization µ = 10, thetimescale parameters of τ = 2 and τ = 4 outperform τ = 1 and τ = 8 by a wide margin, again highlightingthat timescale separation can speed up convergence until a certain point where it can potentially slow itdown owing to the eﬀect on the conditioning of the problem locally. A similar trend can be observedwith regularization µ = 1, but with τ = 8 performing closer to τ = 2 and τ = 4. We again observe inthis experiment that for all timescale separation parameters, the performance is signiﬁcantly improved withregularization µ = 1 as compared with µ = 10. This once again highlights the importance of considering howthis the hyperparameters of regularization and timescale interact and dictate the local convergence rates.In summary, we took a well-performing method and implementation for training generative adversarialnetworks and demonstrated that timescale separation is an extremely important, and easy to implement,hyperparameter that is worth careful consideration since it can have a major impact on the convergencespeed and ﬁnal performance of the training process. In this section, we provide a review of related work at the intersection machine learning and game theory,as well as connections to dynamical systems theory and control.

Given the extensive work on the topic of learning in games in machine learning that has gone on over thelast several years, we cannot cover all of it and instead focus our attention on only the most relevant to thispaper. We begin by reviewing solution concepts developed for the class of games under consideration andthen discuss some learning dynamics studied in the literature beyond gradient descent-ascent. Followingthis, we delineate the related work studying gradient descent-ascent in non-convex, non-concave zero-sumgames and ﬁnish by making note of the literature on non-convex, concave optimization.

Solution Concepts.

Owing to the numerous applications in machine learning, a signiﬁcant portion ofthe modern work on learning in games has focused on the zero-sum formulation with non-convex, non-concave cost functions. Most recently, Daskalakis et al. (2020) tout the importance and signiﬁcance ofthis class of games in a paper on the complexity of ﬁnding equilibria (in particular, in the constrainedsetting) in such games. Consequently, local solution concepts have been broadly adopted. Compared to thestandard game-theoretic notions of equilibrium that characterize player’s incentive to deviate given the gameand information structure, local equilibrium concepts restrict the deviation search space to a suitable localneighborhood. Following the standard game-theoretic viewpoint, a vast number of works in machine learningstudy the local Nash equilibrium concept and critical points satisfying gradient-based suﬃcient conditionsfor the equilibrium, which are often referred to as diﬀerential Nash equilibria (Ratliﬀ et al., 2013, 2014;Ratliﬀ et al., 2016). Based on the observation that in non-convex, non-concave zero-sum games the order ofplay is fundamental in the deﬁnition of the game, there has been a push toward considering local notions ofthe Stackelberg equilibrium concept, which is the usual game-theoretic equilibrium when there is an explicitorder of play between players. In the zero-sum formulation, Stackelberg equilibrium are often referred toas minmax equilibria. Similar to as for the Nash equilibrium, gradient-based suﬃcient conditions for localminmax/Stackelberg equilibrium have been given (Daskalakis and Panageas, 2018; Fiez et al., 2020; Jin et al.,2020) and such critical points have been referred to as diﬀerential Stackelberg equilibria (Fiez et al., 2020).We remark that it has been shown that local/diﬀerential Nash equilibria are a subset of local/diﬀerentialStackelberg equilibria (Fiez et al., 2020; Jin et al., 2020). Following past works, we adopt the terminology ofdiﬀerential Nash equilibrium and diﬀerential Stackelberg equilibrium in this paper as the meaning of strictlocal Nash equilibrium and strict local minmax/Stackelberg equilibrium, respectively. Finally we mention35he proximal equilibria proposed by Farnia and Ozdaglar (2020), which we do not consider in this work,that depending on a regularization parameter can interpolate between the local Nash and local Stackelbergequilibrium notions.

Learning Dynamics.

Given that the focus of this work is on gradient descent-ascent, we center ourcoverage of related work on papers analyzing its behavior. Nonetheless, we mention that a signiﬁcantnumber of learning dynamics for zero-sum games have been developed in the past few years, in some casesmotivated by the shortcomings of gradient descent-ascent without timescale separation. The methods includeoptimistic and extra-gradient algorithms (Daskalakis et al., 2018; Gidel et al., 2019a; Mertikopoulos et al.,2019), negative momentum (Gidel et al., 2019b), gradient adjustments (Balduzzi et al., 2018; Letcher et al.,2019a; Mescheder et al., 2017), and opponent modeling methods (Foerster et al., 2018; Letcher et al., 2019b;Metz et al., 2017; Sch¨afer and Anandkumar, 2019; Zhang and Lesser, 2010), among others. While theaforementioned learning dynamics possess some desirable characteristics, they cannot guarantee that theset of stable critical points coincide with a set of local equilibria for the class of games under consideration.However, there have been a select few learning dynamics proposed that can guarantee the stable critical pointscoincide with either the set of diﬀerential Nash equilibria (Adolphs et al., 2019; Mazumdar et al., 2019) or theset of diﬀerential Stackelberg equilibria (Fiez et al., 2020; Wang et al., 2020)—eﬀectively solving the problemof guaranteeing local convergence to only a class of local equilibria. However, since each of the algorithmsachieving the equilibrium stability guarantee require solving a linear equation in each update step, they arenot eﬃcient and can potentially suﬀer from degeneracies along the learning path in applications such asgenerative adversarial networks. These practical shortcomings motivate either proving that existing learningdynamics using only ﬁrst-order gradient feedback achieve analogous theoretical guarantees or developingnovel computationally eﬃcient learning dynamics that can match the theoretical guarantee of interest.

Gradient Descent-Ascent.

Gradient descent-ascent has been studied extensively in non-convex, non-concave zero-sum games since it is a natural analogue to gradient descent from optimization, is computa-tionally eﬃcient, and has been shown to be eﬀective in practice for applications of interest when combinedwith common heuristics. A prevailing approach toward gaining understanding of the convergence character-istics of gradient descent-ascent has been to analyze the local stability around critical points of the continuoustime limiting dynamical system. The majority of this work has not considered the impact of timescale sepa-ration. Numerous papers have pointed out that the stable critical points of gradient descent-ascent withouttimescale separation may not be game-theoretically meaningful. In particular, it has been shown that therecan exist stable critical points that are not diﬀerential Nash equilibrium (Daskalakis and Panageas, 2018;Mazumdar et al., 2020). Furthermore, it is known that there can exist stable critical points that are notdiﬀerential Stackelberg equilibria (Jin et al., 2020). The aforementioned results rule out the possibility thatgradient descent-ascent without timescale separation can guarantee equilibrium convergence. In terms of thestability of equilibria, it is known that diﬀerential Nash equilibrium are stable for gradient descent-ascentwithout timescale separation (Daskalakis and Panageas, 2018; Mazumdar et al., 2020), but that there canexist diﬀerential Stackelberg equilibria which are not stable with respect to gradient descent-ascent withouttimescale separation.The work of Jin et al. (2020) is the most relevant exploring how the aforementioned stability properties ofgradient descent-ascent change with timescale separation. In particular, Jin et al. (2020) investigate whetherthe desirable stability characteristics (stability of diﬀerential Nash equilibria) and undesirable stability char-acteristics (stability of non-equilibrium critical points and instability of diﬀerential Stackelberg equilibria)of gradient descent without timescale separation are maintained and remedied, respectively with timescaleseparation. In terms of the former query, extending the examples shown in Mazumdar et al. (2020) andDaskalakis and Panageas (2018), Jin et al. (2020) show that diﬀerential Nash equilibrium are stable forgradient descent-ascent with any amount of timescale separation.On the other hand, for the latter query, Jin et al. (2020) shows (in Proposition 27) two interestingexamples: (a) for an a priori ﬁxed τ , there exists a game with a diﬀerential Stackelberg equilibrium thatis not stable and (b) for an a priori ﬁxed τ , there exists a game with a stable critical point that is nota diﬀerential Stackelberg equilibrium. However, (a) does not imply that for the constructed game, theredoes not exist another (ﬁnite) τ —independent of the game parameters—such the diﬀerential Stackelebrgequilibrium is stable for all larger τ . In simple language, the result summarized in (a) says the following:36 f a bad timescale separation is chosen, then convergence may not be guaranteed . Similarly, (b) does notimply that there is no τ such that for all larger τ for the constructed game instance, the critical pointbecomes unstable. Again, in simple language, the result summarized in (b) says the following: if a badtimescale separation is chosen, then non-game theoretically meaningful equilibria may persist . While at ﬁrstglance this set of results may appear to indicate that the undesirable stability characteristics of gradientdescent without timescale separation cannot be averted by any ﬁnite timescale separation, it is important toemphasize that these results do not answer the questions of whether there (a) exists a game with a criticalpoint that is not a diﬀerential Stackelberg equilibrium which is stable with respect to gradient descent-ascentwithout timescale separation and remains stable for all ﬁnite timescale separation ratios or (b) exists a gamewith a diﬀerential Stackelberg equilibrium that is not stable for all ﬁnite timescale separation ratios. Thepreceding questions are left open from previous work and are exactly the focus of this paper. In Appendix K,we go into greater detail on the comparison between Proposition 27 of Jin et al. (2020) as we believe this tobe an important point of distinction between Theorem 3 and 4 in this paper.Finally, Jin et al. (2020) study the an inﬁnite timescale separation ratio and show that the stable criticalpoints of gradient descent-ascent coincide with the set of diﬀerential Stackelberg equilibria in this regime.This result eﬀectively shows that gradient descent-ascent can guarantee only equilibrium convergence withtimescale separation, albeit inﬁnite. We remark that an equivalent result in the context of general singularlyperturbed systems has been known in the literature (Kokotovic et al., 1986, Chap. 2) as we discuss furtherin Section 3.1. Finally, we point out that since an inﬁnite timescale separation does not result in an imple-mentable algorithm, fully understanding the behavior with a ﬁnite timescale separation is of fundamentalimportance and the motivation for our work.Beyond the work of Jin et al. (2020) considering timescale separation in gradient descent-ascent, it is worthmentioning the work of Chasnov et al. (2019) and Heusel et al. (2017). Chasnov et al. (2019) study the impactof timescale separation on gradient descent-ascent, but focus on the convergence rate as a function of it givenan initialize around a diﬀerential Nash equilibrium and do not consider the stability questions examined inthis paper. Heusel et al. (2017) study stochastic gradient descent-ascent with timescale separation and invokethe results of Borkar (2008) for analysis. The stochastic approximation results the claims rely on guaranteethe convergence of the system locally to a stable critical point. Consequently, the claim of convergence todiﬀerential Nash equilibria of stochastic gradient descent-ascent given by Heusel et al. (2017) only holds givenan initialization in a local neighborhood around a diﬀerential Nash equilibrium. In this regard, the issue ofthe local stability of the types of critical point is eﬀectively assumed away and not considered. In contrast,we are able to combine our stability results for gradient descent-ascent with timescale separation togetherwith the stochastic approximation theory of Borkar (2008) to guarantee local convergence to a diﬀerentialStackelberg equilibrium in Section 4.2. We remark that Heusel et al. (2017) empirically demonstrate thattimescale separation can signiﬁcantly improve the performance of gradient descent-ascent when traininggenerative adversarial networks.The ﬁnal relevant line of work studying gradient descent-ascent is speciﬁc to generative adversarialnetworks. The results from this literature develop assumptions relevant to generative adversarial networksand then analyze the stability and convergence properties of gradient descent-ascent under them (see, e.g.,works by Daskalakis et al. (2018); Goodfellow et al. (2014); Mescheder et al. (2018); Metz et al. (2017);Nagarajan and Kolter (2017)). Within this body of work, there has been a signiﬁcant amount of eﬀortfocusing on how the stability (and, hence, convergence properties) of gradient descent-ascent in generativeadversarial networks can be enhanced with regularization methods. Nagarajan and Kolter (2017) show,under suitable assumptions, that gradient-based methods for training generative adversarial networks arelocally convergent assuming the data distributions are absolutely continuous. However, as observed byMescheder et al. (2018), such assumptions not only may not be satisﬁed by many practical generativeadversarial network training scenarios such as natural images, but it can often be the case that the datadistribution is concentrated on a lower dimensional manifold. The latter characteristic leads to nearly purelyimaginary eigenvalues and highly ill-condition problems. Mescheder et al. (2018) provide an explanation forobserved instabilities consequent of the true data distribution being concentrated on a lower dimensionalmanifold using discriminator gradients orthogonal to the tangent space of the data manifold. Further,the authors introduce regularization via gradient penalties that leads to convergence guarantees under lessrestrictive assumptions than were previously known. In this paper, we further extend these results to showthat convergence to diﬀerential Stackelberg equilibria is guaranteed under a wide array of hyperparameter37onﬁgurations (i.e., learning rate ratio and regularization). Nonconvex-Concave Optimization.

A ﬁnal related line of work is on nonconvex-concave optimiza-tion (Lin et al., 2020a,b; Lu et al., 2020; Nouiehed et al., 2019; Ostrovskii et al., 2020; Raﬁque et al., 2018).The focus in this set of works (among many others on the topic) is on characterizing the iteration complexityto stationary points, rather than stability and asymptotic convergence as in the non-convex, non-concavezero-sum game setting. The focus on stationary points in this body of work is reasonable since, to ourknowledge, the results obtained are for (cid:96) -weakly convex-concave games , a subclass of non-convex-concavegames in which the minimizing player faces an (cid:96) -weakly convex optimization problem for each ﬁxed choiceof the maximizing player. The primary relevance of work on this problem is that a number of the algo-rithms rely on timescale separation and variations of gradient descent-ascent. Moreover, the methods forobtaining fast convergence rates may be relevant to future work attempting to characterize fast rates inthe non-convex, non-concave setting after there is a more fundamental understanding of the stability andasymptotic convergence.

The study of gradient descent-ascent dynamics with timescale separation between the minimizing and max-imizing players is closely related to that of singularly perturbed dynamical systems (Kokotovic et al., 1986).Such systems arise in classical control and dynamical systems in the context of physical systems that eitherhave multiple states which evolve on diﬀerent timescales due to some underlying immutable physical processor property, or a single dynamical system which evolves on a sub-manifold of the larger state-space. Forexample, robot manipulators or end eﬀectors often have have slower mechanical dynamics than electricaldynamics. On the other hand, in electrical circuits or mechanical systems, certain resistor-capacitor circuitsor spring-mass systems have a state which evolves subject to a constraint equation (Lagerstrom and Casten,1972; Sastry and Desoer, 1981). Due to their prevalence, singularly perturbed systems have been studiedextensively with one of the outcomes being a number of works on determining the range of perturbationparameters for which the overall system is stable (Kokotovic et al., 1986; Saydy, 1996; Saydy et al., 1990).We exploit these results and analysis techniques to develop novel results for learning in games. One ofcontributions of this work is the introduction of the algebraic analysis techniques to the machine learningand game theory communities. These tools open up new avenues for algorithm synthesis; we comment onpotential directions in the concluding discussion section.This being said, there are a couple key diﬀerence between the present setting and that of the classicalliterature including the following:1.

The perturbation parameter is no longer an immutable characteristic of the physicalsystem, but rather a hyperparameter subject to design.

Indeed, in singular perturbationtheory, the typical dynamical system studied takes the form˙ x = g ( x, y ) (cid:15) ˙ y = g ( x, y ) (26)where (cid:15) is a small parameter that abstracts some physical characteristics of the state variables. On theother hand, in learning in games, the continuous time limiting dynamical system of gradient descent-ascent for a zero-sum game deﬁned by f ∈ C ( X × Y, R ) takes the form˙ x = − D f ( x, y ) ˙ y = τ D f ( x, y ) (27)where the x –player seeks to minimize f with respect to x and the y –player seeks to maximize f withrespect to y , and τ is the ratio of learning rates (without loss of generality) of the maximizing to theminimizing player. These learning rates—and hence the value of τ —are hyperparameters subject todesign in most machine learning and optimization applications. Another feature of (27) as comparedto (26), is that the dynamics D i f are partial derivatives of a function f , which leads to the second keydiﬀerence.2. There is structure in the dynamical system that arises from gradient-play which reﬂectsthe underlying game theoretic interactions between players.

This structure can be exploitedin obtaining convergence guarantees in machine learning and optimization applications of game theory.38or instance, minmax optimization is analogous to a zero sum game for which the local linearizationof gradient descent-ascent dynamics has the structure J = (cid:20) A B − τ B (cid:62) − τ C (cid:21) where A = A (cid:62) and C = C (cid:62) and τ is the learning rate ratio or timescale separation parameter.Such block matrices have very interesting properties. In particular, second order optimality conditionsfor a minmax equilibrium correspond to positive deﬁniteness of the ﬁrst Schur complement S ( J ) = A − BC − B (cid:62) >

0, and of − C > J , toolsfrom the theory of block operators (see, e.g., works by Lancaster and Tismenetsky (1985); Magnus(1988); Tretter (2008)) such as the quadratic numerical range can be exploited (and combined withsingular perturbation theory) to understand the eﬀects of hyperparameters such as τ (the learning rateratio) and regularization (which is common in applications such as generative adversarial networks) onconvergence. In this paper, we prove a necessary and suﬃcient condition for the convergence of gradient descent-ascentwith timescale separation to diﬀerential Stackelberg equilibria in zero sum games. This answers a longstanding open question about provable convergence of ﬁrst order methods for zero-sum games to localminimax equilibria. Speciﬁcally, we provide necessary and suﬃcient conditions for the convergence of τ - GDA to diﬀerential Stackelberg equilibria. A key component of the proof is the construction of a (tight)ﬁnite lower bound on the learning rate ratio τ for which stability of the game Jacobian is guaranteed, andhence local asymptotic convergence of τ - GDA . In addition, we provide results on iteration complexity andconvergence rate and apply the results to generative adversarial networks under mild assumptions on thedata distribution. For both diﬀerential Nash equilibira and the superset of diﬀerential Stackelberg equilibria,we provide estimates on the neighborhood on which convergence is guaranteed.This being said, the question of the size of the region of attraction remains open. As commented on earlierin the paper, an alternative but related technique tackles the nonlinear system directly. The downside ofthis technique is that one needs to have in hand (or be able to construct) Lyapunov functions for both the boundary layer model (i.e., the system that arises from treating the choice variable of the slow player as being‘static’) and the reduced order model (i.e., the system that arises from plugging in the implicit mapping fromthe fast player’s action to the slow player’s action into the slow player’s dynamics). A convex combinationof these functions provides a Lyapunov function for the original system ˙ x = − Λ τ g ( x ). The level sets ofthis combined Lyapunov function then give a better sense of the region of attraction and, in fact, one canoptimize over the weighting in the convex combination in order to obtain better estimates of the region ofattraction. This is an interesting avenue to explore in the context of learning in games with lots of intrinsicstructure that can potentially be exploited to improve both the rate of convergence and the region on whichconvergence is guaranteed.Another signiﬁcant contribution of this work is the fact that we introduce tools that are arguably newto the machine learning and optimization communities and expose interesting new directions of research.In particular, the notion of a guard map, which is arguably even an obscure tool in modern control anddynamical systems theory, is ‘rediscovered’ in this paper. The is potential to leverage this concept in notonly providing certiﬁcates for performance (e.g., beyond stability to robustness) but also in synthesizingalgorithms with performance guarantees. For instance, one observation from our empirical analysis is thatconvergence rate is not only limited by the eigenvalues of the Schur complement of the Jacobian, but thefastest convergence appears to occur when there are complex components of the eigenvalues. In short, somecycling is beneﬁcial. Better understanding this fact from a theoretical perspective is an open question, as isoptimizing the rate of convergence by exploiting these observations.Finally, another set of related open questions center on practical considerations for the eﬃcient use ofﬁrst order methods. For instance, with respect to generative adversarial networks, the exponential movingaverage is known to empirically reduce the negative eﬀects of cycling. Additionally, increasing the learning39ate ratio does lead to predominantly real eigenvalues which in turn reduces cycling. Understanding thetrade oﬀs between not only these two hyperparameters but also regularization is very important for practicalimplementations. Empirically, we study the tradeoﬀs between the learning rate ratio, regularization param-eter, and the parameter controlling the degree of “smoothness” in the exponential moving average, anothercommon heuristic that performs well in practice. There is an open line of research related to analyticallycharacterizing the tradeoﬀs between these three hyperparameters. However, in the absence of theoreticaltools for exploring these issues, what are reasonable and principled heuristics?To conclude, while we arguably deﬁnitively address a standing open question for ﬁrst order methods forlearning in zero-sum games/minmax optimization problems, there a many open directions exposed by thetools introduced and empirical observations discovered in this work. Acknowledgements

This work is funded by the Oﬃce of Naval Research (YIP Award) and National Science Foundation CAREERAward (CNS-1844729). Tanner Fiez is also funded by a National Defense Science and Engineering GraduateFellowship. We thank Daniel Calderone for the helpful discussions, in particular on linear algebra resultsas they pertain to the results in this paper. Finally, we thank Mescheder et al. (2018) for providing a highquality open source implementation of the generative adversarial network experiments they performed, whichfacilitated and expedited the experiments we performed.

References

EH Abed, L Saydy, and AL Tits. Generalized stability of linear singularly perturbed systems includingcalculation of maximal parameter range. In

Robust Control of Linear Systems and Nonlinear Control ,pages 197–203. Springer, 1990.Ralph Abraham, Jerrold E Marsden, and Tudor Ratiu.

Manifolds, tensor analysis, and applications , vol-ume 75. Springer Science & Business Media, 2012.Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle point opti-mization: A curvature exploitation approach. In

International Conference on Artiﬁcial Intelligence andStatistics (AISTATS) , pages 486–495, 2019.John M Alongi and Gail Susan Nelson.

Recurrence and topology , volume 85. American Mathematical Soc.,2007.I. K. Argyros. A generalization of ostrowski’s theorem on ﬁxed points.

Applied Mathematics Letters , 12:77–79, 1999.Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein generative adversarial networks. In

International Conference on Machine Learning (ICML) , pages 214–223, 2017.D Balduzzi, S Racaniere, J Martens, J Foerster, K Tuyls, and T Graepel. The mechanics of n-playerdiﬀerentiable games. In

International Conference on Machine Learning (ICML) , volume 80, pages 363–372, 2018.Tamer Ba¸sar and Geert Jan Olsder.

Dynamic noncooperative game theory . SIAM, 1998.Michel Benaim. A Dynamical System Approach to Stochastic Approximations.

SIAM Journal on Controland Optimization , 34(2):437–472, 1996.Hugo Berard, Gauthier Gidel, Amjad Almahairi, Pascal Vincent, and Simon Lacoste-Julien. A Closer Lookat the Optimization Landscapes of Generative Adversarial Networks. In

International Conference onLearning Representations (ICLR) , 2020.Vivek Borkar.

Stochastic Approximation: A Dynamical Systems Approach . Springer, 2008.40W Broer and F Takens. Preliminaries of dynamical systems theory.

Handbook of Dynamical Systems , 3:1–42, 2010.Frank M Callier and Charles A Desoer.

Linear system theory . Springer Science & Business Media, 2012.Benjamin Chasnov, Lillian J. Ratliﬀ, Eric Mazumdar, and Samuel A. Burden. Convergence analysis ofgradient-based learning in continuous games.

Conference on Uncertainty in Artiﬁcial Intelligence (UAI) ,2019.Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: conditions for asymptoticstability of saddle points.

SIAM Journal on Control and Optimization , 55(1):486–511, 2017.Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-maxoptimization. In

Advances in Neural Information Processing Systems , pages 9236–9246, 2018.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism.

International Conference on Learning Representations (ICLR) , 2018.Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. arXiv preprint arXiv:2009.09623 , 2020.Farzan Farnia and Asuman Ozdaglar. Do gans always have nash equilibria?

International Conference onMachine Learning (ICML) , 2020.Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliﬀ. Convergence of Learning Dynamics in StackelbergGames. arXiv:1906.01217 (a version of this work appeared in ICML 2020) , 2019.Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliﬀ. Implicit learning dynamics in stackelberg games: Equi-libria characterization, convergence analysis, and empirical study.

International Conference on MachineLearning (ICML) , 2020.Jakob Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch.Learning with opponent-learning awareness. In

International Conference on Autonomous Agents andMultiAgent Systems (AAMAS) , pages 122–130, 2018.Drew Fudenberg, Fudenberg Drew, David K Levine, and David K Levine.

The theory of learning in games ,volume 2. MIT press, 1998.Gauthier Gidel, Hugo Berard, Ga¨etan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variationalinequality perspective on generative adversarial networks. In

International Conference on Learning Rep-resentations (ICLR) , 2019a.Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, R´emi Le Priol, Gabriel Huang, SimonLacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In

InternationalConference on Artiﬁcial Intelligence and Statistics (AISTATS) , pages 1802–1811, 2019b.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Information ProcessingSystems (NeurIPS) , pages 2672–2680, 2014.Willy J. F. Govaerts.

Numerical Methods for Bifurcations of Dynamical Equilibria . Society for Industrialand Applied Mathematics, 2000.Joao P Hespanha.

Linear Systems Theory . Princeton University Press, 2nd edition, 2018.Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Ganstrained by a two time-scale update rule converge to a local nash equilibrium. In

Advances in NeuralInformation Processing Systems (NeurIPS) , pages 6626–6637, 2017.Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In

Advances in neural infor-mation processing systems , pages 4565–4573, 2016.41oger A Horn and Charles R Johnson.

Matrix Analysis . Cambridge University Press, 1985.Chi Jin, Praneeth Netrapalli, and Michael I Jordan. What is local optimality in nonconvex-nonconcaveminimax optimization?

International Conference on Machine Learning (ICML , 2020.Sameer Kamal. On the convergence, lock-in probability, and sample complexity of stochastic approximation.

SIAM Journal on Control and Optimization , 48(8):5178–5192, 2010.Hassan Khalil.

Nonlinear Systems . Prentice Hall, 3rd edition, 2002.Peter V Kokotovic, John O’Reilly, and Hassan K Khalil.

Singular Perturbation Methods in Control: Analysisand Design . Academic Press, Inc., 1986.NN Krasovskii. Stability of Motion: Application of Lyapunov’s Second Method to Diﬀerential Systems andEquations with Time-Delay, 1963.PA Lagerstrom and RG Casten. Basic concepts underlying singular perturbation techniques.

Siam Review ,14(1):63–120, 1972.Peter Lancaster and Miron Tismenetsky.

The theory of matrices: with applications . Elsevier, 1985.Alistair Letcher, David Balduzzi, S´ebastien Racaniere, James Martens, Jakob N Foerster, Karl Tuyls, andThore Graepel. Diﬀerentiable game mechanics.

Journal of Machine Learning Research (JMLR) , 20:84–1,2019a.Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rockt¨aschel, and Shimon Whiteson. Stable opponentshaping in diﬀerentiable games.

International Conference on Learning Representations (ICLR) , 2019b.Tianyi Lin, Chi Jin, Michael Jordan, et al. Near-optimal algorithms for minimax optimization.

Conferenceon Learning Theory (COLT) , 2020a.Tianyi Lin, Chi Jin, and Michael I Jordan. On gradient descent ascent for nonconvex-concave minimaxproblems.

International Conference on Machine Learning (ICML) , 2020b.Jonathan Lorraine, Paul Vicol, and David Duvenaud. Optimizing millions of hyperparameters by implicitdiﬀerentiation. In

International Conference on Artiﬁcial Intelligence and Statistics (AISTATS) , pages1540–1552, 2020.Songtao Lu, Ioannis Tsaknakis, Mingyi Hong, and Yongxin Chen. Hybrid block successive approximationfor one-sided non-convex min-max problems: algorithms and applications.

IEEE Transactions on SignalProcessing , 2020.Matthew MacKay, Paul Vicol, Jon Lorraine, David Duvenaud, and Roger Grosse. Self-tuning networks:Bilevel optimization of hyperparameters using structured best-response functions.

International Confer-ence on Learning Representations (ICLR) , 2019.Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. In

International Conference on Learning Representations ,2018.Jan Magnus.

Linear Structures . Griﬃn, 1988.E. Mazumdar and L. J. Ratliﬀ. Local Nash Equilibria are Isolated, Strict Local Nash Equilibria in ‘AlmostAll’ Zero-Sum Continuous Games. In

Proceedings of the 58th IEEE Conference on Decision and Control ,pages 6899–6904, 2019.Eric Mazumdar, Lillian J Ratliﬀ, and S Shankar Sastry. On Gradient-Based Learning in Continuous Games.

SIAM Journal on Mathematics of Data Science , 2(1):103–131, 2020.Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On ﬁnding local nash equilibria (and only localnash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838 , 2019.42anayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, andGeorgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.In

International Conference on Learning Representations (ICLR) , 2019.Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In

Advances in NeuralInformation Processing Systems (NeurIPS) , pages 1825–1835, 2017.Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for gans do actuallyconverge? In

International Conference on Machine learning (ICML) , pages 3481–3490, 2018.Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks.

International Conference on Learning Representations (ICLR) , 2017.D Mustafa and TN Davidson. Generalized integral controllability. In

Proceedings of 1994 33rd IEEE Con-ference on Decision and Control , volume 1, pages 898–903. IEEE, 1994.Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In

Advancesin Neural Information Processing Systems (NeurIPS) , pages 5585–5595, 2017.Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a classof non-convex min-max games using iterative ﬁrst order methods. In

Advances in Neural InformationProcessing Systems , pages 14934–14942, 2019.J. M. Ortega and W. C. Rheinboldt.

Iterative Solutions to Nonlinear Equations in Several Variables . Aca-demic Press, 1970.Dmitrii M Ostrovskii, Andrew Lowy, and Meisam Razaviyayn. Eﬃcient search of ﬁrst-order nash equilibriain nonconvex-concave smooth min-max problems. arXiv preprint arXiv:2002.07919 , 2020.Christos Papadimitriou and Georgios Piliouras. Game dynamics as the meaning of a game.

SIGecom Exch. ,16(2):53–63, May 2019. doi: 10.1145/3331041.3331048.Hassan Raﬁque, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max optimization: Provablealgorithms and applications in machine learning. arXiv preprint arXiv:1810.02060 , 2018.Aravind Rajeswaran, Igor Mordatch, and Vikash Kumar. A game theoretic framework for model basedreinforcement learning.

International Conference on Machine Learning (ICML) , 2020.L. J. Ratliﬀ, S. A. Burden, and S. S. Sastry. Characterization and computation of local Nash equilibria incontinuous games. In

Allerton Conference on Communication, Control, and Computing , pages 917–924,2013.L. J. Ratliﬀ, S. A. Burden, and S. S. Sastry. Genericity and structural stability of non-degenerate diﬀerentialNash equilibria. In

American Control Conference (ACC) , pages 3990–3995, 2014.Lillian J Ratliﬀ, Samuel A Burden, and S Shankar Sastry. On the characterization of local Nash equilibriain continuous games.

IEEE Transactions on Automatic Control , 61(8):2301–2307, 2016.Kevin Roth, Aurelien Lucchi, Sebastian Nowozin, and Thomas Hofmann. Stabilizing training of genera-tive adversarial networks through regularization. In

Advances in Neural Information Processing Systems(NeurIPS) , pages 2018–2028, 2017.S. Shankar Sastry.

Nonlinear Systems Theory . Springer-Verlag, 1999.Shankar Sastry and C Desoer. Jump behavior of circuits and systems.

IEEE Transactions on Circuits andSystems , 28(12):1109–1124, 1981.Lahcen Saydy. New stability/performance results for singularly perturbed systems.

Automatica , 32(6):807– 818, 1996. 43ahcen Saydy, Andr´e L Tits, and Eyad H Abed. Guardian maps and the generalized stability of parametrizedfamilies of matrices and polynomials.

Mathematics of Control, Signals and Systems , 3(4):345–371, 1990.Florian Sch¨afer and Anima Anandkumar. Competitive gradient descent. In

Advances in Neural InformationProcessing Systems (NeurIPS) , pages 7625–7635, 2019.Aman Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness with principledadversarial training.

International Conference on Learning Representations (ICLR) , 2018.Gerald Teschl. Ordinary diﬀerential equations and dynamical systems.

Graduate Studies in Mathematics ,140:08854–8019, 2000.Gugan Thoppe and Vivek Borkar. A concentration bound for stochastic approximation via alekseev’s formula.

Stochastic Systems , 9(1):1–26, 2019.Christiane Tretter.

Spectral theory of block operator matrices and applications . World Scientiﬁc, 2008.Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach.

International Conference on Learning Representations (ICLR , 2020.Yasin Yazici, Chuan-Sheng Foo, Stefan Winkler, Kim-Hui Yap, Georgios Piliouras, and Vijay Chandrasekhar.The unusual eﬀectiveness of averaging in gan training.

International Conference on Learning Representa-tions (ICLR) , 2019.Chongjie Zhang and Victor R Lesser. Multi-agent learning with policy prediction. In

AAAI , volume 3,page 8, 2010.Kaiqing Zhang, Zhuoran Yang, and Tamer Ba¸sar. Multi-agent reinforcement learning: A selective overviewof theories and algorithms. arXiv preprint arXiv:1911.10635 , 2019.

Appendices

Table of Contents

A Helper Lemmas and Additional Mathematical Preliminaries 45B Proof of Proposition 3 46C Proof of Lemma 1 and Lemma 2 47D Proof of Corollary 3 49E Proof of Theorem 3 and Corollary 2 49F Proof of Proposition 4 54G Proof of Theorem 4 58H Proof of Theorem 7 59I Proof of Proposition 5 60J Extensions in the Stochastic Setting 61K Further Details on Related Work 62 Experiments Supplement 63

A Helper Lemmas and Additional Mathematical Preliminaries

In this appendix, we present a handful of technical lemmas and review some additional mathematical pre-liminaries excluded from the main body but which are important in proving the results in the paper.The following technical lemma is used in proving an upper bound on the spectral radius of the linearizationof the discrete time update τ - GDA a requirement for obtaining the convergence rate results.

Lemma 3.

The function c ( z ) = (1 − z ) / + z − (1 − z ) / satisﬁes c ( x ) ≤ for all z ∈ [0 , .Proof. Since c (0) = 0 and c (1) = − √ ≤

0, we simply need to show that c (cid:48) ( z ) ≤ ,

1) to get that c ( z ) is a decreasing function on [0 , , c (cid:48) ( z ) = + √ − z − √ − z ≤ − z ) − / − (4 − z ) − / ≥ / z ∈ (0 , Proposition 6 (Ostrowski’s Theorem (Argyros, 1999); Theorem 10.1.2 (Ortega and Rheinboldt, 1970)) . Let x ∗ be a ﬁxed point for the discrete dynamical system x k +1 = F ( x k ) . If the spectral radius of the Jacobiansatisﬁes ρ ( DF ( x ∗ )) < , then F is a contraction at x ∗ and hence, x ∗ is asymptotically stable. The following technical lemma, due to Mustafa and Davidson (1994), is used in constructing the ﬁnitelearning rate ratio.

Lemma 4 ((Mustafa and Davidson, 1994, Lem. 15)) . Let

V, Z ∈ R p × p , W ∈ R p × q and Y ∈ R q × q . If V and Y − XV − W are non-singular, then det (cid:18)(cid:20) V + Z WX Y (cid:21)(cid:19) = det( V ) det( Y − XV − W ) det( I + V − ( I + W ( Y − XV − W ) − XV − ) Z )For completeness (and because there is a typo in the original manuscript), we provide the proof here. Proof.

Suppose that V and Y − XV − W are non-singular so that the partial Schur decomposition (cid:20) V WX Y (cid:21) = (cid:20) V X Y − XV − W (cid:21) (cid:20) I V − W I (cid:21) holds, and det (cid:18)(cid:20) V WX Y (cid:21)(cid:19) = det( V ) det( Y − XV − W ) . (28)Further, (cid:20) V WX Y (cid:21) − = (cid:20) I − V − W I (cid:21) (cid:20) V − − ( Y − XV − W ) − XV − ( Y − XV − W ) − (cid:21) . Applying the determinant operator, we have thatdet (cid:18)(cid:20) V + Z WX Y (cid:21)(cid:19) = det (cid:18)(cid:20)

V WX Y (cid:21)(cid:19) det (cid:32)(cid:20) I I (cid:21) + (cid:20) V WX Y (cid:21) − (cid:20) Z

00 0 (cid:21)(cid:33) (29)so thatdet (cid:32)(cid:20) I I (cid:21) + (cid:20) V WX Y (cid:21) − (cid:20) Z

00 0 (cid:21)(cid:33) = det (cid:18)(cid:20) V − ( I + W ( Y − XV − W ) − XV − ) Z + I − ( Y − XV − W ) V − Z I (cid:21)(cid:19) (30)= det( V − ( I + W ( Y − XV − W ) − XV − ) Z + I ) . (31)Combining (28) with (31) in (29) gives exactly the claimed result.45he following lemma is Theorem 2 Lancaster and Tismenetsky (1985, Chap. 13.1). We use this lemmaseveral times in the proofs of Theorem 3 and 4 so we include it here for ease of reference. For a given matrix A , υ + ( A ), υ − ( A ), and ζ ( A ) are the number of eigenvalues of the argument that have positive, negative andzero real parts, respectively. Lemma 5.

Consider a matrix A ∈ R n × n .(a) If P is a symmetric matrix such that AP + P A (cid:62) = Q where Q = Q (cid:62) > , then P is nonsingular and P and A have the same inertia—i.e., υ + ( A ) = υ + ( P ) , υ − ( A ) = υ − ( P ) , ζ ( A ) = ζ ( P ) . (32) (b) On the other hand, if ζ ( A ) = 0 , then there exists a matrix P = P (cid:62) and a matrix Q = Q (cid:62) > suchthat AP + P A (cid:62) = Q and P and A have the same inertia (i.e., (32) holds). Numerical and Quadratic Numerical Range.

The numerical range and quadratic numerical range ofa block operator matrix are particularly useful for proving results about the spectrum of a block operatormatrix as they are supersets of the spectrum (Tretter, 2008). Given a matrix A ∈ R n × n , the numerical rangeis deﬁned by W ( A ) = { z ∈ C n : (cid:104) Az, z (cid:105) , (cid:107) z (cid:107) = 1 } , and is a convex subset of C . Deﬁne spaces W i = { z ∈ C n i : (cid:107) z (cid:107) = 1 } for each i ∈ { , } . Consider a blockoperator A = (cid:20) A A A A (cid:21) , where A ii ∈ R n i × n i and A ij ∈ R n i × n j for each i, j ∈ { , } . Given v ∈ W and w ∈ W , let A v,w ∈ C × bedeﬁned by A v,w = (cid:20) (cid:104) A v, v (cid:105) (cid:104) A w, v (cid:105)(cid:104) A v, w (cid:105) (cid:104) A w, w (cid:105) (cid:21) . The quadratic numerical range of A is deﬁned by W ( A ) = (cid:91) v ∈ W ,w ∈ W spec( A v,w )where spec( · ) denotes the spectrum of its argument.The quadratic numerical range can be described as the set of solutions of the characteristic polynomial λ − λ ( (cid:104) A v, v (cid:105) + (cid:104) A w, w (cid:105) ) + (cid:104) A v, v (cid:105)(cid:104) A w, w (cid:105) − (cid:104) A v, w (cid:105)(cid:104) A w, v (cid:105) = 0 (33)for v ∈ W and w ∈ W . We use the notation (cid:104) Av, w (cid:105) = ¯ v (cid:62) Aw to denote the inner product. Note that W ( A ) is a (potentially non-convex) subset of W ( A ) and contains spec( A ). B Proof of Proposition 3

Before proving this result, we note that the result has already been shown in the literature by Jin et al.(2020). We included the result primarily because the proof approach is diﬀerent and the tools we use (inparticular, the quadratic numerical range) have not been utilized before in this type of analysis. Hence, weview the proof technique itself as a contribution. 46 roof of Proposition 3

We leverage the quadratic numerical range to show that spec( J τ ( x ∗ )) ⊂ C ◦ + for any τ ∈ (0 , ∞ ). Indeed, the quadratic numerical range of a block operator matrix contains its spec-trum (Tretter, 2008).Recall that W ( J τ ( x ∗ )) = (cid:91) v ∈ W ,w ∈ W spec( J v,wτ ( x ∗ ))where J v,wτ ( x ∗ ) = (cid:20) (cid:104) D f ( x ∗ ) v, v (cid:105) (cid:104) D f ( x ∗ ) w, v (cid:105)(cid:104)− τ D (cid:62) f ( x ∗ ) v, w (cid:105) (cid:104)− τ D f ( x ∗ ) w, w (cid:105) (cid:21) and W i = { z ∈ C n i : (cid:107) z (cid:107) = 1 } for each i = 1 ,

2. Fix v ∈ W and w ∈ W and consider J v,wτ ( x ∗ ) = (cid:20) a b − τ ¯ b τ d (cid:21) Then, the elements of W ( J τ ( x ∗ )) are of the form λ τ = ( a + τ d ) ± (cid:112) ( a − τ d ) − τ | b | where a = (cid:104) D f ( x ∗ ) v, v (cid:105) , b = (cid:104) D f ( x ∗ ) w, v (cid:105) and d = (cid:104)− D f ( x ∗ ) w, w (cid:105) for vectors v ∈ W and w ∈ W .We claim that for any τ ∈ (0 , ∞ ), Re( λ τ ) > a , b , and d where a > d > x ∗ is adiﬀerential Nash equilibrium.Indeed, we argue this by considering the two possible cases: (1) ( a − τ d ) ≤ | b | τ or (2) ( a − τ d ) > τ | b | . • Case 1 : Suppose τ ∈ (0 , ∞ ) is such that ( a − τ d ) ≤ | b | τ . Then, Re( λ τ ) = ( a + τ d ) > a + d > • Case 2 : Suppose τ ∈ (0 , ∞ ) is such that ( a − τ d ) > τ | b | . In this case, we want to ensure thatRe( λ τ ) > ( a + τ d ) − (cid:112) ( a − τ d ) − τ | b | > . The last inequality is equivalent to − ad < | b | . Indeed,( a + τ d ) > ( a − τ d ) − τ | b | ⇐⇒ τ ad > − τ | b | ⇐⇒ − ad < | b | . Moreover, − ad < | b | holds for any pair of vectors ( v, w ) such that v ∈ W and w ∈ W since a > d > τ ∈ (0 , ∞ ), spec( J τ ( x ∗ )) ⊂ C ◦ + since the spectrum of an operator is contained in its quadraticnumerical range and the above argument shows that W ( J τ ( x ∗ )) ⊂ C ◦ + . C Proof of Lemma 1 and Lemma 2

In this appendix section, we prove Lemma 1 and Lemma 2 from Section 4. We note that the proof ofLemma 2 starts where the proof of Lemma 1 leaves oﬀ.

C.1 Proof of Lemma 1

Suppose that x ∗ is a diﬀerential Stackelberg or Nash equilibrium and that 0 < τ < ∞ is such thatspec( − J τ ( x ∗ )) ⊂ C ◦− . For the discrete time dynamical system x k +1 = x k − γ Λ τ g ( x k ), it is well knownthat if γ is chosen such that ρ ( I − γ J τ ( x ∗ )) <

1, then x k locally (exponentially) converges to x ∗ (Ortegaand Rheinboldt, 1970). With this in mind, we formulate an optimization problem to ﬁnd the upper bound γ on the learning rate γ such that for all γ ∈ (0 , γ ), the spectral radius of the local linearization of thediscrete time map is a contraction which is precisely ρ ( I − γ J τ ( x ∗ )) <

1. The optimization problem is givenby γ = min γ> (cid:26) γ : max λ ∈ spec( J τ ( x ∗ )) | − γλ | ≤ (cid:27) . (34)47he intuition is as follows. The inner maximization problem is over a ﬁnite set spec( J τ ( x ∗ )) = { λ , . . . , λ n } where J τ ( x ∗ ) ∈ R n × n . As γ increases away from zero, each | − γλ i | shrinks in magnitude. The last λ i such that 1 − γλ i hits the boundary of the unit circle in the complex plane (i.e., | − γλ i | = 1) gives us theoptimal value of γ and the element of spec( J τ ( x ∗ )) that achieves it. Examining the constraint, we have thatfor each λ i , γ ( γ | λ i | − λ i )) ≤ γ >

0. As noted this constraint will be tight for one of the λ ,in which case γ = 2Re( λ ) / | λ | since γ >

0. Hence, by selecting γ = min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , we havethat | − γ λ | < λ ∈ spec( J τ ( x ∗ )) and any γ ∈ (0 , γ ).To see this is the case, let γ = min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | and λ m = arg min λ ∈ spec( J τ ) λ ) / | λ | .Using the expression for γ , we have that1 − γ Re( λ ) + γ (Re( λ ) + Im( λ ) ) = 1 − λ m ) | λ m | Re( λ ) + (cid:18) λ m ) | λ m | (cid:19) | λ | . Now, using the fact that Re( λ ) / | λ | > Re( λ m ) / | λ m | , we have1 − λ m ) | λ m | Re( λ ) + (cid:18) λ m ) | λ m | (cid:19) | λ | ≤ − λ m ) | λ m | Re( λ ) + (cid:18) λ m ) | λ m | (cid:19) | λ m | Re( λ )Re( λ m )= 1 − λ m ) | λ m | Re( λ ) + 4 Re( λ m ) | λ m | Re( λ )= 1as claimed. From this argument, it is clear that for any γ ∈ (0 , γ ), | − γ λ | < λ ∈ spec( J τ ( x ∗ )).Now, consider any α ∈ (0 , γ ) and let β = (2Re( λ m ) − α | λ m | ) − . Observe that γ = γ − α so that γ ∈ (0 , γ ). Hence, | − ( γ − α ) λ m | = (cid:18) − (cid:18) λ m ) | λ m | − α (cid:19) Re( λ m ) (cid:19) + (cid:18) λ m ) | λ m | − α (cid:19) Im( λ m ) = 1 − λ m ) | λ m | + 2 α Re( λ m ) + 4 Re( λ m ) | λ m | − α Re( λ m ) + α | λ m | = 1 − α Re( λ m ) + α | λ m | = 1 − αβ so that ρ ( I − γ J τ ( x ∗ )) < (cid:18) − αβ (cid:19) / . Hence, the ρ ( I − γ J τ ( x ∗ )) < C.2 Proof of Lemma 2

To prove this lemma, we build directly on the conclusion of the proof of Lemma 1. Indeed, since ρ ( I − γ J τ ( x ∗ )) < (cid:18) − αβ (cid:19) / , given ε = α β > (cid:107) · (cid:107) (cf. Lemma 5.6.10 in Horn and Johnson (1985)) such that (cid:107) I − γ J τ ( x ∗ ) (cid:107) ≤ (cid:18) − αβ (cid:19) / + α β ≤ (cid:18) − α β (cid:19) / where the last inequality holds by Lemma 3. Taking the Taylor expansion of I − γ g τ ( x ) around x ∗ , we have I − γ g τ ( x ) = ( I − γ g τ ( x ∗ )) + ( I − γ J τ ( x ∗ ))( x − x ∗ ) + R ( x − x ∗ ) The norm that exists can easily be constructed as essentially a weighted induced 1-norm. Note that the norm constructionis not unique. The proof in Horn and Johnson (1985) is by construction and the construction of this norm can be found there. R ( x − x ∗ ) is the remainder term satisfying R ( x − x ∗ ) = o ( (cid:107) x − x ∗ (cid:107) ) as x → x ∗ . This implies thatthere is a δ > (cid:107) R ( x − x ∗ ) (cid:107) ≤ α β (cid:107) x − x ∗ (cid:107) whenever (cid:107) x − x ∗ (cid:107) < δ . Hence, (cid:107) I − γ g τ ( x ) − ( I − γ g τ ( x ∗ )) (cid:107) ≤ (cid:18) (cid:107) I − γ J τ ( x ∗ ) (cid:107) + α β (cid:19) (cid:107) x − x ∗ (cid:107)≤ (cid:32)(cid:18) − α β (cid:19) / + α β (cid:33) (cid:107) x − x ∗ (cid:107)≤ (cid:18) − α β (cid:19) / (cid:107) x − x ∗ (cid:107) where the last inequality holds again by Lemma 3. Hence, (cid:107) x k − x ∗ (cid:107) ≤ (cid:18) − α β (cid:19) k/ (cid:107) x − x ∗ (cid:107) (35)whenever (cid:107) x − x ∗ (cid:107) < δ which veriﬁes the claimed convergence rate. D Proof of Corollary 3

Let (cid:107) · (cid:107) be the norm that exists (via construction a la Horn and Johnson (1985, Lem. 5.6.10)) in the proofof Lemma 2 which is given in Appendix C. Following standard arguments, (35) in the proof of Lemma 2implies a ﬁnite time convergence guarantee. Indeed, let ε > < α β < − α/ (4 β )) k < exp( − kα/ (4 β )). Hence, (cid:107) x k − x ∗ (cid:107) ≤ exp( − kα/ (4 β )) (cid:107) x − x ∗ (cid:107) . In turn, this implies that x k ∈ B ε ( x ∗ ), meaning that x k is a ε -diﬀerential Stackelberg equilibrium for all k ≥ (cid:100) βα log( (cid:107) x − x ∗ (cid:107) /ε ) (cid:101) whenever (cid:107) x − x ∗ (cid:107) < δ .Now, given that f i ∈ C r ( X, R ) for r ≥ I − γ J τ ( x ) is locally Lipschitz with constant L so that we canﬁnd an explicit expression for δ in terms of L . Indeed, recall that R ( x − x ∗ ) = o ( (cid:107) x − x ∗ (cid:107) ) as x → x ∗ whichmeans lim x → x ∗ (cid:107) R ( x − x ∗ ) (cid:107) / (cid:107) x − x ∗ (cid:107) = 0 so that (cid:107) R ( x − x ∗ ) (cid:107) ≤ (cid:90) (cid:107) I − γ J τ ( x ∗ + η ( x − x ∗ )) − ( I − γ J τ ( x ∗ )) (cid:107)(cid:107) x − x ∗ (cid:107) dη ≤ L (cid:107) x − x ∗ (cid:107) Observing that (cid:107) R ( x − x ∗ ) (cid:107) ≤ L (cid:107) x − x ∗ (cid:107) = L (cid:107) x − x ∗ (cid:107)(cid:107) x − x ∗ (cid:107) , we have that the δ > (cid:107) R ( x − x ∗ ) (cid:107) ≤ α/ (8 β ) (cid:107) x − x ∗ (cid:107) is δ = α/ (4 Lβ ). E Proof of Theorem 3 and Corollary 2

To prove Theorem 3 and Corollary 2, we introduce some techniques that are arguably new to the machinelearning and artiﬁcial intelligence communities. The ﬁrst is the notion of a guard map. A guard map canbe used to provide a certiﬁcate of a particular behavior for a dynamical system as a parameter(s) varies. Acritical point of a dynamical systems is known to be stable if the spectrum of the Jacobian at the criticalpoint lies in the open left-half complex plane, denoted C ◦− . Hence, we construct a guard map as a function of τ and show that it guards C ◦− . Speciﬁcally we show that the existence of a τ ∗ ∈ (0 , ∞ ) such that ν ( τ ∗ ) = 0and ν ( τ ) (cid:54) = 0 for all τ ∈ ( τ ∗ , ∞ ) is equivalent to S ( J ( x ∗ )) > − D f ( x ∗ ) > S ( J ( x ∗ )) = S ( J τ ( x ∗ )) = D f ( x ∗ ) − D f ( x ∗ )( D f ( x ∗ )) − D f ( x ∗ ) . Towards this end, we need to introduced some notation as well as formal deﬁnitions for important conceptssuch as the guard map. The notation R ( x − x ∗ ) = o ( (cid:107) x − x ∗ (cid:107) ) as x → x ∗ means lim x → x ∗ (cid:107) R ( x − x ∗ ) (cid:107) / (cid:107) x − x ∗ (cid:107) = 0. .1 Notation and Preliminaries Given a matrix A ∈ R n × n , let vec( A ) ∈ R n n be the vectorization of A . We use the convention that rowsare transposed and stacked in order. That is,vec :  — a —...— a n —  (cid:55)→  a (cid:62) ... a (cid:62) n  Let ⊗ and ⊕ denote the Kronecker product and Kronecker sum respectively. Recall that A ⊕ B = A ⊗ B + B ⊗ A . A less common operator, we deﬁne (cid:1) as an operator that generates an n ( n + 1) × n ( n + 1) matrixfrom a matrix A ∈ R n × n such that A (cid:1) A = H + n ( A ⊕ A ) H n where H + n = ( H (cid:62) n H n ) − H (cid:62) n is the (left) pseudo-inverse of H n , a full column rank dupplication matrix. Aduplication matrix H n ∈ R n × n ( n +1) / is a clever linear algebra tool for mapping a n ( n + 1) vector to a n vector generated by applying vec( · ) to a symmetric matrix and it is designed to respect the vectorizationmap vec( · ). In particular, if vech( X ) is the half-vectorization map of any symmetric matrix X ∈ R n × n , thenvec( X ) = H n vech( X ) and vech( X ) = H + n vec( X ).Given a square matrix A , let λ +max ( A ) be the largest positive real eigenvalue of A and if A does not havea positive real eigenvalue then it is zero. Guardian maps.

The use of guardian maps for studying stability of parameterized families of dynamicalsystems was arguably introduced by Saydy et al. (1990). Guardian or guard maps act as a certiﬁcate for aperformance criteria such as stability.Formally, let X be the set of all n × n real matrices or the set of all polynomials of degree n with realcoeﬃcients. Consider S an open subset of X with closure ¯ S and boundary ∂ S . Deﬁnition 6.

The map ν : X → C is said to be a guardian map for S if for all x ∈ ¯ S , ν ( x ) = 0 ⇐⇒ x ∈ ∂ S . Consider an open subset Ω of the complex plane that is symmetric with respect to the real axis. Then,elements of S (Ω) = { A ∈ R n × n : spec( A ) ⊂ Ω } are said to be stable relative to Ω.The following result gives a necessary and suﬃcient condition for stability of parameterized families ofmatrices relative to some open subset of the complex plane. Proposition 7 (Proposition 1 (Saydy et al., 1990); Theorem 2 (Abed et al., 1990)) . Consider U to be apathwise connected subset of R and A ( τ ) ∈ R n × n a matrix which depends continuously on τ . Let S (Ω) beguarded by the map ν . The family { A ( τ ) : τ ∈ U } is stable relative to Ω if and only if ( i ) it is nominallystable—i.e., A ( τ ) ∈ S (Ω) for some τ ∈ U —and ( ii ) ν ( A ( τ )) (cid:54) = 0 for all τ ∈ U . In proving Theorem 3, we deﬁne a guard map for the space of n × n Hurwitz stable matrices which isdenoted by S ( C ◦− ). Lemma 6.

The map ν : A (cid:55)→ det( A (cid:1) A ) guards the set of n × n Hurwitz stable matrices S ( C ◦− ) .Proof. This follows from the following observation: for A ∈ R n × n ,vech( AX + XA (cid:62) ) = H + n vec( AX + XA (cid:62) ) = H + n ( A ⊕ A )vec( X ) = H + n ( A ⊕ A ) H n vech( X )from which it can be shown that the eigenvalues of A (cid:1) A are λ i + λ j for 1 ≤ j ≤ i ≤ n + n where λ i for i = 1 , . . . , n are the eigenvalues of A .Indeed, let S be a non-singular matrix such that S − AS = M where M is upper triangular with λ , . . . , λ n on its diagonal. Observe that for any n × n matrix P , H n H + n ( P ⊗ P ) H n = ( P ⊗ P ) H n and H + n ( P ⊗ P ) H n H + n = H + n ( P ⊗ P ). Hence, using properties of the Kronecker product (namely, that ( A ⊗ A )( B ⊗ B ) = ( A B ⊗ A B )), we have that H + n ( S − ⊗ S − ) H n H + n ( I ⊗ A + A ⊗ I ) H n H + n ( S ⊗ S ) H n = H + n ( I ⊗ M + M ⊗ I ) H n

50o that the spectrum of H + n ( I ⊗ A + A ⊗ I ) H n and H + n ( I ⊗ M + M ⊗ I ) H n coincide. Now, since M is uppertriangular, H + n ( I ⊗ M + M ⊗ I ) H n is upper triangular with diagonal elements λ i + λ j (1 ≤ j ≤ i ≤ n )which can be veriﬁed by direct computation and using the deﬁnition of H n . This implies that λ i + λ j (1 ≤ j ≤ i ≤ n ) are exactly the eigenvalues of H + n ( I ⊗ A + A ⊗ I ) H n .We note that there are several other guard maps for the space of Hurwtiz stable matrices including ν : A (cid:55)→ det( A ⊕ A ). To give some intuition for this map, it is fairly straightforward to see that theKronecker sum A ⊕ A = A ⊗ I + I ⊗ A has spectrum { λ j + λ i } where λ i , λ j ∈ spec( A ). The operator A (cid:1) A is simply a more computationally eﬃcient expression of A ⊕ A , and as such the eigenvalues of A (cid:1) A arethose of A ⊕ A removing redundancies. We use A (cid:1) A speciﬁcally because of its computational advantagesin computing τ ∗ . E.2 Proof of Theorem 3

We ﬁrst prove that if x ∗ is a diﬀerential Stackelberg equilibrium (i.e., S ( J τ ( x ∗ )) > − D f ( x ∗ ) > τ ∗ ∈ (0 , ∞ ) such that for all τ ∈ ( τ ∗ , ∞ ), x ∗ is exponentially stable for ˙ x = − Λ τ g ( x )(i.e., spec( − J τ ( x ∗ )) ⊂ C ◦− ). Towards this end, we construct a guard map for the space of n × n Hurwtizstable matrices and explicitly construct the τ ∗ using it.Then we prove the other direction. That is, if there exists a ﬁnite τ ∗ ∈ (0 , ∞ ) such that for all τ ∈ ( τ ∗ , ∞ ), x ∗ is exponentially stable for ˙ x = − Λ τ g ( x ), then x ∗ is a diﬀerential Stackelberg equilibrium. We prove thisby contradiction. E.2.1 Proof that if x ∗ is a diﬀerential Stackelberg then ﬁnite τ ∗ exists Towards this end, for a critical point x ∗ , let − J τ ( x ∗ ) = (cid:20) − D f ( x ∗ ) − D f ( x ∗ ) τ D (cid:62) f ( x ∗ ) τ D f ( x ∗ ) (cid:21) = (cid:20) A A − τ A (cid:62) τ A (cid:21) and deﬁne S = S ( − J τ ( x ∗ )) = A − A A − A (cid:62) . Note that this is equivalent to the ﬁrst Schur complement of − J ( x ∗ ) (i.e., when τ = 1) since the τ and τ − cancel, and by assumption the ﬁrst Schur complement of − J ( x ∗ ) is positive deﬁnite. Suppose that x ∗ is adiﬀerential Stackelberg equilibrium so that − S > − A > Polynomial guard map with family of matrices parameterized by τ . By Lemma 6, ν : A (cid:55)→ det( A (cid:1) A ) is a guard map for S ( C ◦− ). Indeed, using the fact that the determinant is the product of theeigenvalues of a matrix and the fact that spec( A (cid:1) A ) = { λ i + λ j , ≤ i ≤ j ≤ n, λ i , λ j ∈ spec( A ) } , we havethat det( A (cid:1) A ) = (cid:89) ≤ j ≤ i ≤ n ( λ i + λ j ) = (cid:89) ≤ i ≤ n λ i )(4Re ( λ i ) + 4Im ( λ i )) (cid:89)

51 function of τ which guards the set of Hurwitz stable matrices via the reasoning describe above. Indeed,by slightly overloading the notation for ν , ν ( τ ) := ν + ν τ + · · · + ν p − τ p − + ν p τ p = ν ( − J τ ( x ∗ ))Hence, for intuition, observe that as τ decreases (towards zero) stability is ﬁrst lost when at least oneeigenvalue of − J τ ( x ∗ ) reaches the imaginary axis, at which point ν ( τ ) = 0.There are two cases to consider: Case 1: ν ( τ ) is an identically zero polynomial. In this case, − J τ ( x ∗ ) is in the interior of the complementof the set of Hurwitz stable matrices for all values of τ > − J τ ( x ∗ ) ∈ int( S c ( C ◦− )) for all τ ∈ R + = (0 , ∞ ). Case 2: ν ( τ ) is not an identically zero polynomial. In this case, ν ( τ ) has ﬁnitely many zeros. If ν ( τ ) hasno positive real roots, then as τ varies in R + , − J τ ( x ∗ ) does not cross ∂ S ( C ◦− —i.e., the boundary ofthe space of n × n Hurwitz stable matrices. Hence, {− J τ ( x ∗ ) : τ ∈ R + } ⊂ S c ( C ◦− ) or {− J τ ( x ∗ ) : τ ∈ R + } ⊂ int( S c ( C ◦− )). It suﬃces to check − J τ ( x ∗ ) ∈ S c ( C ◦− ) or − J τ ( x ∗ ) ∈ int( S c ( C ◦− )) for an arbitrary τ ∈ R + .On the other hand, if ν ( τ ) has (cid:96) ≥ < τ < · · · < τ (cid:96) = τ ∗ , then byProposition 7, − J τ ( x ∗ ) ∈ S ( C ◦− ) for all τ > τ ∗ if and only if − J τ ( x ∗ ) ∈ S ( C ◦− ) for arbitrarily chosen τ > τ ∗ . We choose the largest positive root τ (cid:96) because we are guaranteed that ν ( τ ) stops changingsign for τ > τ ∗ . Further, the largest neighborhood in R + for which − J τ ( x ∗ ) ∈ S ( C ◦− ) is ( τ (cid:96) , ∞ ).Recall that we have assumed that x ∗ is a diﬀerential Stackelberg equilibrium (i.e., S > − A > τ ∗ ) that we are always in case 2. Construction of τ ∗ . We note that there are more elegant, simpler constructions, but to our knowledgethis construction gives the tightest bound on the range of τ for which − J τ ( x ∗ ) is guarnateed to be Hurwitzstable. Recall that − J τ ( x ∗ ) = (cid:20) − D f ( x ∗ ) − D f ( x ∗ ) τ D (cid:62) f ( x ∗ ) τ D f ( x ∗ ) (cid:21) = (cid:20) A A − τ A (cid:62) τ A (cid:21) and S = A − A A − A (cid:62) . Let I m denote the m × m identity matrix. Claim 1.

The ﬁnite learning rate ratio is τ ∗ = λ +max ( Q ) where Q = − ( A ⊗ A − ) + 2 (cid:2) ( A ⊗ A − ) H n ( I n ⊗ A − A (cid:62) ) H n (cid:3) (cid:20) ¯ A − − ¯ S − (cid:21) (cid:20) H + n ( A (cid:62) ⊗ I n ) H + n ( S ⊗ A A − ) (cid:21) (36) with ¯ A = A (cid:1) A and ¯ S = S (cid:1) S .Proof. Recall that ν ( τ ) = det( − J τ ( x ∗ ) (cid:1) ( − J τ ( x ∗ ))) is a guard map for S ( C ◦− ).We apply basic properties of the Kronecker product and sum as well as Schur’s determinant formula toobtain a reduced form of the guard map. To this end, we have that − J τ ( x ∗ ) (cid:1) ( − J τ ( x ∗ )) =  A (cid:1) A H + n ( I n ⊗ A ) 0 τ ( I n ⊗ ( − A (cid:62) )) H n A ⊕ τ A ( A ⊗ I n ) H n τ H + n ( − A (cid:62) ⊗ I n ) τ ( A (cid:1) A )  Now, we apply Schur’s determinant formula to get that ν ( τ ) = τ n ( n +1) / det( A (cid:1) A ) det (cid:18)(cid:20) A (cid:1) A H + n ( I n ⊗ A ) τ ( I n ⊗ ( − A (cid:62) )) H n A ⊕ τ A + M (cid:21)(cid:19) (37)where M = − H + n ( − A (cid:62) ⊗ I n )( A (cid:1) A ) − ( A ⊗ I n ) H n A ⊕ τ A = A ⊗ I n + I n ⊗ τ A . Let V = I n ⊗ τ A , Z = A ⊗ I n + M , Y = A (cid:1) A , W = − τ ( I n ⊗ A (cid:62) ) H n , and X = 2 H + n ( I n ⊗ A ).Using the two properties of the Kronecker product ( B ⊗ B )( B ⊗ B ) = ( B B ⊗ B B ) and ( B ⊗ B ) − =( B − ⊗ B − ), we have that Y − XV − W = A (cid:1) A + 2 H + n ( I n ⊗ A )( I n ⊗ A ) − ( I n ⊗ A (cid:62) ) H n (38)= A (cid:1) A + 2 H + n ( I n ⊗ A A − A (cid:62) ) H n (39)= A (cid:1) A + H + n (( I n ⊗ A A − A (cid:62) ) + ( A A − A (cid:62) ⊗ I n )) H n (40)= S (cid:1) S (41)where (40) holds since H + n ( I n ⊗ A A − A (cid:62) ) H n = H + n ( A A − A (cid:62) ⊗ I n ) H n . Now, deﬁne V − + V − W ( Y − XV − W ) − XV − = τ − M where M = I n ⊗ A − − I n ⊗ A − A (cid:62) ) H n ( S (cid:1) S ) − H + n ( I n ⊗ A A − )so that applying Lemma 4 we have ν ( τ ) = τ n ( n +1) / det( A (cid:1) A ) det( S (cid:1) S ) det( I n ⊗ A ) det( τ I n n + M ( A ⊗ I n + M )) (42)The assumptions that S > − A > S (cid:1) S ) (cid:54) = 0 and det( I n ⊗ A ) (cid:54) = 0.Hence, ν ( τ ) = 0 if and only if det( τ I n n + M ( A ⊗ I n + M )) = 0 since 0 < τ < ∞ . The determinantexpression is exactly an eigenvalue problem.Since by assumption the Schur complement of J ( x ∗ ) and the individual Hessian − D f ( x ∗ ) are positivedeﬁnite (i.e., x ∗ is a diﬀerential Stackelberg equilibrium), Thus, the largest positive real root of ν ( τ ) = 0 is τ ∗ = λ +max ( − M ( A ⊗ I n + M ))where λ +max ( · ) is the largest positive real eigenvalue of its argument if one exists and otherwise its zero. Usingproperties of the Kronecker product and duplication matrices, it can easily be seen that Q = − M ( A ⊗ I n + M ).The result of this claim concludes the proof that if x ∗ is a diﬀerential Stackelberg, then there exists aﬁnite τ ∗ ∈ [0 , ∞ ) such that for all τ ∈ ( τ ∗ , ∞ ), spec( − J τ ( x ∗ )) ⊂ C ◦− . E.2.2 Proof that existence of ﬁnite τ ∗ implies that x ∗ is a diﬀerential Stackelberg The proof of this direction is argued by contradiction. Consider a critical point x ∗ (i.e., where g ( x ∗ ) = 0such that − C ≡ − D f ( x ∗ ) and S ≡ S ( J ( x ∗ )) = D f ( x ∗ ) − D f ( x ∗ )( D f ( x ∗ )) − D (cid:62) f ( x ∗ ) have no zeroeigenvalues—that is, det( S ) (cid:54) = 0 and det( C ) (cid:54) = 0.Suppose that there exists a τ ∗ ∈ (0 , ∞ ) such that for all τ ∈ ( τ ∗ , ∞ ), spec( − J τ ( x ∗ )) ⊂ C ◦− , yet x ∗ isnot a diﬀerential Stackelberg equilibrium. That is, either − S or C have at least one positive eigenvalue.Without loss of generality, let − S have at least one positive eigenvalue.Since det( S ) (cid:54) = 0 and det( C ) (cid:54) = 0, by Lemma 5.b, there exists non-singular Hermitian matrices P , P and positive deﬁnite Hermitian matrices Q , Q such that − S P − P S = Q and CP + P C = Q .Further, − S and P have the same inertia, meaning υ + ( − S ) = υ + ( P ) , υ − ( − S ) = υ − ( P ) , ζ ( − S ) = ζ ( P )where for a given matrix A , υ + ( A ), υ − ( A ), and ζ ( A ) are the number of eigenvalues of the argument thathave positive, negative and zero real parts, respectively. Similarly, C and P have the same inertia: υ + ( C ) = υ + ( P ) , υ − ( C ) = υ − ( P ) , ζ ( C ) = ζ ( P ) . Since − S has at least one strictly positive eigenvalue, υ + ( P ) = υ + ( − S ) ≥ P = (cid:20) I L (cid:62) I (cid:21) (cid:20) P P (cid:21) (cid:20) I L I (cid:21) (43)where L = ( D f ( x ∗ )) − D (cid:62) f ( x ∗ ) = CD (cid:62) f ( x ∗ ). Since P is congruent to blockdiag( P , P ), by Sylvester’slaw of inertia (Horn and Johnson, 1985, Thm. 4.5.8), P and blockdiag( P , P ) have the same inertia, meaningthat υ + ( P ) = υ + (blockdiag( P , P )), υ − ( P ) = υ − (blockdiag( P , P )), and ζ ( P ) = ζ (blockdiag( P , P )).Consider the matrix equation − P J τ ( x ∗ ) − J (cid:62) τ ( x ∗ ) P = Q τ for − J τ ( x ∗ ) where Q τ = (cid:20) I L (cid:62) I (cid:21) (cid:20) Q P D f ( x ∗ ) − S L (cid:62) P ( P D f ( x ∗ ) − S L (cid:62) P ) (cid:62) P L D f ( x ∗ ) + ( P L D f ( x ∗ )) (cid:62) + τ Q (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) B τ (cid:20) I L I (cid:21) which can be veriﬁed by straightforward calculations.Observe that Q τ > B τ > B τ > Q > S ( B τ ) > S ( B τ ) = P L D f ( x ∗ ) + ( P L D f ( x ∗ )) (cid:62) + τ Q − ( P D f ( x ∗ ) − S L (cid:62) P ) (cid:62) Q − ( P D f ( x ∗ ) − S L (cid:62) P ) . Now, S ( B τ ) is also a real symmetric matrix, and hence, it is positive deﬁnite if and only if all its eigenvaluesare positive. To determine the range of τ such that S ( B τ ) is positive deﬁnite, we can formulate an eigenvalueproblem to determine the value of τ such that the matrix S ( B τ ) becomes singular. This is analogous to theguard map approach used in the proof in the previous subsection for the other direction of the proof, andin this case, we are varying τ from zero to inﬁnity and ﬁnding the point such that for all larger τ , S ( B τ ) ispositive deﬁnite. Intuitively, such an argument works since τ scales the positive deﬁnite matrix Q . Towardsthis end, consider the eigenvalue problem in τ given by0 = det (cid:16) τ I − Q − (cid:0) ( P D f ( x ∗ ) − S L (cid:62) P ) (cid:62) Q − ( P D f ( x ∗ ) − S L (cid:62) P ) − P L D f ( x ∗ ) − ( P L D f ( x ∗ )) (cid:62) (cid:1)(cid:17) . Let τ be the maximum positive eigenvalue, and zero otherwise. Then, since eigenvalues vary continuously,for all τ ∈ ( τ , ∞ ), Q τ > P and − J τ ( x ∗ ) have the same inertia,but this contradicts the stability of − J τ ( x ∗ ) for all τ ∈ ( τ ∗ , ∞ ) since υ + ( P ) ≥ E.3 Proof of Corollary 2

Suppose that x ∗ is a diﬀerential Stackelberg equilibrium so that by Theorem 3, there exists a τ ∗ ∈ (0 , ∞ )such that spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ). Now that we have a guarantee that − J τ ( x ∗ ) is Hurwitzstable for any τ ∈ ( τ ∗ , ∞ ), we apply Hartman-Grobman to get that the nonlinear system ˙ x = − Λ τ g ( x ) isstable in a neighborhood of x ∗ . Fix any τ ∈ ( τ ∗ , ∞ ) and let γ = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | . Then,applying Lemma 1, for any γ ∈ (0 , γ ), τ - GDA converges locally asymptotically to x ∗ .On the other hand, suppose that there exists a τ ∗ ∈ (0 , ∞ ) such that spec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ). Then by Theorem 3, if x ∗ is a diﬀerential Stackelberg equilibrium. Furthermore, sincespec( − J τ ( x ∗ )) ⊂ C ◦− for all τ ∈ ( τ ∗ , ∞ ), if we let γ = arg min λ ∈ spec( J τ ( x ∗ )) λ ) / | λ | , then by Lemma 1 τ - GDA converges locally asymptotically to x ∗ for any choice of γ ∈ (0 , γ ). F Proof of Proposition 4

The structure of this proof is as follows. We begin by introducing general background for analyzing generalsingularly perturbed systems. Following this, we consider the linearization of the singularly perturbed systemthat approximates the simultaneous gradient dynamics and describe how insights made about this systemtranslate to the corresponding nonlinear system. Finally, we analyze the stability of the linear system arounda critical point to arrive at the stated result. The analysis is primarily from Kokotovic et al. (1986).54 nalysis of General Singularly Perturbed Systems.

Let us begin by considering a general singularlyperturbed system for x ∈ R n , z ∈ R m , and a suﬃciently small parameter ε > x = f ( x, z, ε, t ) , x ( t , ε ) = x , x ∈ R n ε ˙ z = g ( x, z, ε, t ) , z ( t , ε ) = z , z ∈ R m (44)where f and g are assumed to be suﬃciently many times continuously diﬀerential functions of the arguments x , z , ε , and t . Observe that when ε = 0, the dimension of the system given in (44) drops from n + m to n since ˙ z degenerates into the equation 0 = g (¯ x, ¯ z, , t ) (45)where the notation of ¯ x, ¯ z indicates that the variables belong to the system with ε = 0. We further requirethe assumption that (45) has k ≥ i ∈ { , . . . , k } are given by¯ z = ¯ φ i (¯ x, t ) . We now deﬁne an n -dimensional manifold M ε for any ε > z ( t, ε ) = φ ( x ( t, ε ) , ε ) , (46)where φ is suﬃciently many times continuously diﬀerentiable function of x and ε . For M ε to be an invariantmanifold of the system given in (44), the expression in (46) must hold for all t > t ∗ if it holds for t = t ∗ .Formally, if z ( t ∗ , ε ) = φ ( x ( t ∗ , ε ) , ε ) → z ( t, ε ) = φ ( x ( t, ε ) , ε ) ∀ t ≥ t ∗ , (47)then M ε is an invariant manifold for (44). Diﬀerentiating the expression in (47) with respect to t , we obtain˙ z = ddt φ ( x ( t, ε ) , ε ) = dφ∂x ˙ x. (48)Now, multiplying the expression in (48) by ε and substituting in the forms of ˙ x , ˙ z , and z from (44) and (46),the manifold condition becomes g ( x, φ ( x, ε ) , ε, t ) = ε ∂φ∂x f ( x, φ ( x, ε ) , ε, t ) , (49)which φ ( x, ε ) must satisfy for all x of interest and all ε ∈ [0 , ε ∗ ], where ε ∗ is a positive constant.We now deﬁne η = z − φ ( x, ε ) . Then, in terms of x and η , the system becomes˙ x = f ( x, φ ( x, ε ) + η, ε, t ) ε ˙ η = g ( x, φ ( x, ε ) + η, ε, t ) − ε ∂φ∂x f ( x, φ ( x, ε ) + η, ε, t ) . Remark 3.

One interesting observation is that the above system is exactly the continuous time limiting sys-tem for the τ -Stackelberg learning update in Fiez et al. (2020) under a simple transformation of coordinates. Observe that the invariant manifold M ε is characterized by the fact that η = 0 implies ˙ η = 0 for all x forwhich the manifold condition from (49) holds. This implies that if η ( t , ε ) = 0, it is suﬃcient to solve thesystem ˙ x = f ( x, φ ( x, ε ) , ε, t ) , x ( t , ε ) = x . This system is often referred to as the exact slow model and is valid for all x, z ∈ M ε and M ε known as theslow manifold of (52). 55 inearization of Simultaneous Gradient Descent Singularly Perturbed System. We now considerthe singularly perturbed system for simulataneous gradient descent given by˙ x = − D f ( x, z ) ε ˙ z = − D f ( x, z ) . (50)Let us linearize the system around a point ( x ∗ , z ∗ ). Then, D f ( x, z ) ≈ D f ( x ∗ , z ∗ ) + D f ( x ∗ , z ∗ )( x − x ∗ ) + D f ( x ∗ , z ∗ )( z − z ∗ ) D f ( x, z ) ≈ D f ( x ∗ , z ∗ ) + D f ( x ∗ , z ∗ )( x − x ∗ ) + D f ( x ∗ , z ∗ )( z − z ∗ ) . (51)Deﬁning u = ( x − x ∗ ) and v = ( z − z ∗ ) and considering a point ( x ∗ , z ∗ ) such that D f ( x ∗ , z ∗ ) = 0 and D f ( x ∗ , z ∗ ) = 0, then linearized singularly perturbed system is given by˙ u = − D f ( x ∗ , z ∗ ) u − D f ( x ∗ , z ∗ ) vε ˙ v = − D f ( x ∗ , z ∗ ) u − D f ( x ∗ , z ∗ ) v. (52)To simplify notation, let us deﬁne J τ as follows J τ = (cid:20) D f ( x ∗ , z ∗ ) D f ( x ∗ , z ∗ ) ε − D f ( x ∗ , z ∗ ) ε − D f ( x ∗ , z ∗ ) (cid:21) = (cid:20) A A ε − A ε − A (cid:21) along with ˙ w = (cid:20) ˙ u ˙ v (cid:21) and w = (cid:20) uv (cid:21) . Then, an equivalent form of (52) is given by ˙ w = − J τ w. (53)In what follows, we make insights about the behavior of the nonlinear system given in (50) around a criticalpoint ( x ∗ , z ∗ ) by analyzing the linear system given in (53). Recall that if ( x ∗ , z ∗ ) is asymptotically stable withrespect to the linear system in (53), then it is also asymptotically stable with respect to the nonlinear systemfrom (50). Moreover, to determine asymptotic stability, it is suﬃcient to prove that spec( J τ (( x ∗ , z ∗ )) ⊂ C ◦ + .In what follows, we specialize the general analysis of singularly perturbed systems to the singularly perturbedlinear system given in (53). Stability of Critical Points of Simutaneous Gradient Descent.

The manifold condition from (49)for the system in (53) is given by A u + A φ ( u, ε ) = ε ∂φ∂u ( A u + A φ ( u, ε )) . (54)We claim that (54) can be satisﬁed by a function φ that is linear in u . Indeed, deﬁning v = φ ( u, ε ) = − L ( ε ) u and then substituting back into (49), we get the simpliﬁed manifold condition of A − A L ( ε ) = − εL ( ε ) A + εL ( ε ) A L ( ε ) . (55)Before we prove that an L ( ε ) always exists to satisfy (55), consider the change of variables η = v + L ( ε ) u. Here, the ≈ means, e.g., D f ( x, z ) = D f ( x ∗ , z ∗ )+ D f ( x ∗ , z ∗ )( x − x ∗ )+ D f ( x ∗ , z ∗ )( z − z ∗ )+ O ( (cid:107) x − x ∗ (cid:107) + (cid:107) z − z ∗ (cid:107) ),and similarly for D f ( x, z ). (cid:20) ˙ u ˙ η (cid:21) = (cid:20) A − A L ( ε ) A R ( L, ε ) A + εL ( ε ) A (cid:21) (cid:20) uη (cid:21) (56)where R ( L, ε ) = A − A L ( ε ) + εL ( ε ) A − εL ( ε ) A L ( ε ) . (57)Consider that R ( L, ε ) = 0. Then, the system from (56) has the upper block-triangular form (cid:20) ˙ x ˙ η (cid:21) = (cid:20) A − A L ( ε ) A A + εL ( ε ) A (cid:21) (cid:20) xη (cid:21) , (58)which has the eﬀect of generating a replacement fast subsystem given by ε ˙ η = ( A + εLA ) η. We now proceed to show that an L ( ε ) such that R ( L, ε ) = 0 always exists.

Lemma 7. If A is such that det( A ) (cid:54) = 0 , there is an ε ∗ such that for all ε ∈ [0 , ε ∗ ] , there exists a solution L ( ε ) to the matrix quadratic equation R ( L, ε ) = A − A L ( ε ) + εL ( ε ) A − εL ( ε ) A L = 0 (59) which is approximated according to L ( ε ) = A − A + εA − A A + O ( ε ) , (60) where A = A − A A − A . (61) Proof.

To begin, observe that for ε = 0, the unique solution to (59) is given by L (0) = A − A . Now,diﬀerentiating R ( L, ε ) from (59) with respect to ε , we ﬁnd A + εL ( ε ) A dLdε − ε dLdε ( A − A L ( ε )) = L ( ε ) A − L ( ε ) A L ( ε ) . The unique solution of this equation at ε is dLdε (cid:12)(cid:12)(cid:12) ε =0 = A − L (0)( A − A L (0)) = A − A A . Accordingly, (60) represents the ﬁrst two terms of the MacLaurin series for L ( ε ).We remark that L ( ε ) as deﬁned in (60) is unique in the sense that even though R ( L, (cid:15) ) as given in (59)may have several real solutions, only one is approximated by (60).The characteristic equation of (58) is equivalent to that for the system from (53) owing to the similaritytransform between the systems. The block-triangular form of (53) admits a characteristic equation given by ψ ( s, ε ) = 1 ε m ψ s ( s, ε ) ψ f ( p, ε ) = 0 , (62)where ψ s ( s, ε ) = det( sI − ( A − A L ( ε ))) (63)is the characteristic polynomial of the slow subsystem, and ψ f ( p, ε ) = det( pI − ( A + εA L ( ε ))) (64)is the characteristic polynomial of the fast subsystem in the timescale p = sε . Consequently, n of theeigenvalues of (53) denoted by { λ , . . . , λ n } are the roots of the slow characteristic equation ψ s ( s, ε ) = 057nd the rest of the eigenvalues { λ n +1 , . . . , λ n + m } are denoted by λ i = ν j /ε for i = n + j and j ∈ { , . . . , m } where { ν , . . . , ν m } are the roots of the fast characteristic equation ψ f ( p, ε ) = 0.The roots of ψ s ( s, ε ) at ε = 0, given by the solution to ψ s ( s,

0) = det( sI − ( A − A L (0))) = 0 , (65)are the eigenvalues of the matrix A deﬁned in (61) since L (0) = A − A as shown in Lemma 7. The rootsof the fast characteristic equation at ε = 0, given by the solution to ψ f ( p,

0) = det( pI − A ) = 0 (66)are the eigenvalues of the matrix A . We now proceed by characterizing how closely the eigenvalues of thesystem at ε = 0 approximate the eigenvalues of the system from (53) as ε → A ) (cid:54) = 0, then as ε → n eigenvalues of the system given in (53) tend toward the eigenvaluesof the matrix A while the remaining m eigenvalues of the system from (53) tend to inﬁnity with the rate1 /ε along asymptotes deﬁned by the eigenvalues of A given as spec( A ) /ε as a result of the continuity ofcoeﬃcients of the polynomials from (63) and (64) with respect to ε .Now, consider the special (but generic) case in which the eigenvalues of A are distinct and the eigenvaluesof A are distinct, but A and A may have common eigenvalues. Then, taking the total derivative of (62)with respect to ε we have that ∂ψ s ∂s dsdε + ∂ψ s ∂ε = 0Now, observe that ∂ψ s /∂s (cid:54) = 0 since the eigenvalues of A = A − A A − A are distinct. For each i = 1 , . . . , n , this gives us a well-deﬁned derivative ds/dε (by the implicit mapping theorem) and hence, with s (0) = λ i ( A ), the O ( ε ) approximation of s ( ε ) follows directly. That is, λ i = λ i ( A ) + O ( ε ) , i = 1 , . . . , n Similarly, taking the total derivative of ψ f ( p, ε ) = 0 and again applying the implicit function theorem, wehave λ i + n = ε − ( λ j ( A + O ( ε )) , i = 1 , . . . , n where we have used the fact that p = sε . G Proof of Theorem 4

Let x ∗ be a stable critical point of 1- GDA which is not a diﬀerential Stackelberg equilibrium. Without loss ofgenerality, suppose that S ( − J ( x ∗ )) has at least one eigenvalue with strictly positive real part.Since both S ( − J ( x ∗ )) and D f ( x ∗ ) have no zero valued eigenvalues, by Lemma 5.b, there exists non-singular Hermitian matrices P , P and positive deﬁnite Hermitian matrices Q , Q such that S ( − J ( x ∗ )) P + P S ( − J ( x ∗ )) = Q and D f ( x ∗ ) P + P D f ( x ∗ ) = Q . Further, S ( − J ( x ∗ )) and P have the same inertia,meaning υ + ( S ( − J ( x ∗ ))) = υ + ( P ) , υ − ( S ( − J ( x ∗ ))) = υ − ( P ) , ζ ( S ( − J ( x ∗ ))) = ζ ( P )where for a given matrix A , υ + ( A ), υ − ( A ), and ζ ( A ) are the number of eigenvalues of the argument thathave positive, negative and zero real parts, respectively. Similarly, D f ( x ∗ ) and P have the same inertia: υ + ( D f ( x ∗ )) = υ + ( P ) , υ − ( D f ( x ∗ )) = υ − ( P ) , ζ ( D f ( x ∗ )) = ζ ( P ) . Recall that we assumed S ( − J ( x ∗ )) has at least one eigenvalue with strictly positive real part. Hence, υ + ( P ) = υ + ( S ( − J ( x ∗ ))) ≥ P = (cid:20) I L (cid:62) I (cid:21) (cid:20) P P (cid:21) (cid:20) I L I (cid:21) Recall that having distinct eigenvalues is a generic condition for a matrix an n × n matrix, though not explicitly requiredfor the asymptotic results; its only a condition for the big-O approximation λ i = λ i ( A ) + O ( ε ) for i = 1 , . . . , n and λ i = ε − ( λ j ( A ) + O ( ε )) where i = n + j for j = 1 , . . . , n . L = ( D f ( x ∗ )) − D (cid:62) f ( x ∗ ). Since P is congruent to blockdiag( P , P ), by Sylvester’s law of iner-tia (Horn and Johnson, 1985, Thm. 4.5.8), P and blockdiag( P , P ) have the same inertia, meaning that υ + ( P ) = υ + (blockdiag( P , P )), υ − ( P ) = υ − (blockdiag( P , P )), and ζ ( P ) = ζ (blockdiag( P , P )). Considernow the Lyapunov equation − P J τ ( x ∗ ) − J (cid:62) τ ( x ∗ ) P = Q τ for − J τ ( x ∗ ) where Q τ = (cid:20) I L (cid:62) I (cid:21) (cid:20) Q P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) (cid:62) P L D f ( x ∗ ) + ( P L D f ( x ∗ )) (cid:62) + τ Q (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) B τ (cid:20) I L I (cid:21) which can be veriﬁed by straightforward calculations.Since υ + ( P ) ≥

1, we have that υ + ( P ) ≥

1. Now, we ﬁnd the value of τ such that for all τ > τ , Q τ > − J τ ( x ∗ )) (cid:54)⊂ C ◦− . Indeed, observe that Q τ > B τ > B τ > Q > S ( B τ ) > S ( B τ ) = P L D f ( x ∗ ) + ( P L D f ( x ∗ )) (cid:62) + τ Q − ( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) (cid:62) Q − ( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) . Now, S ( B τ ) is also a real symmetric matrix, and hence, it is positive deﬁnite if and only if all its eigenvaluesare positive. To determine the range of τ for which Q τ >

0, we simply need to solve the eigenvalue problem0 = det( τ I − Q − (( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) (cid:62) Q − ( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) − P L D f ( x ∗ ) − ( P L D f ( x ∗ )) (cid:62) )) . and extract the maximum eigenvalue, namely, τ = λ max ( Q − (( P D f ( x ∗ ) + S ( − J ( x ∗ )) L (cid:62) P ) (cid:62) Q − ( P D f ( x ∗ )+ S ( − J ( x ∗ )) L (cid:62) P ) − P L D f ( x ∗ ) − ( P L D f ( x ∗ )) (cid:62) )) . Hence, as noted previously, by Lemma 5.a, we conclude that for all τ ∈ ( τ , ∞ ), spec( − J τ ( x ∗ )) (cid:54)⊂ C ◦− .To provide some context for the proof approach, it follows the same idea as the proof of Theorem 3 inAppendix E.2.2. Indeed, to determine the range of τ such that S ( B τ ) is positive deﬁnite, we can formulatean eigenvalue problem to determine the value of τ such that the matrix S ( B τ ) becomes singular. We vary τ from zero to inﬁnity in order to ﬁnd the point such that for all larger τ , S ( B τ ) is positive deﬁnite. Intuitively,such an argument works since τ scales the positive deﬁnite matrix Q . H Proof of Theorem 7

As in Mescheder et al. (2018), we only apply the regularization to the discriminator. In the following proof,we use ∇ x ( · ) to denote the partial gradient with respect to x of the argument ( · ) when the argument is thediscriminator D( · ; ω ) in order prevent any confusion between the notation D ( · ) which we use elsewhere forderivatives.To prove the ﬁrst part of this result, we following similar arguments to Theorem 4.1 of (Meschederet al., 2018). To prove the second part, we leverage the concept of the quadratic numerical range. For bothcomponents of the proof, we will use the following form of the Jacobian of the regularized game. Indeed,ﬁrst observe that the structural form of J ( τ,µ ) ( x ∗ ) is J ( τ,µ ) ( x ∗ ) = (cid:20) B − τ B (cid:62) τ ( C + µR ) (cid:21) (67)where B = D f ( x ∗ ), C = − D f ( x ∗ ) and R = D R i ( x ∗ ). This follows from Assumption 2-a., which impliesthat D( x ; ω ∗ ) = 0 in some neighborhood of supp( p D ) and hence, ∇ x D( x ; ω ∗ ) = 0 and ∇ x D( x ; ω ∗ ) = 0 for x ∈ supp( p D ). In turn, we have that D f ( x ∗ ) = 0. 59 roof that x ∗ = ( θ ∗ , ω ∗ ) is a diﬀerential Stackelberg equilibrium. For any ﬁxed µ ∈ [0 , ∞ ), thenwe ﬁrst observe that x ∗ is also a critical point of the unregularized dynamics. Indeed, by Assumption 2-a., D( x ; ω ∗ ) = 0 in some neighborhood of supp( p D ) and hence, ∇ x D( x ; ω ∗ ) = 0 and ∇ x D( x ; ω ∗ ) = 0 for x ∈ supp( p D ). Further, D R i ( θ, ω ) = µ E p i ( x ) [ D ( ∇ x D( x ; ω )) ∇ x D( x, ω )] for i = 1 , p ( x ) = p D ( x )and p ( x ) = p θ ( x ). Thus, using the above observation that ∇ x D( x ; ω ∗ ) = 0, we have that D R i ( θ ∗ , ω ∗ ) = 0for i = 1 , ω is zero at x ∗ = ( θ ∗ , ω ∗ ) whichin turn implies that D f ( x ∗ ) = 0 and − D f ( x ∗ ) = 0. Hence, x ∗ is a critical point of the unregularizeddynamics as claimed. Further, C + µR > v (cid:54) = 0 and v (cid:54)∈ T θ ∗ M G , then Bv (cid:54) = 0which implies that B can only be rank deﬁcient on T θ ∗ M G . Using this fact along with the structure of theJacobian as in (67), we have that the Schur complement of J ( τ,µ ) ( x ∗ ) is equal to B (cid:62) ( C + µR ) − B > C + µR >

0. Hence, x ∗ = ( θ ∗ , ω ∗ ) is a diﬀerential Stackelberg equilibrium. Proof of stability.

Examining (67), it is straightforward to see that the quadratic numerical range W ( J ( τ,µ ) ) has eigenvalues of the form λ τ,µ = ( τ ( c + µr )) ± (cid:112) ( − τ ( c + µr )) − τ | b | where b = (cid:104) D f ( x ∗ ) v, w (cid:105) , c = (cid:104)− D f ( x ∗ ) w, w (cid:105) and r = (cid:104) D R i ( x ∗ ) w, w (cid:105) for vectors v ∈ W ∩ ( T θ ∗ M G ) ⊥ and w ∈ W ∩ ( T ω ∗ M D ) ⊥ where U ⊥ denotes the orthogonal complement of U . We claim that for any valueof µ ∈ [0 , µ ] and any τ ∈ (0 , ∞ ), Re( λ τ,µ ) >

0. Indeed, we argue this by considering the two possible cases:(1) ( τ ( c + µr )) ≤ | b | τ or (2) ( τ ( c + µr )) > τ | b | . • Case 1 : Suppose that ( τ ( c + µr )) ≤ | b | τ . Then, Re( λ τ,µ ) = ( τ ( c + µr )) > c + µr > • Case 2 : Suppose that ( τ ( c + µr )) > τ | b | . In this case, we want to ensure thatRe( λ τ ) > ( τ ( c + µr )) − (cid:112) ( − τ ( c + µr )) − τ | b | > . which is true since ( τ ( c + µr )) > ( − τ ( c + µr )) − τ | b | ⇐⇒ > − τ | b | This concludes the proof.

I Proof of Proposition 5

This proposition follows immediately from observing the structure of the Jacobian: for any matrix of theform − J = (cid:20) − BB (cid:62) − C (cid:21) at least one eigenvalue will be purely imaginary if n < n / B ∈ R n × n and C ∈ R n × n . Indeed,by Lyapunov’s stability theorem for linear systems (Hespanha, 2018, Theorem 8.2), a matrix A is Hurwitzstable if and only if for every symmetric positive deﬁnite Q = Q (cid:62) >

0, there exists a unique symmetricpositive deﬁnite P = P (cid:62) >

0, such that A (cid:62) P + P A = − Q . Hence, − J is Hurwitz stable if and only if thereexists a P = P (cid:62) > < Q = (cid:20) − BB (cid:62) C (cid:21) (cid:20) P P P (cid:62) P (cid:21) + (cid:20) P P P (cid:62) P (cid:21) (cid:20) B − B (cid:62) C (cid:21) = (cid:20) − BP (cid:62) − P B (cid:62) − BP + P B + P CB (cid:62) P + CP (cid:62) − P B (cid:62) B (cid:62) P + CP + P (cid:62) B + P C (cid:21) Since this is a symmetric positive deﬁnite matrix, the block diagonal components must also be symmetricpositive deﬁnite so that − BP − P B (cid:62) > Recall that B ∈ R n × n and P ∈ R n × n . Hence, a necessary If a block matrix Q with block entries Q ij for i, j ∈ { , } is positive deﬁnite symmetric, then Q ii > i = 1 , n ≥ n / − BP − P B (cid:62) to have full rank; ofcourse this is not suﬃcient, but it is necessary. It is easy to see this argument is independent of whether alearning rate ratio τ (cid:54) = 0 or regularization is incorporated. J Extensions in the Stochastic Setting

Let ˜ x ( t ) be the asymptotic pseudo-trajectories of the stochastic approximation process { x k } . That is, ˜ x ( t )are linear interpolates between the sample points x k generated by the stochastic τ - GDA process, and aredeﬁned by ˜ x ( t ) = ˜ x ( t k ) + ( t − t k ) γ k (˜ x ( t k +1 ) − ˜ x ( t k ))where t k = t k + γ k and t = 0. Assumption 3.

The stochastic process { w k } is a martingale diﬀerence sequence with respect to the increasingfamily of σ -ﬁelds deﬁned by F k = σ ( x (cid:96) , w (cid:96) , (cid:96) ≤ k ) , ∀ k ≥ , so that E [ w k +1 | F k ] = 0 almost surely for all k ≥ . Furthermore, there exists c , c ∈ C ( R d , R > ) such that Pr {(cid:107) w k +1 (cid:107) > v | F k } ≤ c ( x k ) exp( − c ( x k ) v ) , n ≥ for all v ≥ ˜ v where ˜ v is some suﬃciently large, ﬁxed number. Proposition 8.

Suppose that Assumption 3 holds and that x ∗ is a diﬀerential Stackelberg equilibrium. Let γ k = 1 / ( k + 1) β where β ∈ (0 , . There exists a τ ∗ ∈ (0 , ∞ ) and an (cid:15) ∈ (0 , ∞ ) such that for any ﬁxed (cid:15) ∈ (0 , (cid:15) ] , there exists functions h ( (cid:15) ) = O (log(1 /(cid:15) )) and h ( (cid:15) ) = O (1 /(cid:15) ) so that when T ≥ h ( (cid:15) ) and k ≥ K τ where K τ is such that /γ k ≥ h ( (cid:15) ) for all k ≥ K τ , the stochastic iterates of τ - GDA with stepsizesequence γ k and timescale separation τ ∈ ( τ ∗ , ∞ ) satisfy Pr {(cid:107) ˜ x ( t ) − x ∗ (cid:107) ≤ (cid:15) ∀ t ≥ t k + T + 1 | ˜ x ( t k ) ∈ B (cid:15) ( x ∗ ) } = 1 − O ( k − β/ exp( − C τ k β/ )) for some constant C τ > . The proof largely follows from the proofs of Theorem 1.1 and 1.2 in (Thoppe and Borkar, 2019), combinedwith the existence of a ﬁnite timescale separation parameter obtained via Theorem 3. Indeed, since x ∗ isa diﬀerential Stackelberg equilibrium, by Theorem 3 there exists a range of τ —namely, ( τ ∗ , ∞ )—such thatfor any τ ∈ ( τ ∗ , ∞ ), x ∗ is a locally asymptotically stable equillibrium for ˙ x = − Λ τ g ( x ). Hence, ﬁxing any τ ∈ ( τ ∗ , ∞ ), a converse Lyapunov theorem can be applied to construct a local Lyapunov function. Let V : R n → R be this Lyapunov function so that there exists r, r , (cid:15) > r > r , and B (cid:15) ( x ∗ ) ⊆ V r ⊂ N (cid:15) ( V r ) ⊆ V r for any (cid:15) ∈ (0 , (cid:15) ] where, for a given q > V q = { x ∈ dom( V ) : V ( x ) ≤ q } and N (cid:15) ( V r ) is an (cid:15) –neighborhood of V r —i.e., N (cid:15) ( V r ) = { x ∈ R n | ∃ y ∈ V r , (cid:107) x − y (cid:107) ≤ (cid:15) } . From here, the result followsfrom an application of the results in the work by Thoppe and Borkar (2019).The utility of this result is that it provides a guarantee in the stochastic setting for a more reasonableand practically useful stepsize sequence. However, constructing the constants such as K τ , C τ and (cid:15) ishighly non-trivial as can be seen in the work of Thoppe and Borkar (2019) and similar works in the area ofstochastic approximation (Borkar, 2008). One direction of future work is examining the Lyapunov approachfor directly analyzing the nonlinear singularly perturbed system; it is known, however, that the stochasticsingularly perturbed systems have much weaker guarantees in terms of stability (Kokotovic et al., 1986,Chap. 4). 61 Further Details on Related Work

In this section, we provide further details on the discussion from Section 7 regarding the results presentedby Jin et al. (2020) on the local stability of gradient descent-ascent with a ﬁnite timescale separation. Thepurpose of this discussion is to make clear that Proposition 27 from the work of Jin et al. (2020) does notdisagree with the results we provide in Theorem 3 and Theorem 4 and is instead complementary. In whatfollows, we recall Proposition 27 of Jin et al. (2020) in separate pieces in the terminology of this paperand delineate its meaning from our results on the stability of gradient descent-ascent with a ﬁnite timescaleseparation.To begin, we consider the component of Proposition 27 from Jin et al. (2020) which says that givenany ﬁxed and ﬁnite timescale separation τ >

0, a zero-sum game can be constructed with a diﬀerentialStackelberg equilibrium that is not stable with respect to the continuous time limiting system of τ - GDA givenby the dynamics ˙ x = − Λ τ g ( x ). Proposition 9 (Rephrasing of Jin et al. 2020, Proposition 27(a)) . For any ﬁxed τ > , there exists azero-sum game G = ( f, − f ) such that spec( J τ ( x ∗ )) (cid:54)⊂ C ◦ + for a diﬀerential Stackelberg equilibrium x ∗ . We now explain the proof. Let us consider any (cid:15) > f ( x, y ) = − x + 2 √ (cid:15)xy − ( (cid:15)/ y . (68)At the unique critical point ( x ∗ , y ∗ ) = (0 , J τ ( x ∗ , y ∗ ) = (cid:20) − √ (cid:15) − τ √ (cid:15) τ (cid:15) (cid:21) . Moreover, observe that ( x ∗ , y ∗ ) is a diﬀerential Stackelberg equilibrium and not a diﬀerential Nash equilib-rium since D f ( x ∗ , y ∗ ) = − ≯ − D f ( x ∗ , y ∗ ) = (cid:15) > S ( J ( x ∗ , y ∗ )) = 2 >

0. Finally, the spectrumof the Jacobian is spec( J τ ( x ∗ , y ∗ )) = (cid:110) − τ (cid:15) ± √ τ (cid:15) − τ (cid:15) + 42 (cid:111) . Let us now ﬁx τ as any arbitrary positive value. Then, consider the game construction from (68) with (cid:15) = 1 /τ . For the ﬁxed choice of τ and subsequent game construction, we get thatspec( J τ ( x ∗ , y ∗ )) = { (cid:0) − ± i √ (cid:1) / } (cid:54)⊂ C ◦ + . This in turn means the diﬀerential Stackelberg equilibrium is not stable with respect to the dynamics˙ x = − Λ τ g ( x ) for the given choice of τ . Since the choice of τ was arbitrary, this is a valid procedure togenerate a game with a diﬀerential Stackelberg equilibrium that is not stable with respect to ˙ x = − Λ τ g ( x )given a choice of τ beforehand.This result contrasts with that of Theorem 3 in the following fundamental way. In the proof of Proposi-tion 9, τ is ﬁxed and then the game is constructed, whereas in Theorem 3 the game is ﬁxed and then theconditions on τ given. To illustrate this point, consider the game construction from (68) with (cid:15) ﬁxed to bean arbitrary positive value. It can be veriﬁed that spec( J τ ( x ∗ , y ∗ )) ⊂ C ◦ + for all τ > /(cid:15) . This means thatgiven the diﬀerential Stackelberg equilibria in this game construction, there is indeed a ﬁnite τ ∗ such thatthe equilibrium is stable with respect to ˙ x = − Λ τ g ( x ) for all τ ∈ ( τ ∗ , ∞ ). Put concisely, Proposition 9 isshowing that there is exists a continuum of games for which a diﬀerential Stackelberg equilibrium is unstablewith an improper choice of ﬁnite learning rate ratio τ . On the other hand, Theorem 3 is proving that givena game with a diﬀerential Stackelberg equilibrium, there exists a range of suitable ﬁnite learning rate ratiossuch that the diﬀerential Stackelberg equilibrium is guaranteed to be stable.We now move on to examining the portion of Proposition 27 from Jin et al. (2020) which says that givenany ﬁxed and ﬁnite timescale separation τ >

0, a zero-sum game can be constructed with a critical pointthat is not a diﬀerential Stackelberg equilibrium which is stable with respect to the continuous time limitingsystem of τ - GDA given by ˙ x = − Λ τ g ( x ). Proposition 10 (Rephrasing of Jin et al. 2020, Proposition 27(b)) . For any ﬁxed τ , there exists a zero-sumgame G = ( f, − f ) such that spec( J τ ( x ∗ )) ⊂ C ◦ + for a critical point x ∗ satisfying g ( x ∗ ) = 0 that is not adiﬀerential Stackelberg equilibrium.

62n a similar manner as following Proposition 9, we now explain the proof of Proposition 10 and thencontrast the result with Theorem 4. Again, consider any (cid:15) >

0, along with the game construction f ( x, y ) = x + 2 √ (cid:15)x y + ( (cid:15)/ y − x / √ (cid:15)x y − (cid:15)y . (69)At the unique critical point ( x ∗ , y ∗ ) = (0 , J τ ( x ∗ , y ∗ ) =  √ (cid:15) − √ (cid:15) − τ √ (cid:15) − τ (cid:15) − τ √ (cid:15) τ (cid:15)  Observe that ( x ∗ , y ∗ ) is neither a diﬀerential Nash equilibrium nor a diﬀerential Stackelberg equilibriumsince D f ( x ∗ , y ∗ ) = diag(2 , −

1) and − D f ( x ∗ , y ∗ ) = diag( (cid:15), (cid:15) ) are both indeﬁnite. The spectrum of theJacobian is spec( J τ ( x ∗ , y ∗ )) = (cid:110) − τ (cid:15) ± √ τ (cid:15) − τ (cid:15) + 42 , − τ (cid:15) ± √ τ (cid:15) − τ (cid:15) + 12 (cid:111) . Now, ﬁx τ as any arbitrary positive value, then consider the game construction from (69) with (cid:15) = 1 /τ . Forthe ﬁxed choice of τ and resulting game construction given the choice of (cid:15) , we have thatspec( J τ ( x ∗ , y ∗ )) = { ± i √ , ± i √ } ⊂ C ◦ + . This indicates that the non-equilibrium critical point is stable with respect to the dynamics ˙ z = − Λ τ g ( z )where z = ( x, y ) for the given choice of τ . Similar to the proof of Proposition 9, since the choice of τ wasarbitrary, the procedure to generate a game with a non-equilibrium critical point that is stable with respectto ˙ z = − Λ τ g ( z ) is valid given a choice of τ beforehand.The key distinction between Proposition 10 and Theorem 4 is analogous to that between Proposition 9and Theorem 3. Indeed, the proof and result of Proposition 10 rely on τ being ﬁxed followed by the gamebeing constructed. On the other hand, in Theorem 4 the game is ﬁxed and then the conditions on τ given.To make this clear, consider the game construction from (69) with (cid:15) ﬁxed to be an arbitrary positive value.It turns out that spec( J τ ( x ∗ , y ∗ )) (cid:54)⊂ C ◦ + for all τ > /(cid:15) sinceRe (cid:16) − τ (cid:15) ± √ τ (cid:15) − τ (cid:15) + 42 (cid:17) < . As a result, given the unique critical point of the game there is a ﬁnite τ such that the non-equilibriumcritical point is not stable with respect to ˙ x = − Λ τ g ( x ) for all τ ∈ ( τ , ∞ ). In summary, Proposition 10 isshowing that there is exists a continuum of games for which a non-equilibrium critical point is stable givenan unsuitable choice of ﬁnite learning rate ratio τ . In contrast, Theorem 4 is showing that given a game witha non-equilibrium critical point, there exists a range of ﬁnite learning rate ratios such that it is not stable.To recap, the discussion in this section is meant to explicitly contrast Proposition 27 from the work of Jinet al. (2020) with Theorem 3 and Theorem 4 since they may potentially appear contradictory to each otherwithout close inspection. The result of Jin et al. (2020) shows that (i) given a ﬁxed ﬁnite learning ratio,there exists a game for with a diﬀerential Stackelberg equilibria that is not stable and (ii) given a ﬁxedﬁnite learning ratio, there exists a game with a non-equilibrium critical point that is stable. From a diﬀerentperspective, we show that (i) given a ﬁxed game and diﬀerential Stackelberg equilibrium, there exists a rangeof ﬁnite learning rate ratios for which the equilibrium is stable (Theorem 3) and (ii) given a ﬁxed game anda non-equilibrium critical point, there exists a range of ﬁnite learning rate ratios for which the critical pointis not stable (Theorem 4). L Experiments Supplement

In this section we present several experiments not included in the body of the paper along with supplemen-tal simulation results and details for the experiments presented in Section 6. We study a torus game in63ection L.1 and examine the connection between timescale separation and the region of attraction. Then,in Section L.2, we return to the Dirac-GAN game and consider the non-saturating objective function. InSection L.3, we explore a generative adversarial network formulation using the Wasserstein cost function witha linear generator and quadratic discriminator for the problem of learning a covariance matrix. We ﬁnishin Section L.4 by presenting further results and details on our experiments training generative adversarialnetworks on image datasets.

L.1 Location Game on the Torus

We use the example in this section to further study the role of timescale separation on the regions of attractionaround critical points. Consider the zero-sum game deﬁned by the cost f ( x , x ) = − .

15 cos( x ) + cos( x − x ) + 0 .

15 cos( x ) . (70)This game can be interpreted as a location game on the torus. Speciﬁcally, the ﬁrst player seeks to be farfrom the second player but near zero, while the second player seeks to be near the ﬁrst player. This is anon-convex game on a non-convex strategy space. The critical points are given by the set : { x : g ( x ) = 0 } = { (0 , , ( π, π ) , ( π, , (0 , π ) , ( − . , − . , (1 . , . } . The critical points (0 ,

0) and ( π, π ) are the only diﬀerential Stackelberg equilibrium and neither is a diﬀerentialNash equilibrium. The diﬀerential Stackelberg equilibrium at (0 ,

0) is stable for all τ ∈ ( τ ∗ , ∞ ) where τ ∗ = 0 .

74 and the diﬀerential Stackelberg equilibrium ( π, π ) is stable for all τ ∈ ( τ ∗ , ∞ ) where τ = 1 . τ . We remark that we computed τ ∗ for eachdiﬀerential Stackelberg equilibrium using the construction from Theorem 3 in Section 3 and it again gavethe exact value of τ ∗ such that the system is stable for all τ > τ ∗ .In Figure 13a, we show the trajectories of τ - GDA with γ = 0 .

001 and τ ∈ { , , , } given the initializa-tions ( x , x ) = (2 , −

1) and ( x , x ) = (1 . , − .

1) overlayed on the vector ﬁeld generated by the respectivetimescale separation parameters. We observe that as the timescale separation τ grows, the rotational dy-namics in the vector ﬁeld dissipate and the directions of movement become sharp. As we mentioned inprevious examples, τ - GDA moves directly to the zero line of − D f ( x , x ) and then along that line to anequilibrium given suﬃcient timescale separation. The warping of the vector ﬁeld that occurs as a result oftimescale separation impacts the equilibrium that the dynamics converge to from a ﬁxed initial conditionand the neighborhood on which τ - GDA converges to an equilibrium. In other words, the region of attraction around critical points depends heavily on the timescale separation τ .To illustrate this fact, in Figure 13b we show the regions of attraction for each choice of timescale sepa-ration. The vector ﬁelds are again shown for each τ ∈ { , , , } , but now with colors overlayed indicatingthe equilibria that the dynamics converge to given an initialization at that position. This experiment wasgenerated by running τ - GDA with a dense set of initial conditions chosen uniformly over the strategy space.Positions in the strategy space without color did not converge to an equilibrium in the ﬁxed horizon of20000 iterations with γ = 0 .

04. This happens when τ - GDA is not initialized in the local neighborhood ofattraction around a stable equilibrium. For the choice of τ = 1, (0 ,

0) is the only stable equilibrium. How-ever, as demonstrated in Figure 13a, τ - GDA fails to converge to the equilibrium from the initial conditions( x , x ) = (2 , −

1) and ( x , x ) = (1 . , − . τ - GDA con-verges to an equilibrium from any initial condition for τ ∈ { , , } as can be seen by Figure 13b. Notably,the equilibrium to which the learning dynamics converge depends on the timescale separation and initial con-dition. To give a concrete example, consider the initial conditions shown in Figure 13a of ( x , x ) = (2 , − x , x ) = (1 . , − . x , x ) = (2 , − τ - GDA converges to the equilibriumat (0 ,

0) for each τ ∈ { , , } . Yet, for the initial condition ( x , x ) = (1 . , − . τ - GDA converges to theequilibrium at { (0 , , ( π, π ) , ( π, π ) } for the respective choices of τ ∈ { , , } . In other words, the regionof attraction around the critical points changes so that from a ﬁxed initial condition τ - GDA may converge todistinct equilibrium depending on the initial condition. From Figure 13b, we see that the region of attraction Note that because the joint strategy space is a torus, ( ± π, ± π ) = ( ∓ π, ± π ), ( π,

0) = ( − π, , − π ) = (0 , π ). a)(b) Figure 13: Experimental results for the torus game deﬁned in (70) of Appendix L.1. In Figure 13a, we overlaymultiple trajectories produced by τ - GDA onto the vector ﬁeld generated by the choice of timescale separationselection τ . The shading of the vector ﬁeld is dictated by its magnitude so that lighter shading correspondsto a higher magnitude and darker shading corresponds to a lower magnitude. Figure 13b demonstrates theeﬀect of timescale separation on the regions of attraction around critical points by coloring points in thestrategy space according to the equilibrium τ - GDA converges. We remark that areas without coloring indicatewhere τ - GDA did not converge in the time horizon.around ( x , x ) = (1 . , − .

1) grows from τ = 1 to τ = 2 and τ = 4, but then shrinks at τ = 10. This examplehighlights that timescale separation has a fundamental impact on the region of attraction around criticalpoints and as τ grows it is possible for the region of attraction around an equilibrium to shrink. Collectively,this motivates explicit methods for trying to shape the region of attraction around desirable equilibria. L.2 Dirac-GAN and Regularization: Non-Saturating Formulation

In Section 6.4, we presented experiments for the Dirac-GAN game studied by Mescheder et al. (2017)using the original generative adversarial network formulation of Goodfellow et al. (2014). In this section,we revisit the Dirac-GAN game using the non-saturating generative adversarial network formulation alsoproposed by Goodfellow et al. (2014). While we refer the reader back to Section 6.4 for complete details onthe Dirac-GAN, we do recall some key components of the formulation. Recall that the zero-sum game whicharises from the original objective with regularization µ > f ( θ, ω ) = (cid:96) ( θω ) + (cid:96) (0) − µ ω . As discussed in Section 6.4, the unique critical point of the game is ( θ ∗ , ω ∗ ) = (0 ,

0) and it correspondsto the local Nash equilibrium of the unregularized game and a diﬀerential Stackelberg equilibrium of the65 a) (b)(d)

Figure 14: Experimental results for the Dirac-GAN game deﬁned in (71) of Appendix L.2. Figure 14ashows trajectories of τ - GDA for τ ∈ { , , , } with regularization µ = 0 . τ = 1 with regularization µ = 1. Figure 14b shows the distance from the equilibrium along the learning paths. Figure 14d showsthe trajectories of τ - GDA overlayed on the vector ﬁeld generated by the respective timescale separation andregularization parameters. The shading of the vector ﬁeld is dictated by its magnitude so that lighter shadingcorresponds to a higher magnitude and darker shading corresponds to a lower magnitude.regularized game. Moreover, the equilibrium is stable with respect to the continous time dynamics for all τ > µ > τ - GDA converges with a suitable learning rate γ .The non-saturating generative adversarial network formulation proposed by Goodfellow et al. (2014) inthe context of the Dirac-GAN game corresponds to player 1 maximizing (cid:96) ( − θω ) instead of minimizing (cid:96) ( θω ).This results in the general-sum game deﬁned by the costs( f ( θ, ω ) , f ( θ, ω )) = ( − (cid:96) ( − θω ) + (cid:96) (0) − µ ω , − (cid:96) ( θω ) − (cid:96) (0) + µ ω ) . (71)As shown by Mescheder et al. (2018), the unique critical point of the game remains at ( θ ∗ , ω ∗ ) = (0 , J τ ( θ ∗ , ω ∗ ) in this formulation is identical to that from (25) so this game islocally equivalent to the zero-sum game that arises from the original objective proposed by Goodfellow et al.(2014). This is despite the fact that the non-saturating objective was motivated by global concerns (vanishinggradients early in the training process) rather than local considerations. In Figure 14 we present experimentswith τ - GDA for the regularized Dirac-GAN game with the non-saturating objective and (cid:96) ( t ) = − (cid:96) (1+exp( − t )).We observe similar behavior as the experiments with the standard objective and refer back to Section 6.4for the insights we draw from the simulation. This experiment is primarily included for completeness andto motivate our use of the non-saturating objective in the generative adversarial networks experiments weperform on image datasets in Section 6.5. 66 .3 Generative Adversarial Network: Learning a Covariance Matrix We now consider a generative adversarial network formulation presented by Daskalakis et al. (2018) forlearning a covariance matrix. This is a simple example with degeneracies much like the Dirac-GAN game,but it can be generalized to arbitrary dimensional strategy spaces and has served as a benchmark forcomparing convergence rates in a number of recent papers on learning in games. Often, the example isused to show that gradient descent-ascent cycles and converges slowly. However, by and large, timescaleseparation is not considered. We show that gradient descent-ascent converges fast in this game with suitabletimescale separation and further explore the interplay between timescale separation, regularization, and rateof convergence. We primarily follow the notation of Daskalakis et al. (2018) when describing the problem.The objective of this problem is to learn a covariance matrix using the Wasserstein GAN formulation.The real data x is drawn from a mean-zero multivariate normal distribution with an unknown covariancematrix Σ. The generator is restricted to be a linear function of the random input noise z ∼ N (0 , I ) andis of the form G V ( z ) = V z . The discriminator is restricted to the set of all quadratic functions, whichwe represent by D W ( x ) = x (cid:62) W x . The parameters of the generator and the discriminator are given by W ∈ R d × d and V ∈ R d × d , respectively. For the given generator and discriminator classes the WassersteinGAN game is deﬁned by the cost f ( V, W ) = E x ∼N (0 , Σ) [ x (cid:62) W x ] − E z ∼N (0 ,I ) [ z (cid:62) V (cid:62) W V z ] . As shown by Daskalakis et al. (2018), the cost function can be simpliﬁed to be expressed as f ( V, W ) = d (cid:88) i =1 d (cid:88) j =1 W ij (cid:16) Σ ij − d (cid:88) k =1 V ik V jk (cid:17) . With this cost, the individual gradients for gradient descent-ascent are given by g ( V, W ) = ( − ( W + W (cid:62) ) V, − (Σ − V V (cid:62) )) . From the individual gradients, it is clear that the critical points of the game are given by (

V, W ) such that

V V (cid:62) = Σ and W + W (cid:62) = 0. Moreover, given the form of g ( V, W ), the game Jacobian at any critical point( V ∗ , W ∗ ) is of the form J τ ( V ∗ , W ∗ ) = (cid:20) D f ( V ∗ , W ∗ ) − τ D (cid:62) f ( V ∗ , W ∗ ) 0 (cid:21) . Consequently, the eigenvalues of the game Jacobian are purely imaginary and the critical points are notstable. To ﬁx this problem, Daskalakis et al. (2018) regularized both the generator and discriminator. Weonly regularize the discriminator in this example. The cost function of the zero-sum game with regularization µ > f ( V, W ) = d (cid:88) i =1 d (cid:88) j =1 W ij (cid:16) Σ ij − d (cid:88) k =1 V ik V jk (cid:17) − µ W (cid:62) W ) . (72)The individual gradients for gradient descent-ascent in this regularized game are then g ( V, W ) = ( − ( W + W (cid:62) ) V, − (Σ − V V (cid:62) ) + µ W ) . We begin by considering the simplest form of this problem, which is that d = 1. The critical points withthis restriction are ( V ∗ , W ∗ ) = ( σ,

0) and ( V ∗ , W ∗ ) = ( − σ,

0) and the game Jacobian evaluated at them is J τ ( V ∗ , W ∗ ) = (cid:20) − σ τ σ τ µ (cid:21) . Each critical point is a local Nash equilibrium of the unregularized game and a diﬀerential Stackelbergequilibrium of the regularized game since − D f ( V ∗ , W ∗ ) = µ > S ( J ( V ∗ , W ∗ )) = 4 σ /µ > a) d = 1 , µ = 0 . d = 1 , µ = 0 .

75 (c) d = 1 , µ = 1(d) (e) (f)(g) d = 5 , µ = 1 (h) d = 10 , µ = 1 (i) d = 20 , µ = 1(j) Legend for Figures 15a, 15b, 15c, 15g, 15h, and 15i. Figure 15: Experimental results for the generative adversarial network formulation for learning a covariancematrix deﬁned by the cost from (72) of Section L.3. Figures 15a, 15b, and 15c show the distance from theequilibrium along the learning paths of τ - GDA with d = 1. Figures 15d, 15e, and 15f show the trajectories ofthe eigenvalues of J τ ( x ∗ ) as a function of the τ , respectively. Figures 15g, 15h, and 15i show the distancefrom the equilibrium along the learning paths of τ - GDA with d = 5 , , J τ ( V ∗ , W ∗ )) = { ( τ µ ± (cid:112) τ µ − τ σ ) / } so that each critical point is stable for all τ ∈ (0 , ∞ ) and µ ∈ (0 , ∞ ) since spec( J τ ( θ ∗ , ω ∗ )) ⊂ C ◦ + . Thus, given a suitably chosen learning rate γ , thediscrete time update τ - GDA locally converges to an equilibrium. For this reason, we focus on studying the rateof convergence for the problem as a function of timescale separation and regularization. Figures 15a, 15b,and 15c show the distance from an equilibrium along the learning path of τ - GDA with τ ∈ { , , , } givena ﬁxed initial condition with learning rate γ = 0 .

001 and regularization µ ∈ { . , . , } , respectively.Moreover, Figures 15d, 15e, and 15f show the trajectories of the eigenvalues for J τ ( V ∗ , W ∗ ) as a function of τ for the regularization parameters µ ∈ { . , . , } . Finally, Figures 16a, 16b, and 16c show the trajectories68 a) d = 1 , µ = 0 . d = 1 , µ = 0 . d = 1 , µ = 1 Figure 16: Experimental results for learning a covariance matrix deﬁned by the cost from (72) of Section L.3.We overlay the trajectories produced by τ - GDA onto the vector ﬁeld generated by the choices of τ and µ .The shading of the vector ﬁeld is dictated by its magnitude so that lighter shading corresponds to a highermagnitude and darker shading corresponds to a lower magnitude.of τ - GDA overlayed on the vector ﬁeld generated by the respective timescale separation and regularizationparameters.From the eigenvalue trajectories, we see that as µ grows, the eigenvalues become purely real at a smallervalue of τ . Moreover, as µ increases, the magnitude of the real and imaginary parts of the eigenvaluesdecreases. We observe the eﬀect of this on the convergence, where the dynamics do not cycles as much forlarger µ . Again, we see the trade-oﬀ between timescale separation, regularization, and convergence. Forexample, despite the eigenvalues being purely real with µ = 1 and τ = 25 so that there is no rotationaldynamics, the convergence is slower than for µ = 0 .

75 where there is some non-zero imaginary piece of theeigenvalues.Figures 15g, 15h, and 15i show the distance from a critical point along the learning path of τ - GDA with τ ∈ { , , , } given a ﬁxed initial condition with learning rate γ = 0 . µ = 1, and thedimension of the problem d among the set { , , } , respectively. The primary purpose of showing this setof results is simply to be clear that the behavior for d = 1, which is easier to explain and visualize, transfersover to higher dimensional formulations of this problem. This is to be expected since the problem dimensionis not necessarily fundamental to the convergence rate, but rather it depends on the conditioning of Σ andeach Σ was chosen so that the behavior was comparable for each choice of dimension.69 .4 Generative Adversarial Networks: Image Data In this section we provide additional results and details from the experiments we ran training generativeadversarial networks on the CIFAR-10 and CelebA datasets. In Figure 17 we show more generated sampleson each of the datasets. We ran our simulations based on the work of Mescheder et al. (2018) and usedthe publicly available code from the link https://github.com/LMescheder/GAN_stability . We refer thereaders to (Mescheder et al., 2018) for details on the implementation and architectures, as we primarily onlychanged the learning rates used to run the experiments. For the networks, we ran the experiments using thearchitecture provided in the gan training/models/resnet.py ﬁle of the repository. In Figure 18 we include thehyperparameters we used for the experiments. To be clear, we used the same exact setup for both trainingCIFAR-10 and CelebA datasets. We computed the Frechet Inception Distance using 10k samples from thereal and generated data. For both experiments and across the set of hyperparameters we did the evaluationusing a ﬁxed random noise vector to make for an equal comparison and a ﬁxed set of real images which wererandomly selected. The evaluation was done using the training data. We used the FID score implementationin pytorch available at https://github.com/mseitzer/pytorch-fid . (a) CIFAR-10 generated sample images (b) CelebA generated sample images Figure 17: Generated sample images with τ = 4 and β = 0 . Hyperparameter Value(s)Objective NSGANBatch Size 64Latent Distribution z ∈ R Generator Learning Rate 0.0001Timescale Separation τ {

1, 2, 4, 8 } Learning Rate Decay (1 + x ) − . Optimizer RMSpropRMSprop Smoothing Constant α (cid:15) − Regularization µ {

1, 10 } EMA Parameter β { }}