Follow the Perturbed Leader: Optimism and Fast Parallel Algorithms for Smooth Minimax Games
aa r X i v : . [ c s . L G ] J un Follow the Perturbed Leader: Optimism and FastParallel Algorithms for Smooth Minimax Games
Arun Sai Suggala
Carnegie Mellon University [email protected]
Praneeth Netrapalli
Microsoft Research, India [email protected]
Abstract
We consider the problem of online learning and its application to solving minimaxgames. For the online learning problem, Follow the Perturbed Leader (FTPL) isa widely studied algorithm which enjoys the optimal O ` T { ˘ worst case regretguarantee for both convex and nonconvex losses. In this work, we show thatwhen the sequence of loss functions is predictable , a simple modification of FTPLwhich incorporates optimism can achieve better regret guarantees, while retainingthe optimal worst case regret guarantee for unpredictable sequences. A key chal-lenge in obtaining these tighter regret bounds is the stochasticity and optimism inthe algorithm, which requires different analysis techniques than those commonlyused in the analysis of FTPL. The key ingredient we utilize in our analysis is thedual view of perturbation as regularization. While our algorithm has several ap-plications, we consider the specific application of minimax games. For solvingsmooth convex-concave games, our algorithm only requires access to a linear op-timization oracle. For Lipschitz and smooth nonconvex-nonconcave games, ouralgorithm requires access to an optimization oracle which computes the perturbedbest response. In both these settings, our algorithm solves the game up to an accu-racy of O ` T ´ { ˘ using T calls to the optimization oracle. An important featureof our algorithm is that it is highly parallelizable and requires only O p T { q it-erations, with each iteration making O ` T { ˘ parallel calls to the optimizationoracle. In this work, we consider the problem of online learning, where in each iteration, the learner choosesan action and observes a loss function. The goal of the learner is to choose a sequence of actionswhich minimizes the cumulative loss suffered over the course of learning. The paradigm of onlinelearning has many theoretical and practical applications and has been widely studied in a numberof fields, including game theory and machine learning. One of the popular applications of onlinelearning is in solving minimax games arising in various contexts such as boosting [1], robust opti-mization [2], Generative Adversarial Networks [3].In recent years, a number of efficient algorithms have been developed for regret minimization. Thesealgorithms fall into two broad categories, namely, Follow the Regularized Leader (FTRL) [4] andFTPL [5] style algorithms. When the sequence of loss functions encountered by the learner areconvex, both these algorithms are known to achieve the optimal O ` T { ˘ worst case regret [6, 7].While these algorithms have similar regret guarantees, they differ in computational aspects. Eachiteration of FTRL involves implementation of an expensive projection step. In contrast, each stepof FTPL involves solving a linear optimization problem, which can be implemented efficiently formany problems of interest [8, 9, 10]. This crucial difference between FTRL and FTPL makes thelatter algorithm more attractive in practice. Even in the more general nonconvex setting, where theoss functions encountered by the learner can potentially be nonconvex, FTPL algorithms are at-tractive. In this setting, FTPL requires access to an offline optimization oracle which computes theperturbed best response, and achieves O ` T { ˘ worst case regret [11]. Furthermore, these optimiza-tion oracles can be efficiently implemented for many problems by leveraging the rich body of workon global optimization [12].Despite its importance and popularity, FTPL has been mostly studied for the worst case setting,where the loss functions are assumed to be adversarially chosen. In a number of applications ofonline learning, the loss functions are actually benign and predictable [13]. In such scenarios, FTPLcan not utilize the predictability of losses to achieve tighter regret bounds. While [11, 13] studyvariants of FTPL which can make use of predictability, these works either consider restricted set-tings or provide sub-optimal regret guarantees (see Section 2 for more details). This is unlike FTRL,where optimistic variants that can utilize the predictability of loss functions have been well under-stood [13, 14] and have been shown to provide faster convergence rates in applications such asminimax games. In this work, we aim to bridge this gap and study a variant of FTPL called Opti-mistic FTPL (OFTPL), which can achieve better regret bounds, while retaining the optimal worstcase regret guarantee for unpredictable sequences. The main challenge in obtaining these tighterregret bounds is handling the stochasticity and optimism in the algorithm, which requires differentanalysis techniques to those commonly used in the analysis of FTPL. In this work, we rely on thedual view of perturbation as regularization to derive regret bounds of OFTPL.To demonstrate the usefulness of OFTPL, we consider the problem of solving minimax games. Awidely used approach for solving such games relies on online learning algorithms [6]. In this ap-proach, both the minimization and the maximization players play a repeated game against each otherand rely on online learning algorithms to choose their actions in each round of the game. In our al-gorithm for solving games, we let both the players use OFTPL to choose their actions. For solvingsmooth convex-concave games, our algorithm only requires access to a linear optimization oracle.For Lipschitz and smooth nonconvex-nonconcave games, our algorithm requires access to an opti-mization oracle which computes the perturbed best response. In both these settings, our algorithmsolves the game up to an accuracy of O ` T ´ { ˘ using T calls to the optimization oracle. Whilethere are prior algorithms that achieve these convergence rates [11, 15], an important feature of ouralgorithm is that it is highly parallelizable and requires only O p T { q iterations, with each iterationmaking O ` T { ˘ parallel calls to the optimization oracle. We note that such parallelizable algo-rithms are especially useful in large-scale machine learning applications such as training of GANs,adversarial training, which often involve huge datasets such as ImageNet [16]. Online Learning.
The online learning framework can be seen as a repeated game betweena learner and an adversary. In this framework, in each round t , the learner makes a pre-diction x t P X Ď R d for some compact set X , and the adversary simultaneously chooses aloss function f t : X Ñ R and observe each others actions. The goal of the learner is tochoose a sequence of actions t x t u Tt “ so that the following notion of regret is minimized: ř Tt “ f t p x t q ´ inf x P X ř Tt “ f t p x q . When the domain X and loss functions f t are convex, a number of efficient algorithms for regretminimization have been studied. Some of these include deterministic algorithms such as OnlineMirror Descent, Follow the Regularized Leader (FTRL) [4, 7], and stochastic algorithms such asFollow the Perturbed Leader (FTPL) [5]. In FTRL, one predicts x t as argmin x P X ř t ´ i “ x ∇ i , x y ` R p x q , for some strongly convex regularizer R , where ∇ i “ ∇ f i p x i q . FTRL is known toachieve the optimal O p T { q worst case regret in the convex setting [4]. In FTPL, one predicts x t as m ´ ř mj “ x t,j , where x t,j is a minimizer of the following linear optimization problem: argmin x P X Ař t ´ i “ ∇ i ´ σ t,j , x E . Here, t σ t,j u mj “ are independent random perturbations drawnfrom some appropriate probability distribution such as exponential distribution or uniform distribu-tion in a hyper-cube. Various choices of perturbation distribution gives rise to various FTPL algo-rithms. When the loss functions are linear, Kalai and Vempala [5] show that FTPL achieves O ` T { ˘ expected regret, irrespective of the choice of m . When the loss functions are convex, Hazan [7]showed that the deterministic version of FTPL ( i.e., as m Ñ 8 ) achieves O ` T { ˘ regret. While2rojection free methods for online convex learning have been studied since the early work of [17],surprisingly, regret bounds of FTPL for finite m have only been recently studied [10]. Hazan andMinasyan [10] show that for Lipschitz and convex functions, FTPL achieves O ` T { ` m ´ { T ˘ expected regret, and for smooth convex functions, the algorithm achieves O ` T { ` m ´ T ˘ ex-pected regret.When either the domain X or the loss functions f t are non-convex, no deterministic algorithm canachieve o p T q regret [6, 11]. In such cases, one has to rely on randomized algorithms to achievesub-linear regret. In randomized algorithms, in each round t , the learner samples the prediction x t from a distribution P t P P , where P is the set of all probability distributions supported on X . The goal of the learner is to choose a sequence of distributions t P t u Tt “ to minimize the ex-pected regret ř Tt “ E x „ P t r f t p x qs ´ inf x P X ř Tt “ f t p x q . A popular technique to minimize the ex-pected regret is to consider a linearized problem in the space of probability distributions with losses ˜ f t p P q “ E x „ P r f t p x qs and perform FTRL in this space. In such a technique, P t is computed as: argmin P P P ř t ´ i “ ˜ f i p P q ` R p P q , for some strongly convex regularizer R p P q . When R p P q is thenegative entropy of P , the algorithm is called entropic mirror descent or continuous exponentialweights. This algorithm achieves O ` T { ˘ expected regret for bounded loss functions f t . Anothertechnique to minimize expected regret is to rely on FTPL [11, 18]. Here, the learner generates therandom prediction x t by first sampling a random perturbation σ and then computing the perturbedbest response, which is defined as argmin x P X ř t ´ i “ f i p x q ´ x σ, x y . In a recent work, Suggala andNetrapalli [11] show that this algorithm achieves O ` T { ˘ expected regret, whenever the sequenceof loss functions are Lipschitz. We now briefly discuss the computational aspects of FTRL andFTPL. Each iteration of FTRL (with entropic regularizer) requires sampling from a non-logconcavedistribution. In contrast, FTPL requires solving a nonconvex optimization problem to compute theperturbed best response. Of these, computing the perturbed best response seems significantly eas-ier since standard algorithms such as gradient descent seem to be able to find approximate globaloptima reasonably fast, even for complicated tasks such as training deep neural networks. Online Learning with Optimism.
When the sequence of loss functions are convex and pre-dictable, Rakhlin and Sridharan [13, 14] study optimistic variants of FTRL which can exploit thepredictability to obtain better regret bounds. Let g t be our guess of ∇ t at the beginning of round t .Given g t , we predict x t in Optimistic FTRL (OFTRL) as argmin x P X Ař t ´ i “ ∇ i ` g t , x E ` R p x q . Note that when g t “ , OFTRL is equivalent to FTRL. [13, 14] show that the regret bounds ofOFTRL only depend on p g t ´ ∇ t q . Moreover, these works show that OFTRL provides faster con-vergence rates for solving smooth convex-concave games. In contrast to FTRL, the optimistic vari-ants of FTPL have been less well understood. [13] studies OFTPL for linear loss functions. Butthey consider restrictive settings and their algorithms require the knowledge of sizes of deviations p g t ´ ∇ t q . [11] studies OFTPL for the more general nonconvex setting. The algorithm predicts x t as argmin x P X ř t ´ i “ f i p x q ` g t p x q ´ x σ, x y , where g t is our guess of f t . However, the regret boundsof [11] are sub-optimal and weaker than the bounds we obtain in our work (see Theorem 4.2). More-over, [11] does not provide any consequences of their results to minimax games. We note that theirsub-optimal regret bounds translate to sub-optimal rates of convergence for solving smooth minimaxgames. Minimax Games.
Consider the following problem, which we refer to as minimax game: min x P X max y P Y f p x , y q . In these games, we are often interested in finding a Nash Equilib-rium (NE). A pair p P, Q q , where P is a probability distribution over X and Q is a probabil-ity distribution over Y , is called a NE if: sup y P Y E x „ P r f p x , y qs ď E x „ P, y „ Q r f p x , y qs ď inf x P X E y „ Q r f p x , y qs . A standard technique for finding a NE of the game is to rely on no-regretalgorithms [6, 7]. Here, both x and y players play a repeated game against each other and use onlinelearning algorithms to choose their actions. The average of the iterates generated via this repeatedgame can be shown to converge to a NE. Projection Free Learning.
Projection free learning algorithms are attractive as they only involvesolving linear optimization problems. Two broad classes of projection free techniques have beenconsidered for online convex learning and minimax games, namely, Frank-Wolfe (FW) methodsand FTPL based methods. Garber and Hazan [8] consider the problem of online learning when theaction space X is a polytope. They provide a FW method which achieves O ` T { ˘ regret using T O ` T { ˘ regret for general online convex learning with Lipschitz losses and uses T calls to thelinear optimization oracle. In a recent work, Hazan and Minasyan [10] show that FTPL achieves O ` T { ˘ regret for online convex learning with smooth losses, using T calls to the linear optimiza-tion oracle. This translates to O ` T ´ { ˘ rate of convergence for solving smooth convex-concavegames. Note that, in contrast, our algorithm achieves O ` T ´ { ˘ convergence rate in the same set-ting. Gidel et al. [9] study FW methods for solving convex-concave games. When the constraintsets X , Y are strongly convex , the authors show geometric convergence of their algorithms. In arecent work, He and Harchaoui [15] propose a FW technique for solving smooth convex-concavegames which converges at a rate of O ` T ´ { ˘ using T calls to the linear optimization oracle. Wenote that our simple OFTPL based algorithm achieves these rates, with the added advantage of par-allelizability. That being said, He and Harchaoui [15] achieve dimension free convergence rates inthe Euclidean setting, where the smoothness is measured w.r.t } ¨ } norm. In contrast, the rates ofconvergence of our algorithm depend on the dimension. Notation. } ¨ } is a norm on some vector space, which is typically R d in our work. } ¨ } ˚ is thedual norm of } ¨ } , which is defined as } x } ˚ “ sup tx u , x y : u P R d , } u } ď u . We use Ψ , Ψ todenote norm compatibility constants of } ¨ } , which are defined as Ψ “ sup x ‰ } x }{} x } , Ψ “ sup x ‰ } x } {} x } . We use the notation f t to denote ř ti “ f i . In some cases, when clear fromcontext, we overload the notation f t and use it to denote the set t f , f . . . f t u . For any convexfunction f , B f p x q is the set of all subgradients of f at x . For any function f : X ˆ Y Ñ R , f p¨ , y q , f p x , ¨q denote the functions x Ñ f p x , y q , y Ñ f p x , y q . For any function f : X Ñ R and any probability distribution P , we let f p P q denote E x „ P r f p x qs . Similarly, for any function f : X ˆ Y Ñ R and any two distributions P, Q , we let f p P, Q q denote E x „ P, y „ Q r f p x , y qs . Forany set of distributions t P j u mj “ , m ř mj “ P j is the mixture distribution which gives equal weightsto its components. We use Exp p η q to denote the exponential distribution, whose CDF is given by P p Z ď s q “ ´ exp p´ s { η q . In this section, we present a key result which shows that when the sequence of loss functions areconvex, every FTPL algorithm is an FTRL algorithm. Our analysis of OFTPL relies on this dualview to obtain tight regret bounds. This duality between FTPL and FTRL was originally studiedby Hofbauer and Sandholm [19], where the authors show that any FTPL algorithm, with perturbationdistribution admitting a strictly positive density on R d , is an FTRL algorithm w.r.t some convexregularizer. However, many popular perturbation distributions such as exponential and uniformdistributions don’t have a strictly positive density. In a recent work, Abernethy et al. [20] point outthat the duality between FTPL and FTRL holds for very general perturbation distributions. However,the authors do not provide a formal theorem showing this result. Here, we provide a propositionformalizing the claim of [20]. Proposition 3.1.
Consider the problem of online convex learning, where the sequence of lossfunctions t f t u Tt “ encountered by the learner are convex. Consider the deterministic version ofFTPL algorithm, where the learner predicts x t as E σ r argmin x P X x ∇ t ´ ´ σ, x ys . Supposethe perturbation distribution is absolutely continuous w.r.t the Lebesgue measure. Then there ex-ists a convex regularizer R : R d Ñ R Y t8u , with domain dom p R q Ď X , such that x t “ argmin x P X x ∇ t ´ , x y ` R p x q . Moreover, ´ ∇ t ´ P B R p x t q , and x t “ B R ´ p´ ∇ t ´ q , where B R ´ is the inverse of B R in the sense of multivalued mappings. In this section, we present the OFTPL algorithm for online convex learning and derive an upperbound on its regret. The algorithm we consider is similar to the OFTRL algorithm (see Algorithm 1).Let g t r f . . . f t ´ s be our guess for ∇ t at the beginning of round t , with g “ . To simplifythe notation, in the sequel, we suppress the dependence of g t on t f i u t ´ i “ . Given g t , we predict4 lgorithm 1 Convex OFTPL Input:
Perturbation Distribution P PRTB , number of samples m, number of iterations T Denote ∇ “ for t “ . . . T do Let g t be the guess for ∇ t for j “ . . . m do Sample σ t,j „ P PRTB x t,j P argmin x P X x ∇ t ´ ` g t ´ σ t,j , x y end for Play x t “ m ř mj “ x t,j Observe loss function f t end for x t in OFTPL as follows. We sample independent perturbations t σ t,j u mj “ from the perturbationdistribution P PRTB and compute x t as m ´ ř mj “ x t,j , where x t,j is a minimizer of the followinglinear optimization problem x t,j P argmin x P X x ∇ t ´ ` g t ´ σ t,j , x y . We now present our main theorem which bounds the regret of OFTPL. A key quantity the regretdepends on is the stability of predictions of the deterministic version of OFTPL. Intuitively, an algo-rithm is stable if its predictions in two consecutive iterations differ by a small quantity. To capturethis notion, we first define function ∇ Φ : R d Ñ R d as: ∇ Φ p g q “ E σ r argmin x P X x g ´ σ, x ys . Observe that ∇ Φ p ∇ t ´ ` g t q is the prediction of the deterministic version of OFTPL. We say thepredictions of OFTPL are stable, if ∇ Φ is a Lipschitz function. Definition 4.1 (Stability) . The predictions of OFTPL are said to be β -stable w.r.t some norm } ¨ } , if @ g , g P R d } ∇ Φ p g q ´ ∇ Φ p g q } ˚ ď β } g ´ g } . Theorem 4.1.
Suppose the perturbation distribution P PRTB is absolutely continuous w.r.t Lebesguemeasure. Let D be the diameter of X w.r.t } ¨ } , which is defined as D “ sup x , x P X } x ´ x } . Let η “ E σ r} σ } ˚ s , and suppose the predictions of OFTPL are Cη ´ -stable w.r.t } ¨ } ˚ , where C isa constant that depends on the set X . Finally, suppose the sequence of loss functions t f t u Tt “ areHolder smooth and satisfy @ x , x P X } ∇ f t p x q ´ ∇ f t p x q} ˚ ď L } x ´ x } α , for some constant α P r , s . Then the expected regret of Algorithm 1 satisfies sup x P X E « T ÿ t “ f t p x t q ´ f t p x q ff ď ηD ` T ÿ t “ C η E “ } ∇ t ´ g t } ˚ ‰ ´ T ÿ t “ η C E “ } x t ´ ˜ x t ´ } ‰ ` LT ˆ Ψ Ψ D ? m ˙ ` α . where x t “ E r x t | g t , f t ´ , x t ´ s and ˜ x t ´ “ E r ˜ x t ´ | f t ´ , x t ´ s and ˜ x t ´ denotes theprediction in the t th iteration of Algorithm 1, if guess g t “ was used. Here, Ψ , Ψ denote thenorm compatibility constants of } ¨ } . Regret bounds that hold with high probability can be found in Appendix G. The above Theoremshows that the regret of OFTPL only depends on } ∇ t ´ g t } ˚ , which quantifies the accuracy of ourguess g t . In contrast, the regret of FTPL depends on } ∇ t } ˚ [7]. This shows that for predictablesequences, with an appropriate choice of g t , OFTPL can achieve better regret guarantees than FTPL.As we demonstrate in Section 5, this helps us design faster algorithms for solving minimax games.Note that the above result is very general and holds for any absolutely continuous perturbationdistribution. The key challenge in instantiating this result for any particular perturbation distributionis in showing the stability of predictions. Several past works have studied the stability of FTPLfor various perturbation distributions such as uniform, exponential, Gumbel distributions [5, 7, 10].5 lgorithm 2 Nonconvex OFTPL Input:
Perturbation Distribution P PRTB , number of samples m , number of iterations T Denote f “ for t “ . . . T do Let g t be the guess for f t for j “ . . . m do Sample σ t,j „ P PRTB x t,j P argmin x P X f t ´ p x q ` g t p x q ´ σ t,j p x q end for Let P t be the empirical distribution over t x t, , x t, . . . x t,m u Play x t , a random sample generated from P t Observe loss function f t end for Consequently, the above result can be used to derive tight regret bounds for all these perturbationdistributions. As one particular instantiation of Theorem 4.1, we consider the special case of g t “ and derive regret bounds for FTPL, when the perturbation distribution is the uniform distributionover a ball centered at the origin. Corollary 4.1 (FTPL) . Suppose the perturbation distribution is equal to the uniform distributionover t x : } x } ď p ` d ´ q η u . Let D be the diameter of X w.r.t } ¨ } . Then E σ r} σ } s “ η , andthe predictions of OFTPL are dDη ´ -stable w.r.t } ¨ } . Suppose, the sequence of loss functions t f t u Tt “ are G -Lipschitz and satisfy sup x P X } ∇ f t p x q} ď G . Moreover, suppose f t satisfies theHolder smooth condition in Theorem 4.1 w.r.t } ¨ } norm. Then the expected regret of Algorithm 1,with guess g t “ , satisfies sup x P X E « T ÿ t “ f t p x t q ´ f t p x q ff ď ηD ` dDG T η ` LT ˆ D ? m ˙ ` α . This recovers the regret bounds of FTPL for general convex loss functions, derived by [10].
We now study OFTPL in the nonconvex setting. In this setting, we assume the sequence of lossfunctions belong to some function class F containing real-valued measurable functions on X . Somepopular choices for F include the set of Lipschitz functions, the set of bounded functions. TheOFTPL algorithm in this setting is described in Algorithm 2. Similar to the convex case, we firstsample random perturbation functions t σ t,j u mj “ from some distribution P PRTB . Some examples ofperturbation functions that have been considered in the past include σ t,j p x q “ x ¯ σ t,j , x y , for somerandom vector ¯ σ t,j sampled from exponential or uniform distributions [11, 18]. Another popularchoice for σ t,j is the Gumbel process, which results in the continuous exponential weights algo-rithm [21]. Letting, g t be our guess of loss function f t at the beginning of round t , the learner firstcomputes x t,j as argmin x P X ř t ´ i “ f i p x q ` g t p x q ´ σ t,j p x q . We assume access to an optimizationoracle which computes a minimizer of this problem. We often refer to this oracle as the perturbedbest response oracle. Let P t denote the empirical distribution of t x t,j u mj “ . The learner then playsan x t which is sampled from P t . Algorithm 2 describes this procedure. We note that for the onlinelearning problem, m “ suffices, as the expected loss suffered by the learner in each round isindependent of m ; that is E r f t p x t qs “ E r f t p x t, qs . However, the choice of m affects the rate ofconvergence when Algorithm 2 is used for solving nonconvex nonconcave minimax games.Before we present the regret bounds, we introduce the dual space associated with F . Let } ¨ } F bea seminorm associated with F . For example, when F is the set of Lipschitz functions, } ¨ } F is theLipschitz seminorm. Various choices of p F , } ¨ } F q induce various distance metrics on P , the set ofall probability distributions on X . We let γ F denote the Integral Probability Metric (IPM) inducedby p F , } ¨ } F q , which is defined as γ F p P, Q q “ sup f P F , } f } F ď ˇˇˇ E x „ P r f p x qs ´ E x „ Q r f p x qs ˇˇˇ .
6e often refer to p P , γ F q as the dual space of p F , } ¨ } F q . When F is the set of Lipschitz functionsand when } ¨ } F is the Lipschitz seminorm, γ F is the Wasserstein distance. Table 1 in Appendix E.1presents examples of γ F induced by some popular function spaces. Similar to the convex case, theregret bounds in the nonconvex setting depend on the stability of predictions of OFTPL. Definition 4.2 (Stability) . Suppose the perturbation function σ p x q is sampled from P PRTB . For any f P F , define random variable x f p σ q as argmin x P X f p x q´ σ p x q . Let ∇ Φ p f q denote the distributionof x f p σ q . The predictions of OFTPL are said to be β -stable w.r.t } ¨ } F if @ f, g P F γ F p ∇ Φ p f q , ∇ Φ p g qq ď β } f ´ g } F . Theorem 4.2.
Suppose the sequence of loss functions t f t u Tt “ belong to p F , } ¨ } F q . Suppose theperturbation distribution P PRTB is such that argmin x P X f p x q ´ σ p x q has a unique minimizer withprobability one, for any f P F . Let P be the set of probability distributions over X . Define thediameter of P as D “ sup P ,P P P γ F p P , P q . Let η “ E r} σ } F s . Suppose the predictions ofOFTPL are Cη ´ -stable w.r.t } ¨ } F , for some constant C that depends on X . Then the expectedregret of Algorithm 2 satisfies sup x P X E « T ÿ t “ f t p x t q ´ f t p x q ff ď ηD ` T ÿ t “ C η E “ } f t ´ g t } F ‰ ´ T ÿ t “ η C E ” γ F p P t , ˜ P t ´ q ı , where P t “ E r P t | g t , f t ´ , P t ´ s , ˜ P t “ E ” ˜ P t ´ | f t ´ , P t ´ ı and ˜ P t ´ is the empiricaldistribution computed in the t th iteration of Algorithm 2, if guess g t “ was used. We note that, unlike the convex case, there are no known analogs of Fenchel duality for infinitedimensional function spaces. As a result, more careful analysis is needed to obtain the above regretbounds. Our analysis mimics the arguments made in the convex case, albeit without explicitlyrelying on duality theory. As in the convex case, the key challenge in instantiating the above resultfor any particular perturbation distribution is in showing the stability of predictions. In a recentwork, [11] consider linear perturbation functions σ p x q “ x ¯ σ, x y , for ¯ σ sampled from exponentialdistribution, and show stability of FTPL. We now instantiate the above Theorem for this setting. Corollary 4.2.
Consider the setting of Theorem 4.2. Let F be the set of Lipschitz functions and }¨} F be the Lipschitz seminorm, which is defined as } f } F “ sup x ‰ y in X | f p x q´ f p y q|{} x ´ y } . Supposethe perturbation function is such that σ p x q “ x ¯ σ, x y , where ¯ σ P R d is a random vector whose entriesare sampled independently from Exp p η q . Then E σ r} σ } F s “ η log d , and the predictions of OFTPLare O ` d Dη ´ ˘ -stable w.r.t } ¨ } F . Moreover, the expected regret of Algorithm 2 is upper boundedby O ´ ηD log d ` ř Tt “ d Dη E “ } f t ´ g t } F ‰ ´ ř Tt “ ηd D E ” γ F p P t , ˜ P t ´ q ı¯ . We note that the above regret bounds are tighter than the regret bounds of [11], where the authorsshow that the regret of OFTPL is bounded by O ´ ηD log d ` ř Tt “ d Dη E “ } f t ´ g t } F ‰¯ . Thesetigher bounds help us design faster algorithms for solving minimax games in the nonconvex setting. We now consider the problem of solving minimax games of the following form min x P X max y P Y f p x , y q . (1)Nash equilibria of such games can be computed by playing two online learning algorithms againsteach other [6, 7]. In this work, we study the algorithm where both the players employ OFTPLto decide their actions in each round. For convex-concave games, both the players use the OFTPLalgorithm described in Algorithm 1 (see Algorithm 3 in Appendix D). The following theorem derivesthe rate of convergence of this algorithm to a Nash equilibirum (NE). Theorem 5.1.
Consider the minimax game in Equation (1). Suppose both the domains X , Y arecompact subsets of R d , with diameter D “ max t sup x , x P X } x ´ x } , sup y , y P Y } y ´ y } u .Suppose f is convex in x , concave in y and is smooth w.r.t } ¨ } } ∇ x f p x , y q ´ ∇ x f p x , y q} ` } ∇ y f p x , y q ´ ∇ y f p x , y q} ď L } x ´ x } ` L } y ´ y } . uppose Algorithm 3 is used to solve the minimax game. Suppose the perturbation distri-butions used by both the players are the same and equal to the uniform distribution over t x : } x } ď p ` d ´ q η u . Suppose the guesses used by x , y players in the t th iteration are ∇ x f p ˜ x t ´ , ˜ y t ´ q , ∇ y f p ˜ x t ´ , ˜ y t ´ q , where ˜ x t ´ , ˜ y t ´ denote the predictions of x , y players inthe t th iteration, if guess g t “ was used. If Algorithm 3 is run with η “ dD p L ` q , m “ T , thenthe iterates tp x t , y t qu Tt “ satisfy sup x P X , y P Y E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff “ O ˆ dD p L ` q T ˙ . Rates of convergence which hold with high probability can be found in Appendix G. We note thatTheorem 5.1 can be extended to more general noise distributions and settings where gradients of f are Holder smooth w.r.t non-Euclidean norms, and X , Y lie in spaces of different dimensions(see Theorem D.1 in Appendix). The above result shows that for smooth convex-concave games,Algorithm 3 converges to a NE at O ` T ´ ˘ rate using T calls to the linear optimization oracle.Moreover, the algorithm runs in O p T q iterations, with each iteration making O p T q parallel callsto the optimization oracle. We believe the dimension dependence in the rates can be removed byappropriately choosing the perturbation distributions based on domains X , Y (see Appendix F).We now consider the more general nonconvex-nonconcave games. In this case, both the playersuse the nonconvex OFTPL algorithm described in Algorithm 2 to choose their actions. Instead ofgenerating a single sample from the empirical distribution P t computed in t th iteration of Algo-rithm 2, the players now play the entire distribution P t (see Algorithm 4 in Appendix E). Letting t P t u Tt “ , t Q t u Tt “ , be the sequence of iterates generated by the x and y players, the following theo-rem shows that ´ T ř Tt “ P t , T ř Tt “ Q t ¯ converges to a NE. Theorem 5.2.
Consider the minimax game in Equation (1). Suppose the domains X , Y are compactsubsets of R d with diameter D “ max t sup x , x P X } x ´ x } , sup y , y P Y } y ´ y } u . Suppose f is Lipschitz w.r.t } ¨ } and satisfies max " sup x P X , y P Y } ∇ x f p x , y q} , sup x P X , y P Y } ∇ y f p x , y q} * ď G. Moreover, suppose f satisfies the following smoothness property } ∇ x f p x , y q ´ ∇ x f p x , y q} ` } ∇ y f p x , y q ´ ∇ y f p x , y q} ď L } x ´ x } ` L } y ´ y } . Suppose both x and y players use Algorithm 4 to solve the game with linear perturbation functions σ p z q “ x ¯ σ, z y , where ¯ σ P R d is such that each of its entries is sampled independently from Exp p η q .Suppose the guesses used by x and y players in the t th iteration are f p¨ , ˜ Q t ´ q , f p ˜ P t ´ , ¨q , where ˜ P t ´ , ˜ Q t ´ denote the predictions of x , y players in the t th iteration, if guess g t “ was used. IfAlgorithm 4 is run with η “ d D p L ` q , m “ T , then the iterates tp P t , Q t qu Tt “ satisfy sup x P X , y P Y E « f ˜ T T ÿ t “ P t , y ¸ ´ f ˜ x , T T ÿ t “ Q t ¸ff “ O ˆ d D p L ` q log dT ˙ ` O ˆ min " D L, d G log TLT *˙ . More general versions of the Theorem, which consider other function classes and general perturba-tion distributions, can be found in Appendix E. The above result shows that Algorithm 4 convergesto a NE at ˜ O ` T ´ ˘ rate using T calls to the perturbed best response oracle. This matches therates of convergence of FTPL [11]. However, the key advantage of our algorithm is that it is highlyparallelizable and runs in O p T q iterations, in contrast to FTPL, which runs in O ` T ˘ iterations. We studied an optimistic variant of FTPL which achieves better regret guarantees when the sequenceof loss functions is predictable. As one specific application of our algorithm, we considered the8roblem of solving minimax games. For solving convex-concave games, our algorithm requiresaccess to a linear optimization oracle and for nonconvex-nonconcave games our algorithm requiresaccess to a more powerful perturbed best response oracle. In both these settings, our algorithmachieves O ` T ´ { ˘ convergence rates using T calls to the oracles. Moreover, our algorithm runsin O ` T { ˘ iterations, with each iteration making O ` T { ˘ parallel calls to the optimization oracle.We believe our improved algorithms for solving minimax games are useful in a number of modernmachine learning applications such as training of GANs, adversarial training, which involve solvingnonconvex-nonconcave minimax games and often deal with huge datasets. References [1] Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In
COLT ,volume 96, pages 325–332. Citeseer, 1996.[2] Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization fornon-convex objectives. In
Advances in Neural Information Processing Systems , pages 4705–4714, 2017.[3] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances inneural information processing systems , pages 2672–2680, 2014.[4] H Brendan McMahan. A survey of algorithms and analysis for adaptive online learning.
TheJournal of Machine Learning Research , 18(1):3117–3166, 2017.[5] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems.
Journalof Computer and System Sciences , 71(3):291–307, 2005.[6] Nicolo Cesa-Bianchi and Gabor Lugosi.
Prediction, learning, and games . Cambridge univer-sity press, 2006.[7] Elad Hazan. Introduction to online convex optimization.
Foundations and Trends R (cid:13) in Opti-mization , 2(3-4):157–325, 2016.[8] Dan Garber and Elad Hazan. Playing non-linear games with linear oracles. In , pages 420–428. IEEE, 2013.[9] Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. Frank-wolfe algorithms for saddlepoint problems. arXiv preprint arXiv:1610.07797 , 2016.[10] Elad Hazan and Edgar Minasyan. Faster projection-free online learning. CoRR ,abs/2001.11568, 2020. URL https://arxiv.org/abs/2001.11568 .[11] Arun Sai Suggala and Praneeth Netrapalli. Online non-convex learning: Following the per-turbed leader is optimal. In Aryeh Kontorovich and Gergely Neu, editors,
Proceedings of the31st International Conference on Algorithmic Learning Theory , volume 117 of
Proceedingsof Machine Learning Research , pages 845–861, San Diego, California, USA, 08 Feb–11 Feb2020. PMLR. URL http://proceedings.mlr.press/v117/suggala20a.html .[12] Reiner Horst and Panos M Pardalos.
Handbook of global optimization , volume 2. SpringerScience & Business Media, 2013.[13] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. arXivpreprint arXiv:1208.3728 , 2012.[14] Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictablesequences. In
Advances in Neural Information Processing Systems , pages 3066–3074, 2013.[15] Niao He and Zaid Harchaoui. Semi-proximal mirror-prox for nonsmooth composite minimiza-tion. In
Advances in Neural Information Processing Systems , pages 3411–3419, 2015.[16] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visualrecognition challenge.
International journal of computer vision , 115(3):211–252, 2015.[17] Elad Hazan and Satyen Kale. Projection-free online learning. arXiv preprint arXiv:1206.4657 ,2012. 918] Naman Agarwal, Alon Gonen, and Elad Hazan. Learning in non-convex games withan optimization oracle. In Alina Beygelzimer and Daniel Hsu, editors,
Proceedings ofthe Thirty-Second Conference on Learning Theory , volume 99 of
Proceedings of Ma-chine Learning Research , pages 18–29, Phoenix, USA, 25–28 Jun 2019. PMLR. URL http://proceedings.mlr.press/v99/agarwal19a.html .[19] Josef Hofbauer and William H Sandholm. On the global convergence of stochastic fictitiousplay.
Econometrica , 70(6):2265–2294, 2002.[20] Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. Perturbation techniques in online learningand optimization.
Perturbations, Optimization, and Statistics , page 233, 2016.[21] Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In
Advances in NeuralInformation Processing Systems , pages 3086–3094, 2014.[22] R Tyrrell Rockafellar.
Convex analysis . Number 28. Princeton university press, 1970.[23] Dimitri P Bertsekas. Stochastic optimization problems with nondifferentiable cost functionals.
Journal of Optimization Theory and Applications , 12(2):218–231, 1973.[24] Shai Shalev-Shwartz. Thesis submitted for the degree of “doctor of philosophy”. 2007.[25] Eduardo H Zarantonello. Dense single-valuedness of monotone operators.
Israel Journal ofMathematics , 15(2):158–166, 1973.[26] Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaus-sian random vectors.
Electronic Communications in Probability , 17, 2012.[27] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A shortnote on concentration inequalities for random vectors with subgaussian norm. arXiv preprintarXiv:1902.03736 , 2019.[28] Martin J Wainwright.
High-dimensional statistics: A non-asymptotic viewpoint , volume 48.Cambridge University Press, 2019.[29] Sham Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. On the duality of strong convexity andstrong smoothness: Learning applications and matrix regularization.
Unpublished Manuscript,http://ttic. uchicago. edu/shai/papers/KakadeShalevTewari09. pdf , 2(1), 2009.10
Dual view of Perturbations as Regularization
A.1 Proof of Theorem 3.1
We first define a convex function
Ψ : R d Ñ R as Ψ p f q “ E σ „ sup x P X x f ` σ, x y “ E σ „ sup x P X x f ` σ, x y , where perturbation σ follows probability distribution P PRTB which is absolutely continuous w.r.tthe Lebesgue measure. For our choice of P PRTB , we now show that Ψ is differentiable. Considerthe function ψ p g q “ sup x P X x g, x y . Since ψ p g q is a proper convex function, we know that it isdifferentiable almost everywhere, except on a set of Lebesgue measure [see Theorem 25.5 of 22].Moreover, it is easy to verify that argmax x P X x g, x y P B ψ p g q . These two observations, together withthe fact that P PRTB is absolutely continuous, show that the sup expression inside the expectation of Ψ has a unique maximizer with probability one.Since the sup expression inside the expectation has a unique maximizer with probability , we canswap the expectation and gradient to obtain [see Proposition 2.2 of 23] ∇ Ψ p f q “ E σ „ argmax x P X x f ` σ, x y . (2)Note that ∇ Ψ is related to the prediction of deterministic version of FTPL. Specifically, ∇ Ψ p´ ∇ t ´ q is the prediction of deterministic FTPL in the t th iteration. We now show that ∇ Ψ p f q “ argmin x P X x´ f, x y ` R p x q , for some convex function R .Since all differentiable functions are closed, Ψ p f q is a proper, closed and differentiable convexfunction over R d . Let R p x q denote the Fenchel conjugate of Ψ p f q R p x q “ sup f P dom p Φ q x x , f y ´ Ψ p f q , where dom p Ψ q denotes the domain of Ψ . Following Theorem H.1 (see Appendix H), Ψ p f q is theFenchel conjugate of R p x q Ψ p f q “ sup x P dom p R q x f, x y ´ R p x q . Furthermore, from Theorem H.2 we have ∇ Ψ p f q “ argmax x P dom p R q x f, x y ´ R p x q . We now show that the domain of R is a subset of X . This, together with the previous two equations,would then immediately imply Ψ p f q “ sup x P X x f, x y ´ R p x q , (3) ∇ Ψ p f q “ argmax x P X x f, x y ´ R p x q . (4)From Theorem H.4, we know that the domain of R satisfiesri p dom p R qq Ď range ∇ Ψ Ď dom p R q , where ri p A q denotes the relative interior of a set A . Moreover, from the definition of ∇ Ψ p f q inEquation (2), we have range ∇ Ψ Ď X . Combining these two properties, we can show that one of thefollowing statements is true ri p dom p R qq Ď range ∇ Ψ Ď X Ď dom p R q , ri p dom p R qq Ď range ∇ Ψ Ď dom p R q Ď X . Suppose the first statement is true. Since X is a compact set, it is easy to see that X “ dom p R q . If thesecond statement is true, then dom p R q Ď X . Together, these two statements imply dom p R q Ď X .11 onnecting back to FTPL. We now connect the above results to FTPL. From Equation (2), weknow that the prediction at iteration t of deterministic FTPL is equal to ∇ Ψ p´ ∇ t ´ q . From Equa-tion (4), ∇ Ψ p´ ∇ t ´ q is defined as x t “ ∇ Ψ p´ ∇ t ´ q “ argmax x P X x´ ∇ t ´ , x y ´ R p x q . This shows that x t “ argmin x P X x ∇ t ´ , x y ` R p x q . So the prediction of FTPL can also be obtained using FTRL for some convex regularizer R p x q .Finally, to show that ´ ∇ t ´ P B R p x t q , x t “ B R ´ p´ ∇ t ´ q , we rely on Theorem H.3. Since x t “ ∇ Ψ p´ ∇ t ´ q , from Theorem H.3, we have ´ ∇ t ´ P B R p x t q , x t “ ∇ Ψ p´ ∇ t ´ q “ B R ´ p´ ∇ t ´ q , where B R ´ is the inverse of B R in the sense of multivalued mappings. Note that, even though B R can be a multivalued mapping, its inverse B R ´ “ ∇ Ψ is a singlevalued mapping (this follows formdifferentiability of Ψ ). This finishes the proof of the Theorem. B Online Convex Learning
B.1 Proof of Theorem 4.1
Before presenting the proof of the Theorem, we introduce some notation.
B.1.1 Notation
We define functions
Φ : R d Ñ R , R : R d Ñ R as follows Φ p f q “ E σ „ inf x P X x f ´ σ, x y , R p x q “ sup f P R d x f, x y ` Φ p´ f q . Note that Φ is related to the function Ψ defined in the proof of Proposition 3.1. To be precise, Ψ p f q “ ´ Φ p´ f q . Moreover, R p x q is the Fenchel conjugate of Ψ . For our choice of perturbationdistribution, Ψ is differentiable (see proof of Proposition 3.1). This implies Φ is also differentiablewith gradient ∇ Φ defined as ∇ Φ p f q “ E σ „ argmin x P X x f ´ σ, x y . Note that ∇ Φ is the prediction of deterministic version of FTPL. In Proposition 3.1 we showed that ∇ Φ p f q “ argmin x P X x f, x y ` R p x q . B.1.2 Main Argument
Since x t is the prediction of deterministic version of FTPL, following FTPL-FTRL duality provedin Proposition 3.1, x t can equivalently be written as x t “ ∇ Φ p ∇ t ´ ` g t q “ argmin x P X x ∇ t ´ ` g t , x y ` R p x q . Similarly, ˜ x t can be written as ˜ x t “ ∇ Φ p ∇ t q “ argmin x P X x ∇ t , x y ` R p x q . We use the notation ∇ “ . So ˜ x , x are equal to argmin x P X R p x q . From the first orderoptimality conditions, we have ´ ∇ t ´ ´ g t P B R p x t q , ´ ∇ t P B R p ˜ x t q . Define functions B p¨ , x t q , B p¨ , ˜ x t q for any t P r T s as B p x , x t q “ R p x q ´ R p x t q ` x ∇ t ´ ` g t , x ´ x t y ,B p x , ˜ x t q “ R p x q ´ R p ˜ x t q ` x ∇ t , x ´ ˜ x t y . } ∇ Φ p g q´ ∇ Φ p g q } ď Cη ´ } g ´ g } ˚ . Following our connection between Ψ , Φ , this implies } ∇ Ψ p g q´ ∇ Ψ p g q} ď Cη ´ } g ´ g } ˚ . Thisimplies the following smoothness condition on Ψ [see Lemma 15 of 24] Ψ p g q ď Ψ p g q ` x ∇ Ψ p g q , g ´ g y ` Cη ´ } g ´ g } ˚ . Since Ψ is Cη ´ -smooth w.r.t } ¨ } ˚ , following duality between strong convexity and strong smooth-ness properties (see Theorem H.5), we can infer that R is C ´ η - strongly convex w.r.t } ¨ } norm andsatisfies B p x , x t q ě η C } x ´ x t } , B p x , ˜ x t q ě η C } x ´ ˜ x t } . We now go ahead and bound the regret of the learner. For any x P X , we have f t p x t q ´ f t p x q p a q ď x x t ´ x , ∇ t y “ x x t ´ x t , ∇ t y ` x x t ´ x , ∇ t y“ x x t ´ x t , ∇ t y ` x x t ´ ˜ x t , ∇ t ´ g t y ` x x t ´ ˜ x t , g t y` x ˜ x t ´ x , ∇ t yď x x t ´ x t , ∇ t y ` } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ` x x t ´ ˜ x t , g t y` x ˜ x t ´ x , ∇ t y , where p a q follows from convexity of f . Next, a simple calculation shows that x x t ´ ˜ x t , g t y “ B p ˜ x t , ˜ x t ´ q ´ B p ˜ x t , x t q ´ B p x t , ˜ x t ´ qx ˜ x t ´ x , ∇ t y “ B p x , ˜ x t ´ q ´ B p x , ˜ x t q ´ B p ˜ x t , ˜ x t ´ q . Substituting this in the previous inequality gives us f t p x t q ´ f t p x q ď x x t ´ x t , ∇ t y ` } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ` B p ˜ x t , ˜ x t ´ q ´ B p ˜ x t , x t q ´ B p x t , ˜ x t ´ q` B p x , ˜ x t ´ q ´ B p x , ˜ x t q ´ B p ˜ x t , ˜ x t ´ q“ x x t ´ x t , ∇ t y ` } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ` B p x , ˜ x t ´ q ´ B p x , ˜ x t q ´ B p ˜ x t , x t q ´ B p x t , ˜ x t ´ q p a q ď x x t ´ x t , ∇ t y ` } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ` B p x , ˜ x t ´ q ´ B p x , ˜ x t q ´ η } ˜ x t ´ x t } C ´ η } x t ´ ˜ x t ´ } C , where p a q follows from strongly convexity of R . Summing over t “ , . . . T , gives us T ÿ t “ f t p x t q ´ f t p x q ď T ÿ t “ x x t ´ x t , ∇ t y ` B p x , ˜ x q ´ B p x , ˜ x T q loooooooooooomoooooooooooon S ` T ÿ t “ } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ´ η C T ÿ t “ ` } ˜ x t ´ x t } ` } x t ´ ˜ x t ´ } ˘ . Bounding S . We now bound B p x , ˜ x q ´ B p x , ˜ x T q . From the definition of B , we have B p x , ˜ x q ´ B p x , ˜ x T q “ R p ˜ x T q ´ x ∇ T , x ´ ˜ x T y ´ R p ˜ x q ` x ∇ , x ´ ˜ x T y . Note that ∇ “ . This gives us B p x , ˜ x q ´ B p x , ˜ x T q “ R p ˜ x T q ´ x ∇ T , x ´ ˜ x T y ´ R p ˜ x q . We now use duality to convert the RHS of the above equation, which is currently in terms of R , intoa quantity which depends on Φ . From Proposition 3.1 we have Φ p g q “ ´ Ψ p´ g q “ inf x P X x g, x y ` R p x q . ˜ x T is the minimizer of x ∇ T , x y ` R p x q , we have Φ p ∇ T q “ x ∇ T , ˜ x T y ` R p ˜ x T q . Simi-larly, Φ p q “ R p ˜ x q . Substituting these in the previous equation gives us B p x , ˜ x q ´ B p x , ˜ x T q “ Φ p ∇ T q ´ x ∇ T , x y ´ Φ p q“ E σ „ inf x P X @ ∇ T ´ σ, x D ´ x ∇ T , x y ´ E σ „ inf x P X @ ´ σ, x D ď E σ rx ∇ T ´ σ, x ys ´ x ∇ T , x y ´ E σ „ inf x P X @ ´ σ, x D “ E σ „ inf x P X @ σ, x D ´ E σ rx σ, x ysď D E σ r} σ } ˚ s “ ηD Bounding Regret.
Substituting this in our regret bound and taking expectation on both sides givesus E « T ÿ t “ f t p x t q ´ f t p x q ff ď T ÿ t “ E rx x t ´ x t , ∇ t ys ` ηD ` T ÿ t “ E r} x t ´ ˜ x t }} ∇ t ´ g t } ˚ s´ η C T ÿ t “ ` E “ } ˜ x t ´ x t } ‰ ` E “ } x t ´ ˜ x t ´ } ‰˘ ď T ÿ t “ E rx x t ´ x t , ∇ t ys ` ηD ` T ÿ t “ C η E “ } ∇ t ´ g t } ˚ ‰ ´ η C T ÿ t “ E “ } x t ´ ˜ x t ´ } ‰ To finish the proof, we make use of the Holder’s smoothness assumption on f t to bound the firstterm in the RHS above. From Holder’s smoothness assumption, we have x x t ´ x t , ∇ t ´ ∇ f t p x t qy ď L } x t ´ x t } ` α . Using this, we get E rx x t ´ x t , ∇ t y | g t , x t ´ , f t s ď E “ x x t ´ x t , ∇ f t p x t qy ` L } x t ´ x t } ` α | g t , x t ´ , f t ‰ p a q “ L E “ } x t ´ x t } ` α | g t , x t ´ , f t ‰ p b q ď Ψ ` α L E “ } x t ´ x t } ` α | g t , x t ´ , f t ‰ p c q ď Ψ ` α L E “ } x t ´ x t } | g t , x t ´ , f t ‰ p ` α q{ p d q ď L ˆ Ψ Ψ D ? m ˙ ` α , where p a q follows from the fact that E rx x t ´ x t , ∇ f t p x t qy | g t , x t ´ , f t s “ , p b q follows fromthe definition of norm compatibility constant Ψ , p c q follows from Holders inequality and p d q usesthe fact that conditioned on t g t , x t ´ , f t u , x t ´ x t is the average of m i.i.d bounded mean random variables, the variance of which scales as O p D { m q . Substituting this in the above regretbound gives us the required result. B.2 Proof of Corollary 4.1
We first bound E σ r} σ } s . Relying on spherical symmetry of the perturbation distribution and thefact that the density of P PRTB on the spherical shell of radius r is proportional to r d ´ , we get E σ r} σ } s “ ş p ` d ´ q ηr “ r ˆ r d ´ dr ş p ` d ´ q ηr “ r d ´ dr “ η.
14e now bound the stability of predictions of OFTPL. Our technique for bounding the stability usessimilar arguments as Hazan and Minasyan [10] (see Lemma 4.2 of [10]). Recall, to bound stability,we need to show that Φ p g q “ E σ r inf x P X x g ´ σ, x ys is smooth. Let φ p g q “ inf x P X x g, x ´ x y ,where x is an arbitrary point in X . We can rewrite Φ p g q as Φ p g q “ E σ r φ p g ´ σ qs ` x g, x y . Since the second term in the RHS above is linear in g , any upper bound on the smoothness of E σ r φ p g ´ σ qs is also a bound on the smoothness of Φ p g q . So we focus on bounding the smoothnessof E σ r φ p g ´ σ qs .First note that φ p g q is D Lipschitz and satisfies the following for any g , g P R d φ p g q ´ φ p g q “ inf x P X x´ g , x ´ x y ´ inf x P X x´ g , x ´ x yď sup x P X x g ´ g , x ´ x yď D } g ´ g } . Letting Φ p g q “ E σ r φ p g ´ σ qs , Lemma 4.2 of Hazan and Minasyan [10] shows that Φ p g q issmooth and satisfies } ∇ Φ p g q ´ ∇ Φ p g q} ď dDη ´ } g ´ g } . This shows that the predictions of OFTPL are dDη ´ stable. The rest of the proof involves sub-stituting C “ dD in the regret bound of Theorem 4.1 and setting g t “ and using the fact that } ∇ t } ď G . C Online Nonconvex Learning
C.1 Proof of Theorem 4.2
Before we present the proof of the Theorem, we introduce some notation and present some useful in-termediate results. We note that unlike the convex case, there are no know Fenchel duality theoremsfor infinite dimensional setting. So more careful arguments are need to obtain tight regret bounds.Our proof mimics the proof of Theorem 4.1.
C.1.1 Notation
Let P be the set of all probability measures on X . We define functions Φ : F Ñ R , R : P Ñ R asfollows Φ p f q “ E σ „ inf P P P E x „ P r f p x q ´ σ p x qs ,R p P q “ sup f P F ´ E x „ P r f p x qs ` Φ p f q . Also, note that the function ∇ Φ : F Ñ P defined in Section 4.2 can be written as ∇ Φ p f q “ E σ „ argmin P P P E x „ P r f p x q ´ σ p x qs . Note that, ∇ Φ p f q is well defined because from our assumption on the perturbation distribution,the minimization problem inside the expectation has a unique minimizer with probability one. Tosimplify the notation, in the sequel, we use the shorthand notation x P, f y to denote E x „ P r f p x qs , forany P P P and f P F . Similarly, for any P , P P P and f P F , we use the notation x P ´ P , f y to denote E x „ P r f p x qs ´ E x „ P r f p x qs . C.1.2 Intermediate ResultsLemma C.1.
For any g P F , R p ∇ Φ p g qq “ ´ x ∇ Φ p g q , g y ` Φ p g q .Proof. Define P g,σ as P g,σ “ argmin P P P E x „ P r g p x q ´ σ p x qs . ∇ Φ p g q “ E σ r P g,σ s . For any g, h P F , we have Φ p h q “ E σ „ inf P P P x P, h ´ σ y ď E σ rx P g,σ , h ´ σ ys“ E σ rx P g,σ , g ´ σ ys ` E σ rx P g,σ , h ´ g ys“ Φ p g q ` x ∇ Φ p g q , h ´ g y . This shows that for any g, h P F Φ p h q ´ x ∇ Φ p g q , h y ď Φ p g q ´ x ∇ Φ p g q , g y . (5)Taking supremum over h of the LHS quantity gives us R p ∇ Φ p g qq “ sup h P F Φ p h q ´ x ∇ Φ p g q , h y “ Φ p g q ´ x ∇ Φ p g q , g y . Lemma C.2 (Strong Smoothness) . The function ´ Φ is convex and strongly smooth and satisfies thefollowing inequality for any g , g P F ´ Φ p g q ď ´ Φ p g q ´ x ∇ Φ p g q , g ´ g y ` C η } g ´ g } F . Proof.
Let g , g P F and α P r , s . Then Φ p αg ` p ´ α q g q “ E σ „ inf P P P x P, αg ` p ´ α q g ´ σ y ě α E σ „ inf P P P x P, g ´ σ y ` p ´ α q E σ „ inf P P P x P, g ´ σ y “ α Φ p g q ` p ´ α q Φ p g q . This shows that ´ Φ is convex. To show smoothness, we rely on the following stability property @ g , g P F γ F p ∇ Φ p g q , ∇ Φ p g qq ď Cη } g ´ g } F . Let T be an arbitrary positive integer and for t P t , , . . . T u , define α t “ t { T . Let h “ g ´ g .We have Φ p g q ´ Φ p g q “ Φ p g ` α h q ´ Φ p g ` α T h q“ T ´ ÿ t “ p Φ p g ` α t h q ´ Φ p g ` α t ` h qq Since ´ Φ is convex and satisfies Equation (5), we have Φ p g q ´ Φ p g q “ T ´ ÿ t “ p Φ p g ` α t h q ´ Φ p g ` α t ` h qqď ´ T ´ ÿ t “ T x ∇ Φ p g ` α t ` h q , h y Φ p g q ´ Φ p g q ď ´ T ´ ÿ t “ T x ∇ Φ p g ` α t ` h q , h y“ T ´ ÿ t “ T px ∇ Φ p g q ´ ∇ Φ p g ` α t ` h q , h y ´ x ∇ Φ p g q , h yq p a q ď ´ x ∇ Φ p g q , h y ` T ´ ÿ t “ T γ F p ∇ Φ p g q , ∇ Φ p g ` α t ` h qq} h } F p b q ď ´ x ∇ Φ p g q , h y ` T ´ ÿ t “ CT η } α t ` h } F } h } F “ ´ x ∇ Φ p g q , h y ` T ´ ÿ t “ Cα t ` T η } h } F “ ´ x ∇ Φ p g q , h y ` Cη T ` T } h } F , where p a q follows from the definition of γ F and p b q follows from the stability assumption. Taking T Ñ 8 , we get ´ Φ p g q ď ´ Φ p g q ´ x ∇ Φ p g q , g ´ g y ` C η } g ´ g } F . Lemma C.3 (Strong Convexity) . For any P P P and g P F , R satisfies the following inequality R p P q ě R p ∇ Φ p g qq ` x ∇ Φ p g q ´ P, g y ` η C γ F p P, ∇ Φ p g qq . Proof.
From Lemma C.2 we know that the following holds for any g, h P F Φ p g q ě Φ p h q ` x ∇ Φ p h q , g ´ h y ´ C η } g ´ h } F loooooooooooooooooooooooomoooooooooooooooooooooooon Φ lb ,h p g q . Define R lb ,h p P q as R lb p P q “ sup g P F ´ x P, g y ` Φ lb ,h p g q . Since Φ p g q ě Φ lb ,h p g q for all g P F , R p P q ě R lb ,h p P q for all P . We now derive an expression for R lb ,h p P q . Note that from Lemma C.1 we have R p ∇ Φ p h qq “ ´ x ∇ Φ p h q , h y ` Φ p h q . Using this,we get R lb ,h p P q “ sup g P F ´ x P, g y ` Φ lb ,h p g q p a q “ sup g P F ˆ ´ x P, g y ` Φ p h q ` x ∇ Φ p h q , g ´ h y ´ C η } g ´ h } F ˙ p b q “ R p ∇ Φ p h qq ` sup g P F ˆ x ∇ Φ p h q ´ P, g y ´ C η } g ´ h } F ˙ , where p a q follows from the definition of Φ lb ,h p g q and p b q follows from Lemma C.1. We now do achange of variables in the supremum of the above expression. Substituting g “ g ´ h , we get R lb ,h p P q “ R p ∇ Φ p h qq ` x ∇ Φ p h q ´ P, h y ` sup g P F ˆ@ ∇ Φ p h q ´ P, g D ´ C η } g } F ˙ . We now show that sup g P F ˆ@ ∇ Φ p h q ´ P, g D ´ C η } g } F ˙ ě η C γ F p P, ∇ Φ p h qq .
17o this end, we choose a g P F such that } g } F “ ηC γ F p P, ∇ Φ p h qq , @ ∇ Φ p h q ´ P, g D “ ηC γ F p P, ∇ Φ p h qq . (6)If such a g can be found, we have sup g P F ˆ@ ∇ Φ p h q ´ P, g D ´ C η } g } F ˙ ě @ ∇ Φ p h q ´ P, g D ´ C η } g } F “ η C γ F p P, ∇ Φ p h qq . This would then imply the main claim of the Lemma. R p P q ě R lb ,h p P q ě R p ∇ Φ p h qq ` x ∇ Φ p h q ´ P, h y ` η C γ F p P, ∇ Φ p h qq . Finding g . We now construct a g which satisfies Equation (6). From the definition of γ F weknow that γ F p P, ∇ Φ p h qq “ sup } g } F ď | @ ∇ Φ p h q ´ P, g D | Suppose the supremum is achieved at g ˚ . Define g as ηsC γ F p P, ∇ Φ p h qq g ˚ , where s “ sign px ∇ Φ p h q ´ P, g ˚ yq . It can be easily verified that g satifies Equation (6).If the supremum is never achieved, the same argument as above can still be made using a sequenceof functions t g n u n “ such that } g n } F ď , lim n Ñ8 | x ∇ Φ p h q ´ P, g n y | “ γ F p P, ∇ Φ p h qq . Define g n as ηs n C γ F p P, ∇ Φ p h qq g n , where s n “ sign px ∇ Φ p h q ´ P, g n yq . Since lim n Ñ8 } g n } F “ , we have lim n Ñ8 } g n } F “ ηC γ F p P, ∇ Φ p h qq . Moreover, lim n Ñ8 @ ∇ Φ p h q ´ P, g n D “ lim n Ñ8 ηC γ F p P, ∇ Φ p h qq ˇˇˇ x ∇ Φ p h q ´ P, g n y ˇˇˇ “ ηC γ F p P, ∇ Φ p h qq . This shows that sup g P F ˆ@ ∇ Φ p h q ´ P, g D ´ C η } g } F ˙ ě lim n Ñ8 @ ∇ Φ p h q ´ P, g n D ´ C η } g n } F “ η C γ F p P, ∇ Φ p h qq . This finishes the proof of the Lemma.
C.1.3 Main Argument
We are now ready to prove Theorem 4.2. Our proof relies on Lemma C.3 and uses similar argumentsas used in the proof of Theorem 4.1. We first rewrite P t , ˜ P t as P t “ m m ÿ j “ argmin P P P E x „ P « t ´ ÿ i “ f i p x q ` g t p x q ´ σ t,j p x q ff , ˜ P t “ m m ÿ j “ argmin P P P E x „ P « t ÿ i “ f i p x q ´ σ t,j p x q ff . Note that P t “ E r P t | g t , f t ´ , P t ´ s “ ∇ Φ p f t ´ ` g t q , ˜ P t “ E ” ˜ P t | f t ´ , P t ´ ı “ ∇ Φ p f t q , with P “ ˜ P “ ∇ Φ p q . Define functions B p¨ , P t q , B p¨ , ˜ P t q as B p P, P t q “ R p P q ´ R p P t q ` x P ´ P t , f t ´ ` g t y ,B p P, ˜ P t q “ R p P q ´ R p ˜ P t q ` A P ´ ˜ P t , f t E . B p P, P t q ě η C γ F p P, P t q , B p P, ˜ P t q ě η C γ F p P, ˜ P t q . For any P P P , we have E r f t p x t q ´ f t p P qs “ E r f t p P t q ´ f t p P qs“ E rx P t ´ P, f t ys“ E rx P t ´ P t , f t ys ` E rx P t ´ P, f t ys“ E rx P t ´ P t , f t ys ` E ”A P t ´ ˜ P t , f t ´ g t Eı ` E ”A P t ´ ˜ P t , g t Eı ` E ”A ˜ P t ´ P, f t Eı p a q ď E ” γ F p P t , ˜ P t q} f t ´ g t } F ı ` E ”A P t ´ ˜ P t , g t Eı ` E ”A ˜ P t ´ P, f t Eı , where p a q follows from the fact that E rx P t ´ P t , f t y | g t , f t ´ , P t ´ s “ and as a result E rx P t ´ P t , f t ys “ . Next, a simple calculation shows that A P t ´ ˜ P t , g t E “ B p ˜ P t , ˜ P t ´ q ´ B p ˜ P t , P t q ´ B p P t , ˜ P t ´ q A ˜ P t ´ P, f t E “ B p P, ˜ P t ´ q ´ B p P, ˜ P t q ´ B p ˜ P t , ˜ P t ´ q . Substituting this in the previous regret bound gives us E r f t p x t q ´ f t p P qs ď E ” γ F p P t , ˜ P t q} f t ´ g t } F ı ` E ” B p ˜ P t , ˜ P t ´ q ´ B p ˜ P t , P t q ´ B p P t , ˜ P t ´ q ı ` E ” B p P, ˜ P t ´ q ´ B p P, ˜ P t q ´ B p ˜ P t , ˜ P t ´ q ı “ E ” γ F p P t , ˜ P t q} f t ´ g t } F ı ` E ” B p P, ˜ P t ´ q ´ B p P, ˜ P t q ´ B p ˜ P t , P t q ´ B p P t , ˜ P t ´ q ı p a q ď E ” γ F p P t , ˜ P t q} f t ´ g t } F ı ` E ” B p P, ˜ P t ´ q ´ B p P, ˜ P t q ı ´ E ” η C γ F p ˜ P t , P t q ` η C γ F p P t , ˜ P t ´ q ı p b q ď C η E “ } f t ´ g t } F ‰ ` E ” B p P, ˜ P t ´ q ´ B p P, ˜ P t q ı ´ E ” η C γ F p P t , ˜ P t ´ q ı where p a q follows from Lemma C.3, and p b q uses the fact that | xy | ď c | x | ` c | y | , for any x, y , c ą . Summing over t “ , . . . T gives us T ÿ t “ E r f t p x t q ´ f t p P qs ď E ” B p P, ˜ P q ´ B p P, ˜ P T q ıloooooooooooooooomoooooooooooooooon S ` T ÿ t “ C η E “ } f t ´ g t } F ‰ ´ T ÿ t “ η C E ” γ F p P t , ˜ P t ´ q ı To finish the proof of the Theorem, we need to bound S . Bounding S . From the definition of B , we have B p P, ˜ P q ´ B p P, ˜ P T q “ R p ˜ P T q ´ A P ´ ˜ P T , f T E ´ R p ˜ x q , where we used the fact that f “ . We now rely on Lemma C.1 to convert the above equation,which is currently in terms of R , into a quantity which depends on Φ . Using Lemma C.1, we get B p P, ˜ P q ´ B p P, ˜ P T q “ Φ p f T q ´ x P, f T y ´ Φ p q . Φ we have B p P, ˜ P q ´ B p P, ˜ P T q “ Φ p f T q ´ x P, f T y ´ Φ p q“ E σ „ inf P P P @ P , f T ´ σ D ´ x P, f T y ´ E σ „ inf P P P @ P , ´ σ D ď E σ rx P, f T ´ σ ys ´ x P, f T y ´ E σ „ inf P P P @ P , ´ σ D “ E σ „ sup P P P @ P , σ D ´ E σ rx P, σ ysď D E σ r} σ } F s “ ηD, where the last inequality follows from our bound on the diameter of P . Substituting this in the aboveregret bound gives us the required result. C.2 Proof of Corollary 4.2
To prove the corollary we first show that for our choice of perturbation distribution, argmin x P X f p x q ´ σ p x q has a unique minimizer with probability one, for any f P F . Next, weshow that the predictions of OFTPL are stable. C.2.1 Intermediate ResultsLemma C.4 (Unique Minimizer) . Suppose the perturbation function is such that σ p x q “ x ¯ σ, x y ,where ¯ σ P R d is a random vector whose entries are sampled independently from Exp p η q . Then, forany f P F , argmin x P X f p x q ´ σ p x q has a unique minimizer with probability one.Proof. Define x f p σ q as x f p ¯ σ q P argmin x P X f p x q ´ x ¯ σ, x y . For any ¯ σ , ¯ σ we now show that x f p ¯ σ q satisfies the following monotonicity property x x f p ¯ σ q ´ x f p ¯ σ q , ¯ σ ´ ¯ σ y ě . From the optimality of x f p ¯ σ q , x f p ¯ σ q we have f p x f p ¯ σ qq ´ x ¯ σ , x f p ¯ σ qy ď f p x f p ¯ σ qq ´ x ¯ σ , x f p ¯ σ qy“ f p x f p ¯ σ qq ´ x ¯ σ , x f p ¯ σ qy ` x ¯ σ ´ ¯ σ , x f p ¯ σ qyď f p x f p ¯ σ qq ´ x ¯ σ , x f p ¯ σ qy ` x ¯ σ ´ ¯ σ , x f p ¯ σ qy . This shows that x ¯ σ ´ ¯ σ , x f p ¯ σ q ´ x f p ¯ σ qy ě . To finish the proof of Lemma, we rely on Theo-rem 1 of Zarantonello [25], which shows that the set of points for which a monotone operator is notsingle-valued has Lebesgue measure zero. Since the distribution of ¯ σ is absolutely continuous w.r.tLebesgue measure, this shows that argmin x P X f p x q´ σ p x q has a unique minimizer with probabilityone. C.2.2 Main Argument
For our choice of perturbation distribution, E σ r} σ } F s “ E ¯ σ r} ¯ σ } s “ η log d . We now bound thestability of predictions of OFTPL. First note that for our choice of primal space p F , } ¨ } F q , γ F isthe Wasserstein-1 metric, which is defined as γ F p P , P q “ sup f P F , } f } F ď ˇˇˇ E x „ P r f p x qs ´ E x „ P r f p x qs ˇˇˇ “ inf Q P Γ p P ,P q E p x , x q„ Q r} x ´ x } s , where Γ p P , P q is the set of all probability measures on X ˆ X with marginals P , P on the firstand second factors respectively. Define x f p ¯ σ q as x f p ¯ σ q P argmin x P X f p x q ´ x ¯ σ, x y . lgorithm 3 OFTPL for convex-concave games Input:
Perturbation Distributions P PRTB , P PRTB of x , y players, number of samples m, iterations T for t “ . . . T do if t “ then
4: Sample t σ ,j u mj “ , t σ ,j u mj “ from P PRTB , P PRTB x “ m ř mj “ “ argmin x P X @ ´ σ ,j , x D‰ , y “ m ”ř mj “ argmax y P Y @ σ ,j , y Dı continue end if //Compute guesses for j “ . . . m do
10: Sample σ t,j „ P PRTB , σ t,j „ P PRTB ˜ x t ´ ,j “ argmin x P X @ř t ´ i “ ∇ x f p x i , y i q ´ σ t,j , x D ˜ y t ´ ,j “ argmax y P Y @ř t ´ i “ ∇ y f p x i , y i q ` σ t,j , y D end for ˜ x t ´ “ m ř mj “ ˜ x t ´ ,j , ˜ y t ´ “ m ř mj “ ˜ y t ´ ,j //Use the guesses to compute the next action for j “ . . . m do
17: Sample σ t,j „ P PRTB , σ t,j „ P PRTB x t,j “ argmin x P X @ř t ´ i “ ∇ x f p x i , y i q ` ∇ x f p ˜ x t ´ , ˜ y t ´ q ´ σ t,j , x D y t,j “ argmax y P Y @ř t ´ i “ ∇ y f p x i , y i q ` ∇ y f p ˜ x t ´ , ˜ y t ´ q ` σ t,j , y D end for x t “ m ř mj “ x t,j , y t “ m ř mj “ y t,j end for return tp x t , y t qu Tt “ Note that ∇ Φ p f q is the distribution of random variable x f p ¯ σ q . Suggala and Netrapalli [11] showthat for any f, g P F E ¯ σ r} x f p ¯ σ q ´ x g p ¯ σ q} s ď d Dη } f ´ g } F . Since γ F p ∇ Φ p f q , ∇ Φ p g qq ď E ¯ σ r} x f p ¯ σ q ´ x g p ¯ σ q} s , this shows that OFTPL is O ` d Dη ´ ˘ stable w.r.t } ¨ } F . Substituting the stability bound in the regret bound of Theorem 4.2 shows that sup P P P E « T ÿ t “ f t p x t q ´ f t p P q ff “ ηD log d ` O ˜ T ÿ t “ d Dη E “ } f t ´ g t } F ‰ ´ T ÿ t “ ηd D E ” γ F p P t , ˜ P t ´ q ı¸ . D Convex-Concave Games
Our algorithm for convex-concave games is presented in Algorithm 3. Before presenting the proofof Theorem 5.1, we first present a more general result in Section D.1. Theorem 5.1 immediatelyfollows from our general result by instantiating it for the uniform noise distribution.
D.1 General ResultTheorem D.1.
Consider the minimax game in Equation (1). Suppose f is convex in x , concave in y and is Holder smooth w.r.t some norm } ¨ }} ∇ x f p x , y q ´ ∇ x f p x , y q} ˚ ď L } x ´ x } α ` L } y ´ y } α , } ∇ y f p x , y q ´ ∇ y f p x , y q} ˚ ď L } x ´ x } α ` L } y ´ y } α . Define diameter of sets X , Y as D “ max t sup x , x P X } x ´ x } , sup y , y P Y } y ´ y }u . Let L “t L , L u . Suppose both x and y players use Algorithm 1 to solve the minimax game. Suppose he perturbation distributions P PRTB , P PRTB , used by x , y players are absolutely continuous andsatisfy E σ „ P PRTB r} σ } ˚ s “ E σ „ P PRTB r} σ } ˚ s “ η . Suppose the predictions of both the playersare Cη ´ -stable w.r.t } ¨ } ˚ . Suppose the guesses used by x , y players in the t th iteration are ∇ x f p ˜ x t ´ , ˜ y t ´ q , ∇ y f p ˜ x t ´ , ˜ y t ´ q , where ˜ x t ´ , ˜ y t ´ denote the predictions of x , y players in the t th iteration, if guess g t “ was used in that iteration. Then the iterates tp x t , y t qu Tt “ generatedby the OFTPL based algorithm satisfy sup x P X , y P Y E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff ď L ˆ Ψ Ψ D ? m ˙ ` α ` ηDT ` CL η ˆ Ψ Ψ D ? m ˙ α ` L ˆ CLη ˙ ` α ´ α Proof.
Since both the players are responding to each others actions using OFTPL, using Theo-rem 4.1, we get the following regret bounds for the players sup x P X E « T ÿ t “ f p x t , y t q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` C η T ÿ t “ E “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ´ η C T ÿ t “ E “ } x t ´ ˜ x t ´ } ‰ . sup y P Y E « T ÿ t “ f p x t , y q ´ f p x t , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` C η T ÿ t “ E “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ´ η C T ÿ t “ E “ } y t ´ ˜ y t ´ } ‰ . First, consider the regret of the x player. Since } a ` ¨ ¨ ¨ ` a } ď p} a } ¨ ¨ ¨ ` } a } q , we have } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ď } ∇ x f p x t , y t q ´ ∇ x f p x t , y t q} ˚ ` } ∇ x f p x t , y t q ´ ∇ x f p x t , y t q} ˚ ` } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ` } ∇ x f p ˜ x t ´ , ˜ y t ´ q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ` } ∇ x f p ˜ x t ´ , ˜ y t ´ q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚p a q ď L } x t ´ x t } α ` L } ˜ x t ´ ´ ˜ x t ´ } α ` L } y t ´ y t } α ` L } ˜ y t ´ ´ ˜ y t ´ } α ` } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ . where p a q follows from the Holder’s smoothness of f . Using a similar technique as in the proof ofTheorem 4.1, relying on Holders inequality, we get E “ } x t ´ x t } α | ˜ x t ´ , ˜ y t ´ , x t ´ , y t ´ ‰ ď E “ } x t ´ x t } | ˜ x t ´ , ˜ y t ´ , x t ´ , y t ´ ‰ α ď Ψ α E “ } x t ´ x t } | ˜ x t ´ , ˜ y t ´ , x t ´ , y t ´ ‰ α p a q ď ˆ Ψ Ψ D ? m ˙ α , p a q follows from the fact that conditioned on past randomness, x t ´ x t is the average of m i.i.d bounded mean random variables, the variance of which scales as O p D { m q . A similar boundholds for the expectation of other quantities appearing in the RHS of the above equation. Using this,the regret of x player can be upper bounded as sup x P X E « T ÿ t “ f p x t , y t q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α ` C η T ÿ t “ E “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ´ η C T ÿ t “ E “ } x t ´ ˜ x t ´ } ‰ . Similarly, the regret of y player can be bounded as sup y P Y E « T ÿ t “ f p x t , y q ´ f p x t , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α ` C η T ÿ t “ E “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ´ η C T ÿ t “ E “ } y t ´ ˜ y t ´ } ‰ . Summing the above two inequalities, we get sup x P X y P Y E « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α ` C η T ÿ t “ E “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ` C η T ÿ t “ E “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ´ η C T ÿ t “ ` E “ } y t ´ ˜ y t ´ } ‰ ` E “ } x t ´ ˜ x t ´ } ‰˘ . From Holder’s smoothness assumption on f , we have E “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ď E “ } ∇ x f p x t , y t q ´ ∇ x f p x t , ˜ y t ´ q} ˚ ‰ ` E “ } ∇ x f p x t , ˜ y t ´ q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ p a q ď L E “ } x t ´ ˜ x t ´ } α ‰ ` L E “ } y t ´ ˜ y t ´ } α ‰ , Using a similar argument, we get E “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ˚ ‰ ď L E “ } x t ´ ˜ x t ´ } α ‰ ` L E “ } y t ´ ˜ y t ´ } α ‰ . Plugging this in the previous bound, we get sup x P X y P Y E « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α ` CL η T ÿ t “ ` E “ } x t ´ ˜ x t ´ } α ‰ ` E “ } y t ´ ˜ y t ´ } α ‰˘ ´ η C T ÿ t “ ` E “ } y t ´ ˜ y t ´ } ‰ ` E “ } x t ´ ˜ x t ´ } ‰˘ . ase α “ . We first consider the case of α “ . In this case, choosing η ą ? CL , we get sup x P X y P Y E « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α . General α . The more general case relies on AM-GM inequality. Consider the following CL η } x t ´ ˜ x t ´ } α “ ´ p αC q α ´ α η ´ ` α ´ α p CL q ´ α ¯ ´ α ˆ } x t ´ ˜ x t ´ } αCη ´ ˙ α p a q ď p ´ α q ´ p αC q α ´ α η ´ ` α ´ α p CL q ´ α ¯ ` η C } x t ´ ˜ x t ´ } “ ? L ˆ ? CLη ˙ ` α ´ α ` η C } x t ´ ˜ x t ´ } where p a q follows from AM-GM inequality. Plugging this in the previous bound, we get sup x P X y P Y E « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď L T ˆ Ψ Ψ D ? m ˙ ` α ` ηD ` CL Tη ˆ Ψ Ψ D ? m ˙ α ` ? LT ˆ ? CLη ˙ ` α ´ α . The claim of the theorem then follows from the observation that E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff ď T E « T ÿ t “ f p x t , y q ´ f p x , y t q ff . D.2 Proof of Theorem 5.1
To prove the Theorem, we instantiate Theorem D.1 for the uniform noise distribution. As shown inCorollary 4.1, the predictions of OFTPL are dDη ´ -stable in this case. Plugging this in the boundof Theorem D.1 and using the fact that Ψ “ Ψ “ and α “ gives us sup x P X , y P Y E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff ď L ˆ D ? m ˙ ` ηDT ` dDL η ˆ D ? m ˙ ` L ˆ dDLη ˙ . Plugging in η “ dD p L ` q , m “ T in the above bound gives us sup x P X , y P Y E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff ď O ˆ dD p L ` q T ˙ . E Nonconvex-Nonconcave Games
Our algorithm for nonconvex-nonconcave games is presented in Algorithm 4. Note that in each itera-tion of this game, both the players play empirical distributions p P t , Q t q . Before presenting the proofof Theorem 5.2, we first present a more general result in Section E.2. Theorem 5.2 immediatelyfollows from our general result by instantiating it for exponential noise distribution. E.1 Primal Dual Spaces
In this section, we present some integral probability metrics induced by popular choices of functionsspaces p F , } ¨ } F q . 24 lgorithm 4 OFTPL for nonconvex-nonconcave games Input:
Perturbation Distributions P PRTB , P PRTB of x , y players, number of samples m, iterations T for t “ . . . T do if t “ then for j “ . . . m do
5: Sample σ t,j „ P PRTB , σ t,j „ P PRTB x ,j “ argmin x P X ´ σ ,j p x q y ,j “ argmax y P Y σ ,j p y q end for
9: Let P , Q be the empirical distributions over t x ,j u mj “ , t y ,j u mj “ continue end if //Compute guesses for j “ . . . m do
14: Sample σ t,j „ P PRTB , σ t,j „ P PRTB ˜ x t ´ ,j “ argmin x P X ř t ´ i “ f p x , Q i q ´ σ t,j p x q ˜ y t ´ ,j “ argmax y P Y ř t ´ i “ f p P i , y q ` σ t,j p y q end for
18: Let ˜ P t ´ , ˜ Q t ´ be the empirical distributions over t ˜ x t ´ ,j u mj “ , t ˜ y t ´ ,j u mj “ //Use the guesses to compute the next action for j “ . . . m do
21: Sample σ t,j „ P PRTB , σ t,j „ P PRTB x t,j “ argmin x P X ř t ´ i “ f p x , Q i q ` f p x , ˜ Q t ´ q ´ σ t,j p x q y t,j “ argmax y P Y ř t ´ i “ f p P i , y q ` f p ˜ P t ´ , y q ` σ t,j p y q end for
25: Let P t , Q t be the empirical distributions over t x t,j u mj “ , t y t,j u mj “ end for return tp P t , Q t qu Tt “ γ F p P, Q q } f } F F Dudley Metric Lip p f q ` } f } t f : Lip p f q ` } f } ă 8u Kantorovich Metric (or)Wasserstein-1 Metric Lip p f q t f : Lip p f q ă 8u Total Variation (TV) Distance } f } t f : } f } ă 8u Maximum Mean Discrepancy (MMD)for RKHS H } f } H t f : } f } H ă 8u Table 1: Table showing some popular Integral Probability Metrics. Here Lip p f q is the Lipschitzconstant of f which is defined as sup x , y P X | f p x q ´ f p y q|{} x ´ y } and } f } is the supremum normof f . E.2 General ResultTheorem E.1.
Consider the minimax game in Equation (1). Suppose the domains X , Y are compactsubsets of R d . Let F , F be the set of Lipschitz functions over X , Y , and } g } F , } g } F be theLipschitz constants of functions g : X Ñ R , g : Y Ñ R w.r.t some norm } ¨ } . Suppose f issuch that max t sup x P X } f p¨ , y q} F , sup y P Y } f p x , ¨q} F u ď G and satisfies the following smoothnessproperty } ∇ x f p x , y q ´ ∇ x f p x , y q} ˚ ď L } x ´ x } ` L } y ´ y } , } ∇ y f p x , y q ´ ∇ y f p x , y q} ˚ ď L } x ´ x } ` L } y ´ y } . Let P , Q be the set of probability distributions over X , Y . Define diameter of P , Q as D “ max t sup P ,P P P γ F p P , P q , sup Q ,Q P Q γ F p Q , Q qu . Suppose both x , y players use Algo-rithm 2 to solve the game. Suppose the perturbation distributions P PRTB , P PRTB , used by x , y players are such that argmin x P X f p x q ´ σ p x q , argmax y P Y f p y q ` σ p y q have unique optimiz-ers with probability one, for any f in F , F respectively. Moreover, suppose E σ „ P PRTB r} σ } F s “ E σ „ P PRTB r} σ } F s “ η and predictions of both the players are Cη ´ -stable w.r.t norms } ¨ } F , } ¨ } F . uppose the guesses used by x , y players in the t th iteration are f p¨ , ˜ Q t ´ q , f p ˜ P t ´ , ¨q , where ˜ P t ´ , ˜ Q t ´ denote the predictions of x , y players in the t th iteration, if guess g t “ was used.Then the iterates tp P t , Q t qu Tt “ generated by the Algorithm 3 satisfy the following, for η ą ? CL sup x P X , y P Y E « f ˜ T T ÿ t “ P t , y ¸ ´ f ˜ x , T T ÿ t “ Q t ¸ff “ O ˆ ηDT ` CD L ηm ˙ ` O ˆ min " dC Ψ Ψ G log p m q ηm , CD L η *˙ . Proof.
The proof of this Theorem uses similar arguments as Theorem D.1. Since both the playersare responding to each others actions using OFTPL, using Theorem 4.2, we get the following regretbounds for the players sup x P X E « T ÿ t “ f p P t , Q t q ´ f p x , Q t q ff ď ηD ` T ÿ t “ C η E ” } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ı ´ η C T ÿ t “ E ” γ F p P t , ˜ P t ´ q ı , sup y P Y E « T ÿ t “ f p P t , y q ´ f p P t , Q t q ff ď ηD ` T ÿ t “ C η E ” } f p P t , ¨q ´ f p ˜ P t ´ , ¨q} F ı ´ η C T ÿ t “ E ” γ F p Q t , ˜ Q t ´ q ı , where P t , ˜ P t ´ , Q t , ˜ Q t ´ are as defined in Theorem 4.2. First, consider the regret of the x player.We upper bound } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F as } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ď } f p¨ , Q t q ´ f p¨ , Q t q} F ` } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ` } f p¨ , ˜ Q t ´ q ´ f p¨ , ˜ Q t ´ q} F . We now show that E ” } f p¨ , Q t q ´ f p¨ , Q t q} F | ˜ P t ´ , ˜ Q t ´ , P t ´ , Q t ´ ı is O p { m q . To simplifythe notation, we let ζ t “ t ˜ P t ´ , ˜ Q t ´ , P t ´ , Q t ´ u . Let N ǫ be the ǫ -net of X w.r.t } ¨ } . Then } f p¨ , Q t q ´ f p¨ , Q t q} F p a q “ sup x P X } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ˚p b q ď sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ˚ ` Lǫ, where p a q follows from the definition of Lipschitz constant and p b q follows from our smoothnessassumption on f . Using this, we get E “ } f p¨ , Q t q ´ f p¨ , Q t q} F | ζ t ‰ ď E „ sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ˚ ˇˇˇ ζ t ` L ǫ , Since f is Lipschitz, } ∇ x f p x , y q} ˚ is bounded by G . So } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ˚ isbounded by G and } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} is bounded by G . Moreover, conditionedon past randomness ( ζ t ), ∇ x f p x , Q t q ´ ∇ x f p x , Q t q is a sub-Gaussian random vector and satisfiesthe following bound E rx u , ∇ x f p x , Q t q ´ ∇ x f p x , Q t qy | ζ t s ď exp ` G } u } { m ˘ . From tail bounds of sub-Gaussian random vectors [26], we have P ˆ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ą G m p d ` ? ds ` s q ˇˇˇ ζ t ˙ ď e ´ s , s ą . Using union bound, and the fact that log | N ǫ | is upper bounded by d log p ` D { ǫ q ,we get P ˆ sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ą G m p d ` ? ds ` s q ˇˇˇ ζ t ˙ ď e ´ s ` d log p ` D { ǫ q . Let Z “ sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} . The expectation of Z can be bounded as follows E r Z | ζ t s “ P p Z ď a | ζ t q E r Z | ζ t , Z ď a s ` P p Z ą a | ζ t q E r Z | ζ t , Z ą a sď a ` G P p Z ą a | ζ t q . Choosing ǫ “ Dm ´ { , s “ d log p ` m { q , and a “ d Ψ G log p ` m { q m , we get E r Z | ζ t s ď d Ψ G log p ` m { q m . This shows that E “ } f p¨ , Q t q ´ f p¨ , Q t q} F | ζ t ‰ ď d Ψ Ψ G log p ` m { q m ` D L m . Note that an-other trivial upper bound for } f p¨ , Q t q ´ f p¨ , Q t q} F is DL , which can obtained as follows } f p¨ , Q t q ´ f p¨ , Q t q} F “ sup x P X } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ˚ “ } E y „ Q t , y „ Q t r ∇ x f p x , y q ´ ∇ x f p x , y qs } ˚p a q ď LD, where p a q follows from the smoothness assumption on f and the fact that the diameter of X is D .When L is close to , this bound can be much better than the above bound. So we have E “ } f p¨ , Q t q ´ f p¨ , Q t q} F | ζ t ‰ ď min ˆ d Ψ Ψ G log p ` m { q m ` D L m , L D ˙ . Using this, the regret of the x player can be bounded as follows sup x P X E « T ÿ t “ f p P t , Q t q ´ f p x , Q t q ff ď ηD ` CD L Tηm ` min ˆ dC Ψ Ψ G T log p ` m { q ηm , CD L Tη ˙ ` T ÿ t “ C η E ” } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ı ´ η C T ÿ t “ E ” γ F p P t , ˜ P t ´ q ı . A similar analysis shows that the regret of y player can be bounded as sup y P Y E « T ÿ t “ f p P t , y q ´ f p P t , Q t q ff ď ηD ` CD L Tηm ` min ˆ dC Ψ Ψ G T log p ` m { q ηm , CD L Tη ˙ ` T ÿ t “ C η E ” } f p P t , ¨q ´ f p ˜ P t ´ , ¨q} F ı ´ η C T ÿ t “ E ” γ F p Q t , ˜ Q t ´ q ı , sup x P X , y P Y E « T ÿ t “ f p P t , y q ´ f p P, Q t q ff ď ηD ` CD L Tηm ` min ˆ dC Ψ Ψ G T log p ` m { q ηm , CD L Tη ˙ ` T ÿ t “ C η E ” } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ı ` T ÿ t “ C η E ” } f p P t , ¨q ´ f p ˜ P t ´ , ¨q} F ı ´ η C T ÿ t “ ´ E ” γ F p P t , ˜ P t ´ q ı ` E ” γ F p Q t , ˜ Q t ´ q ı¯ . From our assumption on smoothness of f , we have } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ď Lγ F p Q t , ˜ Q t ´ q , } f p P t , ¨q ´ f p ˜ P t ´ , ¨q} F ď Lγ F p P t , ˜ P t ´ q . To see this, consider the following } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F “ sup x P X } ∇ x f p x , Q t q ´ ∇ x f p x , ˜ Q t ´ q} ˚ “ sup x P X , } u }ď A u , ∇ x f p x , Q t q ´ ∇ x f p x , ˜ Q t ´ q E “ sup x P X , } u }ď E y „ Q t rx u , ∇ x f p x , y qys ´ E y „ ˜ Q t ´ rx u , ∇ x f p x , y qysď γ F p Q t , ˜ Q t ´ q sup x P X , } u }ď } x u , ∇ x f p x , ¨qy } F “ γ F p Q t , ˜ Q t ´ q sup x P X , } u }ď ˆ sup y ‰ y P Y | x u , ∇ x f p x , y qy ´ x u , ∇ x f p x , y qy |} y ´ y } ˙ ď γ F p Q t , ˜ Q t ´ q sup x P X ˆ sup y ‰ y P Y } ∇ x f p x , y q ´ ∇ x f p x , y q} ˚ } y ´ y } ˙ p a q ď Lγ F p Q t , ˜ Q t ´ q , where p a q follows from smoothness of f . Substituting this in the previous equation, and choosing η ą ? CL , we get sup x P X , y P Y E « T ÿ t “ f p P t , y q ´ f p P, Q t q ff ď ηD ` CD L Tηm ` min ˆ dC Ψ Ψ G T log p ` m { q ηm , CD L Tη ˙ This finishes the proof of the Theorem.
Remark E.1.
We note that a similar result can be obtained for other choice of function classes suchas the set of all bounded and Lipschitz functions. The only difference between proving such a resultvs. proving Theorem E.1 is in bounding } f p¨ , Q t q ´ f p¨ , Q t q} F . E.3 Proof of Theorem 5.2
To prove the Theorem, we instantiate Theorem E.1 for exponential noise distribution. Recall, inCorollary 4.2, we showed that E σ r} σ } F s “ η log d and OFTPL is O ` d Dη ´ ˘ stable w.r.t } ¨ } F ,28or this choice of perturbation distribution (similar results hold for p F , } ¨ } F q ). Substituting this inthe bounds of Theorem E.1 and using the fact that Ψ “ ? d, Ψ “ , we get sup x P X , y P Y E « f ˜ T T ÿ t “ P t , y ¸ ´ f ˜ x , T T ÿ t “ Q t ¸ff “ O ˆ ηD log dT ` d D L ηm ˙ ` O ˆ min " d DG log p m q ηm , d D L η *˙ . Choosing η “ d D p L ` q , m “ T , we get sup x P X , y P Y E « f ˜ T T ÿ t “ P t , y ¸ ´ f ˜ x , T T ÿ t “ Q t ¸ff “ O ˆ d D p L ` q log dT ˙ ` O ˆ min " d G log p T q LT , D L *˙ . F Choice of Perturbation Distributions
Regularization of some Perturbation Distributions.
We first study the regularization effect ofvarious perturbation distributions. Table 2 presents the regularizer R corresponding to some com-monly used perturbation distributions, when the action space X is ℓ ball of radius centered atorigin. Perturbation Distribution P PRTB
RegularizerUniform over r , η s d η } x ´ } Exponential P p σ ą t q “ exp p´ t { η q ÿ i η p x i ` q r log p x i ` q ´ p ` log 2 qs Gaussian P p σ “ t q9 e ´ t { η ÿ i sup u P R u r x i ´ ` F p´ u { η qs Table 2: Regularizers corresponding to various perturbation distributions used in FTPL when theaction space X is ℓ ball of radius centered at origin. Here, F is the CDF of a standard normalrandom variable. Dimension independent rates.
Recall, the OFTPL algorithm described in Algorithm 3 convergesat O p d { T q rate to a Nash equilibrium of smooth convex-concave games (see Theorem 5.1). We nowshow that for certain constraint sets X , Y , by choosing the perturbation distributions appropriately,the dimension dependence in the rates can potentially be removed.Suppose the action set is X “ t x : } x } ď u . Suppose the perturbation distribution P PRTB is the multivariate Gaussian distribution with mean and covariance η I d ˆ d , where I d ˆ d is theidentity matrix. We now try to explicitly compute the reguralizer corresponding to this perturbationdistribution and action set. Define function Ψ as Ψ p f q “ E σ „ max x P X x f ` σ, x y “ E σ r} f ` σ } s . As shown in Proposition 3.1, the regularizer R corresponding to any perturbation distribution isgiven by the Fenchel conjugate of Ψ R p x q “ sup f x f, x y ´ Ψ p f q . Since getting an exact expression for R is a non-trivial task, we only compute an approximateexpression for R . Consider the high dimensional setting ( i.e., very large d ). In this setting, } f ` σ } ,29or σ drawn from N p , η I d ˆ d q , can be approximated as follows } f ` σ } “ b } f } ` } σ } ` x f, σ y p a q « b } f } ` η d ` x f, σ y p b q « b } f } ` η d where p a q follows from the fact that } σ } is highly concentrated around η d [26]. To be precise P p} σ } ě η p d ` ? dt ` t qq ď e ´ t . A similar bound holds for the lower tail. Approximation p b q follows from the fact that x f, σ y is aGaussian random variable with mean and variance η } f } , and with high probability its magnitudeis upper bounded by ˜ O p η } f } q . Since η } f } ! ? dη } f } ď } f } ` η d , approximation p b q holds.This shows that Ψ p f q can be approximated as Ψ p f q « b } f } ` η d. Using this approximation, we now compute the reguralizer corresponding to the perturbation distri-bution R p x q “ sup f x f, x y ´ Ψ p f q « sup f x f, x y ´ b } f } ` η d “ ´ η ? d b ´ } x } . This shows that R is η ? d -strongly convex w.r.t } ¨ } norm. Following duality between strongconvexity and strong smoothness, Ψ p f q is p η d q ´ { strongly smooth w.r.t } ¨ } norm and satisfies } ∇ Ψ p f q ´ ∇ Ψ p f q} ď p η d q ´ { } f ´ f } . This shows that the predictions of OFTPL are p η d q ´ { stable w.r.t } ¨ } norm. We now instantiateTheorem D.1 for this perturbation distribution and for constraint sets which are unit balls centeredat origin, and use the above stability bound, together with the fact that E σ r} σ } s « η ? d . Suppose f is smooth w.r.t } ¨ } norm and satisfies } ∇ x f p x , y q ´ ∇ x f p x , y q} ` } ∇ y f p x , y q ´ ∇ y f p x , y q} ď L } x ´ x } ` L } y ´ y } . Then Theorem D.1 gives us the following rates of convergence to a NE sup x P X , y P Y E « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff ď L m ` η ? dT ` L η ? d ˆ m ˙ ` L ˆ Lη ? d ˙ Choosing η “ L {? d, m “ T , we get O ` LT ˘ rate of convergence. Although, these rates aredimension independent, we note that our stability bound is only approximate. More accurate analysisis needed to actually claim that Algorithm 3 achieves dimension independent rates in this setting.That being said, for general constraints sets, we believe one can get dimension independent rates bychoosing the perturbation distribution appropriately. G High Probability Bounds
In this section, we provide high probability bounds for Theorems 4.1, 5.1. Our results rely on thefollowing concentration inequalities.
Proposition G.1 (Jin et al. [27]) . Let X , . . . X K be K independent mean vector-valued randomvariables such that } X i } ď B i . Then P ˜ } K ÿ i “ X i } ě t ¸ ď ˜ ´ c t ř Ki “ B i ¸ , where c ą is a universal constant.
30e also need the following concentration inequality for martingales.
Proposition G.2 (Wainwright [28]) . Let X , . . . X K P R be a martingale difference sequence,where E r X i | F i ´ s “ . Assume that X i satisfy the following tail condition, for some scalar B i ą P ˆˇˇˇ X i B i ˇˇˇ ě z ˇˇˇ F i ´ ˙ ď p´ z q . Then P ˜ˇˇˇ K ÿ i “ X i ˇˇˇ ě z ¸ ď ˜ ´ c z ř Ki “ B i ¸ , where c ą is a universal constant. G.1 Online Convex Learning
In this section, we present a high probability version of Theorem 4.1.
Theorem G.1.
Suppose the perturbation distribution P PRTB is absolutely continuous w.r.t Lebesguemeasure. Let D be the diameter of X w.r.t } ¨ } , which is defined as D “ sup x , x P X } x ´ x } . Let η “ E σ r} σ } ˚ s , and suppose the predictions of OFTPL are Cη ´ -stable w.r.t } ¨ } ˚ , where C is aconstant that depends on the set X . Suppose, the sequence of loss functions t f t u Tt “ are G -Lipschitzw.r.t } ¨ } and satisfy sup x P X } ∇ f t p x q} ˚ ď G . Moreover, suppose t f t u Tt “ are Holder smooth andsatisfy @ x , x P X } ∇ f t p x q ´ ∇ f t p x q} ˚ ď L } x ´ x } α , for some constant α P r , s . Then the regret of Algorithm 1 satisfies the following with probabilityat least ´ δ sup x P X T ÿ t “ f t p x t q ´ f t p x q ď ηD ` T ÿ t “ C η } ∇ t ´ g t } ˚ ´ T ÿ t “ η C } x t ´ ˜ x t ´ } ` cGD c T log 2 { δm ` cLT ˆ Ψ Ψ D log 4 T { δm ˙ ` α , where c is a universal constant, x t “ E r x t | g t , f t ´ , x t ´ s and ˜ x t ´ “ E r ˜ x t ´ | f t ´ , x t ´ s and ˜ x t ´ denotes the prediction in the t th iteration of Algorithm 1, if guess g t “ was used. Here, Ψ , Ψ denote the norm compatibility constants of } ¨ } . Proof.
Our proof uses the same notation and similar arguments as in the proof Theorem 4.1. Recall,in Theorem 4.1 we showed that the regret of OFTPL is upper bounded by T ÿ t “ f t p x t q ´ f t p x q ď T ÿ t “ x x t ´ x t , ∇ t y ` ηD ` T ÿ t “ } x t ´ ˜ x t }} ∇ t ´ g t } ˚ ´ η C T ÿ t “ ` } ˜ x t ´ x t } ` } x t ´ ˜ x t ´ } ˘ ď T ÿ t “ x x t ´ x t , ∇ t y ` ηD ` T ÿ t “ C η } ∇ t ´ g t } ˚ ´ T ÿ t “ η C } x t ´ ˜ x t ´ } . From Holder’s smoothness assumption, we have x x t ´ x t , ∇ t ´ ∇ f t p x t qy ď L } x t ´ x t } ` α . Substituting this in the previous bound gives us T ÿ t “ f t p x t q ´ f t p x q ď T ÿ t “ x x t ´ x t , ∇ f t p x t qy loooooooooooooomoooooooooooooon S ` T ÿ t “ L } x t ´ x t } ` α looooooomooooooon S ` ηD ` T ÿ t “ C η } ∇ t ´ g t } ˚ ´ T ÿ t “ η C } x t ´ ˜ x t ´ } . We now provide high probability bounds for S and S .31 ounding S . Let ξ i “ t g i ` , f i ` , x i u and let ξ t denote the union of sets ξ , ξ , . . . , ξ t . Let ζ t “ x x t ´ x t , ∇ f t p x t qy with ζ “ . Note that t ζ t u Tt “ is a martingale difference sequencew.r.t ξ T . This is because E r x t | ξ t ´ s “ x t and ∇ f t p x t q is a deterministic quantity conditionedon ξ t ´ . As a result E r ζ t | ξ t ´ s “ . Moreover, conditioned on ξ t ´ , ζ t is the average of m independent mean random variables, each of which is bounded by GD . Using Proposition G.1,we get P ´ | ζ t | ě s ˇˇˇ ξ t ´ ¯ ď ˆ ´ ms G D ˙ . Using Proposition G.2 on the martingale difference sequence t ζ t u Tt “ , we get P ˜ˇˇˇ T ÿ t “ ζ t ˇˇˇ ě s ¸ ď ˆ ´ c ms G D T ˙ , where c ą is a universal constant. This shows that with probability at least ´ δ { , S is upperbounded by O ˆb G D T log δ m ˙ . Bounding S . Conditioned on t g t , f t ´ , x t ´ u , x t ´ x t is the average of m independent mean random variables which are bounded by D in } ¨ } norm. From our definition of norm compatibilityconstant Ψ , this implies the random variables are bounded by Ψ D in } ¨ } . Using Proposition G.1,we get P ˜ } x t ´ x t } ě Ψ D c c log 4 T { δm ˇˇˇˇˇ g t , f t ´ , x t ´ ¸ ď δ T .
Since the above bound holds for any set of t g t , f t , x t ´ u , the same tail bound also holds withoutthe conditioning. This shows that P ˜ } x t ´ x t } ` α ě ˆ c Ψ Ψ D log 4 T { δm ˙ ` α ¸ ď δ T , where we converted back to } ¨ } by introducing the norm compatibility constant Ψ . Bounding the regret.
Plugging the above high probability bounds for S , S in the previous regretbound and using union bound, we get the following regret bound which holds with probability atleast ´ δ T ÿ t “ f t p x t q ´ f t p x q ď cGD c T log 2 { δm ` cLT ˆ Ψ Ψ D log 4 T { δm ˙ ` α ` ηD ` T ÿ t “ C η } ∇ t ´ g t } ˚ ´ T ÿ t “ η C } x t ´ ˜ x t ´ } , where c ą is a universal constant. G.2 Convex-Concave Games
In this section, we present a high probability version of Theorem 5.1.
Theorem G.2.
Consider the minimax game in Equation (1). Suppose both the domains X , Y arecompact subsets of R d , with diameter D “ max t sup x , x P X } x ´ x } , sup y , y P Y } y ´ y } u .Suppose f is convex in x , concave in y and is Lipschitz w.r.t } ¨ } and satisfies max " sup x P X , y P Y } ∇ x f p x , y q} , sup x P X , y P Y } ∇ y f p x , y q} * ď G. Moreover, suppose f is smooth w.r.t } ¨ } } ∇ x f p x , y q ´ ∇ x f p x , y q} ` } ∇ y f p x , y q ´ ∇ y f p x , y q} ď L } x ´ x } ` L } y ´ y } . uppose Algorithm 3 is used to solve the minimax game. Suppose the perturbation distri-butions used by both the players are the same and equal to the uniform distribution over t x : } x } ď p ` d ´ q η u . Suppose the guesses used by x , y players in the t th iteration are ∇ x f p ˜ x t ´ , ˜ y t ´ q , ∇ y f p ˜ x t ´ , ˜ y t ´ q , where ˜ x t ´ , ˜ y t ´ denote the predictions of x , y players inthe t th iteration, if guess g t “ was used. If Algorithm 3 is run with η “ dD p L ` q , m “ T , thenthe iterates tp x t , y t qu Tt “ satisfy the following bound with probability at least ´ δ sup x P X , y P Y « f ˜ T T ÿ t “ x t , y ¸ ´ f ˜ x , T T ÿ t “ y t ¸ff “ O ¨˝ GD b log δ T ` D p L ` q ` d ` log Tδ ˘ T ˛‚ . Proof.
We use the same notation and proof technique as Theorems D.1, 5.1. From Theorem 4.1 weknow that the predictions of OFTPL are dDη ´ stable w.r.t } ¨ } , for the particular perturbationdistribution we consider here. We use this stability bound in our proof. From Theorem G.1, we havethe following regret bound for both the players, which holds with probability at least ´ δ { x P X « T ÿ t “ f p x t , y t q ´ f p x , y t q ff ď cGD c T log 8 { δm ` cLT ˆ D log 16 T { δm ˙ ` ηD ` dD η T ÿ t “ “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ‰ ´ η dD T ÿ t “ “ } x t ´ ˜ x t ´ } ‰ . sup y P Y « T ÿ t “ f p x t , y q ´ f p x t , y t q ff ď cGD c T log 8 { δm ` cLT ˆ D log 16 T { δm ˙ ` ηD ` dD η T ÿ t “ “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ‰ ´ η dD T ÿ t “ “ } y t ´ ˜ y t ´ } ‰ . First, consider the regret of the x player. From the proof of Theorem D.1, we have } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ď L } x t ´ x t } ` L } ˜ x t ´ ´ ˜ x t ´ } ` L } y t ´ y t } ` L } ˜ y t ´ ´ ˜ y t ´ } ` } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} . Moreover, from the proof of Theorem G.1, we know that } x t ´ x t } satisfies the following tailbound P ˆ } x t ´ x t } ě cD log 16 T { δm ˙ ď δ T .
Similar bounds hold for the quantities appearing in the regret bound of y player. Plugging this inthe previous regret bounds, we get the following which hold with probability at least ´ δ sup x P X « T ÿ t “ f p x t , y t q ´ f p x , y t q ff ď cGD c T log 8 { δm ` ˆ L ` dDL η ˙ ˆ cD log 16 T { δm ˙ T ` ηD ` dD η T ÿ t “ “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ‰ ´ η dD T ÿ t “ “ } x t ´ ˜ x t ´ } ‰ . up y P Y « T ÿ t “ f p x t , y q ´ f p x t , y t q ff ď cGD c T log 8 { δm ` ˆ L ` dDL η ˙ ˆ cD log 16 T { δm ˙ T ` ηD ` dD η T ÿ t “ “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ‰ ´ η dD T ÿ t “ “ } y t ´ ˜ y t ´ } ‰ . Summing these two regret bounds, we get sup x P X , y P Y « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď cGD c T log 8 { δm ` ˆ L ` dDL η ˙ ˆ cD log 16 T { δm ˙ T ` ηD ` dD η T ÿ t “ “ } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ‰ ` dD η T ÿ t “ “ } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ‰ ´ η dD T ÿ t “ “ } x t ´ ˜ x t ´ } ` } y t ´ ˜ y t ´ } ‰ . From Holder’s smoothness assumption on f , we have } ∇ x f p x t , y t q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ď } ∇ x f p x t , y t q ´ ∇ x f p x t , ˜ y t ´ q} ` } ∇ x f p x t , ˜ y t ´ q ´ ∇ x f p ˜ x t ´ , ˜ y t ´ q} ď L } x t ´ ˜ x t ´ } ` L } y t ´ ˜ y t ´ } , Using a similar argument, we get } ∇ y f p x t , y t q ´ ∇ y f p ˜ x t ´ , ˜ y t ´ q} ď L } x t ´ ˜ x t ´ } ` L } y t ´ ˜ y t ´ } . Plugging this in the previous bound, and setting η “ dD p L ` q , m “ T , we get the followingbound which holds with probability at least ´ δ sup x P X , y P Y « T ÿ t “ f p x t , y q ´ f p x , y t q ff ď O ˜ GD c log 8 δ ` D p L ` q ˆ d ` log 16 Tδ ˙¸ . G.3 Nonconvex-Nonconcave Games
In this section, we present a high probability version of Theorem 5.2.
Theorem G.3.
Consider the minimax game in Equation (1). Suppose the domains X , Y are compactsubsets of R d with diameter D “ max t sup x , x P X } x ´ x } , sup y , y P Y } y ´ y } u . Suppose f is Lipschitz w.r.t } ¨ } and satisfies max " sup x P X , y P Y } ∇ x f p x , y q} , sup x P X , y P Y } ∇ y f p x , y q} * ď G. Moreover, suppose f satisfies the following smoothness property } ∇ x f p x , y q ´ ∇ x f p x , y q} ` } ∇ y f p x , y q ´ ∇ y f p x , y q} ď L } x ´ x } ` L } y ´ y } . Suppose both x and y players use Algorithm 4 to solve the game with linear perturbation functions σ p z q “ x ¯ σ, z y , where ¯ σ P R d is such that each of its entries is sampled independently from Exp p η q .Suppose the guesses used by x and y players in the t th iteration are f p¨ , ˜ Q t ´ q , f p ˜ P t ´ , ¨q , where ˜ P t ´ , ˜ Q t ´ denote the predictions of x , y players in the t th iteration, if guess g t “ was used. If lgorithm 4 is run with η “ d D p L ` q , m “ T , then the iterates tp P t , Q t qu Tt “ satisfy thefollowing with probability at least ´ δ sup x P X , y P Y T ÿ t “ f p P t , y q ´ f p x , Q t q “ O ˜ d D p L ` q log dT ` GDT c log 8 δ ¸ ` O ˆ min " D L, d G log T ` dG log δ LT *˙ . Proof.
We use the same notation used in the proofs of Theorems 4.2, E.1. Let F , F be the set of Lip-schitz functions over X , Y , and } g } F , } g } F be the Lipschitz constants of functions g : X Ñ R , g : Y Ñ R w.r.t } ¨ } . Recall, in Corollary 4.2 we showed that for our choice of perturbationdistribution, E σ r} σ } F s “ η log d and OFTPL is O ` d Dη ´ ˘ stable. We use this in our proof.From Theorem 4.2, we know that the regret of x , y players satisfy T ÿ t “ f p P t , Q t q ´ f p x , Q t q ď ηD log d ` T ÿ t “ x P t ´ P t , f p¨ , Q t qy looooooooooooomooooooooooooon S ` T ÿ t “ cd D η } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F loooooooooooooomoooooooooooooon S ´ T ÿ t “ η cd D γ F p P t , ˜ P t ´ q T ÿ t “ f p P t , y q ´ f p P t , Q t q ď ηD log d ` T ÿ t “ x Q t ´ Q t , f p P t , ¨qy` T ÿ t “ cd D η } f p P t , ¨q ´ f p ˜ P t ´ , ¨q} F ´ T ÿ t “ η cd D γ F p Q t , ˜ Q t ´ q , where c ą is a positive constant. We now provide high probability bounds for S , S . Bounding S . Let ξ i “ t ˜ P i , ˜ Q i , P i , Q i ` u with ξ “ t Q u and let ξ t denote the union of sets ξ , . . . , ξ t . Let ζ t “ x P t ´ P t , f p¨ , Q t qy with ζ “ . Note that t ζ t u Tt “ is a martingale differencesequence w.r.t ξ T . This is because E r P t | ξ t ´ s “ P t and f p¨ , Q t q is a deterministic quantityconditioned on ξ t ´ . As a result E r ζ t | ξ t ´ s “ . Moreover, conditioned on ξ t ´ , ζ t is theaverage of m independent mean random variables, each of which is bounded by GD . UsingProposition G.1, we get P ´ | ζ t | ě s ˇˇˇ ξ t ´ ¯ ď ˆ ´ ms G D ˙ . Using Proposition G.2 on the martingale difference sequence t ζ t u Tt “ , we get P ˜ˇˇˇ T ÿ t “ ζ t ˇˇˇ ě s ¸ ď ˆ ´ c ms G D T ˙ , where c ą is a universal constant. This shows that with probability at least ´ δ { , S is upperbounded by O ˆb G D T log δ m ˙ . ounding S . We upper bound S as } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ď } f p¨ , Q t q ´ f p¨ , Q t q} F ` } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ` } f p¨ , ˜ Q t ´ q ´ f p¨ , ˜ Q t ´ q} F . We first provide a high probability bound for } f p¨ , Q t q ´ f p¨ , Q t q} F . A trivial bound for thisquantity is L D , which can be obtained as follows } f p¨ , Q t q ´ f p¨ , Q t q} F “ sup x P X } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} “ } E y „ Q t , y „ Q t r ∇ x f p x , y q ´ ∇ x f p x , y qs } a q ď LD, where p a q follows from the smoothness assumption on f and the fact that the diameter of X is D . Abetter bound for this quantity can be obtained as follows. From proof of Theorem E.1, we have } f p¨ , Q t q ´ f p¨ , Q t q} F ď x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ` L ǫ . where N ǫ be the ǫ -net of X w.r.t } ¨ } . Recall, in the proof of Theorem E.1, we showed the followinghigh probability bound for the RHS quantity P ˆ sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ą dG m p d ` ? ds ` s q ˙ ď e ´ s ` d log p ` D { ǫ q . Choosing ǫ “ Dm ´ { , s “ log δ ` d log p ` m { q , we get the following bound for sup x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} which holds with probability at least ´ δ { x P N ǫ } ∇ x f p x , Q t q ´ ∇ x f p x , Q t q} ď dG m ˆ log 8 δ ` d log p ` m { q ˙ . Together with our trivial bound of D L , this gives us the following bound for } f p¨ , Q t q ´ f p¨ , Q t q} F , which holds with probability at least ´ δ { } f p¨ , Q t q ´ f p¨ , Q t q} F ď min ˆ dG m ˆ log 8 δ ` d log p ` m { q ˙ , D L ˙ ` D L m . Next, we bound } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F . From our smoothness assumption on f , we have } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ď Lγ F p Q t , ˜ Q t ´ q . Combining the previous two results, we get the following upper bound for S which holds withprobability at least ´ δ { } f p¨ , Q t q ´ f p¨ , ˜ Q t ´ q} F ď L γ F p Q t , ˜ Q t ´ q ` D L m ` min ˆ dG m ˆ log 8 δ ` d log p ` m { q ˙ , D L ˙ . Regret bound.
Substituting the above bounds for S , S in the regret bound for x player gives usthe following bound, which holds with probability at least ´ δ { T ÿ t “ f p P t , Q t q ´ f p x , Q t q ď ηD log d ` O ¨˝ GD d T log δ m ` d D L Tηm ˛‚ ` O ˆ min ˆ d DG Tηm ˆ log 8 δ ` d log p m q ˙ , d D L Tη ˙˙ ` T ÿ t “ cd DL η γ F p Q t , ˜ Q t ´ q ´ T ÿ t “ η cd D γ F p P t , ˜ P t ´ q y player T ÿ t “ f p P t , Q t q ´ f p x , Q t q ď ηD log d ` O ¨˝ GD d T log δ m ` d D L Tηm ˛‚ ` O ˆ min ˆ d DG Tηm ˆ log 8 δ ` d log p m q ˙ , d D L Tη ˙˙ ` T ÿ t “ cd DL η γ F p P t , ˜ P t ´ q ´ T ÿ t “ η cd D γ F p Q t , ˜ Q t ´ q Choosing, η “ d D p L ` q , m “ T , and adding the above two regret bounds, we get sup x P X , y P Y T ÿ t “ f p P t , y q ´ f p x , Q t q “ O ˜ d D p L ` q log d ` GD c log 8 δ ¸ ` O ˆ min " D LT, d G log TL ` dG log δ L *˙ . H Background on Convex Analysis
Fenchel Conjugate.
The Fenchel conjugate of a function f is defined as f ˚ p x ˚ q “ sup x x x, x ˚ y ´ f p x q . We now state some useful properties of Fenchel conjugates. These properties can be found in Rock-afellar [22].
Theorem H.1.
Let f be a proper convex function. The conjugate function f ˚ is then a closed andproper convex function. Moreover, if f is lower semi-continuous then f ˚˚ “ f . Theorem H.2.
For any proper convex function f and any vector x , the following conditions on avector x ˚ are equivalent to each other • x ˚ P B f p x q • x z, x ˚ y ´ f p z q achieves its supremum in z at z “ x • f p x q ` f ˚ p x ˚ q “ x x, x ˚ y If p cl f qp x q “ f p x q , the following condition can be added to the list • x P B f ˚ p x ˚ q Theorem H.3. If f is a closed proper convex function, B f ˚ is the inverse of B f in the sense ofmultivalued mappings, i.e., x P B f ˚ p x ˚ q iff x ˚ P B f p x q . Theorem H.4.
Let f be a closed proper convex function. Let B f be the subdifferential mapping.The effective domain of B f , which is the set dom pB f q “ t x |B f ‰ u , satisfiesri p dom p f qq Ď dom pB f q Ď dom p f q . The range of B f is defined as range B f “ YtB f p x q| x P R d u . The range of B f is the effective domainof B f ˚ , so ri p dom p f ˚ qq Ď range B f Ď dom p f ˚ q . Strong Convexity and Smoothness.
We now define strong convexity and strong smoothness andshow that these two properties are duals of each other.
Definition H.1 (Strong Convexity) . A function f : X Ñ R Y t8u is β -strongly convex w.r.t a norm } ¨ } if for all x, y P ri p dom p f qq and α P p , q we have f p αx ` p ´ α q y q ď αf p x q ` p ´ α q f p y q ´ βα p ´ α q} x ´ y } . f [see Lemma 13 of24] f p y q ě f p x q ` x g, y ´ x y ` β } y ´ x } , for any x, y P ri p dom p f qq , g P B f p x q Definition H.2 (Strong Smoothness) . A function f : X Ñ R Y t8u is β -strongly smooth w.r.t anorm } ¨ } if f is everywhere differentiable and if for all x, y we have f p y q ď f p x q ` x ∇ f p x q , y ´ x y ` β } y ´ x } . Theorem H.5 (Kakade et al. [29]) . Assume that f is a proper closed and convex function. Suppose f is β -strongly smooth w.r.t a norm } ¨ } . Then its conjugate f ˚ satisfies the following for all a, x with u “ ∇ f p x q f ˚ p a ` u q ě f ˚ p u q ` x x, a y ` β } a } ˚ . Theorem H.6 (Kakade et al. [29]) . Assume that f is a closed and convex function. Then f is β -strongly convex w.r.t a norm } ¨ } iff f ˚ is β -strongly smooth w.r.t the dual norm } ¨ } ˚ ..