Two-Player Games for Efficient Non-Convex Constrained Optimization
aa r X i v : . [ c s . L G ] S e p Two-Player Games for Efficient Non-Convex ConstrainedOptimization
Andrew Cotter ∗ , Heinrich Jiang † , and Karthik Sridharan ‡ Google AI Cornell UniversityOctober 2, 2018
Abstract
In recent years, constrained optimization has become increasingly relevant to the machine learning community,with applications including Neyman-Pearson classification, robust optimization, and fair machine learning. A naturalapproach to constrained optimization is to optimize the Lagrangian, but this is not guaranteed to work in the non-convex setting, and, if using a first-order method, cannot cope with non-differentiable constraints (e.g. constraints onrates or proportions).The Lagrangian can be interpreted as a two-player game played between a player who seeks to optimize overthe model parameters, and a player who wishes to maximize over the Lagrange multipliers. We propose a non-zero-sum variant of the Lagrangian formulation that can cope with non-differentiable—even discontinuous—constraints,which we call the “proxy-Lagrangian”. The first player minimizes external regret in terms of easy-to-optimize “proxyconstraints”, while the second player enforces the original constraints by minimizing swap regret.For this new formulation, as for the Lagrangian in the non-convex setting, the result is a stochastic classifier. Forboth the proxy-Lagrangian and Lagrangian formulations, however, we prove that this classifier, instead of havingunbounded size, can be taken to be a distribution over no more than m + 1 models (where m is the number ofconstraints). This is a significant improvement in practical terms. We consider the general problem of inequality constrained optimization, in which we wish to find a set of parameters θ ∈ Θ minimizing an objective function subject to m functional constraints: min θ ∈ Θ g ( θ ) (1) s . t . ∀ i ∈ [ m ] .g i ( θ ) ≤ To highlight some of the challenges that arise in non-convex constrained optimization, consider the specific exampleof constraining a fairness metric. We cast the fairness problem as that of minimizing some empirical loss subject to ∗ [email protected] † [email protected] ‡ [email protected] min θ ∈ Θ | S | X x,y ∈ S ℓ ( f ( x ; θ ) , y ) (2) s . t . | S | X x ∈ S min f ( x ; θ ) > ≥ . | S | X x ∈ S f ( x ; θ ) > Here, f ( · ; θ ) is a classification function with parameters θ , S is the training dataset, and S min ⊆ S represents a mi-nority population. The constraint represents a version of the so-called “80% rule” [e.g. Biddle, 2005, Vuolo and Levy,2013], and forces the resulting classifier to make at least 80% of its positive predictions on the minority population—Goh et al. [2016] and Narasimhan [2018] discuss a number of useful constraints that are formulated similarly, bothon fairness and non-fairness metrics. Unfortunately, several serious challenges arise when we attempt to optimize thisproblem:1. The constraint is data-dependent, and could therefore be very expensive to check.2. The classification function f may be a badly-behaving function of θ (e.g. a deep neural network), resulting innon-convex objective and constraint functions.3. Worse, the constraint is a linear combination of indicators , hence is not even subdifferentiable w.r.t. θ .Perhaps the most “familiar” technique for constrained optimization is to formulate the Lagrangian: Definition 1.
The Lagrangian L : Θ × Λ → R of Equation 1 is: L ( θ, λ ) := g ( θ ) + m X i =1 λ i g i ( θ ) where Λ ⊆ R m + . and jointly minimize over θ ∈ Θ and maximize over λ ∈ Λ ⊆ R m + . By itself, using this formulation doesn’t addressthe challenges we identified above, but we will see that, compared to the alternatives (Section 2.1), it’s a good startingpoint for an approach that does. Optimizing the Lagrangian can be interpreted as playing a two player zero-sum game: the first player chooses θ to min-imize L ( θ, λ ) , and the second player chooses λ to maximize it. The essential difficulty is that, without strong duality—equivalently, unless the minimax theorem holds, giving that min θ ∈ Θ max λ ∈ Λ L ( θ, λ ) = max λ ∈ Λ min θ ∈ Θ L ( θ, λ ) —then the θ -player, who is working on the primal (minimax) problem, and the λ -player, who is working on the dual(maximin) problem, might fail to converge to a solution satisfying both players simultaneously (i.e. a pure Nashequilibrium).If Equation 1 is a convex optimization problem and the action spaces Θ and Λ are compact and convex, then theminimax theorem holds [von Neumann, 1928], and optimizing the Lagrangian will work. Otherwise it might not, andin fact it’s quite easy to construct a counterexample: Figure 1 shows a case in which a pure Nash equilibrium of theLagrangian game does not exist . For this reason, the standard approach for handling non-convex machine learningproblems, i.e. pretending that the problem is convex and using a stochastic first order algorithm anyway, should notbe expected to reliably converge to a pure Nash equilibrium—even on a problem as trivial as that in Figure 1—sincethere may be none for it to converge to .Under general conditions, however, even when there is no pure Nash equilibrium, a mixed equilibrium (i.e. a pair ofdistributions over θ and λ ) does exist. Such an equilibrium defines a stochastic classifier: upon receiving an example x to classify, we would sample θ from its equilibrium distribution, and then evaluate the classification function f ( x ; θ ) .Furthermore, and this is our first main contribution, this equilibrium can be taken to consist of a discrete distribution2ver at most m + 1 distinct θ s ( m being the number of constraints), and a single non-random λ . This is a crucialimprovement in practical terms, since a machine learning model consisting of e.g. a distribution over thousands (ormore) of deep neural networks—or worse, a continuous distribution—would likely be so unwieldy as to be unusable. Figure 1: The plotted rectangular region is the domain Θ ,the contours are those of the strictly concave minimizationobjective function g , and the shaded triangle is the feasibleregion determined by the three linear inequality constraints g , . . . , g . The red dot is the optimal feasible point. TheLagrangian L ( θ, λ ) is strictly concave in θ for any choiceof λ , so the optimal choice(s) for the θ -player will alwayslie on the four corners of the plotted rectangle. However,these points are infeasible, and therefore suboptimal for the λ -player.Most real-world machine learning implementations usefirst-order methods (even on non-convex problems, e.g.DNNs). To use such a method, however, one musthave gradients, and gradients are unavailable for non-differentiable constraints like that in the fairness exam-ple of Equation 2, or in the myriad of other situations inwhich one wishes to constrain counts or proportions in-stead of smooth losses (e.g. recall, coverage or churn asin Goh et al. [2016]). In all of these cases, the constraintfunctions are piecewise-constant, so their gradients arezero almost everywhere, and a gradient-based methodcannot be expected to succeed.The obvious solution is to use a surrogate. For example,we might consider replacing the indicators of Equation 2with sigmoids, and then optimizing the Lagrangian. Thissolves the differentiability problem, but introduces a newone: a (mixed) Nash equilibrium would correspond toa solution satisfying the sigmoid-relaxed constraint, in-stead of the actual constraint. Interestingly, it turns outthat we can seek to satisfy the original un-relaxed con-straint, even while using a surrogate. Our proposal is motivated by the observation that, while differentiating theLagrangian (Definition 1) w.r.t. θ requires differentiating the constraint functions g i ( θ ) , to differentiate it w.r.t. λ weonly need to evaluate them. Hence, a surrogate is only necessary for the θ -player; the λ -player can continue to use theoriginal constraint functions.We refer to a surrogate that is used by only one of the two players as a “proxy”, and introduce the notion of “proxyconstraints” by taking ˜ g i ( θ ) to be a sufficiently-smooth upper bound on g i ( θ ) for i ∈ [ m ] , and formulating twofunctions that we call “proxy-Lagrangians”: Definition 2.
Given proxy constraint functions ˜ g i ( θ ) ≥ g i ( θ ) for i ∈ [ m ] , the proxy-Lagrangians L θ , L λ : Θ × Λ → R of Equation 1 are: L θ ( θ, λ ) := λ g ( θ ) + m X i =1 λ i +1 ˜ g i ( θ ) L λ ( θ, λ ) := m X i =1 λ i +1 g i ( θ ) where Λ := ∆ m +1 ∋ λ is the ( m + 1) -dimensional simplex. As one might expect, the θ -player wishes to minimize L θ ( θ, λ ) , while the λ -player wishes to maximize L λ ( θ, λ ) . No-tice that the ˜ g i s are only used by the θ -player. Intuitively, the λ -player chooses how much to weigh the proxy constraintfunctions, but—and this is the key to our proposal—does so in such a way as to satisfy the original constraints.Unfortunately, because the two players are optimizing different functions, this is a non-zero-sum game, and findinga (mixed) Nash equilibrium of such games is known to be PPAD-complete even in the finite setting [Chen and Deng,3006]. We prove, however, that a weaker type of equilibrium (a Φ -correlated equilibrium [Rakhlin et al., 2011], i.e. ajoint distribution over θ and λ w.r.t. which neither player can improve)—one that we can find efficiently—suffices toguarantee a nearly-optimal and nearly-feasible solution to Equation 1 in expectation. We first focus on the standard Lagrangian formulation, in the non-convex setting. In Section 3, we provide an algorithmthat, given access to an approximate Bayesian optimization oracle, finds a stochastic classifier that, in expectation, isprovably approximately feasible and optimal. Many previous authors have approached constrained optimization usingsimilar techniques (see Section 2)—our main contribution is to show how such a classifier can be efficiently “shrunk”to one that is at least as good , but is supported on only m + 1 solutions.Our next major contribution is the introduction of the proxy-Lagrangian formulation, which allows us to optimizeconstrained problems with extremely general (even non-differentiable) constraints. In Section 4, we prove that aparticular type of Φ -correlated equilibrium results in a stochastic classifier that is feasible and optimal, and go on toprovide a novel algorithm that converges to such an equilibrium. Interestingly, to get the “right” sort of equilibrium,the θ -player needs only minimize the usual external regret, but the λ -player must minimize the swap regret . Whilethe resulting distribution is supported on a large number of ( θ, λ ) pairs, applying the same “shrinking” procedure asbefore yields a distribution over only m + 1 θ s that is at least as good as the original.Finally, in Section 5, we tie everything together by describing an end-to-end recipe for provably solving a non-convexconstrained optimization problem with potentially non-differentiable constraints, yielding a stochastic model that isa supported on at most m + 1 solutions. In practice, one would use SGD instead of an oracle, which results in anefficient procedure that can be easily plugged-in to existing workflows, as is experimentally verified in Section 6. The interpretation of constrained optimization as a two-player game has a long history: Arora et al. [2012] sur-veys some such work, and there are several more recent examples [e.g. Kearns et al., 2017, Narasimhan, 2018,Agarwal et al., 2018]. In particular, Agarwal et al. [2018] propose an algorithm for fair classification that is verysimilar to the Lagrangian-based approach that we outline in Section 3—the main differences are our introductionof “shrinking”, and that our setting (Equation 1) is more general. The recent work of Chen et al. [2017] addressesnon-convex robust optimization, i.e. problems of the form: min θ ∈ Θ max i ∈ [ m ] g i ( θ ) Like both us and Agarwal et al. [2018], they: (i) model such a problem as a two-player game where one playerchooses a mixture of objective functions, and the other player minimizes the loss of the mixture, and (ii) they find a distribution over solutions rather than a pure equilibrium. These similarities are unsurprising in light of the fact thatrobust optimization can be reformulated as constrained optimization via the introduction of a slack variable: min θ ∈ Θ ,ξ ∈ Ξ ξ (3) s . t . ∀ i ∈ [ m ] .ξ ≥ g i ( θ ) Correspondingly, one can transform a robust problem to a constrained one at the cost of an extra bisection search [e.g.Christiano et al., 2011, Rakhlin and Sridharan, 2013]. As this relationship suggests, our main contributions can beadapted to the robust optimization setting. In particular: (i) our proposed shrinking procedure can be applied toEquation 3 to yield a distribution over only m + 1 solutions, and (ii) one could perform robust optimization overnon-differentiable (even discontinuous) losses using “proxy objectives”, just as we use proxy constraints.4 lgorithm 1 Optimizes the Lagrangian formulation (Definition 1) in the non-convex setting via the use of an approxi-mate Bayesian optimization oracle O ρ (Definition 3) for the θ -player. The parameter R is the radius of the Lagrangemultiplier space Λ := (cid:8) λ ∈ R m + : k λ k ≤ R (cid:9) , and the function Π Λ projects its argument onto Λ w.r.t. the Euclideannorm.OracleLagrangian ( R ∈ R + , L : Θ × Λ → R , O ρ : (Θ → R ) → Θ , T ∈ N , η λ ∈ R + ) : Initialize λ (1) = 0 For t ∈ [ T ] : Let θ ( t ) = O ρ (cid:0) L (cid:0) · , λ ( t ) (cid:1)(cid:1) // Oracle optimization Let ∆ ( t ) λ be a gradient of L (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. λ Update λ ( t +1) = Π Λ (cid:16) λ ( t ) + η λ ∆ ( t ) λ (cid:17) // Projected gradient update Return θ (1) , . . . , θ ( T ) and λ (1) , . . . , λ ( T ) Given the difficulties involved in using a Lagrangian-like formulation for non-convex problems, it’s natural to askwhether one should instead favor a procedure based on entirely different principles. Unfortunately, the potentialalternatives each present their own challenges.The potential complexity of the constraints all but rules out approaches based on projections (e.g. projectedSGD) or optimization of constrained subproblems (e.g. Frank-Wolfe, as in Hazan and Kale [2012], Jaggi [2013],Garber and Hazan [2013]). Similarly, attempting to penalize violations [e.g. Arora et al., 2012, Rakhlin and Sridharan,2013, Mahdavi et al., 2012, Cotter et al., 2016], for example by adding γ max i ∈ [ m ] max { , g i ( θ ) } to the objective,where γ ∈ R + is a hyperparameter, and optimizing the resulting problem using a first order method, fails if theconstraint functions are non-differentiable. Even if they are, they may still be data-dependent, so evaluating g i , oreven determining whether it is positive (as is necessary for such techniques, due to the max with ), requires enu-merating over the entire dataset. Hence, unlike the Lagrangian and proxy-Lagrangian formulations, such “penalized”formulations are incompatible with the use of a computationally-cheap stochastic optimizer.In response to the idea of proxy constraints, it’s natural to ask “why not just relax the constraints for both players, in-stead of just the θ -player?”. This is indeed a popular approach, having been proposed e.g. for Neyman-Pearson classifi-cation [Davenport et al., 2010, Gasso et al., 2011], more general rate metrics [Goh et al., 2016], and AUC [Eban et al.,2017]. The answer is that in many cases, particularly when constraints are data dependent, they represent real-worldrestrictions on how the learned model is permitted to behave. For example, the “80% rule” of Equation 2 can be foundin the HOPA Act of 1995 [Wikipedia, 2018], and it requires an 80% threshold in terms of the number of positivepredictions —not a relaxation—which is precisely the target that the proxy-Lagrangian approach will attempt to hit.This point, in turn, raises the question of generalization : satisfying the correct un-relaxed constraints on training datadoes not necessarily mean that they will be satisfied at evaluation time. This issue is outside the scope of this paper, butis vital. For certain specific applications, the post-training correction approach of Woodworth et al. [2017] can improvegeneralization performance, and Cotter et al. [2018]’s more recent proposal (which is based on our proxy-Lagrangianformulation) can be applied more generally, but there is still room for future work. Our ultimate interest is in constrained optimization, so before we present our proposed algorithm for optimizing theLagrangian (Definition 1) in the non-convex setting, we will characterize the relationship between an approximateNash equilibrium of the Lagrangian game, and a nearly-optimal nearly-feasible solution to the original constrainedproblem (Equation 1): 5 heorem 1.
Define
Λ := (cid:8) λ ∈ R m + : k λ k ≤ R (cid:9) , and let θ (1) , . . . , θ ( T ) ∈ Θ and λ (1) , . . . , λ ( T ) ∈ Λ be sequences ofparameter vectors and Lagrange multipliers that comprise an approximate mixed Nash equilibrium, i.e.: max λ ∗ ∈ Λ T T X t =1 L (cid:16) θ ( t ) , λ ∗ (cid:17) − inf θ ∗ ∈ Θ T T X t =1 L (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ ǫ Define ¯ θ as a random variable for which ¯ θ = θ ( t ) with probability /T , and let ¯ λ := (cid:16)P Tt =1 λ ( t ) (cid:17) /T . Then ¯ θ isnearly-optimal in expectation: E ¯ θ (cid:2) g (cid:0) ¯ θ (cid:1)(cid:3) ≤ inf θ ∗ ∈ Θ: ∀ i.g i ( θ ∗ ) ≤ g ( θ ∗ ) + ǫ and nearly-feasible: max i ∈ [ m ] E ¯ θ (cid:2) g i (cid:0) ¯ θ (cid:1)(cid:3) ≤ ǫR − (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) (4) Additionally, if there exists a θ ′ ∈ Θ that satisfies all of the constraints with margin γ (i.e. g i ( θ ′ ) ≤ − γ for all i ∈ [ m ] ),then: (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) ≤ ǫ + B g γ where B g ≥ sup θ ∈ Θ g ( θ ) − inf θ ∈ Θ g ( θ ) is a bound on the range of the objective function g .Proof. This is a special case of Theorem 3 and Lemma 6 in Appendix A.This theorem has a few differences from the more typically-encountered equivalence between Nash equilibria andoptimal feasible solutions in the convex setting. First, it characterizes mixed equilibria, in that uniformly samplingfrom the sequences θ ( t ) and λ ( t ) can be interpreted as defining distributions over Θ and Λ . A convexity assumptionwould enable us to eliminate this added complexity by appealing to Jensen’s inequality to replace these sequences withtheir averages. Second, for the technical reason that we require compact domains in order to prove convergence rates(below), Λ is taken to consist only of sets of Lagrange multipliers with bounded -norm .Finally, as a consequence of this second point, the feasibility guarantee of Equation 4 only holds if the Lagrangemultipliers are, on average, smaller than the maximum -norm radius R . Thankfully, as is shown by the final result ofTheorem 1, if there exists a point satisfying the constraints with some margin γ , then there will exist R s that are largeenough to guarantee feasibility to within O ( ǫ ) .Our proposed algorithm (Algorithm 1) requires an oracle that performs approximate non-convex minimization, sim-ilarly to Chen et al. [2017]’s algorithm for robust optimization and Agarwal et al. [2018]’s for fair classification (thelatter reference uses the terminology “best response”): Definition 3. A ρ -approximate Bayesian optimization oracle is a function O ρ : (Θ → R ) → Θ for which: f ( O ρ ( f )) ≤ inf θ ∗ ∈ Θ f ( θ ∗ ) + ρ for any f : Θ → R that can be written as a nonnegative linear combination of the objective and constraint functions g , g , . . . , g m . The θ -player uses this oracle, and the λ -player uses projected gradient ascent. Notice that, unlike the oracle ofChen et al. [2017], which provides a multiplicative approximation, O ρ provides an additive approximation. Algo-rithm 1’s convergence rate is: Lemma 1.
Suppose that Λ and R are as in Theorem 1, and define the upper bound B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) . In Appendix A, this is generalized to p -norms. f we run Algorithm 1 with the step size η λ := R/B ∆ √ T , then the result satisfies the conditions of Theorem 1 for: ǫ = ρ + RB ∆ r T where ρ is the error associated with the oracle O ρ .Proof. In Appendix C.3.Combined with Theorem 1, we therefore have that if R is sufficiently large, then Algorithm 1 will converge to adistribution over Θ that is, in expectation, O ( ρ ) -far from being optimal and feasible at a O (1 / √ T ) rate, where ρ is asin Definition 3. Aside from the unrealistic oracle assumption (which will be partially addressed in Section 4), the main disadvantageof Algorithm 1 is that it results in a mixture of T models, which presumably would be far too many to use in practice.However, much smaller Nash equilibria exist: Lemma 2. If Θ is a compact Hausdorff space, Λ is compact, and the objective and constraint functions g , g , . . . , g m are continuous, then the Lagrangian game (Definition 1) has a mixed Nash equilibrium pair ( θ, λ ) where θ is a randomvariable supported on at most m + 1 elements of Θ , and λ is non-random.Proof. Follows from Theorem 5 in Appendix B.Of course, the mere existence of such an equilibrium is insufficient—we need to be able to find it, and Algorithm 1manifestly does not. Thankfully, we can re-formulate the problem of finding the optimal ǫ -feasible mixture of the θ ( t ) s as a linear program (LP) that can be solved to “shrink” the support set. We must first evaluate the objective andconstraint functions for every θ ( t ) , yielding a T -dimensional vector of objective function values, and m such vectorsof constraint function evaluations, which are then used to specify the LP: Lemma 3.
Let θ (1) , θ (2) , . . . , θ ( T ) ∈ Θ be a sequence of T “candidate solutions” of Equation 1. Define ~g , ~g i ∈ R T such that ( ~g ) t = g (cid:0) θ ( t ) (cid:1) and ( ~g i ) t = g i (cid:0) θ ( t ) (cid:1) for i ∈ [ m ] , and consider the linear program: min p ∈ ∆ T h p, ~g i s . t . ∀ i ∈ [ m ] . h p, ~g i i ≤ ǫ where ∆ T is the T -dimensional simplex. Then every vertex p ∗ of the feasible region—in particular an optimal one—has at most m ∗ + 1 ≤ m + 1 nonzero elements, where m ∗ is the number of active h p ∗ , ~g i i ≤ ǫ constraints.Proof. In Appendix B.This result suggests a two-phase approach to optimization. In the first phase, we apply Algorithm 1, yielding asequence of iterates for which the uniform distribution over the θ ( t ) s is approximately feasible and optimal. We thenapply the procedure of Lemma 3 to find the best distribution over these iterates, which in particular is guaranteed tobe no worse than the uniform distribution, and is supported on at most m + 1 iterates. We’ll expand upon this furtherin Section 5. 7 lgorithm 2 Optimizes the proxy-Lagrangian formulation (Definition 2) in the convex setting, with the θ -playerminimizing external regret, and the λ -player minimizing swap regret. The fix M operation on line results in astationary distribution of M (i.e. a λ ∈ Λ such that M λ = λ , which can be derived from the top eigenvector). Thefunction Π Θ projects its argument onto Θ w.r.t. the Euclidean norm.StochasticProxyLagrangian (cid:0) L θ , L λ : Θ × ∆ m +1 → R , T ∈ N , η θ , η λ ∈ R + (cid:1) : Initialize θ (1) = 0 , and M (1) ∈ R ( m +1) × ( m +1) with M i,j = 1 / ( m + 1) // Assumes ∈ Θ For t ∈ [ T ] : Let λ ( t ) = fix M ( t ) // Stationary distribution of M ( t ) Let ˇ∆ ( t ) θ be a stochastic subgradient of L θ (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. θ Let ∆ ( t ) λ be a stochastic gradient of L λ (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. λ Update θ ( t +1) = Π Θ (cid:16) θ ( t ) − η θ ˇ∆ ( t ) θ (cid:17) // Projected SGD update Update ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) η λ ∆ ( t ) λ (cid:0) λ ( t ) (cid:1) T (cid:17) // ⊙ and . exp are element-wise Project M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) for i ∈ [ m + 1] // Column-wise projection w.r.t. KL divergence Return θ (1) , . . . , θ ( T ) and λ (1) , . . . , λ ( T ) While the Lagrangian formulation can be used to solve constrained problems in the form of Equation 1, Algorithm 1isn’t actually implementable, due to its reliance on an oracle. If one wished to apply it in practice, one would needto replace the oracle with something else, and for large-scale machine learning problems, “something else” is over-whelmingly likely to be SGD [Robbins and Monro, 1951, Zinkevich, 2003] or another first-order stochastic algorithm(e.g. AdaGrad [Duchi et al., 2011] or ADAM [Kingma and Ba, 2014]).This leads to the issue we raised in Section 1.2: for non-differentiable constraints like those in the fairness exampleof Equation 2, we cannot compute gradients, and therefore cannot use a first-order algorithm. “Fixing” this issue byreplacing the constraints with differentiable surrogates introduces a new difficulty: solutions to the resulting problemwill satisfy the surrogate constraints, rather than the actual constraints.The proxy-Lagrangian formulation of Definition 2 sidesteps this issue by using a non-zero-sum two-player game. The λ -player chooses how much the θ -player should penalize the (differentiable) proxy constraints, but does so in sucha way as to satisfy the original constraints. Unfortunately, since the proxy-Lagrangian game is non-zero-sum, wecannot expect to find a Nash equilibrium, at least not efficiently. However, the analogous result to Theorem 1 requiresa weaker type of equilibrium: a joint distribution over Θ and Λ w.r.t. which the θ -player can only make a negligibleimprovement compared to the best constant strategy, and the λ -player compared to the best action-swapping strategy(this is a particular type of Φ -correlated equilibrium [Rakhlin et al., 2011]): Theorem 2.
Define M as the set of all left-stochastic ( m + 1) × ( m + 1) matrices, Λ := ∆ m +1 as the ( m + 1) -dimensional simplex, and assume that each ˜ g i upper bounds the corresponding g i . Let θ (1) , . . . , θ ( T ) ∈ Θ and λ (1) , . . . , λ ( T ) ∈ Λ be sequences satisfying: T T X t =1 L θ (cid:16) θ ( t ) , λ ( t ) (cid:17) − inf θ ∗ ∈ Θ T T X t =1 L θ (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ ǫ θ max M ∗ ∈M T T X t =1 L λ (cid:16) θ ( t ) , M ∗ λ ( t ) (cid:17) − T T X t =1 L λ (cid:16) θ ( t ) , λ ( t ) (cid:17) ≤ ǫ λ Define ¯ θ as a random variable for which ¯ θ = θ ( t ) with probability λ ( t )1 / P Ts =1 λ ( s )1 , and let ¯ λ := (cid:16)P Tt =1 λ ( t ) (cid:17) /T . hen ¯ θ is nearly-optimal in expectation: E ¯ θ (cid:2) g (cid:0) ¯ θ (cid:1)(cid:3) ≤ inf θ ∗ ∈ Θ: ∀ i. ˜ g i ( θ ∗ ) ≤ g ( θ ∗ ) + ǫ θ + ǫ λ ¯ λ (5) and nearly-feasible: max i ∈ [ m ] E ¯ θ (cid:2) g i (cid:0) ¯ θ (cid:1)(cid:3) ≤ ǫ λ ¯ λ (6) Additionally, if there exists a θ ′ ∈ Θ that satisfies all of the proxy constraints with margin γ (i.e. ˜ g i ( θ ′ ) ≤ − γ for all i ∈ [ m ] ), then: ¯ λ ≥ γ − ǫ θ − ǫ λ γ + B g where B g ≥ sup θ ∈ Θ g ( θ ) − inf θ ∈ Θ g ( θ ) is a bound on the range of the objective function g .Proof. This is a special case of Theorem 4 and Lemma 7 in Appendix A.Notice that while Equation 6 guarantees feasibility w.r.t. the original constraints, the comparator in Equation 5 isfeasible w.r.t. the proxy constraints. Hence, the overall guarantee is no better than what we would achieve if wetook g i := ˜ g i for all i ∈ [ m ] , and optimized the Lagrangian as in Section 3. However, as will be demonstratedexperimentally in Section 6.2, because the feasible region w.r.t. the original constraints is larger (perhaps significantlyso) than that w.r.t. the proxy constraints, the proxy-Lagrangian approach has more “room” to find a better solution inpractice.One key difference between this result and Theorem 1 is that the R parameter is absent. Instead, its role, and that of (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) , is played by the first coordinate of ¯ λ . Inspection of Definition 2 reveals that, if one or more of the constraints areviolated, then the λ -player would prefer λ to be zero, whereas if they are satisfied (with some margin), then it wouldprefer λ to be one. In other words, the first coordinate of λ ( t ) encodes the λ -player’s belief about the feasibility of θ ( t ) , for which reason θ ( t ) is weighted by λ ( t )1 in the density defining ¯ θ .Algorithm 2 is motivated by the observation that, while Theorem 2 only requires that the θ ( t ) sequence suffer lowexternal regret w.r.t. L θ (cid:0) · , λ ( t ) (cid:1) , the condition on the λ ( t ) sequence is stronger, requiring it to suffer low swap re-gret [Blum and Mansour, 2007]. Hence, the θ -player uses SGD to minimize external regret, while the λ -player uses aswap-regret minimization algorithm of the type proposed by Gordon et al. [2008], yielding the convergence guarantee: Lemma 4.
Suppose that Θ is a compact convex set, M and Λ are as in Theorem 2, and that the objective andproxy constraint functions g , ˜ g , . . . , ˜ g m are convex (but not g , . . . , g m ). Define the three upper bounds B Θ ≥ max θ ∈ Θ k θ k , B ˇ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ˇ∆ ( t ) θ (cid:13)(cid:13)(cid:13) , and B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) ∞ .If we run Algorithm 2 with the step sizes η θ := B Θ /B ˇ∆ √ T and η λ := p ( m + 1) ln ( m + 1) /T B , then the resultsatisfies the conditions of Theorem 2 for: ǫ θ =2 B Θ B ˇ∆ s δ Tǫ λ =2 B ∆ s m + 1) ln ( m + 1) (cid:0) δ (cid:1) T with probability − δ over the draws of the stochastic (sub)gradients.Proof. In Appendix C.3. 9lgorithm 2 is designed for the convex setting (except for the g i s), for which reason it uses SGD for the θ -updates.However, this convexity requirement is not innate to our approach: it’s straightforward to design an oracle-basedalgorithm that, like Algorithm 1, doesn’t require convexity . Our reason for presenting the SGD-based algorithm,instead of the oracle-based one, is that the purpose of proxy constraints is to substitute optimizable constraints forunoptimizable ones, and there is no need to do so if you have an oracle. It turns out that the same existence result that we provided for the Lagrangian game (Lemma 2)—of a
Nash equilibrium—holds for the proxy-Lagrangian:
Lemma 5. If Θ is a compact Hausdorff space and the objective, constraint and proxy constraint functions g , g , . . . , g m , ˜ g , . . . , ˜ g m are continuous, then the proxy-Lagrangian game (Definition 2) has a mixed Nashequilibrium pair ( θ, λ ) where θ is a random variable supported on at most m + 1 elements of Θ , and λ is non-random.Proof. In Appendix B.Furthermore, the exact same linear programming procedure of Lemma 3 can be applied (with the ~g i s being defined interms of the original —not proxy—constraints) to yield a solution with support size m + 1 , and works equally well.This is easy to verify: since ¯ θ , as defined in Theorem 2, is a distribution over the θ ( t ) s, and is therefore feasible for theLP, the best distribution over the iterates will be at least as good. The pieces are now in place to propose a complete two-phase optimization procedure, for both convex and non-convex problems, with or without proxy constraints. In the first phase, we apply the appropriate algorithm to yielda distribution over the T “candidates” θ (1) , . . . , θ ( T ) that is approximately feasible and optimal, according to eitherTheorems 1 or 2. Then, in the second phase, we construct ~g , ~g , . . . , ~g m ∈ R T by evaluating the objective andconstraint functions for each θ ( t ) , and then optimize the LP of Lemma 3 to find the best distribution over θ (1) , . . . , θ ( T ) (which will have support size ≤ m + 1 ). If we take the ǫ parameter to this LP to be either the RHS of Equation 4in Theorem 1 (for the Lagrangian case), or of Equation 6 in Theorem 2 (for the proxy-Lagrangian case), then theresulting size- ( m + 1) distribution will have the same guarantees as the original. Practical Procedure:
The approach outlined above provably works, but is still somewhat idealized. In practice, we’lldispense with the oracle O ρ —even on non-convex problems—in favor of the “typical” approach: pretending that theproblem is convex, and using SGD (or another cheap stochastic algorithm) for the θ -updates . On a non-convexproblem, this has no guarantees, but one would still hope that it would result in a “candidate set” of θ ( t ) s that containsenough good solutions to pass on to the LP of Lemma 3. If necessary, this candidate set can first be subsampled tomake it a reasonable size. To choose the ǫ parameter of the LP, we propose using a bisection search to find the smallest ǫ ≥ for which there exists a feasible solution. Evaluation:
The ultimate result of either of these procedures is a distribution over at most m + 1 distinct θ s. If theunderlying problem is one of classification, with f ( · ; θ ) being the scoring function, then this distribution defines astochastic classifier: at evaluation time, upon receiving an example x , we would sample θ , and then return f ( x ; θ ) . Ifa stochastic classifier is not acceptable (as is often the case in real-world applications), then one could heuristicallyconvert it into a deterministic one, e.g. by weighted averaging or voting, which is made significantly easier by its smallsize. This is Algorithm 4, with Lemma 11 being its convergence guarantee, both in Appendix C.3. In the Lagrangian case, this is Algorithm 3, with Lemma 10 being its convergence guarantee in the convex setting, both in Appendix C.3. Inthe proxy-Lagrangian case, this is Algorithm 2. ¯ θ ) 2.58 2.66 2.01 2.52 1.06 1.45 0.43 1.10Baseline ( θ ( T ) ) 1.77 1.92 1.77 1.75 0.04 0.40 0.01 0.08Lagrangian ( ¯ θ ) 2.04 2.15 1.96 2.04 0.42 0.70 0.30 0.49Lagrangian (LP) 1.66 1.67 1.63 1.62 0.00 0.01 0.00 0.00Table 2: Support sizes, test error rates, and “equal opportunity” values for the experiments of Section 6.2. For theconstraints, each reported number is the ratio of the positive prediction rate on positively-labeled members of theprotected class, to the positive prediction rate on the set of all positively-labeled data. The constraints attempt to forcethis ratio to be at least —quantities lower than this threshold violate the constraint, and are marked in bold .Testing TrainingSupport Error Female Male Black White Female Male Black WhiteBaseline ( θ ( T ) ) 1 14.2% ¯ θ ) 100 16.3% 114% 97.5% 126% 99.8% 113% 97.8% 121% 99.7%Lagrangian (LP) 3 15.5% 106% 99.0% 111% 101% 104% 99.4% 105% 101%Proxy ( ¯ θ ) 100 14.4% We present two experiments: the first, on the robust MNIST problem of Chen et al. [2017], tests the perfor-mance of the “practical procedure” of Section 5 using the Lagrangian formulation (with the norms of theLagrange multipliers being unbounded, i.e. R = ∞ ), while the second, a fairness problem on the UCI Adultdataset [Dheeru and Karra Taniskidou, 2017], uses the proxy-Lagrangian formulation. Both were implemented inTensorFlow .In both cases, the θ and λ -updates both used AdaGrad with the same initial learning rates. In the proxy-Lagrangiancase, however, the λ -update (line of Algorithm 2) was performed in the log domain so that it would be multiplicative.To choose the initial AdaGrad learning rate, we performed a grid search over powers-of-two, and chose the best modelon a validation set. In all experiments, the optimum was in the interior of the grid.Our constrained optimization algorithms result in stochastic classifiers, and we report results for both the ¯ θ of Theo-rems 1 or 2 (in the Lagrangian or proxy-Lagrangian cases, respectively), and the optimal distribution found by the LPof Lemma 3, optimized on the training dataset. In robust optimization, there are multiple objective functions g , ..., g m : Θ → R , and the goal is to find a θ ∈ Θ minimizing max i ∈ [ m ] g i ( θ ) . As was discussed in Section 2, this can be re-written as a constrained problem byintroducing a slack variable, as in Equation 3.The task is the modified MNIST problem created by Chen et al. [2017], which is based on four datasets, each of whichis a version of MNIST that has been corrupted in different ways. One would therefore hope that choosing g i to be an Source code: https://github.com/tensorflow/tensorflow/tree/r1.10/tensorflow/contrib/constrained_optimization . i th such dataset, and optimizing the corresponding robust problem, will result in a classifier thatis “robust” to all four types of corruption.We used a neural network with one -neuron hidden layer, and ReLu activations. The four objective functions werethe cross-entropy losses on the corrupted datasets. All models were trained for
50 000 iterations using a minibatch sizeof , and a θ ( t ) was extracted every iterations, yielding a sequence of length T = 100 . Baselines:
For our baselines, we trained the neural network over the union of the four datasets. We report two variants:(i) the “Uniform Distribution Baseline” of Chen et al. [2017] is a stochastic classifier, uniformly sampled over the θ ( t ) s(like our ¯ θ classifier), and (ii) a non-stochastic classifier taking its parameters from the last iterate θ ( T ) . Results:
Table 1 lists, for each of the corrupted datasets, the error rates of the compared models on both the trainingand testing datasets. Interestingly, although our proposed shrinking procedure is only guaranteed to give a distributionover m + 1 solutions, in these experiments it chose only one. Hence, the “Lagrangian (LP)” model of Table 1 is, like“Baseline ( θ ( T ) )”, non-stochastic.While we did not quite match the raw performance reported by Chen et al. [2017]’s algorithm, our results, and theirs,tell similar stories. In particular, we can see that both of our algorithms outperformed their natural baseline equivalents.In particular, the use of shrinking not only greatly simplified the model, but also significantly improved performance. These experiments were performed on the UCI Adult dataset, which consists of census data including featuressuch as age, gender, race, occupation, and education. The goal was to predict whether income exceeds 50k/year. Thedataset contains
32 561 training examples, from which we split off 20% to form a validation set, and
16 281 testingexamples.We dropped the “fnlwgt” weighting feature, and processed the remaining features as in Platt [1998], yielding binary features, on which we trained linear models. The objective was to minimize the average hinge loss, subject toone 95% equal opportunity [Hardt et al., 2016] constraint in the style of Goh et al. [2016] for each “protected class”: g i was defined such that g i ( θ ) ≤ iff the positive prediction rate on the set of positively-labeled examples for theassociated class was at least 95% of the positive prediction rate on the set of all positively-labeled examples.When using proxy constraints, ˜ g i was taken to be a version of g i with the indicator functions defining the positiveprediction rates replaced with hinge upper bounds. When not using proxy constraints, the indicator-based constraintswere dropped entirely, with these upper bounds being used throughout.All models were trained for iterations with a minibatch size of , with a θ ( t ) being extracted every iterations,yielding a sequence of length T = 100 . Baseline:
The baseline classifier was optimized to simply minimize training hinge loss. Since this problem is uncon-strained, we took the last iterate θ ( T ) . “Best-model” Heuristic: For hyperparameter tuning using a grid search, we needed to choose the “best” model on thevalidation set. Due to the presence of constraints, however, the “best” model was not necessarily that with the lowestvalidation error. Instead, we used the following heuristic: the models were each ranked in terms of their objectivefunction value, as well as the magnitude of the i th constraint violation (i.e. max { , g i ( θ ) } ). The “score” of eachmodel was then taken to be the maximal such rank, and the model with the lowest score was chosen, with the objectivefunction serving as a tiebreaker. Results:
Table 2 lists the test error rates, (indicator-based) constraint function values on both the training and testingdatasets, and support sizes of the stochastic classifiers, for each of the compared algorithms. The “LP” versions ofour models, which were found using the shrinking procedure of Lemma 3, uniformly outperformed their ¯ θ -analogues.We can see, however, that the generalization issue discussed in Section 2.1 caused the proxy-Lagrangian LP model toslightly violate the constraints on the testing dataset, despite satisfying them on the training dataset. The non-proxy12lgorithms satisfied all constraints, on both the training and testing datasets, because there was sufficient “room” be-tween the hinge upper bound that they actually constrained, and the true constraint, to absorb the generalization error.Inspection of the error rates, however, reveals that the relaxed constraints were so overly-conservative that satisfy-ing them significantly damaged classification performance. In contrast, our proxy-Lagrangian approach matched theclassification performance of the unconstrained baseline. 13 cknowledgments We thank Seungil You for initially posing the question of whether constraint functions could be relaxed for only the θ -player, as well as Maya Gupta, Taman Narayan and Serena Wang for helping to develop the heuristic used to choosethe “best” model on the validation dataset in Section 6.2. References
Alekh Agarwal, Alina Beygelzimer, Miroslav Dudík, John Langford, and Hanna M. Wallach. A reductions approachto fair classification. In
ICML’18 , pages 60–69, 2018.Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm andapplications.
Theory of Computing , 8(6):121–164, 2012.Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.
Oper. Res. Lett. , 31(3):167–175, May 2003.Dan Biddle.
Adverse Impact and Test Validation: A Practitioner’s Guide to Valid and Defensible Employment Testing .Gower, 2005.Avrim Blum and Yishay Mansour. From external to internal regret.
JMLR , 8:1307–1324, 2007.H. F. Bohnenblust, Samuel Karlin, and L. S. Shapley. Games with continuous, convex pay-off.
Contributions to theTheory of Games , 1(24):181–192, 1950.Robert S. Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for non-convex objectives.In
Nips’17 , 2017.Xi Chen and Xiaotie Deng. Settling the complexity of two-player nash equilibrium. In
FOCS’06 , pages 261–272.IEEE, 2006.Paul Christiano, Jonathan A. Kelner, Aleksander Madry, Daniel A. Spielman, and Shang-Hua Teng. Electrical flows,Laplacian systems, and faster approximation of maximum flow in undirected graphs. In
STOC ’11 , pages 273–282,2011.Andrew Cotter, Maya Gupta, and Jan Pfeifer. A Light Touch for heavily constrained SGD. In , pages 729–771, 2016.Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, andSeungil You. Training well-generalizing classifiers for fairness metrics and other data-dependent constraints, 2018.URL https://arxiv.org/abs/1807.00028 .Mark Davenport, Richard G. Baraniuk, and Clayton D. Scott. Tuning support vector machines for minimax andNeyman-Pearson classification.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 2010.Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic opti-mization.
JMLR , 12(Jul):2121–2159, 2011.Elad Eban, Mariano Schain, Alan Mackey, Ariel Gordon, Rif A. Saurous, and Gal Elidan. Scalable learning of non-decomposable objectives.
AIStats’17 , 2017.Dan Garber and Elad Hazan. Playing non-linear games with linear oracles. In
FOCS , pages 420–428. IEEE ComputerSociety, 2013. 14illes Gasso, Aristidis Pappaionannou, Marina Spivak, and Léon Bottou. Batch and online learning algorithms fornonconvex Neyman-Pearson classification.
ACM Transactions on Intelligent Systems and Technology , 2011.I. L. Glicksberg. A further generalization of the Kakutani fixed point theorem with application to Nash equilibriumpoints.
Amer. Math. Soc. , 3:170–174, 1952.Gabriel Goh, Andrew Cotter, Maya Gupta, and Michael P Friedlander. Satisfying real-world goals with dataset con-straints. In
NIPS , pages 2415–2423. 2016.Geoffrey J. Gordon, Amy Greenwald, and Casey Marks. No-regret learning in convex games. In
ICML’08 , pages360–367, 2008.Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In
NIPS , 2016.Elad Hazan and Satyen Kale. Projection-free online learning. In
ICML’12 , 2012.Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In
ICML’13 , volume 28, pages427–435, 2013.Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing andlearning for subgroup fairness, 2017. URL https://arxiv.org/abs/1711.05144 .Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
ICLR’14 , 2014.Mehrdad Mahdavi, Tianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi. Stochastic gradient descent with onlyone projection. In
NIPS’12 , pages 494–502. 2012.Harikrishna Narasimhan. Learning with complex loss functions and constraints. In
AIStats , 2018.Arkadi Nemirovski and D. B. Yudin.
Problem complexity and method efficiency in optimization . John Wiley & SonsLtd, 1983.T. Parthasarathy and T. E. S. Raghavan. Equilibria of continuous two-person games.
Pacific Journal of Mathematics ,57(1):265–270, 1975.John C. Platt. Fast training of support vector machines using Sequential Minimal Optimization. In
Advances in KernelMethods - Support Vector Learning . MIT Press, 1998.Alexander Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In
NIPS’13 ,pages 3066–3074, 2013.Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: Beyond regret. In
COLT’11 , pages559–594, 2011.Herbert Robbins and Sutton Monro. A stochastic approximation method.
Annals of Mathematical Statistics , 22:400–407, 1951.Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In
NIPS’11 , 2011.John von Neumann. Zur theorie der gesellschaftsspiele.
Mathematische annalen , 100(1):295–320, 1928.Matthew S. Vuolo and Norma B. Levy. Disparate impact doctrine in fair housing.
New York Law Journal , 2013.Wikipedia. Housing for older persons act — wikipedia, the free encyclopedia, 2018. URL https://en.wikipedia.org/w/index.php?title=Housing_for_Older_Persons_Act&oldid=809132145 .[Online; accessed 6-February-2018].Blake E. Woodworth, Suriya Gunasekar, Mesrob I. Ohannessian, and Nathan Srebro. Learning non-discriminatorypredictors. In
COLT’17 , pages 1920–1953, 2017.Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In
ICML’03 , 2003.15
Proofs of Sub{optimality,feasibility} Guarantees
Theorem 3. (Lagrangian Sub{optimality,feasibility})
Define
Λ = n λ ∈ R m + : k λ k p ≤ R o , and consider the La-grangian of Equation 1 (Definition 1). Suppose that θ ∈ Θ and λ ∈ Λ are random variables such that: max λ ∗ ∈ Λ E θ [ L ( θ, λ ∗ )] − inf θ ∗ ∈ Θ E λ [ L ( θ ∗ , λ )] ≤ ǫ (7) i.e. θ, λ is an ǫ -approximate Nash equilibrium. Then θ is ǫ -suboptimal: E θ [ g ( θ )] ≤ inf θ ∗ ∈ Θ: ∀ i ∈ [ m ] .g i ( θ ∗ ) ≤ g ( θ ∗ ) + ǫ Furthermore, if λ is in the interior of Λ , in the sense that (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p < R where ¯ λ := E λ [ λ ] , then θ is ǫ/ (cid:16) R − (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p (cid:17) -feasible: (cid:13)(cid:13) ( E θ [ g : ( θ )]) + (cid:13)(cid:13) q ≤ ǫR − (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p where g : ( θ ) is the m -dimensional vector of constraint evaluations, and ( · ) + takes the positive part of its argument, sothat (cid:13)(cid:13) ( E θ [ g : ( θ )]) + (cid:13)(cid:13) q is the q -norm of the vector of expected constraint violations.Proof. First notice that L is linear in λ , so: max λ ∗ ∈ Λ E θ [ L ( θ, λ ∗ )] − inf θ ∗ ∈ Θ L (cid:0) θ ∗ , ¯ λ (cid:1) ≤ ǫ (8) Optimality:
Choose θ ∗ to be the optimal feasible solution in Equation 8, so that g i ( θ ∗ ) ≤ for all i ∈ [ m ] , and alsochoose λ ∗ = 0 , which combined with the definition of L (Definition 1) gives that: E θ [ g ( θ )] − g ( θ ∗ ) ≤ ǫ which is the optimality claim. Feasibility:
Choose θ ∗ = θ in Equation 8. By the definition of L (Definition 1): max λ ∗ ∈ Λ m X i =1 λ ∗ i E θ [ g i ( θ )] − m X i =1 ¯ λ i E θ [ g i ( θ )] ≤ ǫ Then by the definition of a dual norm, Hölder’s inequality, and the assumption that (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p < R : R (cid:13)(cid:13) ( E θ [ g : ( θ )]) + (cid:13)(cid:13) q − (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p (cid:13)(cid:13) ( E θ [ g : ( θ )]) + (cid:13)(cid:13) q ≤ ǫ Rearranging terms gives the feasibility claim.
Lemma 6.
In the context of Theorem 3, suppose that there exists a θ ′ ∈ Θ that satisfies all of the constraints, anddoes so with q -norm margin γ , i.e. g i ( θ ′ ) ≤ for all i ∈ [ m ] and k g : ( θ ′ ) k q ≥ γ . Then: (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p ≤ ǫ + B g γ where B g ≥ sup θ ∈ Θ g ( θ ) − inf θ ∈ Θ g ( θ ) is a bound on the range of the objective function g . roof. Starting from Equation 7 (in Theorem 3), and choosing θ ∗ = θ ′ and λ ∗ = 0 : ǫ ≥ E θ [ g ( θ )] − E λ " g ( θ ′ ) + m X i =1 λ i g i ( θ ′ ) ǫ ≥ E θ (cid:20) g ( θ ) − inf θ ′ ∈ Θ g ( θ ′ ) (cid:21) − (cid:18) g ( θ ′ ) − inf θ ′ ∈ Θ g ( θ ′ ) (cid:19) + γ (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p ǫ ≥ − B g + γ (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p Solving for (cid:13)(cid:13) ¯ λ (cid:13)(cid:13) p yields the claim. Theorem 4. (Proxy-Lagrangian Sub{optimality,feasibility})
Let M be the set of all left-stochastic ( m + 1) × ( m + 1) matrices (i.e. M := (cid:8) M ∈ R ( m +1) × ( m +1) : ∀ i ∈ [ m + 1] .M : ,i ∈ ∆ m +1 (cid:9) ), and consider the “proxy-Lagrangians”of Equation 1 (Definition 2). Suppose that θ ∈ Θ and λ ∈ Λ are jointly distributed random variables such that: E θ,λ [ L θ ( θ, λ )] − inf θ ∗ ∈ Θ E λ [ L θ ( θ ∗ , λ )] ≤ ǫ θ (9) max M ∗ ∈M E θ,λ [ L λ ( θ, M ∗ λ )] − E θ,λ [ L λ ( θ, λ )] ≤ ǫ λ Define ¯ λ := E λ [ λ ] , let (Ω , F , P ) be the probability space, and define a random variable ¯ θ such that: Pr (cid:8) ¯ θ ∈ S (cid:9) = R θ − ( S ) λ ( x ) dP ( x ) R Ω λ ( x ) dP ( x ) In words, ¯ θ is a version of θ that has been resampled with λ being treated as an importance weight. In particular E ¯ θ (cid:2) f (cid:0) ¯ θ (cid:1)(cid:3) = E θ,λ [ λ f ( θ )] / ¯ λ for any f : Θ → R . Then ¯ θ is nearly-optimal: E ¯ θ (cid:2) g (cid:0) ¯ θ (cid:1)(cid:3) ≤ inf θ ∗ ∈ Θ: ∀ i ∈ [ m ] . ˜ g i ( θ ∗ ) ≤ g ( θ ∗ ) + ǫ θ + ǫ λ ¯ λ and nearly-feasible: (cid:13)(cid:13)(cid:13)(cid:0) E ¯ θ (cid:2) g : (cid:0) ¯ θ (cid:1)(cid:3)(cid:1) + (cid:13)(cid:13)(cid:13) ∞ ≤ ǫ λ ¯ λ Notice the optimality inequality is weaker than it may appear, since the comparator in this equation is not the optimalsolution w.r.t. the constraints g i , but rather w.r.t. the proxy constraints ˜ g i .Proof. Optimality:
If we choose M ∗ to be the matrix with its first row being all-one, and all other rows being all-zero,then L λ ( θ, M ∗ λ ) = 0 , which shows that the first term in the LHS of the second line of Equation 9 is nonnegative.Hence, − E θ,λ [ L λ ( θ, λ )] ≤ ǫ λ , so by the definition of L λ (Definition 2), and the fact that ˜ g i ≥ g i : E θ,λ " m X i =1 λ i +1 ˜ g i ( θ ) ≥ − ǫ λ Notice that L θ is linear in λ , so the first line of Equation 9, combined with the above result and the definition of L θ (Definition 2) becomes: E θ,λ [ λ g ( θ )] − inf θ ∗ ∈ Θ ¯ λ g ( θ ∗ ) + m X i =1 ¯ λ i +1 ˜ g i ( θ ∗ ) ! ≤ ǫ θ + ǫ λ (10)Choose θ ∗ to be the optimal solution that satisfies the proxy constraints ˜ g , so that ˜ g i ( θ ∗ ) ≤ for all i ∈ [ m ] . Then: E θ,λ [ λ g ( θ )] − ¯ λ g ( θ ∗ ) ≤ ǫ θ + ǫ λ Feasibility:
We’ll simplify our notation by defining ℓ ( θ ) := 0 and ℓ i +1 ( θ ) := g i ( θ ) for i ∈ [ m ] , so that L λ ( θ, λ ) = h λ, ℓ : ( θ ) i . Consider the first term in the LHS of the second line of Equation 9: max M ∗ ∈M E θ,λ [ L λ ( θ, M ∗ λ )] = max M ∗ ∈M E θ,λ [ h M ∗ λ, ℓ : ( θ ) i ]= max M ∗ ∈M E θ,λ m +1 X i =1 m +1 X j =1 M ∗ j,i λ i ℓ j ( θ ) = m +1 X i =1 max M ∗ : ,i ∈ ∆ m +1 m +1 X j =1 E θ,λ (cid:2) M ∗ j,i λ i ℓ j ( θ ) (cid:3) = m +1 X i =1 max j ∈ [ m +1] E θ,λ [ λ i ℓ j ( θ )] where we used the fact that, since M ∗ is left-stochastic, each of its columns is a ( m + 1) -dimensional multinoullidistribution. For the second term in the LHS of the second line of Equation 9, we can use the fact that ℓ ( θ ) = 0 : E θ,λ " m +1 X i =2 λ i ℓ i ( θ ) ≤ m +1 X i =2 max j ∈ [ m +1] E θ,λ [ λ i ℓ j ( θ )] Plugging these two results into the second line of Equation 9, the two sums collapse, leaving: max i ∈ [ m +1] E θ,λ [ λ ℓ i ( θ )] ≤ ǫ λ The definition of ℓ i then yields the feasibility claim. Lemma 7.
In the context of Theorem 4, suppose that there exists a θ ′ ∈ Θ that satisfies all of the proxy constraintswith margin γ , i.e. ˜ g i ( θ ′ ) ≤ − γ for all i ∈ [ m ] . Then: ¯ λ ≥ γ − ǫ θ − ǫ λ γ + B g where B g ≥ sup θ ∈ Θ g ( θ ) − inf θ ∈ Θ g ( θ ) is a bound on the range of the objective function g .Proof. Starting from Equation 10 (in the proof of Theorem 4), and choosing θ ∗ = θ ′ : E θ,λ [ λ g ( θ )] − ¯ λ g ( θ ′ ) + m X i =1 ¯ λ i +1 ˜ g i ( θ ′ ) ! ≤ ǫ θ + ǫ λ Since ˜ g i ( θ ′ ) ≤ − γ for all i ∈ [ m ] : ǫ θ + ǫ λ ≥ E θ,λ [ λ g ( θ )] − ¯ λ g ( θ ′ ) + (cid:0) − ¯ λ (cid:1) γ ≥ E θ,λ (cid:20) λ (cid:18) g ( θ ) − inf θ ′ ∈ Θ g ( θ ′ ) (cid:19)(cid:21) − ¯ λ (cid:18) g ( θ ′ ) − inf θ ′ ∈ Θ g ( θ ′ ) (cid:19) + (cid:0) − ¯ λ (cid:1) γ ≥ − ¯ λ B g + (cid:0) − ¯ λ (cid:1) γ Solving for ¯ λ yields the claim. 18 Proofs of Existence of Sparse Equilibria
Theorem 5.
Consider a two player game, played on the compact Hausdorff spaces Θ and Λ ⊆ R m . Imagine that the θ -player wishes to minimize L θ : Θ × Λ → R , and the λ -player wishes to maximize L λ : Θ × Λ → R , with both ofthese functions being continuous in θ and linear in λ . Then there exists a Nash equilibrium θ , λ : E θ [ L θ ( θ, λ )] = min θ ∗ ∈ Θ L θ ( θ ∗ , λ ) E θ [ L λ ( θ, λ )] = max λ ∗ ∈ Λ E θ [ L λ ( θ, λ ∗ )] where θ is a random variable placing nonzero probability mass on at most m + 1 elements of Θ , and λ ∈ Λ isnon-random.Proof. There are some extremely similar (and in some ways more general) results than this in the game theory litera-ture [e.g. Bohnenblust et al., 1950, Parthasarathy and Raghavan, 1975], but for our particular (Lagrangian and proxy-Lagrangian) setting it’s possible to provide a fairly straightforward proof.To begin with, Glicksberg [1952] gives that there exists a mixed strategy in the form of two random variables ˜ θ and ˜ λ : E ˜ θ, ˜ λ h L θ (cid:16) ˜ θ, ˜ λ (cid:17)i = min θ ∗ ∈ Θ E ˜ λ h L θ (cid:16) θ ∗ , ˜ λ (cid:17)i E ˜ θ, ˜ λ h L λ (cid:16) ˜ θ, ˜ λ (cid:17)i = max λ ∗ ∈ Λ E ˜ θ h L λ (cid:16) ˜ θ, λ ∗ (cid:17)i Since both functions are linear in ˜ λ , we can define λ := E ˜ λ h ˜ λ i , and these conditions become: E ˜ θ h L θ (cid:16) ˜ θ, λ (cid:17)i = min θ ∗ ∈ Θ L θ ( θ ∗ , λ ) := ℓ min E ˜ θ h L λ (cid:16) ˜ θ, λ (cid:17)i = max λ ∗ ∈ Λ E ˜ θ h L λ (cid:16) ˜ θ, λ ∗ (cid:17)i Let’s focus on the first condition. Let p ǫ := Pr n L θ (cid:16) ˜ θ, λ (cid:17) ≥ ℓ min + ǫ o , and notice that p /n must equal zero forany n ∈ { , , . . . } (otherwise we would contradict the above), implying by the countable additivity of measuresthat Pr n L θ (cid:16) ˜ θ, λ (cid:17) = ℓ min o = 1 . We therefore assume henceforth, without loss of generality, that the support of ˜ θ consists entirely of minimizers of L θ ( · , λ ) . Let S ⊆ Θ be this support set.Define G := (cid:8) ∇ ˜ λ L λ ( θ ′ , λ ) : θ ′ ∈ S (cid:9) , and take ¯ G to be the closure of the convex hull of G . Since E ˜ θ h ∇ ˜ λ L λ (cid:16) ˜ θ, λ (cid:17)i ∈ ¯ G ⊆ R m , we can write it as a convex combination of at most m + 1 extreme points of ¯ G , or equivalently of m + 1 elements of G . Hence, we can take θ to be a discrete random variable that places nonzeromass on at most m + 1 elements of S , and: E θ (cid:2) ∇ ˜ λ L λ ( θ, λ ) (cid:3) = E ˜ θ h ∇ ˜ λ L λ (cid:16) ˜ θ, λ (cid:17)i Linearity in λ then implies that E θ [ L λ ( θ, · )] and E ˜ θ h L λ (cid:16) ˜ θ, · (cid:17)i are the same function (up to a constant), and thereforehave the same maximizer(s). Correspondingly, θ is supported on S , which contains only minimizers of L θ ( · , λ ) byconstruction. Lemma 5. If Θ is a compact Hausdorff space and the objective, constraint and proxy constraint functions g , g , . . . , g m , ˜ g , . . . , ˜ g m are continuous, then the proxy-Lagrangian game (Definition 2) has a mixed Nashequilibrium pair ( θ, λ ) where θ is a random variable supported on at most m + 1 elements of Θ , and λ is non-random. roof. Applying Theorem 5 directly would result in a support size of m + 2 , rather than the desired m + 1 , since Λ is ( m + 1) -dimensional. Instead, we define ˜Λ = n ˜ λ ∈ R m + : (cid:13)(cid:13)(cid:13) ˜ λ (cid:13)(cid:13)(cid:13) ≤ o as the space containing the last m coordinatesof Λ . Then we can rewrite the proxy-Lagrangian functions ˜ L θ , ˜ L λ : Θ × ˜Λ → R as: ˜ L θ (cid:16) θ, ˜ λ (cid:17) = (cid:16) − (cid:13)(cid:13)(cid:13) ˜ λ (cid:13)(cid:13)(cid:13) (cid:17) g ( θ ) + m X i =1 ˜ λ i ˜ g i ( θ )˜ L λ (cid:16) θ, ˜ λ (cid:17) = m X i =1 ˜ λ i g i ( θ ) These functions are linear in ˜ λ , which is a m -dimensional space, so the conditions of Theorem 5 apply, yielding theclaimed result. Lemma 3.
Let θ (1) , θ (2) , . . . , θ ( T ) ∈ Θ be a sequence of T “candidate solutions” of Equation 1. Define ~g , ~g i ∈ R T such that ( ~g ) t = g (cid:0) θ ( t ) (cid:1) and ( ~g i ) t = g i (cid:0) θ ( t ) (cid:1) for i ∈ [ m ] , and consider the linear program: min p ∈ ∆ T h p, ~g i s . t . ∀ i ∈ [ m ] . h p, ~g i i ≤ ǫ where ∆ T is the T -dimensional simplex. Then every vertex p ∗ of the feasible region—in particular an optimal one—has at most m ∗ + 1 ≤ m + 1 nonzero elements, where m ∗ is the number of active h p ∗ , ~g i i ≤ ǫ constraints.Proof. The linear program contains not only the m explicit linearized functional constraints, but also, since p ∈ ∆ T ,the T nonnegativity constraints p t ≥ , and the sum-to-one constraint P Tt =1 p t = 1 .Since p is T -dimensional, every vertex p ∗ of the feasible region must include T active constraints. Letting m ∗ ≤ m bethe number of active linearized functional constraints, and accounting for the sum-to-one constraint, it follows that atleast T − m ∗ − nonnegativity constraints are active, implying that p ∗ contains at most m ∗ + 1 nonzero elements. C Proofs of Convergence Rates
C.1 Non-Stochastic One-Player Convergence Rates
Theorem 6. (Mirror Descent)
Let f , f , . . . : Θ → R be a sequence of convex functions that we wish to minimize ona compact convex set Θ . Suppose that the “distance generating function” Ψ : Θ → R + is nonnegative and -stronglyconvex w.r.t. a norm k·k with dual norm k·k ∗ .Define the step size η = q B Ψ /T B ∇ , where B Ψ ≥ max θ ∈ Θ Ψ ( θ ) is a uniform upper bound on Ψ , and B ˇ ∇ ≥ (cid:13)(cid:13) ˇ ∇ f t (cid:0) θ ( t ) (cid:1)(cid:13)(cid:13) ∗ is a uniform upper bound on the norms of the subgradients. Suppose that we perform T iterations ofthe following update, starting from θ (1) = argmin θ ∈ Θ Ψ ( θ ) : ˜ θ ( t +1) = ∇ Ψ ∗ (cid:16) ∇ Ψ (cid:16) θ ( t ) (cid:17) − η ˇ ∇ f t (cid:16) θ ( t ) (cid:17)(cid:17) θ ( t +1) = argmin θ ∈ Θ D Ψ (cid:16) θ | ˜ θ ( t +1) (cid:17) here ˇ ∇ f t ( θ ) ∈ ∂f t ( θ ( t ) ) is a subgradient of f t at θ , and D Ψ ( θ | θ ′ ) := Ψ ( θ ) − Ψ ( θ ′ ) − h∇ Ψ ( θ ′ ) , θ − θ ′ i is theBregman divergence associated with Ψ . Then: T T X t =1 f t (cid:16) θ ( t ) (cid:17) − T T X t =1 f t ( θ ∗ ) ≤ B ˇ ∇ r B Ψ T where θ ∗ ∈ Θ is an arbitrary reference vector.Proof. Mirror descent [Nemirovski and Yudin, 1983, Beck and Teboulle, 2003] dates back to 1983, but this particularstatement is taken from Lemma 2 of Srebro et al. [2011].
Corollary 1. (Gradient Descent)
Let f , f , . . . : Θ → R be a sequence of convex functions that we wish to minimizeon a compact convex set Θ .Define the step size η = B Θ /B ˇ ∇ √ T , where B Θ ≥ max θ ∈ Θ k θ k , and B ˇ ∇ ≥ (cid:13)(cid:13) ˇ ∇ f t (cid:0) θ ( t ) (cid:1)(cid:13)(cid:13) is a uniform upperbound on the norms of the subgradients. Suppose that we perform T iterations of the following update, starting from θ (1) = argmin θ ∈ Θ k θ k : θ ( t +1) = Π Θ (cid:16) θ ( t ) − η ˇ ∇ f t (cid:16) θ ( t ) (cid:17)(cid:17) where ˇ ∇ f t ( θ ) ∈ ∂f t ( θ ( t ) ) is a subgradient of f t at θ , and Π Θ projects its argument onto Θ w.r.t. the Euclidean norm.Then: T T X t =1 f t (cid:16) θ ( t ) (cid:17) − T T X t =1 f t ( θ ∗ ) ≤ B Θ B ˇ ∇ r T where θ ∗ ∈ Θ is an arbitrary reference vector.Proof. Follows from taking
Ψ ( θ ) = k θ k / in Theorem 6. Corollary 2.
Let M := (cid:8) M ∈ R ˜ m × ˜ m : ∀ i ∈ [ ˜ m ] .M : ,i ∈ ∆ ˜ m (cid:9) be the set of all left-stochastic ˜ m × ˜ m matrices, andlet f , f , . . . : M → R be a sequence of concave functions that we wish to maximize.Define the step size η = q ˜ m ln ˜ m/T B ∇ , where B ˆ ∇ ≥ (cid:13)(cid:13)(cid:13) ˆ ∇ f t (cid:0) M ( t ) (cid:1)(cid:13)(cid:13)(cid:13) ∞ , is a uniform upper bound on the norms ofthe supergradients, and k·k ∞ , := qP ˜ mi =1 k M : ,i k ∞ is the L ∞ , matrix norm. Suppose that we perform T iterationsof the following update starting from the matrix M (1) with all elements equal to / ˜ m : ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) η ˆ ∇ f t (cid:16) M ( t ) (cid:17)(cid:17) M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) where − ˆ ∇ f t (cid:0) M ( t ) (cid:1) ∈ ∂ (cid:0) − f t ( M ( t ) ) (cid:1) , i.e. ˆ ∇ f t (cid:0) M ( t ) (cid:1) is a supergradient of f t at M ( t ) , and the multiplication andexponentiation in the first step are performed element-wise. Then: T T X t =1 f t ( M ∗ ) − T T X t =1 f t (cid:16) M ( t ) (cid:17) ≤ B ˆ ∇ r ˜ m ln ˜ mT where M ∗ ∈ M is an arbitrary reference matrix.Proof. Define
Ψ :
M → R := ˜ m ln ˜ m + P i,j ∈ [ ˜ m ] M i,j ln M i,j as ˜ m ln ˜ m plus the negative Shannon entropy, appliedto its (matrix) argument element-wise ( ˜ m ln ˜ m is added to make Ψ nonnegative on M ). As in the vector setting, theresulting mirror descent update will be (element-wise) multiplicative.21he Bregman divergence satisfies: D Ψ ( M | M ′ ) =Ψ ( M ) − Ψ ( M ′ ) − h∇ Ψ ( M ′ ) , M − M ′ i = k M ′ k , − k M k , + ˜ m X i =1 D KL (cid:0) M : ,i k M ′ : ,i (cid:1) (11)where k M k , = P ˜ mi =1 k M : ,i k is the L , matrix norm. This incidentally shows that one projects onto M w.r.t. D Ψ by projecting each column w.r.t. the KL divergence, i.e. by normalizing the columns.By Pinsker’s inequality (applied to each column of an M ∈ M ): k M − M ′ k , ≤ ˜ m X i =1 D KL (cid:0) M : ,i k M ′ : ,i (cid:1) where k M k , = qP ˜ mi =1 k M : ,i k is the L , matrix norm. Substituting this into Equation 11, and using the fact that k M k , = ˜ m for all M ∈ M , we have that for all M, M ′ ∈ M : D Ψ ( M | M ′ ) ≥ k M − M ′ k , which shows that Ψ is -strongly convex w.r.t. the L , matrix norm. The dual norm of the L , matrix norm is the L ∞ , norm, which is the last piece needed to apply Theorem 6, yielding the claimed result. Lemma 8.
Let
Λ := ∆ ˜ m be the ˜ m -dimensional simplex, define M := (cid:8) M ∈ R ˜ m × ˜ m : ∀ i ∈ [ ˜ m ] .M : ,i ∈ ∆ ˜ m (cid:9) as theset of all left-stochastic ˜ m × ˜ m matrices, and take f , f , . . . : Λ → R to be a sequence of concave functions that wewish to maximize.Define the step size η = q ˜ m ln ˜ m/T B ∇ , where B ˆ ∇ ≥ (cid:13)(cid:13)(cid:13) ˆ ∇ f t (cid:0) λ ( t ) (cid:1)(cid:13)(cid:13)(cid:13) ∞ is a uniform upper bound on the ∞ -norms ofthe supergradients. Suppose that we perform T iterations of the following update, starting from the matrix M (1) withall elements equal to / ˜ m : λ ( t ) = fix M ( t ) A ( t ) = (cid:16) ˆ ∇ f t (cid:16) λ ( t ) (cid:17)(cid:17) (cid:16) λ ( t ) (cid:17) T ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) ηA ( t ) (cid:17) M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) where fix M is a stationary distribution of M (i.e. a λ ∈ Λ such that M λ = λ —such always exists, since M isleft-stochastic), − ˆ ∇ f t (cid:0) λ ( t ) (cid:1) ∈ ∂ (cid:0) − f t ( λ ( t ) ) (cid:1) , i.e. ˆ ∇ f t (cid:0) λ ( t ) (cid:1) is a supergradient of f t at λ ( t ) , and the multiplicationand exponentiation of the third step are performed element-wise. Then: T T X t =1 f t (cid:16) M ∗ λ ( t ) (cid:17) − T T X t =1 f t (cid:16) λ ( t ) (cid:17) ≤ B ˆ ∇ r ˜ m ln ˜ mT where M ∗ ∈ M is an arbitrary left-stochastic reference matrix.Proof. This algorithm is an instance of that contained in Figure 1 of Gordon et al. [2008].Define ˜ f t ( M ) := f t (cid:0) M ( t ) λ ( t ) (cid:1) . Observe that since ˆ ∇ f t (cid:0) λ ( t ) (cid:1) is a supergradient of f t at λ ( t ) , and M ( t ) λ ( t ) = λ ( t ) : f t (cid:16) ˜ M λ ( t ) (cid:17) ≤ f t (cid:16) M ( t ) λ ( t ) (cid:17) + D ˆ ∇ f t (cid:16) λ ( t ) (cid:17) , ˜ M λ ( t ) − M ( t ) λ ( t ) E ≤ f t (cid:16) M ( t ) λ ( t ) (cid:17) + A ( t ) · (cid:16) ˜ M − M ( t ) (cid:17) A ( t ) is a supergradient of ˜ f t at M ( t ) , from which we conclude that the final two steps of the update are performing the algorithm of Corollary 2, so: T T X t =1 ˜ f t ( M ∗ ) − T T X t =1 ˜ f t (cid:16) M ( t ) (cid:17) ≤ B ˆ ∇ r ˜ m ln ˜ mT where the B ˆ ∇ of Corollary 2 is a uniform upper bound on the L ∞ , matrix norms of the A ( t ) s. However, by thedefinition of A ( t ) and the fact that λ ( t ) ∈ ∆ ˜ m , we can instead take B ˆ ∇ to be a uniform upper bound on (cid:13)(cid:13)(cid:13) ˆ ∇ ( t ) (cid:13)(cid:13)(cid:13) ∞ .Substituting the definition of ˜ f t and again using the fact that M ( t ) λ ( t ) = λ ( t ) then yields the claimed result. C.2 Stochastic One-Player Convergence Rates
Theorem 7. (Stochastic Mirror Descent)
Let Ψ , k·k , D Ψ and B Ψ be as in Theorem 6, and let f , f , . . . : Θ → R bea sequence of convex functions that we wish to minimize on a compact convex set Θ .Define the step size η = q B Ψ /T B , where B ˇ∆ ≥ (cid:13)(cid:13) ˇ∆ ( t ) (cid:13)(cid:13) ∗ is a uniform upper bound on the norms of thestochastic subgradients. Suppose that we perform T iterations of the following stochastic update, starting from θ (1) = argmin θ ∈ Θ Ψ ( θ ) : ˜ θ ( t +1) = ∇ Ψ ∗ (cid:16) ∇ Ψ (cid:16) θ ( t ) (cid:17) − η ˇ∆ ( t ) (cid:17) θ ( t +1) = argmin θ ∈ Θ D Ψ (cid:16) θ | ˜ θ ( t +1) (cid:17) where E (cid:2) ˇ∆ ( t ) | θ ( t ) (cid:3) ∈ ∂f t ( θ ( t ) ) , i.e. ˇ∆ ( t ) is a stochastic subgradient of f t at θ ( t ) . Then, with probability − δ overthe draws of the stochastic subgradients: T T X t =1 f t (cid:16) θ ( t ) (cid:17) − T T X t =1 f t ( θ ∗ ) ≤ B ˇ ∇ s B Ψ (cid:0) δ (cid:1) T where θ ∗ ∈ Θ is an arbitrary reference vector.Proof. This is nothing more than the usual transformation of a uniform regret guarantee into a stochastic one via theHoeffding-Azuma inequality—we include a proof for completeness.Define the sequence: ˜ f t ( θ ) = f t (cid:16) θ ( t ) (cid:17) + D ˇ∆ ( t ) , θ − θ ( t ) E Then applying non-stochastic mirror descent to the sequence ˜ f t will result in exactly the same sequence of iterates θ ( t ) as applying stochastic mirror descent (above) to f t . Hence, by Theorem 6 and the definition of ˜ f t (notice that we cantake B ˇ ∇ = B ˇ∆ ): T T X t =1 ˜ f t (cid:16) θ ( t ) (cid:17) − T T X t =1 ˜ f t ( θ ∗ ) ≤ B ˇ ∇ r B Ψ T T T X t =1 f t (cid:16) θ ( t ) (cid:17) − T T X t =1 f t ( θ ∗ ) ≤ B ˇ ∇ r B Ψ T + 1 T T X t =1 (cid:16) ˜ f t ( θ ∗ ) − f t ( θ ∗ ) (cid:17) ≤ B ˇ ∇ r B Ψ T + 1 T T X t =1 D ˇ∆ ( t ) − ˇ ∇ f t (cid:16) θ ( t ) (cid:17) , θ ∗ − θ ( t ) E (12)23here the last step follows from the convexity of the f t s. Consider the second term on the RHS. Observe that, sincethe ˇ∆ ( t ) s are stochastic subgradients, each of the terms in the sum is zero in expectation (conditioned on the past), andthe partial sums therefore form a martingale. Furthermore, by Hölder’s inequality: D ˇ∆ ( t ) − ˇ ∇ f t (cid:16) θ ( t ) (cid:17) , θ ∗ − θ ( t ) E ≤ (cid:13)(cid:13)(cid:13) ˇ∆ ( t ) − ˇ ∇ f t (cid:16) θ ( t ) (cid:17)(cid:13)(cid:13)(cid:13) ∗ (cid:13)(cid:13)(cid:13) θ ∗ − θ ( t ) (cid:13)(cid:13)(cid:13) ≤ B ˇ∆ p B Ψ the last step because (cid:13)(cid:13) θ ∗ − θ ( t ) (cid:13)(cid:13) ≤ (cid:13)(cid:13) θ ∗ − θ (1) (cid:13)(cid:13) + (cid:13)(cid:13) θ ( t ) − θ (1) (cid:13)(cid:13) ≤ θ ∈ Θ q D Ψ (cid:0) θ | θ (1) (cid:1) ≤ √ B Ψ , using thefact that D Ψ is -strongly convex w.r.t. k·k , and the definition of θ (1) . Hence, by the Hoeffding-Azuma inequality: Pr ( T T X t =1 D ˇ∆ ( t ) − ˇ ∇ f t (cid:16) θ ( t ) (cid:17) , θ ∗ − θ ( t ) E ≥ ǫ ) ≤ exp − T ǫ B Ψ B ! equivalently: Pr T T X t =1 D ˇ∆ ( t ) − ˇ ∇ f t (cid:16) θ ( t ) (cid:17) , θ ∗ − θ ( t ) E ≥ B ˇ∆ s B Ψ ln δ T ≤ δ substituting this into Equation 12, and applying the inequality √ a + √ b ≤ √ a + 2 b , yields the claimed result. Corollary 3. (Stochastic Gradient Descent)
Let f , f , . . . : Θ → R be a sequence of convex functions that we wishto minimize on a compact convex set Θ .Define the step size η = B Θ /B ˇ∆ √ T , where B Θ ≥ max θ ∈ Θ k θ k , and B ˇ∆ ≥ (cid:13)(cid:13) ˇ∆ ( t ) (cid:13)(cid:13) is a uniform upper boundon the norms of the stochastic subgradients. Suppose that we perform T iterations of the following stochastic update,starting from θ (1) = argmin θ ∈ Θ k θ k : θ ( t +1) = Π Θ (cid:16) θ ( t ) − η ˇ∆ ( t ) (cid:17) where E (cid:2) ˇ∆ ( t ) | θ ( t ) (cid:3) ∈ ∂f t ( θ ( t ) ) , i.e. ˇ∆ ( t ) is a stochastic subgradient of f t at θ ( t ) , and Π Θ projects its argument onto Θ w.r.t. the Euclidean norm. Then, with probability − δ over the draws of the stochastic subgradients: T T X t =1 f t (cid:16) θ ( t ) (cid:17) − T T X t =1 f t ( θ ∗ ) ≤ B Θ B ˇ ∇ s δ T where θ ∗ ∈ Θ is an arbitrary reference vector.Proof. Follows from taking
Ψ ( θ ) = k θ k / in Theorem 7. Corollary 4.
Let M := (cid:8) M ∈ R ˜ m × ˜ m : ∀ i ∈ [ ˜ m ] .M : ,i ∈ ∆ ˜ m (cid:9) be the set of all left-stochastic ˜ m × ˜ m matrices, andlet f , f , . . . : M → R be a sequence of concave functions that we wish to maximize.Define the step size η = q ˜ m ln ˜ m/T B , where B ˆ∆ ≥ (cid:13)(cid:13)(cid:13) ˆ∆ ( t ) (cid:13)(cid:13)(cid:13) ∞ , is a uniform upper bound on the norms of thestochastic supergradients, and k·k ∞ , := qP ˜ mi =1 k M : ,i k ∞ is the L ∞ , matrix norm. Suppose that we perform T iterations of the following stochastic update starting from the matrix M (1) with all elements equal to / ˜ m : ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) η ˆ∆ ( t ) (cid:17) M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) here E h − ˆ∆ ( t ) | M ( t ) i ∈ ∂ (cid:0) − f t ( M ( t ) ) (cid:1) , i.e. ˆ∆ ( t ) is a stochastic supergradient of f t at M ( t ) , and the multiplicationand exponentiation in the first step are performed element-wise. Then with probability − δ over the draws of thestochastic supergradients: T T X t =1 f t ( M ∗ ) − T T X t =1 f t (cid:16) M ( t ) (cid:17) ≤ B ˆ∆ s m ln ˜ m ) (cid:0) δ (cid:1) T where M ∗ ∈ M is an arbitrary reference matrix.Proof. The same reasoning as was used to prove Corollary 2 from Theorem 6 applies here (but starting from Theo-rem 7).
Lemma 9.
Let
Λ := ∆ ˜ m be the ˜ m -dimensional simplex, define M := (cid:8) M ∈ R ˜ m × ˜ m : ∀ i ∈ [ ˜ m ] .M : ,i ∈ ∆ ˜ m (cid:9) as theset of all left-stochastic ˜ m × ˜ m matrices, and take f , f , . . . : Λ → R to be a sequence of concave functions that wewish to maximize.Define the step size η = q ˜ m ln ˜ m/T B , where B ˆ∆ ≥ (cid:13)(cid:13)(cid:13) ˆ∆ ( t ) (cid:13)(cid:13)(cid:13) ∞ is a uniform upper bound on the ∞ -norms of thestochastic supergradients. Suppose that we perform T iterations of the following update, starting from the matrix M (1) with all elements equal to / ˜ m : λ ( t ) = fix M ( t ) A ( t ) = ˆ∆ ( t ) (cid:16) λ ( t ) (cid:17) T ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) ηA ( t ) (cid:17) M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) where fix M is a stationary distribution of M (i.e. a λ ∈ Λ such that M λ = λ —such always exists, since M is left-stochastic), E h − ˆ∆ ( t ) | λ ( t ) i ∈ ∂ (cid:0) − f t ( λ ( t ) ) (cid:1) , i.e. ˆ∆ ( t ) is a stochastic supergradient of f t at λ ( t ) , and themultiplication and exponentiation of the third step are performed element-wise. Then with probability − δ over thedraws of the stochastic supergradients: T T X t =1 f t (cid:16) M ∗ λ ( t ) (cid:17) − T T X t =1 f t (cid:16) λ ( t ) (cid:17) ≤ B ˆ∆ s m ln ˜ m ) (cid:0) δ (cid:1) T where M ∗ ∈ M is an arbitrary left-stochastic reference matrix.Proof. The same reasoning as was used to prove Lemma 8 from Corollary 2 applies here (but starting from Corol-lary 4).
C.3 Two-Player Convergence Rates
Lemma 1. (Algorithm 1)
Suppose that Λ and R are as in Theorem 1, and define the upper bound B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) .If we run Algorithm 1 with the step size η λ := R/B ∆ √ T , then the result satisfies the conditions of Theorem 1 for: ǫ = ρ + RB ∆ r T where ρ is the error associated with the oracle O ρ . lgorithm 3 Optimizes the Lagrangian formulation (Definition 1) in the convex setting. The parameter R is the radiusof the Lagrange multiplier space Λ := (cid:8) λ ∈ R m + : k λ k ≤ R (cid:9) , and the functions Π Θ and Π Λ project their argumentsonto Θ and Λ (respectively) w.r.t. the Euclidean norm.StochasticLagrangian ( R ∈ R + , L : Θ × Λ → R , T ∈ N , η θ , η λ ∈ R + ) : Initialize θ (1) = 0 , λ (1) = 0 // Assumes ∈ Θ For t ∈ [ T ] : Let ˇ∆ ( t ) θ be a stochastic subgradient of L (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. θ Let ∆ ( t ) λ be a stochastic gradient of L (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. λ Update θ ( t +1) = Π Θ (cid:16) θ ( t ) − η θ ˇ∆ ( t ) θ (cid:17) // Projected SGD updates . . . Update λ ( t +1) = Π Λ (cid:16) λ ( t ) + η λ ∆ ( t ) λ (cid:17) // . . . Return θ (1) , . . . , θ ( T ) and λ (1) , . . . , λ ( T ) Proof.
Applying Corollary 1 to the optimization over λ gives: T T X t =1 L (cid:16) θ ( t ) , λ ∗ (cid:17) − T T X t =1 L (cid:16) θ ( t ) , λ ( t ) (cid:17) ≤ B Λ B ∆ r T By the definition of O ρ (Definition 3): T T X t =1 L (cid:16) θ ( t ) , λ ∗ (cid:17) − inf θ ∗ ∈ Θ T T X t =1 L (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ ρ + B Λ B ∆ r T Using the linearity of L in λ , the fact that B Λ = R , and the definitions of ¯ θ and ¯ λ , yields the claimed result. Lemma 10. (Algorithm 3)
Suppose that Θ is a compact convex set, Λ and R are as in Theorem 1, and that theobjective and constraint functions g , g , . . . , g m are convex. Define the three upper bounds B Θ ≥ max θ ∈ Θ k θ k , B ˇ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ˇ∆ ( t ) θ (cid:13)(cid:13)(cid:13) , and B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) .If we run Algorithm 3 with the step sizes η θ := B Θ /B ˇ∆ √ T and η λ := R/B ∆ √ T , then the result satisfies theconditions of Theorem 1 for: ǫ = 2 ( B Θ B ˇ∆ + RB ∆ ) s δ T with probability − δ over the draws of the stochastic (sub)gradients.Proof. Applying Corollary 3 to the two optimizations (over θ and λ ) gives that with probability − δ ′ over the drawsof the stochastic (sub)gradients: T T X t =1 L (cid:16) θ ( t ) , λ ( t ) (cid:17) − T T X t =1 L (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ B Θ B ˇ∆ s δ ′ T T T X t =1 L (cid:16) θ ( t ) , λ ∗ (cid:17) − T T X t =1 L (cid:16) θ ( t ) , λ ( t ) (cid:17) ≤ B Λ B ∆ s δ ′ T Adding these inequalities, taking δ = 2 δ ′ , using the linearity of L in λ , the fact that B Λ = R , and the definitions of ¯ θ and ¯ λ , yields the claimed result. 26 lgorithm 4 Optimizes the proxy-Lagrangian formulation (Definition 2) in the non-convex setting via the use of anapproximate Bayesian optimization oracle O ρ (Definition 3, but with ˜ g i s instead of g i s in the linear combinationdefining f ) for the θ -player, with the λ -player minimizing swap regret. The fix M operation on line results in astationary distribution of M (i.e. a λ ∈ Λ such that M λ = λ , which can be derived from the top eigenvector).OracleProxyLagrangian (cid:0) L θ , L λ : Θ × ∆ m +1 → R , O ρ : (Θ → R ) → Θ , T ∈ N , η λ ∈ R + (cid:1) : Initialize M (1) ∈ R ( m +1) × ( m +1) with M i,j = 1 / ( m + 1) For t ∈ [ T ] : Let λ ( t ) = fix M ( t ) // Stationary distribution of M ( t ) Let θ ( t ) = O ρ (cid:0) L θ (cid:0) · , λ ( t ) (cid:1)(cid:1) // Oracle optimization Let ∆ ( t ) λ be a gradient of L λ (cid:0) θ ( t ) , λ ( t ) (cid:1) w.r.t. λ Update ˜ M ( t +1) = M ( t ) ⊙ . exp (cid:16) η λ ∆ ( t ) λ (cid:0) λ ( t ) (cid:1) T (cid:17) // ⊙ and . exp are element-wise Project M ( t +1): ,i = ˜ M ( t +1): ,i / (cid:13)(cid:13)(cid:13) ˜ M ( t +1): ,i (cid:13)(cid:13)(cid:13) for i ∈ [ m + 1] // Column-wise projection w.r.t. KL divergence Return θ (1) , . . . , θ ( T ) and λ (1) , . . . , λ ( T ) Lemma 11. (Algorithm 4)
Suppose that M and Λ are as in Theorem 2, and define the upper bound B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) ∞ .If we run Algorithm 4 with the step size η λ := p ( m + 1) ln ( m + 1) /T B , then the result satisfies satisfies theconditions of Theorem 2 for: ǫ θ = ρǫ λ =2 B ∆ r ( m + 1) ln ( m + 1) T where ρ is the error associated with the oracle O ρ .Proof. Applying Lemma 8 to the optimization over λ (with ˜ m := m + 1 ) gives: T T X t =1 L λ (cid:16) θ ( t ) , M ∗ λ ( t ) (cid:17) − T T X t =1 L λ (cid:16) θ ( t ) , λ ( t ) (cid:17) ≤ B ∆ r ( m + 1) ln ( m + 1) T By the definition of O ρ (Definition 3): T T X t =1 L θ (cid:16) θ ( t ) , λ ( t ) (cid:17) − inf θ ∗ ∈ Θ T T X t =1 L θ (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ ρ Using the definitions of ¯ θ and ¯ λ yields the claimed result. Lemma 4. (Algorithm 2)
Suppose that Θ is a compact convex set, M and Λ are as in Theorem 2, and that theobjective and proxy constraint functions g , ˜ g , . . . , ˜ g m are convex (but not g , . . . , g m ). Define the three upper bounds B Θ ≥ max θ ∈ Θ k θ k , B ˇ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ˇ∆ ( t ) θ (cid:13)(cid:13)(cid:13) , and B ∆ ≥ max t ∈ [ T ] (cid:13)(cid:13)(cid:13) ∆ ( t ) λ (cid:13)(cid:13)(cid:13) ∞ .If we run Algorithm 2 with the step sizes η θ := B Θ /B ˇ∆ √ T and η λ := p ( m + 1) ln ( m + 1) /T B , then the resultsatisfies the conditions of Theorem 2 for: ǫ θ =2 B Θ B ˇ∆ s δ Tǫ λ =2 B ∆ s m + 1) ln ( m + 1) (cid:0) δ (cid:1) T ith probability − δ over the draws of the stochastic (sub)gradients.Proof. Applying Corollary 3 to the optimization over θ , and Lemma 9 to that over λ (with ˜ m := m + 1 ), gives thatwith probability − δ ′ over the draws of the stochastic (sub)gradients: T T X t =1 L θ (cid:16) θ ( t ) , λ ( t ) (cid:17) − T T X t =1 L θ (cid:16) θ ∗ , λ ( t ) (cid:17) ≤ B Θ B ˇ∆ s δ ′ T T T X t =1 L λ (cid:16) θ ( t ) , M ∗ λ ( t ) (cid:17) − T T X t =1 L λ (cid:16) θ ( t ) , λ ( t ) (cid:17) ≤ B ∆ s m + 1) ln ( m + 1) (cid:0) δ ′ (cid:1) T Taking δ = 2 δ ′ , and using the definitions of ¯ θ and ¯ λλ