[PDF] Dual Prices for Frank--Wolfe Algorithms

Abstract

In this note we observe that for constrained convex minimization problems \min_{x \in P}f(x) over a polytope P, dual prices for the linear program \min_{z \in P} \nabla f(x) z obtained from linearization at approximately optimal solutions x have a similar interpretation of rate of change in optimal value as for linear programming, providing a convex form of sensitivity analysis. This is of particular interest for Frank--Wolfe algorithms (also called conditional gradients), forming an important class of first-order methods, where a basic building block is linear minimization of gradients of f over P, which in most implementations already compute the dual prices as a by-product.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Dual Prices for Frank–Wolfe Algorithms ∗ —a note— Gábor Braun and Sebastian Pokutta

Technische Universität Berlin, Germany Zuse Institute Berlin, Germany, {braun,pokutta}@zib.de

January 6, 2021

Abstract

In this note we observe that for constrained convex minimization problems min 𝑥 ∈ 𝑃 𝑓 ( 𝑥 ) over a poly-tope 𝑃 , dual prices for the linear program min 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ) 𝑧 obtained from linearization at approximatelyoptimal solutions 𝑥 have a similar interpretation of rate of change in optimal value as for linear pro-gramming, providing a convex form of sensitivity analysis. This is of particular interest for Frank–Wolfealgorithms (also called conditional gradients), forming an important class of ﬁrst-order methods, wherea basic building block is linear minimization of gradients of 𝑓 over 𝑃 , which in most implementationsalready compute the dual prices as a by-product. We consider the constrained convex minimization problemmin 𝑥 ∈ 𝑃 𝑓 ( 𝑥 ) , (minProb)where 𝑓 is a smooth convex function and 𝑃 is a compact convex feasible region. Our primary interest is ﬁrst-order algorithms where access to 𝑓 is provided by computing gradients ∇ 𝑓 ( 𝑥 ) and function values 𝑓 ( 𝑥 ) at anyfeasible point 𝑥 . An important class of ﬁrst-order methods is formed by conditional gradient algorithms (alsoknown as Frank–Wolfe algorithms) , which access the feasible region 𝑃 solely through a linear minimizationoracle , i.e., presented with a linear objective 𝑐 the oracle returns argmin 𝑥 ∈ 𝑃 𝑐 · 𝑥 . This class of algorithmshas several advantages two of which are (1) Projection-freeness: no projection to the domain 𝑃 is needed,and (2) Sparsity: iterates are represented as convex combination of a small number of vertices, usually atmost one vertex per iteration. As an example, the simplest algorithm, namely, the (vanilla) Frank–WolfeAlgorithm (Frank and Wolfe, 1956; Levitin and Polyak, 1966) is recalled in Algorithm 1, which however willnot be used in the rest of the paper.Many implementations of a linear minimization oracle already compute dual prices for an optimal solution,to verify optimality. Therefore it is of interest to make use of this extra information. The role of conditionalgradient algorithms in this paper is only as a practical example: making available dual prices at no extracost in most implementations.We will provide an interpretation for dual prices similar to sensitivity analysis for linear optimization:dual prices at an optimal solution 𝑥 are the rate of change in optimal value under small changes to theright-hand side 𝑏 of constraints over a domain { 𝑧 : 𝐴𝑧 ≤ 𝑏 } deﬁned by linear inequalities. We shall see thisinterpretation holds even for approximately optimal solutions 𝑥 while retaining the additive error of theaccuracy of 𝑥 . Thus dual prices can then be used as customary, e.g., in sensitivity analysis, to computerisk-free state probabilities, (economic) shadow prices in e.g., energy systems, etc. ∗ Research in this paper was partially supported by Research Campus MODAL funded by German Federal Ministry ofEducation and Research under grant 05M14ZAM. .1 Preliminaries Algorithm 1:

Frank–Wolfe Algorithm (FW) 𝑥 ∈ 𝑃 arbitrary; for 𝑡 = to . . . do 𝑣 𝑡 ← argmin 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 𝑡 ) 𝑧 𝛾 𝑡 ← argmin ≤ 𝛾 ≤ 𝑓 ( 𝑥 𝑡 + 𝛾 ( 𝑣 𝑡 − 𝑥 𝑡 )) 𝑥 𝑡 + ← 𝑥 𝑡 + 𝛾 𝑡 ( 𝑣 𝑡 − 𝑥 𝑡 ) end We brieﬂy recall basic deﬁnitions. Recall that a dif-ferentiable function 𝑓 : 𝑃 → ℝ is convex if for all 𝑥, 𝑦 ∈ 𝑃 𝑓 ( 𝑦 ) − 𝑓 ( 𝑥 ) ≥ ∇ 𝑓 ( 𝑥 ) ( 𝑦 − 𝑥 ) , (1.1)We denote by ∇ 𝑓 ( 𝑥 ) the gradient of 𝑓 as a row vec-tor, while 𝑃 is considered to be contained in a vectorspace of column vectors. This will be convenient fordual prices, but we note that the formalism is in-herently free of coordinate choices in the space of 𝑃 (treating ∇ 𝑓 ( 𝑥 ) as a linear function).Further the function 𝑓 is 𝐿 -smooth if for all 𝑥, 𝑦 ∈ 𝑃𝑓 ( 𝑦 ) − 𝑓 ( 𝑥 ) ≤ ∇ 𝑓 ( 𝑥 ) ( 𝑦 − 𝑥 ) + 𝐿 k 𝑦 − 𝑥 k . (1.2)Here k . k is any norm on the vector space containing 𝑃 and the value of 𝐿 depends on the norm k . k .Conditional gradient algorithms often use the Frank–Wolfe gap deﬁned as:max 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑧 ) = ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) , (1.3)with 𝑣 a point of 𝑃 minimizing min 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ) 𝑧 . In the context of Frank–Wolfe algorithms, these minimizersare called Frank–Wolfe vertices at 𝑥 (even though not all are vertices, but only vertex minimizers are usedin practice). They are usually used to deﬁne the next iterate as in the vanilla variant in Algorithm 1. Recallthat by convexity 0 ≤ 𝑓 ( 𝑥 ) − 𝑓 ( 𝑥 ∗ ) ≤ ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑥 ∗ ) ≤ ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) . (1.4)Here and below 𝑥 ∗ is an optimal solution to min 𝑥 ∈ 𝑃 𝑓 ( 𝑥 ) . Thus the Frank–Wolfe gap is an upper boundto the primal gap 𝑓 ( 𝑥 ) − 𝑓 ( 𝑥 ∗ ) and it is 0 at optimal solutions to min 𝑧 ∈ 𝑃 𝑓 ( 𝑧 ) . As such the Frank–Wolfegap is useful as a proxy for the primal gap, while it also provides a lower bound to the optimal value: 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) ≤ 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑥 ∗ ) ≤ 𝑓 ( 𝑥 ∗ ) . We recall dual prices for linear optimization applied to our context here. Let 𝑃 = { 𝑧 : 𝐴𝑧 ≤ 𝑏 } with 𝐴 ∈ ℝ 𝑚 × 𝑛 , 𝑏 ∈ ℝ 𝑛 be a polytope and let 𝑓 be a convex function and let 𝑥, 𝑣 ∈ 𝑃 be arbitrary. Recall that strongduality states that 𝑣 ∈ 𝑃 is a minimizer for the linear program min 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 , i.e., 𝑣 = argmin 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 if and only if there is a nonnegative combination of constraints certifying optimality, i.e., a vector 0 ≤ 𝜆 ∈ ℝ 𝑚 whose entries are multipliers called dual prices satisfying ∇ 𝑓 ( 𝑥 ) = − 𝜆𝐴, 𝜆 ≥ , (2.1a) ∇ 𝑓 ( 𝑥 ) 𝑣 = min 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 = − 𝜆𝑏. (2.1b)The second equality can be replaced with complementary slackness , namely, that 𝑣 satisﬁes with equalitythe constraints of 𝐴𝑧 ≤ 𝑏 whose multiplier in 𝜆 is positive (i.e., 𝑎 𝑖 𝑣 = 𝑏 𝑖 for 𝜆 𝑖 >

0, where 𝑎 𝑖 is row 𝑖 of 𝐴 , and 𝜆 𝑖 and 𝑏 𝑖 are entry 𝑖 of 𝜆 and 𝑏 respectively), or short 𝜆 ( 𝑏 − 𝐴𝑣 ) =

0. A primal-dual pair is a pair ( 𝑣, 𝜆 ) satisfying the strong duality conditions stated in Equation system (2.1). Obviously 𝑣 is a Frank–Wolfevertex 𝑣 at ∇ 𝑓 ( 𝑥 ) and the Frank–Wolfe gap at 𝑥 equals the complementarity gap , i.e., ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) = − 𝜆𝐴 ( 𝑥 − 𝑣 ) = 𝜆 ( 𝑏 − 𝐴𝑥 ) . (2.2)Common implementations of a linear optimization oracle naturally compute dual prices for a Frank–Wolfevertex 𝑣 , which is indeed a vertex of 𝑃 . For example, the widely used simplex algorithm internally operateswith data providing a candidate ( 𝑣, 𝜆 ) for a primal-dual pair, where 𝑣 is a vertex of 𝑃 but 𝜆 may violate thenonnegativity condition, which is then incrementally improved to a primal-dual pair.2ecall that the celebrated Slater’s condition of optimality (a special case of the

Karush–Kuhn–Tuckercondition for convex functions) is the strong duality form of the optimality condition for min 𝑧 : 𝐴𝑧 ≤ 𝑏 𝑓 ( 𝑧 ) : apoint 𝑥 is an optimal solution to min 𝑧 : 𝐴𝑧 ≤ 𝑏 𝑓 ( 𝑧 ) if and only if 𝑥 is an optimal solution to min 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 ,i.e., ( 𝑥, 𝜆 ) is a primal-dual pair for the linear program min 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 for some 𝜆 ; equivalently, there aredual prices 𝜆 for 𝑥 under the linear objective ∇ 𝑓 ( 𝑥 ) .In practical implementations, e.g., of Frank–Wolfe algorithms, this means that dual prices 𝜆 for theoptimal solution 𝑥 ∗ can be obtained as dual prices for the Frank–Wolfe vertex 𝑣 associated with ∇ 𝑓 ( 𝑥 ∗ ) . 𝑏 In practice we rarely have exact optimal solutions to convex minimization problems (even within the limitof numerical accuracy), and we are usually satisﬁed with a good approximate solution with, e.g., an additiveerror in function value of at most 𝜀 . For Frank–Wolfe algorithms the usual stopping criterion is an upperbound on the Frank–Wolfe gap (sometimes also called dual gap) max 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑧 ) ≤ 𝜀 as due to thelinear minimizations the Frank–Wolfe gap is essentially computed anyway. As such we will now consider thecase of approximately optimal solutions. To this end let 𝑣 = argmin 𝑧 : 𝐴𝑧 ≤ 𝑏 ∇ 𝑓 ( 𝑥 ) 𝑧 be the Frank–Wolfe vertexat 𝑥 and let 0 ≤ 𝜆 ∈ ℝ 𝑚 be associated dual prices as in Equation system (2.1) above.A common interpretation of 𝜆 in the context of linear programs is the rate of change in optimal valueas a function of change to the constant term 𝑏 , i.e., min 𝑧 : 𝐴𝑧 ≤ 𝑏 ′ ∇ 𝑓 ( 𝑥 ) 𝑧 = 𝜆𝑏 ′ for 𝑏 ′ close to 𝑏 ; if 𝜆 is not aunique dual solution it needs to be chosen depending on 𝑏 ′ . Morally, the optimal value changes by 𝜆 ( 𝑏 − 𝑏 ′ ) ,while the dual solution 𝜆 does not change.The next observation carries this sensitivity analysis over to smooth convex functions and approximatelyoptimal solutions: in this case we will incur additional error terms due to (1) non-linearity of the objectivefunction, and (2) approximate optimality (with no error for optimal solutions). There are many commonassumptions on the objective convex function 𝑓 to bound its non-linearity. For the sake of exposition,we assume the most common one, namely, smoothness, which is only needed for the last inequality inEquations (2.4) and (2.5). For other assumptions the error term 𝐿 k 𝑣 ′ − 𝑣 k / should be replaced accordingly. Observation 2.1.

Let 𝑓 be an 𝐿 -smooth convex function over a convex domain containing the polytopes 𝑃 = { 𝑧 : 𝐴𝑧 ≤ 𝑏 } and 𝑃 ′ = { 𝑧 : 𝐴𝑧 ≤ 𝑏 ′ } . Let 𝑥 ∈ 𝑃 and 𝑣 = argmin 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ) 𝑧 . Similarly, let 𝑣 ′ = argmin 𝑧 ∈ 𝑃 ′ ∇ 𝑓 ( 𝑥 ) 𝑧 . Assume that 𝑥 ′ ≔ 𝑥 − 𝑣 + 𝑣 ′ ∈ 𝑃 ′ . Then 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) ≤ min 𝑧 ∈ 𝑃 𝑓 ( 𝑧 ) ≤ 𝑓 ( 𝑥 ) (2.3) 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) + ∇ 𝑓 ( 𝑥 ) ( 𝑣 ′ − 𝑣 ) ≤ min 𝑧 ∈ 𝑃 ′ 𝑓 ( 𝑧 ) ≤ 𝑓 ( 𝑥 ′ ) ≤ 𝑓 ( 𝑥 ) + ∇ 𝑓 ( 𝑥 ) ( 𝑣 ′ − 𝑣 ) + 𝐿 k 𝑣 ′ − 𝑣 k (2.4)When 𝜆 is a common dual solution for both 𝑣 in 𝑃 and 𝑣 ′ in 𝑃 ′ 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ) + 𝜆 ( 𝑏 − 𝑏 ′ ) ≤ min 𝑧 ∈ 𝑃 ′ 𝑓 ( 𝑧 ) ≤ 𝑓 ( 𝑥 ′ ) ≤ 𝑓 ( 𝑥 ) + 𝜆 ( 𝑏 − 𝑏 ′ ) + 𝐿 k 𝑣 ′ − 𝑣 k . (2.5)To justify the assumption 𝑥 ′ ∈ 𝑃 ′ , we note that it holds when 𝑏 ′ is suﬃciently close to 𝑏 and 𝑥 issuﬃciently close to the optimal solution 𝑥 ∗ (with 𝑣 and 𝑣 ′ appropriately chosen depending on 𝑏 ′ ). Intuitively,a neighborhood of 𝑣 ′ in 𝑃 ′ is just a translation of a neighborhood of 𝑣 in 𝑃 and 𝑥 is well inside the neighborhoodto be preserved by translation. Let us split the deﬁning linear inequalities 𝐴𝑧 ≤ 𝑏 for 𝑃 into two: let 𝐴 = 𝑧 ≤ 𝑏 = be the subsystem which 𝑥 satisﬁes with equality (describing the boundary of the neighborhood at 𝑣 ), and 𝐴 < 𝑧 ≤ 𝑏 < be the subsystem with inequalities which 𝑥 satisﬁes with strict inequality (describing far awayparts of 𝑃 ), i.e., 𝐴 = 𝑥 = 𝑏 = and 𝐴 < 𝑥 < 𝑏 < . We claim that when 𝑥 is close enough to 𝑥 ∗ then 𝐴 = 𝑣 = 𝑏 = (regardless of the choice of 𝑣 ). In geometrical terms the claim means that 𝑣 is contained in the minimal facecontaining 𝑥 . To verify it, let 𝐹 be the face of 𝑃 containing 𝑥 ∗ in its relative interior (allowing 𝐹 = 𝑃 ), whichis a minimal solution to min 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ∗ ) 𝑧 by optimality. When 𝑥 is close to 𝑥 ∗ then (1) all minimal solutionsto min 𝑧 ∈ 𝑃 ∇ 𝑓 ( 𝑥 ) 𝑧 lie in 𝐹 , too, and (2) every hyperface of 𝑃 containing 𝑥 also contains 𝑥 ∗ and hence 𝐹 iscontained in the minimal face containing 𝑥 .As for linear programs, if 𝑏 ′ is suﬃciently close to 𝑏 then for some choice of optimal solutions 𝑣 and 𝑣 ′ (recall they need not be unique) they have a common dual solution 𝜆 and 𝑣 ′ is suﬃciently close to 𝑣 .3fter these preliminaries, we verify 𝑥 ′ ∈ 𝑃 ′ . First we deal with the inequalities 𝑥 satisfy with equality for 𝑃 , which turns out to be the easy case: 𝐴 = 𝑥 ′ = 𝐴 = 𝑥 − 𝐴 = 𝑣 + 𝐴 = 𝑣 ′ ≤ 𝑏 = − 𝑏 = + 𝑏 ′ = = 𝑏 ′ = . For the other inequalitiesnote that 𝑏 ′ < − 𝐴 < 𝑥 ′ = ( 𝑏 < − 𝐴 < 𝑥 ) + ( 𝑏 ′ < − 𝑏 < ) − 𝐴 < ( 𝑣 ′ − 𝑣 ) . As 𝑏 < − 𝐴 < 𝑥 is strictly positive, with 𝑏 ′ closeenough to 𝑏 (and hence 𝑣 ′ close enough to 𝑣 ), the other terms on the right-hand side are small enough forthe right-hand side remaining positive, i.e., 𝐴 < 𝑥 ′ < 𝑏 ′ < . Proof.

The bounds for the polytope 𝑃 in (2.3) are well-known and presented for comparison only; they easilyfollow from convexity and minimality of 𝑣 .Equation (2.4) provides the same bounds for the polytope 𝑃 ′ , even though the points at which we computethe lower and the upper bound might diﬀer. The left-hand side of the ﬁrst inequality is 𝑓 ( 𝑥 ) − ∇ 𝑓 ( 𝑥 ) ( 𝑥 − 𝑣 ′ ) (written in a form to ease comparison with Equation (2.3)). While 𝑥 might not be contained in 𝑃 ′ , it doesnot aﬀect the validity of the inequality. The second inequality explicitly uses the assumption 𝑥 ′ ∈ 𝑃 ′ , andthe last inequality is just the smoothness inequality for 𝑓 , using 𝑥 ′ − 𝑥 = 𝑣 ′ − 𝑣 .Finally, observe that 𝜆 ( 𝑏 − 𝑏 ′ ) = ∇ 𝑓 ( 𝑥 ) ( 𝑣 ′ − 𝑣 ) by the deﬁnition of dual prices, leading to Equation (2.5). (cid:3) References

M. Frank and P. Wolfe. An algorithm for quadratic programming.

Naval Research Logistics Quarterly , 3(1-2):95–110, 1956.E. S. Levitin and B. T. Polyak. Constrained minimization methods.