[PDF] Blended Conditional Gradients: the unconditioning of conditional gradients

Abstract

We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank--Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of [arXiv:1410.8816]. We also present a streamlined version of the algorithm for the probability simplex.

Full PDF

aa r X i v : . [ m a t h . O C ] M a y Blended Conditional Gradients:the unconditioning of conditional gradients

Gábor Braun , Sebastian Pokutta , Dan Tu , and Stephen Wright ISyE, Georgia Institute of Technology, Atlanta, GA, {gabor.braun,sebastian.pokutta}@isye.gatech.edu, [email protected] Computer Sciences Department, University of Wisconsin, Madison, WI, [email protected]

June 3, 2019

Abstract

We present a blended conditional gradient approach for minimizing a smooth convex function over apolytope P , combining the Frank–Wolfe algorithm (also called conditional gradient) with gradient-basedsteps, diﬀerent from away steps and pairwise steps, but still achieving linear convergence for stronglyconvex functions, along with good practical performance. Our approach retains all favorable properties ofconditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates assparse convex combinations of a limited number of extreme points of P . The algorithm is lazy , making useof inexpensive inexact solutions of the linear programming subproblem that characterizes the conditionalgradient approach. It decreases measures of optimality rapidly, both in the number of iterations and inwall-clock time, outperforming even the lazy conditional gradient algorithms of Braun et al. [2017]. Wealso present a streamlined version of the algorithm for the probability simplex. A common paradigm in convex optimization is minimization of a smooth convex function f over a poly-tope P . The conditional gradient (CG) algorithm, also known as “Frank–Wolfe” Frank and Wolfe [1956],Levitin and Polyak [1966] is enjoying renewed popularity because it can be implemented eﬃciently to solveimportant problems in data analysis. It is a ﬁrst-order method, requiring access to gradients ∇ f ( x ) andfunction values f ( x ) . In its original form, CG employs a linear programming (LP) oracle to minimize alinear function over the polytope P at each iteration. The cost of this operation depends on the complexityof P . The base method has many extensions with the aim of improving performance, like reusing previouslyfound points of P to complement or even sometimes omit LP oracle calls Lacoste-Julien and Jaggi [2015],or using oracles weaker than an LP oracle to reduce cost of oracle calls Braun et al. [2017].In this work, we describe a blended conditional gradient (BCG) approach, which is a novel combinationof several previously used ideas into a single algorithm with similar theoretical convergence rates as severalother variants of CG that have been studied recently, including pairwise-step and away-step variants and thelazy variants in Braun et al. [2017], however, with very fast performance and in several cases, empirically1igher convergence rates compared to other variants. In particular, while the lazy variants of Braun et al.[2017] have an advantage over baseline CG when the LP oracle is expensive, our BCG approach consistentlyoutperforms the other variants in more general circumstances, both in per-iteration progress and in wall-clocktime.In a nutshell, BCG is a ﬁrst-order algorithm that chooses among several types of steps based on the gradient ∇ f at the current point. It also maintains an “active vertex set” of solutions from previous iterations, likee.g., the Away-step Frank–Wolfe algorithm. Building on Braun et al. [2017], BCG uses a “weak-separationoracle” to ﬁnd a vertex of P for which the linear objective attains some fraction of the reduction in f the LPoracle would achieve, typically by ﬁrst searching among the current set of active vertices and if no suitablevertex is found, the LP oracle used in the original FW algorithm may be deployed. On other iterations, BCGemploys a “simplex descent oracle,” which takes a step within the convex hull of the active vertices, yieldingprogress either via reduction in function value (a “descent step”) or via culling of the active vertex set (a“drop step”). For example, the oracle can make a single (vanilla) gradient descent step. The size of the activevertex set typically remains small, which beneﬁts both the eﬃciency of the method and the “sparsity” ofthe ﬁnal solution (i.e., its representation as a convex combination of a relatively small number of vertices).Compared to the Away-step and Pairwise-step Frank–Wolfe algorithms, the simplex descent oracle realizesan improved use of the active set by (partially) optimizing over its convex hull via descent steps, similar tothe Fully-corrective Frank–Wolfe algorithm and also the approach in Rao et al. [2015] but with a better stepselection criterion: BCG alternates between the various steps using estimated progress from dual gaps. Wehasten to stress that BCG remains projection-free. Related work

There has been an extensive body of work on conditional gradient algorithms; see the excellent overview ofJaggi [2013]. Here we review only those papers most closely related to our work.Our main inspiration comes from Braun et al. [2017], Lan et al. [2017], which introduce the weak-separation oracle as a lazy alternative to calling the LP oracle in every iteration. It is inﬂuenced too by themethod of Rao et al. [2015], which maintains an active vertex set, using projected descent steps to improvethe objective over the convex hull of this set, and culling the set on some steps to keep its size under control.While the latter method is a heuristic with no proven convergence bounds beyond those inherited from thestandard Frank–Wolfe method, our BCG algorithm employs a criterion for optimal trade-oﬀ between thevarious steps , with a proven convergence rate equal to state-of-the-art Frank–Wolfe variants up to a constantfactor.Our main result shows linear convergence of BCG for strongly convex functions. Linearly convergentvariants of CG were studied as early as Guélat and Marcotte [1986] for special cases and Garber and Hazan[2013] for the general case (though the latter work involves very large constants). More recently, linear conver-gence has been established for various pairwise-step and away-step variants of CG in Lacoste-Julien and Jaggi[2015]. Other memory-eﬃcient decomposition-invariant variants were described in Garber and Meshi [2016]and Bashiri and Zhang [2017]. Modiﬁcation of descent directions and step sizes, reminiscent of the dropsteps used in BCG, have been considered by Freund and Grigas [2016], Freund et al. [2017]. The use of aninexpensive oracle based on a subset of the vertices of P , as an alternative to the full LP oracle, has beenconsidered in Kerdreux et al. [2018b]. Garber et al. [2018] proposes a fast variant of conditional gradientsfor matrix recovery problems.BCG is quite distinct from the Fully-corrective Frank–Wolfe algorithm (FCFW) (see, for example,Holloway [1974], Lacoste-Julien and Jaggi [2015]). Both approaches maintain active vertex sets, generateiterates that lie in the convex hulls of these sets, and alternate between Frank–Wolfe steps generating new2ertices and correction steps optimizing within the current active vertex set. However, convergence analysesof the FCFW algorithm assume that the correction steps have unit cost, though they can be quite expensivein practice. For BCG, by contrast, we assume only a single step of gradient descent type having unit cost(disregarding cost of line search). For computational results comparing BCG and FCFW, and illustrating thisissue, see Figure 11 and the discussion in Section 6. Contribution

Our contribution is summarized as follows:

Blended Conditional Gradients (BCG).

The BCG approach blends diﬀerent types of descent steps: Frank–Wolfesteps from Frank and Wolfe [1956], optionally laziﬁed as in Braun et al. [2017], and gradient descentsteps over the convex hull of the current active vertex set. It avoids projections and does not useaway steps and pairwise steps, which are elements of other popular variants of CG. It achieves linearconvergence for strongly convex functions (see Theorem 3.1), and O ( / t ) convergence after t iterationsfor general smooth functions. While the linear convergence proof of the Away-step Frank–WolfeAlgorithm [Lacoste-Julien and Jaggi, 2015, Theorem 1, Footnote 4] requires the objective function f to be deﬁned on the Minkowski sum P − P + P , BCG does not need f to be deﬁned outside thepolytope P . The algorithm has complexity comparable to pairwise-step or away-step variants of con-ditional gradients, both in time measured as number of iterations and in space (size of active set). Itis aﬃne-invariant and parameter-free; estimates of such parameters as smoothness, strong convexity,or the diameter of P are not required. It maintains iterates as (often sparse) convex combinations ofvertices, typically much sparser than the baseline CG methods, a property that is important for someapplications. Such sparsity is due to the aggressive reuse of active vertices, and the fact that newvertices are added only as a kind of last resort. In wall-clock time as well as per-iteration progress, ourcomputational results show that BCG can be orders of magnitude faster than competimg CG methodson some problems. Simplex Gradient Descent (SiGD).

In Section 4, we describe a new projection-free gradient descent proce-dure for minimizing a smooth function over the probability simplex, which can be used to implementthe “simplex descent oracle” required by BCG, which is the module doing gradient descent steps.

Computational Experiments.

We demonstrate the excellent computational behavior of BCG comparedto other CG algorithms on standard problems, including video co-localization, sparse regression,structured SVM training, and structured regression. We observe signiﬁcant computational speedupsand in several cases empirically better convergence rates.

Outline

We summarize preliminary material in Section 2, including the two oracles that are the foundation of ourBCG procedure. BCG is described and analyzed in Section 3, establishing linear convergence rates. Thesimplex gradient descent routine, which implements the simplex descent oracle, is described in Section 4.We mention in particular a variant of BCG that applies when P is the probability simplex, a special case thatadmits several simpliﬁcations and improvements to the analysis. Some possible enhancements to BCG arediscussed in Section 5. Our computational experiments appear in Section 6.3 Preliminaries

We use the following notation: e i is the i -th coordinate vector, ≔ ( , . . . , ) = e + e + · · · is the all-onesvector, k·k denotes the Euclidean norm ( ℓ -norm), D = diam ( P ) = sup u , v ∈ P k u − v k is the ℓ -diameter of P ,and conv S denotes the convex hull of a set S of points. The probability simplex ∆ k ≔ conv { e , . . . , e k } isthe convex hull of the coordinate vectors in dimension k .Let f be a diﬀerentiable convex function. Recall that f is L -smooth if f ( y ) − f ( x ) − ∇ f ( x )( y − x ) ≤ L k y − x k / x , y ∈ P .The function f has curvature C if f ( γ y + ( − γ ) x ) ≤ f ( x ) + γ ∇ f ( x )( y − x ) + C γ / x , y ∈ P and 0 ≤ γ ≤ L -smooth function always has curvature C ≤ LD .) Finally, f is strongly convex if for some α > f ( y ) − f ( x ) − ∇ f ( x )( y − x ) ≥ α k y − x k / x , y ∈ P .We will use the following fact about strongly convex function when optimizing over P . Fact 2.1 (Geometric strong convexity guarantee) . [Lacoste-Julien and Jaggi, 2015, Theorem 6 and Eq. (28)]Given a strongly convex function f , there is a value µ > called the geometric strong convexity such that f ( x ) − min y ∈ P f ( y ) ≤ (cid:0) max y ∈ S , z ∈ P ∇ f ( x )( y − z ) (cid:1) µ for any x ∈ P and for any subset S of the vertices of P for which x lies in the convex hull of S . The value of µ depends both on f and the geometry of P . For example, a possible choice is decomposing µ as a product of the form µ = α f W P , where α f is the strong convexity constant of f and W P is the pyramidalwidth of P , a constant only depending on the polytope P ; see Lacoste-Julien and Jaggi [2015]. Given a convex objective function f and an ordered ﬁnite set S = { v , . . . , v k } of points, we deﬁne f S : ∆ k → R as follows: f S ( λ ) ≔ f k Õ i = λ i v i ! . (1)When f S is L f S -smooth, Oracle 1 returns an improving point x ′ in conv S together with a vertex set S ′ ⊆ S such that x ′ ∈ conv S ′ .In Section 4 we provide an implementation (Algorithm 2) of this oracle via a single descent step , whichavoids projection and does not require knowledge of the smoothness parameter L f S . The weak-separation oracle Oracle 2 was introduced in Braun et al. [2017] to replace the LP oracletraditionally used in the CG method. Provided with a point x ∈ P , a linear objective c , a target reduction4 racle 1 Simplex Descent Oracle SiDO ( x , S , f ) Input: ﬁnite set S ⊆ R n , point x ∈ conv S , convex smooth function f : conv S → R n ; Output: ﬁnite set S ′ ⊆ S , point x ′ ∈ conv S ′ satisfying either drop step: f ( x ′ ) ≤ f ( x ) and S ′ , S descent step: f ( x ) − f ( x ′ ) ≥ [ max u , v ∈ S ∇ f ( x )( u − v )] /( L f S ) Oracle 2

Weak-Separation Oracle LPsep P ( c , x , Φ , K ) Input: linear objective c ∈ R n , point x ∈ P , accuracy K ≥

1, gap estimate Φ > Output:

Either (1) vertex y ∈ P with c ( x − y ) ≥ Φ / K , or (2) false : c ( x − z ) ≤ Φ for all z ∈ P .value Φ >

0, and an inexactness factor K ≥

1, it decides whether there exists y ∈ P with cx − c y ≥ Φ / K , orelse certiﬁes that cx − cz ≤ Φ for all z ∈ P . In our applications, c = ∇ f ( x ) is the gradient of the objectiveat the current iterate x . Oracle 2 could be implemented simply by the standard LP oracle of minimizing cz over z ∈ P . However, it allows more eﬃcient implementations, including the following. (1) Caching : testingpreviously obtained vertices y ∈ P (speciﬁcally, vertices in the current active vertex set) to see if one of themsatisﬁes cx − c y ≥ Φ / K . If not, the traditional LP oracle could be called to either ﬁnd a new vertex of P satisfying this bound, or else to certify that cx − cz ≤ Φ for all z ∈ P , and (2) Early Termination : Terminatingthe LP procedure as soon as a vertex of P has been discovered that satisﬁes cx − c y ≥ Φ / K . (This techniquerequires an LP implementation that generates vertices as iterates.) If the LP procedure runs to terminationwithout ﬁnding such a point, it has certiﬁed that cx − cz ≤ Φ for all z ∈ P . In Braun et al. [2017] thesetechniques resulted in orders-of-magnitude speedups in wall-clock time in the computational tests, as wellas sparse convex combinations of vertices for the iterates x t , a desirable property in many contexts. Our BCG approach is speciﬁed as Algorithm 1. We discuss the algorithm in this section and establish itsconvergence rate. The algorithm expresses each iterate x t , t = , , , . . . as a convex combination of theelements of the active vertex set, denoted by S t , as in the Pairwise and Away-step variants of CG. At eachiteration, the algorithm calls either Oracle 1 or Oracle 2 in search of the next iterate, whichever promises thesmaller function value, using a test in Line 6 based on an estimate of the dual gap. The same greedy principleis used in the Away-step CG approach, and its lazy variants. A critical role in the algorithm (and particularlyin the test of Line 6) is played by the value Φ t , which is a current estimate of the primal gap — the diﬀerencebetween the current function value f ( x t ) and the optimal function value over P . When Oracle 2 returns false ,the curent value of Φ t is discovered to be an overestimate of the dual gap, so it is halved (Line 13) and weproceed to the next iteration. In subsequent discussion, we refer to Φ t as the “gap estimate.”In Line 17, the active set S t + is required to be minimal. By Caratheodory’s theorem, this requirementensures that | S t + | ≤ dim P +

1. In practice, the S t are invariably small and no explicit reduction in size isnecessary. The key requirement, in theory and practice, is that if after a call to Oracle SiDO the new iterate x t + lies on a face of the convex hull of the vertices in S t , then at least one element of S t is dropped to form S t + . This requirement ensures that the local pairwise gap in Line 6 is not too large due to stale vertices in5 lgorithm 1 Blended Conditional Gradients (BCG)

Input: smooth convex function f , start vertex x ∈ P , weak-separation oracle LPsep P , accuracy K ≥ Output: points x t in P for t = , . . . , T Φ ← max v ∈ P ∇ f ( x )( x − v )/ S ← { x } for t = to T − do v At ← argmax v ∈ S t ∇ f ( x t ) v v FW − St ← argmin v ∈ S t ∇ f ( x t ) v if ∇ f ( x t )( v At − v FW − St ) ≥ Φ t then x t + , S t + ← SiDO ( x t , S t ) {either a drop step or a descent step} Φ t + ← Φ t else v t ← LPsep P (∇ f ( x t ) , x t , Φ t , K ) if v t = false then x t + ← x t Φ t + ← Φ t / S t + ← S t else x t + ← argmin x ∈[ x t , v t ] f ( x ) {FW step, with line search} Choose S t + ⊆ S t ∪ { v t } minimal such that x t + ∈ conv S t + . Φ t + ← Φ t end if end if end for t , which can block progress. Small size of the sets S t is crucial to the eﬃciency of the algorithm, in rapidlydetermining the maximizer and minimizer of ∇ f ( x t ) over the active set S t in Lines 4 and 5.The constants in the convergence rate described in our main theorem (Theorem 3.1 below) depend on amodiﬁed curvature-like parameter of the function f . Given a vertex set S of P , recall from Section 2.1 thesmoothness parameter L f S of the function f S : ∆ k → R deﬁned by (1). Deﬁne the simplicial curvature C ∆ to be C ∆ ≔ max S : | S |≤ P L f S (2)to be the maximum of the L f S over all possible active sets. This aﬃne-invariant parameter depends both onthe shape of P and the function f . This is the relative smoothness constant L f , A from the predecessor ofGutman and Peña [2019], namely [Gutman and Peña, 2018, Deﬁniton 2a], with an additional restriction: thesimplex is restricted to faces of dimension at most 2 dim P , which appears as a bound on the size of S in ourformulation. This restriction improves the constant by removing dependence on the number of vertices of thepolytope, and can probably replace the original constant in convergence bounds. We can immediately see theeﬀect in the common case of L -smooth functions, that the simplicial curvature is of reasonable magnitude,speciﬁcally, C ∆ ≤ LD ( dim P ) , where D is the diameter of P . This result follows from (2) and the bound on L f S from Lemma A.1 in theappendix. This bound is not directly comparable with the upper bound L f , A ≤ LD / P is explainedby the n -dimensional probability simplex having constant minimum width 2 in 1-norm, but having minimumwidth dependent on the dimension n (speciﬁcally, Θ ( /√ n ) ) in the 2-norm. Recall that the minimum widthof a convex body P ⊆ R n in norm k·k is min φ max u , v ∈ P φ ( u − v ) , where φ runs over all linear maps R n → R having dual norm k φ k ∗ =

1. For the 2-norm, this is just the minimum distance between parallel hyperplanessuch that P lies between the two hyperplanes.For another comparison, recall the curvature bound C ≤ LD . Note, however, that the algorithm andconvergence rate below are aﬃne invariant, and the only restriction on the function f is that it has ﬁnitesimplicial curvature. This restriction readily provides the curvature bound C ≤ C ∆ , (3)where the factor 2 arises as the square of the diameter of the probability simplex ∆ k . (See Lemma A.2in the appendix for details.) Note that S is allowed to be large enough so that every two points of P liesimultaneously in the convex hull of some vertex subset S , by Caratheodory’s theorem, which is needed for(3). We describe the convergence of BCG (Algorithm 1) in the following theorem. Theorem 3.1.

Let f be a strongly convex, smooth function over the polytope P with simplicial curvature C ∆ and geometric strong convexity µ . Then Algorithm 1 ensures f ( x T ) − f ( x ∗ ) ≤ ε , where x ∗ is an optimalsolution to f in P for some iteration index T that satisﬁes T ≤ (cid:24) log 2 Φ ε (cid:25) + K (cid:24) log Φ KC ∆ (cid:25) + K C ∆ µ (cid:24) log 4 KC ∆ ε (cid:25) = O (cid:18) C ∆ µ log Φ ε (cid:19) , (4) where log denotes logarithms to the base . f , the algorithm ensures f ( x T ) − f ( x ∗ ) ≤ ε after T = O ( max { C ∆ , Φ }/ ε ) iterations by a similar argument, which is omitted. Proof.

The proof tracks that of Braun et al. [2017]. We divide the iteration sequence into epochs that aredemarcated by the gap steps , that is, the iterations for which the weak-separation oracle (Oracle 2) returnsthe value false , which results in Φ t being halved for the next iteration. We then bound the number of iterateswithin each epoch. The result is obtained by aggregating across epochs.We start by a well-known bound on the function value using the Frank–Wolfe point v FWt ≔ argmin v ∈ P ∇ f ( x t ) v at iteration t , which follows from convexity: f ( x t ) − f ( x ∗ ) ≤ ∇ f ( x t )( x t − x ∗ ) ≤ ∇ f ( x t )( x t − v FWt ) . If iteration t − x t = x t − and Φ t = Φ t − / f ( x t ) − f ( x ∗ ) ≤ ∇ f ( x t )( x t − v FWt ) ≤ Φ t . (5)This bound also holds at t =

0, by deﬁnition of Φ . Thus Algorithm 1 is guaranteed to satisfy f ( x T )− f ( x ∗ ) ≤ ε at some iterate T such that T − Φ T ≤ ε . Therefore, the total number of gap steps N Φ required to reach this point satisﬁes N Φ ≤ (cid:24) log 2 Φ ε (cid:25) , (6)which is also a bound on the total number of epochs. The next stage of the proof ﬁnds bounds on the numberof iterations of each type within an individual epoch.If iteration t − x t = x t − and Φ t = Φ t − /

2, and because the condition is false atLine 6 of Algorithm 1, we have ∇ f ( x t )( v At − x t ) ≤ ∇ f ( x t )( v At − v FW − St ) ≤ Φ t . (7)This condition also holds trivially at t =

0, since v A = v FW − S = x . By summing (5) and (7), we obtain ∇ f ( x t )( v At − v FWt ) ≤ Φ t , so it follows from Fact 2.1 that f ( x t ) − f ( x ∗ ) ≤ [∇ f ( x t )( v At − v FWt )] µ ≤ Φ t µ . By combining this inequality with (5), we obtain f ( x t ) − f ( x ∗ ) ≤ min (cid:8) Φ t / µ, Φ t (cid:9) (8)for all t such that either t = t − all t , because (1) the sequenceof function values { f ( x s )} s is non-increasing; and (2) Φ s = Φ t for all s in the epoch that starts at iteration t .We now consider the epoch that starts at iteration t , and use s to index the iterations within this epoch.Note that Φ s = Φ t for all s in this epoch.We distinguish three types of iterations besides gap step. The ﬁrst type is a Frank–Wolfe step, taken whenthe weak-separation oracle returns an improving vertex v s ∈ P such that ∇ f ( x s )( x s − v s ) ≥ Φ s / K = Φ t / K C , we have by standard Frank–Wolfe arguments that (c.f.,Braun et al. [2017]). f ( x s ) − f ( x s + ) ≥ Φ s K min (cid:26) , Φ s KC (cid:27) ≥ Φ t K min (cid:26) , Φ t KC ∆ (cid:27) , (9)where we used Φ s = Φ t and C ≤ C ∆ (from (3)). We denote by N t FW the number of Frank–Wolfe iterationsin the epoch starting at iteration t .The second type of iteration is a descent step , in which Oracle Oracle SiDO (Line 7) returns a point x s + that lies in the relative interior of conv S s and with strictly smaller function value. We thus have S s + = S s and, by the deﬁnition of Oracle Oracle SiDO, together with (2), it follows that f ( x s ) − f ( x s + ) ≥ [∇ f ( x s )( v As − v FW − Ss )] C ∆ ≥ Φ s C ∆ = Φ t C ∆ . (10)We denote by N t desc the number of descent steps that take place in the epoch that starts at iteration t .The third type of iteration is one in which Oracle 1 returns a point x s + lying on a face of the convexhull of S s , so that S s + is strictly smaller than S s . Similarly to the Away-step Frank–Wolfe algorithm ofLacoste-Julien and Jaggi [2015], we call these steps drop steps , and denote by N t drop the number of such stepsthat take place in the epoch that starts at iteration t . Note that since S s is expanded only at Frank–Wolfe steps,and then only by at most one element, the total number of drop steps across the whole algorithm cannotexceed the total number of Frank–Wolfe steps. We use this fact and (6) in bounding the total number ofiterations T required for f ( x T ) − f ( x ∗ ) ≤ ε : T ≤ N Φ + N desc + N FW + N drop ≤ (cid:24) log 2 Φ ε (cid:25) + N desc + N FW = (cid:24) log 2 Φ ε (cid:25) + Õ t :epoch start ( N t desc + N t FW ) . (11)Here N desc denotes the total number of descent steps, N FW the total number of Frank–Wolfe steps, and N drop the total number of drop steps, which is bounded by N FW , as just discussed.Next, we seek bounds on the iteration counts N t desc and N t FW within the epoch starting with iteration t . Forthe total decrease in function value during the epoch, Equations (9) and (10) provide a lower bound, while f ( x t ) − f ( x ∗ ) is an obvious upper bound, leading to the following estimate using (8).• If Φ t ≥ KC ∆ then2 Φ t ≥ f ( x t ) − f ( x ∗ ) ≥ N t desc Φ t C ∆ + N t FW Φ t K ≥ N t desc Φ t K + N t FW Φ t K ≥ ( N t desc + N t FW ) Φ t K , hence N t desc + N t FW ≤ K . (12)• If Φ t < KC ∆ , a similar argument provides8 Φ t µ ≥ f ( x t ) − f ( x ∗ ) ≥ N t desc Φ t C ∆ + N t FW Φ t K C ∆ ≥ ( N t desc + N t FW ) Φ t K C ∆ , leading to N t desc + N t FW ≤ K C ∆ µ . (13)9here are at most (cid:24) log Φ KC ∆ (cid:25) epochs in the regime with Φ t ≥ KC ∆ , (cid:24) log 2 KC ∆ ε / (cid:25) epochs in the regime with Φ t < KC ∆ . Combining (11) with the bounds (12) and (13) on N t FW and N t desc , we obtain (4). (cid:3) Here we describe the Simplex Gradient Descent approach (Algorithm 2), an implementation of Oracle SiDO(Oracle 1). Algorithm 2 requires only O (| S |) operations beyond the evaluation of ∇ f ( x ) and the cost of linesearch. (It is assumed that x is represented as a convex combination of vertices of P , which is updated duringOracle 1.) Apart from the (trivial) computation of the projection of ∇ f ( x ) onto the linear space spanned by ∆ k , no projections are computed. Thus, Algorithm 2 is typically faster even than a Frank–Wolfe step (LPoracle call), for typical small sets S .Alternative implementations of Oracle 1 are described in Section 4.1. Section 4.2 describes the specialcase in which P itself is a probability simplex, combining BCG and its oracles into a single, simple methodwith better constants in the convergence bounds.In the algorithm, the form c denotes the scalar product of c and , i.e., the sum of entries of c . Algorithm 2

Simplex Gradient Descent Step (SiGD)

Input: polyhedron P , smooth convex function f : P → R , subset S = { v , v , . . . , v k } of vertices of P , point x ∈ conv S Output: set S ′ ⊆ S , point x ′ ∈ conv S ′ Decompose x as a convex combination x = Í ki = λ i v i , with Í ki = λ i = λ i ≥ i = , , . . . , k c ← [∇ f ( x ) v , . . . , ∇ f ( x ) v k ] { c = ∇ f S ( λ ) ; see (1)} d ← c − ( c ) / k {Projection onto the hyperplane of ∆ k } if d = then return x ′ = v , S ′ = { v } {Arbitrary vertex} end if η ← max { η ≥ λ − η d ≥ } y ← x − η Í i d i v i if f ( x ) ≥ f ( y ) then x ′ ← y Choose S ′ ⊆ S , S ′ , S with x ′ ∈ conv S ′ . else x ′ ← argmin z ∈[ x , y ] f ( z ) S ′ ← S end if return x ′ , S ′ To verify the validity of Algorithm 2 as an implementation of Oracle 1, note ﬁrst that since y lies on aface of conv S by deﬁnition, it is always possible to choose a proper subset S ′ ⊆ S in Line 11, for example,10 ′ ≔ { v i : λ i > η d i } . The following lemma shows that with the choice h ≔ f S , Algorithm 2 correctlyimplements Oracle 1. Lemma 4.1.

Let ∆ k be the probability simplex in k dimensions and h : ∆ k → R be an L h -smooth function.Given some λ ∈ ∆ k , deﬁne d ≔ ∇ h ( λ ) − (∇ h ( λ ) / k ) and let η ≥ be the largest value for which τ ≔ λ − η d ≥ . Let λ ′ ≔ argmin z ∈[ λ,τ ] h ( z ) . Then either h ( λ ) ≥ h ( τ ) or h ( λ ) − h ( λ ′ ) ≥ [ max ≤ i , j ≤ k ∇ h ( λ )( e i − e j )] L h . Proof.

Let π denote the orthogonal projection onto the lineality space of ∆ k , i.e., π ( ζ ) ≔ ζ − ( ζ ) / k . Let g ( ζ ) ≔ h ( π ( ζ )) , then ∇ g ( ζ ) = π (∇ h ( π ( ζ ))) , and g is clearly L h -smooth, too. In particular, ∇ g ( λ ) = d .From standard gradient descent bounds, not repeated here, we have the following inequalities, for γ ≤ min { η, / L h } : h ( λ ) − h ( λ − γ d ) = g ( λ ) − g ( λ − γ ∇ g ( λ )) ≥ γ k∇ g ( λ )k ≥ γ [ max ≤ i , j ≤ k ∇ g ( λ )( e i − e j )] = γ [ max ≤ i , j ≤ k ∇ h ( λ )( e i − e j )] , (14)where the second inequality uses that the ℓ -diameter of the ∆ k is 2, and the last equality follows from ∇ g ( λ )( e i − e j ) = ∇ h ( λ )( e i − e j ) .When η ≥ / L h , we conclude that h ( λ ′ ) ≤ h ( λ − ( / L h ) d ) ≤ h ( λ ) , hence h ( λ ) − h ( λ ′ ) ≥ [ max i , j ∈{ , ,..., k } ∇ h ( λ )( e i − e j )] L h , which is the second case of the lemma. When η < / L h , then setting γ = η in (14) clearly provides h ( λ ) − h ( τ ) ≥

0, which is the ﬁrst case of the lemma. (cid:3)

Algorithm 2 is probably the least expensive possible implementation of Oracle 1, in general. We mayconsider other implementations, based on projected gradient descent, that aim to decrease f by a greateramount in each step and possibly make more extensive reductions to the set S . Projected gradient descent would seek to minimize f S along the piecewise-linear path { proj ∆ k ( λ − γ ∇ f S ( λ )) | γ ≥ } , where proj ∆ k denotes projection onto ∆ k . Such a search is more expensive, but may result in a new active set S ′ that issigniﬁcantly smaller than the current set S and, since the reduction in f S is at least as great as the reductionon the interval γ ∈ [ , η ] alone, it also satisﬁes the requirements of Oracle 1.More advanced methods for optimizing over the simplex could also be considered, for example, mirrordescent (see Nemirovski and Yudin [1983]) and accelerated versions of mirror descent and projected gradientdescent; see Lan [2017] for a good overview. The eﬀects of these alternatives on the overall convergencerate of Algorithm 1 has not been studied; the analysis is complicated signiﬁcantly by the lack of guaranteedimprovement in each (inner) iteration.The accelerated versions are considered in the computational tests in Section 6, but on the examples wetried, the inexpensive implementation of Algorithm 2 usually gave the fastest overall performance. We havenot tested mirror descent versions. 11 .2 Simplex Gradient Descent as a stand-alone algorithm We describe a variant of Algorithm 1 for the special case in which P is the probability simplex ∆ k . Sinceoptimization of a linear function over ∆ k is trivial, we use the standard LP oracle in place of the weak-separation oracle (Oracle 2), resulting in the non-lazy variant Algorithm 3. Observe that the per-iterationcost is only O ( k ) . In cases of k very large, we could also formulate a version of Algorithm 3 that uses aweak-separation oracle (Oracle 2) to evaluate only a subset of the coordinates of the gradient, as in coordinatedescent. The resulting algorithm would be an interpolation of Algorithm 3 below and Algorithm 1; detailsare left to the reader. Algorithm 3

Stand-Alone Simplex Gradient Descent

Input: convex function f Output: points x t in ∆ k for t = , . . . , T x = e for t = to T − do S t ← { i : x t , i > } a t ← argmax i ∈ S t ∇ f ( x t ) i s t ← argmin i ∈ S t ∇ f ( x t ) i w t ← argmin ≤ i ≤ k ∇ f ( x t ) i if ∇ f ( x t ) a t − ∇ f ( x t ) s t > ∇ f ( x t ) x t − ∇ f ( x t ) w t then d i = ( ∇ f ( x t ) i − Í j ∈ S ∇ f ( x t ) j /| S t | i ∈ S t i < S t for i = , , . . . , k η = max { γ : x t − γ d ≥ } {ratio test} y = x t − η d if f ( x t ) ≥ f ( y ) then x t + ← y {drop step} else x t + ← argmin x ∈[ x t , y ] f ( x ) {descent step} end if else x t + ← argmin x ∈[ x , e wt ] f ( x ) {FW step} end if end for When line search is too expensive, one might replace Line 14 by x t + = ( − / L f ) x t + y / L f , and Line 17by x t + = ( − /( t + )) x t + ( /( t + )) e w . These employ the standard step sizes for the Frank–Wolfealgorithm and (projected) gradient descent, respectively, and yield the required descent guarantees.We now describe convergence rates for Algorithm 3, noting that better constants are available in theconvergence rate expression than those obtained from a direct application of Theorem 3.1. Corollary 4.2.

Let f be an α -strongly convex and L f -smooth function over the probability simplex ∆ k with k ≥ . Let x ∗ be a minimum point of f in ∆ k . Then Algorithm 3 converges with rate f ( x T ) − f ( x ∗ ) ≤ (cid:18) − α L f k (cid:19) T · ( f ( x ) − f ( x ∗ )) , T = , , . . . . f f is not strongly convex (that is, α = ), we have f ( x T ) − f ( x ∗ ) ≤ L f T , T = , , . . . . Proof.

The structure of the proof is similar to that of [Lacoste-Julien and Jaggi, 2015, Theorem 8]. Recallfrom [Lacoste-Julien and Jaggi, 2015, §B.1] that the pyramidal width of the probability simplex is W ≥ /√ k ,so that the geometric strong convexity of f is µ ≥ α / k . The diameter of ∆ k is D = √

2, and it is easily seenthat C ∆ = L f and C ≤ L f D / = L f .To maintain the same notation as in the proof of Theorem 3.1, we deﬁne v At = e a t , v FW − St = e s t and v FWt = e w t . In particular, we have ∇ f ( x t ) w t = ∇ f ( x t ) v FWt , ∇ f ( x t ) s t = ∇ f ( x t ) v FW − St , and ∇ f ( x t ) a t = ∇ f ( x t ) v At . Let h t ≔ f ( x t ) − f ( x ∗ ) .In the proof, we use several elementary estimates. First, by convexity of f and the deﬁnition of theFrank–Wolfe step, we have h t = f ( x t ) − f ( x ∗ ) ≤ ∇ f ( x t )( x t − v FWt ) . (15)Second, by Fact 2.1 and the estimate µ ≥ α / k for geometric strong convexity, we obtain h t ≤ [∇ f ( x t )( v At − v FWt )] α / k . (16)Let us consider a ﬁxed iteration t . Suppose ﬁrst that we take a descent step (Line 14), in particular, ∇ f ( x t )( v At − v FW − St ) ≥ ∇ f ( x t )( x t − v FWt ) from Line 7 which, together with ∇ f ( x t ) x t ≥ ∇ f ( x t ) v FW − S ,yields 2 ∇ f ( x t )( v At − v FW − St ) ≥ ∇ f ( x t )( v At − v FWt ) . (17)By Lemma 4.1, we have f ( x t ) − f ( x t + ) ≥ (cid:2) ∇ f ( x t )( v A − v FW − S ) (cid:3) L f ≥ (cid:2) ∇ f ( x t )( v A − v FW ) (cid:3) L f ≥ α L f k · h t , where the second inequality follows from (17) and the third inequality follows from (16).If a Frank–Wolfe step is taken (Line 17), we have similarly to (9) f ( x t ) − f ( x t + ) ≥ ∇ f ( x t )( x t − v FW ) (cid:26) , ∇ f ( x t )( x t − v FW ) L f (cid:27) . Combining with (15), we have either f ( x t ) − f ( x t + ) ≥ h t / f ( x t ) − f ( x t + ) ≥ [∇ f ( x t )( x t − v FW )] L f ≥ (cid:2) ∇ f ( x t )( v A − v FW ) (cid:3) L f ≥ α L f k · h t . Since α ≤ L f , the latter is always smaller than the former, and hence is a lower bound that holds for allFrank–Wolfe steps.Since f ( x t ) − f ( x t + ) = h t − h t + , we have h t + ≤ ( − α /( L f k )) h t for descent steps and Frank–Wolfesteps, while obviously h t + ≤ h t for drop steps (Line 12). For any given iteration counter T , let T desc bethe number of descent steps taken before iteration T , T FW be the number of Frank–Wolfe steps taken beforeiteration T , and T drop be the number of drop steps taken before iteration T . We have T drop ≤ T FW , so thatsimilarly to (11) T = T desc + T FW + T drop ≤ T desc + T FW . (18)13y compounding the decrease at each iteration, and using (18) together with the identity ( − ǫ / ) ≥ ( − ǫ ) for any ǫ ∈ ( , ) , we have h T ≤ (cid:18) − α L f k (cid:19) T desc + T FW h ≤ (cid:18) − α L f k (cid:19) T / h ≤ (cid:18) − α L f k (cid:19) T · h . The case for the smooth but not strongly convex functions is similar: we obtain for descent steps h t − h t + = f ( x t ) − f ( x t + ) ≥ (cid:2) ∇ f ( x t )( v A − v FW − S ) (cid:3) L f ≥ (cid:2) ∇ f ( x t )( x − v FW ) (cid:3) L f ≥ h t L f , (19)where the second inequality follows from (15).For Frank–Wolfe steps, we have by standard estimations h t + ≤ ( h t − h t /( L f ) if h t ≤ L f , L f ≤ h t / T , we deﬁne T drop , T FW and T desc as above, and show by induction that h T ≤ L f T desc + T FW for T ≥

1. (21)Equation (21), i.e., h T ≤ L f / T easily follows from this via T drop ≤ T FW . Note that the ﬁrst step is necessarilya Frank–Wolfe step, hence the denominator is never 0.If iteration T is a drop step, then T >

1, and the claim is obvious by induction from h T ≥ h T − . Hencewe assume that iteration T is either a descent step or a Frank–Wolfe step. If T desc + T FW ≤ h T ≤ L f < L f or h T ≤ h T − − h T − /( L f ) ≤ L f , without using any upper bound on h T − , proving (21) in this case. Note that this includes the case T =

1, the start of the induction.Finally, if T desc + T FW ≥

3, then h T − ≤ L f /( T desc + T FW − ) ≤ L f by induction, therefore a familiarargument using (19) or (20) provides h T ≤ L f T desc + T FW − − L f ( T desc + T FW − ) ≤ L f T desc + T FW , proving (21) in this case, too, ﬁnishing the proof. (cid:3) We describe various enhancements that can be made to the BCG algorithm, to improve its practical perfor-mance while staying broadly within the framework above. Computational testing with these enhancementsis reported in Section 6.

Sparse solutions (which in the current context means “solutions that are a convex combination of a smallnumber of vertices of P ”) are desirable for many applications. Techniques for promoting sparse solutions inconditional gradients were considered in Rao et al. [2015]. In many situations, a sparse approximate solutioncan be identiﬁed at the cost of some increase in the value of the objective function.We explored two sparsiﬁcation approaches, which can be applied separately or together, and performedpreliminary computational tests for a few of our experiments in Section 6.14anilla (i) (i), (ii) ∆ f ( x ) PCG 112 62 60 2 . . .

0% vanilla (i), (ii) ∆ f ( x ) ACG 300 298 7 . . . netgen_08a . Since we use LPCG and PCG asbenchmarks, we report (i) separately as well. Right: Matrix Completion over movielens100k instance.BCG without sparsiﬁcation provides sparser solutions than the baseline methods with sparsiﬁcation. In thelast column, we report the percentage increase in objective function value due to sparsiﬁcation. (Becausethis quantity is not aﬃne invariant, this value should serve only to rank the quality of solutions.)(i) Promoting drop steps.

Here we relax Line 9 in Algorithm 2 from testing f ( y ) ≥ f ( x ) to f ( y ) ≥ f ( x )− ε ,where ε ≔ min { max { p , } , ε } with ε ∈ R some upper bound on the accepted potential increase inobjective function value and p being the amount of reduction in f achieved on the latest iteration. Thistechnique allows a controlled increase of the objective function value in return for additional sparsity.The same convergence analysis will apply, with an additional factor of 2 in the estimates of the totalnumber of iterations.(ii) Post-optimization.

Once the considered algorithm has stopped with active set S , solution x , anddual gap d , we re-run the algorithm with the same objective function f over conv S , i.e., we solvemin x ∈ conv S f ( x ) terminating when the dual gap reaches d .These approaches can sparsify the solutions of the baseline algorithms Away-step Frank–Wolfe, PairwiseFrank–Wolfe, and lazy Pairwise Frank–Wolfe; see Rao et al. [2015]. We observed, however, that the iteratesgenerated by BCG are often quite sparse. In fact, the solutions produced by BCG are sparser than thoseproduced by the baseline algorithms even when sparsiﬁcation is used in the benchmarks but not in BCG!This eﬀect is not surprising, as BCG adds new vertices to the active vertex set only when really necessaryfor ensuring further progress in the optimization.Two representative examples are shown in Table 1, where we report the eﬀect of sparsiﬁcation in the sizeof the active set as well as the increase in objective function value.We also compared evolution of the function value and size of the active set. BCG decreases functionvalue much more for the same number of vertices because, by design, it performs more descent on a givenactive set; see Figure 12. Algorithm 1 mixes descent steps with Frank–Wolfe steps. One might be tempted to replace the Frank–Wolfesteps with (seemingly stronger) pairwise steps, as the information needed for the latter steps is computed inany case. In our tests, however, this variant did not substantially diﬀer in practical performance from theone that uses the standard Frank–Wolfe step (see Figure 8). The explanation is that BCG uses descent stepsthat typically provide better directions than either Frank–Wolfe steps or pairwise steps. When the pairwisegap over the active set is small, the Frank–Wolfe and pairwise directions typically oﬀer a similar amount ofreduction in f . 15 Computational experiments

To compare our experiments to previous work,we used problems and instances similar to those in Lacoste-Julien and Jaggi[2015], Garber and Meshi [2016], Rao et al. [2015], Braun et al. [2017], Lan et al. [2017]. These problemsinclude structured regression, sparse regression, video co-localization, sparse signal recovery, matrix com-pletion, and Lasso. In particular, we compared our algorithm to the Pairwise Frank–Wolfe algorithm fromLacoste-Julien and Jaggi [2015], Garber and Meshi [2016] and the laziﬁed Pairwise Frank–Wolfe algorithmfrom Braun et al. [2017]. Figure 1 summarizes our results on four test problems.We also benchmarked against the laziﬁed versions of the vanilla Frank–Wolfe and the Away-stepFrank–Wolfe as presented in Braun et al. [2017] for completeness. We implemented our code in

Python 3.6 using

Gurobi (see Gurobi Optimization [2016]) as the LP solver for complex feasible regions; as well as obvi-ous direct implementations for the probability simplex, the cube and the ℓ -ball. As feasible regions, we usedinstances from MIPLIB2010 (see Koch et al. [2011]), as done before in Braun et al. [2017], along with someof the examples in Bashiri and Zhang [2017]. Code is available at https://github.com/pokutta/bcg .We used quadratic objective functions for the tests with random coeﬃcients, making sure that the globalminimum lies outside the feasible region, to make the optimization problem non-trivial; see below in therespective sections for more details.Every plot contains four diagrams depicting results of a single instance. The upper row measures progressin the logarithm of the function value, while the lower row does so in the logarithm of the gap estimate. Theleft column measures performance in the number of iterations, while the right column does so in wall-clocktime. In the graphs we will compare various algorithms denoted by the following abbreviations: PairwiseFrank–Wolfe (PCG), Away-step Frank–Wolfe (ACG), (vanilla) Frank–Wolfe (CG), blended conditionalgradients (BCG); we indicate the laziﬁed versions of Braun et al. [2017] by preﬁxing with an ‘L’. All testswere conducted with an instance-dependent, ﬁxed time limit, which can be easily read oﬀ the plots.The value Φ t provided by the algorithm is an estimate of the primal gap f ( x t ) − f ( x ∗ ) . The laziﬁedversions (including BCG) use it to estimate the required stepwise progress, halving it occasionally, whichprovides a stair-like appearance in the graphs for the dual progress. Note that if the certiﬁcation in the weak-separation oracle that c ( z − x ) ≥ Φ for all z ∈ P is obtained from the original LP oracle (which computesthe actual optimum of c y over y ∈ P ), then we update the gap estimate Φ t + with that value; otherwise theoracle would continue to return false anyway until Φ drops below that value. For the non-laziﬁed algorithms,we plot the dual gap max v ∈ P ∇ f ( x t )( x t − v ) . Performance comparison

We implemented Algorithm 1 as outlined above and used SiGD (Algorithm 2) for the descent steps asdescribed in Section 4. For line search in Line 13 of Algorithm 2, we perform standard backtracking, andfor Line 16 of Algorithm 1, we do ternary search. In Figure 1, each of the four plots itself contains foursubplots depicting results of four variants of CG on a single instance. The two subplots in each upper rowmeasure progress in the logarithm (to base 2) of the function value, while the two subplots in each lower rowreport the logarithm of the gap estimate Φ t from Algorithm 1. The two subplots in the left column of eachplot report performance in terms of number of iterations, while the two subplots in the right column reportwall-clock time. Lasso.

We tested BCG on lasso instances and compared them to vanilla Frank–Wolfe, Away-step Frank–Wolfe,and Pairwise Frank–Wolfe. We generated Lasso instances similar to Lacoste-Julien and Jaggi [2015], whichhas also also been used by several follow-up papers as benchmark. Here we solve min x ∈ P k Ax − b k with16 being the (scaled) ℓ -ball. We considered instances of varying sizes and the results (as well as detailsabout the instance) can be found in Figure 2. Note that we did not benchmark any of the laziﬁed versions ofBraun et al. [2017] here, because the linear programming oracle is so simple that laziﬁcation is not beneﬁcialand we used the LP oracle directly. Video co-localization instances.

We also tested BCG on video co-localization instances as done inLacoste-Julien and Jaggi [2015]. It was shown in Joulin et al. [2014] that video co-localization can be natu-rally reformulated as optimizing a quadratic function over a ﬂow (or path) polytope. To this end, we run tests onthe same ﬂow polytope instances as used in Lan et al. [2017] (obtained from http://lime.cs.elte.hu/~kpeter/data/mcf/road/ ).We depict the results in Figure 3.

Structured regression.

We also compared BCG against PCG and LPCG on structured regression prob-lems, where we minimize a quadratic objective function over polytopes corresponding to hard optimizationproblems used as benchmarks in e.g., Braun et al. [2017], Lan et al. [2017], Bashiri and Zhang [2017]. Thepolytopes were taken from MIPLIB2010 (see Koch et al. [2011]). Additionally, we compare ACG, PCG, andvanilla CG over the Birkhoﬀ polytope for which linear optimization is fast, so that there is little gain to beexpected from laziﬁcation. See Figures 4 and 5 for results.

Matrix completion.

Clearly, our algorithm also works directly over compact convex sets, even though witha weaker theoretical bound of O ( / ε ) as convex sets need not have a pyramidal width bounded away from 0,and linear optimization might dominate the cost, and hence the advantage of laziﬁcation and BCG might beeven greater empirically.To this end, we also considered Matrix Completion instances over the spectrahedron S = { X (cid:23) [ X ] = } ⊆ R n × n , where we solve the problem:min X ∈ S Õ ( i , j )∈ L ( X i , j − T i , j ) , where D = { T i , j | ( i , j ) ∈ L } ⊆ R is a data set. In our tests we used the data sets Movie Lens 100k andMovie Lens 1m from https://grouplens.org/datasets/movielens/ We subsampled in the 1m caseto generate 3 diﬀerent instances.As in the case of the Lasso benchmarks, we benchmark against ACG, PCG, and CG, as the linearprogramming oracle is simple and there is no gain to be expected from laziﬁcation. In the case of matrixcompletion, the performance of BCG is quite comparable to ACG, PCG, and CG in iterations, which makessense over the spectrahedron, because the gradient approximations computed by the linear optimizationoracle are essentially identical to the actual gradient, so that there is no gain from the blending with descentsteps. In wall-clock time, vanilla CG performs best as the algorithm has the lowest implementation overheadbeyond the oracle calls compared to BCG, ACG, and PCG (see Figure 6) and in particular does not have tomaintain the (large) active set.

Sparse signal recovery.

We also performed computational experiments on the sparse signal recoveryinstances from Rao et al. [2015], which have the following form:ˆ x = argmin x ∈ R n : k x k ≤ τ k y − Φ x k .

17e chose a variety of parameters in our tests, including one test that matches the setup in Rao et al. [2015]. Asin the case of the Lasso benchmarks, we benchmark against ACG, PCG, and CG, as the linear programmingoracle is simple and there is no gain to be expected from laziﬁcation. The results are shown in Figure 7.

PGD vs. SiGD as subroutine

To demonstrate the superiority of SiGD over PGD we also tested two implementations of BCG, once withstandard PGD as subroutine and once with SiGD as subroutine. The results can be found in Figure 8 (right):while PGD and SiGD compare essentially identical in per-iteration progress, in terms of wall clock time theSiGD variant is much faster. For comparison, we also plotted LPCG on the same instance.

Pairwise steps vs. Frank–Wolfe steps

As pointed out in Section 5.2, a natural extension is to replace the Frank–Wolfe steps in Line 16 ofAlgorithm 1 with pairwise steps, since the information required is readily available. In Figure 8 (left) wedepict representative behavior: Little to no advantage when taking the more complex pairwise step. This isexpected as the Frank–Wolfe steps are only needed to add new vertices as the drop steps are subsumed thesteps from Oracle SiDO. Note that BCG with Frank–Wolfe steps is slightly faster per iteration, allowing formore steps within the time limit.

Comparison between laziﬁed variants and BCG

For completeness, we also ran tests for BCG against various other laziﬁed variants of conditional gradientdescent. The results are consistent with our observations from before which we depict in Figure 9.

Standard vs. accelerated version

Another natural variant of our algorithm is to replace Oracle SiDO with its accelerated variant (both possiblefor PGD and SiGD). As expected, due to the small size of the subproblem, we did not observe any signiﬁcantspeedup from acceleration; see Figure 10.

Comparison to Fully-Corrective Frank–Wolfe

As mentioned in the introduction, BCG is quite diﬀerent from FCFW. BCG is much faster and, in fact, FCFWis usually already outpeformed by the much more eﬃcient Pairwise-step CG (PCG), except in some specialcases. In Figure 11, the left column compares FCFW and BCG only across those iterations where FW stepswere taken ; for completeness, we also implemented a variant

FCFW (ﬁxed steps) where only a ﬁxed number ofdescent steps in the correction subroutine are performed. As expected FCFW has a better “per-FW-iterationperformance,” because it performs full correction. The excessive cost of FCFW’s correction routine shows upin the wall-clock time (right column), where FCFW is outperformed even by vanilla pairwise-step CG. Thisbecomes even more apparent when the iterations in the correction subroutine are broken out and reportedas well (see middle column). For purposes of comparison, BCG and FCFW used both SiGD steps in thesubroutine. (This actually gives an advantage to FCFW, as SiGD was not known until the current paper.)The per-iteration progress of FCFW is poor, due to spending many iterations to optimize over active sets thatare irrelevant for the optimal solution. Our tests highlight the fact that correction steps do not have constantcost in practice. 18

Final remarks

In Lan et al. [2017], an accelerated method based on weak separation and conditional gradient sliding wasdescribed. This method provided optimal tradeoﬀs between (stochastic) ﬁrst-order oracle calls and weak-separation oracle calls. An open question is whether the same tradeoﬀs and acceleration could be realizedby replacing SiGD (Algorithm 2) by an accelerated method.After an earlier version of our work appeared online, Kerdreux et al. [2018a] introduced the

HölderError Bound condition (also known as sharpness or the

Łojasiewicz growth condition ). This is a familyof conditions parameterized by 0 < p ≤

1, interpolating between strongly convex ( p =

0) and convexfunctions ( p = O ( / ε p ) has been shown for Away-step Frank–Wolfealgorithms, among others. Our analysis can be similarly extended to objective functions satisfying thiscondition, leading to similar convergence rates. Acknowledgements

We are indebted to Swati Gupta for the helpful discussions. Research reported in this paper was partially sup-ported by NSF CAREER Award CMMI-1452463, and also NSF Awards 1628384, 1634597, and 1740707;Subcontract 8F-30039 from Argonne National Laboratory; Award N660011824020 from the DARPA La-grange Program; and Award W911NF-18-1-0223 from the Army Research Oﬃce.

References

M. A. Bashiri and X. Zhang. Decomposition-invariant conditional gradient for general polytopes with linesearch. In

Advances in Neural Information Processing Systems , pages 2687–2697, 2017.G. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms.

Proceedings of ICML , 2017.M. Frank and P. Wolfe. An algorithm for quadratic programming.

Naval research logistics quarterly , 3(1-2):95–110, 1956.R. M. Freund and P. Grigas. New analysis and results for the Frank–Wolfe method.

MathematicalProgramming , 155(1):199–230, 2016. ISSN 1436-4646. doi: 10.1007/s10107-014-0841-6. URL http://dx.doi.org/10.1007/s10107-014-0841-6 .R. M. Freund, P. Grigas, and R. Mazumder. An extended Frank–Wolfe method with “in-face” directions, andits application to low-rank matrix completion.

SIAM Journal on Optimization , 27(1):319–346, 2017.D. Garber and E. Hazan. A linearly convergent conditional gradient algorithm with applications to onlineand stochastic optimization. arXiv preprint arXiv:1301.4666 , 2013.D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent conditionalgradient algorithm for structured polytopes. arXiv preprint, arXiv:1605.06492v1 , May 2016.D. Garber, S. Sabach, and A. Kaplan. Fast generalized conditional gradient method with applications tomatrix recovery problems. arXiv preprint arXiv:1802.05581 , 2018.J. Guélat and P. Marcotte. Some comments on wolfe’s ‘away step’.

Mathematical Programming , 35(1):110–119, 1986. 19urobi Optimization. Gurobi optimizer reference manual version 6.5, 2016. URL .D. H. Gutman and J. F. Peña. The condition of a function relative to a polytope. arXiv preprintarXiv:1802.00271 , Feb. 2018.D. H. Gutman and J. F. Peña. The condition of a function relative to a set. arXiv preprint arXiv:1901.08359 ,Jan. 2019.C. A. Holloway. An extension of the Frank and Wolfe method of feasible directions.

MathematicalProgramming , 6(1):14–27, 1974.M. Jaggi. Revisiting Frank–Wolfe: Projection-free sparse convex optimization. In

Proceedings of the 30thInternational Conference on Machine Learning (ICML-13) , pages 427–435, 2013.A. Joulin, K. Tang, and L. Fei-Fei. Eﬃcient image and video co-localization with frank-wolfe algorithm. In

European Conference on Computer Vision , pages 253–268. Springer, 2014.T. Kerdreux, A. d’Aspremont, and S. Pokutta. Restarting Frank–Wolfe. arXiv preprint arXiv:1810.02429 ,2018a.T. Kerdreux, F. Pedregosa, and A. d’Aspremont. Frank–Wolfe with subsampling oracle. arXiv preprintarXiv:1803.07348 , 2018b.T. Koch, T. Achterberg, E. Andersen, O. Bastert, T. Berthold, R. E. Bixby, E. Danna, G. Gamrath, A. M.Gleixner, S. Heinz, A. Lodi, H. Mittelmann, T. Ralphs, D. Salvagnin, D. E. Steﬀy, and K. Wolter. MIPLIB2010.

Mathematical Programming Computation , 3(2):103–163, 2011. doi: 10.1007/s12532-011-0025-9.URL http://mpc.zib.de/index.php/MPC/article/view/56/28 .S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank–Wolfe optimizationvariants. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,

Advancesin Neural Information Processing Systems , volume 28, pages 496–504. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/5925-on-the-global-linear-convergence-of-frank-wolfe-optimization-variants.pdf .G. Lan, S. Pokutta, Y. Zhou, and D. Zink. Conditional accelerated lazy stochastic gradient descent.

Proceed-ings of ICML , 2017.G. G. Lan.

Lectures on Optimization for Machine Learning . ISyE, April 2017.E. S. Levitin and B. T. Polyak. Constrained minimization methods.

USSR Computational mathematics andmathematical physics , 6(5):1–50, 1966.A. Nemirovski and D. Yudin.

Problem complexity and method eﬃciency in optimization . Wiley, 1983. ISBN0-471-10345-4.N. Rao, P. Shah, and S. Wright. Forward–backward greedy algorithms for atomic norm regularization.

IEEETransactions on Signal Processing , 63(21):5798–5811, 2015.20

Upper bound on simplicial curvature

Lemma A.1.

Let f : P → R be an L -smooth function over a polytope P with diameter D in some norm k·k .Let S be a set of vertices of P . Then the function f S from Section 2.1 is smooth with smoothness parameterat most L f S ≤ LD | S | . Proof.

Let S = { v , . . . , v k } . Recall that f S : ∆ k → R is deﬁned on the probability simplex via f S ( α ) ≔ f ( A α ) , where A is the linear operator deﬁned via A α ≔ Í ki = α i v i . We need to show f S ( α ) − f S ( β ) − ∇ f S ( β )( α − β ) ≤ LD | S | · k α − β k , α, β ∈ ∆ k . (22)We start by expressing the left-hand side in terms of f and applying the smoothness of f : f S ( α ) − f S ( β ) − ∇ f S ( β )( α − β ) = f ( A α ) − f ( A β ) − ∇ f ( A β ) · ( A α − A β ) ≤ L · k A α − A β k . (23)Let γ + ≔ max { α − β, } and γ − ≔ max { β − α, } with the maximum taken coordinatewise. Then α − β = γ + − γ − with γ + and γ − nonnegative vectors with disjoint support. In particular, k α − β k = k γ + − γ − k = k γ + k + k γ − k . (24)Let denote the vector of length k with all its coordinates 1. Since α = β =

1, we have γ + = γ − .Let t denote this last quantity, which is clearly nonnegative. If t = γ + = γ − = α = β , hence theclaimed (22) is obvious. If t > γ + / t and γ − / t are points of the simplex ∆ k , therefore D ≥ k A ( γ + / t ) − A ( γ − / t )k = k A α − A β k t . (25)Using (24) with k + and k − denoting the number of non-zero coordinates of γ + and γ − , respectively, we obtain k α − β k = k γ + k + k γ − k ≥ t (cid:18) k + + k − (cid:19) ≥ t · k + + k − ≥ t k . (26)By (25) and (26) we conclude that k A α − A β k ≤ kD k α − β k /

4, which together with (23) proves the claim(22). (cid:3)

Lemma A.2.

Let f : P → R be a convex function over a polytope P with ﬁnite simplicial curvature C ∆ .Then f has curvature at most C ≤ C ∆ . Proof.

Let x , y ∈ P be two distinct points of P . The line through x and y intersects P in a segment [ w , z ] ,where w and z are points on the boundary of P , i.e., contained in facets of P , which have dimension dim P − S w , S z of P of size at most dim P with w ∈ conv S w and z ∈ conv S z . As such x , y ∈ conv S with S ≔ S w ∪ S z and | S | ≤ P .21eusing the notation from the proof of Lemma A.1, let k ≔ | S | and A be a linear transformation with S = { Ae , . . . , Ae k } and f S ( ζ ) = f ( A ζ ) for all ζ ∈ ∆ k . Since x , y ∈ conv S , there are α, β ∈ ∆ k with x = A α and y = A β . Therefore by smoothness of f S together with L f S ≤ C ∆ and k β − α k ≤ √ f ( γ y + ( − γ ) x ) − f ( x ) − γ ∇ f ( x )( y − x ) = f ( γ A β + ( − γ ) A α ) − f ( A α ) − γ ∇ f ( A α ) · ( A β − A α ) = f S ( γ β + ( − γ ) α ) − f S ( α ) − γ ∇ f S ( α )( β − α )≤ L f S k γ ( β − α )k = L f S k β − α k · γ ≤ C ∆ γ showing that C ≤ C ∆ as claimed. (cid:3) − L og F un c ti on V a l u e − − L og G a p E s ti m a t e − L og F un c ti on V a l u e L og G a p E s ti m a t e − − L(cid:0)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)

I(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23) (cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" <= >?@ABCDEFGHJKMNO PQR

Wall-clock Time

Figure 1: Four representative examples. (Upper-left) Sparse signal recovery: min x ∈ R n : k x k ≤ τ k y − Φ x k ,where Φ is of size 1000 × .

05. BCG made 1402 iterations with 155 calls to the weak-separation oracle LPsep P . The ﬁnal solution is a convex combination of 152 vertices. (Upper-right) Lasso.We solve min x ∈ P k Ax − b k with P being the (scaled) ℓ -ball. A is a 400 × P

477 times, with the ﬁnal solution being a convex combination of462 vertices. (Lower-left) Structured regression over the Birkhoﬀ polytope of dimension 50. BCG made 2057iterations with 524 calls to LPsep P . The ﬁnal solution is a convex combination of 524 vertices. (Lower-right)Video co-localization over netgen_12b polytope with an underlying 5000-vertex graph. BCG made 140iterations, with 36 calls to LPsep P . The ﬁnal solution is a convex combination of 35 vertices.23 − STUVWXYZ[\]^_‘ab c defghijklmnopqrs tuvw xyz{| }~(cid:127)(cid:128)(cid:129) (cid:130)(cid:131)(cid:132)(cid:133)(cid:134) (cid:135)(cid:136)(cid:137)(cid:138)(cid:139) Wall-clock Time 46810 (cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148)(cid:149)(cid:150)(cid:151)(cid:152)(cid:153)(cid:154)(cid:155) (cid:156)(cid:157)(cid:158)(cid:159) (cid:160) ¡¢£⁄¥ƒ§¤'“«‹›ﬁﬂ(cid:176) –†‡· (cid:181)¶•‚ „”»…‰ (cid:190)¿(cid:192)`´ ˆ˜¯˘˙ Wall-clock Time − − ¨(cid:201)˚¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212)(cid:213)(cid:214)(cid:215) − (cid:216)(cid:217)(cid:218)(cid:219)(cid:220)(cid:221)(cid:222)(cid:223)(cid:224)Æ(cid:226)ª(cid:228)(cid:229) (cid:230)(cid:231)Ł ØŒº(cid:236) (cid:237)(cid:238)(cid:239)(cid:240) æ(cid:242)(cid:243)(cid:244)ı (cid:246)(cid:247)łøœ Wall-clock Time 81012 ß(cid:252)(cid:253)(cid:254)(cid:255)L(cid:0)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)

Figure 2: Comparison of BCG, ACG, PCG and CG on Lasso instances. Upper-left: A is a 400 × A is a 200 ×

200 matrix with100 non-zeros. BCG made 13952 iterations, calling the LP oracle 258 times, with the ﬁnal solution beinga convex combination of 197 vertices giving the sparsity. Lower-left: A is a 500 × A is a 1000 × (cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" ()*+,-./012345 KM NOPQRSTUVWXYZ[ \]^_‘abcdefghijk lm nopqrstuvwxyz{|} ~(cid:127) (cid:128)(cid:129)

111 148Wall-clock Time 20212223 (cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145) L og G a p E s ti m a t e Figure 3: Comparison of PCG, Lazy PCG, and BCG on video co-localization instances. Upper-Left: netgen_12b for a 3000-vertex graph. BCG made 202 iterations, called LPsep P

56 times and the ﬁnalsolution is a convex combination of 56 vertices. Upper-Right: netgen_12b over a 5000-vertex graph. BCGdid 212 iterations, LPsep P was talked 58 times, and the ﬁnal solution is a convex combination of 57 vertices.Lower-Left: road_paths_01_DC_a over a 2000-vertex graph. Even on instances where lazy PCG gainslittle advantage over PCG, BCG performs signiﬁcantly better with empirically higher rate of convergence.BCG made 43 iterations, LPsep P was called 25 times, and the ﬁnal convex combination has 25 verticesLower-Right: netgen_08a over a 800-vertex graph. BCG made 2794 iterations, LPsep P was called 222times, and the ﬁnal convex combination has 106 vertices.25 (cid:146)(cid:147) L og F un c ti on V a l u e (cid:148)(cid:149)(cid:150)(cid:151)(cid:152)(cid:153)(cid:154)(cid:155)(cid:156)(cid:157) L og G a p E s ti m a t e (cid:158)(cid:159)(cid:160)¡ ¢£⁄¥ƒ§¤'“«‹›ﬁﬂ(cid:176)–†‡·(cid:181)¶•‚„” L og F un c ti on V a l u e »…‰(cid:190) ¿(cid:192)`´ˆ ˜¯˘˙¨(cid:201)˚¸(cid:204)˝ L og G a p E s ti m a t e ˛ˇ—(cid:209)(cid:210) (cid:211)(cid:212)(cid:213)(cid:214)(cid:215) (cid:216)(cid:217)(cid:218)(cid:219)(cid:220) Wall-clock Time1012 L og F un c ti on V a l u e L og G a p E s ti m a t e (cid:221)(cid:222)(cid:223)(cid:224)Æ (cid:226)ª(cid:228)(cid:229)(cid:230) (cid:231)ŁØŒº Wall-clock Time − L og F un c ti on V a l u e L og G a p E s ti m a t e (cid:236)(cid:237)(cid:238)(cid:239)(cid:240) æ(cid:242)(cid:243)(cid:244)ı (cid:246)(cid:247)łøœ Wall-clock Time

Figure 4: Comparison of BCG, LPCG and PCG on structured regression instances. Upper-Left: Over the disctom polytope. BCG made 3526 iterations with 1410 LPsep P calls and the ﬁnal solution is a convexcombination of 85 vertices. Upper-Right: Over a maxcut polytope over a graph with 28 vertices. BCGmade 76 LPsep P calls and the ﬁnal solution is a convex combination of 13 vertices. Lower-Left: Over the m100n500k4r1 polytope. BCG made 2137 iterations with 944 LPsep P calls and the ﬁnal solution is a convexcombination of 442 vertices. Lower-right: Over the spanning tree polytope over the complete graph with 10nodes. BCG made 1983 iterations with 262 LPsep P calls and the ﬁnal solution is a convex combination of247 vertices. BCG outperforms LPCG and PCG, even in the cases where LPCG is much faster than PCG.26 − L og F un c ti on V a l u e L og G a p E s ti m a t e − − L og F un c ti on V a l u e ß(cid:252) (cid:253)(cid:254) (cid:255)1(cid:0) (cid:1)(cid:2)(cid:3) Iterations01020

L(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16) 0(cid:17)(cid:18) (cid:19)(cid:20)(cid:21)(cid:22) 3(cid:23)(cid:24)(cid:25) 5(cid:26)(cid:27)(cid:28) 7(cid:29)(cid:30)(cid:31)

Wall-clock Time L og F un c ti on V a l u e L og G a p E s ti m a t e − L og F un c ti on V a l u e !" Figure 5: Comparison of BCG, ACG, PCG and CG over the Birkhoﬀ polytope. Upper-Left: Dimension50. BCG made 2057 iterations with 524 LPsep P calls and the ﬁnal solution is a convex combination of524 vertices. Upper-Right: Dimension 100. BCG made 151 iterations with 134 LPsep P calls and the ﬁnalsolution is a convex combination of 134 vertices. Lower-Left: Dimension 50. BCG made 1040 iterationswith 377 LPsep P calls and the ﬁnal solution is a convex combination of 377 vertices. Lower-right: Dimension80. BCG made 429 iterations with 239 LPsep P calls and the ﬁnal solution is a convex combination of 239vertices. BCG outperforms ACG, PCG and CG in all cases.27 ./24689:;<=>?@AB CDE

FGHI

Iterations10121416

JKMNOPQRSTUVWX

YZ[\]^_‘abcdefghijklmnop qrs tuvw xyz{ |}~(cid:127) (cid:128)(cid:129)(cid:130)(cid:131)

Iterations1314 (cid:132)(cid:133) (cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147) Figure 6: Comparison of BCG, ACG, PCG and CG on matrix completion instances over the spectrahedron.Upper-Left: Over the movie lens 100k data set. BCG made 519 iterations with 346 LPsep P calls and theﬁnal solution is a convex combination of 333 vertices. Upper-Right: Over a subset of movie lens 1m data set.BCG made 78 iterations with 17 LPsep P calls and the ﬁnal solution is a convex combination of 14 vertices.BCG performs very similar to ACG, PCG, and vanilla CG as discussed.28 L og F un c ti on V a l u e (cid:148)(cid:149)(cid:150)(cid:151)(cid:152)(cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡¢£⁄¥ƒ§¤'“«‹›ﬁﬂ(cid:176)– †‡·(cid:181) ¶•‚„ ”»…‰ (cid:190)¿(cid:192)`´ ˆ˜¯˘˙ Wall-clock Time − − L og F un c ti on V a l u e − − L og G a p E s ti m a t e − − − ¨(cid:201)˚¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212)(cid:213)(cid:214)(cid:215) (cid:216)(cid:217)(cid:218) − − (cid:219)(cid:220)(cid:221)(cid:222)(cid:223)(cid:224)Æ(cid:226)ª(cid:228)(cid:229)(cid:230)(cid:231)Ł ØŒº(cid:236) (cid:237)(cid:238)(cid:239)(cid:240)æ (cid:242)(cid:243)(cid:244)ı(cid:246) (cid:247)łøœß (cid:252)(cid:253)(cid:254)(cid:255)5 Wall-clock Time − − − L(cid:0)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14) (cid:15)(cid:16)(cid:17) Iterations − − (cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26)(cid:27)(cid:28)(cid:29)(cid:30)(cid:31) !" 0 Wall-clock Time

Figure 7: Comparison of BCG, ACG, PCG and CG on a sparse signal recovery problem. Upper-Left:Dimension is 5000 × .

1. BCG made 547 iterations with 102 LPsep P calls and the ﬁnalsolution is a convex combination of 102 vertices. Upper-Right: Dimension is 1000 × . P calls and the ﬁnal solution is a convex combination of 152vertices. Lower-Left: Dimension is 10000 × .

05. BCG made 997 iterations with 87LPsep P calls and the ﬁnal solution is a convex combination of 52 vertices. Lower-right: dimension is5000 × .

05. BCG made 1569 iterations with 124 LPsep P calls and the ﬁnal solution is aconvex combination of 103 vertices. BCG outperforms all other algorithms in all examples signiﬁcantly.29 :;<=>?@ABCDEFGHIJK MNO PQRS TUVW XYZ[\ ]^_‘a Iterations2426 bcdefghijklmno pqr stuv wxyz {|}~ (cid:127)(cid:128)(cid:129)(cid:130)

Wall-clock Time (cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147)(cid:148) (cid:149)(cid:150)(cid:151) Iterations222426 (cid:152)(cid:153)(cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159)(cid:160)¡¢£⁄¥ ƒ§¤ '“«‹ ›ﬁﬂ(cid:176)– †‡·(cid:181)¶ •‚„”»

Wall-clock Time

Figure 8: Comparison of BCG variants on a small video co-localization instance (instance netgen_10a ).Left: BCG with vanilla Frank–Wolfe steps (red) and with pairwise steps (purple). Performance is essen-tially equivalent here which matches our observations on other instances. Right: Comparison of oracleimplementations PGD and SiGD. SiGD is signiﬁcantly faster in wall-clock time. − − …‰(cid:190)¿(cid:192)`´ˆ˜¯˘˙¨(cid:201)˚¸ − (cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212)(cid:213)(cid:214)(cid:215)(cid:216)(cid:217) (cid:218)(cid:219) (cid:220)(cid:221)(cid:222) (cid:223)(cid:224)Æ(cid:226)ª(cid:228)(cid:229)(cid:230)(cid:231)ŁØŒº(cid:236)(cid:237)(cid:238) (cid:239)(cid:240)æ (cid:242)(cid:243)(cid:244)ı (cid:246)(cid:247)łøœ ß(cid:252)(cid:253)(cid:254)(cid:255) 3(cid:0)(cid:1)(cid:2)(cid:3) Iterations10121416

L(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16) 0(cid:17)(cid:18) 4(cid:19)(cid:20)(cid:21) 9(cid:22)(cid:23)(cid:24) 1(cid:25)(cid:26)(cid:27)(cid:28) (cid:29)(cid:30)(cid:31) !

Wall-clock Time

Figure 9: Comparison of BCG, LCG, ACG, and PCG. Left: Structured regression instance over the spanningtree polytope over the complete graph with 11 nodes demonstrating signiﬁcant performance diﬀerence inimproving the function value and closing the dual gap; BCG made 3031 iterations, LPsep P was called 1501times (almost always terminated early) and ﬁnal solution is a convex combination of 232 vertices only. Right:Structured regression over the disctom polytope; BCG made 346 iterations, LPsep P was called 71 times,and ﬁnal solution is a convex combination of 39 vertices only. Observe that not only the function valuedecreases faster, but the gap estimate, too. 30 L og F un c ti on V a l u e L og G a p E s ti m a t e − L og F un c ti on V a l u e Iterations ( )* L og G a p E s ti m a t e Figure 10: Comparison of BCG, accelerated BCG and LPCG. Left: On a medium size video co-localizationinstance ( netgen_12b ). Right: On a larger video co-localization instance ( road_paths_01_DC_a ). Herethe accelerated version is (slightly) better in iterations but not in wall-clock time though. These ﬁndings arerepresentative of all our other tests. 31 − +,-./268:;<=>?@A BCD

FE GHIJKMNOPQ

RSTUVWXYZ[\]^_ ‘ab cdefghijkl

Figure 11: Comparison to FCFW across FW iterations, (all) iterations, and wall-clock time on a Lassoinstance. Test run with 40s time limit. In this test we explicitly computed the dual gap of BCG, rather thanusing the estimate Φ t . 32

50 100 150 200 mnop qr stuvwx yz{ |}~(cid:127)(cid:128)(cid:129)(cid:130)(cid:131)(cid:132)(cid:133)(cid:134)(cid:135)(cid:136)(cid:137)(cid:138)(cid:139)(cid:140)(cid:141)(cid:142)(cid:143)(cid:144)(cid:145)(cid:146)(cid:147) (cid:148)(cid:149)(cid:150)(cid:151) (cid:152)(cid:153) (cid:154)(cid:155)(cid:156)(cid:157)(cid:158)(cid:159) (cid:160)¡¢ − − − − £⁄¥ƒ§¤'“«‹›ﬁﬂ(cid:176)–†£⁄¥ƒ§¤'“«‹›ﬁﬂ(cid:176)–†