[PDF] Complexity of Linear Minimization and Projection on Some Sets

Abstract

Full PDF

CComplexity of Linear Minimizationand Projection on Some Sets

Sebastian Pokutta [email protected] School of Industrial and Systems Engineering, Georgia Institute of Technology, USA Institute of Mathematics, Technische Universität Berlin, Germany Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany

Abstract

The Frank-Wolfe algorithm is a method for constrained optimization that relies on linearminimizations, as opposed to projections. Therefore, a motivation put forward in a largebody of work on the Frank-Wolfe algorithm is the computational advantage of solvinglinear minimizations instead of projections. However, the discussions supporting thisadvantage are often too succinct or incomplete. In this paper, we review the complexitybounds for both tasks on several sets commonly used in optimization. Projection methodsonto the (cid:96) p -ball, p ∈ ]1 , ∪ ]2 , + ∞ [ , and the Birkhoff polytope are also proposed. We consider the constrained optimization problem min x ∈C f ( x ) , (1)where C ⊂ R n is a compact convex set and f : R n → R is a smooth function. Among allgeneral purpose methods addressing problem (1), the Frank-Wolfe algorithm [17], a.k.a. con-ditional gradient algorithm [41], has the particularity of never requiring projections onto C .It uses linear minimizations over C instead and is therefore often referred to as a projection-free algorithm in the literature, in the sense that it does not call for solutions to quadraticoptimization subproblems.Thus, a motivation put forward in a large body of work on the Frank-Wolfe algorithm isthe computational advantage of solving linear minimizations instead of projections. How-ever, only a few works actually provide examples. On the other hand, the complexities oflinear minimizations over several sets are available in [27, 29, 19], but they do not always(accurately) discuss the complexities of the respective projections. Therefore, while it is in-tuitive that a linear minimization is simpler to solve than a projection in general, a completequantitative assessment is necessary to properly motivate the projection-free property of theFrank-Wolfe algorithm. Contributions.

We review the complexity bounds of linear minimizations and projectionson several sets commonly used in optimization: the standard simplex, the (cid:96) p -balls for p ∈ [1 , + ∞ ] , the nuclear norm-ball, the ﬂow polytope, the Birkhoff polytope, and the permutahe-dron. These sets are selected because linear minimizations or projections can be solved veryefﬁciently, rather than by resorting to a general purpose method, in which case the analysis is1 a r X i v : . [ m a t h . O C ] J a n ess interesting. We also propose two methods for projecting onto the (cid:96) p -ball and the Birkhoffpolytope respectively, and we analyze their complexity. Computational experiments for the (cid:96) -ball and the nuclear norm-ball are presented.We would like to stress that, while it is possible that a projection-based algorithm requiresless iterations than the Frank-Wolfe algorithm to ﬁnd a solution to problem (1), our goal hereis to demonstrate its advantage in terms of per-iteration complexity. We discuss the Frank-Wolfe algorithm and some successful applications in Section 2.2. We work in the Euclidean space R n or R m × n equipped with the standard scalar product (cid:104) x, y (cid:105) = x (cid:62) y or (cid:104) X, Y (cid:105) = tr( X (cid:62) Y ) . We denote by (cid:107)·(cid:107) the norm induced by the scalar product,i.e., the (cid:96) -norm (cid:107) · (cid:107) or the Frobenius norm (cid:107) · (cid:107) F respectively. For any closed convex set C ⊂ R n , the projection operator onto C , the distance function to C , and the diameter of C , allwith respect to (cid:107) · (cid:107) , are denoted by proj( · , C ) , dist( · , C ) , and diam( C ) respectively.For every i, j ∈ N such that i (cid:54) j , the brackets (cid:74) i, j (cid:75) denote the set of integers between(and including) i and j . For all x ∈ R n and i, j ∈ (cid:74) , n (cid:75) such that i (cid:54) j , [ x ] i denotes the i -thentry of x and [ x ] i : j = ([ x ] i , . . . , [ x ] j ) (cid:62) ∈ R j − i +1 . The signum function is sign : λ ∈ R (cid:55)→ if λ > , − if λ < , and if λ = 0 . The characteristic function of an event E is E = 1 if E is true, else . The indicator function of a set C ⊂ R n is ι C : x ∈ R n (cid:55)→ if x ∈ C , else + ∞ .Operations on vectors in R n , such as sign( x ) , | x | , x p , max { x, y } , xy , that are conventionallyapplied to scalars, are carried out entrywise and return a vector in R n . The shape of and will be clear from context, i.e., a scalar or a vector. The identity matrix in R n × n is denoted by I n . The matrix with all ones in R m × n is denoted by J m,n , and by J n if m = n .We follow the real-number inﬁnite-precision model of computation. The complexity of acomputational task is the number of arithmetic operations necessary to execute it. We ran theexperiments on a laptop under Linux Ubuntu 20.04 with Intel Core i7-10750H. The code isavailable at https://github.com/cyrillewcombettes/complexity . The Frank-Wolfe algorithm (FW) [17], a.k.a. conditional gradient algorithm [41], is a ﬁrst-order projection-free algorithm for solving constrained optimization problems (1). It is pre-sented in Algorithm 1.

Algorithm 1

Frank-Wolfe (FW)

Input:

Start point x ∈ C , step-size strategy ( γ t ) t ∈ N ⊂ [0 , . for t = 0 to T − do v t ← arg min v ∈C (cid:104) v, ∇ f ( x t ) (cid:105) x t +1 ← x t + γ t ( v t − x t ) end for At each iteration, FW minimizes the linear approximation of f at x t over C (Line 2), i.e., min v ∈C f ( x t ) + (cid:104) v − x t , ∇ f ( x t ) (cid:105) , v t ∈ C with a step-size γ t ∈ [0 , (Line 3). Thisensures that the new iterate x t +1 = (1 − γ t ) x t + γ t v t ∈ C is feasible by convexity, and thereis no need for a projection back onto C . For this projection-free property, FW has encounterednumerous applications, including solving trafﬁc assignment problems [39], performing videoco-localization [30], or, e.g., developing adversarial attacks [6].When f is convex, FW converges at a rate f ( x t ) − min C f = O (1 /t ) for different step-sizestrategies [17, 15, 29], which is optimal in general [5, 29]. Faster rates can be establishedunder additional assumptions on the properties of f or the geometry of C [41, 23, 20, 31].Recently, several variants have also been developed to improve its performance [36, 38, 21,18, 4, 9]. When f is nonconvex, [35] showed that FW converges to a stationary point at arate O (1 / √ t ) in the gap max v ∈C (cid:104) x t − v, ∇ f ( x t ) (cid:105) [28], which has inspired a line of work instochastic optimization [48, 52, 51, 10, 53].Lastly, note that FW is also popular for the natural sparsity of its iterates with respect tothe vertices of C , as x t ∈ conv { x , v , . . . , v t − } [7, 26, 37, 43, 8]. The Frank-Wolfe algorithm avoids projections by computing linear minimizations instead. InTable 1, we summarize the complexities of a linear minimization and a (Euclidean) projectionon several sets commonly used in optimization. That is, we compare the complexities ofsolving min x ∈C (cid:104) x, y (cid:105) and min x ∈C (cid:107) x − y (cid:107) . (2)When an exact solution cannot be computed directly, we compare to the complexity of ﬁnd-ing an ε -approximate solution; note that the two objectives in (2) are homogeneous. In thatcase, we solve min x ∈C (cid:107) x − y (cid:107) using any method but the Frank-Wolfe algorithm, since itwould go against the purpose of this paper. Note however that the Frank-Wolfe algorithmcan generate a solution with complexity O (iter( C ) diam( C ) /ε ) , where iter( C ) denotes thecomplexity of an iteration, which amounts to that of a linear minimization over C . When ad-dressing problem (1), solving projection subproblems via the Frank-Wolfe algorithm is knownas conditional gradient sliding [38]. Table 1: Complexities of linear minimizations and (Euclidean) projections on some sets commonly usedin optimization. The constants ρ , d y , and d z are deﬁned in (10), (11), and (14) respectively, ˜ O hidespolylogarithmic factors, and we denote by ν and σ the number of nonzero entries and the top singularvalue of − Y respectively, and by ε > the additive error in the objective of (2) when an approximatesolution is computed. Set C Linear minimization Projection (cid:96) p -ball, p ∈ { , , + ∞} O ( n ) O ( n ) (cid:96) p -ball, p ∈ ]1 , ∪ ]2 , + ∞ [ O ( n ) O ( nρ d y /ε ) Nuclear norm-ball O ( ν ln( m + n ) √ σ / √ ε ) O ( mn min { m, n } + min { m , n } ) Flow polytope O ( m + n ) ˜ O ( m n + n ) Birkhoff polytope O ( n ) O ( n d z /ε ) Permutahedron O ( n ln( n )) O ( n ln( n ) + n ) We now provide details for the complexities reported in Table 1. Slightly abusing notation3hough we may have card(arg min x ∈C (cid:104) x, y (cid:105) ) > , we write arg min x ∈C (cid:104) x, y (cid:105) = x ∗ instead of arg min x ∈C (cid:104) x, y (cid:105) (cid:51) x ∗ . (cid:96) -ball and the standard simplex Let { e , . . . , e n } denote the standard basis in R n . The (cid:96) -ball is { x ∈ R n | (cid:107) x (cid:107) (cid:54) } = (cid:40) x ∈ R n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 | [ x ] i | (cid:54) (cid:41) = conv {± e , . . . , ± e n } , and the standard simplex is ∆ n = { x ∈ R n | (cid:104) x, (cid:105) = 1 , x (cid:62) } = conv { e , . . . , e n } . A projection onto the (cid:96) -ball amounts to computing a projection onto the standard simplex,for which the most efﬁcient algorithms have a complexity O ( n ) ; see [12] for a review. On theother hand, linear minimizations are available in closed form: for all y ∈ R n , arg min x ∈ ∆ n (cid:104) x, y (cid:105) = e i min and arg min (cid:107) x (cid:107) (cid:54) (cid:104) x, y (cid:105) = − sign([ y ] i max ) e i max , where i min ∈ arg min i ∈ (cid:74) ,n (cid:75) [ y ] i and i max ∈ arg max i ∈ (cid:74) ,n (cid:75) | [ y ] i | . Thus, while their complexitiescan both be written O ( n ) , in practice linear minimizations are much simpler to solve thanprojections. Figure 1 presents a computational comparison. Figure 1: Solving a linear minimization (linear program, LP) and a projection (quadratic program, QP)on the (cid:96) -ball. The input vector y ∈ R n is generated by sampling entries from the standard normaldistribution and the projection method is [12, Fig. 2], which is state-of-the-art in practice [12, Tab. 3]. (cid:96) -ball and the (cid:96) ∞ -ball For all x ∈ R n , (cid:107) x (cid:107) = (cid:112)(cid:80) ni =1 [ x ] i and (cid:107) x (cid:107) ∞ = max i ∈ (cid:74) ,n (cid:75) | [ x ] i | . Here, linear minimizationshave no signiﬁcant advantage over projections as they are all available in closed form. For all y ∈ R n , arg min (cid:107) x (cid:107) (cid:54) (cid:104) x, y (cid:105) = − y (cid:107) y (cid:107) and arg min (cid:107) x (cid:107) (cid:54) (cid:107) x − y (cid:107) = y max {(cid:107) y (cid:107) , } , and arg min (cid:107) x (cid:107) ∞ (cid:54) (cid:104) x, y (cid:105) = − sign( y ) and arg min (cid:107) x (cid:107) ∞ (cid:54) (cid:107) x − y (cid:107) = sign( y ) min {| y | , } . .3 The (cid:96) p -balls for p ∈ ]1 , + ∞ [ Let p ∈ ]1 , + ∞ [ . The (cid:96) p -ball is { x ∈ R n | (cid:107) x (cid:107) p (cid:54) } =  x ∈ R n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:32) n (cid:88) i =1 | [ x ] i | p (cid:33) /p (cid:54)  . Linear minimizations are available in closed form: by duality, for all y ∈ R n , arg min (cid:107) x (cid:107) p (cid:54) (cid:104) x, y (cid:105) = −∇(cid:107) · (cid:107) q ( y ) = − sign( y ) | y | q − (cid:107) y (cid:107) q − q , where q = p/ ( p − ∈ ]1 , + ∞ [ . To the best of our knowledge, there is no projection methodspeciﬁc to the (cid:96) p -ball when p ∈ ]1 , ∪ ]2 , + ∞ [ . We use [11, Alg. 6.5], a Haugazeau-likealgorithm [25] for projecting onto the intersection of lower level sets of convex functions. Fora single lower level set, the problem reads min x ∈ R n (cid:107) x − y (cid:107) (3)s.t. g ( x ) (cid:54) , and we assume that g : R n → R is convex differentiable for ease of exposure. In our case, g = (cid:107) · (cid:107) pp − . Alternatively, one could use a Lagrange multiplier to formulate (3) as astrongly convex unconstrained problem, but ﬁnding the corresponding multiplier may requirea considerable effort of tuning; also note that information is usually given in the form g ( x ) (cid:54) rather than in the form of a Lagrange multiplier. The method is presented in Algorithm 2,where for all a, b ∈ R n , H ( a, b ) = { x ∈ R n | (cid:104) x − b, a − b (cid:105) (cid:54) } , and G t is the projection of x t onto { x ∈ R n | g ( x t ) + (cid:104) x − x t , ∇ g ( x t ) (cid:105) (cid:54) } = H ( x t , G t ) . If g ( x t ) (cid:54) , then x t + s = x ∗ for all s ∈ N [11, Prop. 3.1], where x ∗ is the solution to problem (3). Algorithm 2

Haugazeau-like for problem (3)

Input:

Point to project y ∈ R n . x ← y for t = 0 to T − do G t ← x t − g ( x t ) (cid:107)∇ g ( x t ) (cid:107) ∇ g ( x t ) { g ( x t ) > } x t +1 ← proj( x , H ( x , x t ) ∩ H ( x t , G t )) end for The projection in Line 4 is available in closed form [25, Thm. 3-1]; see also [2, Cor. 29.25].The complexity of an iteration of Algorithm 2 is O ( n ) . We propose in Theorem 3.1 the con-vergence rate of Algorithm 2, based on a key result from [1, Thm. 7.12]. A convergence rateis also proposed in [46], however it uses a stronger assumption and has a minor error in theexponent of the constant. Theorem 3.1.

Let g : R n → R be a differentiable convex function and C = { x ∈ R n | g ( x ) (cid:54) } ,and suppose that there exists ˆ x ∈ C such that g (ˆ x ) < . Consider Algorithm 2 and let x ∗ =proj( x , C ) . Then, for all t ∈ N , (cid:107) x ∗ − x (cid:107) − (cid:107) x t − x (cid:107) (cid:54) max { ρ , }(cid:107) x − x ∗ (cid:107) t + 2 , (4)5 nd (cid:107) x t − x ∗ (cid:107) (cid:54) max { √ ρ, √ }(cid:107) x − x ∗ (cid:107) √ t + 2 , (5) where ρ = ( − /g (ˆ x )) sup t ∈ N (cid:107)∇ g ( x t ) (cid:107) (cid:107) x t − ˆ x (cid:107) < + ∞ .Proof. First, note that by [11, Prop. 3.1], for all t ∈ N , (cid:107) x t − x (cid:107) (cid:54) (cid:107) x t +1 − x (cid:107) (cid:54) (cid:107) x ∗ − x (cid:107) . (6)We prove by induction that (4) holds for all t ∈ N . The base case t = 0 is trivial. Supposethat (4) holds at iteration t ∈ N . Since x t +1 ∈ H ( x , x t ) and x t +1 ∈ H ( x t , G t ) , we have (cid:107) x t +1 − x (cid:107) − (cid:107) x t − x (cid:107) = (cid:107) x t +1 − x t (cid:107) + 2 (cid:104) x t +1 − x t , x t − x (cid:105) (cid:62) (cid:107) x t +1 − x t (cid:107) (cid:62) dist( x t , H ( x t , G t )) . (7)By [1, Thm. 7.12], dist( x t , C ) (cid:54) ρ dist( x t , H ( x t , G t )) , (8)where ρ = ( − /g (ˆ x )) sup t ∈ N (cid:107)∇ g ( x t ) (cid:107) (cid:107) x t − ˆ x (cid:107) , and ρ < + ∞ because ( x t ) t ∈ N converges[2, Cor. 30.9]. Now, x ∗ = proj( x , C ) so C ⊂ H ( x , x ∗ ) . We can assume that x t (cid:54) = x ∗ , so x t / ∈ H ( x , x ∗ ) by (6). By [2, Ex. 29.20], dist( x t , H ( x , x ∗ )) = (cid:104) x t − x ∗ , x − x ∗ (cid:105)(cid:107) x − x ∗ (cid:107) . Thus, dist( x t , C ) (cid:62) dist( x t , H ( x , x ∗ ))= (cid:104) x t − x ∗ , x − x ∗ (cid:105)(cid:107) x − x ∗ (cid:107) = (cid:104) x t − x , x − x ∗ (cid:105) + (cid:107) x − x ∗ (cid:107) (cid:107) x − x ∗ (cid:107) (cid:62) (cid:107) x − x ∗ (cid:107) − (cid:107) x t − x (cid:107) = (cid:107) x − x ∗ (cid:107) − (cid:113) (cid:107) x ∗ − x (cid:107) − ( (cid:107) x ∗ − x (cid:107) − (cid:107) x t − x (cid:107) ) (9) (cid:62) , where we used the Cauchy-Schwarz inequality in the second inequality and (6) in the lastinequality. Let ε s = (cid:107) x ∗ − x (cid:107) − (cid:107) x s − x (cid:107) for s ∈ { t, t + 1 } . Combining (7)–(9), ε t +1 (cid:54) ε t − ρ (cid:18) (cid:107) x − x ∗ (cid:107) − (cid:113) (cid:107) x ∗ − x (cid:107) − ε t (cid:19) . Since √ α − β (cid:54) √ α − β/ (2 √ α ) for all α (cid:62) β > , we obtain ε t +1 (cid:54) ε t − ρ (cid:18) (cid:107) x − x ∗ (cid:107) − (cid:18) (cid:107) x ∗ − x (cid:107) − ε t (cid:107) x ∗ − x (cid:107) (cid:19)(cid:19) = (cid:18) − ε t ρ (cid:107) x ∗ − x (cid:107) (cid:19) ε t . κ = max { ρ , }(cid:107) x − x ∗ (cid:107) . We have ε t (cid:54) κ/ ( t + 2) by the induction hypothesis, so ε t +1 (cid:54) (cid:18) − ε t κ/ (cid:19) ε t (cid:54)  κ/ t + 2 if ε t (cid:54) κ/ t + 2 (cid:18) − t + 2 (cid:19) κt + 2 if ε t (cid:62) κ/ t + 2 (cid:54) κt + 3 . We conclude that (4) holds for all t ∈ N . Then, for all t ∈ N , (cid:107) x t − x ∗ (cid:107) = (cid:107) x ∗ − x (cid:107) − (cid:107) x t − x (cid:107) + 2 (cid:104) x ∗ − x t , x − x t (cid:105) (cid:54) (cid:107) x ∗ − x (cid:107) − (cid:107) x t − x (cid:107) , because x ∗ ∈ H ( x , x t ) , since C ⊂ H ( x , x t ) [11, Prop. 5.2]. This proves (5).In our case, g = (cid:107) · (cid:107) pp − and g (0) = − < , so Theorem 3.1 holds with ρ = sup t ∈ N (cid:107)∇ g ( x t ) (cid:107) (cid:107) x t (cid:107) = p sup t ∈ N (cid:107) x t (cid:107) p − p − (cid:107) x t (cid:107) < + ∞ . (10)Therefore, the complexity of an ε -approximate projection onto the (cid:96) p -ball is O ( nρ d y /ε ) ,where d y = (cid:107) y − x ∗ (cid:107) . (11) Remark 3.2. If p ∈ [1 , + ∞ [ ∩ Q , another option, although probably less practical, is to formulatethe projection problem as a conic quadratic program and to obtain an ε -approximate solutionusing an interior-point algorithm, with complexity O (poly( n ) ln(1 /ε )) [3]. This is probably the most popular example of the computational advantage of linear mini-mizations over projections in the literature. The nuclear norm, a.k.a. trace norm, of a matrixis the sum of its singular values and serves as a convex surrogate for the rank constraint [16].The nuclear norm-ball is the convex hull of rank- matrices: { X ∈ R m × n | (cid:107) X (cid:107) nuc (cid:54) } = conv { uv (cid:62) | u ∈ R m , v ∈ R n , (cid:107) u (cid:107) = (cid:107) v (cid:107) = 1 } . For all Y ∈ R m × n , arg min (cid:107) X (cid:107) nuc (cid:54) (cid:107) X − Y (cid:107) F = U diag(ˆ σ ) V (cid:62) , where Y = U diag( σ ) V (cid:62) is the singular value decomposition (SVD) of Y , ( U, σ, V ) ∈ R m × k × R k × R n × k , k = min { m, n } , and ˆ σ is the projection of σ onto the standard simplex ∆ k . TheSVD can be computed with complexity O ( mn min { m, n } + min { m , n } ) using the Golub-Reinsch algorithm or the R -SVD algorithm [22, Fig. 8.6.1]. On the other hand, a linearminimization requires only a truncated SVD: arg min (cid:107) X (cid:107) nuc (cid:54) (cid:104) X, Y (cid:105) = arg max (cid:107) u (cid:107) = (cid:107) v (cid:107) =1 tr(( uv (cid:62) ) (cid:62) ( − Y )) = arg max (cid:107) u (cid:107) = (cid:107) v (cid:107) =1 u (cid:62) ( − Y ) v = uv (cid:62) , u and v are the top left and right singular vectors of − Y . A pair of unit vectors ( u, v ) ∈ R m × R n satisfying σ − u (cid:62) ( − Y ) v (cid:54) ε with high probability can be obtained usingthe Lanczos algorithm with complexity O ( ν ln( m + n ) √ σ / √ ε ) , where σ and ν denote thetop singular value and the number of nonzero entries in − Y respectively [29, 33]. Note that ν (cid:54) mn and that in many applications of interest, e.g., in recommender systems, ν (cid:28) mn .In practice, the package ARPACK [40] is often used to compute the top pair of singularvectors. Furthermore, if the input matrix Y is symmetric, then the package LOBPCG [32] canbe particularly efﬁcient. Figure 2 illustrates both cases. Figure 2: Solving a linear minimization (linear program, LP) and a projection (quadratic program, QP)on the nuclear norm-ball. A matrix Y ∈ R n × n is generated by sampling entries from the standardnormal distribution. The full and truncated SVDs are computed using the functions svd and svds fromthe Python packages numpy.linalg [24] and scipy.sparse.linalg [50] respectively. Left:

The inputis Y and the function svds is used with solver=‘arpack’ . Right:

The input is the symmetric matrix ( Y + Y (cid:62) ) / and the function svds is used with solver=‘lobpcg’ . Let G be a single-source single-sink directed acyclic graph (DAG) with m vertices and n edges. Index by (cid:74) , m (cid:75) the set of vertices such that the edges are directed from a smallerto a larger vertex index; this can be achieved with complexity O ( m + n ) via topological sort[13, Sec. 22.4]. Let A G ∈ R m × n be the incidence matrix of G . The ﬂow polytope induced by G is F G = { x ∈ R n | A G x = ( − , , . . . , , (cid:62) , x (cid:62) } , i.e., F G is the set of unit ﬂows x ∈ R n on G , where [ x ] i (cid:62) denotes the ﬂow going throughedge i . Thus, for all y ∈ R n , arg min x ∈F G (cid:104) x, y (cid:105) is a ﬂow x ∈ { , } n identifying a shortestpath on G weighted by y . Its computation has complexity O ( m + n ) [13, Sec. 24.2]. This issigniﬁcantly cheaper than the complexity ˜ O ( m n + n ) of a projection [49, Thm. 20], where ˜ O hides polylogarithmic factors. The Birkhoff polytope, a.k.a. assignment polytope, is the set of doubly stochastic matrices B n = { X ∈ R n × n | X , X (cid:62) , X (cid:62) } . It is the convex hull of the permutation matrices and arises in matching, ranking, and seriationproblems. Linear minimizations can be solved with complexity O ( n ) using the Hungarian8lgorithm [34]. To the best of our knowledge, there is no projection method speciﬁc to theBirkhoff polytope so we propose one here. Let Y ∈ R n × n . By reshaping it into a vector y ∈ R n , projecting Y onto the Birkhoff polytope is equivalent to solving min x ∈ R n (cid:107) x − y (cid:107) s.t. Ax = 1 x (cid:62) , where A =  (cid:62) (cid:62) · · · (cid:62) (cid:62) . . . . . . ...... . . . . . . (cid:62) (cid:62) · · · (cid:62) (cid:62) I n · · · · · · I n  ∈ R n × n . (12)This can again be reformulated as min x ∈ R n (cid:18) ι K ( x ) + 12 (cid:107) x − y (cid:107) (cid:19) + (cid:18) ι A ( x ) + 12 (cid:107) x − y (cid:107) (cid:19) , (13)where K = { x ∈ R n | x (cid:62) } and A = { x ∈ R n | Ax = 1 } . That is, we split the con-straints into two sets enjoying efﬁcient projections. We can now apply the Douglas-Rachfordalgorithm [42] to problem (13). The method is presented in Algorithm 3; see Appendix Afor details. Line 3 computes the projection of u t onto the afﬁne subspace A and in Line 4is computed a projection onto the nonnegative orthant K . We can set u = 1 /n ∈ A and wedenote by A † ∈ R n × n the Moore-Penrose inverse of A . Algorithm 3

Douglas-Rachford for problem (13)

Input:

Point to project y ∈ R n , start point z ∈ R n , offset point u ∈ A . for t = 0 to T − do u t ← z t + y x t ← u t − A † A ( u t − u ) z t +1 ← max (cid:26) x t − z t + y , (cid:27) + z t − x t end for The complexity of an iteration of Algorithm 3 is dominated by the matrix-vector multipli-cation in Line 3. We can assume that A † A ∈ R n × n is precomputed. In fact, A † A = 1 n  B B · · · B B . . . . . . ...... . . . . . . B B · · · B B  ∈ R n × n , where (cid:40) B = nI n + ( n − J n ∈ R n × n B = nI n − J n ∈ R n × n , so A † A is block circulant with circulant blocks (BCCB) and has only three distinct entries: n − , n − , and − . The expression of A † A can be shown by checking that A † = A (cid:62) /n − J n , n / (2 n ) ∈ R n × n using the necessary and sufﬁcient Moore-Penrose conditions[47, Thm. 1]. Thus, the multiplication of A † A and any vector x ∈ R n can be performed withcomplexity O ( n ) . Indeed, it amounts to computing ( nI n +( n − J n )[ x ] i : j and ( nI n − J n )[ x ] i : j for every ( i, j ) ∈ { ( kn + 1 , ( k + 1) n ) | k ∈ (cid:74) , n − (cid:75) } , each of which has complexity O ( n ) .It remains to bound the number of iterations required to achieve ε -convergence. Let x ∗ =proj( y, A ∩ K ) be the solution to problem (13), i.e., the projection of Y onto the Birkhoff9olytope after reshaping, and let ¯ x t = ( (cid:80) ts =0 x s ) / ( t + 1) ∈ A for all t ∈ N . By [14, Thm. 1], (cid:107) ¯ x t − x ∗ (cid:107) (cid:54) (cid:107) z − z ∗ (cid:107) (cid:112) t + 1) , where z ∗ is a ﬁxed point of rprox ι K +(1 / (cid:107)·− y (cid:107) ◦ rprox ι A +(1 / (cid:107)·− y (cid:107) , rprox ϕ = 2 prox ϕ − id ,and prox ϕ is the proximity operator of ϕ [44]. Therefore, the complexity of an ε -approximateprojection onto the Birkhoff polytope is O ( n d z /ε ) , where d z = (cid:107) z − z ∗ (cid:107) . (14) Let S n be the set of permutations on (cid:74) , n (cid:75) and w ∈ R n . The permutahedron induced by w isthe convex hull of all permutations of the entries in w , i.e., P w = conv { w σ ∈ R n | w σ = ([ w ] σ , . . . , [ w ] σ n ) (cid:62) , σ ∈ S n } . It is related to the Birkhoff polytope via P w = { Xw | X ∈ B n } . With no loss of generality, wecan assume that the weights are already sorted in ascending order: [ w ] (cid:54) · · · (cid:54) [ w ] n . Thus,for all y ∈ R n , arg min x ∈P w (cid:104) x, y (cid:105) = w σ − , where σ satisﬁes [ y ] σ (cid:62) · · · (cid:62) [ y ] σ n . Sorting the entries of y has complexity O ( n ln( n )) [13].A projection can be obtained with a slightly higher complexity O ( n ln( n ) + n ) , by sorting theentries of y and solving an isotonic regression problem [45]. Acknowledgment

Research reported in this paper was partially supported by the Research Campus MODALfunded by the German Federal Ministry of Education and Research under grant 05M14ZAM.

A An application of the Douglas-Rachford algorithm

Let H be a Euclidean space with norm (cid:107) · (cid:107) and denote by Γ ( H ) the set of proper lowersemicontinuous convex functions H → R ∪ { + ∞} . The Douglas-Rachford algorithm [42] canbe used to solve min x ∈H f ( x ) + g ( x ) , when f, g ∈ Γ ( H ) satisfy arg min H ( f + g ) (cid:54) = ∅ and (ri dom f ) ∩ (ri dom g ) (cid:54) = ∅ [2], where ri and dom denote the relative interior of a set and the domain of a function respectively.It is presented in Algorithm 4. For every function ϕ ∈ Γ ( H ) , the proximity operator is prox ϕ = arg min x ∈H ϕ ( x ) + (1 / (cid:107) · − x (cid:107) [44].We are interested in an application to problem (13), where H = R n , f = ι K + (1 / (cid:107) ·− y (cid:107) , g = ι A + (1 / (cid:107) · − y (cid:107) , y ∈ H , K = { x ∈ R n | x (cid:62) } , A = { x ∈ R n | Ax = 1 } ,and A ∈ R n × n is deﬁned in (12). Problem (13) admits a (unique) solution since it isa projection problem onto the intersection of the closed convex sets K and A , and /n ∈ lgorithm 4 Douglas-Rachford

Input:

Start point z ∈ H . for t = 0 to T − do x t ← prox g ( z t ) z t +1 ← prox f (2 x t − z t ) + z t − x t end for (ri dom f ) ∩ (ri dom g ) = (ri K ) ∩ A so the Douglas-Rachford algorithm is well deﬁned here.We now show that it reduces to Algorithm 3. For all t ∈ N , x t = prox g ( z t )= arg min x ∈H g ( x ) + 12 (cid:107) z t − x (cid:107) = arg min x ∈A (cid:107) x − y (cid:107) + 12 (cid:107) x − z t (cid:107) = arg min x ∈A (cid:13)(cid:13)(cid:13)(cid:13) x − z t + y (cid:13)(cid:13)(cid:13)(cid:13) = proj (cid:18) z t + y , A (cid:19) = z t + y − A † A (cid:18) z t + y − u (cid:19) , since proj( · , A ) = · − A † A ( · − u ) , given any u ∈ A [2, Ex. 29.17]. Thus, Lines 2–3 inAlgorithm 3 are equivalent to Line 2 in Algorithm 4. Similarly, Line 4 in Algorithm 3 isequivalent to Line 3 in Algorithm 4: z t +1 = prox f (2 x t − z t ) + z t − x t = arg min z ∈H (cid:18) f ( z ) + 12 (cid:107) x t − z t − z (cid:107) (cid:19) + z t − x t = arg min z ∈K (cid:18) (cid:107) z − y (cid:107) + 12 (cid:107) z − (2 x t − z t ) (cid:107) (cid:19) + z t − x t = arg min z ∈K (cid:13)(cid:13)(cid:13)(cid:13) z − x t − z t + y (cid:13)(cid:13)(cid:13)(cid:13) + z t − x t = proj (cid:18) x t − z t + y , K (cid:19) + z t − x t = max (cid:26) x t − z t + y , (cid:27) + z t − x t , since proj( · , K ) = max {· , } . References [1] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex feasibility prob-lems.

SIAM Review , 38(3):367–426, 1996.[2] H. H. Bauschke and P. L. Combettes.

Convex Analysis and Monotone Operator Theory in HilbertSpaces . Springer, 2nd edition, 2017.

3] A. Ben-Tal and A. S. Nemirovski.

Lectures on Modern Convex Optimization: Analysis, Algorithms,and Engineering Applications . Society for Industrial and Applied Mathematics, 2001.[4] G. Braun, S. Pokutta, and D. Zink. Lazifying conditional gradient algorithms.

Journal of MachineLearning Research , 20(71):1–42, 2019.[5] M. D. Canon and C. D. Cullum. A tight upper bound on the rate of convergence of Frank-Wolfealgorithm.

SIAM Journal on Control , 6(4):509–516, 1968.[6] J. Chen, D. Zhou, J. Yi, and Q. Gu. A Frank-Wolfe framework for efﬁcient and effective adversarialattacks. In

Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence , pages 3486–3494,2020.[7] K. L. Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm.

ACMTransactions on Algorithms , 6(4):1–30, 2010.[8] C. W. Combettes and S. Pokutta. Revisiting the approximate Carathéodory problem via the Frank-Wolfe algorithm. arXiv preprint arXiv:1911.04415 , 2019.[9] C. W. Combettes and S. Pokutta. Boosting Frank-Wolfe by chasing gradients. In

Proceedings of the37th International Conference on Machine Learning , pages 2111–2121, 2020.[10] C. W. Combettes, C. Spiegel, and S. Pokutta. Projection-free adaptive gradients for large-scaleoptimization. arXiv preprint arXiv:2009.14114 , 2020.[11] P. L. Combettes. Strong convergence of block-iterative outer approximation methods for convexoptimization.

SIAM Journal on Control and Optimization , 38(2):538–565, 2000.[12] L. Condat. Fast projection onto the simplex and the (cid:96) ball. Mathematical Programming ,158(1):575–585, 2016.[13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.

Introduction to Algorithms . MIT Press,3rd edition, 2009.[14] D. Davis and W. Yin. Faster convergence rates of relaxed Peaceman-Rachford and ADMM underregularity assumptions.

Mathematics of Operations Research , 42(3):783–805, 2017.[15] J. C. Dunn and S. Harshbarger. Conditional gradient algorithms with open loop step size rules.

Journal of Mathematical Analysis and Applications , 62(2):432–444, 1978.[16] M. Fazel, H. Hindi, and S. P. Boyd. A rank minimization heuristic with application to minimumorder system approximation. In

Proceedings of the 2001 American Control Conference , pages 4734–4739, 2001.[17] M. Frank and P. Wolfe. An algorithm for quadratic programming.

Naval Research Logistics Quar-terly , 3(1–2):95–110, 1956.[18] R. M. Freund, P. Grigas, and R. Mazumder. An extended Frank-Wolfe method with “in-face”directions, and its application to low-rank matrix completion.

SIAM Journal on Optimization ,27(1):319–346, 2017.[19] D. Garber.

Projection-free Algorithms for Convex Optimization and Online Learning . Ph.D. thesis,Technion, 2016.[20] D. Garber and E. Hazan. Faster rates for the Frank-Wolfe method over strongly-convex sets. In

Proceedings of the 32nd International Conference on Machine Learning , pages 541–549, 2015.[21] D. Garber and O. Meshi. Linear-memory and decomposition-invariant linearly convergent condi-tional gradient algorithm for structured polytopes. In

Advances in Neural Information ProcessingSystems , volume 29, pages 1001–1009, 2016.

22] G. H. Golub and C. F. van Loan.

Matrix Computations . Johns Hopkins University Press, 4th edition,2013.[23] J. Guélat and P. Marcotte. Some comments on Wolfe’s ‘away step’.

Mathematical Programming ,35(1):110–119, 1986.[24] C. R. Harris et al. Array programming with NumPy.

Nature , 585(7825):357–362, 2020.[25] Y. Haugazeau.

Sur les Inéquations Variationnelles et la Minimisation de Fonctionnelles Convexes .Thèse de doctorat, Université de Paris, 1968.[26] E. Hazan. Sparse approximate solutions to semideﬁnite programs. In

Proceedings of the 8th LatinAmerican Symposium on Theoretical Informatics , pages 306–316, 2008.[27] E. Hazan and S. Kale. Projection-free online learning. In

Proceedings of the 29th InternationalConference on Machine Learning , 2012.[28] D. W. Hearn. The gap function of a convex program.

Operations Research Letters , 1(2):67–71,1982.[29] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In

Proceedings of the30th International Conference on Machine Learning , pages 427–435, 2013.[30] A. Joulin, K. Tang, and L. Fei-Fei. Efﬁcient image and video co-localization with Frank-Wolfealgorithm. In

European Conference on Computer Vision , pages 253–268, 2014.[31] T. Kerdreux, A. d’Aspremont, and S. Pokutta. Projection-free optimization on uniformly convexsets. arXiv preprint arXiv:2004.11053 , 2020.[32] A. V. Knyazev, M. E. Argentati, I. Lashuk, and E. E. Ovtchinnikov. Block locally optimal precondi-tioned eigenvalue xolvers (BLOPEX) in hypre and PETSc.

SIAM Journal on Scientiﬁc Computing ,29(5):2224–2239, 2007.[33] J. Kuczy´nski and H. Wo´zniakowski. Estimating the largest eigenvalue by the power and Lanczosalgorithms with a random start.

SIAM Journal on Matrix Analysis and Applications , 13(4):1094–1122, 1992.[34] H. W. Kuhn. The Hungarian method for the assignment problem.

Naval Research Logistics Quar-terly , 2(1–2):83–97, 1955.[35] S. Lacoste-Julien. Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprintarXiv:1607.00345 , 2016.[36] S. Lacoste-Julien and M. Jaggi. On the global linear convergence of Frank-Wolfe optimizationvariants. In

Advances in Neural Information Processing Systems , volume 28, pages 496–504, 2015.[37] S. Lacoste-Julien, M. Jaggi, M. Schmidt, and P. Pletscher. Block-coordinate Frank-Wolfe optimiza-tion for structural SVMs. In

Proceedings of the 30th International Conference on Machine Learning ,pages 53–61, 2013.[38] G. Lan and Y. Zhou. Conditional gradient sliding for convex optimization.

SIAM Journal onOptimization , 26(2):1379–1409, 2016.[39] L. J. LeBlanc, E. K. Morlok, and W. P. Pierskalla. An efﬁcient approach to solving the road networkequilibrium trafﬁc assignment problem.

Transportation Research , 9(5):309–318, 1975.[40] R. B. Lehoucq, D. C. Sorensen, and C. Yang.

ARPACK Users’ Guide: Solution of Large-Scale Eigen-value Problems with Implicitly Restarted Arnoldi Methods . Society for Industrial and Applied Math-ematics, 1998.

41] E. S. Levitin and B. T. Polyak. Constrained minimization methods.

USSR Computational Mathe-matics and Mathematical Physics , 6(5):1–50, 1966.[42] P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.

SIAMJournal on Numerical Analysis , 16(6):964–979, 1979.[43] G. Luise, S. Salzo, M. Pontil, and C. Ciliberto. Sinkhorn barycenters with free support via Frank-Wolfe algorithm. In

Advances in Neural Information Processing Systems , volume 32, pages 9322–9333, 2019.[44] J. J. Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien.

ComptesRendus Hebdomadaires des Séances de l’Académie des Sciences , 255:2897–2899, 1962.[45] R. Negrinho and A. F. T. Martins. Orbit regularization. In

Advances in Neural Information ProcessingSystems , volume 27, pages 3221–3229, 2014.[46] C. H. J. Pang. First order constrained optimization algorithms with feasibility updates. arXivpreprint arXiv:1506.08247 , 2015.[47] R. Penrose. A generalized inverse for matrices.

Mathematical Proceedings of the Cambridge Philo-sophical Society , 51(3):406–413, 1955.[48] S. J. Reddi, S. Sra, B. Póczos, and A. Smola. Stochastic Frank-Wolfe methods for nonconvexoptimization. In ,pages 1244–1251, 2016.[49] L. A. Végh. A strongly polynomial algorithm for a class of minimum-cost ﬂow problems withseparable convex objectives.

SIAM Journal on Computing , 45(5):1729–1761, 2016.[50] P. Virtanen et al. Scipy 1.0: Fundamental algorithms for scientiﬁc computing in Python.

NatureMethods , 17(3):261–272, 2020.[51] J. Xie, Z. Shen, C. Zhang, H. Qian, and B. Wang. Efﬁcient projection-free online methods withstochastic recursive gradient. In

Proceedings of the 34th AAAI Conference on Artiﬁcial Intelligence ,pages 6446–6453, 2020.[52] A. Yurtsever, S. Sra, and V. Cevher. Conditional gradient methods via stochastic path-integrateddifferential estimator. In

Proceedings of the 36th International Conference on Machine Learning ,pages 7282–7291, 2019.[53] M. Zhang, Z. Shen, A. Mokhtari, H. Hassani, and A. Karbasi. One sample stochastic Frank-Wolfe.In

Proceedings of the 23rd International Conference on Artiﬁcial Intelligence and Statistics , pages4012–4023, 2020., pages4012–4023, 2020.