[PDF] Efficient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms

Abstract

We study Frank-Wolfe algorithms -- standard, pairwise, and away-steps -- for efficient optimization of Dominant Set Clustering. We present a unified and computationally efficient framework to employ the different variants of Frank-Wolfe methods, and we investigate its effectiveness via several experimental studies. In addition, we provide explicit convergence rates for the algorithms in terms of the so-called Frank-Wolfe gap. The theoretical analysis has been specialized to the problem of Dominant Set Clustering and is thus more easily accessible compared to prior work.

Full PDF

EEfﬁcient Optimization of Dominant Set Clustering with Frank-Wolfe Algorithms

Carl Johnell, Morteza Haghir Chehreghani

Chalmers University of [email protected], [email protected]

Abstract

We study Frank-Wolfe algorithms – standard, pairwise, andaway-steps – for efﬁcient optimization of Dominant Set Clus-tering. We present a uniﬁed and computationally efﬁcientframework to employ the different variants of Frank-Wolfemethods, and we investigate its effectiveness via several ex-perimental studies. In addition, we provide explicit conver-gence rates for the algorithms in terms of the so called Frank-Wolfe gap. The theoretical analysis has been specialized tothe problem of Dominant Set Clustering and is thus more eas-ily accessible compared to prior work.

Introduction

Data clustering plays an important role in unsupervisedlearning and exploratory data analytics (Jain 2010). It is usedin many applications and areas such as network analysis, im-age segmentation, document and text processing, commu-nity detection and bioinformatics. Given a set of n objectswith indices V = { , ..., n } and the nonnegative pairwisesimilarities A = ( a ij ) , i.e., graph G ( V, A ) with vertices V and edge weights A , our goal is to partition the data intocoherent groups that look dissimilar from each other. We as-sume zero self-similarities, i.e., a ii = 0 ∀ i . Several clus-tering methods compute the clusters via minimizing a costfunction. Examples are Ratio Cut (Chan, Schlag, and Zien1994), Normalized Cut (Shi and Malik 2000), CorrelationClustering (Bansal, Blum, and Chawla 2004), and shiftedMin Cut (Haghir Chehreghani 2017a). For some of them, forexample Normalized Cut, approximate solutions have beendeveloped in the context of spectral analysis (Shi and Malik2000; Ng, Jordan, and Weiss 2001), Power Iteration method(Lin and Cohen 2010) and P-Spectral Clustering (B¨uhler andHein 2009; Hein and B¨uhler 2010). It is notable that cluster-ing can be applied to more complicated data such as trees(Haghir Chehreghani et al. 2007).Another prominent clustering approach has been devel-oped in the context of Dominant Set Clustering (DSC) andits connection to discrete-time dynamical systems and repli-cator dynamics (Pavan and Pelillo 2007; Bul`o and Pelillo2017). Unlike the methods based on cost function minimiza-tion, DSC does not deﬁne a global cost function for theclusters. Instead, it applies the generic principles of cluster-ing where each cluster should be coherent and well sepa-rated from the other clusters. These principles are formu- lated via the concepts of dominant sets (Pavan and Pelillo2007). Then, several variants of the method have been pro-posed. The method in (Liu, Latecki, and Yan 2013) pro-poses an iterative clustering algorithm in two Shrink andExpand steps. These steps are suitable for sparse data andlead to reducing the runtime of performing replicator dy-namics. (Bul`o, Torsello, and Pelillo 2009) develops an enu-meration technique for different clusters via unstabilizingthe underlying equilibrium of replicator dynamics. (Pavanand Pelillo 2003a) proposes a hierarchical variant of DSCvia regularization and shifting the off-diagonal elements ofthe similarity matrix. (Chehreghani 2016) analyzes adap-tively the trajectories of replicator dynamics in order todiscover suitable phase transitions that correspond to dif-ferent clusters. Several studies demonstrate the effective-ness of DSC variants compared to other clustering meth-ods, such as spectral methods (Pavan and Pelillo 2007;Liu, Latecki, and Yan 2013; Chehreghani 2016; Bul`o andPelillo 2017).In this paper, we investigate efﬁcient optimization forDSC based on Frank-Wolfe algorithms (Frank and Wolfe1956; Lacoste-Julien and Jaggi 2015; Reddi et al. 2016) in-stead of replicator dynamics. Frank-Wolfe optimization hasbeen successfully applied to several constrained optimiza-tion problems. We develop a uniﬁed and computationally ef-ﬁcient framework to employ the different variants of Frank-Wolfe algorithms for DSC, and we investigate its effective-ness via several experimental studies. Our theoretical anal-ysis is specialized to DSC, and we provide explicit con-vergence rates for the algorithms in terms of the so calledFrank-Wolfe gap – including pairwise Frank-Wolfe withnonconvex/nonconcave objective function, which we havenot seen in prior work. Dominant Set Clustering

DSC follows an iterative procedure to compute the clusters:i) compute a dominant set using the similarity matrix A ofthe available data, ii) peel off (remove) the clustered objectsfrom the data, and iii) repeat until a predeﬁned number ofclusters have been obtained. With some abuse of the notation, n , V , A and x sometimesrefer to the available (i.e., still unclustered) objects and the similar-ities between them. This is obvious from the context. a r X i v : . [ c s . L G ] A ug ominant sets correspond to local optima of the followingquadratic problem (Pavan and Pelillo 2007), called standardquadratic problem (StQP).maximize f ( x ) = x T Ax (1)subject to x ∈ ∆ = (cid:40) x ∈ R n : x ≥ n and n (cid:88) i =1 x i = 1 (cid:41) . The constraint ∆ is called the standard simplex. We notethat A is generally not negative deﬁnite, and the objectivefunction f ( x ) is thus not concave.Every unclustered object i corresponds to a component ofthe n -dimensional characteristic vector x . The support of lo-cal optimum x ∗ speciﬁes the objects that belong to the domi-nant set (cluster), i.e., i is in the cluster if component x ∗ i > .In practice we use x ∗ i > δ , where δ is a small number calledthe cutoff parameter. Previous work employ replicator dy-namics to solve StQP, where x is updated according to thefollowing dynamics. x ( t +1) i = x ( t ) i ( Ax t ) i x T t Ax t , i = 1 , .., n , (2)where x t indicates the solution at iterate t , and x ( t ) i isthe i -th component of x t . Note the O ( n ) per-iteration timecomplexity due to the matrix multiplication.In this paper we investigate an alternative optimizationframework based on Frank-Wolfe methods. Uniﬁed Frank-Wolfe Optimization Methods

Let

P ⊂ R n be a ﬁnite set of points and D = convex ( P ) itsconvex hull (convex polytope). The Frank-Wolfe algorithm,ﬁrst introduced in (Frank and Wolfe 1956), aims at solvingthe following constrained optimization. max x ∈D f ( x ) , (3)where f is nonlinear and differentiable. The formulation in(Lacoste-Julien 2016) has extended the concavity assump-tion to arbitrary functions with L -Lipschitz (‘well-behaved’)gradients. Algorithm 1 outlines the steps of a Frank-Wolfemethod to solve the optimization in (3). Algorithm 1

Frank-Wolfe pseudocode procedure PSEUDO-FW( f , D , T ) (cid:46) Function f ,convex polytope D , and iterations T . Select x ∈ D for t = 0 , ..., T − do if x t is stationary then break Compute feasible ascent direction d t at x t Compute step size γ t ∈ [0 , such that f ( x t + γ t d t ) > f ( x t ) x t +1 := x t + γ t d t return x t In this work, in addition to the standard FW (called FW),we also consider two other variants of FW: pairwise FW (PFW) and away-steps FW (AFW), adapted from (Lacoste-Julien and Jaggi 2015). They differ in the way the ascentdirection d t is computed.From the deﬁnition of D , any point x t ∈ D can be writtenas a convex combination of the points in P , i.e., x t = (cid:88) v ∈P λ ( t ) v v , (4)where the coefﬁcients λ ( t ) v ∈ [0 , and (cid:80) v ∈P λ ( t ) v = 1 .Deﬁne S t = { v ∈ P : λ ( t ) v > } (5)as the set of points with nonzero coefﬁcients at iterate t .Moreover, let s t ∈ arg max s ∈D ∇ f ( x t ) T s , (6) v t ∈ arg min v ∈ S t ∇ f ( x t ) T v . (7)Since D is a convex polytope, s t is the point that maximizesthe linearization and v t is the point with nonzero coefﬁcientthat minimizes it over S t . Let x t be the estimated solutionof (3) at iterate t and deﬁne d At = x t − v t , d F Wt = s t − x t , d AF Wt = (cid:40) d F Wt , if ∇ f ( x t ) T d F Wt ≥ f ( x t ) T d Atλ ( t ) v t − λ ( t ) v t d At , otherwise d P F Wt = s t − v t (8)respectively as the away, FW, pairwise, and away/FW di-rections. The FW direction moves towards a ‘good’ point,and the away direction moves away from a ‘bad’ point. Thepairwise direction shifts from a ‘bad’ point to a ‘good’ point(Lacoste-Julien and Jaggi 2015). The coefﬁcient with d At in d AF Wt ensures the next iterate remains feasible.An issue with standard FW, which PFW and AFW aim toﬁx, is the zig-zagging phenomenon. This occurs when theoptimal solution of (3) lies on the boundary of the domain.Then the iterates start to zig-zag between the points, whichnegatively affects the convergence. By adding the possibilityof an away step in AFW, or alternatively using the pairwisedirection, the zig-zagging can be attenuated.The step size γ t can be computed by line-search, i.e., γ t ∈ arg max γ ∈ [0 , f ( x t + γ d t ) . (9)Finally, the Frank-Wolfe gap is used to check if an iterate is(close enough to) a stationary solution. Deﬁnition 1.

The Frank-Wolfe gap g t of f : D → R atiterate x t is deﬁned as g t = max s ∈D ∇ f ( x t ) T ( s − x t ) ⇐⇒ g t = ∇ f ( x t ) T d F Wt . (10) point x t is stationary if and only if g t = 0 , meaningthere are no ascent directions. The Frank-Wolfe gap is thus areasonable measure of nonstationarity and is frequently usedas a stopping criterion (Lacoste-Julien 2016). Speciﬁcally, athreshold (cid:15) is deﬁned, and if g t ≤ (cid:15) , then we conclude theiterate is sufﬁciently close to a stationary point and stop thealgorithm. Frank-Wolfe for Dominant Set Clustering

Here we apply the Frank-Wolfe methods from the previoussection to the optimization problem (1) deﬁned by DSC.

Simplex Domain

Because of the simplex form – the con-straints in (1) – the convex combination in (4) for x ∈ ∆ canbe written as x = n (cid:88) i =1 λ e i e i , (11)where e i are the standard basis vectors. That is, the i -th co-efﬁcient corresponds to the i -th component of x , λ e i = x i .The set of components with nonzero coefﬁcients at iterate x t gives the support, i.e., σ t = { i ∈ V : x ( t ) i > } . (12)Due to the structure of the simplex ∆ , the solution of theoptimization (6) is  s t ∈ ∆ s ( t ) i = 1 , where i ∈ arg max i ∇ i f ( x t ) s ( t ) j = 0 , for j (cid:54) = i, (13)and the optimization (7) is obtained by  v t ∈ ∆ v ( t ) i = 1 , where i ∈ arg min i ∈ σ t ∇ i f ( x t ) v ( t ) j = 0 , for j (cid:54) = i. (14)The maximum and minimum values of the linearization arethe largest and smallest components of the gradient, respec-tively (subject to i ∈ σ t in the latter case). Note that thegradient is ∇ f ( x t ) = 2 Ax t . Step Sizes

We compute the optimal step sizes for FW,PFW, and AFW. Iterate subscripts t are omitted for clarity.We deﬁne the step size function as ψ ( γ ) = f ( x + γ d )= ( x + γ d ) T A ( x + γ d )= x T Ax + 2 γ d T Ax + γ d T Ad = f ( x ) + γ ∇ f ( x t ) T d + γ d T Ad , (15)for some ascent direction d . This expression is a single vari-able second degree polynomial in γ . The function is concaveif the coefﬁcient d T Ad ≤ – second derivative test – andadmits a global maximum in that case.In the following it is assumed that s and v satisfy (13) and(14), and their nonzero components are i and j , respectively. FW direction : Substitute d F W = s − x into d T Ad . d T Ad = ( s − x ) T A ( s − x )= s T As − s T Ax + x T Ax = − (2 s T Ax − x T Ax )= x T Ax − a Ti ∗ x . (16)The i -th row of A is a i ∗ and the j -th column of A is a ∗ j . Pairwise direction : Substitute d P F W = s − v into d T Ad . d T Ad = ( s − v ) T A ( s − v )= s T As − v T As + v T Av = − a ij . (17) Away direction : Substitute d A = x − v into d T Ad . d T Ad = ( x − v ) T A ( x − v )= x T Ax − v T Ax + v T Av = x T Ax − a Tj ∗ x . (18)Recall A has nonnegative entries and zeros on the maindiagonal. Therefore s T As = 0 and v T Av = 0 . It is im-mediate that (17) is nonpositive. From x T Ax ≤ s T Ax weconclude that (16) is also nonpositive. The correspondingstep size functions are therefore always concave . We can-not make any conclusion for (18), and the sign of d T Ad istherefore dependent on the iterate.The derivative of ψ ( γ ) is dψdγ ( γ ) = ∇ f ( x ) T d + 2 γ d T Ad . (19)By solving dψdγ ( γ ) = 0 we obtain ∇ f ( x ) T d + 2 γ d T Ad = 0 ⇐⇒ γ ∗ = − ∇ f ( x ) T d d T Ad = − x T Add T Ad . (20)Since ∇ f ( x ) T d ≥ , we also conclude here that d T Ad < has to hold in order for the step size to makesense.By substituting the directions and corresponding d T Ad into (20) we obtain the different step sizes. FW direction and (16): γ F W = − x T Add T Ad = a Ti ∗ x − x T Ax a Ti ∗ x − x T Ax . (21) Pairwise direction and (17): γ P F W = − x T Add T Ad = a Ti ∗ x − a Tj ∗ x a ij . (22) Away direction and (18): γ A = − x T Add T Ad = x T Ax − a Tj ∗ x a Tj ∗ x − x T Ax . (23) lgorithms Here, we describe in detail standard FW (Al-gorithm 2), pairwise FW (Algorithm 3), and away-stepsFW (Algorithm 4) for problem (1), following the high-level structure of Algorithm 1. All variants have O ( n ) per-iteration time complexity, where the linear operations are arg max , arg min , and vector addition. The key for thistime complexity is that we are able to update the gradient ∇ f ( x ) = 2 Ax in linear time. Lemmas 1, 2, and 3 showwhy this is the case. Recall that the updates in replicator dy-namics are quadratic w.r.t. n . Algorithm 2

FW for DSC procedure FW( A , (cid:15) , T ) Select x ∈ ∆ r := Ax f := r T x for t = 0 , ..., T − do s t := e i , where i ∈ arg max (cid:96) r ( t ) (cid:96) g t := r ( t ) i − f t if g t ≤ (cid:15) then break γ t := r ( t ) i − f t r ( t ) i − f t x t +1 := (1 − γ t ) x t + γ t s t r t +1 := (1 − γ t ) r t + γ t a ∗ i f t +1 := (1 − γ t ) f t + 2 γ t (1 − γ t ) r ( t ) i return x t Lemma 1.

For x t +1 = (1 − γ t ) x t + γ t s t , lines 11 and 12in Algorithm 2 satisfy r t +1 = Ax t +1 ,f t +1 = x Tt +1 Ax t +1 . Algorithm 3

Pairwise FW for DSC procedure PFW( A , (cid:15) , T ) Select x ∈ ∆ r := Ax f := r T x for t = 0 , ..., T − do σ t := { i ∈ V : x ( t ) i > } s t := e i , where i ∈ arg max (cid:96) r ( t ) (cid:96) v t := e j , where j ∈ arg min (cid:96) ∈ σ t r ( t ) (cid:96) g t := r ( t ) i − f t if g t ≤ (cid:15) then break γ t := min (cid:18) x ( t ) j , r ( t ) i − r ( t ) j a ij (cid:19) x t +1 := x t + γ t ( s t − v t ) r t +1 := r t + γ t ( a ∗ i − a ∗ j ) f t +1 := f t + 2 γ t ( r ( t ) i − r ( t ) j ) − γ t a ij return x t Lemma 2.

For x t +1 = x t + γ t ( s t − v t ) , lines 13 and 14 inAlgorithm 3 satisfy r t +1 = Ax t +1 ,f t +1 = x Tt +1 Ax t +1 . Algorithm 4

Away-steps FW for DSC procedure AFW( A , (cid:15) , T ) Select x ∈ ∆ r := Ax f := r T x for t = 0 , ..., T − do σ t := { i ∈ V : x ( t ) i > } s t := e i , where i ∈ arg max (cid:96) r ( t ) (cid:96) v t := e j , where j ∈ arg min (cid:96) ∈ σ t r ( t ) (cid:96) g t := r ( t ) i − f t if g t ≤ (cid:15) then break if ( r ( t ) i − f t ) ≥ ( f t − r ( t ) j ) then (cid:46) FW direction γ t := r ( t ) i − f t r ( t ) i − f t x t +1 := (1 − γ t ) x t + γ t s t r t +1 := (1 − γ t ) r t + γ t a ∗ i f t +1 := (1 − γ t ) f t + 2 γ t (1 − γ t ) r ( t ) i else (cid:46) Away direction γ t := x ( t ) j / (1 − x ( t ) j ) if (2 r ( t ) j − f t ) > then γ t ← min (cid:18) γ t , f t − r ( t ) j r ( t ) j − f t (cid:19) x t +1 := (1 + γ t ) x t − γ t v t r t +1 := (1 + γ t ) r t − γ t a ∗ j f t +1 := (1 + γ t ) f t − γ t (1 + γ t ) r ( t ) j return x t Lines 12-15 are identical to the updates in Algorithm 2and are included in Lemma 1. We therefore only show theaway direction.

Lemma 3.

For x t +1 = (1 + γ t ) x t − γ t v t , lines 22 and 23in Algorithm 4 satisfy r t +1 = Ax t +1 ,f t +1 = x Tt +1 Ax t +1 . Algorithm 4 (AFW) is actually equivalent to the infec-tion and immunization dynamics (InImDyn) with the purestrategy selection function, introduced in (Bul`o, Pelillo, andBomze 2011) as an alternative to replicator dynamics. How-ever, InImDyn is derived from the perspective of evolu-tionary game theory as opposed to Frank-Wolfe. Thus, ourframework provides a different way to analyze this methodand also study its convergence rate.

Proposition 4.

Algorithm 4 (AFW) is equivalent to Algo-rithm 1 in (Bul`o, Pelillo, and Bomze 2011). nalysis of Convergence Rates (Lacoste-Julien 2016) shows that the Frank-Wolfe gapfor standard FW decreases at rate O (1 / √ t ) for noncon-vex/nonconcave objective functions, where t is the num-ber of iterations. A similar convergence rate is shown in(Bomze, Rinaldi, and Zefﬁro 2019) for nonconvex AFWover the simplex. When the objective function is con-vex/concave, linear convergence rates for PFW and AFWare shown in (Lacoste-Julien and Jaggi 2015). The analysisin (Thiel, Haghir Chehreghani, and Dubhashi 2019) showslinear convergence rate of standard FW for nonconvex butmulti-linear functions. We are not aware of any work ana-lyzing the convergence rate in terms of the Frank-Wolfe gapfor nonconvex/nonvoncave PFW.Following the terminology and techniques in (Lacoste-Julien 2016; Lacoste-Julien and Jaggi 2015; Bomze, Ri-naldi, and Zefﬁro 2019), we present a uniﬁed framework toanalyze convergence rates for Algorithms 2, 3, and 4. Theanalysis is split into a number of different cases, where eachcase handles a unique ascent direction and step size com-bination. For the step sizes, we consider one case when theoptimal step size is used ( γ t < γ max ), and a second casewhen it has been truncated ( γ t = γ max ). The former case isreferred to as a good step, since in this case we can providea lower bound on the progress f ( x t +1 ) − f ( x t ) in terms ofthe Frank-Wolfe gap. The latter case is referred to as a dropstep or a swap step. It is called a drop step when the cardi-nality of the support reduces by one, i.e., | σ t +1 | = | σ t | − ,and it is called a swap step when it remains unchanged, i.e., | σ t +1 | = | σ t | . When γ t = γ max we cannot provide a boundon the progress in terms of the Frank-Wolfe gap, and insteadwe bound the number of drop/swap steps. Furthermore, thiscase can only happen for PFW and AFW as the step size forFW always satisﬁes γ t < γ max . Swap steps can only happenfor PFW.Let ˜ g t = min ≤ (cid:96) ≤ t g (cid:96) , M = min i,j : i (cid:54) = j a ij , M = max i,j : i (cid:54) = j a ij , be the smallest Frank-Wolfe gap after t iterations, and thesmallest and largest off-diagonal elements of A . Let I bethe indexes that take a good step. That is, for t ∈ I we have γ t < γ max . Then, we show the following results (the detailsare in supplemental). Lemma 5.

The smallest Frank-Wolfe gap for Algorithms 2,3, and 4 satisfy ˜ g t ≤ (cid:115) β ( f ( x t ) − f ( x )) | I | , (24) where β = 2 M − M for FW and AFW, and β = 2 M forPFW. Theorem 6.

The smallest Frank-Wolfe gap for Algorithm 2(FW) satisﬁes ˜ g F Wt ≤ (cid:115) (2 M − M ) ( f ( x t ) − f ( x )) t . (25) Theorem 7.

The smallest Frank-Wolfe gap for Algorithm 3(PFW) satisﬁes ˜ g P F Wt ≤ (cid:115) n ! M ( f ( x t ) − f ( x )) t . (26) Theorem 8.

The smallest Frank-Wolfe gap for Algorithm 4(AFW) satisﬁes ˜ g AF Wt ≤ (cid:115) M − M ) ( f ( x t ) − f ( x )) t + 1 − | σ | . (27)From Theorems 6, 7 and 8 we conclude Corollary 9. Corollary 9.

The smallest Frank-Wolfe gap for Algorithms2, 3, and 4 decrease at rate O (1 / √ t ) . Initialization

The way the algorithms are initialized – value of x – affectsthe local optima the algorithms converge to. Let ¯ x B = n e be the barycenter of the simplex ∆ , where e T = (1 , , ..., .We also deﬁne ¯ x V as  ¯ x V ∈ ∆¯ x Vi = 1 , where i ∈ arg max i ∇ i f (¯ x B )¯ x Vj = 0 , for j (cid:54) = i. (28)Initializing x with ¯ x B avoids initial bias to a particular so-lution as it considers a uniform distribution over the avail-able objects. Since ∇ f (¯ x B ) = 2 A ¯ x B , the nonzero com-ponent of ¯ x V corresponds to the row of A with largest totalsum. Therefore, it is biased to an object that is highly similarto many other objects.The starting point for replicator dynamics is ¯ x RD = ¯ x B ,as used for example in (Pavan and Pelillo 2003b; Pavan andPelillo 2007). Note that if a component of ¯ x RD starts at zeroit will remain at zero for the entire duration of the dynamicsaccording to (2). Furthermore, (¯ x V ) T A ¯ x V = 0 since A haszeros on the main diagonal, and the denominator in replica-tor dynamics is then zero for this point. Thus, ¯ x V is not aviable starting point for replicator dynamics.The starting point for standard FW is ¯ x F W = ¯ x V , andwas found experimentally to work well. As explained in con-vergence rate analysis, FW never performs any drop stepssince the step size always satisﬁes γ t < γ max . Hence, us-ing ¯ x B as starting point for FW will lead to a solution thathas full support – this was found experimentally to hold trueas well. Therefore, with FW, we use only initialization with ¯ x V . With PFW and AFW, we can use both ¯ x B and ¯ x V asstarting points. We denote the PFW and AFW variants byPFW-B, PFW-V, AFW-B, and AFW-V, respectively, to spec-ify the starting point. Experiments

In this section, we describe the experimental results of thedifferent optimization methods. xperimental Setup

Settings.

The Frank-Wolfe gap (Deﬁnition 1) and the dis-tance between two consecutive iterates are used as the stop-ping criterion for the FW variants and replicator dynamics.Speciﬁcally, let (cid:15) be the threshold, then an algorithm stops if g t ≤ (cid:15) or if || x t +1 − x t || ≤ (cid:15) . In the experiments we set (cid:15) toPython’s epsilon, (cid:15) ≈ . · − , and the cutoff parameter δ to δ = 2 · − .We denote the number of clusters in the dataset by k and the maximum number of clusters to extract by K . Fora dataset with n objects, the clustering assignment is rep-resented by a discrete n -dimensional vector c , i.e., c i ∈{ , , ..., K − , K } for i = 1 , ..., n . If c i = c j , then ob-jects i and j are in the same cluster. The discrete values , , ..., K − , K are called labels and represent the differ-ent clusters. Label 0 is designated to represent ‘no cluster’– if c i = 0 , then object i is unassigned. We may regularizethe pairwise similarities by a shift parameter, as described indetail in (Johnell 2020). Clustering metrics.

To evaluate the clustering quality, wecompare the predicted solution and the ground truth solutionw.r.t. Adjusted Rand Index (ARI) (Hubert and Arabie 1985)and V-Measure (Rosenberg and Hirschberg 2007). The Randindex is the ratio of the object pairs that are either in thesame cluster or in different clusters, in both the predictedand ground truth solutions. V-measure is the harmonic meanof homogeneity and completeness. We may also report theAssignment Rate (AR), representing the rate of the objectsassigned to a valid cluster. t time AR ARI V-Meas.FW 1000 0.36s 0.6325 0.4695 0.53884000 1.35s 0.6885 0.4593 0.52248000 2.41s 0.6969 0.4673 0.5325PFW-B 1000 0.43s 0.7429 0.1944 0.42894000 1.86s 0.6605 0.467 0.53278000 2.62s 0.642 0.471 0.5335PFW-V 1000 0.52s 0.6471 0.5178 0.57454000 1.6s 0.6487 0.4565 0.52378000 2.47s 0.642 0.471 0.5335AFW-B 1000 0.35s 0.8527 0.076 0.28544000 1.69s 0.6258 0.3887 0.53168000 2.93s 0.6599 0.4676 0.5328AFW-V 1000 0.46s 0.6415 0.5184 0.57364000 1.38s 0.6482 0.518 0.57548000 2.75s 0.6476 0.4618 0.5257RD 1000 1.06s 1.0 0.0 0.04000 4.56s 0.9081 0.1852 0.30038000 11.4s 0.6997 0.4121 0.5384Table 1: Dataset newsgroups1 results. Experiments on 20 Newsgroups Data

We ﬁrst study the clustering of different subsets of20 newsgroups data collection. The collection con-sists of documents in 20 categories split intotraining and test subsets. We use four datasets withdocuments from randomly selected categories from t time AR ARI V-Meas.FW 1000 0.37s 0.6587 0.5594 0.59294000 1.38s 0.6674 0.5479 0.58668000 2.6s 0.6679 0.5473 0.5864PFW-B 1000 0.45s 0.7508 0.135 0.35554000 1.57s 0.6172 0.6257 0.63648000 2.06s 0.6172 0.6257 0.6364PFW-V 1000 0.59s 0.6281 0.6095 0.62414000 1.85s 0.6172 0.6257 0.63648000 3.1s 0.6172 0.6257 0.6364AFW-B 1000 0.41s 0.8653 0.0979 0.3164000 1.9s 0.6172 0.6257 0.63648000 3.39s 0.6172 0.6257 0.6364AFW-V 1000 0.48s 0.663 0.5548 0.59074000 1.75s 0.6172 0.6257 0.63648000 3.38s 0.6172 0.6257 0.6364RD 1000 0.76s 1.0 0.0 0.04000 4.67s 1.0 0.1795 0.3338000 13.52s 0.7585 0.4391 0.5161Table 2: Dataset newsgroups2 results. t time AR ARI V-Meas.FW 1000 0.41s 0.6756 0.5206 0.58794000 1.35s 0.6468 0.5309 0.59758000 2.63s 0.6473 0.5314 0.5978PFW-B 1000 0.49s 0.758 0.217 0.46174000 1.79s 0.6468 0.5317 0.60048000 2.88s 0.6468 0.5317 0.6004PFW-V 1000 0.56s 0.6468 0.5317 0.60044000 1.96s 0.6468 0.5317 0.60048000 3.71s 0.6468 0.5317 0.6004AFW-B 1000 0.37s 0.8373 0.1381 0.35944000 1.83s 0.6462 0.5316 0.60038000 3.19s 0.6468 0.5317 0.6004AFW-V 1000 0.49s 0.6468 0.5322 0.59934000 1.63s 0.6468 0.5317 0.60048000 2.99s 0.6468 0.5317 0.6004RD 1000 0.86s 1.0 0.0 0.04000 4.69s 0.9089 0.2212 0.34658000 12.9s 0.8012 0.3526 0.4556Table 3: Dataset newsgroups3 results.the test subset. (i) newsgroups1 : soc.religion.christian,comp.os.ms-windows.misc, talk.politics.guns, alt.atheism,talk.politics.misc. (ii) newsgroups2 : comp.windows.x,sci.med, rec.autos, talk.religion.misc, sci.crypt. (iii) newsgroups3 : misc.forsale, comp.sys.mac.hardware,talk.politics.mideast, sci.electronics, rec.motorcycles. (iv) newsgroups4 : in comp.sys.ibm.pc.hardware, comp.graphics,rec.sport.hockey, rec.sport.baseball, sci.space. Each datasethas k = 5 true clusters and ≤ n ≤ documents,where we use K = 5 for peeling off the computed clus-ters. We obtain the tf-idf (term-frequency times inversedocument-frequency) vector for each document and thenapply PCA to reduce the dimensionality to 20. We obtainthe similarity matrix A using the cosine similarity betweenthe PCA vectors and then shift the off-diagonal elements by to ensure nonnegative entries. time AR ARI V-Meas.FW 1000 0.42s 0.653 0.5097 0.57064000 1.38s 0.6169 0.4672 0.54378000 2.6s 0.7002 0.5014 0.5514PFW-B 1000 0.43s 0.8092 0.2247 0.44834000 1.82s 0.6697 0.6211 0.64848000 3.04s 0.6697 0.6211 0.6484PFW-V 1000 0.58s 0.6591 0.6446 0.67174000 2.02s 0.6565 0.6462 0.6758000 2.74s 0.6565 0.6462 0.675AFW-B 1000 0.35s 0.9041 0.1109 0.33614000 1.87s 0.6687 0.6191 0.64638000 3.55s 0.6697 0.6211 0.6484AFW-V 1000 0.5s 0.6525 0.5071 0.56514000 1.84s 0.6565 0.6462 0.6758000 3.6s 0.6565 0.6462 0.675RD 1000 0.93s 1.0 0.0 0.04000 5.46s 1.0 0.3197 0.41128000 14.52s 0.8559 0.4528 0.5328Table 4: Dataset newsgroups4 results.Tables 1, 2, 3, and 4 show the results for the differentdatasets. We observe that different variants of FW yieldsigniﬁcantly better results compared to replicator dynamics(RD), w.r.t. both ARI and V-Measure. In particular, PFW-Vand AFW-V are computationally efﬁcient and perform verywell even with t = 1000 . On the other hand, these methodsare more robust w.r.t. different parameter settings. Since allthe objects in the ground truth solutions are assigned to acluster, the assignment rate (AR) indicates the ratio of theobjects assigned (correctly or incorrectly) to a cluster dur-ing the clustering. High AR and low ARI/V-measure meansassignment of many objects to wrong clusters. This is whathappens for RD with t = 1000 . We note that these resultsare consistent with the results on synthetic datasets reportedin supplementary material.As discussed in (Pavan and Pelillo 2007), it is commonfor DSC to perform a post processing to assign each unas-signed object to the cluster which it has the highest averagesimilarity with. Speciﬁcally, let C ⊆ V contain the unas-signed objects and C i ⊆ V , ≤ i ≤ K , be the predictedclusters. Object j ∈ C is then assigned to cluster C i thatsatisﬁes i ∈ arg max (cid:96) ≥ | C (cid:96) | (cid:88) p ∈ C (cid:96) A jp . Table 5 shows the performance of different methods afterassigning all the documents to valid clusters, i.e., when ARis always 1. We observe that ARI and V-measure are usuallysimilar for pre and post assignment settings. In both casesthe FW variants (especially PFW-V and AFW-V) yield thebest and computationally the most efﬁcient results. Consis-tent to the previous results, PFW-V and AFW-V yield highscores already with t = 1000 . Image Segmentation

Then, we study segmentation of colored images inHSV space. We deﬁne the feature vector f ( i ) = [ v, vs sin( h ) , vs cos( h )] T as in (Pavan and Pelillo 2007),where h , s , and v are the HSV values of pixel i . Thesimilarity matrix A is then deﬁned as follows. (i) Com-pute || f ( i ) − f ( j ) || , for every pair of pixels i and j toobtain D L . (ii) Compute the minimax (path-based) dis-tances (Fischer and Buhmann 2003; Chehreghani 2017;Haghir Chehreghani 2017b) from D L to obtain D P . (iii)Finally, A = max( D P ) − D P , where max is over the ele-ments in D P as used in (Chehreghani 2016).Figure 1 shows the segmentation results of the airplaneimage in Figure 1(a). The image has dimensions × ,which leads to a clustering problem with n = 120 ·

80 =9600 . We run the FW variants for t = 10000 and RD for t = 250 iterations. Due to the linear versus quadratic per-iteration time complexity of the FW variants and RD, re-spectively, we are able to run FW for many more iterations.This allows us to have more ﬂexibility in tuning the parame-ters and thus obtain more robust results. See (Johnell 2020)for additional details. Conclusion

We presented a uniﬁed and computationally efﬁcient frame-work to employ the different variants of Frank-Wolfe forDominant Set Clustering. In particular, replicator dynamicswas replaced with standard, pairwise, and away-steps Frank-Wolfe when optimizing the quadratic problem deﬁned byDSC. We provided a specialized analysis of the algorithms’convergence rates, and demonstrated the effectiveness of theframework via several experimental studies.

Acknowledgment

The work of Morteza Haghir Chehreghani was partially sup-ported by the Wallenberg AI, Autonomous Systems andSoftware Program (WASP) funded by the Knut and AliceWallenberg Foundation.

References [Bansal, Blum, and Chawla 2004] Bansal, N.; Blum, A.; andChawla, S. 2004. Correlation clustering. volume 56, 89–113.[Bomze, Rinaldi, and Zefﬁro 2019] Bomze, I. M.; Rinaldi,F.; and Zefﬁro, D. 2019. Active set complexity ofthe away-step frank-wolfe algorithm. arXiv preprintarXiv:1912.11492 .[B¨uhler and Hein 2009] B¨uhler, T., and Hein, M. 2009.Spectral clustering based on the graph p-laplacian. In , 81–88.[Bul`o and Pelillo 2017] Bul`o, S. R., and Pelillo, M. 2017.Dominant-set clustering: A review.

European Journal ofOperational Research

Computer Vision and ImageUnderstanding (a) Original image (b) RD (c) FW, PFW-B, AFW-V, andAFW-B (d) PFW-V

Figure 1: Original image and segmentation results.[Bul`o, Torsello, and Pelillo 2009] Bul`o, S. R.; Torsello, A.;and Pelillo, M. 2009. A game-theoretic approach to partialclique enumeration.

Image Vision Comput.

IEEE Trans. on CAD of Integrated Circuitsand Systems

Machine Learning

Proceedings ofthe Thirty-First AAAI Conference on Artiﬁcial Intelligence ,1784–1790. AAAI Press.[Fischer and Buhmann 2003] Fischer, B., and Buhmann,J. M. 2003. Path-based clustering for grouping of smoothcurves and texture segmentation.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence

Naval research logis-tics quarterly

In-tell. Data Anal.

IEEE InternationalConference on Data Mining, ICDM , 793–798.[Haghir Chehreghani 2017b] Haghir Chehreghani, M.2017b. Efﬁcient computation of pairwise minimax distancemeasures. In

IEEE International Conference on DataMining, ICDM , 799–804.[Hein and B¨uhler 2010] Hein, M., and B¨uhler, T. 2010. Aninverse power method for nonlinear eigenproblems with ap-plications in 1-spectral clustering and sparse PCA. In

Ad-vances in Neural Information Processing Systems (NIPS) ,847–855.[Hubert and Arabie 1985] Hubert, L., and Arabie, P. 1985.Comparing partitions.

Journal of classiﬁcation

Pattern recognition letters

Foundations and Trends R (cid:13) in Machine Learning Advances in NeuralInformation Processing Systems , 496–504.[Lacoste-Julien 2016] Lacoste-Julien, S. 2016. Conver-gence rate of frank-wolfe for non-convex objectives. arXivpreprint arXiv:1607.00345 .[Lin and Cohen 2010] Lin, F., and Cohen, W. W. 2010.Power iteration clustering. In , 655–662.[Liu, Latecki, and Yan 2013] Liu, H.; Latecki, L. J.; and Yan,S. 2013. Fast detection of dense subgraphs with iterativeshrinking and expansion.

IEEE Trans. Pattern Anal. Mach.Intell.

Advances in Neural Information ProcessingSystems 14 , 849–856. MIT Press.[Pavan and Pelillo 2003a] Pavan, M., and Pelillo, M. 2003a.Dominant sets and hierarchical clustering. In

Proceedingsof the Ninth IEEE International Conference on ComputerVision , volume 2, 362–. IEEE.[Pavan and Pelillo 2003b] Pavan, M., and Pelillo, M. 2003b.A new graph-theoretic approach to clustering and segmenta-tion. In ,volume 1, I–I. IEEE.[Pavan and Pelillo 2007] Pavan, M., and Pelillo, M. 2007.Dominant sets and pairwise clustering.

IEEE transactionson pattern analysis and machine intelligence , 1244–1251. IEEE.[Rosenberg and Hirschberg 2007] Rosenberg, A., andHirschberg, J. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In

Proceedingsof the 2007 joint conference on empirical methods in naturallanguage processing and computational natural languagelearning (EMNLP-CoNLL) , 410–420.[Shi and Malik 2000] Shi, J., and Malik, J. 2000. Normal-ized cuts and image segmentation.

IEEE Trans. PatternAnal. Mach. Intell.

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence ,5159–5166.

Appendix A - Additional Experiments

In this section, we further investigate the proposed optimiza-tion framework for Dominant Set Clustering by performingadditional experiments.

Experiments on Synthetic Data

For synthetic experiments, we ﬁx n = 200 and K = k = 5 ,and assign the objects uniformly to one of the k clusters.Let µ ∼ U (0 , be uniformly distributed and (cid:26) z = 0 , with probability pz = 1 , with probability − p, where p is the noise ratio. The similarity matrix A = ( a ij ) is then constructed as follows: (cid:26) a ij = a ji = zµ, if i and j are in the same cluster a ij = 0 , otherwise.For each parameter conﬁguration, we generate a similaritymatrix, perform the clustering ﬁve times and then report theaverage results in Figure 2. We observe that the different FWmethods are considerably more robust w.r.t. the noise in pair-wise measurements and yields higher quality results. Also,the performance of all FW variants is consistent with differ-ent parameter conﬁgurations, whereas RD is more sensitiveto the number of iterations t and cutoff δ . Multi-Start Dominant Set Clustering

Finally, as a side study, we study a combination of multi-start dominant set clustering with the peeling off strategy.For this, we perform the following procedure.1. Sample a subset of objects, and use them to construct anumber of starting points for the same similarity matrix.2. Run an optimization method for each starting point.3. Identify the nonoverlapping clusters from the solutionsand remove (peel off) the corresponding objects from thesimilarity matrix.4. Repeat until no objects are left or a sufﬁcient number ofclusters have been found.This scenario can be potentially useful in particular whenmultiple processors can perform clustering in parallel. How-ever, if all the different starting points converge to the samecluster, then there would be no computational beneﬁt. Thus,here we investigate such a possibility for our optimiza-tion framework. For this, we consider the number of passesthrough the entire data, where a pass is deﬁned as one com-plete run of the aforementioned steps. After the solutionsfrom a pass are computed, they are sorted based on the func-tion value f ( x ) . The sorted solutions are then permuted ina decreasing order, and if the support of the current solutionoverlaps more than with the support of the other (previ-ous) solutions, it is discarded. Each pass will therefore yieldat least one new cluster. With K the maximum number ofclusters to extract, there will be at most K passes. Thus inorder for the method to be useful and effective, less than K passes should be performed. . . . . . . . . Noise A R (a) . . . . . . . . Noise A R (b) . . . . . . . . Noise A R (c) . . . . . . . . Noise A R I (d) . . . . . . . . Noise A R I (e) . . . . . . . . Noise A R I (f) . . . . . . . . Noise V - M ea s u r e (g) . . . . . . . . Noise V - M ea s u r e (h) . . . . . . . . Noise V - M ea s u r e (i) Figure 2: Results on the synthetic dataset for RD, FW, PFW, and AFW. PFW-B and AFW-B have squares; PFW-V and AFW-Vhave crosses. (a)-(i) n = 200 . (a,d,g) t = 400 , δ = 2 · − . (b,e,h) t = 400 , δ = 2 · − . (c,f,i) t = 4000 , δ = 2 · − .Figure 3 shows the form of the datasets used in this study.Each cluster corresponds to a two dimensional Gaussian dis-tribution with a ﬁxed mean and an identity co-variance ma-trix (see Figure 3(a)). We ﬁx n = 1000 and K = k = 4 , anduse the parameter p to control the noise ratio. Set n = pn and n = n − n . A dataset is then generated by sampling n objects from a uniform distribution (background noise inFigure 3(c)), . · n , . · n , . · n , and . · n objectsfrom the respective Gaussians.Let D be the matrix with pairwise Euclidean distances be-tween all objects in the dataset. The similarity matrix is thendeﬁned as A = max( D ) − D , similar to the image segmen-tation study but with a different base distance measure.To determine the starting points we sample 4 compo-nents from { , ..., n } , denoted by i , i , i , i . The number4 matches the number of CPUs in our system. For a givencomponent i ∈ { i , i , i , i } , we deﬁne the starting pointsas (cid:26) ¯ x Vi = 1¯ x Vj = 0 , for j (cid:54) = i and (cid:26) ¯ x Bi = 0 . x Bj = 0 . / ( n − , for j (cid:54) = i. FW uses only ¯ x V while PFW and AFW use both ¯ x V and ¯ x B .To sample the components, we consider uniform sam-pling and Determinantal Point Processes (Kulesza, Taskar,and others 2012), denoted as UNI and DPP, respectively. Uniform sampling.

Let (cid:96) be the number of components tosample and a i ∗ = (cid:80) nj =1 a ij , the sum of the elements in row i of A . We sort a i ∗ ’s in decreasing order, divide them intoblocks of size n/(cid:96) , and sample one component i uniformlyfrom each block. Determinantal Point Processes (DPP).

DPP is a commonsampling method that provides both relevance and diversity.Thus, we study this method for sampling of starting objectstoo. A discrete DPP is a probability measure on subsets ofa discrete set V , i.e., on the power set V . We consider avariant of DPP called L -ensemble. If P L is an L -ensemble a) No noise (b) AR 0.94, ARI 1.0, and V-measure 1.0 (c) Noise p = 0 . (d) AR 0.74, ARI 1.0, and V-measure 1.0 Figure 3: Two example datasets used for multi-start study. (b) and (d) show clustering results with the FW optimization; PFWand AFW produce similar results. . . . . . . . . Noise A R (a) . . . . . . . . Noise A R I (b) . . . . . . . . Noise V - M ea s u r e (c) . . . . Noise P a ss e s (d) . . . . . . . . Noise A R (e) . . . . . . . . Noise A R I (f) . . . . . . . . Noise V - M ea s u r e (g) . . . . Noise P a ss e s (h) Figure 4: Results of multi-start paradigm with FW, PFW, and AFW where PFW-B and AFW-B are marked by squares; andPFW-V and AFW-V are marked by crosses. The ﬁrst row corresponds to UNI and the second row corresponds to DPP sampling.All optimization and sampling methods require only about two passes to compute the clusters.and Y ⊆ V , the probability of sampling it according to P L then is P L ( Y ) ∝ det( L Y ) , where L is real, symmetric, and positive semideﬁnite, calledthe likelihood matrix. The sub-matrix L Y is constructed ofthe rows and columns of L indexed by Y (Kulesza, Taskar,and others 2012). If L is a similarity matrix, then subsetswith diverse objects (low similarities between them) aremore likely to be sampled. For example, if Y = { i, j } , then P L ( Y ) ∝ det( L Y ) = (cid:96) ii (cid:96) jj − (cid:96) ij (cid:96) ji . (29)Using the similarity matrix A as the DPP likelihood matrix,we are more likely to sample objects that are diverse, andtherefore less likely to belong in the same cluster.The similarity matrix A – or submatrices thereof – are realand symmetric, but not positive semideﬁnite since the maindiagonal elements are zero. However, any symmetric matrixcan be transformed to be positive semideﬁnite by ensuringit is diagonally dominant – every diagonal element is larger than the sum of the absolute elements on the correspondingrow.In order to sample from a DPP with the given likelihoodmatrix, we need to compute its eigen-decomposition whosecomputational complexity can be O ( n ) . Thus, we performthe sampling in two steps. The ﬁrst step is similar to theuniform sampling method: we sort the components based on a i ∗ ’s, split them into blocks of size n/ , and then uniformlysample n / / objects from each block. In the second stepwe sample with DPP where for likelihood matrix we use thesub-matrix of A indexed by the n / objects from the ﬁrststep. Note that we ensure the sub-matrix of A is diagonallydominant. Results.

Figure 4 illustrates the results for the different sam-pling methods and starting objects. For a given dataset, sam-pling method, and optimization method, we generate startingobjects and run the experiments 10 times and report the av-erage results. Each optimization method is run for t = 1000 iterations. For this type of dataset we do not observe any sig-niﬁcant difference between FW, PFW, or AFW when usingither DPP or UNI. It seems that AFW with ¯ x B as startingobject performs slightly worse.However, we observe that all the sampling and optimiza-tion methods require only two passes, whereas we have K =4 . This observation implies that the multi-start paradigm ispotentially useful for computing the clusters in parallel withthe different FW variants. Appendix B - Proofs

Proof of Lemma 1

Proof.

By deﬁnition (lines 3 and 4 in Algorithm 2), r = Ax and f = x T Ax . Let x = x t , s = s t , and γ = γ t . Assume r t = Ax and f t = x T Ax holds. Expand thedeﬁnition of Ax t +1 and proceed by induction. Ax t +1 = A ((1 − γ ) x + γ s )= (1 − γ ) Ax + γ As = (1 − γ ) r t + γ a ∗ i = r t +1 , x Tt +1 Ax t +1 = ((1 − γ ) x + γ s ) T A ((1 − γ ) x + γ s )= (1 − γ ) x T Ax + 2 γ (1 − γ ) s T Ax + γ s T As = (1 − γ ) x T Ax + 2 γ (1 − γ ) s T Ax = (1 − γ ) f t + 2 γ (1 − γ ) r ( t ) i = f t +1 . Note s T As = 0 from the deﬁnition of s and A . Proof of Lemma 2

Proof.

Proceed as in proof of Lemma 1. Let x = x t , s = s t , v = v t , and γ = γ t . Ax t +1 = A ( x + γ ( s − v ))= Ax + γ ( As − Av )= r t + γ ( a ∗ i − a ∗ j )= r t +1 , x Tt +1 Ax t +1 = ( x + γ ( s − v )) T A ( x + γ ( s − v ))= x T Ax + 2 γ ( s − v ) T Ax + γ ( s − v ) T A ( s − v )= x T Ax + 2 γ ( s T Ax − v T Ax ) − γ a ij = f t + 2 γ ( r ( t ) i − r ( t ) j ) − γ a ij = f t +1 . Proof of Lemma 3

Proof.

Proceed as in proof of Lemma 1. Let x = x t , v = v t , and γ = γ t . Ax t +1 = A ((1 + γ ) x − γ v )= (1 + γ ) Ax − γ Av = (1 + γ ) r t − γ a ∗ j = r t +1 , x Tt +1 Ax t +1 = ((1 + γ ) x − γ v ) T A ((1 + γ ) x − γ v )= (1 + γ ) x T Ax − γ (1 + γ ) v T Ax + γ v T Av = (1 + γ ) x T Ax − γ (1 + γ ) v T Ax = (1 + γ ) f t − γ (1 + γ ) r ( t ) j = f t +1 . Proof of Lemma 5

Proof.

Let y = x t + γ t d t , for some ascent direction d t , r ( x ) = Ax , and f ( x ) = x T Ax . From (15) we have f ( y ) = f ( x t ) + 2 γ t r ( x t ) T d t + γ t d Tt Ad t . Using γ t = − ( x t ) T Ad t d t Ad t = − r ( x t ) T d t d Tt Ad t from (20), we get f ( y ) = f ( x t ) − r ( x t ) T d t ) d Tt Ad t + ( r ( x t ) T d t ) d Tt Ad t = f ( x t ) − ( r ( x t ) T d t ) d Tt Ad t ⇐⇒ (30) ( r ( x t ) T d t ) = − d Tt Ad t ( f ( y ) − f ( x t )) . Let s t satisfy (13) and v t satisfy (14). Denote theirnonzero components by i and j , respectively. Let h t = f ( x t +1 ) − f ( x t ) and g t = 2( r ( x t ) i − f ( x t )) .We consider the FW, away, and pairwise directions d t andcorresponding step sizes satisfying γ t < γ max . Note that f ( y ) = f ( x t +1 ) holds in (30) for such directions and stepsizes. FW direction : Substitute d t = s t − x t and (16) into (30). ( r ( x t ) i − f ( x t )) = (2 r ( x t ) i − f ( x t )) h t = ⇒ g t ≤ M − M ) h t . Away direction : For this direction with γ t < γ max wehave r ( x t ) i − f ( x t ) < f ( x t ) − r ( x t ) j , from line 11 in Algorithm 4. Substitute d t = x t − v t and(18) into (30). ( f ( x t ) − r ( x t ) j ) = (2 r ( x t ) j − f ( x t )) h t = ⇒ g t ≤ M − M ) h t . airwise direction : Substitute d t = s t − v t and (17) into(30). ( r ( x t ) i − r ( x t ) j ) = 2 a ij h t = ⇒ g t ≤ M h t . Using previously deﬁned I and β in section Analysis ofConvergence Rates, we get β ( f ( x t ) − f ( x )) = 4 β t − (cid:88) (cid:96) =0 h (cid:96) ≥ β (cid:88) (cid:96) ∈ I h (cid:96) ≥ (cid:88) (cid:96) ∈ I g t ≥ | I | ˜ g t = ⇒ ˜ g t ≤ β ( f ( x t ) − f ( x )) | I |⇐⇒ ˜ g t ≤ (cid:115) β ( f ( x t ) − f ( x )) | I | , for either direction d t . Proof of Theorem 6

Proof.

Since standard FW only takes good steps we have | I | = t . The result follows from Lemma 5. Proof of Theorem 7

Proof.

When γ t = γ max we either have | σ t +1 | = | σ t |− or | σ t +1 | = | σ t | , called drop and swap step, respectively. Weneed to upper bound the number of these steps in order toget a lower bound for | I | .The following reasoning is from the analysis of PFWwith convex objective function in (Lacoste-Julien and Jaggi2015).Let n be the dimension of x t , m = | σ t | , and d t = s t − v t . Since we are performing line search, we alwayshave f ( x (cid:96) ) < f ( x t ) for all (cid:96) < t that are nonstationary.This means the sequence x , ..., x t will not have any dupli-cates. The set of component values does not change whenwe perform a swap step: { x ( t ) (cid:96) : (cid:96) = 1 , ..., n } ∩ { x ( t +1) (cid:96) : (cid:96) = 1 , ..., n } = ∅ . That is, the components are simply permuted after a swapstep. The number of possible unique permutations is κ = n ! / ( n − m )! . After we have performed κ swap steps, a dropstep can be taken which will change the component values.Thus in the worst case, κ swap steps followed by a drop stepwill be performed until m = 1 before a good step is taken.The number of swap/drop steps between two good steps isthen bounded by m (cid:88) (cid:96) =1 n !( n − (cid:96) )! ≤ n ! ∞ (cid:88) (cid:96) =0 (cid:96) ! = n ! e ≤ n ! . Result (26) follows from Lemma 5 and | I | ≥ t n ! . Proof of Theorem 8

Proof.

When γ t = γ max , d t must be the away direction. Inthis case the support is reduced by one, i.e. | σ t +1 | = | σ t |− .Denote these indexes by D . Let I A ⊆ I be the indexes thatadds to the support, i.e. | σ t +1 | > | σ t | for t ∈ I A . Similar asbefore, we need to upper bound | D | in order to get a lowerbound for | I | .We have | I A | + | D | ≤ t and | σ t | = | σ | + | I A | − | D ] .Combining the inequalities we get ≤ | σ t | ≤ | σ | + t − | D | = ⇒ | D | ≤ | σ | − t . Result (27) then follows from Lemma 5 and | I | = t − | D | ≥ t − ( | σ | − t )2 = t + 1 − | σ |2