[PDF] Frank-Wolfe algorithm for learning SVM-type multi-category classifiers

Abstract

Multi-category support vector machine (MC-SVM) is one of the most popular machine learning algorithms. There are lots of variants of MC-SVM, although different optimization algorithms were developed for different learning machines. In this study, we developed a new optimization algorithm that can be applied to many of MC-SVM variants. The algorithm is based on the Frank-Wolfe framework that requires two subproblems, direction finding and line search, in each iteration. The contribution of this study is the discovery that both subproblems have a closed form solution if the Frank-Wolfe framework is applied to the dual problem. Additionally, the closed form solutions on both for the direction finding and for the line search exist even for the Moreau envelopes of the loss functions. We use several large datasets to demonstrate that the proposed optimization algorithm converges rapidly and thereby improves the pattern recognition performance.

Full PDF

aa r X i v : . [ c s . L G ] A ug Frank-Wolfe algorithm for learning SVM-type multi-category classiﬁers

Kenya Tajima ∗ Yoshihiro Hirohashi † Esmeraldo Ronnie Rey Zara ∗ Tsuyoshi Kato ∗ Abstract

Multi-category support vector machine (MC-SVM) isone of the most popular machine learning algorithms.There are lots of variants of MC-SVM, although dif-ferent optimization algorithms were developed for dif-ferent learning machines. In this study, we developeda new optimization algorithm that can be applied tomany of MC-SVM variants. The algorithm is based onthe Frank-Wolfe framework that requires two subprob-lems, direction ﬁnding and line search, in each itera-tion. The contribution of this study is the discoverythat both subproblems have a closed form solution ifthe Frank-Wolfe framework is applied to the dual prob-lem. Additionally, the closed form solutions on both forthe direction ﬁnding and for the line search exist evenfor the Moreau envelopes of the loss functions. We useseveral large datasets to demonstrate that the proposedoptimization algorithm converges rapidly and therebyimproves the pattern recognition performance.

Multi-category classiﬁcation is a task to assign an inputobject to one of pre-deﬁned categories. Many supervisedlearning problems are reduced to the multi-categoryclassiﬁcation, although in the ﬁeld of pattern recogni-tion, the focus of many researches and theoretical anal-yses have been a simpler task, the binary classiﬁcation,yielding the most successful machine learning algorithm,the support vector machine (SVM). In the 90’s, the so-called one-versus-rest approach was employed to applySVM to multi-category classiﬁcation tasks. In the one-versus-rest approach, the learning task is divided intomany independent optimization problems, and SVM isapplied to each of the optimization problems. A draw-back of the one-versus-rest approach is the inability forlearning correlation among the categories. Crammerand Singer [3] proposed an alternative method, whichformulates the learning problem with a single optimiza-tion problem. This method is called the multi-categorySVM (MC-SVM). Since the emergence of Crammer andSinger’s MC-SVM, many variants such as the structuredSVM [17], SVM multi [9], top-k SVM [12] have been de- ∗ Gunma University † Individual veloped. Structured SVM expanded the applicabilityof machine learning to a wide range including naturallanguage parsing [6] and the deformable part model forimage analysis [4], and the biological sequence align-ment [17]. SVM multi provides a framework that directlylearns the performance measures such as F1-score andprecision/recall breakeven point, precision at k, andROC score [9]. Top-k SVM is trained by minimizingthe empirical risk based on top- k error [12].Learning machine cannot be practical without ef-ﬁcient and stable optimization algorithm. The abovementioned MC-SVM’s variants are learned with diﬀer-ent optimization algorithms, each of which is specializedto the corresponding learning machine. For example,cutting plain methods [10] were developed for learn-ing the structured SVM and SVM multi . Optimizationalgorithms for learning top- k SVM were proposed bytwo research groups [12, 2], and both algorithms werebased on the stochastic dual coordinate ascent (SDCA)method [16]. However, the algorithms were derived froman incorrect theory, making both the algorithms fail toattain an optimum [11]. Kato and Hirohashi [11] consid-ered applying Frank-Wolfe method [5] to the dual prob-lem of top- k SVM. Frank-Wolfe method is an iterativeframework for convex optimization over a polyhedronand each iteration consists of the direction ﬁnding stepand the line search step. Sub-linear convergence to theoptimum is guaranteed if both the two steps are per-formed exactly [8]. Kato and Hirohashi [11] found thatboth the direction ﬁnding step and the line search stepcan be given in a closed form, and the computationaltime is within O ( mn ).One of main contributions of this study is theﬁnding that both the direction ﬁnding step and theline search step of Frank-Wolfe method are expressedin a closed form not only for top- k SVM but also awide range of the MC-SVM variants. In this paper,a condition for expressing the two steps in a closedform is clariﬁed. Compared to gradient methods thatare often employed for machine learning, the proposedFrank-Wolfe algorithm possesses no hyper-parametersuch as a step size often requiring a manual tuningfor optimization and guarantees the accuracy for theresulting solution. Due to the discovery of this study,an optimization algorithm that does not require a step arXiv version. ize and can be terminated with a pre-deﬁned accuracybecomes available for learning a variety of MC-SVMvariants.In addition, we extended our analysis to the Moreauenvelope [1] of the loss function. The Moreau envelopeis a trick that is widely used in the machine learningﬁeld. Taking the Moreau envelope makes the lossfunctions smooth and thereby accelerates optimizationin general [14, 19, 13]. In this study, we found that eachstep of Frank-Wolfe method can be expressed in a closedform even when taking the Moreau envelope of the lossfunction.

Notation:

We shall use the notation π ( j ; s ) ∈ [ m ] which is the index of the j -th largest component ina vector s ∈ R m . When using this notation, the vector s is omitted if there is no danger of confusion. Namely,for a vector s ∈ R m , we can write s π (1) ≥ s π (2) ≥ · · · ≥ s π ( m ) . Let us deﬁne π ( s ) := [ π (1 ; s ) , . . . , π ( m ; s )] ⊤ and introduce a notation for a vector with permutatedcomponents as s π ( s ) := (cid:2) s π (1) , . . . , s π ( m ) (cid:3) ⊤ . We use e i to denote a unit vector where i -th entryis one. The n -dimensional vector all of whose entriesare one is denoted by n . We use an operator k·k F todenote the Frobenius norm. In this section,we review MC-SVM and its several variants. Let usdenote the discrete output space by Y := { , . . . , m } where m is the number of categories. In the scenarioof multi-category classiﬁcation, prediction is assignmentof an input x ∈ X to one of elements in Y , where X is the input space. Feature vectors are extractednot only from an input but also from a candidate ofcategories. Let ψ : X ×Y → R d be the feature extractor.A typical implementation of the feature extractor is ψ ( x , j ) := e j ⊗ x , where ⊗ is the operator for theKronecker product and e j is here an m -dimensional unitvector. Using the model parameter w ∈ R d , predictionscore for the category j is given by the inner productbetween the feature vector and the parameter vector,i.e. h ψ ( x , j ) , w i . Prediction of an input x ∈ X is doneby computing the prediction score for each category j ,say h ψ ( x , j ) , w i , and ﬁnding the maximal score among m prediction scores. The corresponding category is theprediction result.We use n training examples ( x , y ) , . . . , ( x n , y n ) ∈X × Y , to determine the value of the model parameter w ∈ R d . MC-SVM tries to ﬁnd the minimizer of theregularized empirical risk deﬁned as(2.1) P ( w ) := λ k w k + 1 n n X i =1 Φ( Ψ ( x i ) ⊤ w ; y i ) . where Ψ ( x i ) is the horizontal concatenation of m feature vectors (i.e. Ψ ( x ) := [ ψ ( x , , . . . , ψ ( x , m )] ∈ R d × m ); λ is a positive constant called the regularizedparameter ; Φ( · ; y ) : R m → R is a loss function. ForMC-SVM, the max hinge loss Φ mh ( · ; y ) is adopted forΦ( · ; y ). Using the Kronecker delta δ · , · , the max hingeloss is deﬁned as(2.2) Φ mh ( s ; y ) := max j ∈ [ m ] ( s j − s y + 1 − δ j,y ) . Fenchel dual:

Function D : R m × n → R deﬁnedas(2.3) D ( A ) := − λ k w ( A ) k − n n X i =1 Φ ∗ ( − α i ; y i ) . is a Fenchel dual to the regularized empirical risk P : R d → R , where A := [ α , . . . , α n ] ∈ R m × n ; w ( A ) := λn P ni =1 Ψ ( x i ) α i ; Φ ∗ ( · ; y ) is the convexconjugate of Φ( · ; y ). The optimal solution of the primalvariable w ⋆ is obtained by w ⋆ := w ( A ⋆ ) where A ⋆ isthe maximizer of the dual objective D ( A ). The gap P ( w ( A )) − D ( A ) is non-negative for any A ∈ R m × n and vanishes at the optimum A = A ⋆ . From this fact,we can terminate the iterations for optimization when P ( w ( A )) − D ( A ) ≤ ǫ with a pre-deﬁned small positiveconstant ǫ . Then, the primal error P ( w ( A )) − P ( w ⋆ )is guaranteed not to be over ǫ . Structured SVM:

In the structured SVM, a non-negative loss ∆ ( i )ˆ y for the i -th training example is arbi-trarily designed for the case that the i -th training exam-ple is predicted as ˆ y ∈ Y (i.e. argmax j ∈ [ m ] h w j , ψ ( x i , j ) i =ˆ y ). This is contrastive to Crammer and Singer’s MC-SVM that adopts the convex surrogate of the 0/1 lossthat always suﬀers a unit loss for a mistake. StructuredSVM employs the convex surrogate of ∆ ( i )ˆ y , deﬁned as:(2.4) Φ i, sh ( s ; y ) := max j ∈ [ m ] (cid:16) s j − s y + ∆ ( i ) j (cid:17) . Unweighted Top- k SVM:

The unweighted top- k SVM is a variant of MC-SVM. While MC-SVMassumes that a single category is assigned to an input,the unweighted top- k SVM assigns k categories to aninput. Prediction results are interpreted so that oneof predicted k categories will be the category of theinput. Given an input x ∈ X , the set of k categories arechosen as { π (1; s ) , . . . , π ( k ; s ) } where s is the predictionscore vector (i.e. s = [ s , . . . , s m ] ⊤ := Ψ ( x ) ⊤ w ). Fortraining such a classiﬁer, the loss function is designed arXiv version. s(2.5)Φ utk ( s ; y ) := max  , k k X j =1 ( s − s y + − e y ) π ( j ) .  This is called the unweighted top- k hinge loss . Unweighted Usunier SVM:

Similar to theunweighted top- k SVM, the unweighted Usunier SVMtrains the classiﬁer that performs top- k prediction, butthe loss function is slightly diﬀerent. The empirical riskfor learning the unweighted Usunier SVM consists of thefollowing loss function:(2.6)Φ uu ( s ; y ) := 1 k k X j =1 max (cid:8) , ( s − s y + − e y ) π ( j ; s − e y ) (cid:9) . This loss function is called the Usunier loss. Theoriginal loss function developed by Usunier et al [18]is devised for ranking prediction. Lapin et al [12]redesigned their loss function for top- k prediction. Weighted Top- k SVM:

Using a constant weightvector ρ = [ ρ , . . . , ρ m ] ⊤ ∈ R m such as ρ ≥ · · · ≥ ρ m − ≥ ρ m = 0, Kato and Hirohashi extended theunweighted top- k hinge loss to the weighted version:(2.7)Φ wtk ( s ; y ) := max  , m X j =1 ( m − e y + s − s y m ) π ( j ) ρ j  . They called this function (2.7) the weighted top- k hingeloss . Weighted Usunier SVM:

The weighted versionof the Usunier loss can be considered. The weightedUsunier loss function is deﬁned as(2.8)Φ wu ( s ; y ) := m X j =1 max n , ( m − e y + s − s y m ) π ( j ) o ρ j . where ρ = [ ρ , . . . , ρ m ] ⊤ ∈ R m is a constant vector suchas ρ ≥ · · · ≥ ρ m − ≥ ρ m = 0. In this section, learning machines targeted by ourlearning algorithm are formulated. The learning ma-chine trains a classiﬁer by minimizing the regularizedempirical risk given in (2.1). In the learning algorithmpresented in the next section, the loss function appear-ing in the expression of the regularized empirical riskis assumed to be the max dot over simplex-type ( mdos-type ) deﬁned below. Definition 1.

Function

Φ : R m → R is said to bemdos-type if there exists a simplex B such that ∀ y ∈ [ m ] , ∀ s ∈ R m , (2.9) Φ( s ; y ) = max β ∈B h β , − e y + s − s y i . In the previous section, six loss functions, the maxhinge loss, the structured hinge loss, the unweightedtop- k hinge loss, the unweighted Usunier loss, theweighted top- k hinge loss, the weighted Usunier loss,were described. It can be shown that all these sixloss functions are mdos-type. In what follows, thecorresponding simplexes are presented. • The simplex for the max hinge loss and the struc-tured hinge loss is B sh := ∆(1), where(2.10) ∆( r ) := (cid:8) β ∈ R m + (cid:12)(cid:12) k β k ≤ r (cid:9) . • The simplex for the unweighted top- k loss is B utk :=∆ tk ( k, tk ( k, r ) := (cid:26) β ∈ ∆( r ) (cid:12)(cid:12)(cid:12)(cid:12) β ≤ k β k k (cid:27) • The simplex for the unweighted Usunier loss (2.6)is B uu := ∆ u ( k, u ( k, r ) := (cid:26) β ∈ ∆( r ) (cid:12)(cid:12)(cid:12)(cid:12) β ≤ k (cid:27) • The simplex for the weighted top- k hinge loss (2.7)is(2.13) B wtk := n β ∈ R m (cid:12)(cid:12)(cid:12) ∃ ζ ∈ R , ∀ ℓ ∈ [ L ] , ∃ λ ℓ ∈ ∆ tk ( k ℓ , ρ ′ ℓ k ℓ ) , ζ = h , λ ℓ i k ℓ ρ ′ ℓ , β = L X ℓ =1 λ ℓ o . • The simplex of the weighted Usunier loss (2.8) is(2.14) B wu := n β ∈ R m + (cid:12)(cid:12)(cid:12) ∀ ℓ ∈ [ L ] , ∃ λ ℓ ∈ ∆ uu (1 /ρ ′ ℓ , k ℓ ρ ′ ℓ ) , β ≤ L X ℓ =1 λ ℓ o . Therein, the variables L and ρ ′ , . . . , ρ ′ L used in(2.13) and (2.14) are deﬁned as follows. The variable L takes a natural number representing the cardinalityof the set(2.15) K := { k ∈ [ m ] | ρ k > ρ k +1 } . Denote by k , . . . , k L the entries in K sorted as 1 ≤ k < · · · < k L < m . The rest of the variables ρ ′ , . . . , ρ ′ L are deﬁned as ρ ′ ℓ := ρ k ℓ for ℓ = 1 , . . . , L . The aboveresults are summarized in the following theorem. arXiv version. heorem 2.1. Either of four loss functions Φ utk ( · ; y ) , Φ uu ( · ; y ) , Φ wtk ( · ; y ) , and Φ wu ( · ; y ) is mdos-type. Lossfunction Φ i, sh ( · ; y ) is mdos-type when ∆ ( i ) j = 1 − δ y,j where δ · , · is the Kronecker delta. The detail of the proof for Theorem 2.1 is given inSection A.

In this section, an op-timization algorithm for learning MC-SVM is presented.Here, the loss function Φ appearing in the regularizedempirical loss is supposed to be mdos-type. The opti-mization algorithm presented here is the Frank-Wolfemethod maximizing the dual objective D ( A ). Each it-eration of the Frank-Wolfe method consists of the di-rection ﬁnding step and the line search step. Denoteby A ( t ) = h α ( t )1 , . . . , α ( t ) n i the dual variable at t -th it-eration. The direction ﬁnding step solves the followinglinear program:(2.16) U ( t − ∈ argmax U ∈ dom( − D ) D ∇ D ( A ( t − ) , U E . At the line-search step, a solution maximizing D ( A )over the line segment between two points, A ( t − and U ( t − , is found:(2.17) γ ( t − := argmax γ ∈ [0 , D (cid:16) (1 − γ ) A ( t − + γ U ( t − (cid:17) . Using the solutions to the two subproblems, say U ( t − and γ ( t − , the dual variable is updated as(2.18) A ( t ) := (1 − γ ( t − ) A ( t − + γ ( t − U ( t − . So long as the two subproblems are solved exactly ateach iteration, the sublinear convergence is guaranteed.However, the algorithm would be impractical if eachstep could not be solved eﬃciently.We ﬁrst discuss how we can perform the directionﬁnding step. Under the assumption that the lossfunction is mdos-type, the eﬀective domain dom( − D ) isa polyhedron. Hence, a general-purpose solver for linearprograms can be used for the direction ﬁnding step, ittakes a prohibitive computational cost if resorting to ageneral-purpose solver at every iteration. In this study,we consider the following theorem. Theorem 2.2.

Consider applying the Frank-Wolfe al-gorithm to the problem for maximizing D ( A ) with re-spect to A . Assume that the loss function Φ( · ; y ) to bemdos-type. Then, both the direction ﬁnding step and theline search step are expressed in a closed form. See Section B for the proof of Theorem 2.2. Theconcrete solution to the subproblem (2.16) for the direction ﬁnding step is given as follows. The i -thcolumn in the matrix U ( t − , say u ( t − i , is set to(2.19) u ( t − i ∈ − ∂ Φ (cid:16) Ψ ( x i ) ⊤ w ( A ( t − ) ; y i (cid:17) where ∂ Φ( · ; y ) is the subdiﬀerential of Φ( · ; y ). Anarbitrary subgradient can be taken even if the set ∂ Φ( · ; y ) has multiple elements.The subproblem for the line search step (2.17)is also solved in a closed-form solution as γ ( t − =max(0 , min(1 , ˆ γ ( t − )) where(2.20)ˆ γ ( t − := P ni =1 D ∆ α ( t − i , e y i − Ψ ( x i ) ⊤ w ( A ( t − ) E λn k w (∆ A ( t − ) k , where ∆ α ( t − i is the i -th column of the m × n matrix∆ A ( t − := U ( t − − A ( t − . The Moreauenvelope [1] is a trick often used for transforming a non-diﬀerentiable convex function into a smoothed function.For example, the Huber loss [7] and the smoothedhinge loss [16], widely used in machine learning, are,respectively, the Moreau envelopes of the absolute errorand the hinge loss. Since it tends to take a shorter timeto minimize a smooth objective function, the Moreauenvelope is a useful technique to make machine learningeﬃcient. The Moreau envelope of a convex loss functionΦ( · ; y ) : R m → R is deﬁned as(2.21) Φ m ( s ; y ) := (cid:16) Φ ∗ ( · ; y ) + γ sm k·k (cid:17) ∗ ( s ; y )where γ sm is a non-negative constant called the smooth-ing parameter . As long as Φ( · ; y ) is a convex function,its Moreau envelope Φ m ( · ; y ) is ensured to be (1 /γ sm ) -smooth . To minimize the regularized empirical risk(2.22) P m ( w ) := λ k w k + 1 n n X i =1 Φ m ( Ψ ( x i ) ⊤ w ; y i ) , we now consider applying again the Frank-Wolfemethod to maximization of the Fenchel dual(2.23) D m ( A ) := − λ k w ( A ) k − n n X i =1 Φ ∗ m ( − α i ; y i ) . Notice that the Moreau envelope of the loss functionΦ( · ; y ) is no longer mdos-type even if Φ( · ; y ) is mdos-type, implying that the optimization algorithm pre-sented in the previous section cannot be applied directlyto the Moreau envelope. We obtained the following re-sult: arXiv version. a) Caltech101 (b) CUB200 (c) Flower102(d) Indoor67 (e) News20 Figure 1: Convergence behaviors for minimizing the empirical risk without the Moreau envelope. Regularizationparameter is set to λ = 10 /n . Theorem 2.3.

Consider applying the Frank-Wolfe al-gorithm to the problem for maximizing D m ( A ) with re-spect to A . Assume that the loss function Φ( · ; y ) to bemdos-type. Then, both the direction ﬁnding step and theline search step are expressed in a closed form even if γ sm > . See Section C for the proof of Theorem 2.3. The updaterules of the direction ﬁnding step and the line searchstep are described below. Let(2.24) ˜ s ( t − i := Ψ ( x i ) ⊤ w ( A ( t − ) − γ sm α ( t − i . The update rule of the direction ﬁnding step (2.19) isreplaced to(2.25) u ( t − i ∈ − ∂ Φ (cid:16) ˜ s ( t − i ; y i (cid:17) Note that Φ( · ; y ) in (2.25) is the mdos loss function,not its Moreau envelope. The expression of ˆ γ ( t − forthe line search step γ ( t − = max(0 , min(1 , ˆ γ ( t − )) isreplaced to(2.26)ˆ γ ( t − := P ni =1 D ∆ α ( t − i , e y i − ˜ s ( t − i E λn k w (∆ A ( t − ) k + γ sm λn k ∆ A ( t − k . Algorithm 1

Frank-Wolfe algorithm for minimizing arisk based on Moreau envelope. . A (0) ∈ dom( − D ); for t := 1 to T do w ( t − := λn P ni =1 Ψ ( x i ) α ( t − i ; for i ∈ [ n ] do ˜ s ( t − i := Ψ ( x i ) ⊤ w ( t − − γ sm α ( t − i ; u ( t − i ∈ − ∂ Φ (cid:16) ˜ s ( t − i ; y i (cid:17) ; ∆ α ( t − i := u ( t − i − α ( t − i ; end for ∆ w ( t − := λn P ni =1 Ψ ( x i )∆ α ( t − i ; ˆ γ ( t − := P ni =1 D ∆ α ( t − i , e yi − ˜ s ( t − i E λn k ∆ w ( t − k + γ sm λn k ∆ A ( t − k ; γ ( t − := max(0 , min(1 , ˆ γ ( t − )); A ( t ) := A ( t − + γ ( t − ∆ A ( t − ; end for The detailed derivations are described in the proof ofTheorem 2.3 in Appendix. The procedure is summa-rized in Algorithm 1.In summary, it turns out that the two steps in arXiv version. a) Caltech101 (b) CUB200 (c) Flower102(d) Indoor67 (e) News20

Figure 2: Convergence behaviors for minimizing the empirical risk with the Moreau envelope. Regularizationparameter is set to λ = 10 /n .each iteration of Frank-Wolfe method are expressed ina closed form not only for the mdos-type loss functionbut also for its Moreau envelope. We analyze the time com-plexity of the Frank Wolfe algorithm presented in Algo-rithm 1. The middle column in Table 3 shows the timecomplexity consumed in each line. Line 5 and Line 7,respectively, require O ( md ) and O ( m ) for each i ∈ [ n ],which take O ( md ) × n and O ( m ) × n to compute n vectors of ˜ s ( t − i and ∆ α ( t − i . Line 6 contains compu-tation of a subgradient of the loss function. The timecomplexity for this line depends on the deﬁnition of theloss function. The max hinge loss requires O ( m ) compu-tation for Line 6. The top- k hinge loss, the Usunier loss,and their weighted generalizations consume O ( m log m )computation for the line.In the case that the feature extractor is ψ ( x , j ) = e j ⊗ x with X := R d , the time complexity is improvedcompared to the general case. In this case, d = md .The update rule of the primal variable in Line 3 in Algorithm 1 is rewritten as:(2.27) w ( t − = 1 λn n X i =1 α ( t − i ⊗ x i . Now we discuss how to update ˜ s ( t − i in Line 5. Tocompute the j th entry in the m -dimensional vector Ψ ( x i ) ⊤ w ( t − , we extract a sub-vector ( e ⊤ j ⊗ I d ) w andtake the inner-product between it and the input vector x i . Both extraction of m sub-vectors and computationof m inner-products takes O ( m ) computational cost.Those discussion is summarized in the third column ofTable 3. In this section, we demonstrate the power of the pro-posed Frank-Wolfe algorithm in terms of the conver-gence speed and the pattern recognition performance.

To illustrate how rapidlythe proposed optimization algorithm for empirical riskminimization are converged, we used ﬁve image datasetsCaltech101 Silhouettes, CUB200, Flower102 and In-door67 containing n = 6 , , ,

040 and 15 , arXiv version. able 1: Pattern recognition performance.(a) Caltech101 Top-1 Top-3 Top-5 Top-10PG 0.469(0.008) 0.307(0.005) 0.257(0.005) 0.183(0.006)StdFW 0.481(0.022) 0.317(0.032) 0.266(0.032) 0.192(0.035)LSFW (0.048) (0.010) (0.005) (0.005)(b) CUB200 Top-1 Top-3 Top-5 Top-10PG 0.431(0.008) 0.247(0.007) 0.184(0.004) 0.114(0.003)StdFW 0.443(0.022) 0.257(0.026) 0.196(0.026) 0.126(0.026)LSFW (0.006) (0.007) (0.006) (0.004)(c) Flower102 Top-1 Top-3 Top-5 Top-10PG 0.219(0.009) 0.107(0.010) 0.075(0.008) 0.041(0.004)StdFW 0.220(0.014) 0.108(0.013) 0.076(0.011) 0.041(0.009)LSFW (0.010) (0.011) (0.010) (0.004)(d) Indoor67 Top-1 Top-3 Top-5 Top-10PG 0.319(0.005) 0.139(0.003) 0.092(0.003) 0.051(0.002)StdFW 0.323(0.016) 0.141(0.009) 0.098(0.007) 0.058(0.008)LSFW (0.004) (0.004) (0.003) (0.002)(e) News20 Top-1 Top-3 Top-5 Top-10PG 0.348(0.003) 0.142(0.003) 0.083(0.003) 0.031(0.002)StdFW 0.348(0.003) 0.143(0.004) 0.084(0.005) 0.052(0.027)LSFW (0.004) (0.002) (0.005) (0.002)Table 2: Computational times for one iteration.PG StdFW LSFWCaltech101 0.453 sec 0.589 sec 0.814 secCUB200 2.570 sec 2.458 sec 3.324 secFlower102 0.661 sec 0.649 sec 0.903 secIndoor67 3.040 sec 2.855 sec 3.919 secNews20 0.548 sec 0.436 sec 0.531 secimages, respectively, and a text dataset News20 contain-ing 15 ,

935 texts. The images or texts in each datasetare classiﬁed into m = 102, 200, 102, 67, 20 categories,respectively. Deep neural structure named VGG16 wasused to extract an input vector x ∈ R d from each im-age for CUB200, Flower102, and Indoor67. We extract d = 4 ,

096 features from the deep networks for the threeimage datasets, respectively. Features for Caltech101were the vectorization of the pixel intensities of 16 × O ( mnd ) O ( nd )Line 5 O ( md ) × n O ( d ) × n Line 6 Depends on Φ Depends on ΦLine 7 O ( m ) × n O ( m ) × n Line 9 O ( mnd ) O ( nd )Line 10 O ( mn + d ) O ( mn + d )Line 12 O ( mn ) O ( mn ) arXiv version. = 1 ,

024 features. For the loss function, (2.8) waschosen with ρ j := max(0 , − j ) /

15. The regularizationparameter is set to λ = 1 /n . The proposed optimiza-tion algorithm was compared with two methods: PG and StdFW . The method PG is the projected gradient algo-rithm [15]. In each iteration, PG updates the primal vari-able to the descent direction and projects it onto a ballto suppress the norm of the gradient vector. The sub-linear convergence is ensured for Lipschitz continuousloss functions. The method StdFW is an alternative tothe proposed Frank-Wolfe algorithm. In every iterationof the proposed algorithm, the direction ﬁnding step isfollowed by the line search step that ﬁnds the optimalratio, say γ t , for mixing the previous point with the newpoint computed at the direction ﬁnding step. Theoreti-cally, the sublinear convergence is guaranteed even if theratio is pre-scheduled with γ t = 2 / ( t + 1). The method StdFW denotes the Frank-Wolfe using the pre-scheduled γ t , while the proposed Frank-Wolfe is referred to as theline search Frank-Wolfe abbreviated with LSFW .Figure 1 have ﬁve panels, each of which is for oneof the ﬁve datasets. Each panel contains two sub-panels. The upper and lower sub-panels, respectively,show the objective errors and the duality gaps againstthe number of iterations, where the objective errorand the duality gap at t -th iteration are deﬁned as P ( w ( t ) ) − P ( w ⋆ ) and P ( w ( t ) ) − D ( A ( t ) ), respectively,where w ( t ) and A ( t ) are the values of the primal andthe dual variables at t -th iteration. For StdFW and

LSFW , the value of the primal variable is recovered by w ( t ) := w ( A ( t ) ) for t ∈ N . Since it is impossibleto know the exact value of the optimal solution w ⋆ ,the value of w ⋆ is approximated by w ⋆ ≈ w ( A ( t ′ ) ) inthis experiments, where A ( t ′ ) is obtained by iteratingthe Frank-Wolfe until reaching P ( w ( t ) ) − D ( A ( t ) ) < − . From Figure 1, it can be observed that LSFW converges much faster than PG . For CUB200, Flower102and Indoor67, the convergence of LSFW was muchfaster than those of the other two methods, whereas theconvergence speed among three methods were similarfor Caltech101 and News20. A property of the threedatasets, CUB200, Flower102 and Indoor67, diﬀers fromthat of the two datasets, Caltech101 and News20. Theproperty is the number of dimensions of feature vectors.Feature vectors in CUB200, Flower102 and Indoor67have a higher dimension than these in Caltech101 andNews20. High-dimensional features tend to make thedual objective function more strongly concave. Theauthors conjecture that the diﬀerence in the numberof dimensions yields the diﬀerences of the convergencebehaviors.We then applied the Moreau envelope to the lossfunction with γ sm = 0 .

01. The convergence behaviors were changed as shown in Figure 2 . It can be shownthat the negative dual function is strongly convex withthe coeﬃcient γ sm /n whatever training data are given.Indeed, the Moreau envelope make the convergences ofLSFW for Caltech101 and News20 faster, although theconvergences for the other four datasets were not accel-erated. An explanation of this phenomenon may be thatthe negative dual objectives for CUB200, Flower102,and Indoor67 are already strongly convex even withoutthe Moreau envelope.Table 2 shows the computational times for each it-eration of the three optimization algorithms. The run-ning times for PG and StdFW are similar. Comparedto StdFW, LSFW has to take more computation to per-form line search. The computational times of LSFW donot exceed 1.5 times of the times of StdFW for one it-eration. By combining the experimental results for therunning time for one iteration with the objective errorsagainst the number of iterations, it can be concludedthat LSFW can achieve accurate solutions with muchsmaller computational times. We exam-ined the pattern recognition performance on the ﬁvedatasets used for the convergence experiments. For eachdataset, 50% of data were randomly picked. Each ofthree optimization algorithms was applied to the pickeddata to train a multi-category classiﬁer. The rest ofthe data were used for testing the generalization per-formance of pattern recognition. Each optimization al-gorithm was implemented for 1,000 iterations. Cross-validation was performed to determine the value of theregularization constant λ . Top-1, top-3, top-5 and top-10 error ratios were used for assessing the generaliza-tion performance. Lower value indicates better perfor-mance. The above procedures were preformed 20 times,and the averages and the standard deviations of the per-formance measures across the 20 trials are reported inTable 1. The bold-faced ﬁgures indicate the best per-formance. The underlined ﬁgures have no signiﬁcantdiﬀerence from the best performance, where the signiﬁ-cance is based on the one-sample t-test. For all datasets,LSFW achieved the smallest top- k error. Most of theerror ratios of LSFW were signiﬁcantly smaller thanthose of StdFW and PG. That might be because LSFWsuccessfully produces suﬃciently accurate solutions fortraining within 1,000 iterations whereas the other twocould not. In this paper, we presented a new Frank-Wolfe algo-rithm that can be applied to the mdos-type learningmachines. The mdos type is a class introduced newly arXiv version. n this study to analyze, in a uniﬁed fashion, a widevariety of loss functions originated from the max-hingeloss. The sublinear convergence of the Frank-Wolfe al-gorithm is ensured if both the direction ﬁnding step andthe line search step are exactly implemented. We discov-ered that, if the Frank-Wolfe is applied to the Fencheldual of the regularized empirical risk function, closedform solutions exist both for the two steps. Since low-dimensional feature vectors often slow down the mini-mization algorithms including the Frank-Wolfe, the lossfunction is often replaced to its Moreau envelope. How-ever, the replaced loss function is no longer mdos-type,meaning that the proposed Frank-Wolfe cannot be ap-plied directly. Nevertheless, we found a technique toreuse the proposed Frank-Wolfe again for the Moreauenvelope of the loss function. We carried out experi-ments to empirically show that our algorithm convergesfaster and achieves a better pattern recognition perfor-mance compared to the existing methods.

References [1] D.P. Bertsekas.

Nonlinear Programming . AthenaScientiﬁc, 1999.[2] Dejun Chu, Rui Lu, Jin Li, Xintong Yu, ChangshuiZhang, and Qing Tao. Optimizing top- k multiclassSVM via semismooth newton algorithm. IEEE Trans-actions on Neural Networks and Learning Systems ,29(12):6264–6275, December 2018.[3] Koby Crammer and Yoram Singer. On the algorithmicimplementation of multiclass kernel-based vector ma-chines.

J. Mach. Learn. Res. , 2:265–292, March 2002.[4] P F Felzenszwalb, R B Girshick, D McAllester, andD Ramanan. Object detection with discriminativelytrained part-based models.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 32(9):1627–1645, September 2010.[5] Marguerite Frank and Philip Wolfe. An algo-rithm for quadratic programming.

Naval Re-search Logistics Quarterly , 3(1-2):95–110, March 1956.doi:10.1002/nav.3800030109.[6] Sam Hare, Stuart Golodetz, Amir Saﬀari, VibhavVineet, Ming-Ming Cheng, Stephen L. Hicks, andPhilip H.S. Torr. Struck: Structured output trackingwith kernels.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 38(10):2096–2109, October2016.[7] Peter J. Huber. Robust estimation of a locationparameter.

The Annals of Mathematical Statistics ,35(1):73–101, March 1964.[8] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Sanjoy Dasguptaand David McAllester, editors,

Proceedings of the 30thInternational Conference on Machine Learning , vol-ume 28 of

Proceedings of Machine Learning Research , pages 427–435, Atlanta, Georgia, USA, 17–19 Jun2013. PMLR.[9] Thorsten Joachims. A support vector method formultivariate performance measures. In

Proceedings ofthe 22nd international conference on Machine learning- ICML 05 . ACM Press, 2005.[10] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting-plane training of structuralSVMs.

Machine Learning , 77(1):27–59, May 2009.[11] Tsuyoshi Kato and Yoshihiro Hirohashi. Learningweighted top- k support vector machine. In Wee SunLee and Taiji Suzuki, editors, Proceedings of TheEleventh Asian Conference on Machine Learning , vol-ume 101 of

Proceedings of Machine Learning Re-search , pages 774–789, Nagoya, Japan, 17–19 Nov 2019.PMLR.[12] Maksim Lapin, Matthias Hein, and Bernt Schiele. Top-k multiclass svm. In

Proceedings of the 28th Inter-national Conference on Neural Information ProcessingSystems - Volume 1 , NIPS’15, pages 325–333, Cam-bridge, MA, USA, 2015. MIT Press.[13] Maksim Lapin, Matthias Hein, and Bernt Schiele.Loss functions for top-k error: Analysis and insights.In . IEEE, June 2016.[14] Jason Rennie and Nathan Srebro. Loss functions forpreference levels: Regression with discrete orderedlabels.

Proceedings of the IJCAI MultidisciplinaryWorkshop on Advances in Preference Handling , 012005.[15] Shai Shalev-Shwartz, Yoram Singer, Nathan Srebro,and Andrew Cotter. Pegasos: primal estimated sub-gradient solver for SVM.

Math. Program. , 127(1):3–30,2011.[16] Shai Shalev-Shwartz and Tong Zhang. Stochastic dualcoordinate ascent methods for regularized loss.

J.Mach. Learn. Res. , 14(1):567–599, February 2013.[17] Ioannis Tsochantaridis, Thorsten Joachims, ThomasHofmann, and Yasemin Altun. Large margin methodsfor structured and interdependent output variables.

Journal of Machine Learning Research , 6:1453–1484,2005.[18] Nicolas Usunier, David Buﬀoni, and Patrick Gallinari.Ranking with ordered weighted pairwise classiﬁcation.In

Proceedings of the 26th Annual International Con-ference on Machine Learning - ICML09 . ACM Press,2009.[19] Tong Zhang. Solving large scale linear predictionproblems using stochastic gradient descent algorithms.In

Twenty-ﬁrst international conference on Machinelearning - ICML04 . ACM Press, 2004. arXiv version.

Proof for Theorem 2.1

Proposition 2 and Proposition 5 in [12], respectively, show that the two loss functions Φ utk ( · ; y ) and Φ uu ( · ; y )are mdos-type. The proof for Φ sh ( · ; y ) is straightforward because the expression of Φ sh ( · ; y ) is similar to thoseof Φ uu ( · ; y ) and Φ utk ( · ; y ) with k = 1. Kato & Hirohashi has already shown in (15) of [11] that Φ wtk ( · ; y ) ismdos-type. In what follows, we shall show that Φ wu ( · ; y ) is mdos-type, for which it suﬃces to prove the followingequation:(A.1) |K| X ℓ =1 max (cid:8) , ρ ′ ℓ (cid:10) , x π (1: k ℓ ) (cid:11)(cid:9) = max β ∈B wu h β , x i . The left hand side can be rearranged as:(A.2) LHS of (A.1) = min  m X j =1 h π ( j ) ρ j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ≥ m , h ≥ x  = min ( L X ℓ =1 (cid:10) , h π (1: k ℓ ) (cid:11) ρ ′ ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ≥ m , h ≥ x ) = min ( L X ℓ =1 ( k ℓ t ℓ + h , max( , h − t ℓ ) i ) ρ ′ ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ≥ m , h ≥ x , t ≥ L ) = min ( L X ℓ =1 ( k ℓ t ℓ + h , q ℓ i ) ρ ′ ℓ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ≥ m , h ≥ x , t ≥ L , ℓ ∈ [ L ] , q ℓ ≥ m , q ℓ ≥ h − t ℓ ) . To ﬁnd an analytical solution to the above minimization problem, a non-negative Lagrangian multiplier vectoris introduced for each constraint: η ∈ R m + for the constraint h ≥ , β ∈ R m + for the constraint h ≥ x , ℓ ∈ [ L ], µ ℓ ∈ R m + for the constraint q ℓ ≥ ℓ ∈ [ L ], λ ℓ ∈ R m + for the constraint q ℓ ≥ h − t ℓ , and τ ∈ R L + for the constraint t ≥ .Let(A.3) Q := [ q , . . . , q L ] , M := [ µ , . . . , µ L ] , and Λ := [ λ , . . . , λ L ] . The Lagrangian function is expressed as(A.4) L wu ( h , t , Q , η , β , M , Λ , τ ):= (cid:10) k ⊙ t + Q ⊤ , ρ ′ (cid:11) − h η , h i + h β , x − h i− h M , Q i + (cid:10) Λ , h ⊤ − t ⊤ − Q (cid:11) − h τ , t i = (cid:10) t , ρ ′ ⊙ k − Λ ⊤ − τ (cid:11) + (cid:10) Q , ρ ′⊤ − Λ − M (cid:11) + h h , Λ1 − β − η i + h x , β i . KKT conditions lead to(A.5) ρ ′ ⊙ k − Λ ⊤ − τ = , ρ ′⊤ − Λ − M = O , and Λ1 − β − η = . Eliminating τ , M and η , the above conditions can be rewritten as(A.6) ρ ′ ⊙ k ≥ Λ ⊤ , ρ ′⊤ ≥ Λ , and Λ1 ≥ β , arXiv version. ence, we conclude that(A.7) LHS of (A.1) = min h , t , Q max η , β , M , Λ , τ L wu ( h , t , Q , η , β , M , Λ , τ )= max n h x , β i (cid:12)(cid:12)(cid:12) ρ ′ ⊙ k ≥ Λ ⊤ , ρ ′⊤ ≥ Λ , Λ1 ≥ β o = RHS of (A.1) . q.e.d. B Proof for Theorem 2.2

We ﬁrst show that the direction ﬁnding step can be expressed in a closed form as in (2.19), followed by showingthat the solution to the line search step is γ ( t − = max(0 , min(1 , ˆ γ ( t − )). B.1 Proof for Direction Finding Step

We use Lemma 1 of [11] to rearrange the dual objective function as:(B.8) D ( A ) := − λ k w ( A ) k + 1 n n X i =1 h e y i , α i i . The derivative with respect to α i is obtained as(B.9) ∂D ( A ) ∂ α i = 1 n (cid:16) e y i − Ψ ( x i ) ⊤ w ( A ( t − ) (cid:17) . The eﬀective domain of − D is the product space of the feasible region of each column:(B.10) dom( − D ) = − n Y i =1 domΦ ∗ ( · ; y i ) , This fact allows us to decompose the m · n -variable linear programming problem to n smaller problems:(B.11) u ( t − i ∈ argmin u i ∈− domΦ ∗ ( · ; y i ) D W ( A ( t − ) ⊤ x i − e y i , u i E . Applying Lemma 3 of [11], an optimal solution to each of the n linear programming problems can be expressedas (2.19). q.e.d. B.2 Proof for Line Search Step

Let w ( t − := w ( A ). We shall use(B.12) k w ( γ ∆ A ) k = k w (∆ A ) k γ and(B.13) D w ( t − , w (∆ A ) E = 1 λn * w ( t − , n X j =1 Ψ ( x j )∆ α j + = 1 λn n X j =1 D Ψ ( x j ) ⊤ w ( t − , ∆ α j E to obtain(B.14) 1 λ ( D ( A + γ ∆ A ) − D ( A ))= − k w ( t − + w ( γ ∆ A ) k + 12 k w ( t − k + 1 λn n X j =1 h e j , α j + γ ∆ α j i − λn n X j =1 h e j , α j i = − k w ( γ ∆ A ) k − D w ( t − , w (∆ A ) E γ + 1 λn n X j =1 h e j , ∆ α j i γ = − k w (∆ A ) k γ + 1 λn n X j =1 D e j − Ψ ( x j ) ⊤ w ( t − , ∆ α j E γ. arXiv version. his concludes that the optimal solution for line search is γ ( t − = max(0 , min(1 , ˆ γ ( t − )). B.3 Derivation of (B.9) The derivative of the second term in (B.8) with respect to α i is(B.15) ∂∂ α i n X j =1 (cid:10) e y j , α j (cid:11) = e y i . We now calculate the derivative of the ﬁrst term in (B.8). Let(B.16) ¯ w := 1 λn X j = i Ψ ( x j ) α j to obtain(B.17) k w ( A ) k = k ¯ w + 1 λn Ψ ( x i ) α i k = k ¯ w k + 2 λn (cid:10) Ψ ( x i ) ⊤ ¯ w , α i (cid:11) + 1( λn ) k Ψ ( x i ) α i k . The derivative is(B.18) ∂∂ α i k w ( A ) k = 2 λn Ψ ( x i ) ⊤ ¯ w + 2( λn ) Ψ ( x i ) ⊤ Ψ ( x i ) α i = 2 λn Ψ ( x i ) ⊤ (cid:18) ¯ w + 1 λn Ψ ( x i ) α i (cid:19) = 2 λn Ψ ( x i ) ⊤ w ( A ) . Combining (B.15) with (B.18), we obtain (B.9).

C Proof for Theorem 2.3

Let us consider an ( m + 1) d × m feature matrix for each training example ( x i , y i ), deﬁned as(C.19) ˜ Ψ i := (cid:20) Ψ ( x i ) √ λnγ sm e i ⊗ I m (cid:21) where e i is here an n -dimensional unit vector with i -th entry one. From the n feature matrices, we pose theregularized empirical risk ˜ P : R ( m +1) d → R deﬁned as(C.20) ˜ P ( ˜ w ) := λ k ˜ w k + 1 n n X i =1 Φ( ˜ Ψ ⊤ i w ; y i )where Φ( · ; y ) is the mdos-type loss function. To show Theorem 2.3, we shall use the following lemma: Lemma C.1.

Function D m is an Fenchel dual to ˜ P . The proof for Lemma C.1 is given in Subsection C.3. The proof for Theorem 2.3 is completed by deriving (2.25)and (2.26). Each derivation is given in Subsection C.1 and Subsection C.2, respectively.

C.1 Derivation of (2.25) Deﬁne ˜ w : R m × n → R ( m +1) d as(C.21) ˜ w ( A ) := 1 n n X i =1 ˜ Ψ i α i . Observe that(C.22) ˜ w ( A ) = (cid:20) λn P ni =1 Ψ ( x i ) α i √ λnγ sm λn P ni =1 ( e i ⊗ I m ) α i (cid:21) = (cid:20) w ( A ) p γ sm λn vec( A ) (cid:21) arXiv version. here vec( A ) is the vectorization of an m × n matrix A . From Lemma C.1, the direction ﬁnding step in theFrank-Wolfe algorithm for maximizing D m can be written as(C.23) u ( t − i ∈ − ∂ Φ (cid:16) ˜ Ψ ⊤ i ˜ w ( A ( t − ) ; y i (cid:17) . The prediction score can be rearranged as(C.24) ˜ Ψ ⊤ i ˜ w ( A ( t − ) = Ψ ( x i ) ⊤ w ( A ( t − ) + p λnγ sm r γ sm λn (cid:0) e ⊤ i ⊗ I m (cid:1) vec( A ( t − )= Ψ ( x i ) ⊤ w ( A ( t − ) + γ sm α ( t − i = ˜ s ( t − i . Combining (C.23) with (C.24), the direction ﬁnding step can be obtained as (2.25).

C.2 Derivation of (2.26) From discussion in Section 2.3, the line search step computes γ ( t − by clippingˆ γ ( t − outside the interval of [0 ,

1] where(C.25) ˆ γ ( t − = P ni =1 D ∆ α ( t − i , e y i − ˜ Ψ ⊤ i ˜ w ( A ( t − ) E λn k ˜ w (∆ A ( t − ) k . The square of the norm of ˜ w (∆ A ( t − ) is given by(C.26) k ˜ w (∆ A ( t − ) k = k w (∆ A ( t − ) k + γ sm λn k ∆ A ( t − k . Equation (2.26) is established by substituting (C.24) and (C.26) into (C.25).

C.3 Proof for Lemma C.1

Apparently, the function(C.27) ˜ D ( A ) := − λ k ˜ w ( A ) k − n n X i =1 ˜Φ ∗ ( − α i ; y i )is a Fenchel dual to ˜ P . The ﬁrst term of the right hand side in the above equation can be rewritten as(C.28) k ˜ w ( A ) k = k w ( A ) k + γ sm λn k A k which allows us to rearrange ˜ D ( A ) as(C.29) ˜ D ( A ) = − λ k w ( A ) k − γ sm n k A k − n n X i =1 (cid:16) ˜Φ ∗ ( − α i ; y i ) + γ sm k α i k (cid:17) = − λ k w ( A ) k − n n X i =1 Φ ∗ m ( − α i ; y i ) = D m ( A ) . Hence, D m has been proved to be a Fenchel dual to ˜ P ..