[PDF] Fast Training of Effective Multi-class Boosting Using Coordinate Descent Optimization

Abstract

Wepresentanovelcolumngenerationbasedboostingmethod for multi-class classification. Our multi-class boosting is formulated in a single optimization problem as in Shen and Hao (2011). Different from most existing multi-class boosting methods, which use the same set of weak learners for all the classes, we train class specified weak learners (i.e., each class has a different set of weak learners). We show that using separate weak learner sets for each class leads to fast convergence, without introducing additional computational overhead in the training procedure. To further make the training more efficient and scalable, we also propose a fast co- ordinate descent method for solving the optimization problem at each boosting iteration. The proposed coordinate descent method is conceptually simple and easy to implement in that it is a closed-form solution for each coordinate update. Experimental results on a variety of datasets show that, compared to a range of existing multi-class boosting meth- ods, the proposed method has much faster convergence rate and better generalization performance in most cases. We also empirically show that the proposed fast coordinate descent algorithm needs less training time than the MultiBoost algorithm in Shen and Hao (2011).

Full PDF

FFast Training of Eﬀective Multi-class BoostingUsing Coordinate Descent Optimization (cid:63)

Guosheng Lin, Chunhua Shen (cid:63)(cid:63) , Anton van den Hengel, David Suter

The University of Adelaide, Australia

Abstract.

We present a novel column generation based boosting methodfor multi-class classiﬁcation. Our multi-class boosting is formulated ina single optimization problem as in [1, 2]. Diﬀerent from most existingmulti-class boosting methods, which use the same set of weak learnersfor all the classes, we train class speciﬁed weak learners (i.e., each classhas a diﬀerent set of weak learners). We show that using separate weaklearner sets for each class leads to fast convergence, without introducingadditional computational overhead in the training procedure. To furthermake the training more eﬃcient and scalable, we also propose a fast co-ordinate descent method for solving the optimization problem at eachboosting iteration. The proposed coordinate descent method is concep-tually simple and easy to implement in that it is a closed-form solutionfor each coordinate update. Experimental results on a variety of datasetsshow that, compared to a range of existing multi-class boosting meth-ods, the proposed method has much faster convergence rate and bettergeneralization performance in most cases. We also empirically show thatthe proposed fast coordinate descent algorithm needs less training timethan the MultiBoost algorithm in Shen and Hao [1].

Boosting methods combine a set of weak classiﬁers (weak learners) to form astrong classiﬁer. Boosting has been extensively studied [3, 4] and applied to awide range of applications due to its robustness and eﬃciency (e.g., real-timeobject detection [5–7]). Despite that fact that most classiﬁcation tasks are inher-ently multi-class problems, the majority of boosting algorithms are designed forbinary classiﬁcation. A popular approach to multi-class boosting is to split themulti-class problem into a bunch of binary classiﬁcation problems. A simple ex-ample is the one-vs-all approach. The well-known error correcting output coding(ECOC) methods [8] belong to this category. AdaBoost.ECC [9], AdaBoost.MH[10] and AdaBoost.MO [10] can all be viewed as examples of the ECOC approach.The second approach is to directly formulate multi-class as a single learning task,which is based on pairwise model comparisons between diﬀerent classes. Shen (cid:63)

Code can be downloaded at http://goo.gl/WluhrQ . This paper was published inProc. 11th Asian Conference on Computer Vision, Korea 2012. (cid:63)(cid:63)

Corresponding author: [email protected] . a r X i v : . [ c s . C V ] N ov G. Lin, C. Shen, A. van den Hengel, D. Suter and Hao’s direct formulation for multi-class boosting (referred to as MultiBoost)is such an example [1]. From the perspective of optimization, MultiBoost can beseen as an extension of the binary column generation boosting framework [11, 4]to the multi-class case. Our work here builds upon MultiBoost. As most existingmulti-class boosting, for MultiBoost of [1], diﬀerent classes share the same set ofweak learners, which leads to a sparse solution of the model parameters and henceslow convergence. To solve this problem, in this work we propose a novel formu-lation (referred to as MultiBoost cw ) for multi-class boosting by using separateweak learner sets. Namely, each class uses its own weak learner set. Comparedto MultiBoost, MultiBoost cw converges much faster, generally has better gener-alization performance and does not introduce additional time cost for training.Note that AdaBoost.MO proposed in [10] uses diﬀerent sets of weak classiﬁersfor each class too. AdaBoost.MO is based on ECOC and the code matrix in Ad-aBoost.MO is speciﬁed before learning. Therefore, the underlying dependencebetween the ﬁxed code matrix and generated binary classiﬁers is not explic-itly taken into consideration, compared with AdaBoost.ECC. In contrast, ourMultiBoost cw is based on the direct formulation of multi-class boosting, whichleads to fundamentally diﬀerent optimization strategies. More importantly, asshown in our experiments, our MultiBoost cw is much more scalable than Ad-aBoost.MO although both enjoy faster convergence than most other multi-classboosting.In MultiBoost [1], sophisticated optimization tools like Mosek or LBFGS-B[12] are needed to solve the resulting optimization problem at each boosting iter-ation, which is not very scalable. Here we propose a coordinate descent algorithm(FCD) for fast optimization of the resulting problem at each boosting iterationof MultiBoost cw . FCD methods choose one variable at a time and eﬃcientlysolve the single-variable sub-problem. CD(coordinate decent) has been appliedto solve many large-scale optimization problems. For example, Yuan et al. [13]made comprehensive empirical comparisons of (cid:96) regularized classiﬁcation algo-rithms. They concluded that CD methods are very competitive for solving large-scale problems. In the formulation of MultiBoost (also in our MultiBoost cw ), thenumber of variables is the product of the number of classes and the number ofweak learners, which can be very large (especially when the number of classesis large). Therefore CD methods may be a better choice for fast optimization ofmulti-class boosting. Our method FCD is specially tailored for the optimizationof MultiBoost cw . We are able to obtain a closed-form solution for each variableupdate. Thus the optimization can be extremely fast. The proposed FCD is easyto implement and no sophisticated optimization toolbox is required. Main Contributions

1) We propose a novel multi-class boosting method(MultiBoost cw ) that uses class speciﬁed weak learners. Unlike MultiBoost shar-ing a single set of weak learners across diﬀerent classes, our method uses aseparate set of weak learners for each class. We generate K (the number ofclasses) weak learners in each boosting iteration—one weak learner for eachclass. With this mechanism, we are able to achieve much faster convergence.2) Similar to MultiBoost [1], we employ column generation to implement the ast Training of Multi-class Boosting 3 boosting training. We derive the Lagrange dual problem of the new multi-classboosting formulation which enable us to design fully corrective multi-class al-gorithms using the primal-dual optimization technique. 3) We propose a FCDmethod for fast training of MultiBoost cw . We obtain an analytical solution foreach variable update in coordinate descent. We use the Karush-Kuhn-Tucker(KKT) conditions to derive eﬀective stop criteria and construct working sets ofviolated variables for faster optimization. We show that FCD can be appliedto fully corrective optimization (updating all variables) in multi-class boosting,similar to fast stage-wise optimization in standard AdaBoost (updating newlyadded variables only). Notation

Let us assume that we have K classes. A weak learner is a func-tion that maps an example x to {− , +1 } . We denote each weak learner by (cid:126) : (cid:126) y,j ( · , · ) ∈ F , ( y = 1 . . . K , and j = 1 . . . n ). F is the space of all the weaklearners; n is the number of weak learners. We deﬁne column vectors h y ( x ) =[ (cid:126) y, ( x ) , · · · , (cid:126) y,n ( x )] (cid:62) as the outputs of weak learners associated with the y -thclass on example x . Let us denote the weak learners’ coeﬃcients w y for class y .Then the strong classiﬁer for class y is F y ( x ) = w (cid:62) h y ( x ). We need to learn K strong classiﬁers, one for each class. Given a test data x , the classiﬁcation ruleis y (cid:63) = argmax F y ( x ). is a vector with elements all being one. Its dimensionshould be clear from the context. We show how to formulate the multi-class boosting problem in the large mar-gin learning framework. Analogue to MultiBoost, we can deﬁne the multi-classmargins associate with training data ( x i , y i ) as γ ( i,y ) = w (cid:62) y i h y i ( x i ) − w (cid:62) y h y ( x i ) , (1)for y (cid:54) = y i . Intuitively, γ ( i,y ) is the diﬀerence of the classiﬁcation scores betweena “wrong” model and the right model. We want to make this margin as large aspossible. MultiBoost cw with the exponential loss can be formulated as:min w ≥ , γ (cid:107) w (cid:107) + Cp (cid:88) i (cid:88) y (cid:54) = y i exp( − γ ( i,y ) ) , ∀ i = 1 · · · m ; ∀ y ∈ { · · · K }\ y i . (2)Here γ is deﬁned in (1). We have also introduced a shorthand symbol p = m × ( K − C controls the complexity of the learned model.The model parameter is w = [ w ; w ; . . . , w K ] (cid:62) ∈ R K · n × .Minimizing (2) encourages the conﬁdence score of the correct label y i of atraining example x i to be larger than the conﬁdence of other labels. We de-ﬁne Y as a set of K labels: Y = { , , . . . , K } . The discriminant function F : X × Y (cid:55)→ R we need to learn is: F ( x , y ; w ) = w (cid:62) y h y ( x ) = (cid:80) j w ( y,j ) (cid:126) ( y,j ) ( x ). Theclass label prediction y (cid:63) for an unknown example x is to maximize F ( x , y ; w )over y , which means ﬁnding a class label with the largest conﬁdence: y (cid:63) = G. Lin, C. Shen, A. van den Hengel, D. Suter

Algorithm 1

CG: Column generation for MultiBoost cw Input: training examples ( x ; y ) , ( x ; y ) , · · · ; regularization parameter C ; termi-nation threshold and the maximum iteration number.2: Initialize:

Working weak learner set H c = ∅ ( c = 1 · · · K ); initialize ∀ ( i, y (cid:54) = y i ) : λ ( i,y ) = 1 ( i = 1 , . . . , m, y = 1 , . . . , K ).3: Repeat − Solve (4) to ﬁnd K weak learners: (cid:126) (cid:63)c ( · ) , c = 1 · · · K ; and add them to the workingweak learner set H c .5: − Solve the primal problem (2) on the current working weak learner sets: (cid:126) c ∈ H c , c = 1 , . . . , K .to obtain w (we use coordinate descent of Algorithm 2).6: − Update dual variables λ in (5) using the primal solution w and the KKT con-ditions (5).7: Until the relative change of the primal objective function value is smaller than theprescribed tolerance; or the maximum iteration is reached.8:

Output: K discriminant function F ( x , y ; w ) = w (cid:62) y h y ( x ), y = 1 · · · K . argmax y F ( x , y ; w ) = argmax y w (cid:62) y h ( x ) . MultiBoost cw is an extension of Multi-Boost [1] for multi-class classiﬁcation. The only diﬀerence is that, in MultiBoost,diﬀerent classes share the same set of weak learners h . In contrast, each classassociates a separate set of weak learners. We show that MultiBoost cw learns amore compact model than MultiBoost. Column generation for MultiBoost cw To implement boosting, we needto derive the dual problem of (2). Similar to [1], the dual problem of (2) can bewritten as (3), in which c is the index of class labels. λ ( i,y ) is the dual variableassociated with one constraint in (2):max λ (cid:88) i (cid:88) y (cid:54) = y i λ ( i,y ) (cid:2) − log pC − log λ ( i,y ) (cid:3) (3a)s . t . ∀ c = 1 , . . . , K : (cid:88) i ( y i = c ) (cid:88) y (cid:54) = y i λ ( i,y ) h y i ( x i ) − (cid:88) i (cid:88) y (cid:54) = y i ,y = c λ ( i,y ) h y ( x i ) ≤ , (3b) ∀ i = 1 , . . . , m : 0 ≤ (cid:88) y (cid:54) = y i λ ( i,y ) ≤ Cp . (3c)Following the idea of column generation [4], we divide the original problem (2)into a master problem and a sub-problem, and solve them alternatively. Themaster problem is a restricted problem of (2) which only considers the generatedweak learners. The sub-problem is to generate K weak learners (corresponding K classes) by ﬁnding the most violated constraint of each class in the dual form(3), and add them to the master problem at each iteration. The sub-problem for ast Training of Multi-class Boosting 5 ﬁnding most violated constraints can be written as: ∀ i =1 · · · K : (cid:126) (cid:63)c ( · ) = argmax (cid:126) c ( · ) (cid:88) i ( y i = c ) (cid:88) y (cid:54) = y i λ ( i,y ) h y i ( x i ) − (cid:88) i (cid:88) y (cid:54) = y i ,y = c λ ( i,y ) h y ( x i ) . (4)The column generation procedure for MultiBoost cw is described in Algorithm 1.Essentially, we repeat the following two steps until convergence: 1) We solve themaster problem (2) with (cid:126) c ∈ H c , c = 1 , . . . , K , to obtain the primal solution w . H c is the working set of generated weak learners associated with the c -th class.We obtain the dual solution λ (cid:63) from the primal solution w (cid:63) using the KKTconditions: λ (cid:63) ( i,y ) = Cp exp (cid:2) w (cid:63) (cid:62) y h y ( x i ) − w (cid:63) (cid:62) y i h y i ( x i ) (cid:3) . (5)2) With the dual solution λ (cid:63) ( i,y ) , we solve the sub-problem (4) to generate K weak learners: (cid:126) (cid:63)c , c = 1 , , . . . , K , and add to the working weak learner set H c .In MultiBoost cw , K weak learners are generated for K classes respectively ineach iteration, while in MultiBoost, only one weak learner is generated at eachcolumn generation and shared by all classes. As shown in [1] for MultiBoost, thesub-problem for ﬁnding the most violated constraint in the dual form is:[ (cid:126) (cid:63) ( · ) , c (cid:63) ] = argmax (cid:126) ( · ) , c (cid:88) i ( y i = c ) (cid:88) y (cid:54) = y i λ ( i,y ) h ( x i ) − (cid:88) i (cid:88) y (cid:54) = y i ,y = c λ ( i,y ) h ( x i ) . (6)At each column generation of MultiBoost, (6) is solved to generated one weaklearner. Note that solving (6) is to search over all K classes to ﬁnd the bestweak learner (cid:126) (cid:63) . Thus the computational cost is the same as MultiBoost cw .This is the reason why MultiBoost cw does not introduce additional training costcompared to MultiBoost. In general, the solution [ w ; · · · ; w K ] of MultiBoost ishighly sparse [1]. This can be observed in our empirical study. The weak learnergenerated by solving (6) actually is targeted for one class, thus using this weaklearner across all classes in MultiBoost leads to a very sparse solution. Thesparsity of [ w , · · · , w K ] indicates that one weak learner is usually only usefulfor the prediction of a very few number of classes (typically only one), but uselessfor most other classes. In this sense, forcing diﬀerent classes to use the same setof weak learners may not be necessary and usually it leads to slow convergence.In contrast, using separate weak learner sets for each class, MultiBoost cw tendsto have a dense solution of w . With K weak learners generated at each iteration,MultiBoost cw converges much faster. Fast coordinate descent

To further speed up the training, we proposea fast coordinate descent method (FCD) for solving the primal MultiBoost cw problem at each column generation iteration. The details of FCD is presentedin Algorithm 2. The high-level idea is simple. FCD works iteratively, and ateach iteration (working set iteration), we compute the violated value of theKKT conditions for each variable in w , and construct a working set of violated G. Lin, C. Shen, A. van den Hengel, D. Suter variables (denoted as S ), then pick variables from the S for update (one variableat a time). We also use the violated values for deﬁning stop criteria. Our FCDis a mix of sequential and stochastic coordinate descent. For the ﬁrst workingset iteration, variables are sequentially picked for update (cyclic CD); in laterworking set iterations, variables are randomly picked (stochastic CD). In thesequel, we present the details of FCD. First, we describe how to update onevariable of w by solving a single-variable sub-problem. For notation simplicity,we deﬁne: δ h i ( y ) = h y i ( x i ) ⊗ Γ ( y i ) − h y ( x i ) ⊗ Γ ( y ) . Γ ( y ) is the orthogonal labelcoding vector: Γ ( y ) = [ δ ( y, , δ ( y, , · · · , δ ( y, K )] (cid:62) ∈ { , } K . Here δ ( y, k ) isthe indicator function that returns 1 if y = k , otherwise 0. ⊗ denotes the tensorproduct. MultiBoost cw in (2) can be equivalently written as:min w ≥ (cid:107) w (cid:107) + Cp (cid:88) i (cid:88) y (cid:54) = y i exp (cid:2) − w (cid:62) δ h i ( y ) (cid:3) . (7)We assume that binary weak learners are used here: (cid:126) ( x ) ∈ { +1 , − } . δh i,j ( y )denotes the j -th dimension of δ h i ( y ), and δ ˆ h i,j ( y ) denotes the rest dimensionsof δ h i ( y ) excluding the j -th. The output of δh i,j ( y ) only takes three possi-ble values: δh i,j ( y ) ∈ {− , , +1 } . For the j -th dimension, we deﬁne: D jv = { ( i, y ) | δh ji ( y ) = v, i ∈ { , . . . , m } , y ∈ Y /y i } , v ∈ {− , , +1 } ; so D jv is a setof constraint indices ( i, y ) that the output of δh i,j ( y ) is v . w j denotes the j -thvariable of w ; ˆ w j denotes the rest variables of w excluding the j -th. Let g ( w )be the objective function of the optimization (7). g ( w ) can be de-composited as: g ( w ) = (cid:107) w (cid:107) + Cp (cid:88) i (cid:88) y (cid:54) = y i exp (cid:2) − w (cid:62) δ h i ( y ) (cid:3) = (cid:107) ˆ w j (cid:107) + (cid:107) w j (cid:107) + Cp (cid:88) i,y (cid:54) = y i exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) − w (cid:62) j δh i,j ( y ) (cid:3) = (cid:107) ˆ w j (cid:107) + (cid:107) w j (cid:107) + Cp (cid:26) exp( w (cid:62) j ) (cid:88) ( i,y ) ∈ D j − exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3) +exp( − w (cid:62) j ) (cid:88) ( i,y ) ∈ D j +1 exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3) + (cid:88) ( i,y ) ∈ D j exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3)(cid:27) = (cid:107) ˆ w j (cid:107) + (cid:107) w j (cid:107) + Cp (cid:2) exp( w (cid:62) j ) V − + exp( − w (cid:62) j ) V + + V (cid:3) . (8)Here we have deﬁned: V − = (cid:88) ( i,y ) ∈ D j − exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3) , V = (cid:88) ( i,y ) ∈ D j exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3) , (9a) V + = (cid:88) ( i,y ) ∈ D j +1 exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) (cid:3) . (9b)In the variable update step, one variable w j is picked at a time for updating andother variables ˆ w j are ﬁxed; thus we need to minimize g in (8) w.r.t w j , which ast Training of Multi-class Boosting 7 is a single-variable minimization. It can be written as:min w j ≥ (cid:107) w j (cid:107) + Cp (cid:2) V − exp( w (cid:62) j ) + V + exp( − w (cid:62) j ) (cid:3) . (10)The derivative of the objective function in (10) with w j > ∂g∂w j = 0 = ⇒ Cp (cid:2) V − exp( w (cid:62) j ) − V + exp( − w (cid:62) j ) (cid:3) = 0 . (11)By solving (11) and the bounded constraint w j ≥

0, we obtain the analyticalsolution of the optimization in (10) (since V − > w (cid:63)j = max (cid:26) , log (cid:18)(cid:113) V + V − + p C − p C (cid:19) − log V − (cid:27) . (12)When C is large, (12) can be approximately simpliﬁed as: w (cid:63)j = max (cid:26) ,

12 log V + V − (cid:27) . (13)With the analytical solution in (12), the update of each dimension of w can beperformed extremely eﬃciently. The main requirement for obtaining the closed-form solution is that the use of discrete weak learners.We use the KKT conditions to construct a set of violated variables and de-rive meaningful stop criteria. For the optimization of MultiBoost cw (7), KKTconditions are necessary conditions and also suﬃcient for optimality. The La-grangian of (7) is: L = (cid:107) w (cid:107) + Cp (cid:80) i (cid:80) y (cid:54) = y i exp (cid:2) − w (cid:62) δ h i ( y ) (cid:3) − α (cid:62) w . Accordingto the KKT conditions, w (cid:63) is the optimal for (10) if and only if w (cid:63) satisﬁes w (cid:63) ≥ , α (cid:63) ≥ , ∀ j : α (cid:63)j w (cid:63)j = 0 and ∀ j : ∇ j L ( w (cid:63) ) = 0. For w j > ∂L∂w j = 0 = ⇒ − Cp (cid:88) i (cid:88) y (cid:54) = y i exp (cid:2) − w δ h i ( y ) (cid:3) δh i,j ( y ) − α j = 0 . Considering the complementary slackness: α (cid:63)j w (cid:63)j = 0, if w (cid:63)j >

0, we have α (cid:63)j = 0;if w (cid:63)j = 0, we have α (cid:63)j ≥

0. The optimality conditions can be written as: ∀ j : (cid:40) − Cp (cid:80) i (cid:80) y (cid:54) = y i exp (cid:2) − w (cid:63) δ h i ( y ) (cid:3) δh i,j ( y ) = 0 , if w (cid:63)j > − Cp (cid:80) i (cid:80) y (cid:54) = y i exp (cid:2) − w (cid:63) δ h i ( y ) (cid:3) δh i,j ( y ) ≥ , if w (cid:63)j = 0 . (14)For notation simplicity, we deﬁne a column vector µ as in (15). With the op-timality conditions (14), we deﬁne θ j in (16) as the violated value of the j -thvariable of the solution w (cid:63) : µ ( i,y ) = exp (cid:2) − w (cid:62) δ h i ( y ) (cid:3) (15) θ j = (cid:40) | − Cp (cid:80) i (cid:80) y (cid:54) = y i µ ( i,y ) δh i,j ( y ) | if w (cid:63)j > { , Cp (cid:80) i (cid:80) y (cid:54) = y i µ ( i,y ) δh i,j ( y ) − } if w (cid:63)j = 0 . (16) G. Lin, C. Shen, A. van den Hengel, D. Suter

At each working set iteration of FCD, we compute the violated values θ , andconstruct a working set S of violated variables; then we randomly (except theﬁrst iteration) pick one variable from S for update. We repeat picking for | S | times; | S | is the element number of S . S is deﬁned as S = { j | θ j > (cid:15) } (17)where (cid:15) is a tolerance parameter. Analogue to [14] and [13], with the deﬁnitionof the variable violated values θ in (16), we can deﬁne the stop criteria as:max j θ j ≤ (cid:15), (18)where (cid:15) can be the same tolerance parameter as in the working set S deﬁnition(17). The stop condition (18) shows if the largest violated value is smaller thansome threshold, FCD terminates. We can see that using KKT conditions isactually using the gradient information. An inexact solution for w is acceptablefor each column generation iteration, thus we place a maximum iteration number( τ max in Algorithm 2) for FCD to prevent unnecessary computation. We need tocompute µ before obtaining θ , but computing µ in (15) is expensive. Fortunately,we are able to eﬃciently update µ after the update of one variable w j to avoidre-computing of (15). µ in (15) can be equally written as: µ ( i,y ) = exp (cid:2) − ˆ w (cid:62) j δ ˆ h i,j ( y ) − w j δh i,j ( y ) (cid:3) . (19)So the update of µ is then: µ ( i,y ) = µ old( i,y ) exp (cid:2) δh i,j ( y )( w old j − w j ) (cid:3) . (20)With the deﬁnition of µ in (19), the values V − and V + for one variable updatecan be eﬃciently computed by using µ to avoid the expensive computation in(9a) and (9b); V − and V + can be equally deﬁned as: V − = (cid:88) ( i,y ) ∈ D j − µ ( i,y ) exp( − w j ) , V + = (cid:88) ( i,y ) ∈ D j +1 µ ( i,y ) exp( w j ) . (21)Some discussion on FCD (Algorithm 2) is as follows: 1) Stage-wise optimiza-tion is a special case of FCD. Compared to totally corrective optimization whichconsiders all variables of w for update, stage-wise only considers those newlyadded variables for update. We initialize the working set using the newly addedvariables. For the ﬁrst working set iteration, we sequentially update the newadded variables. If setting the maximum working set iteration to 1 ( τ max = 1in Algorithm 2), FCD becomes a stage-wise algorithm. Thus FCD is a general-ized algorithm with totally corrective update and stage-wise update as specialcases. In the stage-wise setting, usually a large C (regularization parameter) isimplicitly enforced, thus we can use the analytical solution in (13) for variableupdate.2) Randomly picking one variable for update without any guidance leads toslow local convergence. When the solution gets close to the optimality, usually ast Training of Multi-class Boosting 9 Algorithm 2

FCD: Fast coordinate decent for MultiBoost cw Input: training examples ( x ; y ) , · · · , ( x m ; y m ); parameter C ; tolerance: (cid:15) ; weaklearner set H c , c = 1 , . . . , K ; initial value of w ; maximum working set iteration: τ max .2: Initialize: initialize variable working set S by variable indices in w that correspondto newly added weak learners; initialize µ in (15); working set iteration index τ = 0.3: Repeat (working set iteration)4: τ = τ + 1; reset the inner loop index: q = 0;5: While q < | S | ( | S | is the size of S )6: q = q + 1; pick one variable index j from S :if τ = 1 sequentially pick one, else randomly pick one.7: Compute V − and V + in (21) using µ .8: update variable w j in (12) using V − and V + .9: update µ in (20) using the updated w j .10: End While

11: Compute the violated values θ in (16) for all variables.12: Re-construct the variable working set S in (17) using θ .13: Until the stop condition in (18) is satisﬁed or maximum working set iterationreached: τ > = τ max .14: Output: w . only very few variables need update, and most picks do not “hit”. In columngeneration (CG), the initial value of w is initialized by the solution of last CGiteration. This initialization is already fairly close to optimality. Therefore theslow local convergence for stochastic coordinate decent (CD) is more serious incolumn generation based boosting. Here we have used the KKT conditions toiteratively construct a working set of violated variables, and only the variablesin the working set need update. This strategy leads to faster CD convergence. We evaluate our method MultiBoost cw on some UCI datasets and a varietyof multi-class image classiﬁcation applications, including digit recognition, scenerecognition, and traﬃc sign recognition. We compare MultiBoost cw against Multi-Boost [1] with the exponential loss, and another there popular multi-class boost-ing algorithms: AdaBoost.ECC [9], AdaBoost.MH [10] and AdaBoost.MO [10].We use FCD as the solver for MultiBoost cw , and LBFGS-B [12] for MultiBoost.We also perform further experiments to evaluate FCD in detail. For all experi-ments, the best regularization parameter C for MultiBoost cw and MultiBoost isselected from 10 to 10 ; the tolerance parameter in FCD is set to 0 . (cid:15) = 0 . cw -1 to denote MultiBoost cw using the stage-wise setting ofFCD which only uses one iteration ( τ max = 1 in Algorithm 2). In MultiBoost cw -1, we ﬁx C to be a large value: C = 10 .All experiments are run 5 times. We compare the testing error, the total train-ing time and solver time on all datasets. The results show that our MultiBoost cw and MultiBoost cw -1 converge much faster then other methods, use less trainingtime then MultiBoost, and achieve the best testing error on most datasets.

100 200 300 400 50000.10.20.30.40.50.60.70.8 iterations [VOWEL] T e s t e rr o r ADA.MO(0.117 ± ± ± ± ± ±

100 200 300 400 5000100200300400500 iterations [VOWEL] T r a i n i ng t i m e ( s e c ond s ) ADA.MO (419.7 ± ± ± ±

100 200 300 400 500010203040506070 iterations [VOWEL] S o l v e r t i m e ( s e c ond s ) MultiB (69.6 ± ± ±

100 200 300 400 50000.20.40.60.81 iterations [ISOLET] T e s t e rr o r ADA.MH(0.067 ± ± ± ± ±

100 200 300 400 50005001000150020002500300035004000 iterations [ISOLET] T r a i n i ng t i m e ( s e c ond s ) MultiB (3621.3 ± ± ±

100 200 300 400 5000500100015002000 iterations [ISOLET] S o l v e r t i m e ( s e c ond s ) MultiB (1604.7 ± ± ± Fig. 1:

Results of 2 UCI datasets: VOWEL and ISOLET. CW and CW-1 are ourmethods. CW-1 uses stage-wise setting. The number after the method name is themean value with standard deviation of the last iteration. Our methods converge muchfaster and achieve competitive test accuracy. The total training time and the solvertime of our methods both are less than MultiBoost of [1].

AdaBoost.MO [10] (Ada.MO) has a similar convergence rate as our method,but it is much slower than our method and becomes intractable for large scaledatasets. We run Ada.MO on some UCI datasets and MNIST. Results are shownin Fig. 1 and Fig. 2. We set a maximum training time (1000 seconds) for Ada.MO;other methods are all below this maximum time on those datasets. If maximumtime reached, we report the results of those ﬁnished iterations.

UCI datasets : we use 2 UCI multi-class datasets: VOWEL and ISOLET.For each dataset, we randomly select 75% data for training and the rest fortesting. Results are shown in Fig. 1.

Handwritten digit recognition : we use 3 handwritten datasets: MNIST,USPS and PENDIGITS. For MNIST, we randomly sample 1000 examples fromeach class, and use the original test set of 10,000 examples. For USPS andPENDIGITS, we randomly select 75% for training, the rest for testing. Resultsare shown in Fig. 2. : For PASCAL07,we use 5 types of features provided in [15]. For labelMe, we use the subset:LabelMe-12-50k and generate GIST features. For these two datasets, we usethose images which only have one class label. We use 70% data for training, therest for testing. For CIFAR10 , we construct 2 datasets, one uses GIST features

100 200 300 400 50000.10.20.30.40.50.60.7 iterations [USPS] T e s t e rr o r ADA.MO(0.057 ± ± ± ± ± ±

100 200 300 400 50002004006008001000 iterations [USPS] T r a i n i ng t i m e ( s e c ond s ) ADA.MO (990.8 ± ± ± ±

100 200 300 400 5000100200300400500 iterations [USPS] S o l v e r t i m e ( s e c ond s ) MultiB (408.1 ± ± ±

100 200 300 400 50000.10.20.30.40.5 iterations [PENDIGITS] T e s t e rr o r ADA.MO(0.042 ± ± ± ± ± ±

100 200 300 400 50002004006008001000 iterations [PENDIGITS] T r a i n i ng t i m e ( s e c ond s ) ADA.MO (976.7 ± ± ± ±

100 200 300 400 5000100200300400500 iterations [PENDIGITS] S o l v e r t i m e ( s e c ond s ) MultiB (477.6 ± ± ±

100 200 300 400 50000.10.20.30.40.50.60.7 iterations [MNIST] T e s t e rr o r ADA.MO(0.110 ± ± ± ± ± ±

100 200 300 400 50002004006008001000 iterations [MNIST] T r a i n i ng t i m e ( s e c ond s ) ADA.MO (981.3 ± ± ± ±

100 200 300 400 5000100200300400500 iterations [MNIST] S o l v e r t i m e ( s e c ond s ) MultiB (468.4 ± ± ± Fig. 2:

Experiments on 3 handwritten digit recognition datasets: USPS, PENDIGITSand MNIST. CW and CW-1 are our methods. CW-1 uses stage-wise setting. Our meth-ods converge much faster, achieve best test error and use less training time. Ada.MOhas similar convergence rate as ours, but requires much more training time. With amaximum training time of 1000 seconds, Ada.MO failed to ﬁnish 500 iterations on all3 datasets. and the other uses the pixel values. We use the provided test set and 5 trainingsets for 5 times run. Results are shown in Fig. 3.

Scene recognition : we use 2 scene image datasets: Scene15 [16] and SUN[17]. For Scene15, we randomly select 100 images per class for training, and therest for testing. We generate histograms of code words as features. The codebook size is 200. An image is divided into 31 sub-windows in a spatial hierarchymanner. We generate histograms in each sub-windows, so the histogram featuredimension is 6200. For SUN dataset, we construct a subset of the original datasetcontaining 25 categories. For each category, we use the top 200 images, andrandomly select 80% data for training, the rest for testing. We use the HOGfeatures described in[17]. Results are shown in Fig. 4.

100 200 300 400 5000.40.50.60.70.80.9 iterations [PASCAL07] T e s t e rr o r ADA.MH(0.545 ± ± ± ± ±

100 200 300 400 5000.20.30.40.50.60.7 iterations [LABELME−SUB] T e s t e rr o r ADA.MH(0.254 ± ± ± ± ±

100 200 300 400 5000.450.50.550.60.650.70.750.8 iterations [CIFAR10−GIST] T e s t e rr o r ADA.MH(0.495 ± ± ± ± ±

100 200 300 400 5000.650.70.750.80.85 iterations [CIFAR10−RAW] T e s t e rr o r ADA.MH(0.647 ± ± ± ± ±

100 200 300 400 5000100200300400500600700800 iterations [PASCAL07] S o l v e r t i m e ( s e c ond s ) MultiB (728.6 ± ± ±

100 200 300 400 500050100150200250300350400 iterations [CIFAR10−RAW] S o l v e r t i m e ( s e c ond s ) MultiB (400.0 ± ± ±

100 200 300 400 500050100150200250300350400 iterations [CIFAR10−GIST] S o l v e r t i m e ( s e c ond s ) MultiB (350.9 ± ± ± Fig. 3:

Experiments on 3 image datasets: PASCAL07, LabelMe and CIFAR10. CWand CW-1 are our methods. CW-1 uses stage-wise setting. Our methods converge muchfaster, achieve best test error and use less training time.

Traﬃc sign recognition : We use the GTSRB traﬃc sign dataset. Thereare 43 classes and more than 50000 images. We use the provided 3 types of HOG http://benchmark.ini.rub.de/ast Training of Multi-class Boosting 13

100 200 300 400 5000.20.30.40.50.60.70.80.91 iterations [SCENE15] T e s t e rr o r ADA.MH(0.245 ± ± ± ± ±

100 200 300 400 5000100200300400500600 iterations [SCENE15] T r a i n i ng t i m e ( s e c ond s ) MultiB (592.1 ± ± ±

100 200 300 400 500050100150200250 iterations [SCENE15] S o l v e r t i m e ( s e c ond s ) MultiB (242.0 ± ± ±

100 200 300 400 5000.50.550.60.650.70.750.80.850.9 iterations [SUN−25] T e s t e rr o r ADA.MH(0.564 ± ± ± ± ±

100 200 300 400 50005001000150020002500 iterations [SUN−25] T r a i n i ng t i m e ( s e c ond s ) MultiB (2189.6 ± ± ±

100 200 300 400 50002004006008001000 iterations [SUN−25] S o l v e r t i m e ( s e c ond s ) MultiB (944.3 ± ± ± Fig. 4:

Experiments on 2 scene recognition datasets: SCENE15 and a subset of SUN.CW and CW-1 are our methods. CW-1 uses stage-wise setting. Our methods convergemuch faster and achieve best test error and use less training time. features; so there are 6052 features in total. We randomly select 100 examplesper class for training and use the original test set. Results are shown in Fig.5.

We perform further experiments to evaluate FCD with diﬀerent parameter set-tings, and compare to the LBFGS-B [12] solver. We use 3 datasets in this section:VOWEL, USPS and SCENE15. We run FCD with diﬀerent settings of the max-imum working set iteration( τ max in Algorithm 2) to evaluate how the setting of τ max (maximum working set iteration) aﬀects the performance of FCD. We alsorun LBFGS-B [12] solver for solving the same optimization (2) as FCD. We set C = 10 for all cases. Results are shown in Fig. 6. For LBFGS-B, we use thedefault converge setting to get a moderate solution. The number after “FCD” inthe ﬁgure is the setting of τ max in Algorithm 2 for FCD. Results show that thestage-wise case ( τ max = 1) of FCD is the fastest one, as expected. When we set τ max ≥

2, the objective value of the optimization (2) of our method convergesmuch faster than LBFGS-B. Thus setting of τ max = 2 is suﬃcient to achievea very accurate solution, and at the same time has faster convergence and lessrunning time than LBFGS-B. In this work, we have presented a novel multi-class boosting method. Based onthe dual problem, boosting is implemented using the column generation tech-

50 100 150 20000.20.40.60.81 iterations [GTSRB] T e s t e rr o r ADA.MH(0.103 ± ± ± ± ±

50 100 150 200050100150200250300350400 iterations [GTSRB] S o l v e r t i m e ( s e c ond s ) MultiB (380.1 ± ± ± Fig. 5:

Results on the traﬃc sign dataset: GTSRB. CW and CW-1 (stage-wise setting)are our methods. Our methods converge much faster, achieve best test error and useless training time. nique. Diﬀerent from most existing multi-class boosting, we train a weak learnerset for each class, which results in much faster convergence.A wide range of experiments on a few diﬀerent datasets demonstrate thatthe proposed multi-class boosting achieves competitive test accuracy comparedwith other existing multi-class boosting. Yet it converges much faster and dueto the proposed eﬃcient coordinate descent method, the training of our methodis much faster than the counterpart of MultiBoost in [1].

Acknowledgement . This work was supported by ARC grants LP120200485and FT120100969.

References

1. Shen, C., Hao, Z.: A direct formulation for totally-corrective multi-class boosting.In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2011)2. Paisitkriangkrai, S., Shen, C., van den Hengel, A.: Sharing features in multi-classboosting via group sparsity. In: Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2012)3. Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S.: boosting the margin: A newexplanation for the eﬀectiveness of voting methods. Annals of Statistics (1998)1651–16864. Shen, C., Li, H.: On the dual formulation of boosting algorithms. IEEE Trans.Pattern Anal. Mach. Intell. (2010) 2216–22315. Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vision (2004) 137–1546. Wang, P., Shen, C., Barnes, N., Zheng, H.: Fast and robust object detection usingasymmetric totally-corrective boosting. IEEE Trans. Neural Networks & Learn.Syst. (2012) 33–467. Paisitkriangkrai, S., Shen, C., Zhang, J.: Fast pedestrian detection using a cascadeof boosted covariance features. IEEE Trans. Circuits & Syst. for Video Tech. (2008) 1140–11518. Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via error-correcting output codes. J. Artif. Int. Res. (1995) 263–286ast Training of Multi-class Boosting 15

50 100 150 200 250 3004006008001000120014001600 boosting iterations [VOWEL] O b j e c t i v e f un c t i on v a l ue LBFGS−B (357.297 ± ± ± ± ±

50 100 150 200 250 30040060080010001200 boosting iterations [USPS] O b j e c t i v e f un c t i on v a l ue LBFGS−B (284.069 ± ± ± ± ±

50 100 150 200 250 3002004006008001000 boosting iterations [USPS] O b j e c t i v e f un c t i on v a l ue LBFGS−B (240.503 ± ± ± ± ±

50 100 150 200 250 300051015202530 boosting iterations [VOWEL] S o l v e r t i m e ( s e c ond s ) LBFGS−B (23.7 ± ± ± ± ±

50 100 150 200 250 300050100150200 boosting iterations [USPS] S o l v e r t i m e ( s e c ond s ) LBFGS−B (103.5 ± ± ± ± ±

50 100 150 200 250 3000102030405060 boosting iterations [USPS] S o l v e r t i m e ( s e c ond s ) LBFGS−B (55.5 ± ± ± ± ± Fig. 6:

Solver comparison between FCD with diﬀerent parameter setting and LBFGS-B [12]. One column for one dataset. The number after “FCD” is the setting for themaximum iteration ( τ max ) of FCD. The stage-wise setting of FCD is the fastest one.See the text for details.9. Guruswami, V., Sahai, A.: Multiclass learning, boosting, and error-correctingcodes. In: Proc. Annual Conf. Computational Learning Theory, New York, NY,USA, ACM (1999) 145–15510. Schapire, R.E., Singer, Y.: Improved boosting algorithms using conﬁdence-ratedpredictions. In: Machine Learn. (1999) 80–9111. Demiriz, A., Bennett, K.P., Shawe-Taylor, J.: Linear programming boosting viacolumn generation. Mach. Learn. (2002) 225–25412. Zhu, C., Byrd, R.H., Lu, P., Nocedal, J.: Algorithm 778: L-BFGS-B: Fortransubroutines for large-scale bound constrained optimization. ACM Trans. Math.Software (1994)13. Yuan, G.X., Chang, K.W., Hsieh, C.J., Lin, C.J.: A comparison of optimizationmethods and software for large-scale l1-regularized linear classiﬁcation. J. Mach.Learn. Res. (2010) 3183–323414. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: Alibrary for large linear classiﬁcation. J. Mach. Learn. Res.9