[PDF] Online Multiclass Boosting

Abstract

Recent work has extended the theoretical analysis of boosting algorithms to multiclass problems and to online settings. However, the multiclass extension is in the batch setting and the online extensions only consider binary classification. We fill this gap in the literature by defining, and justifying, a weak learning condition for online multiclass boosting. This condition leads to an optimal boosting algorithm that requires the minimal number of weak learners to achieve a certain accuracy. Additionally, we propose an adaptive algorithm which is near optimal and enjoys an excellent performance on real data due to its adaptive property.

Full PDF

OOnline Multiclass Boosting

Young Hun Jung Jack GoetzDepartment of StatisticsUniversity of MichiganAnn Arbor, MI 48109 {yhjung, jrgoetz, tewaria}@umich.edu

Ambuj TewariFebruary 27, 2018

Abstract

Recent work has extended the theoretical analysis of boosting algorithms to multiclass problems andto online settings. However, the multiclass extension is in the batch setting and the online extensionsonly consider binary classiﬁcation. We ﬁll this gap in the literature by deﬁning, and justifying, a weaklearning condition for online multiclass boosting. This condition leads to an optimal boosting algorithmthat requires the minimal number of weak learners to achieve a certain accuracy. Additionally, we proposean adaptive algorithm which is near optimal and enjoys an excellent performance on real data due to itsadaptive property.

Boosting methods are a ensemble learning methods that aggregate several (not necessarily) weak learnersto build a stronger learner. When used to aggregate reasonably strong learners, boosting has been shownto produce results competitive with other state-of-the-art methods (e.g., Korytkowski et al. [1], Zhang andWang [2]). Until recently theoretical development in this area has been focused on batch binary settingswhere the learner can observe the entire training set at once, and the labels are restricted to be binary (cf.Schapire and Freund [3]). In the past few years, progress has been made to extend the theory and algorithmsto more general settings.Dealing with multiclass classiﬁcation turned out to be more subtle than initially expected. Mukherjee andSchapire [4] unify several diﬀerent proposals made earlier in the literature and provide a general framework formulticlass boosting. They state their weak learning conditions in terms of cost matrices that have to satisfycertain restrictions: for example, labeling with the ground truth should have less cost than labeling with someother labels. A weak learning condition, just like the binary condition, states that the performance of a learner,now judged using a cost matrix, should be better than a random guessing baseline. One particular conditionthey call the edge-over-random condition, proves to be suﬃcient for boostability. The edge-over-randomcondition will also ﬁgure prominently in this paper. They also consider a necessary and suﬃcient conditionfor boostability but it turns out to be computationally intractable to be used in practice.A recent trend in modern machine learning is to train learners in an online setting where the instancescome sequentially and the learner has to make predictions instantly. Oza [5] initially proposed an onlineboosting algorithm that has accuracy comparable with the batch version, but it took several years to designan algorithm with theoretical justiﬁcation (Chen et al. [6]). Beygelzimer et al. [7] achieved a breakthrough byproposing an optimal algorithm in online binary settings and an adaptive algorithm that works quite well inpractice. These theories in online binary boosting have led to several extensions. For example, Chen et al. [8]combine one vs all method with binary boosting algorithms to tackle online multiclass problems with banditfeedback, and Hu et al. [9] build a theory of boosting in regression setting.1 a r X i v : . [ s t a t . M L ] F e b n this paper, we combine the insights and techniques of Mukherjee and Schapire [4] and Beygelzimeret al. [7] to provide a framework for online multiclass boosting. The cost matrix framework from the formerwork is adopted to propose an online weak learning condition that deﬁnes how well a learner can perform overa random guess (Deﬁnition 1). We show this condition is naturally derived from its batch setting counterpart.From this weak learning condition, a boosting algorithm (Algorithm 1) is proposed which is theoreticallyoptimal in that it requires the minimal number of learners and sample complexity to attain a speciﬁed levelof accuracy. We also develop an adaptive algorithm (Algorithm 2) which allows learners to have variablestrengths. This algorithm is theoretically less eﬃcient than the optimal one, but the experimental resultsshow that it is quite comparable and sometimes even better due to its adaptive property. Both algorithms notonly possess theoretical proofs of mistake bounds, but also demonstrate superior performance over preexistingmethods. We ﬁrst describe the basic setup for online boosting. While in the batch setting, an additional weak learneris trained at every iteration, in the online setting, the algorithm starts with a ﬁxed count of N weak learners and a booster which manages the weak learners. There are k possible labels [ k ] := { , · · · , k } and k is knownto the learners. At each iteration t = 1 , · · · , T , an adversary picks a labeled example ( x t , y t ) ∈ X × [ k ] , where X is some domain, and reveals x t to the booster. Once the booster observes the unlabeled data x t , it gathersthe weak learners’ predictions and makes a ﬁnal prediction. Throughout this paper, index i takes values from to N ; t from 1 to T ; and l from 1 to k .We utilize the cost matrix framework , ﬁrst proposed by Mukherjee and Schapire [4], to develop multiclassboosting algorithms. This is a key ingredient in the multiclass extension as it enables diﬀerent penalizationfor each pair of correct label and prediction, and we further develop this framework to suit the online setting.The booster sequentially computes cost matrices { C it ∈ R k × k | i = 1 , · · · , N } , sends ( x t , C it ) to the i th weaklearner W L i , and gets its prediction l it ∈ [ k ] . Here the cost matrix C it plays a role of loss function in that W L i tries to minimize the cumulative cost (cid:80) t C it [ y t , l it ] . As the booster wants each learner to predict thecorrect label, it wants to set the diagonal entries of C it to be minimal among its row. At this stage, the truelabel y t is not revealed yet, but the previous weak learners’ predictions can aﬀect the computation of thecost matrix for the next learner. Given a matrix C , the ( i, j ) th entry will be denoted by C [ i, j ] , and i th rowvector by C [ i ] .Once all the learners make predictions, the booster makes the ﬁnal prediction ˆ y t by majority votes. Thebooster can either take simple majority votes or weighted ones. In fact for the adaptive algorithm, we willallow weighted votes so that the booster can assign more weights on well-performing learners. The weight for W L i at iteration t will be denoted by α it . After observing the booster’s ﬁnal decision, the adversary revealsthe true label y t , and the booster suﬀers 0-1 loss (ˆ y t (cid:54) = y t ) . The booster also shares the true label to theweak learners so that they can train on this data point.Two main issues have to be resolved to design a good boosting algorithm. First, we need to design thebooster’s strategy for producing cost matrices. Second, we need to quantify weak learner’s ability to reducethe cumulative cost (cid:80) Tt =1 C it [ y t , l it ] . The ﬁrst issue will be resolved by introducing potential functions, whichwill be thoroughly discussed in Section 3.1. For the second issue, we introduce our online weak learningcondition, a generalization of the weak learning assumption in Beygelzimer et al. [7], stating that for anyadaptively given sequence of cost matrices, weak learners can produce predictions whose cumulative costis less than that incurred by random guessing. The online weak learning condition will be discussed in thefollowing section. For the analysis of the adaptive algorithm, we use empirical edges instead of the onlineweak learning condition. In this section, we propose an online weak learning condition that states the weak learners are better than arandom guess. We ﬁrst deﬁne a baseline condition that is better than a random guess. Let ∆[ k ] denote a2amily of distributions over [ k ] and u lγ ∈ ∆[ k ] be a uniform distribution that puts γ more weight on the label l . For example, u γ = ( − γk + γ, − γk , · · · , − γk ) . For a given sequence of examples { ( x t , y t ) | t = 1 , · · · , T } , U γ ∈ R T × k consists of rows u y t γ . Then we restrict the booster’s choice of cost matrices to C eor := { C ∈ R k × k | ∀ l, r ∈ [ k ] , C [ l, l ] = 0 , C [ l, r ] ≥ , and || C [ l ] || = 1 } . Note that diagonal entries are minimal among the row, and C eor also has a normalization constraint. Abroader choice of cost matrices is allowed if one can assign importance weights on observations, which ispossible for various learners. Even if the learner does not take the importance weight as an input, we canachieve a similar eﬀect by sending to the learner an instance with probability that is proportional to itsweight. Interested readers can refer Beygelzimer et al. [7, Lemma 1]. From now on, we will assume that ourweak learners can take weight w t as an input.We are ready to present our online weak learning condition. This condition is in fact naturally derivedfrom the batch setting counterpart that is well studied by Mukherjee and Schapire [4]. The link is thoroughlydiscussed in Appendix A. For the scaling issue, we assume the weights w t lie in [0 , . Deﬁnition 1. (Online multiclass weak learning condition)

For parameters γ, δ ∈ (0 , , and S > , apair of online learner and an adversary is said to satisfy online weak learning condition with parameters δ, γ, and S if for any sample length T , any adaptive sequence of labeled examples, and for any adaptivelychosen series of pairs of weight and cost matrix { ( w t , C t ) ∈ [0 , × C eor | t = 1 , · · · , T } , the learner cangenerate predictions ˆ y t such that with probability at least − δ , T (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ C • U (cid:48) γ + S = 1 − γk || w || + S, (1) where C ∈ R T × k consists of rows of w t C t [ y t ] and A • B (cid:48) denotes the Frobenius inner product Tr ( AB (cid:48) ) . w = ( w , · · · , w T ) and the last equality holds due to the normalized condition on C eor . γ is called an edge,and S an excess loss. Remark.

Notice that this condition is imposed on a pair of learner and adversary instead of solely on alearner. This is because no learner can satisfy this condition if the adversary draws samples in a completelyadaptive manner. The probabilistic statement is necessary because many online algorithms’ predictions arenot deterministic. The excess loss requirement is needed since an online learner cannot produce meaningfulpredictions before observing a suﬃcient number of examples.

In this section, we describe the booster’s optimal strategy for designing cost matrices. We ﬁrst introducea general theory without specifying the loss, and later investigate the asymptotic behavior of cumulativeloss suﬀered by our algorithm under the speciﬁc 0-1 loss. We adopt the potential function framework fromMukherjee and Schapire [4] and extend it to the online setting. Potential functions help both in designingcost matrices and in proving the mistake bound of the algorithm.

We will keep track of the weighted cumulative votes of the ﬁrst i weak learners for the sample x t by s it := (cid:80) ij =1 α jt e l jt , where α it is the weight of W L i , l it is its prediction and e j is the j th standard basis vector.For the optimal algorithm, we assume that α it = 1 , ∀ i, t . In other words, the booster makes the ﬁnal decisionby simple majority votes. Given a cumulative vote s ∈ R k , suppose we have a loss function L r ( s ) where r denotes the correct label. We call a loss function proper , if it is a decreasing function of s [ r ] and an increasingfunction of other coordinates (we alert the reader that “proper loss” has at least one other meaning in the3 lgorithm 1 Online Multiclass Boost-by-Majority (OnlineMBBM) for t = 1 , · · · , T do Receive example x t Set s t = ∈ R k for i = 1 , · · · , N do Set the normalized cost matrix D it according to (5) and pass it to W L i Get weak predictions l it = W L i ( x t ) and update s it = s i − t + e l it end for Predict ˆ y t := argmax l s Nt [ l ] and receive true label y t for i = 1 , · · · , N do Set w i [ t ] = (cid:80) kl =1 [ φ y t N − i ( s i − t + e l ) − φ y t N − i ( s i − t + e y t )] Pass training example with weight ( x t , y t , w i [ t ]) to W L i end for end for literature). From now on, we will assume that our loss function is proper. A good example of proper loss ismulticlass 0-1 loss: L r ( s ) := (max l (cid:54) = r s [ l ] ≥ s [ r ]) . (2)The purpose of the potential function φ ri ( s ) is to estimate the booster’s loss when there remain i learnersuntil the ﬁnal decision and the current cumulative vote is s . More precisely, we want potential functions tosatisfy the following conditions: φ r ( s ) = L r ( s ) ,φ ri +1 ( s ) = E l ∼ u rγ φ ri ( s + e l ) . (3)Readers should note that φ ri ( s ) also inherits the proper property of the loss function, which can be shown byinduction. The condition (3) can be loosened by replacing both equalities by inequalities “ ≥ ”, but in practicewe usually use equalities.Now we describe the booster’s strategy for designing cost matrices. After observing x t , the boostersequentially sets a cost matrix C it for W L i , gets the weak learner’s prediction l it and uses this in thecomputation of the next cost matrix C i +1 t . Ultimately, booster wants to set C it [ r, l ] = φ rN − i ( s i − t + e l ) . (4)However, this cost matrix does not satisfy the condition of C eor , and thus should be modiﬁed in order toutilize the weak learning condition. First to make the cost for the true label equal to , we subtract C it [ r, r ] from every element of C it [ r ] . Since the potential function is proper, our new cost matrix still has non-negativeelements after the subtraction. We then normalize the row so that each row has (cid:96) norm equal to . In otherwords, we get new normalized cost matrix D it [ r, l ] = φ rN − i ( s i − t + e l ) − φ rN − i ( s i − t + e r ) w i [ t ] , (5)where w i [ t ] := (cid:80) kl =1 φ rN − i ( s i − t + e l ) − φ rN − i ( s i − t + e r ) plays the role of weight. It is still possible that a rowvector C it [ r ] is a zero vector so that normalization is impossible. In this case, we just leave it as a zero vector.Our weak learning condition (1) still works with cost matrices some of whose row vectors are zeros becausehowever the learner predicts, it incurs no cost.After deﬁning cost matrices, the rest of the algorithm is straightforward except we have to estimate || w i || ∞ to normalize the weight. This is necessary because the weak learning condition assumes the weightslying in [0 , . We cannot compute the exact value of || w i || ∞ until the last instance is revealed, which is ﬁne4s we need this value only in proving the mistake bound. The estimate w i ∗ for || w i || ∞ requires to specify theloss, and we postpone the technical parts to Appendix B.2. Interested readers may directly refer Lemma 10before proceeding. Once the learners generate predictions after observing cost matrices, the ﬁnal decision ismade by simple majority votes. After the true label is revealed, the booster updates the weight and sendsthe labeled instance with weight to the weak learners. The pseudocode for the entire algorithm is depicted inAlgorithm 1. The algorithm is named after Beygelzimer et al. [7, OnlineBBM], which is in fact OnlineMBBMwith binary labels.We present our ﬁrst main result regarding the mistake bound of general OnlineMBBM. The proof appearsin Appendix B.1 where the main idea is adopted from Beygelzimer et al. [7, Lemma 3]. Theorem 2. (Cumulative loss bound for OnlineMBBM)

Suppose weak learners and an adversarysatisfy the online weak learning condition (1) with parameters δ, γ, and S . For any T and N satisfying δ (cid:28) N , and any adaptive sequence of labeled examples generated by the adversary, the ﬁnal loss suﬀered byOnlineMBBM satisﬁes the following inequality with probability − N δ : T (cid:88) t =1 L y t ( s Nt ) ≤ φ N ( ) T + S N (cid:88) i =1 w i ∗ . (6)Here φ N ( ) plays a role of asymptotic error rate and the second term determines the sample complexity.We will investigate the behavior of those terms under the 0-1 loss in the following section. From now on, we will specify the loss to be multiclass 0-1 loss deﬁned in (2), which might be the most relevantmeasure in multiclass problems. To present a speciﬁc mistake bound, two terms in the RHS of (6) should bebounded. This requires an approximation of potentials, which is technical and postponed to Appendix B.2.Lemma 9 and 10 provide the bounds for those terms. We also mention another bound for the weight in theremark after Lemma 10 so that one can use whichever tighter. Combining the above lemmas with Theorem 2gives the following corollary. The additional constraint on γ comes from Lemma 10. Corollary 3. (0-1 loss bound of OnlineMBBM)

Suppose weak learners and an adversary satisfy theonline weak learning condition (1) with parameters δ, γ, and S , where γ < . For any T and N satisfying δ (cid:28) N and any adaptive sequence of labeled examples generated by the adversary, OnlineMBBM can generatepredictions ˆ y t that satisfy the following inequality with probability − N δ : T (cid:88) t =1 ( y t (cid:54) = ˆ y t ) ≤ ( k − e − γ N T + ˜ O ( k / √ N S ) . (7) Therefore in order to achieve error rate (cid:15) , it suﬃces to use N = Θ( γ ln k(cid:15) ) weak learners, which gives anexcess loss bound of ˜Θ( k / γ S ) . Remark.

Note that the above excess loss bound gives a sample complexity bound of ˜Θ( k / (cid:15)γ S ) . If we usealternative weight bound to get kN S as an upper bound for the second term in (6), we end up having ˜ O ( kN S ) .This will give an excess loss bound of ˜Θ( kγ S ) . We now provide lower bounds on the number of learners and sample complexity for arbitrary onlineboosting algorithms to evaluate the optimality of OnlineMBBM under 0-1 loss. In particular, we constructweak learners that satisfy the online weak learning condition (1) and have almost matching asymptotic errorrate and excess loss compared to those of OnlineMBBM as in (7). Indeed we can prove that the number oflearners and sample complexity of OnlineMBBM is optimal up to logarithmic factors, ignoring the inﬂuence ofthe number of classes k . Our bounds are possibly suboptimal up to polynomial factors in k , and the problemto ﬁll the gap remains open. The detailed proof and a discussion of the gap can be found in Appendix B.3.Our lower bound is a multiclass version of Beygelzimer et al. [7, Theorem 3].5 heorem 4. (Lower bounds for N and T ) For any γ ∈ (0 , ) , δ, (cid:15) ∈ (0 , , and S ≥ k ln( δ ) γ , there existsan adversary with a family of learners satisfying the online weak learning condition (1) with parameters δ, γ ,and S , such that to achieve asymptotic error rate (cid:15) , an online boosting algorithm requires at least Ω( k γ ln (cid:15) ) learners and a sample complexity of Ω( k(cid:15)γ S ) . The online weak learning condition imposes minimal assumptions on the asymptotic accuracy of learners, andobviously it leads to a solid theory of online boosting. However, it has two main practical limitations. Theﬁrst is the diﬃculty of estimating the edge γ . Given a learner and an adversary, it is by no means a simpletask to ﬁnd the maximum edge that satisﬁes (1). The second issue is that diﬀerent learners may have diﬀerentedges. Some learners may in fact be quite strong with signiﬁcant edges, while others are just slightly betterthan a random guess. In this case, OnlineMBBM has to pick the minimum edge as it assumes common γ forall weak learners. It is obviously ineﬃcient in that the booster underestimates the strong learners’ accuracy.Our adaptive algorithm will discard the online weak learning condition to provide a more practical method.Empirical edges γ , · · · , γ N (see Section 4.2 for the deﬁnition) are measured for the weak learners and areused to bound the number of mistakes made by the boosting algorithm. Adaboost, proposed by Freund et al. [10], is arguably the most popular boosting algorithm in practice. Itaims to minimize the exponential loss, and has many variants which use some other surrogate loss. The mainreason of using a surrogate loss is ease of optimization; while 0-1 loss is not even continuous, most surrogatelosses are convex. We adopt the use of a surrogate loss for the same reason, and throughout this section willdiscuss our choice of surrogate loss for the adaptive algorithm.Exponential loss is a very strong candidate in that it provides a closed form for computing potentialfunctions, which are used to design cost matrices (cf. Mukherjee and Schapire [4, Theorem 13]). One propertyof online setting, however, makes it unfavorable. Like OnlineMBBM, each data point will have a diﬀerentweight depending on weak learners’ performance, and if the algorithm uses exponential loss, this weightwill be an exponential function of diﬀerence in weighted cumulative votes. With this exponentially varyingweights among samples, the algorithm might end up depending on very small portion of observed samples.This is undesirable because it is easier for the adversary to manipulate the sample sequence to perturb thelearner.To overcome exponentially varying weights, Beygelzimer et al. [7] use logistic loss in their adaptivealgorithm. Logistic loss is more desirable in that its derivative is bounded and thus weights will be relativelysmooth. For this reason, we will also use multiclass version of logistic loss: L r ( s ) =: (cid:88) l (cid:54) = r log(1 + exp( s [ r ] − s [ r ])) . (8)We still need to compute potential functions from logistic loss in order to calculate cost matrices. Unfortunately,Mukherjee and Schapire [4] use a unique property of exponential loss to get a closed form for potentialfunctions, which cannot be adopted to logistic loss. However, the optimal cost matrix induced from exponentialloss has a very close connection with the gradient of the loss (cf. Mukherjee and Schapire [4, Lemma 22]).From this, we will design our cost matrices as following: C it [ r, l ] :=  s i − t [ r ] − s i − t [ l ]) , if l (cid:54) = r − (cid:80) j (cid:54) = r s i − t [ r ] − s i − t [ j ]) , if l = r. (9)Readers should note that the row vector C it [ r ] is simply the gradient of L r ( s i − t ) . Also note that this matrixdoes not belong to C eor , but it does guarantee that the correct prediction gets the minimal cost.6he choice of logistic loss over exponential loss is somewhat subjective. The undesirable property ofexponential loss does not necessarily mean that we cannot build an adaptive algorithm using this loss. Infact, we can slightly modify Algorithm 2 to develop algorithms using diﬀerent surrogates (exponential lossand square hinge loss). However, their theoretical bounds are inferior to the one with logistic loss. Interestedreaders can refer Appendix D, but it assumes understanding of Algorithm 2. Our work is a generalization of Adaboost.OL by Beygelzimer et al. [7], from which the name Adaboost.OLMcomes with M standing for multiclass. We introduce a new concept of an expert . From N weak learners, wecan produce N experts where expert i makes its prediction by weighted majority votes among the ﬁrst i learners. Unlike OnlineMBBM, we allow varying weights α it over the learners. As we are working with logisticloss, we want to minimize (cid:80) t L y t ( s it ) for each i , where the loss is given in (8). We want to alert the readersto note that even though the algorithm tries to minimize the cumulative surrogate loss, its performance isstill evaluated by 0-1 loss. The surrogate loss only plays a role of a bridge that makes the algorithm adaptive.We do not impose the online weak learning condition on weak learners, but instead just measure theperformance of W L i by γ i := (cid:80) t C it [ y t ,l it ] (cid:80) t C it [ y t ,y t ] . This empirical edge will be used to bound the number of mistakesmade by Adaboost.OLM. By deﬁnition of cost matrix, we can check C it [ y t , y t ] ≤ C it [ y t , l ] ≤ − C it [ y t , y t ] , ∀ l ∈ [ k ] , from which we can prove − ≤ γ i ≤ , ∀ i . If the online weak learning condition is met with edge γ , then onecan show that γ i ≥ γ with high probability when the sample size is suﬃciently large.Unlike the optimal algorithm, we cannot show the last expert that utilizes all the learners has the bestaccuracy. However, we can show at least one expert has a good predicting power. Therefore we will useclassical Hedge algorithm (Littlestone and Warmuth [11] and Freund and Schapire [12]) to randomly choosean expert at each iteration with adaptive probability weight depending on each expert’s prediction history.Finally we need to address how to set the weight α it for each weak learner. As our algorithm tries tominimize the cumulative logistic loss, we want to set α it to minimize (cid:80) t L y t ( s i − t + α it e l it ) . This is again aclassical topic in online learning, and we will use online gradient descent , proposed by Zinkevich [13]. Byletting, f it ( α ) := L y t ( s i − t + α e l it ) , we need an online algorithm ensuring (cid:80) t f it ( α it ) ≤ min α ∈ F (cid:80) t f it ( α )+ R i ( T ) where F is a feasible set to be speciﬁed later, and R i ( T ) is a regret that is sublinear in T . To apply Zinkevich[13, Theorem 1], we need f it to be convex and F to be compact. The ﬁrst assumption is met by our choice oflogistic loss, and for the second assumption, we will set F = [ − , . There is no harm to restrict the choiceof α it by F because we can always scale the weights without aﬀecting the result of weighted majority votes.By taking derivatives, we get f it (cid:48) ( α ) =  s i − t [ y t ] − s i − t [ l it ] − α ) , if l it (cid:54) = y t − (cid:80) j (cid:54) = y t s i − t [ j ]+ α − s i − t [ y t ]) , if l it = y t . (10)This provides | f it (cid:48) ( α ) | ≤ k − . Now let Π( · ) represent a projection onto F : Π( · ) := max {− , min { , ·}} . Bysetting α it +1 = Π( α it − η t f it (cid:48) ( α it )) where η t = √ k − √ t , we get R i ( T ) ≤ √ k − √ T . Readers should notethat any learning rate of the form η t = c √ t would work, but our choice is optimized to ensure the minimalregret.The pseudocode for Adaboost.OLM is presented in Algorithm 2. In fact, if we put k = 2 , Adaboost.OLMhas the same structure with Adaboost.OL. As in OnlineMBBM, the booster also needs to pass the weightalong with labeled instance. According to (9), it can be inferred that the weight is proportional to − C it [ y t , y t ] . Now we present our second main result that provides a mistake bound of Adaboost.OLM. The main structureof the proof is adopted from Beygelzimer et al. [7, Theorem 4] but in a generalized cost matrix framework.7 lgorithm 2

Adaboost.OLM Initialize: ∀ i, v i = 1 , α i = 0 for t = 1 , · · · , T do Receive example x t Set s t = ∈ R k for i = 1 , · · · , N do Compute C it according to (9) and pass it to W L i Set l it = W L i ( x t ) and s it = s i − t + α it e l it Set ˆ y it = argmax l s it [ l ] , the prediction of expert i end for Randomly draw i t with P ( i t = i ) ∝ v it Predict ˆ y t = ˆ y i t t and receive the true label y t for i = 1 , · · · , N do Set α it +1 = Π( α it − η t f it (cid:48) ( α it )) using (10) and η t = √ k − √ t Set w i [ t ] = − C it [ y t ,y t ] k − and pass ( x t , y t , w i [ t ]) to W L i Set v it +1 = v it · exp( − ( y t (cid:54) = ˆ y it )) end for end for The proof appears in Appendix C.

Theorem 5. (Mistake bound of Adaboost.OLM)

For any T and N , with probability − δ , the numberof mistakes made by Adaboost.OLM satisﬁes the following inequality: T (cid:88) t =1 ( y t (cid:54) = ˆ y t ) ≤ k − (cid:80) Ni =1 γ i T + ˜ O ( kN (cid:80) Ni =1 γ i ) , where ˜ O notation suppresses dependence on log δ . Remark.

Note that this theorem naturally implies Beygelzimer et al. [7, Theorem 4]. The diﬀerence incoeﬃcients is due to diﬀerent scaling of γ i . In fact, their γ i ranges from [ − , ] . Now that we have established a mistake bound, it is worthwhile to compare the bound with the optimalboosting algorithm. Suppose the weak learners satisfy the weak learning condition (1) with edge γ . Forsimplicity, we will ignore the excess loss S . As we have γ i = (cid:80) t C it [ y t ,l it ] (cid:80) t C it [ y t ,y t ] ≥ γ with high probability,the mistake bound becomes k − γ N T + ˜ O ( kNγ ) . In order to achieve error rate (cid:15) , Adaboost.OLM requires N ≥ k − (cid:15)γ learners and T = ˜Ω( k (cid:15) γ ) sample size. Note that OnlineMBBM requires N = Ω( γ ln k(cid:15) ) and T = min { ˜Ω( k / (cid:15)γ ) , ˜Ω( k(cid:15)γ ) } . Adaboost.OLM is obviously suboptimal, but due to its adaptive feature, itsperformance on real data is quite comparable to that by OnlineMBBM. We compare the new algorithms to existing ones for online boosting on several UCI data sets, each with k classes . Table 1 contains some highlights, with additional results and experimental details in the AppendixE. Here we show both the average accuracy on the ﬁnal 20% of each data set, as well as the average runtime for each algorithm. Best decision tree gives the performance of the best of 100 online decision trees ﬁtusing the VFDT algorithm in Domingos and Hulten [14], which were used as the weak learners in all other Codes are available at https://github.com/yhjung88/OnlineBoostingWithVFDT γ .Despite being theoretically weaker, Adaboost.OLM often demonstrates similar accuracy and sometimesoutperforms Best MBBM, which exempliﬁes the power of adaptivity in practice. This power comes fromthe ability to use diverse learners eﬃciently, instead of being limited by the strength of the weakest learner.OnlineMBBM suﬀers from high computational cost, as well as the diﬃculty of choosing the correct value of γ , which in general is unknown, but when the correct value of γ is used it peforms very well. Finally in allcases Adaboost.OLM and OnlineMBBM algorithms outperform both the best tree and the preexisting OnlineBoosting algorithm, while also enjoying theoretical accuracy bounds.Table 1: Comparison of algorithm accuracy on ﬁnal 20% of data set and run time in seconds. Best accuracyon a data set reported in bold .Data sets k Best decision tree Online Boosting Adaboost.OLM Best MBBMBalance 3 0.768 8 0.772 19 0.754 20

59 0.914 56Mushroom 2 0.999 241

Acknowledgments

We acknowledge the support of NSF under grants CAREER IIS-1452099 and CIF-1422157.

References [1] Marcin Korytkowski, Leszek Rutkowski, and Rafał Scherer. Fast image classiﬁcation by boosting fuzzy classiﬁers.

Information Sciences , 327:175–182, 2016.[2] Xiao-Lei Zhang and DeLiang Wang. Boosted deep neural networks and multi-resolution cochleagram features forvoice activity detection. In

INTERSPEECH , pages 1534–1538, 2014.[3] Robert E Schapire and Yoav Freund.

Boosting: Foundations and algorithms . MIT press, 2012.[4] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting.

Journal of Machine LearningResearch , 14(Feb):437–497, 2013.[5] Nikunj C Oza. Online bagging and boosting. In , volume 3, pages 2340–2345. IEEE, 2005.[6] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. An online boosting algorithm with theoretical justiﬁcations.

ICML , 2012.[7] Alina Beygelzimer, Satyen Kale, and Haipeng Luo. Optimal and adaptive algorithms for online boosting.

ICML ,2015.[8] Shang-Tse Chen, Hsuan-Tien Lin, and Chi-Jen Lu. Boosting with online binary learners for the multiclass banditproblem. In

Proceedings of The 31st ICML , pages 342–350, 2014.[9] Hanzhang Hu, Wen Sun, Arun Venkatraman, Martial Hebert, and Andrew Bagnell. Gradient boosting onstochastic data streams. In

Artiﬁcial Intelligence and Statistics , pages 595–603, 2017.

10] Yoav Freund, Robert Schapire, and N Abe. A short introduction to boosting.

Journal-Japanese Society ForArtiﬁcial Intelligence , 14(771-780):1612, 1999.[11] Nick Littlestone and Manfred K Warmuth. The weighted majority algorithm. In

Foundations of ComputerScience, 1989., 30th Annual Symposium on , pages 256–261. IEEE, 1989.[12] Yoav Freund and Robert E Schapire. A desicion-theoretic generalization of on-line learning and an application toboosting. In

European conference on computational learning theory , pages 23–37. Springer, 1995.[13] Martin Zinkevich. Online convex programming and generalized inﬁnitesimal gradient ascent. In

Proceedings of20th ICML , 2003.[14] Pedro Domingos and Geoﬀ Hulten. Mining high-speed data streams. In

Proceedings of the sixth ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 71–80. ACM, 2000.[15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and the ermprinciple. In

COLT , pages 207–232, 2011.[16] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.

Machinelearning , 2(4):285–318, 1988.[17] Volodimir G Vovk. Aggregating strategies. In

Proc. Third Workshop on Computational Learning Theory , pages371–383. Morgan Kaufmann, 1990.[18] Nicolo Cesa-Bianchi and Gábor Lugosi.

Prediction, learning, and games . Cambridge university press, 2006.[19] Elad Hazan et al. Introduction to online convex optimization.

Foundations and Trends R (cid:13) in Optimization , 2(3-4):157–325, 2016.[20] Robert E Schapire. Drifting games. Machine Learning , 43(3):265–291, 2001.[21] Eric V Slud. Distribution inequalities for the binomial law.

The Annals of Probability , pages 404–412, 1977.[22] C.L. Blake and C.J. Merz. UCI machine learning repository, 1998. URL http://archive.ics.uci.edu/ml .[23] Cios KJ Higuera C, Gardiner KJ. Self-organizing feature maps identify proteins critical to learning in a mousemodel of down syndrome, 2015. URL https://doi.org/10.1371/journal.pone.0129126 .[24] Wallace Ugulino, Débora Cardador, Katia Vega, Eduardo Velloso, Ruy Milidiú, and Hugo Fuks. Wearablecomputing: Accelerometers’ data classiﬁcation of body postures and movements. In

Advances in ArtiﬁcialIntelligence-SBIA 2012 , pages 52–61. Springer, 2012. ppendix A Link between batch and online weak learning condi-tions Let us begin the section by introducing the weak learning condition in the batch setting. Mukherjee andSchapire [4] have identiﬁed necessary and suﬃcient condition for boostability. We will focus on a suﬃcientcondition due to reasons of computational tractability. In the batch setting, the entire training set is revealed.Let D := { ( x t , y t ) | t = 1 , · · · , T } be the training set and deﬁne a family of cost matrices: C eor := { C ∈ R T × k | ∀ t, C [ t, y t ] = min l ∈ [ k ] C [ t, l ] } . The superscript “eor” stands for “edge-over-random.” We warn the readers not to confuse C eor with C eor .They both impose similar row constraints, but the matrices in these sets have diﬀerent dimensions: T × k and k × k respectively. C eor also has additional an normalization constraint. Note that C eor provides onecost vector for an instance whereas C eor provides a matrix. This is necessary because if an adversary passesonly a vector to an online learner, then the learner can simply make the prediction which minimizes the cost.Furthermore, in the online boosting setting, the booster does not know the true label when it computes acost matrix.The authors prove that if a weak learning space H satisﬁes the condition described in Deﬁnition 6, then itis boostable, which means there exists a convex linear combination of hypotheses in H that perfectly classiﬁes D . Deﬁnition 6. (Batch setting weak learning condition, Mukherjee and Schapire [4])

Suppose D is ﬁxed and C eor is deﬁned as above. A weak learning space H is said to satisfy weak learning condition ( C eor , U γ ) if ∀ C ∈ C eor , one can ﬁnd a weak hypothesis h ∈ H such that T (cid:88) t =1 C [ t, h ( x t )] ≤ C • U (cid:48) γ . (11)Now we present how our online weak learning condition (Deﬁnition 1) is naturally derived from the batchsetting counterpart (Deﬁnition 6). We extend the arguments of Beygelzimer et al. [7]. The batch settingcondition (11) can be interpreted as making the following two implicit assumptions:1. (Richness condition) For any C ∈ C eor , there is some hypothesis h ∈ H such that T (cid:88) t =1 C [ t, h ( x t )] ≤ C • U (cid:48) γ .

2. (Agnostic learnability) For any C ∈ C eor and (cid:15) ∈ (0 , , there is an algorithm which can compute anearly optimal hypothesis h ∈ H , i.e. T (cid:88) t =1 C [ t, h ( x t )] ≤ inf h (cid:48) ∈H T (cid:88) t =1 C [ t, h (cid:48) ( x t )] + (cid:15)T. For the online setting, we will keep the richness assumption with C being the matrix consisting of rows of w t C t [ y t ] , and the data being drawn by a ﬁxed adversary. That is to say, it is the online richness condition thatimposes a restriction on adversary because the condition cannot be met by any H with fully adaptive adversary.For example, suppose an adversary draws samples uniformly at random from the set { ( x , , · · · , ( x , k ) } forsome ﬁxed x ∈ X . There does not exist weak learning space H that satisﬁes the online richness conditionwith this adversary. The agnostic learnability assumption is also replaced by online agnostic learnabilityassumption. We present online versions of the above two assumptions:11 (cid:48) . (Online richness condition) For any sample length T , any sequence of labeled examples { ( x t , y t ) | t =1 , · · · , T } generated by a ﬁxed adversary, and any series of pairs of weight and cost matrix { ( w t , C t ) ∈ [0 , × C eor | t = 1 , · · · , T } , there is some hypothesis h ∈ H such that T (cid:88) t =1 w t C t [ y t , h ( x t )] ≤ C • U (cid:48) γ , (12)where C ∈ R T × k consists of rows of w t C t [ y t ] .2 (cid:48) . (Online agnostic learnability) For any sample length T , δ ∈ (0 , , and for any adaptively chosen seriesof pairs of weight and cost matrix { ( w t , C t ) ∈ [0 , × C eor | t = 1 , · · · , T } , there is an online algorithmwhich can generate predictions ˆ y t such that with probability − δ , T (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ inf h ∈H T (cid:88) t =1 w t C t [ y t , h ( x t )] + R δ ( T ) , (13)where R δ : N → R is a sublinear regret.Daniely et al. [15] extensively investigates agnostic learnability in online multiclass problems by introducingthe following generalized Littlestone dimension (Littlestone [16]) of a hypothesis family H . Consider a binaryrooted tree RT whose internal nodes are labeled by elements from X and whose edges are labeled by elementsfrom [ k ] such that two edges from a same parent have diﬀerent labels. The tree RT is shattered by H if, forevery path from root to leaf which traverses the nodes x , · · · , x k , there is a hypothesis h ∈ H such that h ( x i ) corresponds to the label of the edge from x i to x i +1 . The Littlestone dimension of H is the maximaldepth of complete binary tree that is shattered by H (or ∞ if one can build a arbitrarily deep shattered tree).The authors prove that an optimal online algorithm has a sublinear regret under the expected (w.r.t. therandomness of the algorithm) 0-1 loss if Littlestone dimension of H is ﬁnite.Similarly we prove in Lemma 7 that the condition (13) is satisﬁed if H has a ﬁnite Littlestone dimension.We need to slightly modify their result in two ways. One is to replace expectation by probabilistic argument,and the other is to replace 0-1 loss by our cost matrix framework. Both questions can be resolved by replacingan auxiliary lemma used by Daniely et al. [15] without changing the main structure. Lemma 7.

Suppose a weak learning space H has a ﬁnite Littlestone dimension d and an adversary choosesexamples in fully adaptive manner. For any sample length T and for any adaptively chosen series of pairs ofweight and cost matrix { ( w t , C t ) ∈ [0 , × C eor | t = 1 , · · · , T } , with probability − δ , the online agnosticlearnability condition (13) is satisﬁed with following sublinear regret R δ ( T ) = (cid:112) ( T d ln T k ) / (cid:112) ( T ln 1 /δ ) / . Proof.

We ﬁrst introduce an online algorithm with experts. Suppose we have a ﬁxed pool of experts of size N .We keep our cost matrix framework. Each expert f i would suﬀer cumulative cost C iT := (cid:80) Tt =1 w t C t [ y t , f i ( x t )] .At each iteration, an online algorithm chooses to follow one expert and incurs a cost w t C t [ y t , ˆ y t ] , and itsgoal is to perform as well as the best expert. That is to say, the algorithm wants to keep its cumulativecost (cid:80) Tt =1 w t C t [ y t , ˆ y t ] not too much larger than min i ∈ [ N ] C iT . This learning framework is called weightedmajority algorithm and is thoroughly investigated by several researchers (e.g., Littlestone and Warmuth [11]and Vovk [17]). We will speciﬁcally use Algorithm 3 (LEA), which is shown to achieve a sublinear regret (cid:112) ( T ln N ) / (cid:112) ( T ln 1 /δ ) / with probability − δ (cf. Cesa-Bianchi and Lugosi [18, Corollary 4.2]). Theauthors require the loss to be bounded, which is also satisﬁed in our cost matrix framework. Readers mightraise a question that our loss function changes for each iteration, but the proof still works as long as it isbounded. Interested readers might refer Hazan et al. [19, Section 1.3.3].To apply this result in our case, we need to construct a ﬁnite set of experts whose best performance isas good as that of hypotheses in H . In fact, in the proof of Daniely et al. [15, Theorem 25], the authors12 lgorithm 3 Learning with Expert Advice (LEA) Input

T: time horizon, N: number of experts Set η = (cid:112) (8 ln N ) /T Set C i = 0 for all i for t = 1 , · · · , T do Receive example x t Receive expert advices ( f t , · · · , f Nt ) ∈ [ k ] N Predict ˆ y t = f it with probability proportional to exp( − ηC it − ) Receive true label y t Update C it = C it − + w t C t [ y t , f it ] for all i end for construct a set E of size N ≤ ( T k ) d such that for every hypothesis h ∈ H , there is an expert f ∈ E whichcoincides with h subject to the given examples x , · · · , x T .Applying the LEA result on E shows that with probability − δ , the regret is bounded above by (cid:112) ( T d ln T k ) / (cid:112) ( T ln 1 /δ ) / , which concludes the proof.One remark is that the proof of Lemma 7 only uses the boundedness condition of C eor .Now we are ready to demonstrate that our online weak learning condition is indeed naturally derivedfrom the batch setting counterpart. The following Theorem shows that two conditions (12) and (13) directlyimply the online weak learning condition (1). In other words, if the weak learning space H accompanied byan adversary is rich enough to contain a hypothesis that slightly outperforms a random guess and has areasonably small dimension, then we can ﬁnd an excess loss S that satisﬁes (1). This is a generalization ofBeygelzimer et al. [7, Lemma 2]. Note that we impose an additional assumption that w t ≥ m > , ∀ t . Incase the learner encounters zero weight, it can simply ignore the instance, and the above assumption is nottoo artiﬁcial. Theorem 8. (Link between batch and online weak learning conditions)

Suppose a pair of weaklearning space H and an adversary satisﬁes online richness assumption (12) with edge γ and online agnosticlearnability assumption (13) with mistake probability δ and sublinear regret R δ ( · ) . Additionally we assume thereexists a positive constant m that satisﬁes w t ≥ m , ∀ t . Then the online learning algorithm satisﬁes the onlineweak learning condition (1), with mistake probability δ , edge γ , and excess loss S = max T ( R δ ( T ) − γmTk ) .Proof. Fix δ ∈ (0 , and a series of pairs of weight and cost matrix { ( w t , C t ) ∈ [0 , × C eor | t = 1 , · · · , T } ,and let C ∈ R T × k consist of rows of w t C t [ y t ] . First note that by sublinearity of R δ ( · ) , S is ﬁnite. Accordingto (13), the online learning algorithm can generate predictions ˆ y t such that, with probability − δ , T (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ C • U (cid:48) γ + R δ ( T ) . Thus it suﬃces to show that C • U (cid:48) γ + R δ ( T ) ≤ C • U (cid:48) γ + S. (14)Since the correct label gets zero cost and the row C [ r ] has (cid:96) norm w t , we have C • U (cid:48) γ = 1 − γk || C || = 1 − γk T (cid:88) t =1 w t . By plugging this in (14), we get C • U (cid:48) γ − C • U (cid:48) γ + R δ ( T ) = − γk T (cid:88) t =1 w t + R δ ( T ) ≤ − γk mT + R δ ( T ) ≤ S. w t ≥ m , and the second inequality holds by deﬁnition of S , which completesthe proof.Lemma 7 and Theorem 8 suggest an implicit relation between δ and S in (1). If we want probabilisticallystronger weak learning condition, R δ ( T ) in Lemma 7 gets bigger, which results in larger S = max T ( R δ ( T ) − γTk ) . Appendix B Detailed discussion of OnlineMBBM

B.1 Proof of Theorem 2

Proof.

For ease of notation, we will assume the edge is equal to γ and the true label is r unless otherwisespeciﬁed. That is to say, u stands for u rγ and φ i for φ ri . By rewriting (3), φ N − i +1 ( s i − t ) = E l ∼ u φ N − i ( s i − t + e l )= C it [ r ] • u = C it [ r ] • ( u − e l it ) + φ N − i ( s it ) , where C it is deﬁned in (4). The last equation holds due to the relation s it = s i − t + e l it . Also note that || u || = || e r || = 1 , and thus subtracting common numbers from each component of C it [ r ] does not aﬀect thedot product term. Therefore, by introducing normalized cost matrix D it as in (5) and w i [ t ] as in Algorithm 1,we may write φ y t N − i +1 ( s i − t ) = w i [ t ] D it [ y t ] • ( u y t γ − e l it ) + φ y t N − i ( s it )= w i [ t ] D it [ y t ] • u y t γ − w i [ t ] D it [ y t , l it ] + φ y t N − i ( s it )= w i [ t ] 1 − γk − w i [ t ] D it [ y t , l it ] + φ y t N − i ( s it ) . (15)The last equality holds because D it is normalized and D it [ y t , y t ] = 0 . If D it [ y t ] is a zero vector, then bydeﬁnition w i [ t ] = 0 , and the equality still holds. Then by summing (15) over t , we get T (cid:88) t =1 φ y t N − i +1 ( s i − t ) = 1 − γk || w i || − T (cid:88) t =1 w i [ t ] D it [ y t , l it ] + T (cid:88) t =1 φ y t N − i ( s it ) . By online weak learning condition, we have with probability − δ , (recall that w i ∗ estimates || w i || ∞ ) T (cid:88) t =1 w i [ t ] w i ∗ D it [ y t , l it ] ≤ − γk || w i || w i ∗ + S. From this, we can argue that T (cid:88) t =1 φ y t N − i +1 ( s i − t ) + Sw i ∗ ≥ T (cid:88) t =1 φ y t N − i ( s it ) . Since the above inequality holds for any i , summing over i gives T (cid:88) t =1 φ y t N ( ) + S N (cid:88) i =1 w i ∗ ≥ T (cid:88) t =1 φ y t ( s Nt ) , which holds with probability − N δ by union bound. By symmetry, φ y t N ( ) = φ N ( ) regardless of the truelabel y t , and by deﬁnition of potential function (3), φ y t ( s Nt ) = L y t ( s Nt ) , which completes the proof.14 .2 Bounding the terms in general bound under 0-1 loss Even though OnlineMBBM has a promising theoretical justiﬁcation, it would be infeasible if the computationof potential functions takes too long or if the behavior of asymptotic error rate φ N ( ) is too complicatedto be approximated. Fortunately for the 0-1 loss, we can get a computationally tractable algorithm withvanishing error rate. The use of potential functions in binary boosting setup is thoroughly discussed bySchapire [20]. In binary setting under 0-1 loss, potential function has a closed form which dramaticallyreduces the computational complexity. Unfortunately, the multiclass version does not have a closed form, butMukherjee and Schapire [4] introduce a heuristic to compute it in reasonable time: φ ri ( s ) = 1 − (cid:88) ( x , ··· ,x k ) ∈ A (cid:18) ix , · · · , x k (cid:19) k (cid:89) l =1 u x l l , (16)where A := { ( x , · · · x k ) ∈ Z k | x + · · · x k = i, ∀ l : x l ≥ , x l + s [ l ] < x r + s [ r ] } , and u rγ = ( u , · · · , u k ) . Byusing dynamic programming, the RHS of (16) can be computed in polynomial time in i , k , and || s || . Inour setting where the number of learners is ﬁxed to be N , the computation can be done in polynomial timein k and N because || s || is bounded by N . To the best of our knowledge, there is no way to compute thepotential function in polynomial time if we start from necessary and suﬃcient weak learning condition (thealgorithm given by Mukherjee and Schapire [4] takes exponential time in the number of learners), and this isthe main reason that we use the suﬃcient condition. Recall from (6) that φ N ( ) plays a role of asymptoticerror rate and the second term determines the sample complexity. The following two lemmas provide boundsfor both terms.By applying the Hoeﬀding’s inequality, we can prove in Lemma 9 that φ N ( ) vanishes exponentially fastas N grows. That is to say, to get a satisfactory accuracy, we do not need too many learners. We also notethat we can decide N before the learning process begins, which is logically plausible. Lemma 9.

Under the same setting as in Theorem 2 but with the particular choice of 0-1 loss, we may bound φ N ( ) as follows: φ N ( ) ≤ ( k −

1) exp( − γ N . (17) Proof.

We reinterpret φ N ( ) in (16). Imagine that we draw numbers N times from [ k ] where the probabilitythat a number i is drawn is u γ [ i ] . That is to say, has highest probability of − γk + γ , and other numbershave equal probability of − γk . Then φ N ( ) can be interpreted as a probability that the number that is drawnfor the most time out of N draws is not . Let A i denote the event that the number i gets more votes thanthe number . Then we have by union bound, φ N ( ) = P ( A ∪ · · · ∪ A k ) ≤ k (cid:88) l =2 P ( A i )= ( k − P ( A ) (18)The last equality holds by symmetry. To compute P ( A ) , imagine that we draw with probability − γk + γ , − with probability − γk , and otherwise. P ( A ) is equal to the probability that after independent N draws,the summation of N i.i.d. random numbers is non-positive. Thus by the Hoeﬀding’s inequality, we get P ( A ) ≤ exp( − γ N (19)Combining (18) and (19) completes the proof.Now we have ﬁxed N based on the desired asymptotic accuracy. Since 0-1 loss is bounded in [0 , , so arepotential functions. Then by deﬁnition of weights (cf. Algorithm 1), || w i || ∞ is trivially bounded above by k ,15hich means we can use w i ∗ = k ∀ i . Thus the second term of (6) is bounded above by kN S , which is valid.However, Lemma 10 allows a tighter bound. Lemma 10.

Under the same setting as in Theorem 2 but with the particular choice of 0-1 loss and anadditional constraint of γ < , we may bound || w i || ∞ by || w i || ∞ ≤ ck / √ N − i , (20) where c is a universal constant that can be determined before the algorithm begins.Proof. We will start by providing a bound on φ rm ( s + e l ) − φ rm ( s + e r ) . First note that it is non-negative aspotential functions are proper. Again by using random draw framework as in the proof of Lemma 9 (now r has the largest probability to be drawn), this value corresponds to the probability that after m draws,the number r wins the majority votes if the count starts from s + e r but loses if the count starts from s + e l . Let X , · · · , X k denote the number of draws of each number out of m draws and deﬁne the events A l := { ( X r + s [ r ]) − ( X l + s [ l ]) ∈ { , }} . Then it can be checked that φ rm ( s + e l ) − φ rm ( s + e r )= P ( ∃ l (cid:48) s.t. X l (cid:48) + s [ l (cid:48) ] + e l [ l (cid:48) ] ≥ X r + s [ r ]) − P ( ∃ l (cid:48) s.t. X l (cid:48) + s [ l (cid:48) ] ≥ X r + s [ r ] + 1) ≤ P ( ∃ l (cid:48) s.t. X l (cid:48) + s [ l (cid:48) ] + e l [ l (cid:48) ] ≥ X r + s [ r ] and ∀ l (cid:48) , X r + s [ r ] ≥ X l (cid:48) + s [ l (cid:48) ]) ≤ P ( ∃ l (cid:48) s.t. X l (cid:48) + s [ l (cid:48) ] + e l [ l (cid:48) ] ≥ X r + s [ r ] ≥ X l (cid:48) + s [ l (cid:48) ])= P ( (cid:91) l (cid:54) = r A l ) ≤ (cid:88) l (cid:54) = r P ( A l ) . (21)The ﬁrst inequality holds by P ( A ) − P ( B ) ≤ P ( A − B ) . Individual probabilities can be written as P ( A l ) = P ( X r − X l = s [ l ] − s [ r ]) + P ( X r − X l = s [ l ] − s [ r ] + 1) ≤ n P ( X r − X l = n ) . (22)We can prove by applying the Berry-Esseen theorem that the last probability is O ( √ m ) . Let Y , · · · , Y m be asequence of i.i.d. random variables such that Y j ∈ {− , , } and P ( Y j = 1) = 1 − γk + γ, P ( Y j = −

1) = 1 − γk . Note that E Y j = γ and V ar ( Y j ) = − γ ) k + γ (1 − γ ) =: σ . It can be easily checked that Y := (cid:80) mj =1 Y j hassame distribution with X r − X l . Now we approximate Y by a Gaussian random variable W ∼ N ( mγ, mσ ) .Let F W and F Y denote CDF of W and Y , respectively, and let f denote the density of W . First note that | P ( Y = n ) − (cid:90) nn − f ( w ) dw | = | ( F Y ( n ) − F Y ( n − − ( F W ( n ) − F W ( n − |≤ | F Y ( n ) − F W ( n ) | + | F Y ( n − − F W ( n − | . We can apply the Berry-Esseen theorem to the last CDF diﬀerences, which provides | P ( Y = n ) − (cid:90) nn − f ( w ) dw | ≤ Cρσ √ m , (23)where C is the universal constant that appears in Berry-Esseen and ρ := E | Y j − γ | . As Y j is a boundedrandom variable, we have ρ = E | Y j − γ | ≤ (1 + γ ) E | Y j − γ | = (1 + γ ) σ ≤ σ . | P ( Y = n ) − (cid:90) nn − f ( w ) dw | ≤ Cσ √ m By simple algebra, we can deduce P ( Y = n ) ≤ (cid:90) nn − f ( w ) dw + 4 Cσ √ m ≤ sup w ∈ R f ( w ) + 4 Cσ √ m = 1 √ πmσ + 4 Cσ √ m . (24)Using the fact that γ < , we can show σ = 2(1 − γ ) k + γ (1 − γ ) ≥ k Plugging this in (24) gives P ( Y = n ) ≤ σ √ m ( 1 √ π + 4 C ) ≤ C (cid:48) (cid:114) km , (25)where C (cid:48) = √ π + 4 C . By combining (21), (22), (25), and the fact that Y and X r − X l have same distribution,we prove φ rm ( s + e l ) − φ rm ( s + e r ) ≤ C (cid:48) k (cid:114) km . (26)The proof is complete by observing that w i [ t ] = (cid:80) kl =1 [ φ y t N − i ( s i − t + e l ) − φ y t N − i ( s i − t + e y t )] . Remark.

By summing (20) over i , we can bound the second term of (6) by O ( k / √ N ) S . Comparing thisto the aforementioned bound kN S , Lemma 10 reduces the dependency on N , but as a tradeoﬀ the dependencyon k is increased. The optimal bound for this term remains open, but in the case that the number of classes k is ﬁxed to be moderate, Lemma 10 provides a better bound. Corollary 3 is a simple consequence of plugging Lemma 9 and 10 to Theorem 2.

B.3 Proof of lower bounds and discussion of gap

We begin by proving Theorem 4.

Proof.

At time t , an adversary draws a label y t uniformly at random from [ k ] , and the weak learnersindependently make predictions with respect to the probability distribution p t ∈ ∆[ k ] . This can be achievedif the adversary draws x t ∈ R N where x t [1] , · · · , x t [ N ] | y t ’s are conditionally independent with conditionaldistribution of p t and W L i predicts x t [ i ] . The booster can only make a ﬁnal decision by weighted majorityvotes of N weak learners. We will manipulate p t in such a way that weak learners satisfy (1), but thebooster’s performance is close to that of Online MBBM.First we note that since C t [ y t , ˆ y t ] used in (1) is bounded in [0 , , the Azuma-Hoeﬀding inequality impliesthat if a weak learner makes prediction ˆ y t according to the probability distribution p t at time t , then withprobability − δ , we have 17 (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ T (cid:88) t =1 w t C t [ y t ] • p t + (cid:114) || w || ln( 1 δ ) ≤ T (cid:88) t =1 w t C t [ y t ] • p t + γ || w || k + k ln( δ )2 γ ≤ T (cid:88) t =1 w t C t [ y t ] • p t + γ || w || k + k ln( δ )2 γ , (27)where the second inequality holds by arithmetic mean and geometric mean relation and the last inequalityholds due to w t ∈ [0 , .We start from providing a lower bound on the number of weak learners. Let p t = u y t γ for all t . This canbe done by the constraint γ < . Then the last line of (27) becomes T (cid:88) t =1 w t C t [ y t ] • u y t γ + γ || w || k + k ln( δ )2 γ = 1 − γk || w || + γ || w || k + k ln( δ )2 γ ≤ − γk || w || + S, where the ﬁrst equality follows by the fact that C t [ y t , y t ] = 0 and || C t [ y t ] || = 1 . Thus the weak learnersindeed satisfy the online weak learning condition with edge γ and excess loss S . Now suppose a boosterimposes weights on weak learners by α i . WLOG, we may assume the weights are normalized such that (cid:80) Ni =1 α i = 1 . Adopting the argument of Schapire and Freund [3, Section 13.2.6], we prove that the optimalchoice of weights is ( N , · · · , N ) . Fix t , and let l i denote the prediction made by W L i . By noting that P ( y t = y ) = k , which is constant, we can deduce P ( y t = y | l , · · · , l N ) = P ( l , · · · , l N | y t = y ) P ( y t = y ) P ( l , · · · , l N ) ∝ P ( l , · · · , l N | y t = y )= N (cid:89) i =1 p ( l i = y ) q ( l i (cid:54) = y ) , where f ∝ g means f ( y ) /g ( y ) does not depend on y , p = u y t γ [ y t ] = − γk + 2 γ , and q = u y t γ [ l ] = − γk . Bytaking log, we get log P ( y t = y | l , · · · , l N ) = C + log p N (cid:88) i =1 ( l i = y ) + log q N (cid:88) i =1 ( l i (cid:54) = y )= C + N log q + log pq N (cid:88) i =1 ( l i = y ) . Therefore, the optimal decision after observing l , · · · , l N is to choose y that maximizes (cid:80) Ni =1 ( l i = y ) , orequivalently, to take simple majority votes.To compute a lower bound for the error rate, we again introduce random draw framework as in the proofof Lemma 9. WLOG, we may assume that the true label is . Let A i denote the event that the number i beats in the majority votes. Then we have P ( booster makes error ) ≥ P ( A ) . (28)Now we need a lower bound for P ( A ) . To do so, let { Y i } be the series of i.i.d. random variables such that18 i ∈ {− , , } and P ( Y j = 1) = 1 − γk + 2 γ =: p , P ( Y j = −

1) = 1 − γk =: p − . Then P ( A ) = P ( Y < where Y := (cid:80) Ni =1 Y i .Now let M be the number of j such that Y j (cid:54) = 0 . By conditioning on M , we can write P ( Y < | M = m ) = P ( B ≤ m , where B ∼ binom ( m, p p + p − ) . By Slud’s inequality [21, Theorem 2.1], we have P ( B ≤ m ≥ P ( Z ≥ √ m p − (cid:112) p (1 − p ) ) , where Z follows a standard normal distribution and p = p p + p − . Now using tail bound on normal distribution,we get P ( B ≤ m ≥ Ω(exp( − m ( p − / p (1 − p ) ))= Ω(exp( − m ( p − p − ) p p − ))= Ω(exp( − mγ p p − )) ≥ Ω(exp( − mk γ )) ≥ Ω(exp( − N k γ )) . (29)We intentionally drop from the power, which makes the bound smaller. The second inequality holds because p p − ≥ (1 − γ ) k ≥ k . Integrating w.r.t. m gives P ( booster makes error ) ≥ P ( Y < ≥ Ω(exp( − N k γ )) . By setting this value equal to (cid:15) , we have N ≥ Ω( k γ ln (cid:15) ) , which proves the ﬁrst part of the theorem.Now we turn our attention to the optimality of sample complexity. Let T := kS γ and deﬁne p t = u y t for t ≤ T and p t = u y t γ for t > T . Then for T ≤ T , (27) implies T (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ γk || w || + k ln( δ )2 γ ≤ − γk || w || + S, (30)where the last inequality holds because || w || ≤ T = kS γ . For T > T , again (27) implies T (cid:88) t =1 w t C t [ y t , ˆ y t ] ≤ k T (cid:88) t =1 w t + 1 − γk T (cid:88) t = T +1 w t + γ || w || k + k ln( δ )2 γ ≤ γk T + 1 − γk || w || + k ln( δ )2 γ ≤ − γk || w || + S. (31)19igure 1: Plot of φ N ( ) computed with distribution u γ versus the number of labels k . N is ﬁxed to be 20,and the edge γ is set to be 0.01 (left) and 0.1 (right). The graph is not monotonic for larger edge. Thishinders the approximation of potential functions with respect to k .(30) and (31) prove that the weak learners indeed satisfy (1). Now note that combining weak learners doesnot provide meaningful information for t ≤ T , and thus any online boosting algorithm has errors at least Ω( T ) . Therefore to get the desired asymptotic error rate, the number of observations T should be at least Ω( T (cid:15) ) = Ω( k(cid:15)γ S ) , which proves the second part of the theorem.Even though the gap for the number of weak learners between Corollary 3 and Theorem 4 is merelypolynomial in k , readers might think it is counter-intuitive that N is increasing in k in the upper bound whiledecreasing in the lower bound. This phenomenon occurs due to the diﬃculty in approximating potentialfunctions. Recall that Lemma 9 and Theorem 4 utilize upper and lower bound of φ N ( ) .At ﬁrst glance, considering that φ N ( ) implies the error rate of majority votes out of N independentrandom draws with distribution u γ , the potential function seems to be increasing in k as the task gets harderwith bigger set of options. This is the case of left panel of Figure 1. However, as it is shown in the rightpanel, it can also start decreasing in k when γ is larger. This can happen because the probability that awrong label is drawn vanishes as k grows while the probability that the correct label is drawn remains biggerthan γ . In this regard, even though the number of wrong labels gets larger, the error rate actually decreasesas u γ [1] dominates other probabilities.After acknowledging that φ N ( ) might not be a monotonic function of k , the linear upper bound (17)turns out to be quite naive, and this is the main reason for the conﬂicting dependence on k in upper boundand lower bound for N . As the relation among k , N , and γ in φ N ( ) is quite intricate, the issue of derivingbetter approximation of potential functions remains open. Appendix C Proof of Theorem 5

We ﬁrst introduce a lemma that will be used in the proof.

Lemma 11.

Suppose

A, B ≥ , B − A = γ ∈ [ − , , and A + B ≤ . Then we have min α ∈ [ − , A ( e α −

1) + B ( e − α − ≤ − γ . Proof.

We divide into three cases with respect to the range of BA .20irst suppose e − ≤ BA ≤ e . In this case, the minimum is attained at α = log BA , and the minimumbecomes − ( A + B ) + 2 √ AB = − ( √ A − √ B ) = − ( A − B √ A + √ B ) = − γ ( √ A + √ B ) ≤ − γ A + B ) ≤ − γ . Now suppose BA > e > . From B − A = γ , we have γ > A ≥ . Choosing α = log 6 , we get theminimum is bounded above by A − B = 256 A − γ< γ − γ = − γ < − γ . The last inequality hold due to γ ≤ .Finally suppose AB > e > . From B − A = γ , we have − γ > B ≥ . Choosing α = − log 6 , we getthe minimum is bounded above by − A + 5 B = 256 B + 56 γ< − γ

50 + 56 γ = 34 γ < − γ . The last inequality hold due to γ ≥ − . This completes the proof.Now we provide a proof of Theorem 5. Proof.

Let M i denote the number of mistakes made by expert i : M i = (cid:80) t ( y t (cid:54) = ˆ y it ) . We also let M = T for the ease of presentation. As Adaboost.OLM is using the Hedge algorithm among N experts, the Azuma-Hoeﬀding inequality and a standard analysis (cf. Cesa-Bianchi and Lugosi [18, Corollary 2.3]) provide withprobability − δ , (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ i M i + 2 log N + ˜ O ( √ T ) , (32)where ˜ O notation suppresses dependence on log δ .Now suppose the expert i − makes a mistake at iteration t . That is to say, in a conservative way, s i − t [ y t ] ≤ s i − t [ l ] for some l (cid:54) = y t . This implies that among k − terms in the summation of − C it [ y t , y t ] in(9), at least one term is not less than . Thus we can say − C it [ y t , y t ] ≥ if the expert i − makes a mistakeat x t . This leads to the inequality: − (cid:88) t C it [ y t , y t ] ≥ M i − . (33)Note that by deﬁnition of M and C t , the above inequality holds for i = 1 as well. For ease of notation, letus write w i := − (cid:80) t C it [ y t , y t ] . 21ow let ∆ i denote the diﬀerence of the cumulative logistic loss between two consecutive experts: ∆ i = (cid:88) t L y t ( s it ) − L y t ( s i − t ) = (cid:88) t L y t ( s i − t + α it e l it ) − L y t ( s i − t ) . Then Online Gradient Descent algorithm provides ∆ i ≤ min α ∈ [ − , (cid:88) t [ L y t ( s i − t + α e l it ) − L y t ( s i − t )] + 4 √ k − √ T . (34)By simple algebra, we can check log(1 + e s + α ) − log(1 + e s ) = log(1 + e α −

11 + e − s ) ≤

11 + e − s ( e α − . From this, we can deduce that L y t ( s i − t + α e l it ) − L y t ( s i − t ) ≤ (cid:40) C it [ y t , l it ]( e α − , if l it (cid:54) = y t C it [ y t , l it ]( − e − α + 1) , if l it = y t . Summing over t , we have (cid:88) t L y t ( s i − t + α e l it ) − L y t ( s i − t ) ≤ w i ( A ( e α −

1) + B ( e − α − , where A = (cid:88) l t (cid:54) = y t C t [ y t , l t ] /w i , B = − (cid:88) l t = y t C t [ y t , l t ] /w i . Note that A and B are non-negative and B − A = γ i ∈ [ − , , A + B ≤ . Lemma 11 provides min α ∈ [ − , (cid:88) t [ L y t ( s i − t + α e l it ) − L y t ( s i − t )] ≤ − γ i w i . (35)Combining (33), (34), and (35), we have ∆ i ≤ − γ i M i − + 4 √ k − √ T .

Summing over i , we get by telescoping rule (cid:88) t L y t ( s Nt ) − (cid:88) t L y t ( ) ≤ − (cid:88) i γ i M i − + 4 √ k − N √ T ≤ − (cid:88) i γ i min i M i + 4 √ k − N √ T .

Note that L y t ( ) = ( k −

1) log 2 and L y t ( s Nt ) ≥ . Therefore we have min i M i ≤ k −

1) log 2 (cid:80) i γ i T + 16 √ k − N (cid:80) i γ i √ T .

Plugging this in (32), we get with probability − δ , (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ k −

1) log 2 (cid:80) i γ i T + ˜ O ( kN √ T (cid:80) i γ i + log N ) ≤ k − (cid:80) i γ i T + ˜ O ( kN (cid:80) i γ i ) , where the last inequality holds from AM-GM inequality: cN √ T ≤ c N + T .22 ppendix D Adaptive algorithms with diﬀerent surrogate losses In this section, we present similar adaptive boosting algorithms with Adaboost.OLM but with two diﬀerentsurrogate losses: exponential loss and square hinge loss. We keep the main structure, but the unique propertiesof each loss result in little diﬀerence in details.

D.1 Exponential loss

As discussed in Section 4.1, exponential loss is useful in batch setting because it provides a closed form forthe potential function. We will use following multiclass version of exponential loss: L r ( s ) := (cid:88) l (cid:54) = r exp( s [ l ] − s [ r ]) . (36)From this, we can compute the cost matrix and f it (cid:48) for the online gradient descent as below: C it [ r, l ] = (cid:40) exp( s i − t [ l ] − s i − t [ r ]) , if l (cid:54) = r − (cid:80) j (cid:54) = r exp( s i − t [ j ] − s i − t [ r ]) , if l = r (37) f it (cid:48) ( α ) = (cid:40) exp( s i − t [ l it ] + α − s i − t [ y t ]) , if l it (cid:54) = y t − (cid:80) j (cid:54) = y t exp( s i − t [ j ] − α − s i − t [ y t ]) , if l it = y t . (38)With this gradient, if we set the learning rate η it = √ k − √ t e − i , a standard analysis provides R i ( T ) ≤ √ k − e i √ T . Note that with exponential loss, we have diﬀerent learning rate for each weak learner. Wekeep the algorithm same as Algorithm 2, but with diﬀerent cost matrix and learning rate. Now we state thetheorem for the mistake bound. Theorem 12. (Mistake bound with exponential loss)

For any T and N , the number of mistakes madeby Algorithm 2 with above cost matrix and learning rate satisﬁes the following inequality with high probability: (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ k (cid:80) i γ i T + ˜ O ( ke N (cid:80) i γ i ) . Proof.

The proof is almost identical to that of Theorem 5, and we only state the diﬀerent steps. With costmatrix deﬁned in (37), we can show − (cid:88) t C it [ y t , y t ] ≥ M i − . Furthermore, we have following identity (which was inequality in the original proof): L y t ( s i − t + α e l it ) − L y t ( s i − t ) = (cid:40) C it [ y t , l it ]( e α − , if l it (cid:54) = y t C it [ y t , l it ]( − e − α + 1) , if l it = y t . This leads to ∆ i ≤ − γ i M i − + 4 √ k − e i √ T .

Summing over i , we get (cid:80) i γ i i M i ≤ ( k − T + 4 √ k − e e N − e − √ T ≤ ( k − T + 9 ke N √ T . (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ k − (cid:80) i γ i T + ˜ O ( ke N √ T (cid:80) i γ i + log N ) ≤ k (cid:80) i γ i T + ˜ O ( ke N (cid:80) i γ i ) , which completes the proof. We also used AM-GM inequality for the last step.Comparing to Theorem 5, we get a better coeﬃcient for the ﬁrst term, which is asymptotic error rate,but the exponential function in the second term makes the bound signiﬁcantly loose. The exponential termcomes from the larger variability of f it associated with exponential loss. It should also be noted that theempirical edge γ i is measured with diﬀerent cost matrices, and thus direct comparison is not fair. In fact, asdiscussed in Section 4.1, γ i is closer to with exponential loss than with logistic loss due to larger variationin weights, which is another huge advantage of logistic loss. D.2 Square hinge loss

Another popular surrogate loss is square hinge loss. We begin the section by introducing multiclass version ofit: L r ( s ) := 12 (cid:88) l (cid:54) = r ( s [ l ] − s [ r ] + 1) , (39)where f + := max { , f } . From this, we can compute the cost matrix and f it (cid:48) for the online gradient descentas below: C it [ r, l ] = (cid:40) ( s i − t [ l ] − s i − t [ r ] + 1) + , if l (cid:54) = r − (cid:80) j (cid:54) = r ( s i − t [ j ] − s i − t [ r ] + 1) + , if l = r (40) f it (cid:48) ( α ) = (cid:40) ( s i − t [ l it ] + α − s i − t [ y t ] + 1) + , if l it (cid:54) = y t − (cid:80) j (cid:54) = y t ( s i − t [ j ] − α − s i − t [ y t ] + 1) + , if l it = y t . (41)With square hinge loss, we do not use Lemma 11 in the proof of mistake bound, and thus the feasible set F can be narrower. In fact, we will set F = [ − c, c ] , where the parameter c will be optimized later. Withthis F , we have | f it (cid:48) ( α ) | ≤ ( k −

1) + ci ≤ ( k −

1) + cN , and the standard analysis of online gradient descentalgorithm with learning rate η t = √ c (( k − cN ) √ t provides that R i ( T ) ≤ √ k − cN ) √ T . Now we areready to prove the mistake bound. Theorem 13. (Mistake bound with square hinge loss)

For any T and N , with the choice of c = √ N ,the number of mistakes made by Algorithm 2 with above cost matrix and learning rate satisﬁes the followinginequality with high probability: (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ k √ N (cid:80) i | γ i | T + ˜ O ( ( k + N ) N √ N (cid:80) i | γ i | ) . Proof.

With cost matrix deﬁned in (40), we can show − (cid:88) t C it [ y t , y t ] ≥ M i − .

24e can also check that

12 [( s + α ) − s ] ≤ s + α + α , by splitting the cases with the sign of each term. Using this, we can deduce that L y t ( s i − t + α e l it ) − L y t ( s i − t ) ≤ C it [ y t , l it ] α + ( k − α . Summing over t gives (cid:88) t L y t ( s i − t + α e l it ) − L y t ( s i − t ) ≤ (cid:88) t C it [ y t , y t ] γ i α + ( k − α T. The RHS is a quadratic in α , and the minimizer is α ∗ = − (cid:80) t C it [ y t ,y t ] γ i ( k − T . Since the magnitude of C it [ y t , y t ] grows as a function of c , there is no guarantee that this minimizer lies in the feasible set F = [ − c, c ] . Instead,we will bound the minimum by plugging in α = ± c : min α ∈ [ − c,c ] (cid:88) t L y t ( s i − t + α e l it ) − L y t ( s i − t ) ≤ ( k − c T + c | γ i | (cid:88) t C it [ y t , y t ] ≤ ( k − c T − c | γ i | M i − . From this, we get ∆ i ≤ − c | γ i | M i − + ( k − c T + 2 √ k − cN ) √ T .

Summing over i , we get c (cid:88) i | γ i | min i M i ≤ k − T + ( k − c N T + 2 √ k − cN ) N √ T .

By rearranging terms, we conclude min i M i ≤ ( k − (cid:80) i | γ i | ( 1 c + cN ) T + 2 √ k − cN ) N (cid:80) i | γ i | √ T .

It is the ﬁrst term from the RHS that provides an optimal choice of c = √ N , and this value gives min i M i ≤ ( k − √ N (cid:80) i | γ i | T + 2 √ k − √ N ) N (cid:80) i | γ i | √ T .

Plugging this in (32), we get with high probability, (cid:88) t ( y t (cid:54) = ˆ y t ) ≤ k − √ N (cid:80) i | γ i | T + ˜ O ( ( k + √ N ) N (cid:80) i | γ i | √ T + log N ) ≤ k √ N (cid:80) i | γ i | T + ˜ O ( ( k + N ) N √ N (cid:80) i | γ i | ) , which completes the proof. We also used AM-GM inequality for the last step.By Cauchy-Schwartz inequality, we have N (cid:80) i γ i ≥ ( (cid:80) i | γ i | ) . From this, we can deduce ( √ N (cid:80) i | γ i | ) ≥ (cid:80) i γ i . If LHS is greater than , then the bound in Theorem 13 is meaningless. Otherwise, we have √ N (cid:80) i | γ i | ≥ ( √ N (cid:80) i | γ i | ) ≥ (cid:80) i γ i , which validates that the bound with logistic loss is tighter. Furthermore, square hinge loss also producesmore variable weights over instances, which results in worse empirical edges.25 ppendix E Detailed description of experiment Testing was performed on a variety of data sets described in Table 2. All are from the UCI data repository(Blake and Merz [22], Higuera C [23], Ugulino et al. [24]) with a few adjustments made to deal with missingdata and high dimensionality. These changes are noted in the table below. Many of the data sets are thesame as used in the Oza [5], with the addition of a few sets with larger numbers of data points and predictors.We report the average performance on both the entire data set and on the ﬁnal 20% of the data set. Thetwo accuracy measures help understand both the “burn in period”, or how quickly the algorithm improves asobservations are recorded, and the “accuracy plateau”, or how well the algorithm can perform given suﬃcientdata. Diﬀerent applications may emphasize each of these two algorithmic characteristics, so we choose toprovide both to the reader. We also report average run times. All computations were carried out on aNehalem architecture 10-core 2.27 GHz Intel Xeon E7-4860 processors with 25 GB RAM per core. For all butthe last two data sets, results are averaged over 27 reordering of the data. Due to computational constraints,Movement was run just nine times and ISOLET just once.Table 2: Data set detailsData sets Number of data points Number of predictors Number of classesBalance 625 4 3Mice 1080 (cid:63) (cid:63)(cid:63) (cid:63)(cid:63)(cid:63) (cid:63)(cid:63)(cid:63) (cid:63) Missing data was replaced with 0. (cid:63)(cid:63)

The original 617 predictors were projected onto their ﬁrst 50 principal components,which contained 80% of the variation. (cid:63)(cid:63)(cid:63)

User information was removed, leaving only sensor position predictors. Single datapoint with missing value removed.In all the experiments we used Very Fast Decision Trees (VFDT) from Domingos and Hulten [14] as weaklearners. VFDT has several tuning parameters which relate to the frequency with which the tree splits. In allmethods we assigned these randomly for each tree. Speciﬁcally for our implementation the tuning parameter grace_period was chosen randomly between 5 and 20 and the tuning parameters split_confidence and hoeffding_tie_threshold randomly between 0.01 and 0.9. It is likely that this procedure would producetrees which do not perform well on speciﬁc data sets. In practice for the Adaboost.OLM it is possible torestart poorly performing trees using parameters similar to better performing trees in an automated andonline (although ad hoc) fashion using the α it , and this tends to produce superior performance (as wellas allow adaptivity to changes in the data distribution). However for these experiments, we did not takeadvantage of this to better examine the beneﬁts of just the cost matrix framework.Several algorithms were tested using the above speciﬁcations, but with slightly diﬀerent conditions. Theﬁrst three are directly comparable since they all use the same weak learners and do not require knowledgeof the edge of the weak learners. DT is the best result from running 100 VFDT independently. The bestwas chosen after seeing the performance on the entire data set and ﬁnal 20% respectively. However the timereported was the average time for running all 100 VFDT. This was done to better see the additional cost ofrunning the boosting framework on top of the training of the raw weak learners. OLB is an implementationof the Online Boosting algorithm in Oza [5, Figure 2] with 100 VFDT. AdaOLM stands for Adaboost.OLM,again with 100 VFDT.The next ﬁve algorithms (MB) tested were all variants of the OnlineMBBM but with diﬀerent edge γ multiclass trees k binary treesData sets DT OLB AdaOLM MB .3 MB .1 MB .05 MB .01 MB .001 OvA AdaOVABalance 0.768 0.772 0.754 0.788 0.821 0.819 0.805 0.752 0.786 0.795Mice 0.608 0.399 0.561 0.572 0.695 0.663 0.502 0.467 0.742 0.667Cars 0.924 0.914 0.930 0.914 0.885 0.870 0.836 0.830 0.946 0.919Mushroom 0.999 1.000 1.000 0.997 1.000 1.000 0.999 0.998 1.000 1.000Nursery 0.953 0.941 0.966 0.965 0.969 0.964 0.948 0.940 0.974 0.965ISOLET 0.515 0.149 0.521 0.453 0.626 0.635 0.226 0.165 0.579 0.570Movement 0.915 0.870 0.962 0.975 0.987 0.988 0.984 0.981 0.947 0.970Table 4: Comparison of algorithms on full data set multiclass trees k binary treesData sets DT OLB AdaOLM MB .3 MB .1 MB .05 MB .01 MB .001 OvA AdaOVABalance 0.734 0.747 0.698 0.751 0.769 0.759 0.736 0.677 0.724 0.730Mice 0.499 0.315 0.454 0.457 0.507 0.449 0.356 0.343 0.586 0.530Cars 0.848 0.839 0.865 0.842 0.829 0.814 0.767 0.762 0.881 0.853Mushroom 0.996 0.997 0.995 0.991 0.995 0.994 0.993 0.992 0.996 0.995Nursery 0.921 0.909 0.928 0.932 0.936 0.932 0.918 0.912 0.939 0.932ISOLET 0.395 0.104 0.456 0.333 0.486 0.461 0.152 0.111 0.507 0.472Movement 0.898 0.864 0.942 0.954 0.972 0.973 0.959 0.957 0.927 0.952values. In practice this value is never known ahead of time, but we want to explore how diﬀerent edges aﬀectthe performance of the algorithm. For the ease of computation, instead of exactly ﬁnding the value of (16),we estimated the potential functions by Monte Carlo (MC) simulations.The ﬁnal two algorithms are slightly diﬀerent implementations of the One VS All (OvA) ensemble method.In this framework multiple binary classiﬁers are used to solve a multiclass problem by viewing diﬀerent classesas the positive class, and all others as the negative class. They then predict whether a data point is theirpositive class or not, and the results are used together to make a ﬁnal classiﬁcation. Both use VFDT astheir weak learners, but with × k binary trees. The ﬁrst method (OvA) uses k versions of Adaboost.OL,each viewing one of the classes as the positive class. Recall that Adaboost.OLM in the binary setting isjust Adaboost.OL by Beygelzimer et al. [7]. The second (AdaOVA) produces 100 weak multiclass classiﬁersby grouping a k binary classiﬁers, one for each class, and then uses Adaboost.OLM to get the ﬁnal learner,treating the 100 single tree OvA’s as its weak learners. In the table below we have partitioned the methodsin terms of the number of weak learners since, while they all tackle the same problem, algorithms within eachpartition are more directly comparable since they use the same weak learners. E.1 Analysis

It is worth beginning by noting the strength of the VFDT without any boosting framework. While theresults above are for the best performing tree in hindsight, which is not a valid strategy in practice, in manyapplications it would be possible to collect some data beforehand activating the system, and use that to picktuning parameters. It is also worth noting that many of the weaknesses of the above methods, such as theirpoor scaling with the number of predictors, are also inherited from the VFDT. Nonetheless in almost all casesAdaboost.OLM algorithm outperforms both the best tree and the preexisting Online Boosting algorithm(and is often comparable to the OnlineMBBM algorithms), as well as provide theoretical guarantees. In27able 5: Comparison of algorithms total run time in seconds multiclass trees k binary treesData sets DT OLB AdaOLM MB .3 MB .1 MB .05 MB .01 MB .001 OvA AdaOVABalance 8 19 20 26 42 47 50 51 66 43Mice 105 263 416 783 2173 3539 3579 3310 3092 3013Cars 39 27 59 56 105 146 165 152 195 143Mushroom 241 169 355 318 325 326 324 321 718 519Nursery 526 302 735 840 1510 2028 2181 1984 2995 1732ISOLET 470 1497 2422 18732 38907 64707 62492 50700 37300 33328Movement 1960 3437 5072 13018 17608 18676 16739 16023 30080 21389particular these performance gains seem to be greater on the ﬁnal 20% of the data and in data sets withlarger number of data points n , leading us to believe that Adaboost.OLM has a longer burn in period, buthigher accuracy plateau. This performance does come at additional computational cost, but this cost isrelatively mild, especially compared to the costs of OnlineMBBM and the OvA methods.The OnlineMBBM methods use additional assumptions about the power of their weak learners, and areable to leverage that additional information to produce more accurate, with one of these algorithms oftenachieving the highest accuracy on each data set. However they can be sensitive to the choice of γ , with theworst choice of γ often underperforming both pure trees and Adaboost.OLM, and with no single γ valuealways producing the best result. These methods are also much slower than Adaboost.OLM, likely due tocomputational burden in estimating the potential functions.Finally our two OvA algorithms tend to perform very well, often beating the other adaptive methods.However this performance is likely due to the use of many times more weak learners than the other adaptivemethods used, which results in high computational cost. Again we see that as nn