Adaptively Pruning Features for Boosted Decision Trees
AAdaptively Pruning Features for Boosted DecisionTrees
Maryam Aziz
Northeastern UniversityBoston, Massachusetts, USA [email protected]
Jesse Anderton
Northeastern UniversityBoston, Massachusetts, USA [email protected]
Javed Aslam
Northeastern UniversityBoston, Massachusetts, USA [email protected]
Abstract
Boosted decision trees enjoy popularity in a variety of applications; however,for large-scale datasets, the cost of training a decision tree in each round can beprohibitively expensive. Inspired by ideas from the multi-arm bandit literature, wedevelop a highly efficient algorithm for computing exact greedy-optimal decisiontrees, outperforming the state-of-the-art
Quick Boost method. We further developa framework for deriving lower bounds on the problem that applies to a wide familyof conceivable algorithms for the task (including our algorithm and
Quick Boost ),and we demonstrate empirically on a wide variety of data sets that our algorithmis near-optimal within this family of algorithms. We also derive a lower boundapplicable to any algorithm solving the task, and we demonstrate that our algorithmempirically achieves performance close to this best-achievable lower bound.
Boosting algorithms are among the most popular classification algorithms in use today, e.g. incomputer vision, learning-to-rank, and text classification. Boosting, originally introduced by Schapire[1990], Freund [1995], Freund and Schapire [1996], is a family of machine learning algorithms inwhich an accurate classification strategy is learned by combining many “weak” hypotheses, eachtrained with respect to a different weighted distribution over the training data. These hypothesesare learned sequentially, and at each iteration of boosting the learner is biased towards correctlyclassifying the examples which were most difficult to classify by the preceding weak hypotheses.Decision trees [Quinlan, 1993], due to their simplicity and representation power, are among the mostpopular weak learners used in Boosting algorithms [Freund and Schapire, 1996, Quinlan, 1996].However, for large-scale data sets, training decision trees across potentially hundreds of roundsof boosting can be prohibitively expensive. Two approaches to ameliorate this cost include (1) approximate decision tree training , which aims to identify a subset of the features and/or a subset ofthe training examples such that exact training on this subset yields a high-quality decision tree, and(2) efficient exact decision tree training , which aims to compute the greedy optimal decision tree overthe entire data set and feature space as efficiently as possible. These two approaches complementeach other: approximate training often devolves to exact training on a subset of the data.As such, we consider the task of efficient exact decision tree learning in the context of boosting whereour primary objective is to minimize the number of examples that must be examined for any featurein order to perform greedy-optimal decision tree training. Our method is simple to implement, andgains in feature-example efficiency directly corresponds to improvements in computation time.The main contributions of the paper are as follows: • We develop a highly efficient algorithm for computing exact greedy-optimal decision trees,
Adaptive-Pruning Boost , and we demonstrate through extensive experiments that ourmethod outperforms the state-of-the-art
Quick Boost method.
Preprint. Work in progress. a r X i v : . [ c s . L G ] M a y We develop a constrained-oracle framework for deriving feature-example lower bounds onthe problem that applies to a wide family of conceivable algorithms for the task, includingour algorithm and
Quick Boost , and we demonstrate that our algorithm is near-optimalwithin this family of algorithms through extensive experiments. • Within the constrained-oracle framework, we also derive a feature-example lower bound ap-plicable to any algorithm solving the task, and we demonstrate that our algorithm empiricallyachieves performance close to this lower bound as well.We will next expand on the ideas that underlie our three main results above and discuss related work.
The Multi-Armed Bandit (MAB) Inspiration.
Our approach to efficiently splitting decision treenodes is based on identifying intervals which contain the score (e.g. classifier’s training accuracy)of each possible split and tightening those intervals by observing training examples incrementally.We can eventually exclude entire features from further consideration because their intervals do notoverlap the intervals of the best splits. Under this paradigm, the optimal strategy would be to assess allexamples for the best feature, reducing its interval to an exact value, and only then to assess examplesfor the remaining features to rule them out. Of course, we do not know in advance which feature isbest. Instead, we wish to spend our assessments optimally to identify the best feature with the fewestassessments spent on the other features. This corresponds well to the best arm identification problemstudied in the MAB literature. This insight inspired our training algorithm.A “Pure Exploration” MAB algorithm in the “Fixed-Confidence” setting [Kalyanakrishnan et al.,2012, Gabillon et al., 2012, Kaufmann and Kalyanakrishnan, 2013] is given a set of arms (probabilitydistributions over rewards) and returns the arm with highest expected reward with high probability(subsequently, WHP) while minimizing the number of samples drawn from each arm. Such confidenceinterval algorithms are generally categorized as LUCB (Lower Upper Confidence Bounds) algorithms,because at each round they “prune” sub-optimal arms whose confidence intervals do not overlap withthe most promising arm’s interval until it is confident that WHP it has found the best arm.In contrast to the MAB setting where one estimates the expected reward of an arm WHP, in theBoosting setting one can calculate the exact (training) accuracy of a feature (expected reward of anarm) if one is willing to assess that feature on all training examples. When only a subset of examplesare assessed, one can also calculate a non-probabilistic “uncertainty interval” which is guaranteedto contain the feature’s true accuracy. This interval shrinks in proportion to the boosting weightof the assessed examples. We specialize the generic LUCB-style MAB algorithm of the best armidentification to assess examples in decreasing order of boosting weights, and to use uncertaintyintervals in place of the more typical probabilistic confidence intervals.
Our Lower Bounds.
We introduce two empirical lower bounds on the total number of examplesneeded to be assessed in order to identify the exact greedy-optimal node for a given set of boostingweights. Our first lower bound is for the class of algorithms which assess feature accuracy bytesting the feature on examples in order of decreasing Boosting weights (we call this the assessmentcomplexity of the problem). We show empirically that our algorithm’s performance is consistentlynearly identical to this lower bound. Our second lower bound permits examples to be assessed inany order. It requires a feature to be assessed with the minimal set of examples necessary to provethat its training accuracy is not optimal. This minimal set depends on the boosting weights in a givenround, from which the best possible (weighted) accuracy across all weak hypotheses is calculated.For non-optimal features, the minimal set is then identified using Integer Linear Programming.
Much effort has gone to reducing the overall computational complexity of training Boosting models.In the spirit of Appel et al. [2013], which has the state-of-the-art exact optimal-greedy boosteddecision tree training algorithm
Quick Boost (our main competitor), we divide these attempts intothree categories and provide examples of the literature from each category: reducing 1) the set offeatures to focus on; 2) the set of examples to focus on; and/or 3) the training time of decision trees.Note that these categories are independent of and parallel to each other. For instance, 3), the focus ofthis work, can build a decision tree from any subset of features or examples. We show improvementscompared to state-of-the-art algorithm both on subsets of the training data and on the full training2atrix. Popular approximate algorithms such as XGBoost [Chen and Guestrin, 2016] typically focuson 1) and 2) and could benefit from using our algorithm for their training step.Various works [Dollar et al., 2007, Paul et al., 2009] focus on reducing the set of features. Busa-Feketeand Kégl [2010] divides features into subsets and at each round of boosting uses adversarial banditmodels to find the most promising subset for boosting.
LazyBoost [Escudero et al., 2001] samples asubset of features uniformly at random to focus on at a given boosting round.Other attempts at computational complexity reduction involve sampling a set of examples. Given afixed budget of examples,
Laminating [Dubout and Fleuret, 2014] attempts to find the best amonga set of hypotheses by testing each surviving hypothesis on a increasingly larger set of sampledexamples while pruning the worst performing half and doubling the number of examples, until it isleft with one hypthesis. It returns this hypothesis to boosting as the best one with probability − δ .The hypothesis identification part of Laminating is fairly identical to the best arm identificationalgorithm
Sequential Halving [Karnin et al., 2013].
Stochastic Gradient Boost [Friedman,2002], and the weight trimming approach of Friedman et al. [1998] are a few other intances ofreducing the set of examples.
FilterBoost [Bradley and Schapire, 2008] uses an oracle to sample aset of examples from a very large dataset and uses this set to train a weak learner.Another line of research focuses on reducing the training time of decision trees [Sharp, 2008, Wuet al., 2008]. More recently, Appel et al. [2013] proposed
Quick Boost , which trains decision treeas weak learners while pruning underperforming features earlier than a classic Boosting algorithmwould. They build their algorithm on the insight that the (weighted) error rate of a feature whentrained on a subset of examples can be used to bound its error rate on all examples. This is becausethe error rate is simply the normalized sum of the weights of the misclassified examples; if onesupposes that all unseen examples may be correctly classified, that yields a lower bound on the errorrate. If this lower bound is above the best observed error rate of a feature trained on all examples, theunderperforming feature may be pruned and no more effort spent on it.Our
Adaptive-Pruning Boost algorithm carries forward the ideas introduced by
Quick Boost .In contrast to
Quick Boost , our algorithm is parameter-free and adaptive. Our algorithm uses fewertraining examples and thus faster training CPU time than
Quick Boost . It works by graduallyadding weight to the “winning” feature with the smallest upper bound on, e.g., its error rate and the“challenger” feature with smallest lower bound, until all challengers are pruned. We demonstrateconsistent improvement over
Quick Boost on a variety of datasets, and show that when speedimprovements are more modest this is due to
Quick Boost approaching the lower bound moretightly rather than due to our algorithm using more examples than are necessary. Our algorithm isconsistently nearly-optimal in terms of the lower bound for algorithms which assess examples inweight order, and this lower bound in turn is close to the global lower bound. Experimentally, weshow that the reduction in total assessed examples also reduces the CPU time.
We adopt the setup, description and notation of Appel et al. [2013] for ease of comparison.
A Generic Boosting Algorithm.
Boosting algorithms train a linear combination of classifiers H T ( x ) = (cid:80) Tt α t h t ( x ) such that an error function E is minimized by optimizing scalar α t and theweak learner h t ( x ) at round t . Examples x i misclassified by h t ( x ) are assigned “heavy” weights w i so that the algorithm focuses on these heavy weight examples when training weak learner h t +1 ( x ) inround t + 1 . Decision trees, defined formally below, are often used as weak learners. Decision Tree.
A binary decision tree h Tree ( x ) is a tree-based classifier where every non-leaf nodeis a decision stump h ( x ) . A decision stump can be viewed as a tuple ( p, k, τ ) of a polarity (either +1 or − ), the feature column index, and threshold, respectively, which predicts a binary label from theset { +1 , − } for any input x ∈ R K using the function h ( x ) ≡ p sign( x [ k ] − τ ) .A decision tree h Tree ( x ) is trained, top to bottom, by “splitting” a node, i.e. selecting a stump h ( x ) that optimizes some function such as error rate, information gain, or GINI impurity. Whilethis paper focuses on selecting stumps based on error rate, we provide bounds for informationgain in the supplementary material which can be used to split nodes on information gain. Ouralgorithm Adaptive-Pruning Stump (Algorithm 1), a subroutine of
Adaptive-Pruning Boost h ( x ) with fewer total example assessments than its analog,the subroutine of the-state-of-the-art algorithm Quick Boost , does. Note that
Adaptive-PruningStump used iteratively can train a decision tree, but for simplicity we assume our weak learners arebinary decision stumps. While we describe
Adaptive-Pruning Stump for binary classification,the reasoning also applies to multi-class data.To describe how
Adaptive-Pruning Stump trains a stump we need a few definitions. Let n be thetotal number of examples, and m ≤ n some number of examples on which a stump has been trainedso far. We will assume that Boosting provides the examples in decreasing weight order. This ordercan be maintained in O ( n ) time in the presence of Boosting weight updates because examples whichare correctly classified do not change their relative weight order, and examples which are incorrectlyclassified do not change their relative weight order; a simple merge of these two groups suffices. Wecan therefore number our examples from 1 to n in decreasing weight order. Furthermore, • let Z m := (cid:80) mi =1 w i be sum of the weights of first m (heaviest) examples, and • let (cid:15) m := (cid:80) mi =1 w i { h ( x i ) (cid:54) = y i } be the sum of the weights of the examples from the first m which are misclassified by the stump h ( x ) .The weighted error rate for stump j on the first m examples is then E jm := (cid:15) jm /Z m . Adaptive-Pruning Stump prunes features based on exact intervals (which we call uncertaintyintervals) and returns the best feature deterministically. To do this we need lower bounds and upperbounds on the stump’s training error rate. Our lower bound assumes that all unseen examples areclassified correctly and our upper bound assumes that all unseen examples are classified incorrectly.We define L jm as the lower bound on the error rate for stump j on all n examples, when computed onthe first m examples, and U jm as the corresponding upper bound. For any ≤ m ≤ n , we define,using c ji := { h j ( x i ) (cid:54) = y i } to indicate whether stump j incorrectly classifies example i , L jm := 1 Z n m (cid:88) i =1 w i c ji ≤ Z n n (cid:88) i =1 w i c ji (cid:124) (cid:123)(cid:122) (cid:125) E jn ≤ Z n (cid:32) (cid:15) jm + n (cid:88) i = m +1 w i (cid:33) = 1 Z n (cid:0) (cid:15) jm + ( Z n − Z m ) (cid:1) =: U jm . For any two stumps i and j when numbers m and m (cid:48) exist such that L im > U jm (cid:48) then we can safelydiscard stump i , as it cannot have the lowest error rate. This extension of the pruning rule used byAppel et al. [2013] permits each feature to have its own interval of possible error rates, and permits usto compare features for pruning without first needing to assess all n examples for any feature ( QuickBoost ’s subroutine requires the current-best feature to be tested on all n examples).Now we describe our algorithm in detail; see the listing in Algorithm 1. We use f k to denote anobject which stores all decision stumps h ( x ) for feature x [ k ] . Recall that x ∈ R K and that x [ k ] isthe k th feature of x , for k ∈ { , . . . , K } . f k has method assess ( batch ) , when given a “batch” ofexamples, updates L m , E m , U m (defined above) for all decision stumps of feature x [ k ] based on theexamples in the batch. It also has methods LB () and U B () , which report the L m and U m for thesingle hypothesis with smallest error E m on the m examples seen so far, and bestStump () , whichreturns the hypothesis with smallest error E m . Adaptive-Pruning Stump proceeds until there is some feature k ∗ whose upper bound is below thelower bounds for all other features. We then know that the best hypothesis uses feature k ∗ . We assessany remaining unseen examples for feature k ∗ in order to identify the best threshold and polarity andto calculate E k ∗ n . Thus, our algorithm always finds the exact greedy-optimal hypothesis.In order to efficiently compare two features i and j to decide whether to prune feature i , we want to“add” the minimum weight to these arms to possibly obtain that L im > U jm (cid:48) . The most efficient way todo this is to test each feature against a batch of the heaviest unseen examples whose weight is at leastthe gap U jm (cid:48) − L im . This permits us to choose batch sizes adaptively, based on the minimum weightneeded to prune a feature given the current boosting weights and the current uncertainty intervals foreach arm. We note that our “weight order” lower bound on the sample complexity of the problem inthe next section is also calculated based on this insight. This is in contrast to Quick Boost , which4ccepts parameters to specify the total number of batches and the weight to use for initial estimates;the remaining weight is divided evenly among the batches. When the number of batches chosenis too large, the run time of a training round approaches O ( n ) ; when it is too small, the run timeapproaches that of assessing all n examples. Algorithm 1
Adaptive-Pruning Stump
Input:
Examples { x , . . . , x n }, Labels{ y , . . . , y n }, Weights { w , . . . , w n } Output: h ( x ) m ← min. index s.t. Z m ≥ . for k = 1 to K do f k .assess ([ x , . . . , x m ]); m k ← m end for a ← k with min f k .U B () b ← k (cid:54) = a with min f k .LB () while f a .U B () > f b .LB () do gap ← f a .U B () − f b .LB () m ← min index s.t. Z m ≥ Z m a + gapf a .assess ([ x m a +1 , . . . , x m ]); m a ← mgap ← f a .U B () − f b .LB () if gap > then m ← min index s.t. Z m ≥ Z m b + gapf b .assess ([ x m b +1 , . . . , x m ]); m b ← m end ifif f a .U B () < f b .U B () then a ← b end if b ← k (cid:54) = a with min f k .LB () end whilereturn h ( x ) := f a .bestStump () Algorithm 2
Adaptive-Pruning Boost
Input:
Instances { x , . . . , x n }, Labels{ y , . . . , y n } Output: H T ( x ) Initialize Weights: { w , . . . , w n } for t = 1 to T do Train Decision Tree h T ree ( x ) one node at atime by calling Adaptive-Pruning Stump
Choose α t and update H t ( x ) Update and Sort (in descending order) w end for At each round,
Adaptive-Pruning Boost trains a decision tree in Algorithm 2 by call-ing the subroutine
Adaptive-Pruning Stump of Algorithm 1.
Implementation Details.
The f k .assess () implementation is shared across all algorithms.For b batches of exactly m examples each on afeature k with v distinct values, our implementa-tion of f k .assess takes O ( bm log( m + v )) oper-ations. We maintain an ordered list of intervalsof thresholds for each feature with the featurevalues for the examples assessed so far lying onthe interval boundaries. Any threshold in theinterval will thus have the same performanceon all examples assessed so far. To assess abatch of examples, we sort the examples in thebatch by feature value and then split intervals asneeded and calculate scores for the thresholdson each interval in time linear in the batch sizeand number of intervals.Note also that maintaining the variables a and b requires a single heap, and that in many iter-ations of the while loop we can update thesevariables from the heap in constant time (e.g.when b has not changed, when a and b are sim-ply swapped, or when b can be pruned). We compare
Adaptive-Pruning Boost against two lower bounds, defined empiricallybased on the boosting weights in a givenround. In our weight order lower bound , weconsider the minimum number of examplesrequired to determine that a given feature isunderperforming with the assumption thatexamples will be assessed in order of decreasingboosting weight. Our exact lower bound permits examples to be assessed in any order, and so boundsany possible algorithm which finds the best-performing feature.
Weight Order Lower Bound.
For this bound, we first require that
Adaptive-Pruning Stump selects the feature with minimal error. In the case of ties, an optimal feature may be chosenarbitrarily.
Adaptive-Pruning Stump need to assess every example for the returned feature inorder for
Adaptive-Pruning Boost to calculate α and update weights w , so the lower bound forthe returned feature is simply the total number of examples n .Let k ∗ be the returned feature, and E ∗ its error rate when assessed on all n examples. For anyfeature k (cid:54) = k ∗ which is not returned, we need to prove that it is underperforming (or tied with thebest feature). Let J k be the set of decision stumps which use feature k ; then we need to find thesmallest value m such that for all stumps j ∈ J k , we have L jm ≥ E ∗ . Our lower bound is simply LB wo := n + (cid:80) k (cid:54) = k ∗ min { m : ∀ j ∈ J k , L jm ≥ E ∗ } . We present results in Figure 2 showingthat Adaptive-Pruning Boost achieves this bound on a variety of datasets.
Quick Boosting , incontrast, sometimes approaches this bound but often uses more examples than necessary.5
100 200 300 400 500Boosting Round-1.00e-010.00e+001.00e-012.00e-013.00e-014.00e-015.00e-01 R o un d A ss e ss m e n t s AP Boost (W4A)Exact LB (W4A)Quick Boost (W4A)Weight Order LB (W4A)0 100 200 300 400 500Boosting Round-2.00e-010.00e+002.00e-014.00e-016.00e-018.00e-011.00e+001.20e+00 R o un d A ss e ss m e n t s AP Boost (A6A)Exact LB (A6A)Quick Boost (A6A)Weight Order LB (A6A)
Figure 1: Lower Bounds versus Upper Bounds.Datasets W4A (top) and A6A (bottom) were usedwith trees of depth 1. The y-axis is the fraction ofthe gap between the exact lower bound (at zero)and the full corpus size (at one) which an algorithmused in a given round. Non-cumulative exampleassessments are plotted for every 10 rounds.
Exact Lower Bound.
In order to test the ideathat adding examples in weight order is nearlyoptimal, and to provide a lower bound on any al-gorithm which finds the optimal stump, we alsopresent an exact lower bound on the problem.Like the weight order lower bound, this boundis defined in terms of the boosting weights ina given round; unlike it, examples may be as-sessed in any order. It is not clear how one mightachieve the exact lower bound without incurringan additional cost in time. We leave such a so-lution to future work. However, we show inFigure 1 that this bound is, in fact, very close tothe weight order lower bound.For the exact lower bound, we still require theselected feature k ∗ to be assessed against allexamples; this is imposed by the boosting algo-rithm. For any other feature k (cid:54) = k ∗ , we simplyneed the size of the smallest set of exampleswhich would prune the feature (or prove it istied with k ∗ ). We will use M ⊆ { , . . . , n } todenote a set of indexes of examples assessed fora given feature, and L jM to denote the lowerbound of stump j when assessed on the ex-amples in subset M . This bound, then, is LB exact := n + (cid:80) k (cid:54) = k ∗ min M : L jM ≥ E ∗ | M | .We identify the examples included in the small-est subset M for a given feature k (cid:54) = k ∗ usinginteger linear programming. We define binaryvariables c , . . . , c n , where c i indicates whetherexample i is included in the set M . We then cre-ate a constraint for each stump j ∈ J k definedfor feature k which requires that the stump beproven underperforming. Our program, then, is: Minimize (cid:80) ni =1 c i s.t. c i ∈ { , } ∀ i, and (cid:80) ni =1 c i w i { h j ( x i ) (cid:54) = y i } ≥ E ∗ ∀ j ∈ J k . Discussion.
Figure 1 shows a non-cumulative comparison of our weight order lower bound to theglobal lower bound. Minimizing the global lower bound function mentioned above is computationallyexpensive. For this reason we used binary class datasets of moderate size and trees of depth 1 asweak leaners, but we have no reason to believe that the technique would not work for deeper treesand multi-class datasets. Refer to Table 1 for details of datasets. The weight order lower bound and
Adaptive-Pruning Boost are within 10-20% of the exact lower bound, but
Quick Boost oftenuses half to all of the unnecessary training examples in a given round.
We experimented with shallow trees on various binary and multi-class datasets. We report bothassessment complexity and CPU time complexity for each dataset. Though
Adaptive-PruningBoost is a general Boosting algorithm, we experimented with the following class of algorithms (1)Boosting exact greedy-optimal decision trees and (2) Boosting approximate decision trees.Each algorithm was run with either the state-of-the-art method (
Quick Boost ) or our decision treetraining method (
Adaptive-Pruning Boost ), apart from the case of Figure 2 that also uses thebrute-force decision tree search method (
Classic AdaBoost ). The details of our datasets are inTable 1. For datasets SATIMAGE, W4A, A6A, and RCV1 tree depth of three was used and forMNIST Digits tree depth of four was used (as in Appel et al. [2013]). Train and test error results areprovided as supplementary material. 6able 1: The datasets used in our experiments. D ATASET S OURCE T RAIN / T
EST S IZE T OTAL F EATURES C LASSESA A P LATT [1999] 11220 / 21341 123 2MNIST D
IGITS L ECUN ET AL . [1998] 60000 / 10000 780 10
RCV
INARY ) L
EWIS ET AL . [2004] 20242 / 677399 47236 2
SATIMAGE H SU AND L IN [2002] 4435 / 2000 36 6 W A P LATT [1999] 7366 / 42383 300 2
Boosting Exact Greedy-Optimal Decision Trees.
We used
AdaBoost for exact decision treetraining. Figure 2 shows the total number of example assessments used by AdaBoost when it usesthree different decision trees building methods described above. In all of these experiments, ouralgorithm,
Adaptive-Pruning Boost , not only consistently beats
Quick Boost but it also almostmatches the weight order lower bound. The
Classic AdaBoost can be seen as the upper bound onthe total number of example assessments.Table 2 shows that CPU time improvements correspond to example-assessments improvementsfor
Adaptive-Pruning Boost for all our datasets, except for RCV1. This could be explained byFigure 2 wherein
Quick Boost is seen approaching the lower bound for this particular dataset. While
Adaptive-Pruning Boost is closer to the lower bound, its example-assessments improvementsare not enough to translate to CPU time improvements. T o t a l A ss e ss m e n t s T o t a l A ss e ss m e n t s T o t a l A ss e ss m e n t s T o t a l A ss e ss m e n t s T o t a l A ss e ss m e n t s Figure 2: We report the total number of assessments at various boosting rounds used by the algorithms,as well as the weight order lower bound. In all of these experiments, our algorithm,
AP Boost , notonly consistently beats
Quick Boost but it also almost matches the lower bound.Table 2: Computational Complexity for AdaBoost. All results are for 500 rounds of boosting exceptMNIST (300 rounds) and RCV1 (400 rounds).
CPU T
IME IN S ECONDS
XAMPLE A SSESSMENTS D ATASET B OOSTING
AP-B QB I
MPROV . AP-B QB I
MPROV . A A A DA B OOST E +02 4.46 E +02 5.3% 1.69 E +09 1.83 E +09 7.8% MNIST A DA B OOST E +05 6.60 E +05 4.2% 3.52 E +11 3.96 E +11 11.1% RCV DA B OOST E +05 1.58 E +05 -0.5% 6.15 E +11 6.58 E +11 6.5% SATIMAGE A DA B OOST E +02 1.19 E +03 18.9% 8.64 E +08 1.11 E +09 22.5% W A A DA B OOST E +02 3.96 E +02 27.1% 1.69 E +09 2.41 E +09 29.8%M EAN
11% 15.54% oosting Approximate Decision Trees. We used two approximate boosting algorithms. Weexperimented with Boosting with Weight-Trimming 90% and 99% [Friedman et al., 1998], whereinthe weak hypothesis is trained only on 90% or 99% of the weights, and LazyBoost 90% and 50%[Escudero et al., 2001] wherein the weak hypothesis is trained only on 90% or 50% randomly selectedfeatures. Table 3 shows that the CPU time improvements correspond to assessment improvements.Note that approximate algorithms like XGBoost of Chen and Guestrin [2016] are not competitors to
Adaptive-Pruning Boost but rather potential “clients” because such algorithms train on a subsetof the data. Therefore, they are not appropriate baselines to our method.Table 3: Computational Complexity for LazyBoost and Boosting with Weight Trimming. All resultsare for 500 rounds of boosting except MNIST (300 rounds) and RCV1 (400 rounds).
CPU T
IME IN S ECONDS
XAMPLE A SSESSMENTS D ATASET B OOSTING
AP-B QB I
MPROV . AP-B QB I
MPROV . A A L AZY B OOST (0.5) 1.86 E +02 1.95 E +02 4.8% 8.48 E +08 9.22 E +08 8.1% MNIST L AZY B OOST (0.5) 3.46 E +05 3.52 E +05 1.8% 1.87 E +11 2.07 E +11 9.7% RCV
AZY B OOST (0.5) 7.86 E +04 7.54 E +04 -4.2% 3.18 E +11 3.29 E +11 3.4% SATIMAGE L AZY B OOST (0.5) 4.70 E +02 5.48 E +02 14.2% 5.17 E +08 6.11 E +08 15.4% W A L AZY B OOST (0.5) 1.15 E +02 1.58 E +02 26.8% 8.61 E +08 1.22 E +09 29.3%M EAN A A L AZY B OOST (0.9) 3.28 E +02 3.48 E +02 5.6% 1.51 E +09 1.64 E +09 7.7% MNIST L AZY B OOST (0.9) 5.89 E +05 6.09 E +05 3.3% 3.20 E +11 3.59 E +11 10.9% RCV
AZY B OOST (0.9) 1.38 E +05 1.37 E +05 -1.0% 5.60 E +11 5.93 E +11 5.6% SATIMAGE L AZY B OOST (0.9) 7.37 E +02 8.89 E +02 17.1% 8.05 E +08 1.01 E +09 20% W A L AZY B OOST (0.9) 2.04 E +02 2.82 E +02 27.7% 1.52 E +09 2.19 E +09 30.5%M EAN A A W T . T RIM (0.9) 2.69 E +02 2.69 E +02 0% 1.23 E +09 1.24 E +09 1.4% MNIST W T . T RIM (0.9) 6.42 E +05 8.02 E +05 19.9% 4.61 E +11 4.61 E +11 0% RCV T . T RIM (0.9) 8.87 E +04 8.95 E +04 0.9% 3.65 E +11 3.79 E +11 3.6% SATIMAGE W T . T RIM (0.9) 9.87 E +02 9.76 E +02 -1.2% 1.26 E +09 1.26 E +09 0.1% W A W T . T RIM (0.9) 1.88 E +02 1.96 E +02 4.1% 1.40 E +09 1.43 E +09 2.5%M EAN A A W T . T RIM (0.99) 3.34 E +02 3.38 E +02 1.3% 1.54 E +09 1.58 E +09 2.6% MNIST W T . T RIM (0.99) 5.80 E +05 5.69 E +05 -1.8% 3.16 E +11 3.37 E +11 6.1% RCV T . T RIM (0.99) 1.38 E +05 1.37 E +05 -1.0% 5.61 E +11 5.86 E +11 4.4% SATIMAGE W T . T RIM (0.99) 6.49 E +02 6.68 E +02 2.9% 7.01 E +08 7.39 E +08 5.1% W A W T . T RIM (0.99) 1.91 E +02 2.03 E +02 6.0% 1.44 E +09 1.52 E +09 5.3%M EAN
In this paper, we introduced an efficient exact greedy-optimal algorithm,
Adaptive-PruningBoost , for boosted decision trees. Our experiments on various datasets show that our algorithmuse fewer total example assessments compared to the-state-of-the-art algorithm
Quick Boost . Wefurther showed that
Adaptive-Pruning Boost almost matches the lower bound for its class ofalgorithms and the global lower bound for any algorithm.8 eferences
Ron Appel, Thomas Fuchs, Piotr Dollar, and Pietro Perona. Quickly boosting decision trees – pruningunderachieving features early. In
Proceedings of the 30th International Conference on MachineLearning (ICML) , 2013.Joseph K Bradley and E Schapire. Filterboost: Regression and classification on large datasets. In J. C.Platt, D. Koller, Y. Singer, and S. T. Roweis, editors,
Advances in Neural Information ProcessingSystems 20 , pages 185–192. Curran Associates, Inc., 2008.R. Busa-Fekete and B. Kégl. Fast boosting using adversarial bandits. In
Proceedings of the 27thInternational Conference on Machine Learning (ICML)
Proceedings ofthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ,KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi:10.1145/2939672.2939785.P. Dollar, Zhuowen Tu, H. Tao, and S. Belongie. Feature mining for image classification. In
IEEEConference on Computer Vision and Pattern Recognition (CVPR ’07) , pages 1–8, June 2007. doi:10.1109/CVPR.2007.383046.Charles Dubout and François Fleuret. Adaptive sampling for large scale boosting.
J. Mach. Learn.Res. , 15(1):1431–1453, January 2014. ISSN 1532-4435.G. Escudero, L. Màrquez, and G. Rigau. Using lazyboosting for word sense disambiguation. In
TheProceedings of the Second International Workshop on Evaluating Word Sense DisambiguationSystems , 2001.Yoav Freund. Boosting a weak learning algorithm by majority.
Inf. Comput. , 121(2):256–285,September 1995. ISSN 0890-5401. doi: 10.1006/inco.1995.1136.Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In
Proceedingsof the Thirteenth International Conference on International Conference on Machine Learning ,ICML’96, pages 148–156, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc.ISBN 1-55860-419-7.J. H. Friedman. Stochastic gradient boosting. In
In Computational Statistics & Data Analysis, 2002. ,2002.Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statisticalview of boosting.
Annals of Statistics , 28:2000, 1998.Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: Aunified approach to fixed budget and fixed confidence. In
Advances in Neural InformationProcessing Systems (NIPS) . 2012.Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines.
IEEE Transactions on Neural Networks , 13(2):415–425, Mar 2002. ISSN 1045-9227. doi:10.1109/72.991427.Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection instochastic multi-armed bandits. In
Proceedings of the 29th International Conference on MachineLearning, (ICML) , 2012.Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits.In
Proceedings of the 30th International Conference on Machine Learning (ICML-13) , 2013.E. Kaufmann and S. Kalyanakrishnan. Information complexity in bandit subset selection. In
Proceeding of the 26th Conference On Learning Theory. , 2013.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2324, Nov 1998. ISSN 0018-9219. doi:10.1109/5.726791. 9avid D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. Rcv1: A new benchmark collection fortext categorization research.
J. Mach. Learn. Res. , 5:361–397, December 2004. ISSN 1532-4435.Biswajit Paul, G Athithan, and M Murty. Speeding up adaboost classifier with random projection, 032009.John C. Platt. Advances in kernel methods. chapter Fast Training of Support Vector Machines UsingSequential Minimal Optimization, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. ISBN0-262-19416-3.J. R. Quinlan. Bagging, boosting, and c4.s. In
Proceedings of the Thirteenth National Conferenceon Artificial Intelligence - Volume 1 , AAAI’96, pages 725–730. AAAI Press, 1996. ISBN 0-262-51091-X.J. Ross Quinlan.
C4.5: Programs for Machine Learning . Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA, 1993. ISBN 1-55860-238-0.Robert E. Schapire. The strength of weak learnability. In
Machine Learning , 1990.Toby Sharp. Implementing decision trees and forests on a gpu. In
ECCV (4) , volume 5305, pages595–608. Springer, January 2008. ISBN 978-3-540-88692-1.Jianxin Wu, S Charles Brubaker, Matthew D Mullin, and James Rehg. Fast asymmetric learning forcascade face detection. 30:369–82, 04 2008. 10 ppendix A Additional Results
A.1 Train and Test Error for AdaBoost
Table 4 reports test and train errors at various Boosting rounds. Our algorithm achieves the test andtrain error in fewer total number of example assessments, compared to
Quick Boost . Note that bothalgorithms, except in the case of RCV1, have the same test and train error at a given round, as theyshould because both train identical decision trees. The case of RCV1 is due to the algorithms pickinga weak learner arbitrarily in case of ties, without changing the overall results significantly.Table 4: AdaBoost results, reported at rounds 100, 300 and 500 (400 for RCV1).
100 300 400/500A LG : D ATA
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP-B: A A E +08 0.142 0.155 1.02 E +09 0.131 0.157 1.69 E +09 0.128 0.160QB: A A E +08 0.142 0.155 1.09 E +09 0.131 0.157 1.83 E +09 0.128 0.160AP-B: MNIST E +11 0.106 0.111 3.52 E +11 0.057 0.064 — — —QB: MNIST E +11 0.106 0.111 3.96 E +11 0.057 0.064 — — —AP-B: RCV E +11 0.027 0.059 4.83 E +11 0.005 0.047 6.15 E +11 0.001 0.044QB: RCV E +11 0.029 0.061 5.13 E +11 0.004 0.047 6.58 E +11 0.001 0.046AP-B: SATIMAGE E +08 0.113 0.150 5.46 E +08 0.070 0.121 8.64 E +08 0.049 0.109QB: SATIMAGE E +08 0.113 0.150 6.61 E +08 0.070 0.121 1.11 E +09 0.049 0.109AP-B: W A E +08 0.011 0.019 1.07 E +09 0.006 0.018 1.69 E +09 0.006 0.018QB: W A E +08 0.011 0.020 1.45 E +09 0.006 0.018 2.41 E +09 0.006 0.018 A.2 Train and Test Error for LazyBoost and Weight Trimming
Table 5: Performance for A6A
100 300 500
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP L
AZY B OOST (0.5) 1.69 E +08 0.145 0.156 5.11 E +08 0.134 0.159 8.48 E +08 0.129 0.160QB L AZY B OOST (0.5) 1.80 E +08 0.145 0.157 5.50 E +08 0.137 0.158 9.22 E +08 0.132 0.160AP L AZY B OOST (0.9) 2.99 E +08 0.141 0.156 9.07 E +08 0.133 0.157 1.51 E +09 0.130 0.159QB L AZY B OOST (0.9) 3.18 E +08 0.141 0.156 9.75 E +08 0.133 0.157 1.64 E +09 0.130 0.159AP W T . T RIM (0.9) 2.45 E +08 0.151 0.157 7.35 E +08 0.151 0.157 1.23 E +09 0.151 0.157QB W T . T RIM (0.9) 2.49 E +08 0.151 0.157 7.46 E +08 0.151 0.157 1.24 E +09 0.151 0.157AP W T . T RIM (0.99) 3.16 E +08 0.141 0.156 9.34 E +08 0.132 0.157 1.54 E +09 0.126 0.158QB W T . T RIM (0.99) 3.28 E +08 0.141 0.156 9.62 E +08 0.132 0.157 1.58 E +09 0.126 0.160 Table 6: Performance for MNIST Digits
100 200 300
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP L
AZY B OOST (0.5) 6.65 E +10 0.150 0.145 1.28 E +11 0.098 0.098 1.87 E +11 0.076 0.079QB L AZY B OOST (0.5) 7.07 E +10 0.150 0.145 1.39 E +11 0.098 0.098 2.07 E +11 0.076 0.079AP L AZY B OOST (0.9) 1.17 E +11 0.117 0.118 2.22 E +11 0.079 0.085 3.20 E +11 0.061 0.069QB L AZY B OOST (0.9) 1.25 E +11 0.117 0.118 2.43 E +11 0.079 0.085 3.59 E +11 0.061 0.069AP W T . T RIM (0.9) 1.53 E +11 0.901 0.901 3.07 E +11 0.901 0.901 4.61 E +11 0.901 0.901QB W T . T RIM (0.9) 1.53 E +11 0.900 0.901 3.07 E +11 0.900 0.901 4.61 E +11 0.900 0.901AP W T . T RIM (0.99) 1.19 E +11 0.117 0.124 2.21 E +11 0.076 0.080 3.16 E +11 0.062 0.068QB W T . T RIM (0.99) 1.29 E +11 0.115 0.117 2.37 E +11 0.074 0.078 3.37 E +11 0.056 0.061
100 300 400
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP L
AZY B OOST (0.5) 8.93 E +10 0.029 0.061 2.48 E +11 0.006 0.047 3.18 E +11 0.002 0.046QB L AZY B OOST (0.5) 9.06 E +10 0.028 0.060 2.55 E +11 0.005 0.048 3.29 E +11 0.002 0.046AP L AZY B OOST (0.9) 1.59 E +11 0.027 0.058 4.35 E +11 0.005 0.047 5.60 E +11 0.002 0.045QB L AZY B OOST (0.9) 1.64 E +11 0.027 0.058 4.62 E +11 0.004 0.047 5.93 E +11 0.001 0.045AP W T . T RIM (0.9) 1.19 E +11 0.022 0.059 2.92 E +11 0.003 0.047 3.65 E +11 0.001 0.046QB W T . T RIM (0.9) 1.22 E +11 0.025 0.058 3.03 E +11 0.003 0.047 3.79 E +11 0.001 0.046AP W T . T RIM (0.99) 1.62 E +11 0.027 0.059 4.40 E +11 0.004 0.047 5.61 E +11 0.001 0.045QB W T T RIM (0.99) 1.70 E +11 0.027 0.059 4.60 E +11 0.004 0.048 5.86 E +11 0.001 0.046 Table 8: Performance for SATIMAGE
100 300 500
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP L
AZY B OOST (0.5) 1.11 E +08 0.133 0.152 3.22 E +08 0.094 0.123 5.17 E +08 0.073 0.115QB L AZY B OOST (0.5) 1.23 E +08 0.130 0.150 3.68 E +08 0.090 0.129 6.11 E +08 0.067 0.113AP L AZY B OOST (0.9) 1.88 E +08 0.114 0.128 5.13 E +08 0.071 0.119 8.05 E +08 0.050 0.110QB L AZY B OOST (0.9) 2.06 E +08 0.114 0.128 6.07 E +08 0.071 0.119 1.01 E +09 0.050 0.110AP W T . T RIM (0.9) 2.51 E +08 0.756 0.766 7.56 E +08 0.756 0.766 1.26 E +09 0.756 0.766QB W T . T RIM (0.9) 2.51 E +08 0.755 0.765 7.57 E +08 0.755 0.765 1.26 E +09 0.755 0.765AP W T . T RIM (0.99) 1.80 E +08 0.109 0.141 4.66 E +08 0.066 0.121 7.01 E +08 0.045 0.113QB W T . T RIM (0.99) 1.89 E +08 0.109 0.141 4.91 E +08 0.066 0.121 7.39 E +08 0.045 0.113 Table 9: Performance for W4A
100 300 500
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
SSESS . T
RAIN T EST
AP L
AZY B OOST (0.5) 2.00 E +08 0.012 0.019 5.46 E +08 0.008 0.018 8.61 E +08 0.006 0.018QB L AZY B OOST (0.5) 2.35 E +08 0.012 0.019 7.35 E +08 0.008 0.018 1.22 E +09 0.006 0.018AP L AZY B OOST (0.9) 3.48 E +08 0.012 0.020 9.66 E +08 0.007 0.018 1.52 E +09 0.006 0.018QB L AZY B OOST (0.9) 4.27 E +08 0.012 0.020 1.32 E +09 0.007 0.018 2.19 E +09 0.006 0.018AP W T . T RIM (0.9) 2.87 E +08 0.016 0.021 8.41 E +08 0.016 0.021 1.40 E +09 0.016 0.021QB W T . T RIM (0.9) 2.97 E +08 0.016 0.021 8.63 E +08 0.016 0.021 1.43 E +09 0.016 0.021AP W T . T RIM (0.99) 3.63 E +08 0.012 0.020 9.44 E +08 0.007 0.017 1.44 E +09 0.006 0.018QB W T . T RIM (0.99) 3.96 E +08 0.012 0.020 1.01 E +09 0.007 0.018 1.52 E +09 0.006 0.018 .3 Different Tree Depths Table 10: Different Tree Depths: Number of Assessments after 500 rounds A A AP B
OOST E +08 1.23 E +09 1.69 E +09 2.08 E +09 2.44 E +09 A A Q UICK B OOST E +08 1.29 E +09 1.83 E +09 2.34 E +09 2.89 E +09 W A AP B
OOST E +08 1.38 E +09 1.69 E +09 1.90 E +09 2.12 E +09 W A Q UICK B OOST E +08 1.72 E +09 2.41 E +09 3.07 E +09 3.60 E +09 We also experimented with different tree depths, and found that
Adaptive-Pruning Boost showsmore dramatic gains in terms of total number of assessments when it uses deeper trees as weaklearners. We believe this is because of accumulated gains for training more nodes in each tree. Wehave included an example of this in Table 10, where for two datasets (W4A, and A6A) we showexperiments at depth 1 through 5. We report the total number of assessments used by
AdaBoost (exact greedy-optimal decision trees) after 500 rounds.13 ppendix B Information Gain
Notation reference: • Z n is the total weight of all n training examples • Z ρ is the weight of examples which reached some leaf ρ . • Z u and Z ¯ u are the seen and unseen weight for leaf ρ (where ρ should be clear from context),so Z u + Z ¯ u = Z ρ . • Z yρ is the total weight for leaf ρ with label y . • Z yu and Z y ¯ u are the seen and unseen weight for leaf ρ with label y , so Z yu + Z y ¯ u = Z yρ . • Z ¯ yρ is the total weight for leaf ρ with some label other than y , so Z ¯ yρ = Z ρ − Z yρ . • Z ¯ yu and Z ¯ y ¯ u are the seen and unseen weight for leaf ρ with some label other than y , so Z ¯ yu + Z ¯ y ¯ u = Z ¯ yρ . • w is the total unseen weight for all leaves, so w = (cid:80) ρ Z ¯ u . • w y and w ¯ y are the fraction of total unseen weight with and without label y , so w y + w ¯ y = w .The “error” term for Information Gain is the conditional entropy of the leaves, written as follows. (cid:15) n := (cid:88) ρ Z ρ Z n (cid:32) − (cid:88) y Z yρ Z ρ lg Z yρ Z ρ (cid:33) ⇒ Z n (cid:15) n = (cid:88) ρ Z ρ (cid:15) ρ (cid:122) (cid:125)(cid:124) (cid:123)(cid:32) − (cid:88) y Z yρ lg Z yρ Z ρ (cid:33) Z ρ (cid:15) ρ = − (cid:88) y Z yρ lg Z yρ Z ρ = − (cid:88) y ( Z yu + Z y ¯ u ) lg Z yu + Z y ¯ u Z u + Z ¯ u = (cid:32) − (cid:88) y Z yu lg Z yu Z u (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) Z u (cid:15) u + (cid:32) − (cid:88) y Z y ¯ u lg Z y ¯ u Z ¯ u (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) Z ¯ u (cid:15) ¯ u + (cid:88) y Z yρ KL (cid:18) B (cid:18) Z yu Z yρ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) Z u Z ρ (cid:19)(cid:19) , where the final equality follows by Lemma 2, proved below. The bounds on information gain thusultimately depend on Z ¯ u (cid:15) ¯ u and on the KL divergence term, (cid:88) y Z yρ KL (cid:18) B (cid:18) Z yu Z yρ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) Z u Z ρ (cid:19)(cid:19) (1)where KL ( ·(cid:107)· ) is the Kullback-Liebler divergence and B ( · ) is a Bernoulli probability distribution. KL ( B ( p ) (cid:107)B ( q )) = p lg pq + (1 − p ) lg 1 − p − q Since Z ¯ u (cid:15) ¯ u ≥ and KL divergence are non-negative, a trivial lower bound is Z ρ (cid:15) ρ ≥ Z u (cid:15) u = − (cid:88) y Z yu lg Z yu Z u . (2)It remains to prove an upper bound. We upper bound the weight Z yρ of KL divergence as Z yρ ≤ Z yu + w y . Below, we prove the following upper bound on the KL divergence in Eq. 1. Lemma 1 (KL Upper Bound) . For any individual leaf ρ and label y , we have KL (cid:18) B (cid:18) Z yu Z yρ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) Z u Z ρ (cid:19)(cid:19) ≤ lg Z u + wZ yu .
14n order to complete our upper bound, we note that Z ¯ u (cid:15) ¯ u is simply the unassessed weight Z ¯ u timesthe label entropy for the unassessed weight, and with | Y | total labels the label entropy is upperbounded as lg | Y | . This yields the following bounds on the conditional entropy term for InformationGain. (cid:88) ρ Z u (cid:15) u ≤ Z n (cid:15) n ≤ (cid:88) ρ (cid:34) Z u (cid:15) u + w lg | Y | + (cid:88) y (cid:18) ( Z yu + w y ) lg Z u + wZ yu (cid:19)(cid:35) (3)Our proofs follow. Proof of Lemma 1.
We bound the KL divergence using the Reyni divergence and by bounding thetwo Bernoulli probability ratios. Our probabilities are (cid:18) Z yu Z yρ , − Z yu Z yρ (cid:19) = (cid:18) Z yu Z yu + Z y ¯ u , Z y ¯ u Z yu + Z y ¯ u (cid:19) (4)and (cid:18) Z u Z ρ , − Z u Z ρ (cid:19) = (cid:18) Z u Z u + Z ¯ u , Z ¯ u Z u + Z ¯ u (cid:19) (5).Our two ratio are upper bounded as follows (cid:16) Z yu Z yu + Z y ¯ u (cid:17)(cid:16) Z u Z u + Z ¯ u (cid:17) = Z yu Z yu + Z y ¯ u × Z u + Z ¯ u Z u ≤ Z yu Z yu × Z u + wZ u ≤ Z u + wZ u (6)and (cid:16) Z y ¯ u Z yu + Z y ¯ u (cid:17)(cid:16) Z ¯ u Z u + Z ¯ u (cid:17) = Z y ¯ u Z yu + Z y ¯ u × Z u + Z ¯ u Z ¯ u = Z y ¯ u Z yu + Z y ¯ u × Z u + Z ¯ u Z y ¯ u + Z ¯ y ¯ u ≤ Z y ¯ u Z yu + Z y ¯ u × Z u + Z ¯ u Z y ¯ u (7) ≤ Z u + wZ yu . (8)Since Z u + wZ u ≤ Z u + wZ yu , by the Reyni Divergence of ∞ order D ∞ ( B ( p ) (cid:107)B ( q )) = lg sup i p i q i (i.e. the log of the maximumratio of probabilities) we conclude that KL (cid:18) B (cid:18) Z yu Z yρ (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) Z u Z ρ (cid:19)(cid:19) ≤ lg Z u + wZ yu . Lemma 2.
For a, b ≥ and α , β > a + b ) lg a + bα + β = a lg aα + b lg bβ − ( a + b ) KL (cid:18) B (cid:18) aa + b (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) αα + β (cid:19)(cid:19) . (9)15 roof. ( a + b ) lg a + bα + β = a lg a + bα + β + b lg a + bα + β = a lg aα α ( a + b ) a ( α + β ) + b lg bβ β ( a + b ) b ( α + β )= a lg aα + b lg bβ + ( a + b ) (cid:20) aa + b lg αα + β a + ba + ba + b lg βα + β a + bb (cid:21) = a lg aα + b lg bβ + ( a + b ) (cid:20) aa + b lg α/ ( α + β ) a/ ( a + b ) + ba + b lg β/α + βb/a + b (cid:21) = a lg aα + b lg bβ + ( a + b ) (cid:20) − KL (cid:18) B (cid:18) aa + b (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) αα + β (cid:19)(cid:19)(cid:21) = a lg aα + b lg bβ − ( a + b ) KL (cid:18) B (cid:18) aa + b (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) B (cid:18) αα + β (cid:19)(cid:19)(cid:19)(cid:19)