[PDF] Stochastic Gradient Trees

Abstract

We present an algorithm for learning decision trees using stochastic gradient information as the source of supervision. In contrast to previous approaches to gradient-based tree learning, our method operates in the incremental learning setting rather than the batch learning setting, and does not make use of soft splits or require the construction of a new tree for every update. We demonstrate how one can apply these decision trees to different problems by changing only the loss function, using classification, regression, and multi-instance learning as example applications. In the experimental evaluation, our method performs similarly to standard incremental classification trees, outperforms state of the art incremental regression trees, and achieves comparable performance with batch multi-instance learning methods.

Full PDF

aa r X i v : . [ s t a t . M L ] S e p Proceedings of Machine Learning Research 101:1–17, 2019 ACML 2019

Stochastic Gradient Trees

Henry Gouk [email protected]

School of InformaticsUniversity of EdinburghEdinburgh, United Kingdom

Bernhard Pfahringer [email protected]

Eibe Frank [email protected]

Department of Computer ScienceUniversity of WaikatoHamilton, New Zealand

Editors:

Wee Sun Lee and Taiji Suzuki

Abstract

We present an algorithm for learning decision trees using stochastic gradient informationas the source of supervision. In contrast to previous approaches to gradient-based treelearning, our method operates in the incremental learning setting rather than the batchlearning setting, and does not make use of soft splits or require the construction of a newtree for every update. We demonstrate how one can apply these decision trees to diﬀerentproblems by changing only the loss function, using classiﬁcation, regression, and multi-instance learning as example applications. In the experimental evaluation, our methodperforms similarly to standard incremental classiﬁcation trees, outperforms state of theart incremental regression trees, and achieves comparable performance with batch multi-instance learning methods.

Keywords:

Decision tree induction, gradient-based optimisation, data stream mining,multi-instance learning.

1. Introduction

Stochastic gradient descent is the workhorse of contemporary machine learning. Methodsfor scalable gradient-based optimisation have allowed deep neural networks to tackle a broadrange of problems, from binary classiﬁcation to playing video games (LeCun et al., 2015).The scalable nature of incremental methods like stochastic gradient descent enable trainingon very large datasets that cannot ﬁt in memory, and combining it with automatic diﬀer-entiation allows one to solve new tasks by simply changing the loss function. In contrast,applying other model classes, such as decision trees, to new tasks requires the design of anew optimisation algorithm that can search for models that perform well on the new prob-lem. Moreover, if one intends to train such a model on a large dataset, this optimisationalgorithm must scale well. As such, designing a general purpose algorithm for incrementallyconstructing decision trees that minimise arbitrary diﬀerentiable loss functions would be ofgreat interest to the machine learning community. Such an algorithm would enable decisiontrees to be applied to a broad range of problems with minimal eﬀort. c (cid:13) ouk Pfahringer Frank Hoeﬀding trees (Domingos and Hulten, 2000) are one approach for incrementally con-structing decision trees, but they lack the generality of the method proposed in this paper. Inorder to adapt the Hoeﬀding tree induction algorithm to new tasks other than classiﬁcation,one must select a new heuristic for measure the quality of splits, and also prove an upperbound on this measure in order to apply the Hoeﬀding inequality. Conversely, the gradientboosting literature has provided many examples of how one can construct an ensemble of de-cision trees using arbitrary diﬀerentiable loss functions (Friedman, 2001; Chen and Guestrin,2016; Ke et al., 2017). However, the resulting models are typically very large—often con-taining hundreds, or sometimes thousands of trees. Constructing these ensembles generallyrequires signiﬁcant computing resources and highly optimised implementations, even formodestly sized datasets (Mitchell and Frank, 2017). In this paper we propose stochasticgradient trees (SGT), which is both general and scalable. The generality comes from theability to optimise for arbitrary diﬀerentiable loss functions, and the scalability is due tothe ability of this algorithm to incrementally build a single tree using gradient information,rather than constructing a large ensemble.Several tasks are used to demonstrate the broad applicability of SGTs. Firstly, it isdemonstrated that SGTs perform similarly to Hoeﬀding trees when applied in the streamingclassiﬁcation setting—where the learner may only see each instance once during training.Following this, we compare SGTs trained with the squared error loss function to severalvariants of Hoeﬀding trees that are specialised for regression. These experiments show thatSGTs achieve state of the art performance on challenging streaming regression problems.Lastly, multi-instance learning (MIL) is considered. In this problem, one is supplied with“bags” during training. Each bag contains several feature vectors, but only a single label.If a bag is labelled positive, then at least one instance inside the bag is a positive example.Otherwise, all training examples in the bag are negative. In this setting, SGTs exhibitcompetitive performance to specialised batch multi-instance learning methods.To summarise, our contributions are three-fold: (i) we show how incremental decisiontrees can be adapted to use stochastic estimates of the gradient as the source of supervision.(ii) To remove the requirement of deriving bounds on the gradients and Hessians of each newloss function, we demonstrate how t -tests can be used in place of the Hoeﬀding inequalitywhen splitting a node. (iii) It is demonstrated how our novel incremental decision tree canbe applied to streaming classiﬁcation, regression, and multi-instance learning.

2. Related Work

Hoeﬀding trees (Domingos and Hulten, 2000) are a commonly used technique for incremen-tal learning. In each leaf node, they maintain a co-occurrence histogram between featurevalues and classes in order to determine the quality of potential splits through the use ofa standard split quality measure, such as the information gain. The Hoeﬀding concentra-tion inequality is applied to determine whether there is enough evidence to identify thebest split, or if more training examples must been before a split can be perfored. Variousmodiﬁcations for Hoeﬀding trees exist that enable them to solve problems other than classiﬁ-cation. The FIMT-DD (Ikonomovska et al., 2011b) and ORTO (Ikonomovska et al., 2011a)are Hoeﬀding tree variants designed for streaming regression problems. FIMT-DD makesuse of linear models in the leaf node of each tree to increase the precision of predictions, tochastic Gradient Trees while ORTO utilises option nodes to enable each instance to travel down multiple paths inthe tree. Read et al. (2012) and Mastelini et al. (2019) extend Hoeﬀding trees to addressstreaming multi-label classiﬁcation and multi-target regression problems, respectively. Foreach of these modiﬁcation, a new measure of split quality must be chosen, and an upperbound derived so that the Hoeﬀding inequality can be applied. Our method automates bothof these steps by leveraging gradient information from arbitrary loss functions to measuresplit quality, and using a diﬀerent hypothesis test to determine if a node is ready to be split.Previous work on learning decision trees with diﬀerentiable loss functions has focusedon the use of soft splits. Suarez and Lutsko (1999) ﬁrst use a conventional decision treeinduction method (the CART method of Breiman (1984)) to ﬁnd the structure of the tree.Following this, each hard split is converted into a soft split by replacing the hard thresholdwith a logistic regression model. These soft splits are subsequently ﬁne-tuned using aprocess similar to backpropagation. While this continuous approximation of discrete modelsimproves performance on regression tasks, it does not enable one to train models on newproblems, as the structure of the trees are still learnt using a traditional tree inductiontechnique. More recently, Yang et al. (2018) focus on learning interpretable models bytraining quasi-soft decision trees using backpropagation, and subsequently converting softsplits into hard splits. The depth of the tree is determined by the number of features in thedataset, and the diﬀerentiable method for learning tree structure requires storing a proper n -ary tree. This, coupled with a learned discretization method, causes the algorithm toscale very poorly with the number of features in the dataset. In contrast to these methods,our approach is able to eﬃciently learn the structure of tree models, as well as select theappropriate features and thresholds at each split.Gradient boosting (Friedman, 2001) is a technique that can be used to construct anensemble of classiﬁers trained to minimise an arbitrary diﬀerentiable loss function. Eachaddition of a new model to the ensemble can be thought of as performing a Newton stepin function space. There are two main diﬀerences between the method presented in thispaper and methods such as XGBoost (Chen and Guestrin, 2016) and LightGBM (Ke et al.,2017). Firstly, our method learns incrementally, whereas these popular gradient boostingapproaches are inherently batch learning algorithms. Secondly, SGTs are not an ensembletechnique: when constructing a stochastic gradient tree a Newton step is performed withevery update to a single tree, rather than each new tree in the ensemble. Updates to anSGT take the form of splits to leaf nodes or updates the prediction value of a leaf node.

3. Stochastic Gradient Trees

In supervised incremental learning, data is of the form ( x t , y t ) ∈ X × Y , a new pair ar-rives at every time step, t , and the aim is to predict the value of y t given x t . Algorithmsfor this setting must be incremental and enable prediction at any time step—they cannotwait until all instances have arrived and then train a model. In this section, we describeour method for incrementally constructing decision trees that can be trained to optimisearbitrary twice-diﬀerentiable loss functions. The ﬁrst key ingredient is a technique for evalu-ating splits and computing leaf node predictions using only gradient information. Secondly,to enable loss functions that have unbounded gradients, we employ standard one-sample ouk Pfahringer Frank t -tests rather than hypothesis tests based on the Hoeﬀding inequality to determine whetherenough evidence has been observed to justify splitting a node. We assume a loss function, l ( y, ˆ y ), that measures how well our predictions, ˆ y , match the truevalues, y . Predictions are generated using an SGT, optionally composed with an activationfunction, σ , ˆ y = σ ( f ( x )) . (1)Training should minimize the expected value, as estimated from the data observed betweenthe current time step, t , and time step, r , at which the tree was previously updated. Assum-ing i.i.d. data, the expectation can be stochastically approximated using the most recentobservations, E [ l ( y, ˆ y )] ≈ t − r t X i = r +1 l ( y i , ˆ y i ) . (2)The predictions, ˆ y i , are obtained from the SGT, f t . At each time step, we aim to ﬁnd amodiﬁcation, u : X → R , to the tree that takes a step towards minimising the expected loss.Because f t is a decision tree, u will be a function that represents a possible split to one ofits leaf nodes, or an update to the prediction made by a leaf: the addition of f t and u isthe act of splitting a node in f t , or changing the value predicted by an existing leaf node.Formally, the process for considering updates to the tree at each time step is given by f t +1 = f t + arg min u [ L t ( u ) + Ω( u )] , (3)where L t ( u ) = t X i = r +1 l ( y i , f t ( x i ) + u ( x i )) , (4)and Ω( u ) = γ | Q u | + λ X j ∈ Q u v u ( j ) . (5)The Ω term is a regularizer, Q u ⊂ N is the set of unique identiﬁers for the new leaf nodesassociated with u , and v u : N → R maps these new leaf node identiﬁers to the diﬀerencebetween their predictions and the prediction made by their parent. The ﬁrst term in Ωimposes a cost for each new node added to the tree, and the second term can be interpretedas a prior that encourages the leaf prediction values to be small. In our experiments, weset λ to 0 . γ to 1. In the case of Hoeﬀding trees, and also SGTs, only the leaf thatcontains x t will be considered for splitting at time t , and information from all previousinstances that have arrived in that leaf will be used to determine the quality of potentialsplits. The algorithm also has the option to leave the tree unmodiﬁed if there is insuﬃcientevidence to determine the best split.There are two obstacles to incrementally training a tree using an arbitrary loss func-tion. Firstly, the splitting criterion must be designed to be consistent with the loss to beminimised. Secondly, the leaf prediction values of the leaf nodes must be chosen in a man-ner that is consistent with the loss. Both problems can be overcome by adapting a trick tochastic Gradient Trees used in gradient boosting techniques (Friedman, 2001; Chen and Guestrin, 2016; Ke et al.,2017) that expands an ensemble of trees by applying a Taylor expansion of the loss functionaround the current state of the ensemble. We only consider modiﬁcation of a single tree,therefore the empirical expectation of the loss function can be approximated using a Taylorexpansion around the unmodiﬁed tree at time t : L t ( u ) ≈ t X i = r +1 [ l ( y i , f t ( x i )) + g i u ( x i ) + 12 h i u ( x i )] , (6)where g i and h i are the ﬁrst and second derivatives, respectively, of l with respect to f t ( x i ).Optimisation can be further simpliﬁed by eliminating the constant ﬁrst term inside thesummation, resulting in ∆ L t ( u ) = t X i = r +1 [ g i u ( x i ) + 12 h i u ( x i )] (7)= t X i = r +1 ∆ l i ( u ) , which now describes the change in loss due to the split, u .This function is evaluated for each possible split to ﬁnd the one that yields the maximumreduction in loss. As in the Hoeﬀding tree algorithm, at time t , we only attempt to split theleaf node into which x t falls, and we consider splitting on each attribute. For each potentialsplit, we need to decide what values should be assigned to any newly created leaf nodes.Note that we also consider the option of not performing a split at all, and only updatingthe prediction made by the existing leaf node.We introduce some notation to explain our procedure. Firstly, we deﬁne what a potentialsplit looks like: u ( x ) = ( v u ( q u ( x )) , if x ∈ Domain( q u )0 , otherwise (8)where q u maps an instance in the current leaf node to an identiﬁer for a leaf node thatwould be created if the split were performed. We denote the codomain of q u —the set ofidentiﬁers for leaf nodes that would created as a result of performing this split—as Q u . Wedeﬁne the set I ju as the set of indices of the instances that would reach the new leaf nodeidentiﬁed by j . The objective can then be rewritten as∆ L t ( u ) = X j ∈ Q u X i ∈ I ju [ g i v u ( j ) + 12 h i v u ( j )] , (9)which can be rearranged to∆ L t ( u ) = X j ∈ Q u [( X i ∈ I ju g i ) v u ( j ) + 12 ( X i ∈ I ju h i ) v u ( j )] , (10) ouk Pfahringer Frank which uses the sums of the gradient and Hessian values that have been seen thus far. Theoptimal v u ( j ) for each candidate leaf can be found by taking the relevant term in Equation 10and adding the corresponding term from Ω,( X i ∈ I ju g i ) v u ( j ) + 12 ( X i ∈ I ju h i ) v u ( j ) + λ v u ( j ) , (11)then setting the derivative to zero,0 = ( X i ∈ I ju g i ) + ( λ + X i ∈ I ju h i ) v u ( j ) , (12)and solving for v u ( j ), yielding v ∗ u ( j ) = − P i ∈ I ju g i λ + P i ∈ I ju h i . (13)Viewing the expected loss as a functional, this induction procedure can be thought of asperforming Newton’s method in function space. In gradient boosting, the addition of eachnew tree to the ensemble performs a Newton step in function space. The diﬀerence in ourapproach is that each Newton step consists of modifying a prediction value or performinga single split, rather than constructing an entire tree. For loss functionals that cannot beperfectly represented with the quadratic approximation in Newton-type methods, an SGTcan potentially take advantage of gradient information more eﬀectively than trees trainedusing conventional gradient tree boosting (Chen and Guestrin, 2016; Ke et al., 2017). When splitting on nominal attributes, we create a branch for each value of the attribute,yielding a multi-way split. We deal with numeric attributes by discretizing them usingsimple equal width binning. A sample of instances from the incoming data is used toestimate the minimum and maximum values of each numeric attribute—if these are notalready known in advance. Any future values that do not lie in the estimated range areclipped. The number of bins and the number of instances used to estimate the range ofattribute values are user-provided hyperparameters. In our experiments, we set them to 64and 1,000, respectively. Given a discretized attribute, we consider all possible binary splitsthat can be made based on the bin boundaries, thus treating it as ordinal (Frank and Witten,1999).

Equation 10 estimates the quality of a split but does not indicate whether a split should bemade. Hoeﬀding trees use the Hoeﬀding concentration inequality to make this decision. Itstates that, with some probability 1 − δ , E [ X ] > X − ǫ, (14) tochastic Gradient Trees with ǫ = r R ln(1 /δ )2 n , (15)where X is the sample mean of a sequence of random variables X i , R is the range of valueseach X i can take, and n is the sample size used to calculate X . Suppose the best splitconsidered at time t is u a . Let L = n ˆ L t ( u a ) be the mean change in loss if the split wereapplied, as measured on a sample of n ≤ t instances. Thus, if − L > ǫ , we know, with 1 − δ conﬁdence, that applying this split will result in a reduction of loss on future instances.In order to apply the Hoeﬀding bound, we must know the range, R , of values that can betaken by the n terms, ∆ˆ l i , in ∆ ˆ L . In our application, this would require proving upper andlower bounds of the ﬁrst and second derivatives of the loss function, and constraining theoutput of the tree to lie within some prespeciﬁed range, thus preventing rapid experimenta-tion with diﬀerent loss functions for novel tasks—one of the properties of deep learning thatenables such a diverse set of tasks to be solved. To circumvent this problem, we instead useStudent’s t -test to determine whether a split should be made. The t statistic is computedby t = L − E [ L ] s/ √ n , (16)where s is the sample standard deviation of L i and, under the null hypothesis, E [ L ] isassumed to be zero—i.e., it is assumed that the split does not result in a change in loss.A p value can be computed using the inverse cumulative distribution function of the t distribution and, if p is less than δ , the split can be applied.This test assumes that L follows a normal distribution. Although it cannot be assumedthat each L i will be normally distributed, due to the central limit theorem, we are justiﬁedin assuming L will be normally distributed for suﬃciently large n . Computing s requiresestimating the sample variance of L i , which is made easier by initially considering each ofthe new leaf nodes, j ∈ Q u , in isolation:Var( L i ) = Var( G i v u ( j ) + 12 H i v u ( j )) , (17)where G i and H i are the random variables representing the gradient and Hessian values,respectively. We intentionally treat v u ( j ) as a constant, even though this ignores the correla-tion between the prediction update values and the gradient and Hessian values. Empirically,this does not appear to matter, and it eliminates the need to compute the variance of aquotient of random variables—an expression for which there is no distribution-free solution.Equation 17 cannot be computed incrementally because the v u ( j ) are not known untilall the data has been seen. It is also infeasible to store all of the gradient and Hessianpairs because this could lead to unbounded memory usage. Instead, the equation can berearranged using some fundamental properties of variances to yieldVar( L i ) = v u Var( G i ) + 14 v u Var( H i ) + v u Cov( G i , H i ) , (18)where we have dropped the “( j )” for compactness. The variances and covariances associatedwith each feature value can be incrementally estimated using the Welford’s method (Welford, ouk Pfahringer Frank s .The process used to determine whether enough evidence has been collected to justifya split would be prohibitively expensive to carry out every time a new instance arrives.In practice, we follow the common trend in online decision tree induction and only checkwhether enough evidence exists to perform a split when the number of instances that havefallen into a leaf node is a multiple of some user speciﬁed parameter. As with many incre-mental decision tree induction implementations, this value is set to 200 by default.

4. Example Tasks

This section outlines the tasks and associated loss functions used to demonstrate the gen-erality of stochastic gradient trees in the experimental evaluation. For completeness, theﬁrst and second derivatives of each loss function are given, though we note that it would bevery easy to incorporate an automatic diﬀerentiation system into an SGT implementationto remove the need to derive these manually.

Streaming classiﬁcation is a variant of the typical classiﬁcation problem encountered in ma-chine learning where a learning algorithm is presented with a continuous stream of examples,and must eﬃciently update the model with the new knowledge obtained at each time step.In this setting, the length of the stream is unknown, and predictions can be requested fromthe model at any time. This means the model must be trained incrementally. In thispaper, the multiclass streaming classiﬁcation problem is addressed using a committee ofSGTs, where one tree is trained for each class. This committee is composed with a softmaxfunction, so the probability that an instance, x i , belongs to class j is estimated byˆ y i,j = exp { f j ( x i ) } P kc =1 exp { f c ( x i ) } , (19)where f c is the SGT trained to predict a real-valued score for class c , and k is the numberof classes. In practice, we hard-wire f k ( x ) = 0 in order to reduce the number of trees beingtrained. The categorical cross entropy loss function is used to train this model, ℓ CE ( y , ˆ y ) = − k X c =1 y c log(ˆ y j ) , (20)where y is the ground truth label encoded as a one-hot vector. The ﬁrst derivative is, δℓ CE δf c ( x ) = ˆ y c − y c , (21)and the second derivative is given by δ ℓ CE δf c ( x ) = ˆ y c (1 − ˆ y c ) . (22) tochastic Gradient Trees Streaming regression is very similar to streaming classiﬁcation, but a numeric value mustbe predicted instead of a nominal value. A single SGT, f , is used to generate predictions,ˆ y , and trained using the squared error loss function, ℓ SE ( y, ˆ y ) = 12 (ˆ y − y ) , (23)which has ﬁrst derivative δℓ SE δf ( x ) = ˆ y − y, (24)and second derivative δ ℓ SE δf ( x ) = 1 . (25) Multi-instance learning is a speciﬁc instantiation of weakly-supervised learning where abag of training instances is assigned a single binary annotation. Under the standard MILassumption (see Foulds and Frank (2010a) for more details), a positive label indicates thatthe bag contains at least one instance belonging to the positive class, while a negativelabel means that the bag does not contain any instances from the positive class. Theability to train instance-level classiﬁers from bag-level supervision can signiﬁcantly reducethe resources required to annotate a training set, and this particular setting maps wellto several tasks in computer vision and computational chemistry. For example, learningto classify whether an image subwindow contains an object category of interest using onlyimage-level labels, rather than bounding boxes, is an active area of research in the computervision community. In this case, the image is the bag, and each subwindow in the image isan instance.Suppose X i is a bag of instances and f is an SGT, the probability that X i contains apositive instance is estimated byˆ y i = 11 + exp( − max x ∈ X i f ( x )) . (26)The binary cross entropy loss function is used to optimise the model, ℓ BCE ( y, ˆ y ) = − y log(ˆ y ) − (1 − y )log(1 − ˆ y ) , (27)where y is the ground truth label for the bag. The ﬁrst and second derivatives of this losswith respect to f ( x ) for each x ∈ X i are δℓ BCE δf ( x ) = ( (ˆ y i − y i ) if x = arg max z ∈ X i f ( z )0 otherwise (28)and δ ℓ BCE δf ( x ) = ( ˆ y i (1 − ˆ y i ) if x = arg max z ∈ X i f ( z )0 otherwise . (29) ouk Pfahringer Frank Table 1: Details of the classiﬁcation and regression datasets used for evaluating incrementaldecision tree learners. No entry in the

5. Experiments

This section demonstrates the eﬃcacy of SGTs in the streaming classiﬁcation, streamingregression, and batch multi-instance learning settings. We implemented the algorithm inJava, making use of the MOA framework (Bifet et al., 2010) for the experiments with incre-mental learning, and WEKA (Hall et al., 2009) for the multi-instance learning experiments.The implementation is available online. Information about the streaming classiﬁcation and regression datasets used in the ex-periments can be found in Table 1.

For each dataset we report the mean classiﬁcation error rate, model size, and runtime across10 runs, where the data is randomly shuﬄed for each run. The standard Hoeﬀding treealgorithm (VFDT) and the more sample eﬃcient extension of Manapragada et al. (2018)(EFDT) are used as baselines. Learning curves for these experiments are given in Figure 1,and the numeric performance measurements are provided in Table 2. SGTs perform simi-larly to state of the art methods on classiﬁcation problems, uniformly outperforming VFDTand exhibiting two wins and two losses each compared to EFDT. They also result is com-paratively compact models, and are faster to train in the case where there is not a largenumber of classes. https://github.com/henrygouk/stochastic-gradient-trees tochastic Gradient Trees Table 2: Mean classiﬁcation error, model size (number of nodes), and runtime (seconds) ofthe trees produced by the classiﬁcation methods on 10 random shuﬄes of eachdataset. Higgs HEPMASS KDD’99 CovertypeClassiﬁcation Error SGT 30.10 14.66 0.27 26.94VFDT 30.25 14.81 0.73 32.54EFDT 31.46 15.22 0.07 22.05Model size SGT 1,933.2 1,289.0 446.3 620.8VFDT 8,081.4 7,256.0 160.0 91.8EFDT 38,535.1 24,760.2 913.8 3,261.4Runtime SGT 384.13 290.46 466.26 32.56VFDT 115.68 108.11 33.15 3.04EFDT 512.32 425.49 97.61 47.23

For the streaming regression experiments we report the mean absolute error, as well asthe model size and runtime. As with the classiﬁcation experiments, all measurements areaveraged over 10 runs of each algorithm, and on each run the data is randomly shuﬄed. TheFIMT-DD (Ikonomovska et al., 2011b) and ORTO (Ikonomovska et al., 2011a) methodsare used as points of reference for how state of the art streaming regression algorithmsperform on the datasets considered. The learning curves for these experiments are given inFigure 2, and the performance measurements are in Table 3. These results show that, froma predictive performance point of view, SGTs generally outperform both FIMT-DD andORTO. With the exception of the airline dataset, the ﬁnal model sizes are drastically smallerthan those of the baselines, and training time is comparable to FIMT-DD. Qualitatively,Figure 2 suggests that SGTs exhibit superior convergence properties, with the loss of theother methods plateauing very early on in training.

The evaluation metric used for multi-instance learning is the 10-fold cross-validation accu-racy. SGTs are compared with two batch techniques speciﬁcally designed for MIL: the QuickDiverse Density Iterative (QDDI) approach of Foulds and Frank (2010b), and an extensionproposed by Bjerring and Frank (2011) to the multi-instance tree inducer (MITI) techniqueoriginally developed by Blockeel et al. (2005). The results are reported in Table 4, alongwith the average rank of each method. The multi-instance learning instantiation of ourgeneral purpose tree induction algorithm performs comparably with MITI—a state of theart tree induction technique for multi-instance learning—as shown by both the results ofthe hypothesis tests and the very similar average ranks. Both tree-based methods perform ouk Pfahringer Frank × × × × × × C l a ss i ﬁ c a t i o n E rr o r InstancesHiggs SGTVFDTEFDT14.014.515.015.516.016.517.017.50 × × × × × × C l a ss i ﬁ c a t i o n E rr o r InstancesHEPMASS SGTVFDTEFDT0.00.51.01.52.02.53.00 × × × × × × C l a ss i ﬁ c a t i o n E rr o r InstancesKDD99 SGTVFDTEFDT15.020.025.030.035.040.045.00 × × × × × × C l a ss i ﬁ c a t i o n E rr o r InstancesCovertype SGTVFDTEFDT

Figure 1: Learning curves for the incremental classiﬁcation problems. tochastic Gradient Trees Table 3: Mean absolute error, model size (number of nodes), and runtime (seconds) of thetrees produced by the regression methods on 10 random shuﬄes of each dataset.Airline AWS Prices Zurich MSD YearMean Abs. Error SGT 20.86 0.27 61.59 7.20FIMT-DD 21.14 0.58 65.80 13.52ORTO 20.41 0.58 65.78 21.76Model size SGT 39,287.0 3,316.3 763.1 235.8FIMT-DD 30,060.4 165,660.6 21,756.2 2,264.4ORTO 33,584.8 177,929.0 23,769.0 2,466.8Runtime SGT 22.76 79.96 33.18 58.23FIMT-DD 22.43 80.89 29.71 70.75ORTO 50.16 42.66 22.16 37.17Table 4: Accuracy of SGT, QDDI, and MITI on a collection of multi-instance classiﬁcationdatasets measured using 10-fold cross-validation. Statistically signiﬁcant improve-ments or degredations in performance relative to SGT are denoted by ◦ and • ,respectively. The rank of each method on each dataset is given in parentheses.Dataset SGT QDDI MITIatoms 73.36 (2) 69.18 (3) 84.06 (1) ◦ bonds 75.58 (2) 72.43 (3) 81.93 (1)chains 81.99 (2) 78.22 (3) 88.42 (1)component 91.92 (1) 88.08 (3) • • elephant 77.00 (2) 81.00 (1) 76.50 (3)fox 55.00 (3) 58.50 (2) 60.00 (1)function 95.75 (1) 92.35 (3) • • ouk Pfahringer Frank × × × × × × M e a n A b s o l u t e E rr o r InstancesAWS Prices SGTFIMTDDORTO20.020.220.420.620.821.021.221.421.60 × × × × × × M e a n A b s o l u t e E rr o r InstancesAirline SGTFIMTDDORTO60.061.062.063.064.065.066.067.00 × × × × × × M e a n A b s o l u t e E rr o r InstancesZurich SGTFIMTDDORTO6.08.010.012.014.016.018.020.00 × × × × × × M e a n A b s o l u t e E rr o r InstancesMSD Year SGTFIMTDDORTO

Figure 2: Learning curves for the incremental regression problems. tochastic Gradient Trees

6. Conclusion

This paper presents the stochastic gradient tree algorithm for incrementally constructing adecision tree using stochastic gradient information as the source of supervision. In additionto showing how gradients can be used for building a single decision tree, we show howthe Hoeﬀding inequality-based splitting heuristic found in many incremental tree learningalgorithms can be replaced with a procedure based on the t -test, removing the requirementthat one can bound the range of the metric used for measure the quality of candidatesplits. Our experimental results on several diﬀerent tasks demonstrate the generality of ourapproach, while also maintaining scalability and state of the art predictive performance.We anticipate that the algorithm presented in this paper will enable decision trees to tacklea diverse range of problems in future. Acknowledgments

This research was supported by the Marsden Fund Council from Government funding, ad-ministered by the Royal Society of New Zealand.

References

J. Bennett, R. Grout, P. Pebay, D. Roe, and D. Thompson. Numerically stable, single-pass, parallel statistics algorithms. In , pages 1–8, August 2009.Albert Bifet, Geoﬀ Holmes, Richard Kirkby, and Bernhard Pfahringer. MOA: MassiveOnline Analysis.

Journal of Machine Learning Research , 11(May):1601–1604, 2010.Luke Bjerring and Eibe Frank. Beyond Trees: Adopting MITI to Learn Rules and EnsembleClassiﬁers for Multi-Instance Data. In Dianhui Wang and Mark Reynolds, editors,

AI2011: Advances in Artiﬁcial Intelligence , Lecture Notes in Computer Science, pages 41–50.Springer Berlin Heidelberg, 2011.Hendrik Blockeel, David Page, and Ashwin Srinivasan. Multi-instance Tree Learning. In

Proceedings of the 22nd International Conference on Machine Learning , ICML ’05, pages57–64, New York, NY, USA, 2005. ACM.Leo Breiman.

Classiﬁcation and Regression Trees . Routledge, 1984.Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In

Proceed-ings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery andData Mining , KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.Pedro Domingos and Geoﬀ Hulten. Mining high-speed data streams. In

Proceedings of theSixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining- KDD ’00 , pages 71–80, Boston, Massachusetts, United States, 2000. ACM Press.James Foulds and Eibe Frank. A review of multi-instance learning assumptions.

TheKnowledge Engineering Review , 25(1):1–25, March 2010a. ouk Pfahringer Frank James R. Foulds and Eibe Frank. Speeding Up and Boosting Diverse Density Learning.In Bernhard Pfahringer, Geoﬀ Holmes, and Achim Hoﬀmann, editors,

Discovery Science ,Lecture Notes in Computer Science, pages 102–116. Springer Berlin Heidelberg, 2010b.Eibe Frank and Ian H Witten. Making Better Use of Global Discretization. In , page 9, 1999.Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine.

TheAnnals of Statistics , 29(5):1189–1232, 2001.Mark Hall, Eibe Frank, Geoﬀrey Holmes, Bernhard Pfahringer, Peter Reutemann, andIan H. Witten. The WEKA Data Mining Software: An Update.

SIGKDD Explor. Newsl. ,11(1):10–18, November 2009.Elena Ikonomovska, Jo˜ao Gama, and Saˇso Dˇzeroski. Learning model trees from evolvingdata streams.

Data Mining and Knowledge Discovery , 23(1):128–168, July 2011a.Elena Ikonomovska, Jo˜ao Gama, Bernard ˇZenko, and Saˇso Dˇzeroski. Speeding Up Hoeﬀding-Based Regression Trees with Options. In , page 8, Bellevue, WA, USA, 2011b.Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, andTie-Yan Liu. LightGBM: A Highly Eﬃcient Gradient Boosting Decision Tree. In I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,editors,

Advances in Neural Information Processing Systems 31 , pages 3146–3154, 2017.Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning.

Nature , 521(7553):436–444, May 2015.Chaitanya Manapragada, Geoﬀ Webb, and Mahsa Salehi. Extremely Fast Decision Tree. In

Knowledge Discovery in Databases 2018 , London, United Kingdom, August 2018. ACM.Saulo Martiello Mastelini, Sylvio Barbon Jr., and Andr´e Carlos Ponce de Leon Ferreira deCarvalho. Online Multi-target regression trees with stacked leaf models. arXiv:1903.12483[cs, stat] , March 2019.Rory Mitchell and Eibe Frank. Accelerating the XGBoost algorithm using GPU computing.

PeerJ Computer Science , 3:e127, July 2017.Jesse Read, Albert Bifet, Geoﬀ Holmes, and Bernhard Pfahringer. Scalable and eﬃcientmulti-label classiﬁcation for evolving data streams.

Machine Learning , 88(1):243–272,July 2012.A. Suarez and J.F. Lutsko. Globally optimal fuzzy decision trees for classiﬁcation andregression.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 21(12):1297–1311, 1999.B. P. Welford. Note on a Method for Calculating Corrected Sums of Squares and Products.

Technometrics , 4(3):419–420, 1962. tochastic Gradient Trees Yongxin Yang, Irene Garcia Morillo, and Timothy M. Hospedales. Deep Neural DecisionTrees. In

ICML Workshop on Human Interpretability in Machine Learning , June 2018., June 2018.