[PDF] Gradient Boosting Neural Networks: GrowNet

Abstract

A novel gradient boosting framework is proposed where shallow neural networks are employed as ``weak learners''. General loss functions are considered under this unified framework with specific examples presented for classification, regression, and learning to rank. A fully corrective step is incorporated to remedy the pitfall of greedy function approximation of classic gradient boosting decision tree. The proposed model rendered outperforming results against state-of-the-art boosting methods in all three tasks on multiple datasets. An ablation study is performed to shed light on the effect of each model components and model hyperparameters.

Full PDF

GGradient Boosting Neural Networks: GrowNet

Sarkhan Badirli ∗ Department of Computer SciencePurdue University, Indiana, USA [email protected]

Xuanqing Liu

Department of Computer ScienceUCLA, California, USA [email protected]

Zhengming Xing

Linkedin, California, USA [email protected]

Avradeep Bhowmik

Amazon, California, USA [email protected]

Khoa Doan

Department of Computer ScienceVirginia Tech, Virginia, [email protected]

Sathiya S. Keerthi

Linkedin, California, USA [email protected]

Abstract

A novel gradient boosting framework is proposed where shallow neural networksare employed as “weak learners”. General loss functions are considered under thisuniﬁed framework with speciﬁc examples presented for classiﬁcation, regressionand learning to rank. A fully corrective step is incorporated to remedy the pitfallof greedy function approximation of classic gradient boosting decision tree. Theproposed model rendered outperforming results against state-of-the-art boostingmethods in all three tasks on multiple datasets. An ablation study is performed toshed light on the effect of each model components and model hyperparameters.

AI and machine learning pervade every aspect of modern life from email spam ﬁltering and e-commerce, to ﬁnancial security and medical diagnostics [18, 25]. Deep learning in particular hasbeen one of the key innovations that has truly pushed the boundary of science beyond what wasconsidered feasible [14, 12].However, in spite of its seemingly limitless possibilities, both in theory as well as demonstratedpractice, developing tailor-made deep neural networks for new application areas remains notoriouslydifﬁcult because of its inherent complexity. Designing architectures for any given applicationrequires immense dexterity and often a lot of luck. The lack of an established paradigm for creatingan application-speciﬁc DNN presents signiﬁcant challenges to practitioners, and often results inresorting to heuristics or even hacks.In this paper, we attempt to rectify this situation by introducing a novel paradigm that builds neuralnetworks from the ground up layer by layer. Speciﬁcally, we use the idea of gradient boosting[11] which has a formidable reputation in machine learning for its capacity to incrementally buildsophisticated models out of simpler components, that can successfully be applied to the most complexlearning tasks. Popular GBDT frameworks like XGBoost [6], LightGBM [17] and CatBoost [22]use decision trees as weak learners, and combine them using a gradient boosting framework to build ∗ This work was started when authors were in Criteo AI LabPreprint. Under review. a r X i v : . [ c s . L G ] J un omplex models that are widely used in both academia and industry as a reliable workhorse forcommon tasks in a wide variety of domains.However, while useful in their own right, decision trees are not universally applicable, and there aremany domains– especially involving structured data– where deep neural networks perform muchbetter [32, 29, 1]. In this paper, we combine the power of gradient boosting with the ﬂexibility andversatility of neural networks and introduce a new modelling paradigm called GrowNet that canbuild up a DNN layer by layer. Instead of decision trees, we use shallow neural networks as ourweak learners in a general gradient boosting framework that can be applied to a wide variety of tasksspanning classiﬁcation, regression and ranking. We introduce further innovations like adding secondorder statistics to the training process, and also including a global corrective step that has been shown,both in theory [31] and in empirical evaluation, to provide performance lift and precise ﬁne-tuning tothe speciﬁc task at hand.Our speciﬁc contributions are summarised below: • We propose a novel approach to combine the power of gradient boosting to incrementallybuild complex deep neural networks out of shallow components. We introduce a versatileframework that can readily be adapted for a diverse range of machine learning tasks in awide variety of domains. • We develop an off-the-shelf optimization algorithm that is faster and easier to train thantraditional deep neural networks. We introduce training innovations including second orderstatistics and global corrective steps that improve stability and allow ﬁner-grained tuning ofour models for speciﬁc tasks. • We demonstrate the efﬁcacy of our techniques using experimental evaluation, and showsuperior results on multiple real datasets in three different ML tasks: classiﬁcation, regressionand learning-to-rank.

In this section, we brieﬂy summarize the gradient boosting algorithms with decision trees and generalboosting/ensemble methods for training neural nets.

Gradient Boosting Algorithms.

Gradient Boosting Machine [11] is a function estimation methodusing numerical optimization in the function space. Unlike parameter estimation, function approxi-mation cannot be solved by traditional optimization methods in Euclidean space. Decision Trees arethe most common functions (predictive learners) that are used in Gradient Boosting framework. Inhis seminal paper, [11] proposed Gradient Boosting Decision Trees (GBDT) where decision trees aretrained in sequence and each tree is modeled by ﬁtting negative gradients. In recent years, there havebeen many implementations of GBDT in machine learning literature. Among these, [27] used GBDTto perform learning to rank, [10] did classiﬁcation and [6, 17] generalized GBDT for multi-taskingpurposes. In particular, scalable framework of [6] made it possible for data scientists to achievestate-of-the-art results on various industry related machine learning problems. For that reason, wetake XGBoost [6] as our baseline. Unlike these GBDT methods, we propose gradient boosting neuralnetwork where we train gradient boosting with shallow neural nets. Using neural nets as base learnersalso gives our method an edge over GBDT models, where we can correct each previous model afteradding the new one, referred to as “corrective step”, in addition to the ability to propagate informationfrom the previous predictors to the next ones.

Boosted Neural Nets.

Although weak learners, like decision trees, are popular in boosting andensemble methods, there have been a substantial work done on combining neural nets with boost-ing/ensemble methods for better performance over single large/deep neural networks. The idea ofconsidering shallow neural nets as weak learners and constructively combining them started with[8]. In their pioneering work, fully connected, multi-layer perceptrons are trained in a layer-by-layerfashion and added to get a cascade-structured neural net. Their model is not exactly a boosting modelas the ﬁnal model is a single, multi-layer neural network.In 1990’s, ensemble of neural networks got popular as ensemble methods helped to signiﬁcantlyimprove the generalization ability of neural nets. Nevertheless, these methods were simply eithermajority voting [13] for classiﬁcation tasks, simple averaging [20] or weighted averaging [21]for regression tasks. After the introduction of adaptive boosting (

Adaboost ) algorithm [9], [23]2 .. Model 1Inputhidden feat.penultimate feat. ...

Model 2 ... ...

Model K

Figure 1: GrowNet architecture. After the ﬁrst weak learner, each predictor is trained on combinedfeatures from original input and penultimate layer features from previous weak learner. The ﬁnaloutput is the weighted sum of outputs from all predictors, (cid:80) k = Kk =1 α k f k ( x ) . Here Model K meansweak learner K.investigated boosting with multi-layer neural networks for a character recognition task and achieveda remarkable performance improvement. They extended the work to traditional machine learningtasks with variations of Adaboost methods where different weighting schemes are explored [24]. Theadaptive boosting can be seen as a speciﬁc version of the gradient boosting algorithm where a simpleexponential loss function is used [10].In early 2000’s, [15] introduced greedy layer-wise unsupervised training for Deep Belief Nets (DBN).DBN is built upon a layer at a time by utilizing Gibbs sampling to obtain the estimator of the gradienton the log-likelihood of Restricted Boltzmann Machines (RBM) in each layer. The authors of [3]expounded this work for continuous inputs and explained its success on attaining high quality featuresfrom image data. They concluded that unsupervised training helped model training by initializingRBM weights in a region close to a good local minimum.Most recently, AdaNet [7] was proposed to adaptively built Neural Network (NN) layer by layer froma singe layer NN to perform image classiﬁcation task. Beside learning network weights, AdaNetadjusts the network structure and its growth procedure is reinforced by a theoretical justiﬁcation.AdaNet optimizes over a generalization bound that consists of empirical risk and complexity of thearchitecture. Coordinate descend approach is applied to the objective function, and heuristic search(weak learning algorithm) is performed to obtain δ − optimal coordinates. Although the learningprocess is boosting-style, the ﬁnal model is a single NN whose ﬁnal output layer is connected to alllower layers. Unlike AdaNet, we train each weak learner in a gradient boosting style, resulting in lessentangled training. The ﬁnal prediction is the weighted sum of all weak learners’ output. Our methodalso renders a uniﬁed platform to perform various ML tasks.In recent years, a few work have been done to explain the success of deep residual neural networks[14] with hundreds of layers by showing that they can be decomposed into a collection of manysubnetworks. The work in [16] extends AdaNet to speciﬁcally focus on ResNet architecture [14] toprovide a new training algorithm for ResNet. The authors of [28], meanwhile, argue that these deeperlayers might serve as a bagging mechanism in a similar spirit to random forest classiﬁer. Thesestudies challenge the common belief that neural networks are too strong to serve as weak learners forboosting methods. In this section, we ﬁrst describe the basic framework of GrowNet for general loss functions, andthen we show how the corrective step is incorporated. The key idea in gradient boosting is to takesimple, lower-order models as weak learners and use them as fundamental building blocks to build apowerful, higher-order model by sequential boosting using ﬁrst or second order gradient statistics.We use shallow neural networks (e.g., with one or two hidden layers) as weak learners in this paper.As each boosting step, we augment the original input features with the output from the penultimatelayer of the current iteration (see Figure 1). This augmented feature-set is then fed as input to trainthe next weak learner via a boosting mechanism using the current residuals. The ﬁnal output of themodel is a weighted combination of scores from all these sequentially trained models.3 .1 Gradient Boosting Neural Network: GrowNet

Let us assume a dataset with n samples in d dimensional feature space D = { ( x i , y i ) ni =1 | x i ∈ R d , y i ∈ R } . GrowNet uses K additive functions to predict the output, ˆ y i = E ( x i ) = K (cid:88) k =0 α k f k ( x i ) , f k ∈ F (1)where F is the space of multilayer perceptrons and α k is the step size (boost rate). Each function f k represents an independent, shallow neural network with a linear layer as an output layer. For a givensample x , the model calculates the prediction as a weighted sum of f k ’s in GrowNet.Let l be any differentiable convex loss function. Our objective is to learn a set of functions (shallowneural networks) that minimize the following equation: L ( E ) = (cid:80) ni =0 l ( y i , ˆ y i ) . We may further add regularization terms to penalize the model complexity but it is omitted forsimplicity in this work. As the objective we are optimizing is over the functions and not on theparameters, traditional optimization techniques will not work here. Analogous to GBDT [11], themodel is trained in an additive manner.Let ˆ y ( t − i = (cid:80) t − k =0 α k f k ( x i ) be the output of GrowNet at stage t − for the sample x i . We greedilyseek the next weak learner f t ( x ) that will minimize the loss at stage t which can be summized as, L ( t ) = n (cid:88) i =0 l ( y i , ˆ y ( t − i + α t f t ( x i )) (2)In addition, Taylor expansion of the loss function l was adopted to ease the computational complexity.As second-order optimization techniques are proven to be superior to ﬁrst-order ones and require lesssteps to converge, we train models with Newton-Raphson steps. Consequently, regardless of the MLtask, individual model parameters are optimized by running regression on the second order gradientsof the GrowNet’s outputs. Objective function for the weak learner f t can be simpliﬁed as follows, L ( t ) = n (cid:88) i =0 h i (˜ y i − α t f t ( x i )) (3)where ˜ y i = − g i /h i , and g i & h i are the ﬁrst and second order gradients of the objective function l at x i , w.r.t. ˆ y ( t − i . (See pseudo-code in part 1 of Algorithm 1 from supplementary material.) In a traditional boosting framework, each weak learner is greedily learned. This means that only theparameters of t th weak learner are updated at boosting step t where all the parameters of previous t − weak learners remain unchanged. The myopic learning procedures may cause the model toget stuck in a local minima, and a ﬁxed boosting rate α k aggravates the issue [11]. Therefore, weimplemented a corrective step to address this problem. In the corrective step, instead of ﬁxing theprevious t − weak learners, we allow update of the parameters of the previous t − weak learnersthrough back-propagation. Moreover, we incorporated the boosting rate α k into parameters of themodel and it is automatically updated through the corrective step. Beyond getting better performance,this move saves us from tuning a delicate parameter. C/S can also be interpreted as a regularizer tomitigate the correlation among weak learners, as during corrective step, the main training objectivebecomes task speciﬁc loss function on just original inputs. The usefulness of this step is empiricallyand theoretically investigated in [31] for gradient boosting decision tree models. Our experiments inthe ablation study 6.2 further validate the necessity of c/s in our model as well. The corrective step issummarized in the second part of Algorithm 1 in the supplementary material. In this section, we show how GrowNet can be adapted for regression, classiﬁcation and learning torank problems. 4 rowNet for Regression.

We employ mean squared error (MSE) loss function for the regressiontask. Let us assume l is the MSE loss; then we can easily obtain ˜ y i , ﬁrst order, and second orderstatistics at stage t , as follows: g i = 2(ˆ y ( t − i − y i ) , h i = 2 = ⇒ ˜ y i = y i − ˆ y ( t − i We train next weak learner f t by least square regression on { x i , ˜ y i } for i = 1 , , ..., n . In thecorrective step, all model parameters in GrowNet are updated again using the MSE loss. GrowNet for Classiﬁcation.

For the illustration purposes, let us consider the binary cross entropyloss function; however, note that any differentiable loss function can be used. Choosing labels y i ∈ {− , +1 } (this notation has an advantage of y i = 1 , which will be used in the derivation), theﬁrst and second order gradients, g i and h i , respectively, at stage t can be written as follows, g i = − y i e y i ˆ y ( t − i , h i = 4 y i e y i ˆ y ( t − i (1 + e y i ˆ y ( t − i ) = ⇒ ˜ y i = − g i /h i = y i (1 + e − y i ˆ y ( t − i ) / The next weak learner f t is ﬁtted by least square regression using second order gradient statisticson { x i , ˜ y i } . In the corrective step, parameters of all the added predictive functions are updated byretraining the whole model using the binary cross entropy loss. This step slightly corrects the weightsaccording to the main objective function of the task at hand, i.e. classiﬁcation in this case. GrowNet for Learning to Rank.

In this part, we demonstrate how the model is adapted to learningto rank (L2R) with a pairwise loss. In the L2R framework, there are queries and documents associatedwith each query. A document can be associated with many different queries. Then for each query,the associated documents have relevance scores. Assume for a given query, a pair of documents U i and U j is chosen. Assume further that we have a feature vector for these documents, x i and x j . Let ˆ y i and ˆ y j denote the output of the model for samples x i and x j respectively. According to [4], acommon pairwise loss for a given query can be formulated as follows, l (ˆ y i , ˆ y j ) = 12 (1 − S ij ) σ (ˆ y i − ˆ y j ) + log (1 + e − σ (ˆ y i − ˆ y j ) ) where S ij ∈ { , − , +1 } denotes the documents’ relevance difference; it is if the U i has a relevancescore greater than U j , − vice-versa and if both document have been labeled with the same relevancescore. σ is the sigmoid function. Note that the cost function l is symmetric and its gradients can beeasily computed as follows (for the details, readers can refer to [4]), ∂ ˆ y i l (ˆ y i , ˆ y j ) = σ ( 12 (1 − S ij ) −

11 + e σ (ˆ y i − ˆ y j ) ) = − ∂ ˆ y j l (ˆ y i , ˆ y j ) ∂ y i l (ˆ y i , ˆ y j ) = σ ( 11 + e σ (ˆ y i − ˆ y j ) )(1 −

11 + e σ (ˆ y i − ˆ y j ) ) where I denotes the set of pairs of indices { i, j } , for which U i is desired to be ranked differently from U j for a given query. Then for a particular document U i , the loss function and its ﬁrst and secondorder statistics can be derived as follows, l = (cid:88) j : { i,j }∈ I l (ˆ y i , ˆ y j ) + (cid:88) j : { j,i }∈ I l (ˆ y i , ˆ y j ) g i = (cid:88) j : { i,j }∈ I ∂ ˆ y i l (ˆ y i , ˆ y j ) − (cid:88) j : { j,i }∈ I ∂ ˆ y i l (ˆ y i , ˆ y j ) , h i = (cid:88) j : { i,j }∈ I ∂ y i l (ˆ y i , ˆ y j ) − (cid:88) j : { j,i }∈ I ∂ y i l (ˆ y i , ˆ y j ) Experiment Setup.

All predictive functions added to the model are multilayer perceptrons withtwo hidden layers. We generally set the number of hidden layer units to roughly half of or equalto the input feature dimension. More hidden layers degraded the performance as the model startsoverﬁtting. additive functions were employed in the experiments for all three tasks and the numberof weak learners in test time is chosen by the validation results. Boosting rate is initially set to andautomatically adjusted during the corrective step.We trained each predictive function for just epoch and the entire model is also trained for epochduring the corrective step by stochastic gradient descent with Adam optimizer. The number of epochs5able 1: L2R results in Normalized Discounted Cumulative Gain for top 5 and 10 queries (NDCG@5& 10), on Microsoft Learning to Rank with 10K queries and Yahoo LTR datasets. GrowNet resultsare average of 5 iterations and the values in the parenthesis represents the standard deviation. MSLR-WEB 10K Yahoo LTRNDCG@5 NDCG@10 NDCG@5 NDCG@10XGBoost . . . . . . GrowNet (pairwise loss) . (0 . . (0 . . (0 . . (0 . GrowNet (Gen. I div. loss) . . . . . . . . is increased to 2 for the ranking task. We also employed D batch normalization on the hiddenlayers. We compared the model performance with XGBoost since similar results are obtain withLightGBM or CatBoost and with AdaNet. Tuning and model details of all 3 methods are provided insupplementary material. Datasets.

We evaluate our model on 5 datasets from 3 different tasks. Higgs Bozon dataset is usedfor classiﬁcation. Higgs data is created using Monte Carlo simulations on high energy physics events.To perform regression, 2 datasets from UCI machine learning repository are selected. The ﬁrst one isComputed Tomography (CT) slice localization data where the aim is to retrieve the location of CTslices on axial axis. The second regression dataset is YearPredictionMSD, a subset of Million Songdataset. The goal is to predict the release year of a song from its audio features.We choose Yahoo LTR dataset [5] for the learning to rank task as it is a well-known benchmark datasetand also used in XGBoost’s paper. The dataset has K queries each associated with approximately documents. Train-test split from the original paper is preserved. The second benchmark rankingdataset we used is MSLR-WEB K in which there are K queries, each corresponding to list of − documents. Detailed statistics of each dataset can be found in the supplementary material. Table 2: Regression results in root mean squareerror (RMSE). GrowNet results are average of 5 it-erations and the values in the parenthesis representstandard deviation.Music Year Pred. Slice Localz.XGBoost . . AdaNet . . GrowNet . (0.0061) . (0.3512)Table 3: Classiﬁcation results, in AUC, on Higgsbozon dataset. For our model, we preset 3 differentresults: using all the data, of the data ( M ),and of the data ( K ).XGBoost . GrowNet (all data) . GrowNet (data sampling = 10% ) . GrowNet (data sampling = 1% ) . Regression.

Table 2 reports regression perfor-mance on two UCI datasets. GrowNet outper-forms both methods on Music dataset whereAdaNet delivers the worse result. On CT slicelocalization dataset, our model obtains on parresults with Adanet and displays decreasein RMSE compared to XGBoost.

Classiﬁcation.

To make a fair comparison withXGBoost, we tested our model on Higgs bo-zon dataset as it is used in XGBoost’s paper [6].Classiﬁcation results are presented in the Table3. GrowNet clearly outperforms XGBoost usingall the data. Subsampling of the data fortraining each weak learner also renders betterperformance. We used 30 weak learners (mul-tilayer perceptrons with two hidden layers of units) and the number of weak learners to beused at test time is chosen by validation results.In all 3 experiments, this number was 30. Learning to Rank.

Ranking experiment results on Yahoo and MSLR datasets are presented in Table1. We evaluated GrowNet with 2 different loss functions, namely pairwise loss and generalizedI-divergence loss. In both scenarios, GrowNet outperforms XGBoost on both datasets; in particular, itdelivers − increase on Microsoft data in NDCG@5 and NDCG@10. For our model to achievethese results, 30 weak learners were enough. M and Microsoft (Fold 1) datasets. All modelshave two-layer shallow networks as weak learners. Hidden layer dimension is 16 for classiﬁcationand 64 for ranking task. The third column is the ﬁnal GrowNet model that all different versions arecompared against. Reported results are AUC scores for classiﬁcation and NDCG for ranking. Datasets Eval. metric GrowNet st order grad. Constant α t Simple version No C/S C/S in every 5 stageHiggs M AUC . . . . . . MSLR Fold1 NDCG@5 . . . . . . NDCG@10 . . . . . . Figure 2: Classiﬁcation training lossesWe investigated different components ofGrowNet. We picked 2 datasets for these ex-periments: Higgs and Microsoft. For Higgsdataset, we randomly selected M points fortraining and of the remaining as the valida-tion set. The original test data was used as thetest set. For Microsoft dataset, we used Fold 1and the original split was preserved. In all up-coming experiments, only the component that isbeing analyzed, is altered while the rest of theparameters remain unchanged. All ablation ex-periments are reported in Table 4, and the thirdcolumn (GrowNet) represents the results fromﬁnal version of our model on these datasets. As seen in Figure 1, every weak learner except the ﬁrst one is trained on the combined features ofthe original input and penultimate layer’s features from previous predictive function. It is worth tonote that the input dimension does not grow by iteration; indeed, it is always the dimension of hiddenlayer plus the dimension of of original input. This idea of stacked features has a weak resemblanceto auto-context [26] in literature, where the authors utilized the direct output of the classiﬁer, alongwith the original inputs, to boost the image segmentation performance. The work in [2] extendedthis idea to not only use the output of the classiﬁer, probabilities, but also the raw prediction imageitself. Our model is signiﬁcantly different from these methods, as we do not simply use the previousmodel’s output but more expressive representation at the penultimate layer. These features leverageour model by propagating more complex information from previous model to the new one. To testthe advantage of this stacked model, we compared the proposed model against its simpler version inwhich the original input features are used for all learners. The sixth column in Table 4 presents theresults from the simpler version. In both tasks, the stacked model outperforms the simpler version;especially, the difference is noteworthy in the ranking task. Training loss in Figure 2 also supportsthe information gain while the stacked version is utilized. Unlike tree boosting methods, our modelmakes this architecture possible through its ﬂexible weak learners.

Among all components of the model, the corrective step is presumably the most vital one. In this step,the parameters of all weak learners, that are added to the model, are updated by training the wholemodel on the original inputs without the penultimate layer features. The loss function used in thisstep is a task speciﬁc one. This procedure allows the model to rectify the parameters to speciﬁcallybetter accommodate the task at hand rather than ﬁtting negative gradients. C/S also alleviates thepotential correlation among weak learners. Moreover, within this step, we incorporated the boostingrate α t and it is automatically adjusted without requiring any tuning. The last two columns of Table 4present the classiﬁcation and learning to rank results from GrowNet without using any corrective stepand using a corrective step in every 5 stages, respectively. The performance severely degraded inthe former one, and the model hardly learned any information after a couple of predictive functionsadded. The ﬂat training loss in Figure 2 conﬁrms this phenomenon as well. Running the correctivestep in every 5 steps rendered much better performance, yet was not as good as GrowNet’s results.7he stair-like loss curve in the Figure 2 evidently displays the inﬂuence of the corrective step onmodel training. Figure 3: Boosting rate evolution Dynamic boost rate.

Within the corrective step, we are ableto dynamically update the boost rate α t (at stage t ). Taking thismeasure saved us from tuning one more parameter as well asyielded a mild performance increase in all tasks. Moreover, themodel obtained better training loss convergence, compared tothe ﬁxed boost rate version (see Fig. 2). In our setup, startingwith α = 1 , the boost rate is automatically updated each timethe corrective step is executed (see Fig. 3) . Results of thebest constant boost rate ( α t = 0 . ), coarsely tuned in a set of { . , . , } , are reported in ﬁfth column of Table 4. In this experiment, we explored the impact of ﬁrst and second order statistics on model performanceas well as on the convergence of training loss. As the forth column of Table 4 displays, using thesecond order (third column in the Table 4) renders a slight performance boost over the ﬁrst orderin classiﬁcation and almost increase in learning to rank task. Figure 2 displays the effects ofﬁrst and second order statistics on training loss. The ﬁnal model (with second order statistics) againshows slightly better convergence on classiﬁcation yet the difference is more apparent on ranking. Asthe learning rate is decreased by a rate of / per weak learners, sudden drops are observed inclassiﬁcation loss curve at th and th stages in Figure 2. As the literature suggests, boosting algorithms work best with weak learners, thus we utilized a shallowneural network with two hidden layers as a weak predictor for our model. While adding more hiddenlayers yields stronger predictors, they are not weak learners anymore. To explore this weak learnerlimit on the number of hidden layers, we assayed weak learners with 1, 2, 3, and 4 hidden layers. Eachhidden layer had 16 units. Although weak learners with more hidden layers render better training lossconvergence as expected, the overall model starts saturating on performance and overﬁtting. Weaklearners with 1 and 2 hidden layers attain the best scores, yet the latter one outperforms the former.The worst test AUC score is from the model with 4 hidden layers (See Fig 6 in Supp. material).Figure 4: Effect of . isachieved with 128 units, yet the performance suffers when thenumber is increased to 256. One might ask what would happen if we just combine all these shallow networks into one deepneural network. There are a couple of issues with this approach: (1) it is very time-consuming totune the parameters of the DNN, such as the number of hidden layers, the number of units in eachhidden layer, the overall architecture, batch normalization, dropout level, and etc., (2) DNNs requirea huge computational power and in general run slower. We compared our model (with 30 weaklearners) against DNN with 5, 10 , 20, and 30 hidden-layer conﬁgurations. The best DNN (with 10hidden layers) produced . on Higgs 1M data in 1000 epochs, and each epoch took seconds.The DNN achieved this score (its best) at epoch 900. GrowNet rendered . AUC on the same8onﬁguration with 30 weak learners. The average stage training time, including the corrective step,took 50 seconds. Both models are run on the same machine with NVIDIA Tesla V100 (16GB) GPU.We ﬁnd that GrowNet has a clear advantage over stacked DNNs on all these aspects.Further details and illustrations from the ablation study and the code are provided in supplementarymaterial.

In this work, we propose

GrowNet , a novel approach to leverage shallow neural networks as “weaklearners" in a gradient boosting framework. This ﬂexible network structure allows us to performmultiple machine learning tasks under a uniﬁed framework while incorporating second order statistics,corrective step and dynamic boost rate to remedy the pitfalls of traditional gradient boosting decisiontree. Ablation study is conducted to explore the limits of neural networks as weak learners in theboosting paradigm and analyze the effects of each GrowNet component on the model performanceand convergence. We show that the proposed model achieves better performance in regression,classiﬁcation and learning to rank on multiple datasets, compared to state-of-the-art boosting methods.We further demonstrate that GrowNet is a better alternative to DNNs in these tasks as it yields betterperformance, requires less training time and is much easier to tune.

Broader Impact

GrowNet describes a novel boosting framework, which could be applied to a wide range of tasks andapplication domains, not limited to classiﬁcation, regression and learning to rank, which are discussedin this paper. Our research could help improve the performance of these tasks and applications inpractice.While there are general impacts of our work, similar to those of the popular boosting methods [6, 7],we focus on the impact of the ease of employing boosted neural networks in practice. The potentialbeneﬁts are improved predictions in a lesser amount of time (instead of a longer time to compriseapplication-speciﬁc algorithms) in a wide range of applications. For example, in healthcare, betterpredictions result in better diagnosis, which could save lives; in social domains, better predictionsimprove the quality of our lives; the list of tasks and applications, where our method could beadapted to, goes on. While there are undoubtedly beneﬁts, there are many concerns and risks ofirresponsible uses. This is even more important in today’s environment, where the issues such as bias,discrimination, privacy, etc... increasingly become more serious. Better predictions should not resultin bias and discriminate decisions and invasion of privacy.To mitigate these risks and concerns, we encourage the research community and policy makers tounderstand and evaluate the speciﬁc impacts of more and more powerful, ready-made ArtiﬁcialIntelligent algorithms. Here, we need to understand the risks and derive the appropriate policies butshould not hinder research activities when there are clearly beneﬁcial implications.

References [1] Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., and Yang, M.-H. Depth-aware video frameinterpolation. ,pp. 3698–3707, 2019.[2] Becker, C. J., Rigamonti, R., Lepetit, V., and Fua, P. Kernelboost: Supervised learning of imagefeatures for classiﬁcation. In

Technical Report , 2013.[3] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training ofdeep networks. In

Proceedings of the 19th International Conference on Neural InformationProcessing Systems , 2007.[4] Burges, C. J. C. From ranknet to lambdarank to lambdamart: An overview. In

MicrosoftResearch Technical Report , 2010. 95] Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview.

Journal of MachineLearning Research - W & CP , 14:1–24, 2011.[6] Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In

Proceedings of the22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pp.785–794. ACM, 2016.[7] Cortes, C., Gonzalvo, X., Kuznetsov, V., Mohri, M., and Yang, S. AdaNet: Adaptive structurallearning of artiﬁcial neural networks. In

Proceedings of the 34th International Conference onMachine Learning , 2017.[8] Fahlman, S. E. and Lebiere, C. The cascade-correlation learning architecture. In

NIPS , 1990.[9] Freund, Y. Boosting a weak learning algorithm by majority.

Information and Computation , pp.256–285, 1995.[10] Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: A statistical view ofboosting.

The Annals of Statistics , 28:337–407, 2000.[11] Friedman, J. H. Greedy function approximation: a gradient boosting machine.

Annals ofstatistics , pp. 1189–1232, 2001.[12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,A., and Bengio, Y. Generative adversarial nets. In

Advances in neural information processingsystems , pp. 2672–2680, 2014.[13] Hansen, L. and Salamon, P. Neural network ensembles.

TPAMI , 12:993–1001, 1990.[14] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In

CVPR ,pp. 770–778, 2016.[15] Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets.

NeuralComputation , 18:1527–1554, 2006.[16] Huang, F., Ash, J., Langford, J., and Schapire, R. Learning deep ResNet blocks sequentiallyusing boosting theory. In

Proceedings of the 35th International Conference on MachineLearning , 2018.[17] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T. Y. Lightgbm: Ahighly efﬁcient gradient boosting decision tree. In

NIPS , 2017.[18] McKinney, S., Sieniek, M., and et. al, V. G. International evaluation of an ai system for breastcancer screening.

Nature , 577:89–94, 2020.[19] Moghimi, M., Belongie, S. J., Saberian, M. J., Yang, J., Vasconcelos, N., and Li, L.-J. Boostedconvolutional neural networks. In

BMVC , 2016.[20] Opitz, D. and Shavlik, J. Actively searching for an effective neural network ensemble.

Connec-tion Science , 8:337–353, 1996.[21] Perrone, M. P. and Cooper, L. N. When networks disagree: Ensemble methods for hybrid neuralnetworks. pp. 126–142. Chapman and Hall, 1993.[22] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A., and Gulin, A. Catboost: unbiasedboosting with categorical features. In

NeurIPS , 2018.[23] Schwenk, H. and Bengio, Y. Training methods for adaptive boosting of neural networks forcharacter recognition. In

NIPS , 1997.[24] Schwenk, H. and Bengio, Y. Boosting neural networks.

Neural Computation , 12:1869–1887,2000.[25] Simeon, S., David, M., Marco, D., and et. al, D. C. A multimodality test to guide the managementof patients with a pancreatic cyst.

Science Translational Medicine , 11(501), 2019. doi:10.1126/scitranslmed.aav4772. 1026] Tu, Z. and Bai, X. Auto-context and its application to high-level vision tasks and 3d brainimage segmentation.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 32(10):1744–1757, 2010.[27] Tyree, S., Weinberger, K., Agrawal, K., and Paykin, J. Parallel boosted regression trees for websearch ranking. In , 2011.[28] Veit, A., Wilber, M., and Belongie, S. Residual networks behave like ensembles of relativelyshallow networks. In

NIPS , 2016.[29] Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhutdinov, R., and Le, Q. V. Xlnet: Generalizedautoregressive pretraining for language understanding. In

NeurIPS , 2019.[30] Zhang, F., Du, B., and Zhang, L. Scene classiﬁcation via a gradient boosting random con-volutional network framework.

IEEE Transactions on Geoscience and Remote Sensing , 54,2016.[31] Zhang, T. and Johnson, R. Learning nonlinear functions using regularized greedy forest.

IEEETransactions on Pattern Analysis and Machine Intelligence , 2014.[32] Zoph, B., Cubuk, E. D., Ghiasi, G., Lin, T.-Y., Shlens, J., and Le, Q. V. Learning dataaugmentation strategies for object detection.

ArXiv , abs/1906.11172, 2019.11

Additional Related Work

A few works [19, 30] have also been proposed to directly combine Gradient Boosting with Convo-lutional Neural Nets (CNN). The authors of [30] propose to train gradient boosting machine withCNN as a base learner by introducing a custom multi-class softmax loss function for a speciﬁc sceneclassiﬁcation task in the remote sensing domain. The work in [19], on the other hand, focuses ontraining each CNN sequentially on the mistakes of the previous networks, similar to Adaboost toperform on solely image classiﬁcation task. Our method is different from [30, 19] as it is a uniﬁedframework to perform various machine learning tasks, such as classiﬁcation, regression and evenlearning to rank. Moreover, unlike those two methods, we leveraged a corrective step to update thepreviously added predictor parameters and achieved a signiﬁcant performance boost.

B Algorithm Pseudocode

The pseudo-code of training the weak learner is explained in Individual model training part (1) ofalgorithm 1. The second part of the algorithm describes the corrective step. The code is available atGitHub page: https://github.com/sbadirli/GrowNet . Algorithm 1

Full GrowNet training

Input: f ( x ) = log ( n + n − ) , α , Training data D tr Output:

GrowNet E for k = 1 to M do f k ( x ) Calculate st order grad.: g i = ∂ ˆ y ( k − i l ( y i , ˆ y ( k − i ) , ∀ x i ∈ D tr Calculate nd order grad.: h i = ∂ y ( k − i l ( y i , ˆ y ( k − i ) , ∀ x i ∈ D tr Train f k ( · ) by least square regression on { x i , − g i /h i } Add the model f k ( x ) into the GrowNet E for epoch = 1 to T do Calculate GrowNet output: ˆ y ( k ) i = (cid:80) km =0 α m f m ( x i ) , ∀ x i ∈ D tr Calculate Loss from GrowNet: L = n (cid:80) ni l ( y i , ˆ y ( k ) i ) Update model f m parameters through back-propagation ∀ m ∈ { , ...k } Update step size α k through back-propagation end forend for C Additional Dataset Statistics

We evaluate our model on 5 datasets from 3 different tasks. A brief description of these datasets arepresented in Table 5.We used Higgs Bozon dataset for classiﬁcation. Higgs data is created using Monte Carlo simulationson high energy physics events. It is a binary event classiﬁcation data with 28 attributes.For the regression task, 2 datasets from the UCI machine learning repository are selected. The ﬁrstone is Computed Tomography (CT) slice localization data where the aim is to retrieve the locationof CT slices on the axial axis. The data was constructed from a set of , CT images that weretaken from 74 different patients (43 male, 31 female).The second regression dataset is YearPredictionMSD data, a subset of Million Song dataset, fromthe UCI repository. The goal is to predict the release year of a song from its audio features. The songsare mostly western, commercial tracks ranging from 1922 to 2011, with a peak in the year 2000s. https://archive.ics.uci.edu/ml/datasets/HIGGS https://archive.ics.uci.edu/ml/datasets/Relative+location+of+CT+slices+on+axial+axis https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD M Binary classiﬁcationSlice localization K RegressionYear prediction K RegressionYahoo LTRC K Learning to rankMSLR-WEB 10K . M Learning to rankWe choose Yahoo LTRC dataset [5] for the learning to rank task as it is a well-know benchmarkdataset and also is used in the XGBoost paper. This dataset has K queries, each associated withapproximately documents. Train-test split from the original paper is preserved. The secondbenchmark ranking dataset we used is MSLR-WEB K . The dataset contains K queries, eachof which corresponds to a list of − documents. D Hyperparameters of GrowNet

Experiment Setup.

All predictive functions added to the model are multilayer perceptrons with twohidden layers. More hidden layers degraded the performance as the model starts overﬁtting. Wegenerally set the number of hidden layer units to roughly a third of, a half of or equal to the inputfeature dimension. additive functions were employed in the experiments for all three tasks, andnumber of weak predictors in test time is chosen by the validation results. From all the experiments,we observe that 30 weak learners are more than enough to get the best results before the modelperformance saturates. Early stopping or other heuristics can also be incorporated into the model toterminate the training before the model begin to overﬁt.The boosting rate is initially set to and automatically adjusted during corrective step. Dependingon the dataset and the task at hand, it may be initially set to a lower number such as . . In ourexperiments, we did not tune or alter the boost rate.We trained each predictive function for just epoch, and the entire model is also trained for epochduring the corrective step using stochastic gradient descent with the Adam optimizer. The Adamoptimizer is run with l regularization at a rate of . . Epoch numbers are increased to 2 for theranking task as we used larger batch sizes. Increasing the epoch number does not contribute to theperformance, and higher numbers cause overﬁtting. We also performed D batch normalization forthe hidden layers. The batch size for classiﬁcation was set to 2048 and the learning rate was set to . . ReLU was used as the activation function for the penultimate layer, whereas Leaky ReLU wasused for the hidden layers. For the ranking task, we replaced ReLU with ReLU6.The source code is uploaded in a separate ﬁle. E XGBoost and AdaNet Tuning

E.1 XGBoost Tuning

For XGBoost, we tuned the main parameters, including the number of trees, learning rate, maximumleaves and (cid:96) regularization in the following range: • Number of trees: { , , , , }• Learning rate: { . , . }• Maximum number of leaves: { , , }• (cid:96) regularization ( lambda ): { , . } https://webscope.sandbox.yahoo.com/catalog.php?datatype=c http://research.microsoft.com/en-us/projects/mslr/ . . . Table 6: Classiﬁcation results on Higgs-1M data. The scores are in AUC-ROC.GrowNet 1HL GrowNet 2HL GrowNet 3HL GrowNet 4HLAUC . . . . Table 7: Results from hidden layer experiment.We did not tune XGBoost on Yahoo LTR (ranking task) and Higgs (classiﬁcation task) datasets as weused the results reported in the original XGBoost paper [6] as is.

E.2 AdaNet Tuning

We tuned 3 main parameters for AdaNet: the learning rate, the number of sub-networks and thecomplexity regularization parameter ( λ ) within the following ranges: • Learning rate: { . , . , . }• AdaNet iterations ( { , , }• Complexity regularizer λ : { . , . , . } The model was ﬁrst tuned with default mixture weights and λ = 0 , as suggested in the authors’Github page . From this experiment, we got the optimal learning rate and number of sub-networks.Then using the learned parameters from the previous setting, AdaNet is again tuned to learn themixture weights and regularization complexity parameter λ .The model is trained for , epochs, and the number of neurons in layers is set to 512, followingthe results from AdaNet paper [7]. We also observed that the model with 512 neurons generallyrenders better performance. E.3 Classiﬁcation on Higgs-1M

Following the same data split on Higgs data from the XGBoost paper [6], we created Higgs-1M data.Table 6 reports the AUC scores on HIggs-1M data from GrowNet, AdaNet and XGBoost. GrowNetrenders favorable results compared to XGBoost and increase over AdaNet result. F Additional Illustrations for Ablation Study

Analogous to Figure 2 from the main text, Figure 5 presents pairwise losses on Microsoft datasetfrom the ranking task.

Effect of hidden layers.

Table 7 reports the results from the hidden-layer experiment. GrowNetﬁnal, employing weak learners with 2 hidden layers, got the best performance (AUC score is . ).The model with a shallow network of 1 hidden layer as a weak learner obtains better performance(AUC of . ) once the number of hidden units is increased from 16 to 32. The inverse effect onthe model with weak learners of 3 or 4 hidden layers did not work as expected. That is, decreasingthe number of neurons in the hidden layers for these predictive functions did not improve much theclassiﬁcation performance. Details on DNN versus GrowNet

Both Deep Neural Network (DNN) models and Grownet are runon the same machine with NVIDIA Tesla V100 (16GB) GPU.Unlike GrowNet, DNN performed better with SELU activation functions. We also applied batchnormalization on the hidden layers of DNN. Each of DNN models run for 1000 epochs. The resultsare reported in Table 8. The best performing DNN model has 10 hidden layers, and each epoch took https://github.com/tensorflow/adanet/blob/master/adanet/examples/tutorials/adanet_objective.ipynb weak learners Training (pairwise) loss on Microsoft data

GrowNet finalW first order stat.W/o corrective stepGrowNet simpleConst. boost rate

Figure 5: Training loss visualization for the learning to rank task on MSLR dataset. We used pairwiseloss. (a) Classiﬁcation training loss (b) Classiﬁcation test loss

Figure 6: Effect of hidden layers on model training and classiﬁcation performance (AUC).approximately 12 seconds. The model reaches its best performance after epoch 900. GrowNet showsa clear advantage on both classiﬁcation performance and training time.Both methods, DNN and GrowNet are not fully optimized, thus their training time can slightly beimproved. Figure 7 displays the training time of GrowNet while adding new weak learners. DNNwith 30 hidden layers are implemented with Dropout(0.3), as without Dropout the model started tooverﬁt immediately after a few epochs. That also explains very close training times of DNN with 20and 30 layers.Models DNN-5 DNN-10 DNN-20 DNN-30 GrowNetTraining time (sec) . . . . . AUC . . . . .8401