[PDF] Unified Robust Boosting

Abstract

Boosting is a popular machine learning algorithm in regression and classification problems. Boosting can combine a sequence of regression trees to obtain accurate prediction. In the presence of outliers, traditional boosting, based on optimizing convex loss functions, may show inferior results. In this article, a unified robust boosting is proposed for more resistant estimation. The method utilizes a recently developed concave-convex family for robust estimation, composite optimization by conjugation operator, and functional decent boosting. As a result, an iteratively reweighted boosting algorithm can be conveniently constructed with existing software. Applications in robust regression, classification and Poisson regression are demonstrated in the R package ccboost.

Full PDF

UUniﬁed Robust Boosting

Zhu Wang

UT Health San Antonio

Abstract

Boosting is a popular machine learning algorithm in regression and classiﬁcation prob-lems. Boosting can combine a sequence of regression trees to obtain accurate prediction.In the presence of outliers, traditional boosting, based on optimizing convex loss functions,may show inferior results. In this article, a uniﬁed robust boosting is proposed for moreresistant estimation. The method utilizes a recently developed concave-convex family forrobust estimation, composite optimization by conjugation operator, and functional decentboosting. As a result, an iteratively reweighted boosting algorithm can be convenientlyconstructed with existing software. Applications in robust regression, classiﬁcation andPoisson regression are demonstrated in the R package ccboost . Keywords : Robust method, CC-family, CC-estimation, CC-boosting, boosting, COCO.

1. Introduction

Boosting is a powerful supervised machine learning algorithm. As an ensemble method,boosting combines many weak learners to generate a strong prediction. As a functional decentmethod, boosting has a wide applications in regression and classiﬁcation problems. Friedman(2001); Friedman, Hastie, and Tibshirani (2000); B¨uhlmann and Hothorn (2007) discussedboosting for a variety of convex loss functions. To deal with outliers, robust estimation hasbeen brought into boosting in order to provide more accurate estimation. Wang (2018a,b)proposed robust functional gradient boosting for nonconvex loss functions. These methodsapplied majorization-minimization (MM) scheme, an extension of the popular expectation-maximization (EM) algorithm in statistics.Recently, Wang (2020) innovatively proposed a uniﬁed robust loss function family, the con-cave convex (CC) family, and introduced the composite optimization by conjugation operator(COCO) to obtain the CC-estimation. The CC-family includes traditional robust loss func-tions such as the Huber loss, robust hinge loss for support vector machine, and robust expo-nential family for generalized linear models. The COCO algorithm is an iteratively reweightedestimation and can be conveniently implemented from the existing methods and software. Inthis article, we integrate the COCO and boosting to obtain CC-estimation in the context offunction estimation, which is more broad than the linear predictor function in Wang (2020).For instance, the CC-boosting algorithm permits function space derived from the regressiontrees. We illustrate the proposed algorithm through the ccboost R package with applicationsto robust exponential family, including regression, binary classiﬁcation and Poisson regression. a r X i v : . [ s t a t . C O ] J a n Uniﬁed Robust Boosting

Concave g ( z )hcave (cid:40) z if z ≤ σ / σ (2 z ) − σ if z > σ /  σ (1 − cos( (2 z ) σ )) if z ≤ σ π / σ if z > σ π / σ (cid:0) − (1 − zσ ) I ( z ≤ σ / (cid:1) ccave σ (cid:0) − exp( − zσ ) (cid:1) dcave − exp( − σ ) log( z z exp( − σ ) )ecave  − δσ ) √ πσδ z if z ≤ δ erf( (cid:112) zσ ) − erf( (cid:113) δσ ) + − δσ ) √ πσδ δ if z > δ gcave (cid:40) δ σ − (1+ δ ) σ +1 z if z ≤ δ σ ( z z ) σ − σ ( δ δ ) σ + δ σ (1+ δ ) σ +1 if z > δ where δ = (cid:40) →

0+ if 0 < σ < σ − if σ ≥ σ, z ) , σ ≥ σ > σ ≥

2. Robust CC-boosting

To unify robust estimation, Wang (2020) proposed the concave convex (CC) family withfunctions Γ satisfying the following conditions:i. Γ = g ◦ s ii. g is a nondecreasing closed concave function on range s iii. ∂ ( − g ( z )) ∀ z ∈ range s is nonempty and boundediv. s is convex on R .Table 1 lists some concave component. The convex component includes common loss func-tions in regression and classiﬁcation such as least squares and logistic function. More broadly,the convex component contains the negative log-likelihood function in the exponential fam-ily adopted by the generalized linear models. Since the convex component can be non-diﬀerentiable, subgradient and subdiﬀerential are useful tools. For instance, the tcave isnot diﬀerentiable at z = σ , but is quite useful to truncate loss functions. Given a set of obser-vations ( (cid:126)x i , y i ) , i = 1 , ..., n where y i ∈ R and (cid:126)x i = ( x i , ..., x ip ) (cid:124) ∈ R p , denote Ω the linear spanof a set H of base learners including regression trees and linear predictor functions. Denote hu Wang (cid:126)f = ( f ( (cid:126)x ) , ..., f ( (cid:126)x n )) (cid:124) , (cid:126)f ∈ Ω. We aim to minimize an empirical loss function n (cid:88) i =1 (cid:96) ( y i , f ( (cid:126)x i )) . (1)Here (cid:96) is a member of the CC-family, (cid:96) = g ◦ s = g ( s ( u )) = g ( s ( y, f )). To simplify notations, f and f ( (cid:126)x ) are interchanged sometimes. For i = 1 , ..., n , u i is deﬁned below: u i =  y i − f i , for regression, y i f i , for classiﬁcation with y i ∈ [ − , ,f i , for exponential family. (2)Robust function estimation can be accomplished by the following algorithm. Algorithm 1

CC-Function Estimation Algorithm Input: training samples { ( (cid:126)x , y ) , ..., ( (cid:126)x n , y n ) } , concave component g with parameter σ ,convex component s , starting point (cid:126)f (0) and iteration count K . for k = 1 to K do Compute z i = s ( y i , f ( k − i ) , i = 1 , ..., n Compute subgradient v ( k ) i via v ( k ) i ∈ ∂ ( − g ( z i )) or z i ∈ ∂ϕ ( v ( k ) i ) , i = 1 , ..., n Compute (cid:126)f ( k ) = argmin (cid:126)f ∈ Ω (cid:80) ni =1 s ( y i , f i )( − v ( k ) i ) end for Output: v ( K ) i and (cid:126)f ( K ) .We have the convergence results for the COCO algorithm. Theorem 1.

Suppose that g is a concave component in the CC-family, and g is bounded below.The loss function values ρ ( (cid:126)f ( k ) ) (cid:44) (cid:80) ni =1 (cid:96) ( y i , f ( k ) i ) generated by Algorithm 1 are nonincreasingand converge. This result is a generalization of Theorem 4 in Wang (2020) in which the function is restrictedto a linear predictor function. Here we study more broadly deﬁned function spaces. On theother hand, if H is a space of linear models, Theorem 1 indeed coincides with the results inWang (2020). The proof follows the same argument of Theorem 4 in Wang (2020), hence onlya sketch is outlined. Deﬁne the surrogate loss function: Q ( (cid:126)f | (cid:126)f ( k ) ) = n (cid:88) i =1 s ( y i , f i )( − v ( k +1) i ) + ϕ ( v ( k +1) i ) . Apply the well-known results on the conjugation operator, we then have ρ ( (cid:126)f ( k +1) ) ≤ Q ( (cid:126)f ( k +1) | (cid:126)f ( k ) ) ≤ Q ( (cid:126)f ( k ) | (cid:126)f ( k ) ) = ρ ( (cid:126)f ( k ) ) . (3)Step 5 in the algorithm is equivalent to minimizing Q ( (cid:126)f | (cid:126)f ( k ) ) since ϕ ( v ( k +1) i ) is a constantwith respect to (cid:126)f . The conclusion of the theorem follows. An important question is how to compute step 5 in Algorithm 1. Here we adopt weightedboosting algorithm. Boosting as a method for function estimation has been well studied by

Uniﬁed Robust Boosting

Friedman et al. (2000); Friedman (2001). Boosting can be utilized to ﬁt a variety of modelswith diﬀerent base learners, including linear least squares, smoothing splines and regressiontrees (B¨uhlmann and Hothorn 2007; Wang 2018b). For ease of notation, we ﬁrst considerunweighted estimation: argmin (cid:126)f ∈ Ω n (cid:88) i =1 s ( y i , f i ) . (4)In a boosting algorithm, the solution is an additive model given byˆ f i = F M ( (cid:126)x i ) = M (cid:88) i =1 t m ( (cid:126)x i ) , (5)where F M ( (cid:126)x i ) is stagewisely constructed by sequentially adding an update t m ( (cid:126)x i ) to thecurrent estimate F m − ( (cid:126)x i ): F m ( (cid:126)x i ) = F m − ( (cid:126)x i ) + t m ( (cid:126)x i ) , m = 1 , ..., M. (6)There are diﬀerent ways to compute (cid:126)t m ( (cid:126)x ) = ( t m ( (cid:126)x ) , ..., t m ( (cid:126)x n )) (cid:124) : gradient and Newton-type update are the most popular (Sigrist 2020). When the second derivative of loss functionexists, the Newton-type update is preferred over gradient update to achieve fast convergence: (cid:126)t m ( (cid:126)x ) = argmin (cid:126)f ∈ H n (cid:88) i =1 h m,i ( − d m,i h m,i − f ( x i )) , (7)where the ﬁrst and second derivatives of the loss function s for observations i are given by: d m,i = ∂∂f s ( y i , f ) | f = F m − ( x i ) , (8) h m,i = ∂ ∂f s ( y i , f ) | f = F m − ( x i ) . (9)For quadratic loss s ( y i , f ) = ( y i − f ) , we obtain h m,i = 1. In this case, the Newton-updatebecomes the gradient update. Furthermore, the weighted minimization problem in step 5 ofAlgorithm 1 can be solved with the weighted boosting algorithm. To avoid overﬁtting, we can add the objective function with a regularization term: n (cid:88) i =1 (cid:96) ( y i , ˆ f i ) + M (cid:88) m =1 Λ( t m ) , (10)where Λ penalizes the model complexity. If H is the space of linear regression with a p -dimensional predictor, i.e., t m ( (cid:126)x i ) = (cid:126)x (cid:124) i β m , β m = ( β m , ..., β pm ) (cid:124) , denoteΛ( t m ) = 12 λ p (cid:88) j =1 β jm + α p (cid:88) j =1 | β jm | , (11) hu Wang λ ≥ , α ≥

0. Note that Λ( t m ) provides shrinkage estimators and can conduct variableselection. Suppose that H is the space of regression trees. Each regression tree splits the wholepredictor space into disjoint hyper-rectangles with sides parallel to the coordinate axes (Wang2018b). Speciﬁcally, denote the hyper-rectangles in the m -th boosting iteration R jm , j =1 , ..., J . Let t m ( (cid:126)x i ) = β jm , (cid:126)x i ∈ R jm , i = 1 , ..., n, j = 1 , ..., J . With γ ≥

0, the penalty can bedeﬁned as in Chen and Guestrin (2016):Λ( t m ) = γJ + 12 λ p (cid:88) j =1 β jm + α p (cid:88) j =1 | β jm | . (12)A diﬀerent penalized estimation is to implement a shrinkage parameter 0 < ν ≤ F m ( (cid:126)x i ) = F m − ( (cid:126)x i ) + νt m ( (cid:126)x i ) , m = 1 , ..., M. (13) In summary, we use Algorithm 1 coupled with the boosting algorithm to minimize the fol-lowing objective function: n (cid:88) i =1 (cid:96) ( y i , ˆ f i ) , (14)where ˆ f i is given by (5). There are two layers of iterations: the outer layer is the CC iterationand the inner layer is the boosting iterations. An early stop of iterations in boosting doesn’tguarantee convergence. On the other hand, the output (cid:126)f ( K ) may overﬁt the data. In thiscase, we may consider a two stage process: In the ﬁrst stage, apply Algorithm 1 to obtainoptimal weights of observations. In the second stage, we can use a data-driven method suchas cross-validation to select optimal boosting iteration M , penalty numbers γ for trees, λ and α . The same strategy can also be applied to the robust parameter σ . However, sincethis parameter is typically considered a hyperparameter, a more computationally convenientapproach in the literature is to conduct estimation for diﬀerent values of σ and compare theresults. One can begin with a large value σ with less robust estimation, and move towardssmaller value σ for more robust results.The source version of the ccboost package is freely available from the Comprehensive R ArchiveNetwork ( http://CRAN.R-project.org ). The reader can install the package directly fromthe R prompt via R> install.packages("ccboost")

All analyses presented below are contained in a package vignette. The rendered output of theanalyses is available by the R -command R> library("ccboost")R> vignette("ccbst", package = "ccboost")

To reproduce the analyses, one can invoke the R code R> edit(vignette("ccbst", package = "ccboost"))

Uniﬁed Robust Boosting

3. Data examples

In this example, we predict median value of owner-occupied homes in suburbs of Boston,with data publicly available from the UCI machine learning data repository. There are 506observations and 13 predictors. A diﬀerent robust estimation can be found in Wang (2020).

R> urlname <- "https://archive.ics.uci.edu/ml/"R> filename <- "machine-learning-databases/housing/housing.data"R> dat <- read.table(paste0(urlname, filename), sep = "",+ header = FALSE)R> dat <- as.matrix(dat)R> colnames(dat) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",+ "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B",+ "LSTAT", "MEDV")R> p <- dim(dat)[2]

We ﬁt a CC-boosting model with concave component bcave and convex component leastsquares. The observation weights are plotted. We also display the values of the 4 observationswith the smallest weights. These observations are considered outliers. We can plot the original

R> library("ccboost")R> fit.ls <- ccboost(dat[, -p], dat[, p], cfun = "bcave",+ s = 10, dfun = "gaussian", verbose = 0, max.depth = 2,+ nrounds = 50)R> plot(fit.ls$weight_update)R> id <- sort.list(fit.ls$weight_update)[1:4]R> text(id, fit.ls$weight_update[id] - 0.02, id, col = "red") . . . . . . Index f i t. l s $ w e i gh t _upda t e median housing price vs the predicted values. Not surprisingly, those 4 observations with thesmallest weights have poor predictions. We can view feature importance/inﬂuence from thelearned model. The ﬁgure shows that the top two factors to predict median housing price hu Wang R> plot(dat[, p], predict(fit.ls, newdata = dat[, -p]))R> text(dat[id, p], predict(fit.ls, newdata = dat[id, -p]) -+ 1, id, col = "red")

10 20 30 40 50 dat[, p] p r ed i c t ( f i t. l s , ne w da t a = da t[, − p ] ) are average number of rooms per dwelling (RM) and percentage values of lower status of thepopulation (LSTAT). R> importance_matrix <- xgboost::xgb.importance(model = fit.ls)R> xgboost::xgb.plot.importance(importance_matrix = importance_matrix)

ZNINDUSRADAGECHASBTAXPTRATIONOXCRIMDISLSTATRM

R> gr <- xgboost::xgb.plot.tree(model = fit.ls, trees = 0:1)

Uniﬁed Robust Boostinghu Wang A binary classiﬁcation problem was proposed by Long and Servedio (2010). Response variable y is randomly chosen to be -1 or +1 with equal probability. We randomly generate symbolsA, B and C with probability 0.25, 0.25 and 0.5, respectively. The predictor vector (cid:126)x with 21elements is generated as follows. If A is obtained, x j = y, j = 1 , ...,

21. If B is generated, x j = y, i = 1 , ..., , x j = − y, j = 12 , ...,

21. If C is generated, x j = y , where j is randomlychosen from 5 out of 1-11, and 6 out of 12-21. For the remaining j ∈ (1 , x j = − y . Wegenerate training data n = 400 and test data n = 200.We ﬁt a robust logistic boosting model with concave component acave , where the maximumdepth of a tree is 5. Other concave components in Table 1 can be applied similarly. We can R> set.seed(1947)R> dat <- dataLS(ntr = 400, nte = 200, percon = 0)R> fit1 <- ccboost(dat$xtr, dat$ytr, cfun = "acave", s = 3,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit1$weight_update) . . . . . Index f i t w e i gh t _upda t e compute prediction error of test data at each boosting iteration. Furthermore, we simulatedata with 10% contamination of response variables, and compute CC-boosting again. In thethird robust logistic boosting, we reduce parameter value σ ( s in the ccboost function) formore robust estimation. As a result, some observations would have decreased weights in themodel.0 Uniﬁed Robust Boosting

R> err1 <- rep(NA, 100)R> for (i in 1:100) {+ pred1 <- predict(fit1, newdata = dat$xte, ntreelimit = i)+ err1[i] <- mean(sign(pred1) != dat$yte)+ }R> plot(err1, type = "l") . . . . . . . . Index e rr R> dat2 <- dataLS(ntr = 400, nte = 200, percon = 0.1)R> fit2 <- ccboost(dat2$xtr, dat2$ytr, cfun = "acave", s = 3,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit2$weight_update) . . . . . . . Index f i t w e i gh t _upda t e hu Wang R> err2 <- rep(NA, 100)R> for (i in 1:100) {+ pred2 <- predict(fit2, newdata = dat2$xte, ntreelimit = i)+ err2[i] <- mean(sign(pred2) != dat2$yte)+ }R> plot(err2, type = "l") . . . . . . Index e rr R> fit3 <- ccboost(dat2$xtr, dat2$ytr, cfun = "acave", s = 1,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit3$weight_update) . . . . . . Index f i t w e i gh t _upda t e Uniﬁed Robust Boosting

R> err3 <- rep(NA, 100)R> for (i in 1:100) {+ pred3 <- predict(fit3, newdata = dat2$xte, ntreelimit = i)+ err3[i] <- mean(sign(pred3) != dat2$yte)+ }R> plot(err3, type = "l") . . . . . . . Index e rr hu Wang A survey collected from 3066 Americans was studied on health care utilization in lieu of doc-tor oﬃce visits (Heritier, Cantoni, Copt, and Victoria-Feser 2009). The data contained 24risk factors. Robust Poisson regression was conducted in Wang (2020). Here robust Poissonboosting model is ﬁtted with concave component ccave . The observation weights are esti-mated. The doctor oﬃce visits in two years are highlighted for the 8 smallest weights, rangingfrom 200 to 750. We can view feature importance/inﬂuence from the learned model. The

R> data(docvisits, package = "mpath")R> x <- model.matrix(~age + factor(gender) + factor(race) ++ factor(hispan) + factor(marital) + factor(arthri) ++ factor(cancer) + factor(hipress) + factor(diabet) ++ factor(lung) + factor(hearth) + factor(stroke) ++ factor(psych) + factor(iadla) + factor(adlwa) + edyears ++ feduc + meduc + log(income + 1) + factor(insur) ++ 0, data = docvisits)R> fit.pos <- ccboost(x, docvisits$visits, cfun = "ccave",+ s = 20, dfun = "poisson", verbose = 0, max.depth = 1,+ nrounds = 50)R> plot(fit.pos$weight_update)R> id <- sort.list(fit.pos$weight_update)[1:8]R> text(id, fit.pos$weight_update[id] - 0.02, docvisits$visits[id],+ col = "red") . . . . . . Index f i t. po s $ w e i gh t _upda t e ﬁgure shows that the top two reasons of doctor oﬃce visits are heart disease and psychiatricproblems.

4. Conclusion

In this article we propose CC-boosting as a uniﬁed robust boosting algorithm, and illustrate itsapplications in regression, classiﬁcation and Poisson regression. The method can be used foroutlier detection and can reduce the impact of outliers. Based on existing weighted boosting4

Uniﬁed Robust Boosting

R> importance_matrix <- xgboost::xgb.importance(model = fit.pos)R> xgboost::xgb.plot.importance(importance_matrix = importance_matrix) factor(insur)1meducfactor(iadla)1factor(marital)1factor(cancer)1factor(lung)1agefactor(iadla)2feducedyearsfactor(iadla)3log(income + 1)factor(arthri)1factor(adlwa)1factor(stroke)1factor(hipress)1factor(diabet)1factor(adlwa)2factor(adlwa)3factor(psych)1factor(hearth)1 software, we can determine variable importance and explore the trees from the boostingalgorithm. The R ccboost is a useful tool in the machine learning applications. References

B¨uhlmann P, Hothorn T (2007). “Boosting algorithms: Regularization, prediction and modelﬁtting (with discussion).”

Statistical Science , (4), 477–505.Chen T, Guestrin C (2016). “Xgboost: A scalable tree boosting system.” In Proceedings ofthe 22nd ACM SIGKDD international conference on knowledge discovery and data mining ,pp. 785–794.Friedman J (2001). “Greedy function approximation: a gradient boosting machine.”

Annalsof Statistics , (5), 1189–1232.Friedman J, Hastie T, Tibshirani R (2000). “Additive logistic regression: a statistical viewof boosting (with discussion and a rejoinder by the authors).” Annals of Statistics , (2),337–407.Heritier S, Cantoni E, Copt S, Victoria-Feser MP (2009). Robust Methods in Biostatistics ,volume 825. John Wiley & Sons.Long PM, Servedio RA (2010). “Random classiﬁcation noise defeats all convex potentialboosters.”

Machine learning , (3), 287–304.Sigrist F (2020). “Gradient and Newton Boosting for Classiﬁcation and Regression.” arXive-prints . https://arxiv.org/abs/1808.03064 , .Wang Z (2018a). “Quadratic majorization for nonconvex loss with applications to the boostingalgorithm.” Journal of Computational and Graphical Statistics , (3), 491–502. hu Wang Electronic Journal ofStatistics , , 599–650. doi:10.1214/18-EJS1404 . URL https://doi.org/10.1214/18-EJS1404 .Wang Z (2020). “Uniﬁed Robust Estimation via the COCO.” arXiv e-prints , arXiv:2010.02848. https://arxiv.org/abs/2010.02848 , . Aﬃliation: