UUnified Robust Boosting
Zhu Wang
UT Health San Antonio
Abstract
Boosting is a popular machine learning algorithm in regression and classification prob-lems. Boosting can combine a sequence of regression trees to obtain accurate prediction.In the presence of outliers, traditional boosting, based on optimizing convex loss functions,may show inferior results. In this article, a unified robust boosting is proposed for moreresistant estimation. The method utilizes a recently developed concave-convex family forrobust estimation, composite optimization by conjugation operator, and functional decentboosting. As a result, an iteratively reweighted boosting algorithm can be convenientlyconstructed with existing software. Applications in robust regression, classification andPoisson regression are demonstrated in the R package ccboost . Keywords : Robust method, CC-family, CC-estimation, CC-boosting, boosting, COCO.
1. Introduction
Boosting is a powerful supervised machine learning algorithm. As an ensemble method,boosting combines many weak learners to generate a strong prediction. As a functional decentmethod, boosting has a wide applications in regression and classification problems. Friedman(2001); Friedman, Hastie, and Tibshirani (2000); B¨uhlmann and Hothorn (2007) discussedboosting for a variety of convex loss functions. To deal with outliers, robust estimation hasbeen brought into boosting in order to provide more accurate estimation. Wang (2018a,b)proposed robust functional gradient boosting for nonconvex loss functions. These methodsapplied majorization-minimization (MM) scheme, an extension of the popular expectation-maximization (EM) algorithm in statistics.Recently, Wang (2020) innovatively proposed a unified robust loss function family, the con-cave convex (CC) family, and introduced the composite optimization by conjugation operator(COCO) to obtain the CC-estimation. The CC-family includes traditional robust loss func-tions such as the Huber loss, robust hinge loss for support vector machine, and robust expo-nential family for generalized linear models. The COCO algorithm is an iteratively reweightedestimation and can be conveniently implemented from the existing methods and software. Inthis article, we integrate the COCO and boosting to obtain CC-estimation in the context offunction estimation, which is more broad than the linear predictor function in Wang (2020).For instance, the CC-boosting algorithm permits function space derived from the regressiontrees. We illustrate the proposed algorithm through the ccboost R package with applicationsto robust exponential family, including regression, binary classification and Poisson regression. a r X i v : . [ s t a t . C O ] J a n Unified Robust Boosting
Concave g ( z )hcave (cid:40) z if z ≤ σ / σ (2 z ) − σ if z > σ / σ (1 − cos( (2 z ) σ )) if z ≤ σ π / σ if z > σ π / σ (cid:0) − (1 − zσ ) I ( z ≤ σ / (cid:1) ccave σ (cid:0) − exp( − zσ ) (cid:1) dcave − exp( − σ ) log( z z exp( − σ ) )ecave − δσ ) √ πσδ z if z ≤ δ erf( (cid:112) zσ ) − erf( (cid:113) δσ ) + − δσ ) √ πσδ δ if z > δ gcave (cid:40) δ σ − (1+ δ ) σ +1 z if z ≤ δ σ ( z z ) σ − σ ( δ δ ) σ + δ σ (1+ δ ) σ +1 if z > δ where δ = (cid:40) →
0+ if 0 < σ < σ − if σ ≥ σ, z ) , σ ≥ σ > σ ≥
2. Robust CC-boosting
To unify robust estimation, Wang (2020) proposed the concave convex (CC) family withfunctions Γ satisfying the following conditions:i. Γ = g ◦ s ii. g is a nondecreasing closed concave function on range s iii. ∂ ( − g ( z )) ∀ z ∈ range s is nonempty and boundediv. s is convex on R .Table 1 lists some concave component. The convex component includes common loss func-tions in regression and classification such as least squares and logistic function. More broadly,the convex component contains the negative log-likelihood function in the exponential fam-ily adopted by the generalized linear models. Since the convex component can be non-differentiable, subgradient and subdifferential are useful tools. For instance, the tcave isnot differentiable at z = σ , but is quite useful to truncate loss functions. Given a set of obser-vations ( (cid:126)x i , y i ) , i = 1 , ..., n where y i ∈ R and (cid:126)x i = ( x i , ..., x ip ) (cid:124) ∈ R p , denote Ω the linear spanof a set H of base learners including regression trees and linear predictor functions. Denote hu Wang (cid:126)f = ( f ( (cid:126)x ) , ..., f ( (cid:126)x n )) (cid:124) , (cid:126)f ∈ Ω. We aim to minimize an empirical loss function n (cid:88) i =1 (cid:96) ( y i , f ( (cid:126)x i )) . (1)Here (cid:96) is a member of the CC-family, (cid:96) = g ◦ s = g ( s ( u )) = g ( s ( y, f )). To simplify notations, f and f ( (cid:126)x ) are interchanged sometimes. For i = 1 , ..., n , u i is defined below: u i = y i − f i , for regression, y i f i , for classification with y i ∈ [ − , ,f i , for exponential family. (2)Robust function estimation can be accomplished by the following algorithm. Algorithm 1
CC-Function Estimation Algorithm Input: training samples { ( (cid:126)x , y ) , ..., ( (cid:126)x n , y n ) } , concave component g with parameter σ ,convex component s , starting point (cid:126)f (0) and iteration count K . for k = 1 to K do Compute z i = s ( y i , f ( k − i ) , i = 1 , ..., n Compute subgradient v ( k ) i via v ( k ) i ∈ ∂ ( − g ( z i )) or z i ∈ ∂ϕ ( v ( k ) i ) , i = 1 , ..., n Compute (cid:126)f ( k ) = argmin (cid:126)f ∈ Ω (cid:80) ni =1 s ( y i , f i )( − v ( k ) i ) end for Output: v ( K ) i and (cid:126)f ( K ) .We have the convergence results for the COCO algorithm. Theorem 1.
Suppose that g is a concave component in the CC-family, and g is bounded below.The loss function values ρ ( (cid:126)f ( k ) ) (cid:44) (cid:80) ni =1 (cid:96) ( y i , f ( k ) i ) generated by Algorithm 1 are nonincreasingand converge. This result is a generalization of Theorem 4 in Wang (2020) in which the function is restrictedto a linear predictor function. Here we study more broadly defined function spaces. On theother hand, if H is a space of linear models, Theorem 1 indeed coincides with the results inWang (2020). The proof follows the same argument of Theorem 4 in Wang (2020), hence onlya sketch is outlined. Define the surrogate loss function: Q ( (cid:126)f | (cid:126)f ( k ) ) = n (cid:88) i =1 s ( y i , f i )( − v ( k +1) i ) + ϕ ( v ( k +1) i ) . Apply the well-known results on the conjugation operator, we then have ρ ( (cid:126)f ( k +1) ) ≤ Q ( (cid:126)f ( k +1) | (cid:126)f ( k ) ) ≤ Q ( (cid:126)f ( k ) | (cid:126)f ( k ) ) = ρ ( (cid:126)f ( k ) ) . (3)Step 5 in the algorithm is equivalent to minimizing Q ( (cid:126)f | (cid:126)f ( k ) ) since ϕ ( v ( k +1) i ) is a constantwith respect to (cid:126)f . The conclusion of the theorem follows. An important question is how to compute step 5 in Algorithm 1. Here we adopt weightedboosting algorithm. Boosting as a method for function estimation has been well studied by
Unified Robust Boosting
Friedman et al. (2000); Friedman (2001). Boosting can be utilized to fit a variety of modelswith different base learners, including linear least squares, smoothing splines and regressiontrees (B¨uhlmann and Hothorn 2007; Wang 2018b). For ease of notation, we first considerunweighted estimation: argmin (cid:126)f ∈ Ω n (cid:88) i =1 s ( y i , f i ) . (4)In a boosting algorithm, the solution is an additive model given byˆ f i = F M ( (cid:126)x i ) = M (cid:88) i =1 t m ( (cid:126)x i ) , (5)where F M ( (cid:126)x i ) is stagewisely constructed by sequentially adding an update t m ( (cid:126)x i ) to thecurrent estimate F m − ( (cid:126)x i ): F m ( (cid:126)x i ) = F m − ( (cid:126)x i ) + t m ( (cid:126)x i ) , m = 1 , ..., M. (6)There are different ways to compute (cid:126)t m ( (cid:126)x ) = ( t m ( (cid:126)x ) , ..., t m ( (cid:126)x n )) (cid:124) : gradient and Newton-type update are the most popular (Sigrist 2020). When the second derivative of loss functionexists, the Newton-type update is preferred over gradient update to achieve fast convergence: (cid:126)t m ( (cid:126)x ) = argmin (cid:126)f ∈ H n (cid:88) i =1 h m,i ( − d m,i h m,i − f ( x i )) , (7)where the first and second derivatives of the loss function s for observations i are given by: d m,i = ∂∂f s ( y i , f ) | f = F m − ( x i ) , (8) h m,i = ∂ ∂f s ( y i , f ) | f = F m − ( x i ) . (9)For quadratic loss s ( y i , f ) = ( y i − f ) , we obtain h m,i = 1. In this case, the Newton-updatebecomes the gradient update. Furthermore, the weighted minimization problem in step 5 ofAlgorithm 1 can be solved with the weighted boosting algorithm. To avoid overfitting, we can add the objective function with a regularization term: n (cid:88) i =1 (cid:96) ( y i , ˆ f i ) + M (cid:88) m =1 Λ( t m ) , (10)where Λ penalizes the model complexity. If H is the space of linear regression with a p -dimensional predictor, i.e., t m ( (cid:126)x i ) = (cid:126)x (cid:124) i β m , β m = ( β m , ..., β pm ) (cid:124) , denoteΛ( t m ) = 12 λ p (cid:88) j =1 β jm + α p (cid:88) j =1 | β jm | , (11) hu Wang λ ≥ , α ≥
0. Note that Λ( t m ) provides shrinkage estimators and can conduct variableselection. Suppose that H is the space of regression trees. Each regression tree splits the wholepredictor space into disjoint hyper-rectangles with sides parallel to the coordinate axes (Wang2018b). Specifically, denote the hyper-rectangles in the m -th boosting iteration R jm , j =1 , ..., J . Let t m ( (cid:126)x i ) = β jm , (cid:126)x i ∈ R jm , i = 1 , ..., n, j = 1 , ..., J . With γ ≥
0, the penalty can bedefined as in Chen and Guestrin (2016):Λ( t m ) = γJ + 12 λ p (cid:88) j =1 β jm + α p (cid:88) j =1 | β jm | . (12)A different penalized estimation is to implement a shrinkage parameter 0 < ν ≤ F m ( (cid:126)x i ) = F m − ( (cid:126)x i ) + νt m ( (cid:126)x i ) , m = 1 , ..., M. (13) In summary, we use Algorithm 1 coupled with the boosting algorithm to minimize the fol-lowing objective function: n (cid:88) i =1 (cid:96) ( y i , ˆ f i ) , (14)where ˆ f i is given by (5). There are two layers of iterations: the outer layer is the CC iterationand the inner layer is the boosting iterations. An early stop of iterations in boosting doesn’tguarantee convergence. On the other hand, the output (cid:126)f ( K ) may overfit the data. In thiscase, we may consider a two stage process: In the first stage, apply Algorithm 1 to obtainoptimal weights of observations. In the second stage, we can use a data-driven method suchas cross-validation to select optimal boosting iteration M , penalty numbers γ for trees, λ and α . The same strategy can also be applied to the robust parameter σ . However, sincethis parameter is typically considered a hyperparameter, a more computationally convenientapproach in the literature is to conduct estimation for different values of σ and compare theresults. One can begin with a large value σ with less robust estimation, and move towardssmaller value σ for more robust results.The source version of the ccboost package is freely available from the Comprehensive R ArchiveNetwork ( http://CRAN.R-project.org ). The reader can install the package directly fromthe R prompt via R> install.packages("ccboost")
All analyses presented below are contained in a package vignette. The rendered output of theanalyses is available by the R -command R> library("ccboost")R> vignette("ccbst", package = "ccboost")
To reproduce the analyses, one can invoke the R code R> edit(vignette("ccbst", package = "ccboost"))
Unified Robust Boosting
3. Data examples
In this example, we predict median value of owner-occupied homes in suburbs of Boston,with data publicly available from the UCI machine learning data repository. There are 506observations and 13 predictors. A different robust estimation can be found in Wang (2020).
R> urlname <- "https://archive.ics.uci.edu/ml/"R> filename <- "machine-learning-databases/housing/housing.data"R> dat <- read.table(paste0(urlname, filename), sep = "",+ header = FALSE)R> dat <- as.matrix(dat)R> colnames(dat) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX",+ "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B",+ "LSTAT", "MEDV")R> p <- dim(dat)[2]
We fit a CC-boosting model with concave component bcave and convex component leastsquares. The observation weights are plotted. We also display the values of the 4 observationswith the smallest weights. These observations are considered outliers. We can plot the original
R> library("ccboost")R> fit.ls <- ccboost(dat[, -p], dat[, p], cfun = "bcave",+ s = 10, dfun = "gaussian", verbose = 0, max.depth = 2,+ nrounds = 50)R> plot(fit.ls$weight_update)R> id <- sort.list(fit.ls$weight_update)[1:4]R> text(id, fit.ls$weight_update[id] - 0.02, id, col = "red") . . . . . . Index f i t. l s $ w e i gh t _upda t e median housing price vs the predicted values. Not surprisingly, those 4 observations with thesmallest weights have poor predictions. We can view feature importance/influence from thelearned model. The figure shows that the top two factors to predict median housing price hu Wang R> plot(dat[, p], predict(fit.ls, newdata = dat[, -p]))R> text(dat[id, p], predict(fit.ls, newdata = dat[id, -p]) -+ 1, id, col = "red")
10 20 30 40 50 dat[, p] p r ed i c t ( f i t. l s , ne w da t a = da t[, − p ] ) are average number of rooms per dwelling (RM) and percentage values of lower status of thepopulation (LSTAT). R> importance_matrix <- xgboost::xgb.importance(model = fit.ls)R> xgboost::xgb.plot.importance(importance_matrix = importance_matrix)
ZNINDUSRADAGECHASBTAXPTRATIONOXCRIMDISLSTATRM
R> gr <- xgboost::xgb.plot.tree(model = fit.ls, trees = 0:1)
Unified Robust Boostinghu Wang A binary classification problem was proposed by Long and Servedio (2010). Response variable y is randomly chosen to be -1 or +1 with equal probability. We randomly generate symbolsA, B and C with probability 0.25, 0.25 and 0.5, respectively. The predictor vector (cid:126)x with 21elements is generated as follows. If A is obtained, x j = y, j = 1 , ...,
21. If B is generated, x j = y, i = 1 , ..., , x j = − y, j = 12 , ...,
21. If C is generated, x j = y , where j is randomlychosen from 5 out of 1-11, and 6 out of 12-21. For the remaining j ∈ (1 , x j = − y . Wegenerate training data n = 400 and test data n = 200.We fit a robust logistic boosting model with concave component acave , where the maximumdepth of a tree is 5. Other concave components in Table 1 can be applied similarly. We can R> set.seed(1947)R> dat <- dataLS(ntr = 400, nte = 200, percon = 0)R> fit1 <- ccboost(dat$xtr, dat$ytr, cfun = "acave", s = 3,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit1$weight_update) . . . . . Index f i t w e i gh t _upda t e compute prediction error of test data at each boosting iteration. Furthermore, we simulatedata with 10% contamination of response variables, and compute CC-boosting again. In thethird robust logistic boosting, we reduce parameter value σ ( s in the ccboost function) formore robust estimation. As a result, some observations would have decreased weights in themodel.0 Unified Robust Boosting
R> err1 <- rep(NA, 100)R> for (i in 1:100) {+ pred1 <- predict(fit1, newdata = dat$xte, ntreelimit = i)+ err1[i] <- mean(sign(pred1) != dat$yte)+ }R> plot(err1, type = "l") . . . . . . . . Index e rr R> dat2 <- dataLS(ntr = 400, nte = 200, percon = 0.1)R> fit2 <- ccboost(dat2$xtr, dat2$ytr, cfun = "acave", s = 3,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit2$weight_update) . . . . . . . Index f i t w e i gh t _upda t e hu Wang R> err2 <- rep(NA, 100)R> for (i in 1:100) {+ pred2 <- predict(fit2, newdata = dat2$xte, ntreelimit = i)+ err2[i] <- mean(sign(pred2) != dat2$yte)+ }R> plot(err2, type = "l") . . . . . . Index e rr R> fit3 <- ccboost(dat2$xtr, dat2$ytr, cfun = "acave", s = 1,+ dfun = "binomial", verbose = 0, max.depth = 5, nrounds = 50)R> plot(fit3$weight_update) . . . . . . Index f i t w e i gh t _upda t e Unified Robust Boosting
R> err3 <- rep(NA, 100)R> for (i in 1:100) {+ pred3 <- predict(fit3, newdata = dat2$xte, ntreelimit = i)+ err3[i] <- mean(sign(pred3) != dat2$yte)+ }R> plot(err3, type = "l") . . . . . . . Index e rr hu Wang A survey collected from 3066 Americans was studied on health care utilization in lieu of doc-tor office visits (Heritier, Cantoni, Copt, and Victoria-Feser 2009). The data contained 24risk factors. Robust Poisson regression was conducted in Wang (2020). Here robust Poissonboosting model is fitted with concave component ccave . The observation weights are esti-mated. The doctor office visits in two years are highlighted for the 8 smallest weights, rangingfrom 200 to 750. We can view feature importance/influence from the learned model. The
R> data(docvisits, package = "mpath")R> x <- model.matrix(~age + factor(gender) + factor(race) ++ factor(hispan) + factor(marital) + factor(arthri) ++ factor(cancer) + factor(hipress) + factor(diabet) ++ factor(lung) + factor(hearth) + factor(stroke) ++ factor(psych) + factor(iadla) + factor(adlwa) + edyears ++ feduc + meduc + log(income + 1) + factor(insur) ++ 0, data = docvisits)R> fit.pos <- ccboost(x, docvisits$visits, cfun = "ccave",+ s = 20, dfun = "poisson", verbose = 0, max.depth = 1,+ nrounds = 50)R> plot(fit.pos$weight_update)R> id <- sort.list(fit.pos$weight_update)[1:8]R> text(id, fit.pos$weight_update[id] - 0.02, docvisits$visits[id],+ col = "red") . . . . . . Index f i t. po s $ w e i gh t _upda t e figure shows that the top two reasons of doctor office visits are heart disease and psychiatricproblems.
4. Conclusion
In this article we propose CC-boosting as a unified robust boosting algorithm, and illustrate itsapplications in regression, classification and Poisson regression. The method can be used foroutlier detection and can reduce the impact of outliers. Based on existing weighted boosting4
Unified Robust Boosting
R> importance_matrix <- xgboost::xgb.importance(model = fit.pos)R> xgboost::xgb.plot.importance(importance_matrix = importance_matrix) factor(insur)1meducfactor(iadla)1factor(marital)1factor(cancer)1factor(lung)1agefactor(iadla)2feducedyearsfactor(iadla)3log(income + 1)factor(arthri)1factor(adlwa)1factor(stroke)1factor(hipress)1factor(diabet)1factor(adlwa)2factor(adlwa)3factor(psych)1factor(hearth)1 software, we can determine variable importance and explore the trees from the boostingalgorithm. The R ccboost is a useful tool in the machine learning applications. References
B¨uhlmann P, Hothorn T (2007). “Boosting algorithms: Regularization, prediction and modelfitting (with discussion).”
Statistical Science , (4), 477–505.Chen T, Guestrin C (2016). “Xgboost: A scalable tree boosting system.” In Proceedings ofthe 22nd ACM SIGKDD international conference on knowledge discovery and data mining ,pp. 785–794.Friedman J (2001). “Greedy function approximation: a gradient boosting machine.”
Annalsof Statistics , (5), 1189–1232.Friedman J, Hastie T, Tibshirani R (2000). “Additive logistic regression: a statistical viewof boosting (with discussion and a rejoinder by the authors).” Annals of Statistics , (2),337–407.Heritier S, Cantoni E, Copt S, Victoria-Feser MP (2009). Robust Methods in Biostatistics ,volume 825. John Wiley & Sons.Long PM, Servedio RA (2010). “Random classification noise defeats all convex potentialboosters.”
Machine learning , (3), 287–304.Sigrist F (2020). “Gradient and Newton Boosting for Classification and Regression.” arXive-prints . https://arxiv.org/abs/1808.03064 , .Wang Z (2018a). “Quadratic majorization for nonconvex loss with applications to the boostingalgorithm.” Journal of Computational and Graphical Statistics , (3), 491–502. hu Wang Electronic Journal ofStatistics , , 599–650. doi:10.1214/18-EJS1404 . URL https://doi.org/10.1214/18-EJS1404 .Wang Z (2020). “Unified Robust Estimation via the COCO.” arXiv e-prints , arXiv:2010.02848. https://arxiv.org/abs/2010.02848 , . Affiliation: