[PDF] A User-Friendly Computational Framework for Robust Structured Regression Using the L 2 Criterion

Abstract

We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods using the L 2 criterion. In addition to introducing a scalable algorithm for performing L 2 E regression, our framework also enables robust regression using the L 2 criterion for additional structural constraints, works without requiring complex tuning procedures, can be used to automatically identify heterogeneous subpopulations, and can incorporate readily available non-robust structured regression solvers. We provide convergence guarantees for the framework and demonstrate its flexibility with some examples.

Full PDF

AA User-Friendly Computational Frameworkfor Robust Structured Regression Using the L Criterion

Jocelyn T. Chi and Eric C. Chi ∗ Abstract

We introduce a user-friendly computational framework for implementing robustversions of a wide variety of structured regression methods using the L criterion.In addition to introducing a scalable algorithm for performing L E regression,our framework also enables robust regression using the L criterion for additionalstructural constraints, works without requiring complex tuning procedures, can beused to automatically identify heterogeneous subpopulations, and can incorporatereadily available non-robust structured regression solvers. We provide convergenceguarantees for the framework and demonstrate its ﬂexibility with some examples. Keywords: block-relaxation, convex optimization, majorization-minimization,regularization ∗ The work was supported in part by NSF grants DMS-1760374 and DMS-1752692. a r X i v : . [ s t a t . C O ] O c t Introduction

Linear regression is a classic method that is ubiquitous across numerous domains. Its abilityto accurately quantify a linear relationship between a response vector y ∈ R n and a setof predictor variables X ∈ R n × p , however, is diminished in the presence of outliers. TheL E method introduced in [27, 28] presents an approach to robust linear regression thatoptimizes the well-known L criterion from nonparametric density estimation in lieu of themaximum likelihood. The usage of the L E method, however, has been limited by thelack of a simple computational framework. In fact, its previous implementations utilizedblack-box optimization solvers and consequently, its use has been limited when the numberof variables p > criterion. Our work oﬀers thefollowing novel contributions.1) Our framework enables practical implementation of the L E method for robustmultivariate regression from [27, 28] to handle more than a handful of covariates.2) Our framework extends the L E method from [27, 28] to a wide variety of robuststructured regression methods using the L criterion.3) Our framework can be employed tuning-free since it estimates the regressioncoeﬃcients and precision parameter simultaneously as demonstrated in Section 3.4) Our framework can be employed to instantly obtain robust versions of existingimplementations of non-robust structured regression methods in a “plug-and-play”manner as demonstrated in Section 4. 2) Our framework comes with convergence guarantees (Proposition 2). Our proof(supplement) involves a nonstandard combination of Meyer’s convergence theorem,which is traditionally used in the statistics literature to prove the convergence of EMalgorithms, and a classic result about the Krasnosel’ski˘i-Mann iterative method forcomputing ﬁxed points of non-expansive maps.We describe the motivation for using the L criterion to perform robust linear regressionin Section 2. We introduce our computational framework with convergence guarantees forperforming robust structured regression using the L criterion in Section 3. We demonstratethe simplicity and ﬂexibility of our framework by incorporating readily available structuredregression solvers to implement robust versions of several MLE-based methods in Section4. Finally, we provide a brief discussion in Section 5. The L minimization criterion has been used for histogram bandwidth selection as wellas for obtaining kernel density estimators [26]. Applying this well-known criterion fromnonparametric density estimation to parametric estimation for regression problems enablesa trade-oﬀ between eﬃciency and robustness in the estimation procedure. In fact, [6]introduced a family of divergences that includes the L E as a special case and the MLE asa limiting case. The members of this family of divergences are indexed by a parameter thatexplicitly trades oﬀ eﬃciency for robustness. While the MLE is the most eﬃcient, it is alsothe least robust. Meanwhile, the L E represents a reasonable trade-oﬀ between eﬃciencyand robustness.Some of the example methods we use to demonstrate our framework in Section 4 haverobust implementations. These include the well-known robust multiple linear regression33, 4, 11, 16, 22], robust convex regression [8], robust isotonic regression [2, 18], and robustlasso [1, 10, 34]. The purpose of our experiments is not to compare the L E to each ofthese robust methods. Rather, it is to demonstrate the ﬂexibility of this computationalframework and show how it can be used in a plug-and-play manner to obtain robust versionswith convergence guarantees of existing non-robust implementations. criterion Let f be the true but unknown density generating the observed data Y , . . . , Y n ∈ R , and letˆ f θ be a probability density function indexed by a parameter θ ∈ Θ ⊂ R q that approximates f . If we were to estimate f using the ˆ f θ that is closest to it, we could minimize the L distance between f and ˆ f θ in lieu of the negative log-likelihood withmin ˆ θ ∈ Θ (cid:90) (cid:104) ˆ f θ ( y ) − f ( y ) (cid:105) dy. (1)In practice, however, we do not know f and so identifying ˆ θ in this way is impossible.While we typically cannot minimize the L distance between f and its estimate ˆ f θ directly,we can minimize an unbiased estimate of this distance. To observe this, we ﬁrst expandthe quadratic integrand in (1), rewriting it as (cid:90) ˆ f θ ( y ) dy − (cid:90) ˆ f θ ( y ) f ( y ) dy + (cid:90) f ( y ) dy. Notice that the second integral is the expectation E Y [ ˆ f θ ( Y )], where Y is a random variabledrawn from f . Therefore, the sample mean provides an unbiased estimate of this quantity.Meanwhile, the third integral does not depend on θ so we can exclude it in the minimization.In this way, we arrive at the the following fully data-based loss function H ( θ ) that provides4n unbiased estimate for (1) up to an irrelevant additive constant H ( θ ) = (cid:90) ˆ f θ ( y ) dy − n n (cid:88) i =1 ˆ f θ ( y i ) . (2)Minimizing over this fully observed loss function presents us with our estimator ˆ θ , alsocalled an L E [27]. We discuss how our computational framework provides intuition forhow the L E imparts robustness in Section 3.2.

Let y ∈ R n denote a vector of n observed responses and let X ∈ R n × p denote thecorresponding observed design matrix of p -dimensional covariates. The standard linearmodel assumes the response and covariates are related via the model y = X β + τ − ε , where β ∈ R p is an unobserved vector of regression coeﬃcients, τ ∈ R + is an unobservedprecision parameter, and the unobserved noise ε i ∈ R for 1 ≤ i ≤ n are independentlyand identically distributed standard Gaussian random variables. We phrase the regressionmodel in terms of the precision rather than the variance to obtain a more straightforwardoptimization problem later.Let θ ≡ ( β T , τ ) T denote the vector of unknown parameters. Additionally, let r denotethe residual vector obtained from the current prediction estimate for β so that its i th component is r i = y i − x T i β , where x T i ∈ R p denotes the i th row of X . Given any suitablepair of β and τ , the conditional density of y i for 1 ≤ i ≤ n isˆ f ( i ) θ ( y i ) = τ √ π exp (cid:18) − τ r i (cid:19) . E loss function is then H ( i ) ( θ ) = (cid:90) ∞−∞ (cid:104) ˆ f ( i ) θ ( y i ) (cid:105) dy i − f ( i ) θ ( y i ) = τ √ π − τ (cid:114) π exp (cid:18) − τ r i (cid:19) . As recommended in [27], when utilizing the L E loss function for linear regression, weaverage the L distance over the observed data and minimize h ( θ ) = n (cid:80) ni =1 H ( i ) ( θ ) = τ √ π − τn (cid:114) π n (cid:88) i =1 exp (cid:18) − τ r i (cid:19) . (3)The solution ˆ θ = ( ˆ β T , ˆ τ ) T of (3) contains the L E regression estimates. This minimizationis similar to the exponential squared loss introduced in [33]. While they propose a three-step data-driven procedure for estimating the variance, our framework simultaneouslyestimates the regression coeﬃcients and precision, removing the need for additional tuningprocedures.

We present a computational framework for performing robust structured regression usingthe L criterion described in Section 2. We do this by introducing a general algorithm forcombining the L E method [27, 28] with a general structural constraint term. Concretely,we seek a minimizer of the objective function L ( β , τ ) = h ( β , τ ) + J ( β ) (4)subject to β ∈ R p and τ ∈ [ τ min , τ max ], where τ min ∈ R and τ max ∈ R are minimum andmaximum values for τ , respectively.There are two computational challenges in minimizing (4). The ﬁrst is that L is non-convex in θ since h ( θ ) is non-convex. The second is that commonly used constraint terms6 ( β ) are often non-smooth or non-diﬀerentiable. We focus on the case where the J arelower semicontinuous convex functions. This condition on J ensures that its proximalmappings are well deﬁned – namely, they always exist and are unique.In minimizing (4), we utilize the key property that the block derivatives of h with respectto β and τ , that is ∇ β h ( β , τ ) and ∂∂τ h ( β , τ ), respectively, are Lipschitz diﬀerentiable. Proposition 1.

Given the assumptions in Section 2.1, let σ ( X ) denote the largestsingular value of the design matrix X , let g ( r ) = (1 − τ r ) exp( − τ r / , and let r (cid:63) = − τ [ √ τ − . Then the L E loss function h ( β , τ ) is block Lipschitz diﬀerentiablewith respect to β and τ so that (cid:107)∇ β h ( β , τ ) − ∇ β h ( ˜ β , τ ) (cid:107) ≤ L β τ n (cid:114) π g ( r (cid:63) ) σ ( X ) (cid:107) β − ˜ β (cid:107) for all β and ˜ β , and (cid:12)(cid:12)(cid:12)(cid:12) ∂∂τ h ( β , τ ) − ∂∂τ h ( β , ˜ τ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ L τ | τ − ˜ τ | for all τ and ˜ τ . The proof is given in the supplement. The block Lipschitz diﬀerentiability of the L Eloss function h and the regularity conditions on J lead us to employ an inexact blockcoordinate descent algorithm to minimize (4). At a high level, we alternate betweeninexactly minimizing with respect to β holding τ ﬁxed, and then inexactly minimizingwith respect to τ holding β ﬁxed. Therefore, at the k th update, we inexactly solve thefollowing two subproblems: Subproblem 1: Update ββ ( k ) = arg min β ∈ R p h ( β , τ ( k − ) + J ( β ) , and (5)7 ubproblem 2: Update τ τ ( k ) = arg min τ ∈ [ τ min ,τ max ] h ( β ( k ) , τ ) . (6)Our algorithm takes a few proximal gradient descent steps to inexactly solve (5) and(6). Recall that proximal gradient descent is a ﬁrst order iterative method for solvingoptimization problems of the formminimize θ f ( θ ) + g ( θ ) , (7)where f is a Lipschitz diﬀerentiable function and g is a convex and lower semicontinuousfunction. Further recall that the proximal map of g is given byprox g ( θ ) = arg min ˜ θ (cid:107) ˜ θ − θ (cid:107) + g ( ˜ θ ) . The proximal map exists and is unique whenever g ( θ ) is convex and lower semicontinuous.Many regularizers J ( β ) that are useful for recovering models with structure satisfy theseconditions and also admit proximal maps that can be evaluated using either an explicitformula or an eﬃcient algorithm. For example, the proximal map of the scaled 1-norm λ (cid:107)·(cid:107) is the celebrated element-wise soft-thresholding operator, namely (cid:2) prox λ (cid:107)·(cid:107) ( θ ) (cid:3) i = sign( θ i ) max( | θ i | − λ, . As its name suggests, the proximal gradient descent method for solving problems of theform described in (7) combines a gradient descent step with a proximal step. Given acurrent iterate θ , the next iterate θ + is computed as θ + = prox tg ( θ − t ∇ f ( θ )) , (8)8here t is a positive step-size parameter.We emphasize that our framework does not require computing the global minimizersin (5) and (6). Nonetheless, in spite of inexactly solving (5) and (6), we will see that thealgorithm still comes with convergence guarantees. criterion Algorithm 1 presents pseudocode for minimizing (4) using inexact block coordinate descent.When updating β (5) and τ (6), we take a ﬁxed number of proximal gradient steps, N β and N τ respectively, in (8). The gradients for updating β and τ are given by ∇ β h ( β , τ ) = τ X T Wr , where W ∈ R n × n is a diagonal matrix that depends on β with i th diagonal entry w ii = exp (cid:20) − τ r i (cid:21) , (9)and ∂∂τ h ( β , τ ) = 12 √ π − n (cid:114) π (cid:34) n (cid:88) i =1 w ii (cid:0) − τ r i (cid:1)(cid:35) , respectively.Proposition 2 states the convergence property of Algorithm 1 rigorously and also impartsguidance on the choice of the step sizes t β and t τ . Under appropriate step-sizes t β and t τ in (5) and (6), Algorithm 1 produces a monotonically decreasing sequence of objectivefunction values and its iterates converge to a ﬁrst order stationary point of the objectivefunction in (4). Recall that a point θ = ( β T , τ ) T is a ﬁrst order stationary point of a9unction f ( θ ) if for all directions v the directional derivative f (cid:48) ( θ ; v ) of f is nonnegative,namely f (cid:48) ( θ ; v ) ≥ . For the update step on τ , the operator P [ τ min ,τ max ] denotes the projection onto [ τ min , τ max ]. Proposition 2 (Convergence of Algorithm 1) . Under modest regularity conditions on (4)and step sizes t β ≤ L β and t τ ≤ L τ , where L β and L τ are given by Proposition 1, the iteratesequence generated by Algorithm 1 converges to a ﬁrst order stationary point of (4) for anychoice of N β and N τ . The proof is given in the supplement.

We present a simple example illustrating intuition for Algorithm 1. Let the design matrix X be the n × n identity matrix I n , and let the structural constraint J ( β ) be the indicatorfunction of a closed and convex set C . Therefore, J ( β ) = ι C ( β ) and is zero when β ∈ C ,and ∞ otherwise. This scenario results in simpliﬁcations to (5). In particular, we can takestep-size t β = 1 so that the update rule for β becomes β + = P C ( z ) , where P C ( z ) is the Euclidean projection of z = Wy + ( I − W ) β onto C , and W is adiagonal matrix with diagonal elements deﬁned in (9).We observe how the L E imparts robustness through the action of W . Consider z ∈ R n as a vector of pseudo-observations, where each element z i is a convex combination of y i and10 lgorithm 1 Alternating block coordinate descent for minimizing (4)Initialize β (0) , τ (0) and ﬁx N β , N τ , t β , t τ k ← repeat β ← β ( k ) // Update β (5) for i = 1 , . . . , N β do β ← prox t β J (cid:20) β − t β ∇ β h ( β , τ ( k ) ) (cid:21) end for β ( k +1) ← β τ ← τ ( k ) // Update τ (6) for i = 1 , . . . , N τ do τ ← P [ τ min ,τ max ] (cid:104) τ − t τ ∂∂τ h ( β ( k +1) , τ ) (cid:105) end for τ ( k +1) ← τk ← k + 1 until convergencethe current prediction β i . If the current residual is large, w i is small and the correspondingpseudo-observation z i resembles the current predicted value β i . Meanwhile, if the currentresidual is small, the corresponding pseudo-observation resembles the observed response y i .Therefore, the pseudo-observations impart the following algorithmic intuition. Given anestimate ˜ β of the regression coeﬃcients, the algorithm performs constrained least squaresregression using a pseudo-response z , whose entries are a convex combination of the entriesof the observed response y and the prediction ˜ β . Observations with large current residuals11re essentially replaced by their predicted value.Thus, the algorithm can ﬁt a fraction of the observations very well while also accountingfor outlying observations by replacing them with pseudo-response values that are moreconsistent with a model that ﬁts the data. Notice that the algorithm is oblivious towhether large residuals come from outliers in the response or in the predictor variables.Consequently, it can handle outliers arising from either source. L criterion We demonstrate how the computational framework presented in Section 3 provides a ﬂexibleframework for performing a wide variety of robust structured regression methods using theL criterion. Our examples highlight how the framework can incorporate existing non-robust structural regression solvers to automatically “robustify” existing implementations.Additionally, we emphasize that our framework is tuning-free as it enables simultaneousestimation of the regression coeﬃcients β and precision τ .We refer to the estimates obtained from optimizing the maximum likelihood and theL criterion as the MLE and the L E , respectively. Software for implementing robuststructured regression using the L criterion is available in the L2E package for R and willbe available on the Comprehensive R Archive Network (CRAN).12 .1 Robust multivariate linear regression using the L criterion We ﬁrst demonstrate the scalability of the framework for multivariate L E regression[27, 28], where J ( β ) = 0. Let X ∈ R n × p with rank( X ) = p . Surprisingly, there havebeen no proposed computational procedures for handling cases beyond p >

3. To illustratethis example, we utilize data from an Italian bank [25]. The response y ∈ R n is the annualinvestment earnings for each of n = 1 ,

949 banking customers. Additionally, the designmatrix X contains quantitative measurements for each customer on each of p = 13 bankservices.Since J ( β ) = 0, prox t β J is simply the identity operation. Therefore, Subproblem 1for updating β in (5) reduces to iteratively performing the following: 1) computing thecurrent residuals, 2) updating the weights w ii in (9), and then 3) updating β using thecurrent residuals and the gradient described in Section 3.1.Figures 1a and 1b depict scatter plots of the ﬁtted values against the residuals obtainedusing the MLE and L E, respectively. A good ﬁt is evidenced by normally distributednoise in the residuals – namely, a symmetric scatter of points about the zero residual level(depicted by the horizontal orange dashed line). Figure 1a, however, shows a discerniblepattern in the MLE residuals with asymmetric scatter of points about the zero residuallevel. This indicates that additional trends in the data not captured by the Gaussian linearmodel remain in the residuals and are not captured by the MLE ﬁt.Meanwhile, Figure 1b shows that after removing the outlying points identiﬁed by theautomatic tuning of τ in our computational framework (depicted by the blue triangles), theresiduals obtained using the L E ﬁt are normally distributed about the zero residual level.Thus, the L E adequately captures the linear relationship between investment earnings andbank services for the non-outlying customers. Notice that L E regression can be recursively13epeated on the outlying customers to identify an appropriate linear relationship betweeninvestment earnings and bank services for subgroups among the customers. −50005001000 0 200 400 600 800

MLE Fitted Values M LE R e s i du a l s (a) MLE ﬁtted values vs. residuals. L E Fitted Values L E R e s i du a l s Outliers

NoYes (b) L E ﬁtted values vs. residuals.

Figure 1: (a) Asymmetric spread of points about the zero residual line suggests additionalvariation in the data not captured by the MLE linear ﬁt. (b) Blue triangles denote outlyingobservations identiﬁed by the L E . Remaining observations are well-ﬁt by the L E linearﬁt, as seen in normal distribution of points about the zero residual line.

In addition to illustrating how the L E presents a more robust linear ﬁt in the presenceof outliers, this example also highlights another beneﬁt of our computational framework.Our framework enables the joint estimation of the regression coeﬃcient vector β and theprecision τ , resulting in automatic identiﬁcation of outlying observations in the data. Thisis practically useful since the L E can simultaneously identify subpopulations within thedata and appropriate ﬁts for each of those groups when applied recursively to the subgroups.14 .2 Robust isotonic regression using the L criterion We demonstrate how the computational framework proposed in Section 3 can performrobust isotonic regression using the L criterion. Let an observed response y ∈ R n consistof n samples drawn from a monotonic function f sampled at discrete time points t , . . . , t n with additive independent Gaussian noise. We can express the i th entry of y as y i = f ( t i ) + (cid:15) i for 1 ≤ i ≤ n, where f is monotonic, (cid:15) i iid ∼ N (0 , τ ), and τ ∈ R + . The goal of isotonic regression[5, 9, 12, 17, 20] is to estimate f by solvingmin β ( t ) , ... , β ( t n ) n (cid:88) i =1 [ y i − β ( t i )] subject to β ( t ) ≤ β ( t ) ≤ · · · ≤ β ( t n ) . We then construct a piece-wise constant estimate for f using the elements of the estimatorˆ β = (cid:16) ˆ β ( t ) ˆ β ( t ) · · · ˆ β ( t n ) (cid:17) T .For the corresponding L E problem, the design matrix X is the n × n identity matrix I n and J ( β ) = ι M ( β ) is the indicator function over the set of vectors M satisfying element-wise monotonicity so that β ≤ β ≤ · · · ≤ β n for β ∈ R n . Subproblem 1 for updating β in(5) reduces to iteratively performing the following: 1) computing the current residuals, 2)updating the weights w ii in (9), and then 3) updating β using the current residuals and thegradient described in Section 3.1 and projecting onto the set M . The gpava function forimplementing the generalized pool-adjacent-violators algorithm (generalized PAVA) in the isotone package [20] for R performs this last step. Therefore, we harness a readily availablenon-robust isotone regression solver inside our computational framework to perform robustisotonic regression using the L criterion. 15e illustrate with a univariate cubic function. Figure 2a shows how the MLE and L Eproduce similar estimates in the absence of outliers. The true underlying cubit ﬁt f isshown in black and the gray points depict the observations generated from f with additiveGaussian noise. The dashed orange line depicts the MLE obtained using generalized PAVAwhile the solid blue line depicts the L E . Meanwhile, Figure 2b shows how the MLE isskewed towards the outliers while the L E estimate remains less sensitive to them. −10010 −2 −1 0 1 2 x F itt e d V a l u e s Method L EMLETrue (a) Isotonic regression without outliers. −10010 −2 −1 0 1 2 x F itt e d V a l u e s Method L EMLETrue (b) Isotonic regression with outliers.

Figure 2:

The black, orange, and blue lines depict the true, MLE, and L E ﬁts for isotonicregression, respectively. The MLE and L E produce similar results in the absence of outliers.The MLE is skewed towards the outliers while the L E provides a more robust estimate.

Figure 3 depicts results of Monte Carlo simulations comparing the MLE and theL E while varying the number of outliers. We simulate four datasets with n = 1 , , , gpava function in the isotone package[20] for R. We repeat experiments in each scenario 100 times on a 2.67 GHz Intel Core i5computer with 12 GB of RAM and present boxplots of the mean squared error (MSE). l l l l ll Number of Outliers L og ( M S E ) Method

MLEL E Figure 3:

Monte Carlo experiments for isotonic regression with n = 1 , observationsdrawn from a univariate cubic function with additive Gaussian noise. Boxplots of the meansquared error (MSE) obtained over , , , and outliers from the from trialsare shown with the MLE in orange and the L E in blue. The MSE obtained using the L Egrows much more slowly than the MSE obtained with the MLE in the presence of outliers.

The MLE produces increasingly larger MSE as the number of outliers increases.Meanwhile, the L E produces a much smaller increase in MSE for the same number ofoutliers. Thus, the L E can produce an isotonic regression ﬁt that is much less sensitive tooutliers than the MLE. This example highlights how our framework can utilize a readilyavailable non-robust isotonic regression solver to automatically perform robust isotonic17egression using the L criterion. criterion We demonstrate how the computational framework proposed in Section 3 can performrobust convex regression using the L criterion. For illustrative purposes, we consider theunivariate case [13, 32]. However, applying our framework to multivariate convex regression[7, 14, 15, 19, 21, 23, 29] can be performed in a similar manner. Let an observed response y ∈ R n consist of n samples drawn from a convex function f sampled at discrete timepoints t , . . . , t n with additive independent Gaussian noise. We can express the i th entry of y as y i = f ( t i ) + (cid:15) i for 1 ≤ i ≤ n, where f is convex, (cid:15) i iid ∼ N (0 , τ ), and τ ∈ R + . The goal of shape-restricted convex regressionis to estimate f by solvingmin β ( t ) , ... , β ( t n ) n (cid:88) i =1 [ y i − β ( t i )] subject to β ( t i ) ≤

12 [ β ( t i − ) + β ( t i +1 )] for 2 ≤ i ≤ n − . Let β ≡ (cid:16) β ( t ) β ( t ) · · · β ( t n ) (cid:17) T . We can recast this constraint in terms of the second-order diﬀerencing matrix D ∈ R n × n with D β ≥ so that all the elements of D β arenon-negative. We then construct a piece-wise constant estimate for f using the elementsof the estimator ˆ β = (cid:16) ˆ β ( t ) ˆ β ( t ) · · · ˆ β ( t n ) (cid:17) T .For the corresponding L E problem, the design matrix X is the n × n identity matrix I n and J ( β ) = ι C ( β ) is the indicator function over the set of vectors in C ≡ { β : D β ≥ } .18ubproblem 1 for updating β in (5) reduces to iteratively performing the following: 1)computing the current residuals, 2) updating the weights w ii in (9), and then 3) updating β using the current residuals and the gradient described in Section 3.1 and projecting ontothe convex cone C . The conreg function in the cobs package [24] for R can be used toperform this last step. Therefore, we can utilize a readily available non-robust convexregression solver inside our computational framework to perform robust convex regressionusing the L criterion.Figure 4a shows how the MLE and L E produce similar ﬁts in the absence of outliers.The true underlying convex ﬁt f is shown in black and the gray points depict theobservations generated from the true ﬁt and some additive Gaussian noise. The dashedorange line depicts the MLE obtained using the cobs package in R while the solid blueline depicts the L E . Meanwhile, Figure 4b shows how the MLE is substantially skewedtowards the outliers while the L E is less distorted. This example again highlights how theL E is less sensitive to outliers than the MLE.Figure 5 depicts the results of Monte Carlo simulations comparing the MLE and L Eon shape-restricted convex regression while varying the number of outliers. We simulatefour datasets with n = 1 ,

000 observations using a fourth-order polynomial with additiveGaussian noise and 10 , , conreg function in the cobs package for R [24]. We repeat experiments in each scenario100 times on a 2.67 GHz Intel Core i5 computer with 12 GB of RAM and present boxplotsof the MSE.Figure 5 highlights how the MLE produces increasingly larger MSE values as the numberof outliers increases. Meanwhile, the MSE obtained using the L E also grows slightly asthe number of outliers increases but it is much less sensitive to outliers. This example19 x F itt e d V a l u e s Method L EMLETrue (a) Convex regression without outliers. x F itt e d V a l u e s Method L EMLETrue (b) Convex regression with outliers.

Figure 4:

The black, orange, and blue lines depict the true, MLE, and L E ﬁts for convexregression, respectively. The MLE and L E produce similar results in the absence of outliers.The L E is much less sensitive to outliers than the MLE. again underscores how the computational framework presented in Section 3 can perform arobust version of a structured regression problem utilizing a readily available non-robustimplementation. criterion We demonstrate how the computational framework proposed in Section 3 can performrobust sparse regression using the L criterion. We compare L E sparse regression with the20 l l ll

Number of Outliers L og ( M S E ) Method

MLEL E Figure 5:

Monte Carlo experiments for convex regression with n = 1 , observationsdrawn from a convex function with additive Gaussian noise. Boxplots of the mean squarederror (MSE) obtained over , , , and outliers from the from trials are shownwith the MLE in orange and the L E in blue. Similar to the MSE comparisons for isotonicregression, the MSE for convex regression obtained using the L E grows much more slowlythan the MSE obtained with the MLE in the presence of outliers. lasso [31], which solves min β (cid:107) y − X β (cid:107) subject to (cid:107) β (cid:107) ≤ t for t ≥ E problem, let X ∈ R n × p with rank( X ) = p and let J ( β ) = ι S ( β ) be the indicator function over the set of vectors in S ≡ { β : (cid:107) β (cid:107) ≤ t } for t ≥

0. Subproblem 1 for updating β in (5) reduces to iteratively performing the21ollowing: 1) computing the current residuals, 2) updating the weights w ii in (9), and then3) updating β using the current residuals and the gradient described in Section 3.1 andprojecting onto the (cid:96) ball with radius t .We illustrate with data from a prostate cancer dataset [30]. The response y ∈ R n is thepercent of Gleason score (a measure of a prostate-speciﬁc antigen) for each of n = 97patients who were to receive a radical prostatectomy. The design matrix X containsquantitative measurements for the patients on each of p = 10 clinical variables. Figure6 shows how the lasso and L E produce similar solution paths in the absence of outliers.After introduction of outliers in the design matrix, however, the lasso solution paths canbe substantially distorted. In this example, the L E solution paths are less sensitive tooutliers.While there are several robust implementations of the lasso and its variations [1, 10, 34],the purpose of this example is not to compare L E sparse regression with thoseimplementations. Rather, it is to illustrate how the framework we present can wrap aroundexisting maximum penalized estimators to automatically compute robust versions of thosemethods.

We introduced a user-friendly computational framework for performing a wide varietyof robust structured regression methods using the L criterion. We highlight that ourframework can “robustify” existing structured regression solvers by utilizing the non-robustsolvers in the β -update step in a plug-and-play manner. Thus, our framework can readilyincorporate newer and improved technologies for existing structured regression methods;22s faster and better algorithms for these non-robust structured regression solvers appear,users may simply replace the previous solver with the new one in the β -update step.We also highlight the signiﬁcance of the convergence properties of our computationalframework. As long as the structural constraints satisfy convexity and lower semi-continuity, a solution obtained using our framework is guaranteed to converge to a ﬁrstorder stationary point. Since many commonly-used structural constraints satisfy theseconditions, our framework provides convergence guarantees for robust versions of manynon-robust methods with readily available software. References [1] Alfons, A., Croux, C., Gelper, S. et al. (2013), “Sparse least trimmed squares regressionfor analyzing high-dimensional large data sets,”

The Annals of Applied Statistics , 7,226–248.[2] ´Alvarez, E. E., and Yohai, V. J. (2012), “M-estimators for isotonic regression,”

Journalof Statistical Planning and Inference , 142, 2351–2368.[3] Andrews, D. F. (1974), “A robust method for multiple linear regression,”

Technometrics , 16, 523–531.[4] Audibert, J.-Y., Catoni, O. et al. (2011), “Robust linear least squares regression,”

TheAnnals of Statistics , 39, 2766–2794.[5] Barlow, R. E., and Brunk, H. D. (1972), “The isotonic regression problem and itsdual,”

Journal of the American Statistical Association , 67, 140–147.236] Basu, A., Harris, I. R., Hjort, N. L., and Jones, M. C. (1998), “Robust and eﬃcientestimation by minimising a density power divergence,”

Biometrika , 85, 549–559.[7] Birke, M., and Dette, H. (2007), “Estimating a convex function in nonparametricregression,”

Scandinavian Journal of Statistics , 34, 384–404.[8] Blanchet, J., Glynn, P. W., Yan, J., and Zhou, Z. (2019), “Multivariate distributionallyrobust convex regression under absolute error loss,” in

Advances in Neural InformationProcessing Systems , pp. 11817–11826.[9] Brunk, H., Barlow, R. E., Bartholomew, D. J., and Bremner, J. M. (1972), “Statisticalinference under order restrictions: The theory and application of isotonic regression,”Tech. rep., Missouri Uuniversity Columbia Department of Statistics.[10] Chang, L., Roberts, S., and Welsh, A. (2018), “Robust lasso regression using Tukey’sbiweight criterion,”

Technometrics , 60, 36–47.[11] Davies, P. L. (1993), “Aspects of robust linear regression,”

The Annals of statistics ,1843–1899.[12] Dykstra, R. L., Robertson, T. et al. (1982), “An algorithm for isotonic regression fortwo or more independent variables,”

The Annals of Statistics , 10, 708–716.[13] Ghosal, P., and Sen, B. (2017), “On univariate convex regression,”

Sankhya A , 79,215–253.[14] Guntuboyina, A.— (2015), “Global risk bounds and adaptation in univariate convexregression,”

Probability Theory and Related Fields , 163, 379–411.2415] Hannah, L. A., and Dunson, D. B. (2013), “Multivariate convex regression withadaptive partitioning,”

The Journal of Machine Learning Research , 14, 3261–3294.[16] Holland, P. W., and Welsch, R. E. (1977), “Robust regression using iterativelyreweighted least-squares,”

Communications in Statistics-theory and Methods , 6, 813–827.[17] Lee, C.-I. C. et al. (1981), “The quadratic loss of isotonic regression under normality,”

The Annals of Statistics , 9, 686–688.[18] Lim, C. H. (2018), “An eﬃcient pruning algorithm for robust isotonic regression,” in

Advances in Neural Information Processing Systems , pp. 219–229.[19] Lim, E., and Glynn, P. W. (2012), “Consistency of multidimensional convexregression,”

Operations Research , 60, 196–208.[20] Mair, P., Hornik, K., and de Leeuw, J. (2009), “Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods,”

Journal of statisticalsoftware , 32, 1–24.[21] Mazumder, R., Choudhury, A., Iyengar, G., and Sen, B. (2019), “A computationalframework for multivariate convex regression and its variants,”

Journal of theAmerican Statistical Association , 114, 318–331.[22] Meng, X., and Mahoney, M. W. (2013), “Low-distortion subspace embeddings ininput-sparsity time and applications to robust linear regression,” in

Proceedings ofthe forty-ﬁfth annual ACM symposium on Theory of computing , pp. 91–100.2523] Meyer, M. C. (2003), “A test for linear versus convex regression function using shape-restricted regression,”

Biometrika , 90, 223–232.[24] Ng, P., and Maechler, M. (2007), “A fast and eﬃcient implementation of qualitativelyconstrained quantile smoothing splines,”

Statistical Modeling , 7, 315–328.[25] Riani, M., Cerioli, A., Atkinson, A. C., and Perrotta, D. (2014), “Monitoring robustregression,”

Electronic Journal of Statistics , 8, 646–677.[26] Scott, D. W. (1992),

Multivariate density estimation. Theory, practice andvisualization , John Wiley & Sons, Inc.[27] — (2001), “Parametric statistical modeling by minimum integrated square error,”

Technometrics , 43, 274–285.[28] — (2009), “The L2E method,”

Wiley Interdisciplinary Reviews: ComputationalStatistics , 1, 45–51.[29] Seijo, E., and Sen, B. (2011), “Nonparametric least squares estimation of a multivariateconvex regression function,”

The Annals of Statistics , 39, 1633–1657.[30] Stamey, T., Kabalin, J., McNeal, J., Johnstone, I., Freiha, F., Redwine, E., and Yang,N. (1989), “Prostate speciﬁc antigen in the diagnosis and treatment of adenocarcinomaof the prostate II. Radical prostatectomy treated patients,”

Journal of Urology , 16,1076–1083.[31] Tibshirani, R. (1996), “Regression shrinkage and selection via the lasso,”

Journal ofthe Royal Statistical Society: Series B (Methodological) , 58, 267–288.2632] Wang, J., and Ghosh, S. K. (2012), “Shape restricted nonparametric regression withBernstein polynomials,”

Computational Statistics & Data Analysis , 56, 2729–2741.[33] Wang, X., Jiang, Y., Huang, M., and Zhang, H. (2013), “Robust variable selectionwith exponential squared loss,”

Journal of the American Statistical Association , 108,632–643.[34] Yang, E., Lozano, A. C., Aravkin, A. et al. (2018), “A general family of trimmedestimators for robust high-dimensional data analysis,”

Electronic Journal of Statistics ,12, 3519–3553. 27 o Outliers Outliers L a ss o L E Shrinkage Factor s C o e ff i c i e n t s Variable agegleasonlbphlcavollcplweightpgg45svi

Figure 6: