[PDF] The Backbone Method for Ultra-High Dimensional Sparse Machine Learning

Abstract

We present the backbone method, a generic framework that enables sparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems. We solve sparse regression problems with 10 7 features in minutes and 10 8 features in hours, as well as decision tree problems with 10 5 features in minutes. The proposed method operates in two phases; we first determine the backbone set, that consists of potentially relevant features, by solving a number of tractable subproblems; then, we solve a reduced problem, considering only the backbone features. For the sparse regression problem, we show that, under certain assumptions and with high probability, the backbone set consists of the true relevant features. Numerical experiments on both synthetic and real-world datasets demonstrate that our method outperforms or competes with state-of-the-art methods in ultra-high dimensional problems, and competes with optimal solutions in problems where exact methods scale, both in terms of recovering the true relevant features and in its out-of-sample predictive performance.

Full PDF

aa r X i v : . [ c s . L G ] J un The Backbone Method for Ultra-High DimensionalSparse Machine Learning

Dimitris Bertsimas · Vassilis Digalakis Jr.

Submitted to Machine Learning: 06/04/2020

Abstract

We present the backbone method, a generic framework that enablessparse and interpretable supervised machine learning methods to scale to ultra-high dimensional problems. We solve, in minutes, sparse regression problems with p ∼ features and decision tree induction problems with p ∼ features.The proposed method operates in two phases; we ﬁrst determine the backboneset, that consists of potentially relevant features, by solving a number of tractablesubproblems; then, we solve a reduced problem, considering only the backbonefeatures. Numerical experiments demonstrate that our method competes with op-timal solutions, when exact methods apply, and substantially outperforms baselineheuristics, when exact methods do not scale, both in terms of recovering the truerelevant features and in its out-of-sample predictive performance. Keywords

Ultra-high dimensional machine learning · Sparse machine learning · Mixed integer optimization · Sparse regression · Decision trees

In the Big Data era, the scalability of machine learning models is constantly chal-lenged.

Ultra-high dimensional datasets are present in a variety of applications,ranging from physical sciences and engineering to social sciences and medicine.We consider a dataset to be ultra-high dimensional when the number of features p exceeds the number of data points n and is at least two orders of magnitudegreater than the known scalability limit of the learning task at hand; as we discussin the sequel, this limit is p ∼ for sparse regression and p ∼ for decision Dimitris BertsimasOperations Research Center and Sloan School of Management, Massachusetts Institute ofTechnology, Cambridge, MA 02139, [email protected] Digalakis Jr.Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA 02139,[email protected] Dimitris Bertsimas, Vassilis Digalakis Jr. trees. A commonly held assumption to address high dimensional regimes is thatthe underlying model is sparse , that is, among all input features, only k ≤ are relevant for the task at hand. For instance, in the context of regression, thesparsity assumption implies that only a few regressors are set to nonzero level. Inthe context of decision trees, it means that most features do not appear in any splitnode. The sparsity requirement can either be explicitly imposed via a cardinalityconstraint or penalty, as in sparse regression (also known as best subset selection),or can implicitly be imposed by the structure of the model, as in depth-constraineddecision trees, where the number of features that we split on is upper bounded asfunction of the tree depth. In general, sparsity is a desirable property for machinelearning models, as it makes them more interpretable and often improves theirgeneralization ability (Ng, 1998).Learning a sparse machine learning model can naturally be formulated as a mixed integer optimization (MIO) problem by associating each feature with anauxiliary binary decision variable that is set to 1 if and only if the feature isidentiﬁed as relevant (Bertsimas et al., 2016; Bertsimas and Dunn, 2017). Despitethe -theoretical- computational hardness of solving MIO problems, the remarkableprogress in the ﬁeld has motivated their use in a variety of machine learning prob-lems, ranging from sparse regression and decision trees to principal componentanalysis and matrix completion (Bertsimas and Dunn, 2019). The rich modelingframework of MIO enables us to directly model the problem at hand and oftencompute provably optimal or near-optimal solutions using either exact methods(e.g., branch and bound) or tailored heuristics (e.g., local search). Moreover, thecomputational eﬃciency and scalability of MIO-based methods is far beyond whatone would imagine a decade ago. For example, we are able to solve sparse re-gression problems with p ∼ features (Bertsimas and Van Parys, 2020b) andinduce decision trees for problems with p ∼ features. Nevertheless, MIO-basedformulations are still challenged in ultra-high dimensional regimes as the ones weexamine.The standard way of addressing ultra-high dimensional problems is to performeither a screening (i.e., ﬁltering) step (Fan and Lv, 2008), that eliminates a largeportion of the features, or more sophisticated dimensionality reduction methods: (a) feature selection methods (Guyon and Elisseeﬀ, 2003), that aim to select asubset of relevant features, or (b) feature extraction methods, that project theoriginal features to a new, lower-dimensional feature space; see Li et al. (2017)for a detailed review. After the dimensionality reduction step, we solve the MIOformulation for the reduced problem, provided that suﬃciently many features havebeen discarded. However, commonly used screening (or, more generally, featureselection) approaches are heuristic and often lead to suboptimal solutions, whereasfeature extraction approaches are uninterpretable in terms of the original problem’sfeatures.In this paper, we introduce the backbone method for sparse supervised learning ,a two-phase framework that, as we empirically show, scales to ultra-high dimen-sions (as deﬁned earlier) and provides signiﬁcantly higher quality solutions thanscreening approaches, without sacriﬁcing interpretability (as opposed to featureextraction methods). In the ﬁrst phase of our proposed method, we aim to iden-tify the “backbone” of the problem, which consists of features that are likely to bepart of an optimal solution. We do so by collecting the solutions to a series of sub-problems that are more tractable than the original problem and whose solutions he Backbone Method 3 are likely to contain relevant features. These subproblems also need to be “selec-tive,” in the sense that only the ﬁttest features survive after solving each of them.In the second phase, we solve the MIO formulation considering only the featuresthat have been included in the backbone. Although our framework is generic andcan be applied to a large variety of problems, we focus on two particular problems,namely sparse regression and decision tree induction, and illustrate how the back-bone method can be applied to them. We accurately solve, in minutes, problemswith p ∼ for sparse regression and p ∼ for decision trees, that is, twoorders of magnitude larger than the known scalability limit of methods that solveto optimality or near-optimality the actual MIO formulation for each problem.Our proposed backbone method was inspired by a large-scale vehicle routingapplication; using a backbone algorithm, Bertsimas et al. (2019b) solve, in seconds,problems with thousands of taxis serving tens of thousands of customers per hour.In this work, we develop the backbone method for sparse supervised learning infull generality. Moreover, we remark that the method can be applied to a variety ofmachine learning problems that exhibit sparse structure, even beyond supervisedlearning. For example, in the clustering problem, a big portion of pairs of datapoints will never be clustered together in any near optimal cluster assignment andhence we need not consider assigning them to the same cluster. Finally, we notethat, within the sparse supervised learning framework that we examine in thispaper, the backbone method can also be used as a feature selection technique incombination with any sparsity-imposing but not necessarily MIO-based method . In this section, we brieﬂy review the landscape of the sparse regression problem,which is one of the two problems that we tackle using the backbone method.

Exact Sparse Regression Formulation.

The sparse regression problem withan explicit sparsity constraint, also known as the best subset selection problem,is the most natural way of performing simultaneous feature selection and modelﬁtting for regression. The sparsity constraint is imposed by requiring that the ℓ (pseudo)norm of the vector of regressors w ∈ R p is less than a predetermineddegree of sparsity k ≪ p , namely, k w k ≤ k, where k w k = P pj =1 ( w j =0) and ( · ) denotes the indicator function. Nevertheless, there are some important obstaclesrelated to this approach. Firstly, the ℓ constraint is very far from being convex.Secondly, the resulting combinatorial optimization problem is NP-hard (Natarajan,1995).Motivated by the advances in MIO, Bertsimas et al. (2016) cast best subsetselection as a mixed integer quadratic program and solve it exactly to optimalityor near-optimality (i.e., within a user-speciﬁed optimality gap that can be set toan arbitrarily small value up to machine precision) with a commercial solver, forproblems with p ∼ . Subsequently, Bertsimas and Van Parys (2020b) reformu-late the sparse regression problem as a pure binary optimization problem and usea cutting planes-type algorithm that, based on the outer approximation methodof Duran and Grossmann (1986), iteratively tightens a piece-wise linear lower ap-proximation of the objective function. By doing so, they solve to optimality sparseregression problems with p ∼ . More recently, Hazimeh et al. (2020) develop Dimitris Bertsimas, Vassilis Digalakis Jr. a specialized, nonlinear branch and bound framework that exploits the structureof the sparse regression problem (e.g., they use tailored heuristics to solve noderelaxations) and are able to solve problems with p ∼ · . In our work, we usethe solution method by Bertsimas and Van Parys (2020b), but we remark thatthe method by Hazimeh et al. (2020) can also be used in combination with ourproposed framework.From a practical point of view, one argument commonly used against best sub-set selection is that, in real-world problems, the actual support size k is not knownand needs to be thoroughly cross-validated hence resulting in a dramatic increasein the required computational eﬀort. Although, in many cases, k is determined bythe application, Kenney et al. (2018) address such concerns by proposing eﬃcientcross validation strategies. In this paper, we assume that k is known and given;the combination of eﬃcient cross validation procedures with our proposed methodis straightforward.The success of MIO-based methods in sparse regression motivated the de-velopment of similar methods in a large variety of sparse machine learningproblems, including classiﬁcation (Bertsimas et al., 2017), polynomial regression(Bertsimas and Van Parys, 2020a), matrix completion (Bertsimas and Li, 2018).In fact, the cutting planes method used in all the aforementioned approaches waseventually cast into a generic, uniﬁed framework that addresses a large variety ofMIO problems (Bertsimas et al., 2019a). Heuristic Solution Methods.

Traditionally, the exact sparse regression formu-lation has been addressed via heuristic and greedy procedures (dating back toBeale et al. (1967); Efroymson (1966); Hocking and Leslie (1967)). For instance,in forward stepwise selection, we start with the empty model and iterativelyadd the variable that best improves the ﬁt. Recently, numerous more sophisti-cated and high quality solution methods have been proposed. Such examples in-clude the ﬁrst-order method of Bertsimas et al. (2016), the subgradient methodof Bertsimas and Van Parys (2020b) and Bertsimas et al. (2020), and the methodof Hazimeh and Mazumder (2018) that combines coordinate descent with localsearch. Another interesting approach in solving the sparse regression formulationis via the alternating direction method of multipliers (ADMM) (Boyd et al., 2011),i.e., an iterative, two-step procedure that, in the ﬁrst step ( x -update), solves a reg-ularized least squares problem and, in the second step ( z -update), simply keeps the k largest magnitude elements of a vector (see Chapter 9 from Boyd et al. (2011)for more details). Surrogate Formulations.

As a result of the diagnosed hardness of the sparseregression problem, a lot of eﬀort has been dedicated to developing and solvingsurrogate problems. Probably the most studied of them all is the lasso formulation(Tibshirani, 1996), whereby the non-convex ℓ norm is replaced by the convex ℓ norm. Namely, denoting by k w k = P pj =1 | w j | , we require that k w k ≤ t or adda penalty term λ k w k in the objective. Due to convexity, the constrained andthe penalized versions can be shown to be equivalent for properly selected valuesof t and λ ; this is not the case if the ℓ norm is used. Due to the geometry ofthe unit ℓ ball, the resulting formulations indeed shrink the coeﬃcients towardzero and produce sparse solutions by setting many coeﬃcients to be exactly zero.There has been a substantial amount of algorithmic work on the lasso formulation he Backbone Method 5 (Efron et al., 2004; Beck and Teboulle, 2009; Friedman et al., 2010) and scalableimplementations are now available (Friedman et al., 2020). Moreover, we have adecent understanding of the (admittedly strong) assumptions required for the lassoto recover the true sparse support. Reviewing this work is beyond the scope of thispaper; we refer the interested reader to, e.g., Hastie et al. (2009) and the referencestherein.Several variants of the original lasso formulation have been developed, withone of the most notable being the elastic net formulation (Zou and Hastie, 2005),which aims to address the fact that lasso is not always strictly convex and henceby standard convexity theory, it needs not have a unique solution. Another line ofresearch has focused on replacing the ℓ norm in the Lasso formulation by othersparsity inducing penalties which are less sensitive to noise or correlation betweenfeatures. In particular, non-convex penalties such as smoothly clipped absolutedeviation (Fan and Li, 2001) and minimax concave penalty (Zhang, 2010) havebeen proposed; the resulting formulations can eﬀectively be solved via coordinatedescent-based algorithms. Comparisons.

Extensive comparisons between the aforementioned methods canbe found in Hastie et al. (2017) and, more recently, in Bertsimas et al. (2020). Asummary of the conclusions of Bertsimas et al. (2020) is as follows. In terms ofaccuracy in recovering the true support, non-convex methods should be preferredover ℓ -regularization, since they provide better feature selection. In terms of falsedetection rate, cardinality-constrained formulations improve substantially, evenwhen the true support size k is unknown. These empirical observations are alignedwith recent theoretical work in statistics (Gamarnik and Zadik, 2017) which hasidentiﬁed regimes where Lasso fails to recover the true support even though sup-port recovery is possible from an information theoretic point of view. Finally, asfar as the eﬀect of noise is concerned, while lasso performs poorly in low noisesettings, it competes and sometimes dominates other methods as noise increases;this observation supports the view that ℓ -regularization is, ﬁrst and foremost, arobustness story (Bertsimas and Copenhaver, 2018). Sparse Regression in Ultra-High Dimensions: Screening.

A regressionproblem is considered high dimensional when the rank of the design matrix X is smaller than the number of features p , i.e., rank ( X ) < p ; this is, for exam-ple, the case when n , the number of data points in the dataset, satisﬁes n < p .In such regimes, the regression model is challenged, as the underlying linear sys-tem is underdetermined, and further assumptions are required. The applicationof sparse or sparsity-inducing methods, like the ones described in the previoussection, provides a way around this limitation and, in addition, hopefully leads tomore interpretable models, where only a few features actually aﬀect the prediction.We consider a problem to be ultra-high dimensional when n ≪ p and, additionally, p is at least two orders of magnitude greater than the known scalability limit ofsparse regression, namely, p ∼ . Sure independence screening (SIS) is a two-phase learning framework, intro-duced by Fan and Lv (2008), that provides a computationally eﬃcient and the-oretically rigorous method to reduce the dimensionality of regression problemsfrom a very large to a relatively large or even moderate scale. More concretely,Fan and Lv (2008) ﬁrst rank the features based on their marginal utilities; e.g.,

Dimitris Bertsimas, Vassilis Digalakis Jr. for linear regression, the proposed marginal utility is the correlation between eachfeature and the response. By keeping only the highest-ranked features, they guar-antee that, under certain conditions, all relevant features are selected with prob-ability tending to one. Then, they conduct learning and inference in the reducedfeature space consisting only of the selected features.A major limitation of SIS stems from the fact that it only examines marginalcorrelations between the features and the response; to overcome this limitation,an iterative SIS procedure has been proposed (Fan and Lv, 2008), whereby theestimated set of relevant features is iteratively updated using SIS, conditional onthe estimated set of features from the previous step. Among the various otherextensions of SIS that have been considered in the literature, we emphasize theconditional SIS method (Barut et al., 2016), which enables the incorporation ofadditional knowledge about the importance of a certain set of features, as well asthe works of Fan et al. (2009) and Fan and Song (2010) that extend SIS beyondthe linear model. Moreover, we remark that SIS was originally designed to performdimensionality reduction for heuristic feature selection methods, such as lasso, theDantzig selector, and SCAD. For a detailed review of SIS-based methods, we referthe interested reader to Fan and Lv (2018).

Sparse Regression in Ultra-High Dimensions: Distributed Approaches.

An alternative path to address large scale sparse regression problems is usingdistributed methods (Bertsekas and Tsitsiklis, 1989). The majority of the dis-tributed statistical estimation literature focuses on subsampling data points (hor-izontally partitioning the data), i.e., assigning a subset of data points to eachcomputing node and combining the results. Examples of such approaches in-clude lasso (Mateos et al., 2010; Lee et al., 2017) and empirical risk minimization(Zhang et al., 2013). Nevertheless, in the ultra-high dimensional regime that weconsider, subsampling features (vertically partitioning the data) seems more suit-able.The ADMM framework (Boyd et al., 2011) can be used to ﬁt, in a distributedfashion, a regression model on vertically partitioned data, provided that the reg-ularizer is separable at the level of the blocks of features; this is, e.g., the case inlasso. The DECO framework of Wang et al. (2016), after arbitrarily partitioningthe features, performs a decorrelation step via the singular value decompositionof the design matrix and, as a result, each subset of data after feature partition-ing, can produce consistent estimates (despite having missing features within eachsubset). Each computing node locally ﬁts a lasso model, before the computedcoeﬃcients are centrally combined (and possibly reﬁned).Several hybrid methods that combine screening with ideas from the distributedliterature have also been developed. Yang et al. (2016) propose a two-stage frame-work, whereby, in the ﬁrst stage, each computing node locally screens the features,and, in the second stage, a sparse linear model is centrally ﬁt to a sketched prob-lem (sketching is used to reduce communication cost). Inspired by group testingtheory, Zhou et al. (2014) propose a parallel feature selection scheme consisting ofa test design component that determines the collection of subsets of features to betested, a scoring function that assigns a score to each test to measure the relevanceof the features in a classiﬁcation task, and the feature identiﬁcation algorithm thatidentiﬁes the ﬁnal selected feature set from the test scores. Song and Liang (2015)develop a Bayesian feature partitioning scheme. he Backbone Method 7

In this section, we present recent advances in the decision tree induction problem.Decision trees are one of the most popular and interpretable methods in ma-chine learning. At a high level, decision trees recursively partition the featurespace into disjoint regions and assign to each resulting partition either a label,in the context of classiﬁcation, or a constant or linear prediction, in the contextof regression. The leading work for decision tree methods in classiﬁcation is theclassiﬁcation and regression trees framework (CART), proposed by Breiman et al.(1984), which takes a top-down approach to determining the partitions. Brieﬂy,the CART method operates in two phases:- A top-down induction phase, where, starting from a root node, a split is de-termined by minimizing an “impurity measure” (e.g., entropy), and then recur-sively applying the process to each of the resulting child nodes until no moresplits are possible.- A pruning phase, where the learned tree is pruned to penalize more complexstructures that might not generalize well.The tree induction process of CART has several limitations. First, the process isone-step optimal and not overall optimal. Second, it does not directly optimize overthe actual objective (e.g., misclassiﬁcation error). Third, it only considers univari-ate splits, i.e., splits that involve a single feature. Despite of the aforementionedlimitations, CART have been widely used as building blocks in ensemble learningmethods based on bagging and boosting; examples include the random subspacemethod (Ho, 1998), random forest (Breiman, 2001), and, more recently, xgboost(Chen and Guestrin, 2016). Indeed, such methods enhance the performance of theresulting classiﬁer, sacriﬁcing, however, the interpretability of the model.In a recent work, Bertsimas and Dunn (2017) and Bertsimas and Dunn (2019)formulate the decision tree induction problem using MIO and propose a tailoredcoordinate descent-based solution method. The resulting optimal trees framework(OT) overcomes many of the limitations of CART (optimization is over the ac-tual objective, multivariate splits are possible), while still leading to highly in-terpretable models. The key observation that led to OT is that, when learninga decision tree, there is a number of discrete decisions and outcomes we need toconsider:- Whether to split at any node and which feature to split on.- Which leaf node a point falls into and (for classiﬁcation problems) whether thispoint is correctly classiﬁed based on its label.At each branch node, a split of the form a T x < b is applied. Points that satisfythis constraint follow the left branch of the tree, whereas those that violate theconstraint follow the right branch. Each leaf node is assigned a label, and eachpoint is assigned the label of the leaf node into which the point falls. The result-ing MIO formulation aims to make the aforementioned decisions in such a waythat the linear combination of an error metric (e.g., misclassiﬁcation error) and atree complexity measure is minimized. The formulation also includes a number ofconstraints ensuring that the resulting tree is indeed valid and consistent.The resulting MIO formulation is solved using a tailored coordinate descentapproach, which starts with an initial tree solution and repeatedly visits the nodes Dimitris Bertsimas, Vassilis Digalakis Jr. of the tree one at time and in a random order. For each branch node, the algorithmconsiders deleting the split at that node and ﬁnding the optimal split to use atthat node. For each leaf node, considers creating a new split at that node. Ifany of these changes improve the overall objective value of the tree, then themodiﬁcation is accepted. The algorithm continues until no possible improvementsare found, meaning that this tree is a local minimum. The problem is nonconvex,and therefore, this coordinate descent process is repeated from a variety of startingtrees that are generated randomly. From this set of trees, the one with the lowestoverall objective function is selected as the ﬁnal solution.

Decision Trees in Ultra High Dimensions.

Decision trees are known to suf-fer from the curse of dimensionality and to not perform well in the presence ofmany irrelevant features (Almuallim and Dietterich, 1994). Among the most no-table attempts to scale decision trees to ultra-high dimensional problems is therecent work by Liu and Tsang (2017), who develop a sparse version of the per-ceptron decision trees framework (Bennett et al., 2000) and solve problems with p ∼ features. Schneider et al. (1996), Walsh and Slaney (2001), and the many references thereinuse the term backbone of a discrete optimization problem to refer to a set of frozendecision variables that must be set to one in any optimal or near-optimal solution .Thus, the backbone can be obtained as the intersection of all optimal solutionsto the problem. For example, in the satisﬁability decision problem, the backboneof a formula is the set of literals which are true in every model. This deﬁnitionis fundamentally diﬀerent from our notion of a backbone and, from a practicalperspective, identifying such backbone can be as hard as solving the actual problem.In our approach, the term backbone refers to all variables that are set to one inat least one near-optimal solution . Thus, the backbone can be obtained as theunion of all near-optimal solutions. To the best of our knowledge, the notion ofbackbone that we examine in this paper ﬁrst appeared in Bertsimas et al. (2019b)in the context of online vehicle routing.

Our key contributions can be summarized as follows:- We develop the backbone method, a novel, two-phase framework that, as weempirically show, performs highly eﬀective feature selection and enables sparsemachine learning models to scale to ultra-high dimensions. As mixed integeroptimization oﬀers a natural way to model sparsity, we focus on MIO-basedmachine learning methods. Nonetheless, our framework is developed in full gen-erality and can be applied to any sparsity-inducing machine learning method.- We apply the backbone method to the sparse regression problem. Our computa-tional results indicate that the backbone method accurately scales to problemswith p ∼ in minutes, and the solutions obtained are of signiﬁcantly higher he Backbone Method 9 quality than those of screening-based methods. In problems with p ∼ ,where exact methods apply and hence we can compare with near-optimal solu-tions, the backbone method, by drastically reducing the problem size, achieves,faster and with fewer data points, the same levels of accuracy as exact methods.- We apply the backbone method to the decision tree induction problem. Ourcomputational results indicate that the backbone method scales to problemswith p ∼ in minutes, and outperforms both CART and random forest,while still outputting a single, interpretable tree. In problems with p ∼ ,the backbone method can accurately ﬁlter the feature set and compute, sub-stantially faster, decision trees with better out-of-sample performance thanthose obtained by applying the optimal trees framework to the entire problem.The structure of the paper is as follows. In Section 2 we develop the genericframework for the backbone method in a general supervised learning setting anddescribe a subproblem feature set selection method that could be of independentinterest. In Section 3, we apply the backbone method to the sparse regressionproblem, discuss several implementation aspects and challenges, and empiricallyevaluate its performance. In Section 4, we apply the backbone method to thedecision tree induction problem, again discuss several implementation aspects andchallenges, and evaluate the method through computational experiments. Finally,Section 5 concludes the paper. In this section, we describe the backbone method in a general supervised learningcontext. The main idea is that many real-world ultra-high dimensional machinelearning problems are indeed sparse, which means that most features are uninfor-mative for performing regression or classiﬁcation and hence need not be consideredwhen solving the actual problem’s formulation. Therefore, the core of the two-phasebackbone method is the construction of the backbone set , which consists of featuresidentiﬁed as potentially relevant. The two phases of the backbone method are:1.

Backbone Construction Phase:

Form a series of tractable subproblems thatcan be solved eﬃciently while still admitting sparse solutions, by reducing thedimension and/or relaxing the original problem. Construct the backbone setby collecting all decision variables that participate in the solution to at leastone subproblem.2.

Reduced Problem Solution Phase:

Solve the reduced problem to near-optimality, considering only the decision variables that were included in thebackbone set.The backbone method diﬀers from existing feature selection methods in that wepropose to ﬁlter uninformative features via a procedure closely related to the actualproblem. For example, to perform feature selection for the decision tree inductionproblem, we train more tractable decision trees in the subproblems and select thosefeatures that are identiﬁed as relevant within the subproblems’ trees. A majordiﬀerence between the backbone method and heuristic solution methods is thatwe aim to reduce the problem dimension as much as possible so that the resultingreduced problem can actually be solved to near-optimality. From an optimizationperspective, the backbone method can loosely be viewed as a heuristic branch andprice approach.

We proceed by applying the backbone method in a general sparse supervised learn-ing setting (Algorithm 1). Let X ∈ R n × p be the input data ( n and p denote thenumber of data points and features respectively) and y the vector of responses(for regression problems) or class labels (for classiﬁcation problems). The ﬁrst twoinput parameters of the backbone method are the number of subproblems M tosolve in each iteration (or hierarchy) and the maximum allowable backbone size B max . We also need to specify the application-speciﬁc functions that will be usedwithin our framework. In particular, these include:- Function construct_subproblems : The function used to construct moretractable subproblems. Since p is assumed to be very large, the construct_subproblems function acts by selecting a subset of features P foreach subproblem. Thus, the output of this function is a collection of featuresfor every subproblem.- Function fit_subproblem : The model ﬁtting function to be used within eachsubproblem. Fits a sparse model of choice (regression, trees, etc.) to the datathat are given as input to the subproblem. This function has to be highlyeﬃcient for the backbone construction to be fast, so we propose that a re-laxed/surrogate formulation or heuristic solution method is used. For simplic-ity in notation, we assume that all input hyperparameters (e.g., for lasso, theregularization weight/sequence of weights) are “hidden” within the deﬁnitionof the function. The output of this function is the learned model.- Function extract_relevant : The function that takes a sparse model as inputand extracts all features that the model identiﬁes as relevant.- Function fit : The target ﬁtting function that ﬁts a sparse model of choiceto the data given as input. As this function is to be applied to the (small)backbone set, it needs not be extremely eﬃcient, so we propose that a methodthat comes with optimality guarantees is used. For simplicity in notation, weassume that all input hyperparameters are “hidden” within the deﬁnition ofthe function. The output of this function is the learned model.Some additional notation used in Algorithm 1 includes the following. The set U t consists of all features that are candidates to enter the backbone during iter-ation (hierarchy) t . Initially, all features considered, so U = [ p ] = { , ..., p } . Thebackbone set in each iteration (hierarchy) is denoted by B . When the data matrix X is indexed by a set C ⊆ [ p ] , denoted by X C , it is implied that only the featurescontained in C have been selected.Importantly, at the end of the backbone construction phase, the size of thebackbone set must be small enough so that we can actually solve the resultingreduced problem. Therefore, we implement the backbone construction phase inAlgorithm 1 hierarchically, in a bottom-up divide and conquer fashion, until thesize of the backbone set is suﬃciently small. Alternatively, we could control thebackbone set’s size by hard-constraining the number of relevant features that areextracted from each subproblem. More concretely, let k max be an upper boundon the number of features that might participate in a solution to a subproblem;then, provided that M k max is suﬃciently small, the size of U reduces by a factorof − Mk max |U t − | after each iteration (hierarchy). he Backbone Method 11 Algorithm 1

Generic Hierarchical Backbone Algorithm

Input:

Data ( X , y ) , number of subproblems M , subproblem size p s , maximum backbone size B max , subproblem construction function construct_subproblems , subproblem model ﬁttingfunction fit_subproblem , relevant feature extraction function extract_relevant , model ﬁt-ting function fit . Output:

Learned model. t ← U ← [ p ] repeat B ← ∅ ( P m ) m ∈ [ M ] ← construct_subproblems ( X U t , y , M, p s ) for m ∈ [ M ] do model m ← fit_subproblem ( X P m , y ) B ← B ∪ extract_relevant ( model m ) end for t ← t + 1 U t ← B until |B| ≤ B max (or other termination criterion is met)model ← fit ( X B , y ) return model As Algorithm 1 shows, the backbone method provides a generic frameworkthat can be used to address a large number of diﬀerent problems. As illustratedby the applications which we investigate in the following sections, depending onthe problem at hand, there are several design choices that aﬀect the performanceof the method. For example:- How to form the subproblems, so that they are tractable and, at the sametime, useful for the original problem? We address this question in Section 2.2by developing a generic framework for regression and classiﬁcation problems.- How to solve the subproblems and, in particular, what is the impact of applyingheuristics within each subproblem for the quality of the backbone set? We ex-amine these questions for the sparse regression and the decision tree inductionproblems in Sections 3 and 4, respectively.

Parallel & Distributed Implementation.

An important feature of the back-bone method is that it can be naturally executed in parallel by assigning diﬀerentsubproblems to diﬀerent computing nodes and then centrally collecting all solu-tions and solving the reduced problem. This is particularly important in distributedapplications, where a massive dataset is “vertically partitioned” among diﬀerentnodes, i.e., each node has access to a subset of the features in the data. This regime,despite being common in practice, has not been considered much in the literature.

In this section, we describe the construct_subproblems component of the back-bone method. We ﬁrst brieﬂy present the sure independence screening (SIS) frame-work, which plays an important role in our approach, and then fully develop ourproposed methodology to construct subproblems.

Sure Independence Screening at a Glance.

SIS (Fan and Lv, 2008) is a two-phase learning framework for regression problems. In the ﬁrst phase, we rank thefeatures based on their marginal utilities s j and retain only the top d features, i.e., c M = { ≤ j ≤ p : s j is among the top d largest ones } . In practice, we usually choose d using cross validation. In the second phase, weconduct learning and inference in the reduced feature space consisting only offeatures in c M . Under certain conditions, SIS possesses the sure screening property,which requires that all relevant features are contained in the set c M with probabilitytending to one (Fan and Lv, 2008; Fan et al., 2009; Fan and Song, 2010).Following Fan and Song (2010), we measure the marginal utility s j of eachfeature using the empirical maximum marginal likelihood estimator. More speciﬁ-cially, given a convex loss function ℓ , we compute, ∀ j ∈ [ p ] , ( ˆ w ,j , ˆ w j ) = argmin w j ,w j n n X i =1 ℓ ( y i , w j + w j X i,j ) . (1)We then estimate the marginal utility of feature j as either the magnitude ofthe maximum marginal likelihood estimator s j := | ˆ w j | or the minimum of theloss function s j := n P ni =1 ℓ ( y i , ˆ w ,j + ˆ w j X i,j ) . For example, for linear regressionproblems, by using the squared loss function and the marginal utility measure givenby s j := | ˆ w j | , s j simpliﬁes to the absolute marginal empirical correlation betweenfeature j and the response, namely, s j := | c cor ( X j , y ) | , which is the standardmarginal utility measure for linear regression as per Fan and Lv (2008) (note that c cor represents the empirical correlation). This approach provides a uniﬁed scoringframework for both regression and classiﬁcation problems. The construct_subproblems

Function.

For every feature j , we compute theSIS-score s j := | ˆ w j | / max i | ˆ w i | (according to the previous paragraph). We use thesquared loss function for regression problems, the logistic loss function for binaryclassiﬁcation problems, and the multi-category SVM loss function for multi-classclassiﬁcation problems (as per Fan et al. (2009)). We form the ﬁrst subproblem’sfeature set P by selecting the features that correspond to the highest p s scores s (1) , ..., s ( p s ) (recall that p s ≪ p denotes the cardinality of the subproblem featuresets). We deﬁne the frequency of selection f ( m ) j of feature j in step m as thefraction of subproblems that feature j participates over m . In other words, afterthe ﬁrst step, f (1) j = 1 for j ∈ P and f (1) j = 0 for j

6∈ P . After the m -th step, werank features using the score s j − λ explore f ( m ) j , that is, we penalize features thathave been selected earlier. The parameter λ explore ∈ [0 , controls the trade-oﬀbetween exploration and exploitation when constructing new subproblem featuresets. The method is described in Algorithm 2.We make the following remarks concerning Algorithm 2:- Exploration parameter λ explore : Depending on the value of the parameter λ explore , we either get to examine more diverse feature sets ( λ explore → ),or focus on highly-ranked features according to the SIS-score ( λ explore → ).The rationale behind our approach is to combine the beneﬁts of SIS, i.e., in-clude within each subproblem features that are more likely to be relevant, with he Backbone Method 13 Algorithm 2

Function construct_subproblems

Input:

Data ( X , y ) , loss function ℓ , number of subproblems M , subproblem size p s , parameter λ explore . Output:

Subproblem feature sets ( P m ) m ∈ [ M ] . for j ∈ [ p ] do ( ˆ w ,j , ˆ w j ) ← argmin w j ,w j n P ni =1 ℓ ( y i , w j + w j X i,j ) s j ← | ˆ w j | f (1) j ← end forfor m ∈ [ M ] do P m ← { ≤ j ≤ p : s j / max i s i − λ explore f ( m ) j is among the top p s largest ones } if m < M then f ( m +1) j ← m P mℓ =1 ( j ∈P ℓ ) end ifend forreturn ( P m ) m ∈ [ M ] the beneﬁts of ensemble learning, i.e., investigate more features within morediverse subsets.- SIS-scores : Higher scores correspond to more important features and we alsorequire that s j ∈ [0 , and max j s j = 1 so that the scores are in the same scalewith the frequencies. As the name suggests, we focus on scores that satisfy thesure screening property.- Tractability : Assuming that we aim to select k relevant features, the subprob-lems clearly become more tractable, since, even with an exhaustive searchprocedure, we end up having to check at most (cid:0) p s k (cid:1) subsets of features, insteadof the initial (cid:0) pk (cid:1) .- Deterministic vs Random Subproblem Feature Sets : Our deterministic subprob-lem feature set selection approach, inspired by the deterministic subspacemethod by Koziarski et al. (2017), aims to overcome the limitations of thepopular random subspace method (Ho, 1998) and its variants, whereby, withineach subproblem, a feature subset is selected uniformly at random. These lim-itations include:1. First, random feature sampling schemes require solving a prohibitive num-ber of subproblems until every feature is sampled once with high probability.This is shown empirically in Section 3.2; for a discussion on the expectednumber of subproblem feature sets (groups) that need to be drawn until allfeatures (coupons) are obtained, see Ferrante and Saltalamacchia (2014).One way around this limitation is the use of a random partitioning scheme,whereby the feature space is partitioned into M = ⌈ pp s ⌉ subsets with p s features each; however, such an approach no longer enjoys the beneﬁts ofensemble learning, as each feature will only be seen once.2. Second, the absence of relevant features from a subproblem’s feature setmakes the subproblem harder to solve. In particular, any relevant featurethat is not sampled and hence not contained within the subproblem’s fea-ture set, is essentially viewed as noise within this subproblem. As a result,the subproblem signal-to-noise ratio decreases (compared to the originalproblem’s signal-to-noise ratio) and hence the sample complexity increases. This motivates the use of a more informed subproblem’s feature set selec-tion scheme.

Given data { ( x i , y i ) } i ∈ [ n ] , where, ∀ i ∈ [ n ] , x i ∈ R p (with n ≪ p ), and y i ∈ R (for regression) or y i ∈ {− , } (for binary classiﬁcation), and a convex lossfunction ℓ , consider the sparse empirical risk minimization problem with Tikhonovregularization and an explicit sparsity constraint, outlined as min w ∈ R p n X i =1 ℓ ( y i , w T x i ) + 12 γ k w k s.t. k w k ≤ k. (2)Problem (2) can be reformulated as a MIO problem, by introducing a binaryvector encoding the support of the regressor w . The optimal cost for the equivalentformulation is then given by min z ∈{ , } p : P pj =1 z j ≤ k min w ∈ R p n X i =1 ℓ ( y i , w T x i ) + 12 γ k w k s.t. z j = 0 ⇒ w j = 0 ∀ j ∈ [ p ] . (3)As we already noted in the introduction, exact MIO methods that solve Problem (3)scale up to p ∼ . We next apply the backbone method to the sparse regressionProblem (2).During the backbone construction phase, we form the M subproblems, whosefeature sets are denoted by P m , m ∈ [ M ] , using the construct_subproblems function (Algorithm 2). Within the m -th subproblem, m ∈ [ M ] , we may use asubset of data points N m ⊆ [ n ] and diﬀerent hyperparameter values k m and γ m .We form the backbone set as follows: B = M [ m =1 (cid:26) j : w ( m ) j = 0 , w ( m ) = argmin w ∈ R p : k w k k m ,w j =0 ∀ j m X i ∈N m ℓ ( y i , w T x i ) + 12 γ m k w k (cid:27) . (4)We note that, instead of solving the sparse regression formulation in eachsubproblem, as shown in Equation (4), there is a possibility of solving a relax-ation or a closely-related surrogate problem, such as the elastic net formulation(Zou and Hastie, 2005). We discuss this possibility in more detail in Section 3.1.Finally, once the backbone construction phase is completed, we solve the origi-nal problem, considering only the features that are contained in the backbone set(i.e., exact solution computation in subset of backbone features), namely, min w ∈ R p n X i =1 ℓ ( y i , w T x i ) + 12 γ k w k s.t. k w k k,w j = 0 ∀ j

6∈ B . (5) he Backbone Method 15 In this section, we discuss some additional implementation aspects of the backbonemethod for the sparse regression problem.

Solving Subproblems via the Sparse Regression Formulation.

Our ﬁrstproposed approach to solve the subproblems relies on the actual sparse regres-sion Formulation (2). In our implementation, we use the subgradient method ofBertsimas et al. (2020), which, besides fast, is especially strong in recovering thetrue support; it should be clear that any of the exact or heuristic solution methodsdiscussed in Section 1.1 can be used. We empirically found this approach to bethe most eﬀective in terms of feature selection, namely, it achieves the lowest falsedetection rate within the backbone set.A critical design choice in this approach concerns the number of relevant fea-tures that should be extracted from each subproblem . Unless our sampling procedureis perfectly accurate, it is possible that some subproblems will end up containing k s < k relevant features. If this is the case, we empirically observed that solutionmethods for Problem (2) quickly recover the k s truly relevant features but strug-gle to select the remaining k − k s , i.e., they spend a lot of time trying to decidewhich irrelevant features to pick. It is therefore crucial to either use cross validationwithin each subproblem (like the one we describe shortly), or apply an incrementalselection procedure (e.g., forward stepwise selection). In our implementation, weuse the following incremental cross validation scheme for the subproblem supportsize , which relies on progressively ﬁtting less sparse models. We start at a smallnumber of relevant features k , increment to k = k + k step , k = k + 2 k step ,and so forth. We stop after i steps if the improvement in terms of validation errorof a model with k i + k step and k i + 2 k step features is negligible compared to theerror of a model with k i features. Crucially, when training a model with k i > k features during cross validation, we use the best model with k i − k step features asa warm-start.Concerning the regularization parameter γ in Formulation (2), we apply asimple grid search (with grid length l ) cross-validation scheme, between a smallvalue, e.g., γ = pk n max i k x i k , and γ l = √ n , which is considered a good choicefor most regression instances (Chen et al., 2012). Solving Subproblems via Randomized Rounding.

Instead of solving theMIO sparse regression Formulation (3), one can consider its boolean relaxation,whereby the integrality constraints z i ∈ { , } are relaxed to z i ∈ [0 , . The result-ing solution ˆ z rel will, in general, not be integral, so we propose to randomly roundit by drawing ˆ z bin i ∼ Bernoulli (ˆ z rel i ) . We form the backbone set by repeating therounding procedure multiple times for each subproblem and collecting all featuresthat had their associated binary decision variable set to in at least one realizationof the Bernoulli random vector. By doing so, we reduce the number of subprob-lems we solve and, instead, perform multiple random roundings per subproblem.As a result, the overall method is sped up, sacriﬁcing, however, its eﬀectivenessin terms of feature selection. For this reason, we did not use this approach in thecomputational results we present. Solving Subproblems via Surrogate Formulations.

In our second proposedapproach to solve the subproblems, we formulate each subproblem using surrogateformulations, such as the lasso estimator (Tibshirani, 1996) and its extensions. Inparticular, we utilize the elastic net formulation (Zou and Hastie, 2005) to solvethe m -th subproblem, m ∈ [ M ] , namely, w ( m ) = argmin w ∈ R p : w j =0 ∀ j m X i ∈N m ⊆ [ n ] ℓ ( y i , w T x i ) + λ m h α m k w k + − α m k w k i . (6)The popularity of ℓ -regularization is justiﬁed by the fact that it enjoys a number ofproperties that are particularly useful in practical applications. First and foremost,its primary mission is to robustify solutions against noise in the data and, specif-ically, against feature-wise perturbations. Second, the convexity of the ℓ -normmakes the task of optimizing over it signiﬁcantly easier. Third, ℓ -regularizationprovides sparser solutions than ℓ -regularization. We refer the interested reader toBertsimas and Copenhaver (2018) and Xu et al. (2009) for a detailed discussionon the equivalence of regularization and robustiﬁcation.We pick the hyperparameters of the elastic net formulation via cross validation.We select α using grid search in the [0 , interval. For each ﬁxed α , we perform gridsearch for λ ∈ [ λ min , λ max ] , where λ max is the value of λ that leads to an emptymodel and λ min is the value of λ that leads to a model consisting of k max nonzerocoeﬃcients (we set k max = ck for c ∈ { , , } in our simulations). Once the bestelastic net hyperparameters are found, we reﬁt and add to the backbone set thosefeatures whose associated coeﬃcients are above a user-speciﬁed threshold. Controlling the Backbone Size.

Our ultimate goal is to form a backbone setthat is small enough for the original problem to solve to near-optimality extremelyfast. Therefore, we impose a limit B max on the maximum allowable backbone size,which is typically a multiple of the target support size k (we use between k and k in our experiments). The hierarchical backbone construction process guaran-tees that the backbone size is reduced after each hierarchy, provided that thenumber of relevant features extracted from each subproblem is suﬃciently small.In backbone construction approaches that solve Formulation (2) (either exactly orapproximately), the reduction factor can be explicitly controlled. However, whenthe backbone set is constructed via randomized rounding or via surrogate formu-lations, we cannot directly control the number of relevant features extracted fromeach subproblem. To ensure fast (and ﬁnite) termination, we also impose a limit H on the number of hierarchical passes the algorithm is allowed to perform. If, after H hierarchies, |B| > B max , then we use SIS and keep the highest-ranked B max features in the backbone. Solving the Reduced Problem.

After having formed the backbone set, we solveProblem (5) using the cutting planes method by Bertsimas and Van Parys (2020b).As Bertsimas et al. (2020) observed, the computational time required for the MIOFormulation (3) to solve to provable optimality is highly dependent on γ ; forsmaller values of γ , problems with p ∼ can be solved in minutes, whereas, forlarger values, it might take a huge amount of time to solve to provable optimality(although the optimal solution is usually attained fast). To address this issue, weimpose a time limit on the cutting planes method for each value of γ during the he Backbone Method 17 cross validation process. In practice, we observe that the cutting planes methodapplied to the backbone set recovers the correct support in seconds (provided thatthe backbone set is suﬃciently small). In this section, we empirically evaluate the backbone method on synthetic linearand logistic regression data where the ground truth is known to be sparse.

Data Generation Methodology.

In all our synthetic experiments throughoutthis section, we generate data according to the following methodology. We assumethat the input data X = ( x , ..., x n ) are i.i.d. realizations from a p -dimensionalzero-mean normal distribution with covariance matrix Σ , i.e., x i ∼ N ( p , Σ ) , i ∈ [ n ] . The covariance matrix Σ is parameterized by the correlation coeﬃcient ρ ∈ [0 , as Σ ij = ρ | i − j | , ∀ i, j ∈ [ p ] . As ρ → , the columns of the data matrix X ,i.e., the features, become more alike which should impede the discovery of nonzerocomponents of the true regressor w true by obfuscating them with highly correlatedlook-alikes. In our experiments, we focus on high correlation regimes ( ρ = 0 . ).We next generate the response vector y . For linear regression , y satisﬁes thelinear relationship y = Xw true + ε , where ε i , i ∈ [ n ] , are i.i.d. noise compo-nents from a normal distribution, scaled according to a chosen signal-to-noise ratio √ SNR = k Xw true k / k ε k . Evidently as the SNR increases, recovery of the unob-served true regressor w true from the noisy observations can be done with higherprecision. In our experiments, we use SNR = 2 . The unobserved true regressor w true is constructed at the beginning of the process and has exactly k -nonzerocomponents at indices selected uniformly without replacement from [ p ] . As dis-cussed in Section 1, in our experiments, we use k ∼ . Likewise, the nonzerocoeﬃcients w true are drawn uniformly at random from the set {− , +1 } . For lo-gistic regression , the signal y is computed according to y = sign ( Xw true + ε ) . Evaluation Metrics & Methods.

We evaluate the quality of each method basedon the following metrics:-

Support recovery accuracy : Measures the proportion of the k relevant featuresthat were actually selected by the estimator, expressed as a percentage. For-mally, |{ j : w j = 0 , w true ,j = 0 }| k . With the exception of one experiment, we do not report the false alarm rate(i.e., ratio of number irrelevant features selected over total number of featuresselected), since we assume that the number of relevant features k is known andhence the two metrics are complementary.- Prediction accuracy : Evaluates the out-of-sample performance of the estimator.We use the R statistic for linear regression and the area under the curve (AUC)for logistic regression.- Optimality gap : Measures the gap between the lower and upper objective boundduring the solution process of a MIO problem and, speciﬁcally, of Problem (3). - Computational time : Total amount of time used (in seconds).Additionally, to assess the quality of each component of the backbone method, wealso record the following statistics:-

Backbone accuracy : Measures the proportion of the k relevant features thatwere included in the backbone set. Formally, |{ j : j ∈ B , w true ,j = 0 }| k . - Backbone size : Number of features included in the backbone set.-

Backbone construction time : Time required for the ﬁrst phase of the backbonemethod (in seconds).-

Subproblem feature sets accuracy : Measures the proportion of the k relevantfeatures that were included in at least one subproblem’s feature set. Formally, |{ j : j ∈ S Mm =1 P m , w true ,j = 0 }| k . - Subproblem feature sets size : Number of distinct features that were included inat least one subproblem.Before proceeding to the results, we make the following remarks concerningthe methods we investigate in our experiments:-

SR-MIO : Implementation of the cutting planes method byBertsimas and Van Parys (2020b) using the commercial MIO solverGurobi (Gurobi Optimization Inc., 2016). Uses the subgradient methodBertsimas et al. (2020) to compute a warm start. Solves the sparse regressionformulation to optimality or near-optimality.-

SR-REL : Implementation of the subgradient method by Bertsimas et al. (2020)using the commercial Interpretable AI software package (Interpretable AI,2020). Solves the sparse regression heuristically by considering its boolean re-laxation and iteratively alternating between a sub-gradient ascent step and aprojection step.-

ENET : glmnet Fortran implementation (Friedman et al., 2020). Solves the elas-tic net formulation using cyclic coordinate descent.-

SIS : Implementation of the sure independence screening (Fan and Lv, 2008)feature selection heuristic, followed by any sparse regression solution methodon the reduced feature set.- BB : Implementation of the backbone method. The components and parameter-ization of the method are discussed in each experiment separately.All experimental results were obtained over r independently generated datasets( r = 5 for experiments in which we investigate the scalability of the method as afunction of p and hence deal with very large problems, and r = 10 for all otherexperiments). In each experiment, we report both the mean and standard deviationof each metric. he Backbone Method 19 Scalability of the Backbone Method for Sparse Linear Regression.

Inour ﬁrst experiment, we show that BB scales to ultra-high dimensional problemswith p ∈ { , · , , · } in minutes, and substantially outperformsthe SIS heuristic. For this experiment, we generate data with ( n, k, SNR , ρ ) =(5000 , , ., . and vary the data dimension p . We set the parameters of BB asfollows. We construct M = 5 subproblems, each of size p s = 10 features andusing λ explore = . within the construct_subproblems function. We solve the sub-problems with SR-REL and use cross validation to extract k s ∈ {⌈ k ⌉ , ⌈ k ⌉ , ⌈ k ⌉ , k } (potentially) relevant features from each subproblem. After solving the subprob-lems, the backbone set consists of at most k = 500 features, so we choose notto constrain the maximum backbone size and the number of hierarchies. For thereduced problem solution phase, we use SR-MIO , select γ via cross validation (froma grid of 5 values chosen as per Section 3.1), and impose a time limit of minutesfor each validation run. As a baseline, we select via SIS the top |B| features (recallthat B denotes the backbone set) and then run SR-MIO (tuned in the exact sameway) on the selected feature set.As can be observed in Figure 1, BB achieves near-perfect accuracy for problemswith up to 30 million features and substantially outperforms SIS , in terms of bothsupport recovery accuracy and out-of-sample predictive performance. As far as thecomputational time is concerned, we observe that the overhead added due to thebackbone construction phase is by no means prohibitive; running

SIS and then

SR-MIO on the selected features takes about 20 minutes, whereas running the full BB method requires a little longer than half an hour. In Figure 2, we illustratehow the size and accuracy of the subproblem feature sets and the backbone setchange with p . The combination of the construct_subproblems function, whichenables us to ﬁlter more than 99.8% of the features without missing any relevantones, with the selectivity of the fit_subproblem component of BB , prevent thebackbone size from exploding. Comparison with Sparse Linear Regression Applied to the Entire Fea-ture Set.

In our second experiment, we verify that BB indeed performs eﬀectivefeature selection and enables us to solve the reduced problem’s sparse regressionMIO formulation to near-optimality substantially faster. We compare BB with theexact cutting planes method SR-MIO applied to the entire feature set. As a baseline,we select with

SIS the top |B| features and then run

SR-MIO on the selected featureset. We generate data with ( p, k, SNR , ρ ) = (10 , , ., . , so that the problem iswithin the reach of SR-MIO . We set the algorithms’ parameters as follows. We im-pose a time limit of minutes to all methods. For SR-MIO , this includes the timeto compute the warm start via

SR_REL and the time dedicated to the MIO solvertime. For

SIS , this additionally includes the time to perform feature selection. For BB , this includes both phases of the backbone method. We ﬁx γ = √ n so that allmethods solve the same problem; as discussed in Bertsimas et al. (2020), the reg-ularization hyperparameter drastically aﬀects the solution time and hence usingdiﬀerent values in diﬀerent methods would lead to an unfair comparison. Concern-ing the parameters of BB , we construct M = 3 subproblems, each of size p s = 5000 features and using λ explore = . within the construct_subproblems function. Wesolve the subproblems using SR-REL and extract k (potentially) relevant featuresfrom each. Notice that, after solving the subproblems, the backbone set consists f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (a) Support recovery accuracy o ) ( - o f - s a m p l e R SETTING:(ρ, , SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (b) Out-of-sample R t o t a t i m e ( s e c ) SETTING:(), k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (c) Computational Time

Fig. 1: Scalability of the backbone method for sparse linear regression.of at most features, so we need not constrain the maximum backbone size andthe number of hierarchies.Figure 3 reports the support recovery accuracy, out-of-sample R , and MIOoptimality gap of each method as function of the sample size n . BB accuratelyrecovers the true support with fewer samples and achieves superior out-of-samplepredictive performance than SR-MIO , while

SIS fails to recover about one third ofthe true support. Additionally, while the solution returned by

SR-MIO comes withno optimality guarantee, the optimality gap for BB quickly drops as the number ofsamples increases, albeit for the reduced problem. Comparison with Surrogate Formulations Applied to the Entire FeatureSet.

In our third experiment, we demonstrate that BB computes, in comparabletimes, sparser and higher quality solutions than ENET . We compare BB with ENET applied to both the entire feature set and the subset of the top n − (as perFan and Lv (2008)) ranked features based on their SIS score. We generate datawith ( p, k, SNR , ρ ) = (5 · , , ., . , so that the problem is within the reach of ENET . We set the algorithms’ parameters as follows. For BB , we construct M = 5 subproblems, each of size p s = 5000 features and using λ explore = . within the construct_subproblems function. We solve the subproblems using SR-REL , ﬁx γ = √ n within each subproblem, and extract k (potentially) relevant featuresfrom each. Notice that, after solving the subproblems, the backbone set consistsof at most features, so we need not constrain the maximum backbone sizeand the number of hierarchies. We solve the reduced problem again using SR-REL , he Backbone Method 21 t o t a l nu m b e r o f f e a t u r e s i n s u bp r o b l e m s SETTING:(ρ, k, SNR, ) = (0.9, 100, 2.0, 5000) backbo e (a) Subproblem feature sets size f r a c t i o n o f r e l e v a n t f e a t u r e s i n s u bp r o b l e m s SETTING:(ρ, k, SNR, ) = (0.9, 100, 2.0, 5000) backbone (b) Subproblem feature sets accuracy b a c k b o n e s i z e SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000) backbone (c) Backbone size f r a c t i o n o f r e l e v a n t f e a t u r e s i n b a c k b o n e SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (d) Backbone accuracy

Fig. 2: Scalability of the backbone method for sparse linear regression: a closerlook.but now use cross validation to select both γ (from a grid of 3 values chosen asper Section 3.1) and k (using the incremental cross validation strategy of Section3.1). For ENET , following the cross validation strategy described in Section 3.1, weselect α from a grid of 5 values but do not impose any restriction on how manynonzero components the resulting model should have. For SIS , we select the top n − features (as per Fan and Lv (2008)) and apply ENET on this set of features(tuned in the exact same way as previously).Figure 4 reports the support recovery accuracy, support recovery false alarmrate, out-of-sample R , computational time, and backbone size/number of nonzerocomponents in elastic net’s solution, as functions of the sample size n . Notice thatthis is the only experiment where we cross validate k for sparse regression, insteadof assuming it is known, and hence the support recovery false alarm rate is notcomplementary to the support recovery accuracy. All methods successfully recoverthe true support, provided that enough samples are available. However, the falsealarm rate for BB quickly drops towards zero, contrary to ENET and

SIS , whosesolutions include a large number of irrelevant features in addition the relevant ones.As a result of the above observation, the out-of-sample predictive performance of BB is superior. Moreover, the size of the backbone set is notably smaller than thenumber of nonzero components in elastic net’s solution, which indicates that thefeature selection eﬀectiveness of the ﬁrst phase of BB is superior. Finally, in termsof computational time, provided that the required number of samples is available, f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (a) Support recovery accuracy o u t - o f - s a m p l e R SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (b) Out-of-sample R M I P g a p SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 100000) backbonesparse regression (on entire feature set) (c) MIO optimality gap t o t a l t i m e ( s e c ) SETTING:(ρ, , SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (d) Computational Time (sec)

Fig. 3: Eﬀectiveness of feature selection and impact on MIO solver for sparse linearregression. BB solves the problem very fast, contrary to ENET applied to the entire problem, inwhich the computational time increases linearly with the number of samples.

Constructing Subproblems: Proposed vs Random Approach.

In our nextexperiment, we compare the proposed construct_subproblems function with themore commonly used random subspace method for constructing subproblems. Weexamine two variants of the random subspace method. In the ﬁrst variant, wesample the feature set of each subproblem uniformly at random (without replace-ment within each subproblem, but with replacement across subproblems). In thesecond variant, we sample each feature with probability proportional to each SIS-score (see Section 2.2). For the proposed construct_subproblems function, we use λ explore = . . We generate data with ( n, p, k, SNR , ρ ) = (10 , , , ., . andset the parameters of BB as follows. We construct M ∈ { , , , , , } subprob-lems, each of size p s = 10 . We solve the subproblems using SR-REL and use crossvalidation to extract k s (potentially) relevant features from each. As a baseline, wenow select via SIS the top d features, where d is the rank of the relevant featurewith the lowest SIS-score (thus SIS achieves by design perfect backbone accuracy).In Figure 5, we show the size and accuracy of the subproblem feature setsand the backbone set as functions of the number of subproblems M . The construct_subproblems function (a) ﬁlters the feature set much more aggressivelywhen constructing the subproblems, (b) without missing any relevant feature. Asa result, most relevant features are included in the feature sets of multiple sub- he Backbone Method 23 f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 500000)

SIS and e astic netbackbonee astic net (on entire feature set) (a) Support recovery accuracy f r a c t i o n o f i rr e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 500000)

SIS and e astic netbackbonee astic net (on entire feature set) (b) Support Recovery False Alarm Rate o u t - o f - s a m p l e R SETTING:(ρ, , SNR, p) = (0.9, 50, 2.0, 500000)

SIS and elas(ic ne(bac boneelas(ic ne( (on en(ire feature set) (c) Out-of-sample R nu m b e r o f f e a t u r e s f o un d SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 500000)

SIS and e astic netbackbonee astic net (on entire feature set) (d) Backbone size / Number of nonzero com-ponents in elastic net’s solution t o t a l t i m e ( s e c ) SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 500000)

SIS and e astic netbackbonee astic net (on entire feature set) (e) Computational Time

Fig. 4: Comparison with Surrogate Formulations Applied to the Entire FeatureSet.problems each and hence their chances of being selected by the fit_subproblem function are increased. The non-uniformly random subspace method requires form-ing about − times more subproblems to include all relevant features in at leastone feature set, whereas the number of subproblems required by the uniformly ran-dom subspace method is increased by − orders of magnitude. Nevertheless, weremark that, since the number of relevant features contained (and hence selectedvia cross validation) within each feature set of the uniformly random subspacemethod is much smaller, the fit_subproblem method runs notably faster. Thisindicates that the random subspace method is not entirely impractical. Finally, wenote that all variants of BB clearly select features more eﬀectively than SIS . t o t a l nu m b e r o f f e a t u r e s i n s u bp r o b l e m s SETTING:(p, SNR, n, k, ρ) = (1000000, 2.0, 5000, 100, 0.9) backbonebackbone (non-uniformly random subspaces)backbone (random subspaces) (a) Subproblem feature sets size f r a c t i o n o f r e l e v a n t f e a t u r e s i n s u bp r o b l e m s SETTING:(p, SNR, n, , ,) = (1000000, 2.0, 5000, 100, 0.9) bac bonebac bone (non-)niformly random subspaces)backbone (random subspaces) (b) Subproblem feature sets accuracy

Fig. 5: Constructing Subproblems: Proposed vs Random Approach.

Solving Subproblems: Sparse Regression vs Elastic Net.

In this experi-ment, we compare the subproblem solution methods that we discussed. We gener-ate data with ( n, p, k, SNR , ρ ) = (10 , , , ., . and set the parameters of BB as follows. We construct M ∈ { , , ..., } subproblems, each of size p s = 10 features, and use λ explore = . for the construct_subproblems function. In ourﬁrst approach, presented in Section 3.1, we solve the sparse regression Formulation(3) within each subproblem using the SR-REL method; we ﬁx the regularization hy-perparameter γ = √ n and use cross validation to extract k s ∈ {⌈ k ⌉ , ⌈ k ⌉ , ⌈ k ⌉ , k } (potentially) relevant features from each subproblem. In our second approach, pre-sented in Section 3.1, we solve the elastic net Formulation (6) within each subprob-lem using the ENET method; following the cross validation strategy described inSection 3.1, we select α from a grid of 4 values and only examine values of λ thatlead to k -sparse models. As a baseline, we again select via SIS the top d features,where d is the rank of the relevant feature with the lowest SIS-score (thus SIS achieves by design perfect backbone accuracy).As Figure 6 reports, both approaches perform substantially more eﬀective fea-ture selection than

SIS , choosing, in some instances, one order of magnitude fewerfeatures to identify almost all relevant ones. The

SR-REL -based approach can beseen to require fewer subproblems to recover the full true support, whereas the

ENET -based approach is faster, more robust, and consistent with respect to the fea-tures selected in each subproblem. This should not come as a surprise; any relevantfeatures that are not sampled in a subproblem’s feature set are essentially viewedas noise and hence the subproblem’s SNR decreases.

ENET is known to be morerobust to noise and hence is indeed expected to perform better is low-SNR regimes.Another important thing to note concerns the impact of the fact that we artiﬁciallylimit the number of relevant features to be extracted from each subproblem in the

ENET -based approach; when the k max parameter was set free, we empirically foundthat the approach led to more dense models within each subproblem and henceresulted in larger backbone sets. The aforementioned observations agree with ourdiscussion in Section 3.1 and our expectation that the ENET-based approach is theless selective one and more robust.

The Backbone Method for Sparse Logistic Regression.

In this set of ex-periments, we evaluate the backbone method for logistic regression. First, we he Backbone Method 25 b a c k b o n e s i z e SETTING:(p, SNR, n, k, ,) = (1000000, 2.0, 5000, 100, 0.9)

SIS and (parse regressionbackbone (elastic net in subproblems)backbone (sparse regression in subproblems) (a) Backbone size f r a c t i o n o f r e l e v a n t f e a t u r e s i n b a c k b o n e SETTING:(p, SNR, n, k, ,) = (1000000, 2.0, 5000, 100, 0.9)

SIS and (parse regressionbackbone (elastic net in subproblems)backbone (sparse regression in subproblems) (b) Backbone accuracy b a c k b o n e c o n s t r u c t i o n t i m e ( s e c ) SETTING:(p, SNR, n, , ρ) = (1000000, 2.0, 5000, 100, 0.9) bac bone (elas(ic ne( in s)bproblems)backbone (sparse regression in subproblems) (c) Backbone construction time

Fig. 6: Solving Subproblems: Sparse Regression vs Elastic Net.study the scalability of BB for sparse logistic regression problems. We generatedata with ( n, k, SNR , ρ ) = (5000 , , ., . and vary the data dimension p ∈{ , · , , · } . We set the parameters of BB as follows. We construct M = 10 subproblems, each of size p s = 5000 features and using λ explore = . withinthe construct_subproblems function. We solve the subproblems with SR-REL anduse cross validation both to select γ s from a grid of 5 values and to extract k s ∈ {⌈ k ⌉ , ⌈ k ⌉ , ⌈ k ⌉ , k } (potentially) relevant features from each. After solvingthe subproblems, the backbone set consists of at most k = 10 features, so wedo not constrain the maximum backbone size and the number of hierarchies. Forthe reduced problem solution phase, we again use SR-REL and select γ via crossvalidation from a grid of 5 values. As a baseline, we select via SIS the top |B| features (recall that B denotes the backbone set) and then run SR-REL (tuned inthe exact same way) on the reduced feature set.As can be observed in Figure 7, BB scales to ultra-high dimensional problemswith up to 30 million features in less than half an hour and substantially outper-forms the SIS heuristic, in terms of both support recovery accuracy and out-of-sample predictive performance. In Figure 8, we illustrate how the size and accuracyof the subproblem feature sets and the backbone set change with p . The resultsare indicative of the eﬀectiveness of both steps of the backbone construction phaseof BB .In our second experiment on logistic regression, we verify that BB indeed per-forms eﬀective feature selection making the job of the MIO solver easier. Similarlyto what we did for linear regression, we again compare BB with the exact cutting f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (a) Support recovery accuracy o u t - o - s a m ( l e A U C SETTING:(-, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (b) Out-of-sample AUC t o t a t i m e ( s e c ) SETTING:(), k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (c) Computational Time

Fig. 7: Scalability of the backbone method for sparse logistic regression.planes method

SR-MIO applied to the entire feature set. As a baseline, we selectwith

SIS the top |B| features and then run

SR-MIO on the reduced feature set. Wegenerate data with ( p, k, SNR , ρ ) = (10 , , ., . , so that the problem is withinthe reach of SR-MIO for logistic regression. We set the algorithms’ parameters asfollows. We impose a time limit of minutes to all methods. For SR-MIO , thisincludes the time to compute the warm start via

SR_REL and the time dedicatedto the MIO solver time. For

SIS , this additionally includes the time to performfeature selection. For BB , this includes both phases of the backbone method. Aslogistic regression is more sensitive to the regularization hyperparameter, we nowselect γ via cross validation from a grid of 5 values. Concerning the parametersof BB , we now construct M = 5 subproblems, each of size p s = 5000 features andusing λ explore = . within the construct_subproblems function. We solve the sub-problems using SR-REL , select γ s via cross validation from a grid of 5 values, andextract k (potentially) relevant features from each. Notice that, after solving thesubproblems, the backbone set consists of at most features, so we again neednot constrain the maximum backbone size and the number of hierarchies.Figure 9 reports the support recovery accuracy, out-of-sample AUC, and totalcomputational time of each method as function of the sample size n . Notice that,since each method may now use a diﬀerent value for the regularization hyperpa-rameter, it is meaningless to look into the optimality gap; we instead examine thetotal computational time as a proxy for impact of the feature selection process onthe MIO solver. Even though none of the methods fully recovers the true support, BB and SR-MIO signiﬁcantly outperform

SIS in terms of both the fraction of rel- he Backbone Method 27 t o t a l nu m b e r o f f e a t u r e s i n s u bp r o b l e m s SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000) backbone (a) Subproblem feature sets size f r a c t i o n o f r e l e v a n t f e a t u r e s i n s u bp r o b l e m s SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000) backb ne (b) Subproblem feature sets accuracy b a c k b o n e s i ) e SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (c) Backbone size f r a c t i o n o f r e l e v a n t f e a t u r e s i n b a c k b o n e SETTING:(ρ, k, SNR, n) = (0.9, 100, 2.0, 5000)

SIS and sparse regressionbackbone (d) Backbone accuracy

Fig. 8: Scalability of the backbone method for sparse logistic regression: a closerlook.evant features found and their out-of-sample predictive performance. Among BB and SR-MIO , BB dominates provided that the number of samples n is suﬃcientlylarge. Most importantly, the computational time required for BB is drastically re-duced, which indicates that the problem has become notably easier post featureselection. Notice that this is not the case with SIS ; the inaccuracy of the featureselection process makes the work of the MIO solver harder.

Sensitivity to Problem Parameters.

In this paragraph, we brieﬂy discuss theimpact of the various problem parameters on BB and how the method’s parametersneed to be modiﬁed. For brevity, we omit the computational results.- Number of relevant features k : As the number of relevant features increases,we need to explore more features via the construct_subproblems functionand hence λ explore needs to be increased as well. Moreover, as k increases, weneed to extract more relevant features from each subproblem. As observed inBertsimas et al. (2020), SR-REL is not extremely sensitive to k . However, abovea certain value ( k ∼ ), it does become computationally demanding to solvethe cardinality constrained formulation, as the number of subsets of featuresthat need to be tested explodes. If this is the case, we propose switching to the ENET formulation for the subproblems, which is less sensitive to and does notdirectly control the degree of sparsity.-

Signal-to-noise ratio SNR : As we already discussed, the robustness of

ENET justiﬁes its use as the fit_subproblem component of BB in low-SNR regimes. f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (a) Support recovery accuracy o u t - o f - s a m p l e A U C SETTING:(ρ, k, SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (b) Out-of-sample AUC t o t a l t i m e ( s e c ) SETTING:(ρ, , SNR, p) = (0.9, 50, 2.0, 100000)

SIS and sparse regressionbackbonesparse regression (on entire feature set) (c) Computational Time

Fig. 9: Eﬀectiveness of feature selection for sparse logistic regression.-

Correlation ρ : The use of optimization during the backbone construction phaseof BB makes the method robust to very high levels of correlation among thefeatures; in contrast, correlation seems to be the main limiting factor for SIS .In highly correlated regimes, we propose using the

SR-REL method to solvethe subproblems, as it performs eﬀective feature selection and hence highlycorrelated features are likely to be ﬁltered out from the backbone set.

Conclusions.

We brieﬂy outline the main conclusions of our experimental study:- BB accurately scales to sparse linear and logistic regression problems with p ∼ features in minutes and the solutions obtained are of signiﬁcantly higherquality than those of screening-based methods. In particular, in the regimes weinvestigated, the backbone set typically consisted of − times fewer featuresthan those we would have to select using SIS to identify all relevant ones.- In problems with p ∼ , where exact methods apply and hence we can com-pare with near-optimal solutions, the backbone method, by drastically reducingthe problem size, achieves, faster and with fewer data points, the same levels ofaccuracy as exact methods. Moreover, contrary to SR-MIO , BB computes prov-ably optimal solutions, albeit for the reduced problem. Crucially, the probleminvolving only the backbone feature is of signiﬁcantly smaller dimension andhence the MIO solver is able to explore much more nodes in the branch andbound tree.- Concerning the construct_subproblems component of BB , our proposed func-tion has a clear edge over random subspace methods in terms of the number he Backbone Method 29 of subproblems required to solve to include all relevant features in the back-bone set. Concerning the fit_subproblem component of BB , using SR-REL leadsto the most selective version of BB and usually most accurate, provided thatthe number of samples used within each subproblem is suﬃciently large. Onthe other hand, using ENET as the fit_subproblem function leads to a slightlyfaster and more robust to noise version of BB . Given data { ( x i , y i ) } i ∈ [ n ] , where, ∀ i ∈ [ n ] , x i ∈ R p (with n ≪ p ) and y i ∈ [ K ] ,decision trees recursively partition the feature space and assign a class label from [ K ] to each partition. Formally, let T be a decision tree, T B the set of branch nodesand T L the set of leaf nodes. At each branch node t ∈ T B , a split of the form a Tt x

Intuitively, each split in a decision tree is associated witha feature j ∈ [ p ] , so there are at most D − relevant features in tree of maxdepth D . We deﬁne that feature j relevant if at least one split performed on it. During the backbone construction phase, we again form M distinct and tractablesubproblems. Constructing Subproblems.

Similarly to regression, in the m -th subproblem, m ∈ [ M ] , we only consider the features included in the subproblem’s feature set P m (constructed using the construct_subproblems function of Section 2.2) and,possibly, randomly sample a subset of data points N m ⊆ [ n ] . Note that, in formu-lating each subproblem, we may use diﬀerent hyperparameter values α m , N min ,m and D m . Forming the Backbone Set.

Let us denote by T ( X , y , D, N min ) the set of allfeasible trees on input data ( X , y ) and by a t the split performed at branch node t . Then, the backbone set can be written as the union of the solutions to allsubproblems, namely, B = M [ m =1 (cid:26) j : X t ∈T ( m ) B a ( m ) tj ≥ ,T ( m ) = argmin T ∈T ( X N m, P m , y N m ,D m ,N min ,m ) g ( T ; X N m , P m , y N m , α m ) (cid:27) . (8)Importantly, in solving each of the subproblems in (8), we need not solve the OCTformulation, shown in (7); instead, we propose solving each subproblem using scal-able heuristic methods, such as CART. In fact, we empirically found that applyingthe OCT framework to subproblems does not signiﬁcantly improve the backboneaccuracy (i.e., fraction of relevant features included in the backbone set). We usecross validation within each subproblem m ∈ [ M ] to tune the hyperparameters D m , N min ,m , α m . Solving the Reduced Problem.

Once the backbone set B is constructed, wesolve the OCT Formulation (7), considering only the features in B , i.e., T ⋆ = argmin T : depth ( T ) ≤ D g ( T ; X B , y , α ) s.t. N ( l ; X B ) ≥ N min ∀ l ∈ T L . (9) Extension to Optimal Classiﬁcation Trees with Hyperplane Splits.

In-stead of limiting the tree learning method to univariate splits, where, a t ∈ { , } p for any branch node t , multivariate splits can also be used. If this is the case, it isimportant that the number of features that participate in each split is artiﬁciallyconstrained (which translates to a sparsity constraint on a t ), otherwise the back-bone set will generally not be small enough to substantially reduce the problemdimension. Connections with the Random Subspace Method and Random Forest.

Feature bagging methods have had signiﬁcant success when applied to ensemblesof decision trees. The random subspace method (Ho, 1998) relies on training eachdecision tree in the ensemble on a random subset of the features instead of theentire feature set. In random forest (Breiman, 2001), a subset of the features isconsidered in each split of each decision tree. Our backbone method for OCTs he Backbone Method 31 relies on the construct_subproblems function, so each tree induced during thebackbone construction phase is also trained on a subset of features. Thus, ourapproach enjoys many of the beneﬁts of feature bagging (e.g., parallelizeability),while, at the same time, its output is a single, interpretable tree.

In this section, we empirically evaluate the backbone method on synthetic binaryclassiﬁcation data generated according to a ground truth decision tree.

Data Generation Methodology.

First, we create a full binary tree T true ( groundtruth tree ) of given depth D (we use D = 5 in our experiments). The structure of T true is determined as follows.- Relevant features:

We randomly pick k relevant features among the entire fea-ture set. At each split node, we select a relevant feature to split on. To ensurethat all relevant features are actually informative, we require that each of themappears in at least r split nodes in the tree. Thus, the number of relevant fea-tures k must satisfy k ≤ D − r . In our experiments, we use r = 3 and, therefore, k = 10 . - Split thresholds:

Within each split node, we randomly pick a split threshold,taking care to maintain consistency of feature ranges across paths in the tree.Moreover, let x ∈ [ x min , x max ] be the feature on which we split at node t . Toensure that splits are reasonably balanced, we require that the split threshold b t ∈ [ x min + x max − x min · f, x max − x max − x min · f ] , where f ∈ [0 , is the parameterthat determines how balanced the splits are. In our experiments, we use f = . - Labels:

We assign labels to leaf nodes in such a way that no sibling leavescorrespond to the same class. Let c l denote the class label of leaf l . Given aglobal “noise” parameter ρ ∈ [0 , , we associate with each leaf l a local “noise”parameter ρ l drawn randomly from [0 , ρ ] . ρ l corresponds to the probability ofﬂipping the label of data points in leaf l . In our experiments, we use ρ = 0 . . Furthermore, we denote by C the set of all class labels and by K = |C| thetotal number of classes. In our experiments, we use K = 2 and hence examinebinary classiﬁcation problems.Next, we generate data from T true as follows.- Data matrix:

We draw n i.i.d. realizations from a p -dimensional multivariateuniform distribution, i.e., we pick x i ∈ [0 , p , i ∈ [ n ] , uniformly at random,and create the data matrix X = ( x i ) i ∈ [ n ] .- Correlated features:

For each relevant feature x , we generate k cor correlatedfeatures as x j = x + ε j , where ε j ∼ N (0 , σ ) , ∀ j ∈ [ k cor ] , . We randomlyselect k cor irrelevant features and replace them with the correlated ones. Inour experiments, we use k cor = 5 . - Labels:

Let l be the leaf where data point x i , i ∈ [ n ] , falls after traversing thetree. With probability − ρ l , we set y i = c l , whereas, with probability ρ l , wedraw y i uniformly at random from C \ { c l } . Evaluation Metrics & Methods.

We evaluate the quality of each method basedon its out-of-sample AUC (since we deal with binary classiﬁcation problems), aswell as its overall computational time. Moreover, when applicable, we report thesupport recovery accuracy and false alarm rate (i.e., does the method use trulyrelevant features in the split nodes in the learned tree?), the learned tree’s depth,and the total number of features that are used in split nodes in the tree.We make the following remarks concerning the methods we investigate in ourexperiments:-

OCT : Implementation of the OCT-learning method from the OT framework(Bertsimas and Dunn, 2017) using the commercial Interpretable AI softwarepackage (Interpretable AI, 2020). Solves the decision tree induction formulationvia a tailored local search procedure.-

CART : Implementation of the CART heuristic (Breiman et al., 1984) using thescikit-learn package (Pedregosa et al., 2011).- RF : Implementation of the random forest classiﬁer (Breiman, 2001) using thescikit-learn package (Pedregosa et al., 2011).- BB : Implementation of the backbone method.All results presented are obtained over independent repetitions of each ex-periment. Scalability of the Backbone Method for Decision Trees.

In our ﬁrst ex-periment, we compare BB with CART in ultra-high dimensional problems with p ∈ { , · , , · } features; our goal is to examine how BB behaves as p increases. We generate n = 5000 data points according to the process describedabove and vary the data dimension p . We set the algorithms’ parameters as follows.For BB , we construct M = 15 subproblems, each of size p s = 3000 features and us-ing λ explore = . within the construct_subproblems function, and set B max = 200 .We solve the subproblems using CART ; we select via cross validation the maximumdepth from the set { , ..., D + 2 } and the minimum impurity decrease parameterfrom the set { , . , . } , and ﬁx the minbucket parameter to N min = 7 . Theimpurity decrease parameter allows us to split at a node only if this split induces adecrease of the impurity greater than or equal to the parameter’s value. We solvethe reduced problem using OCT ; we pick the maximum depth parameter from theset { , ..., D + 2 } , ﬁx the minbucket parameter to N min = 7 , and use a tailoredcross validation procedure (see Interpretable AI (2020)) to select the complexityparameter α in Equation 7. For CART , we use the gini impurity measure and se-lect the minimum impurity decrease parameter via cross validation from the set { , . , . } . We also select via cross validation the maximum depth parameter;we examine the values in the set { , ..., D + 2 } , as well as imposing no limit on thetree’s depth. We set the minbucket parameter to its default ( N min = 7 ).As can be observed in Figure 10, BB outperforms CART in terms of out-of-samplepredictive performance for problems with up to 100 thousand features. However,as the number of features increases further, the performance of BB drops, and weneed to increase either the subproblem size p s or the number of subproblems M .Moreover, this particular conﬁguration of BB solves in approximately half an hourand the computational time is almost constant as function of p , contrary to CART ,whose computational time scales linearly with p . For example, in problems with , features, CART achieves a . increase in AUC over BB , at a cost of . × he Backbone Method 33 in computational time. We remark, though, that the computational time of BB scales linearly with p s or M ; this is shown in subsequent experiments. In Figure11, we examine the structure of the learned trees. We observe that CART achieveshigher support recovery accuracy by − (recovering between and of the true relevant features), at the cost of a signiﬁcantly higher false alarm rateand deeper/more complex trees. o u t - o f - s a m p l e A U C SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (a) Out-of-sample AUC t o t a l t i m e ( s e c ) SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (b) Computational Time (sec)

Fig. 10: Scalability of the backbone method for decision trees: out-of-sample per-formance and computational time.

Eﬀectiveness of Feature Selection and Impact on OCT Formulation.

Inour second experiment, we verify that BB indeed performs eﬀective feature selectionand boosts the performance of OCT , compared to when it is applied to the entirefeature set. We compare BB with OCT , CART , and RF as the sample size n increases.We generate data with p = 2 , features, so that the problem is within the reachof OCT , according to the data generating process described above. We set the algo-rithms’ parameters as follows. For BB , we construct M = 10 subproblems, each ofsize p s = 200 features and using λ explore = . within the construct_subproblems function, and set B max = 200 . We solve the subproblems using CART ; we select viacross validation the maximum depth from the set { , ..., D + 2 } and the minimumimpurity decrease parameter from the set { , . , . } , and ﬁx the minbucketparameter to N min = 7 . Both for the reduced problem of BB and for OCT appliedto the entire feature set, we pick the maximum depth from the set { , ..., D + 2 } ,ﬁx the minbucket parameter to N min = 7 , and use a tailored cross validationprocedure (see Interpretable AI (2020)) to select the complexity parameter α inEquation 7. For CART , we use the same tuning process as in the previous experi-ment. For RF , we select via cross validation the number of features considered ineach split from the set { p s , p s + p , p } . We do not impose any maximum depth orminbucket constraint to RF and train each tree in the forest using of the datapoints (drawn randomly), for a total of trees.In Figure 12, we report the out-of-sample AUC and computational time of allmethods as functions of the sample size n . BB achieves the highest AUC in lowsample size regimes and is eventually matched by OCT and RF as the sample sizeincreases. However, contrary to BB , the computational time of OCT explodes as the f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (a) Support recovery accuracy f r a c t i o n o f i rr e l e v a n t f e a t u r e s f o un d SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (b) Support recovery false alarm rate t r ee d e p t h SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (c) Learned tree depth nu m b e r o f f e a ) u r e s f o un d SETTING:(max_prob_flip, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 0.5, 5000, 3, 5, 2)

CART (on entire feature set)backbone (d) Number of features in tree

Fig. 11: Scalability of the backbone method for decision trees: eﬀectiveness offeature selection.sample size increases, requiring several hours to train. As far as RF is concerned,its computational time is approximately twice the computational time of BB , whileits output is uninterpretable (as opposed to BB ). As expected, the greedy CART method is the fastest in this regime, but its predictive power is substantially lower.Therefore, we conclude that BB is particularly useful as a feature selection method,since it ﬁlters out irrelevant features and makes the task of OCT substantiallyeasier. Figure 13 investigates the structure of the learned trees. Along the linesof our observations in Figure 11, we see that, using either BB or OCT , we inducesimpler trees compared to the ones learned via

CART . Moreover, the induced treesperform almost all their splits on relevant features.

Impact of the Number of Subproblems/Trees.

In our third experiment, wecompare the performance of BB and RF as the number of subproblems (using back-bone’s terminology) or trees (using random forest’s terminology) increases. Wegenerate n = 5 , data points with p = 20 , features, following the datagenerating process described above. We vary the number of subproblems/trees in M ∈ { , } . We set the remaining algorithms’ parameters as follows. For BB , weconstruct M subproblems, each of size p s = 2 , features and using λ explore = . within the construct_subproblems function, and set B max = 200 . We solve thesubproblems and the reduced problem exactly as in the previous experiments. For RF , we select via cross validation the number of features considered in each split he Backbone Method 35 o u t - o f - s a m p l e A U C SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbonerandom forest (a) Out-of-sample AUC t o t a l t i m e ( s e c ) SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbonerandom forest (b) Computational Time (sec)

Fig. 12: Impact on OCT formulation: out-of-sample performance and computa-tional time. f r a c t i o n o f r e l e v a n t f e a t u r e s f o un d SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbone (a) Support recovery accuracy f r a c t i o n o f i rr e l e v a n t f e a t u r e s f o un d SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbone (b) Support recovery false alarm rate t r ee d e p t h SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbone (c) Learned tree depth nu m b e r o f f e a t u r e s f o un d SETTING:(max_prob_flip, p, balanced_split_factor, relevant_occur, tree_depth, n_classes)= (0.3, 2000, 0.5, 3, 5, 2)

CART (on entire feature set)OCT (on entire feature set)backbone (d) Number of features in tree

Fig. 13: Impact on OCT formulation: eﬀectiveness of feature selection.from the set { p s , p s + p , p } and set the remaining parameters exactly as in theprevious experiment.Figure 14 shows that the behavior of BB and RF is very diﬀerent as the numberof subproblems/trees increases. In particular, contrary to RF , whose performanceimproves as the number of trees increases, the performance of BB slightly dropsas more subproblems are solved (beyond a certain number of subproblems thatare required to recover most relevant features). We attribute this observation tothe fact that, as the number of subproblems increases, the number of irrelevant features included in the backbone set also increases; with a bigger backbone set,the reduced problem becomes harder to solve via the OCT method and hence theperformance of BB slightly deteriorates. Finally, we note that the type of problemsthat we examine, where the ground truth is indeed sparse (i.e., only few featuresactually play a role in the tree), challenge tree-based ensemble learning methodsbased on feature bagging. Speciﬁcally, it has been empirically observed (Breiman,2001) that, in classiﬁcation problems, a good choice for the number of featuresto be considered in each split is √ p ; however, such a choice is problematic in oursetting, as it will most likely result in most relevant features not being consideredfor the split. o u t - o f - s a m p l e A U C SETTING:(max_prob_flip, p, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 20000, 0.5, 5000, 3, 5, 2) backbonerandom forest (a) Out-of-sample AUC t o t a l t i m e ( s e c ) SETTING:(max_prob_flip, p, balanced_split_factor, n, relevant_occur, tree_depth, n_classes)= (0.3, 20000, 0.5, 5000, 3, 5, 2) backbonerandom forest (b) Computational Time (sec)

Fig. 14: Impact of number of subproblems/trees.

Sensitivity to Problem Parameters.

In this paragraph, we brieﬂy discuss theimpact of the various problem parameters on BB . For brevity, we omit the compu-tational results.- Ground truth tree’s depth D and number of relevant features k : We found that BB maintains its edge, until the problem stops being sparse enough. When k ∼ p , all methods converge to the same level of accuracy. As the tree ground truthtree’s depth increases, the performance of all methods deteriorates uniformly.- Noise - maximum probability of ﬂipping a label ρ : We observed that, by ﬁlteringout many noise features, BB results in a slightly more robust version of OCT .Nonetheless, all tree-based methods were conﬁrmed to be sensitive to noise.

Conclusions.

We brieﬂy outline the main conclusions of our experimental study:- BB accurately scales to decision tree induction problems with p ∼ featuresin minutes. In terms of out-of-sample predictive performance, BB substantiallyoutperforms CART and competes favorably with RF , while still outputting asingle, interpretable tree.- In problems with p ∼ , the backbone method can accurately ﬁlter thefeature set and compute decision trees with better out-of-sample performancethan those obtained by applying the optimal trees framework to the entireproblem. he Backbone Method 37 - In terms of computational time, BB is faster than RF , as it requires solvingfewer subproblems during the backbone construction phase, than the numberof trees in RF ’s ensemble. Moreover, contrary to CART , the ﬁnal tree outputtedby BB is usually more shallow and performs most of its splits on truly relevantfeatures. In this paper, we developed the backbone method, a novel framework that can beused to train a variety of sparse machine learning models. As we showed, the back-bone method can accurately and eﬀectively sparsify the set of possible solutionsand, as a result, the MIO formulation that exactly models the learning problemcan be solved fast for ultra-high dimensional problems. We gave concrete examplesof problems where the backbone method can be applied and discussed in detailthe implementation details the sparse regression problem and the decision treeinduction problem.As far as the sparse regression problem is concerned, our computational studyillustrates that the backbone method scales to problems with p ∼ in minutesand signiﬁcantly outperforms commonly used heuristics, such as sure independencescreening. In problems with p ∼ , the backbone method can be used as afeature selection method that drastically reduces the problem size and hence makesthe work of exact methods much easier. Regarding the decision tree inductionproblem, the backbone method scales to problems with p ∼ in minutes and,assuming that the underlying problem is indeed sparse (in that only few featuresare involved in splits in the decision tree), outperforms both CART and randomforest while still outputting a single, interpretable tree. In problems with p ∼ ,the backbone method can accurately ﬁlter the feature set and compute decisiontrees with better out-of-sample performance than those obtained by applying theoptimal trees framework to the entire problem.Finally, as we discussed throughout the paper, the backbone method is genericand can be directly applied to any sparse supervised learning model; examplesinclude sparse support vector machines and sparse principal component analysis.In addition, our proposed framework can be extended to non-supervised sparsemachine learning problems, such as the clustering problem. Furthermore, the back-bone construction phase can naturally be implemented in a parallel/distributedfashion. Acknowledgements

We would like to thank Theodore Papalexopoulos, Ryan Cory-Wright,and Jean Pauphilet for their fruitful comments.

References

Almuallim H., Dietterich T. (1994). Learning boolean concepts in the presence ofmany irrelevant features.

Artiﬁcial Intelligence , 69(1-2), 279–305.Barut E., Fan J., Verhasselt A. (2016). Conditional sure independence screening.

Journal of the American Statistical Association , 111(515), 1266–1277.

Beale E., Kendall M., Mann D. (1967). The discarding of variables in multivariateanalysis.

Biometrika , 54(3-4), 357–366.Beck A., Teboulle M. (2009). A fast iterative shrinkage-thresholding algorithm forlinear inverse problems.

SIAM Journal on Imaging Sciences , 2(1), 183–202.Bennett K., Cristianini N., Shawe-Taylor J., Wu D. (2000). Enlarging the marginsin perceptron decision trees.

Machine Learning , 41(3), 295–313.Bertsekas D., Tsitsiklis J. (1989).

Parallel and distributed computation: numericalmethods , vol 23. Prentice Hall Englewood Cliﬀs, NJ.Bertsimas D., Copenhaver M. (2018). Characterization of the equivalence of robus-tiﬁcation and regularization in linear and matrix regression.

European Journalof Operational Research , 270(3), 931–942.Bertsimas D., Dunn J. (2017). Optimal classiﬁcation trees.

Machine Learning ,106(7), 1039–1082.Bertsimas D., Dunn J. (2019).

Machine learning under a modern optimizationlens . Dynamic Ideas LLC.Bertsimas D., Li M. L. (2018). Interpretable matrix completion: A discrete opti-mization approach. arXiv preprint arXiv:181206647 .Bertsimas D., Van Parys B. (2020a). Sparse hierarchical regression with polyno-mials.

Machine Learning , 109(5), 973–997.Bertsimas D., Van Parys B. (2020b). Sparse high-dimensional regression: Exactscalable algorithms and phase transitions.

The Annals of Statistics , 48(1), 300–323.Bertsimas D., King A., Mazumder R. (2016). Best subset selection via a modernoptimization lens.

The Annals of Statistics , 44(2), 813–852.Bertsimas D., Pauphilet J., Van Parys B. (2017). Sparse classiﬁcationand phase transitions: A discrete optimization perspective. arXiv preprintarXiv:171001352 .Bertsimas D., Cory-Wright R., Pauphilet J. (2019a). A uniﬁed approach to mixed-integer optimization: Nonlinear formulations and scalable algorithms. arXivpreprint arXiv:190702109 .Bertsimas D., Jaillet P., Martin S. (2019b). Online vehicle routing: The edge ofoptimization in large-scale applications.

Operations Research , 67(1), 143–162.Bertsimas D., Pauphilet J., Van Parys B. (2020). Sparse regression: Scalablealgorithms and empirical performance.

Statistical Science , to appear.Boyd S., Parikh N., Chu E., Peleato B., Eckstein J. (2011). Distributed optimiza-tion and statistical learning via the alternating direction method of multipliers.

Foundations and Trends R (cid:13) in Machine Learning , 3(1), 1–122.Breiman L. (2001). Random forests. Machine learning , 45(1), 5–32.Breiman L., Friedman J., Olshen R., Stone C. (1984).

Classiﬁcation and regressiontrees . Monterey, CA: Wadsworth and Brooks.Chen P., Tsai C., Chen Y., Chou K., et al. (2012). A linear ensemble of individualand blended models for music rating prediction. In:

Proceedings of KDD-Cup2011 , pp. 21–60.Chen T., Guestrin C. (2016). Xgboost: A scalable tree boosting system. In:

Pro-ceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , pp. 785–794.Duran M., Grossmann I. (1986). An outer-approximation algorithm for a class ofmixed-integer nonlinear programs.

Mathematical Programming , 36(3), 307–339. he Backbone Method 39

Efron B., Hastie T., Johnstone I., Tibshirani R. (2004). Least angle regression.

The Annals of Statistics , 32(2), 407–499.Efroymson M. (1966). Stepwise regression–a backward and forward look.

EasternRegional Meetings of the Institute of Mathematical Statistics .Fan J., Li R. (2001). Variable selection via nonconcave penalized likelihood andits oracle properties.

Journal of the American Statistical Association , 96(456),1348–1360.Fan J., Lv J. (2008). Sure independence screening for ultrahigh dimensional featurespace.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,70(5), 849–911.Fan J., Lv J. (2018). Sure independence screening.

Wiley StatsRef: StatisticsReference Online .Fan J., Song R. (2010). Sure independence screening in generalized linear modelswith np-dimensionality.

The Annals of Statistics , 38(6), 3567–3604.Fan J., Samworth R., Wu Y. (2009). Ultrahigh dimensional feature selection:beyond the linear model.

Journal of Machine Learning Research , 10, 2013–2038.Ferrante M., Saltalamacchia M. (2014). The coupon collector’s problem.

MaterialsMatemàtics , pp. 1–35.Friedman J., Hastie T., Tibshirani R. (2010). Regularization paths for generalizedlinear models via coordinate descent.

Journal of Statistical Software , 33(1), 1–22.Friedman J., Hastie T., Tibshirani R. (2020). glmnet: Lasso and elastic-net regu-larized generalized linear models.

R package version 4 .Gamarnik D., Zadik I. (2017). High-dimensional regression with binary co-eﬃcients. estimating squared error and a phase transition. arXiv preprintarXiv:170104455 .Gurobi Optimization Inc. (2016).

Gurobi Optimizer Reference Manual , .Guyon I., Elisseeﬀ A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research , 3, 1157–1182.Hastie T., Tibshirani R., Friedman J. (2009).

The elements of statistical learning:data mining, inference, and prediction . Springer Science & Business Media.Hastie T., Tibshirani R., Tibshirani R. J. (2017). Extended comparisons ofbest subset selection, forward stepwise selection, and the lasso. arXiv preprintarXiv:170708692 .Hazimeh H., Mazumder R. (2018). Fast best subset selection: Coordinatedescent and local combinatorial optimization algorithms. arXiv preprintarXiv:180301454 .Hazimeh H., Mazumder R., Saab A. (2020). Sparse regression at scale: Branch-and-bound rooted in ﬁrst-order optimization. arXiv preprint arXiv:200406152 .Ho T. (1998). The random subspace method for constructing decision forests.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 20(8), 832–844.Hocking R., Leslie R. (1967). Selection of the best subset in regression analysis.

Technometrics , 9(4), 531–540.Interpretable AI (2020).

Interpretable AI Documentation , .Kenney A., Chiaromonte F., Felici G. (2018). Eﬃcient and eﬀective l featureselection. arXiv preprint arXiv:180802526 . Koziarski M., Krawczyk B., Woźniak M. (2017). The deterministic subspacemethod for constructing classiﬁer ensembles.

Pattern Analysis and Applications ,20(4), 981–990.Lee J., Liu Q., Sun Y., Taylor J. (2017). Communication-eﬃcient sparse regression.

The Journal of Machine Learning Research , 18(5), 1–30.Li J., Cheng K., Wang S., Morstatter F., et al. (2017). Feature selection: A dataperspective.

ACM Computing Surveys (CSUR) , 50(6), 1–45.Liu W., Tsang I. (2017). Making decision trees feasible in ultrahigh feature andlabel dimensions.

The Journal of Machine Learning Research , 18(81), 1–36.Mateos G., Bazerque J., Giannakis G. (2010). Distributed sparse linear regression.

IEEE Transactions on Signal Processing , 58(10), 5262–5276.Natarajan B. (1995). Sparse approximate solutions to linear systems.

SIAM Jour-nal on Computing , 24(2), 227–234.Ng A. (1998). On feature selection: Learning with exponentially many irrelevantfeatures as training examples. In:

Proceedings of the Fifteenth International Con-ference on Machine Learning , Morgan Kaufmann Publishers Inc., pp. 404–412.Pedregosa F., Varoquaux G., Gramfort A., Michel V., et al. (2011). Scikit-learn:Machine learning in Python.

Journal of Machine Learning Research , 12, 2825–2830.Schneider J., Froschhammer C., Morgenstern I., Husslein T., Singer J. (1996).Searching for backbones—an eﬃcient parallel algorithm for the traveling sales-man problem.

Computer Physics Communications , 96(2-3), 173–188.Song Q., Liang F. (2015). A split-and-merge bayesian variable selection approachfor ultrahigh dimensional regression.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 77(5), 947–972.Tibshirani R. (1996). Regression shrinkage and selection via the lasso.

Journal ofthe Royal Statistical Society: Series B (Methodological) , 58(1), 267–288.Walsh T., Slaney J. (2001). Backbones in optimization and approximation. In:

Proceedings of the Seventeenth International Joint Conference on Artiﬁcial In-telligence , pp. 254–259.Wang X., Dunson D., Leng C. (2016). Decorrelated feature space partitioningfor distributed sparse regression. In:

Advances in Neural Information ProcessingSystems , pp. 802–810.Xu H., Caramanis C., Mannor S. (2009). Robust regression and lasso. In:

Advancesin Neural Information Processing Systems , pp. 1801–1808.Yang J., Mahoney M., Saunders M., Sun Y. (2016). Feature-distributed sparseregression: a screen-and-clean approach. In:

Advances in Neural InformationProcessing Systems , pp. 2712–2720.Zhang C. (2010). Nearly unbiased variable selection under minimax concavepenalty.

The Annals of Statistics , 38(2), 894–942.Zhang Y., Duchi J., Wainwright M. (2013). Communication-eﬃcient algorithmsfor statistical optimization.

The Journal of Machine Learning Research , 14(68),3321–3363.Zhou Y., Porwal U., Zhang C., Ngo H., et al. (2014). Parallel feature selection in-spired by group testing. In:

Advances in Neural Information Processing Systems ,pp. 3554–3562.Zou H., Hastie T. (2005). Regularization and variable selection via the elasticnet.

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 11 Jun 2020 (v1), last revised 28 Apr 2021 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar