[PDF] Optimal Decision Trees for the Algorithm Selection Problem: Integer Programming Based Approaches

Abstract

Even though it is well known that for most relevant computational problems different algorithms may perform better on different classes of problem instances, most researchers still focus on determining a single best algorithmic configuration based on aggregate results such as the average. In this paper, we propose Integer Programming based approaches to build decision trees for the Algorithm Selection Problem. These techniques allow automate three crucial decisions: (i) discerning the most important problem features to determine problem classes; (ii) grouping the problems into classes and (iii) select the best algorithm configuration for each class. To evaluate this new approach, extensive computational experiments were executed using the linear programming algorithms implemented in the COIN-OR Branch & Cut solver across a comprehensive set of instances, including all MIPLIB benchmark instances. The results exceeded our expectations. While selecting the single best parameter setting across all instances decreased the total running time by 22%, our approach decreased the total running time by 40% on average across 10-fold cross validation experiments. These results indicate that our method generalizes quite well and does not overfit.

Full PDF

OOptimal Decision Trees for the Algorithm Selection Problem:Integer Programming Based Approaches

Matheus Guedes Vilas Boas ∗ , Haroldo Gambini Santos † , Luiz Henrique de CamposMerschmann ‡ , and Greet Vanden Berghe § Department of Computing, Federal University of Ouro Pretoo, St. Diogo de Vasconcelos328, 35400-000, Ouro Preto, MG, Brazil Computer Science Technology Campus, University of KU Leuven, 9000, Ghent, Belgium Department of Computer Science, Federal University of Lavras, P.O.Box 3037, 37200-000,Lavras, MG, BrazilSeptember 25, 2019

Abstract

Even though it is well known that for most relevant computational problems diﬀerent algorithmsmay perform better on diﬀerent classes of problem instances, most researchers still focus on determininga single best algorithmic conﬁguration based on aggregate results such as the average. In this paper,we propose Integer Programming based approaches to build decision trees for the Algorithm SelectionProblem. These techniques allow automate three crucial decisions: ( i ) discerning the most importantproblem features to determine problem classes; ( ii ) grouping the problems into classes and ( iii ) selectthe best algorithm conﬁguration for each class. To evaluate this new approach, extensive computationalexperiments were executed using the linear programming algorithms implemented in the COIN-ORBranch & Cut solver across a comprehensive set of instances, including all MIPLIB benchmark instances.The results exceeded our expectations. While selecting the single best parameter setting across allinstances decreased the total running time by 22%, our approach decreased the total running time by40% on average across 10-fold cross validation experiments. These results indicate that our methodgeneralizes quite well and does not overﬁt. Given that diﬀerent algorithms may perform better on diﬀerent classes of problems, Rice (1976) proposeda formal deﬁnition of the Algorithm Selection Problem (ASP). The main components of this problem aredepicted in Fig. 1. Formally the ASP has the following input data: (cid:80) : the problem space, a probably very large and diverse set of diﬀerent problem instances; these instanceshave a number of characteristics, i.e. for linear programming, each possible constraint matrix deﬁnesa diﬀerent problem instance; (cid:65) : the algorithm space, the set of available algorithms to solve instances of problem (cid:80) ; since manyalgorithms have parameters that signiﬁcantly change their behavior, diﬀerently from Rice (1976), weconsider that each element in (cid:65) is an algorithm with a speciﬁc parameter setting; thus, selecting thebest algorithm also involves selecting the best parameter tuning for this algorithm; (cid:70) : the feature space; ideally elements of (cid:70) have a signiﬁcantly lower dimension than elements of (cid:80) ,since not every problem instance inﬂuences the selection of the best algorithm; these features are alsoimportant to cluster problems having a common best algorithm, e.g., for linear programming somealgorithms are known for performing well on problems with a dense constraint matrix; ∗ [email protected] † [email protected] ‡ [email protected]ﬂa.br § [email protected] a r X i v : . [ c s . L G ] S e p roblem Space Feature Space FeatureExtractionCriteriaSpace Algorithm Space

SelectionMapping

PerformanceMeasure Space

Figure 1: The Algorithm Selection Problem (Rice, 1976) (cid:87) : the criteria space, since algorithms can be evaluated with diﬀerent criteria, such as processing time,memory consumption and simplicity, the evaluation of the execution results r = R n produced usingan algorithm a to solve a problem instance p may be computed using a weight vector w ∈ (cid:87) = [0 , n which describes the relative importance of each criterion.The objective is to deﬁne a function S that, considering problem features, maps problem instances tothe best performing algorithms. This function is a mapping function that always selects the best algorithmfor every instance. Thus, if B is the ideal function, the objective is to deﬁne S minimizing: (cid:88) p ∈ (cid:80) | w T r ( B ( f ( p ) , w )) − w T r ( S ( f ( p ) , w )) | (1)The main motivation for solving the ASP is that usually there is no best algorithm in the general sense:even though some algorithms may perform better on average, usually some algorithms perform much betterthan others for some groups of instances. A “winner-take-all” approach will probably discard algorithmsthat perform poorly on average, even if they produce excellent results for a small, but still relevant, groupof instances.This paper investigates the construction of S using decision trees. The use of decision trees to compute S was one of the suggestions included in the seminal paper of Rice (1976). To the best of our knowledgehowever, the use of decision trees for algorithm selection was mostly ignored in the literature. One recentexception is the work of Polyakovskiy et al. (2014) who evaluated many heuristics for the traveling thiefproblem and built a decision tree for algorithm recommendation. Polyakovskiy et al. (2014) did not reportwhich algorithm was used to build this tree, but did note that the MatLab R (cid:13) Statistics Toolbox was usedto produce an initial tree that was subsequently pruned to produce a compact tree. This is an importantconsideration: even though deep decision trees can achieve 100% of accuracy in the training dataset, theyusually overﬁt, achieving low accuracy when predicting the class of new instances. The production of com-pact and accurate decision trees is an NP-Hard problem (Hyaﬁl and Rivest, 1976). Thus, many greedyheuristics have been proposed, such as ID3 (Quinlan, 1986), C4.5 (Quinlan, 1993) and CART (Breimanet al., 1984). These heuristics recursively analyze each split in isolation and proceed recursively. Recently,(Bertsimas and Dunn, 2017) proposed Integer Programming for producing optimal decision trees for classiﬁ-cation. Thus, the entire decision tree is evaluated to reach global optimality. Their results showed that muchbetter classiﬁcation trees were produced for an extensive test set. This result was somewhat unexpectedsince there is a popular belief that optimum decision trees could overﬁt at the expense of generalization.Trees are not only the organizational basis of many machine learning methods, but also an important struc-tural information (Zhang et al., 2018). The main advantage of methods that produce a tree as result is theinterpretability of the produced model, an important feature in some applications such as healthcare.At this point it is important to clarify the relationship between decision trees for the ASP and decisiontrees for classiﬁcation and regression , their most common applications. Although the ASP can be seen as theclassiﬁcation problem of selecting the best algorithm for each instance, this modeling does not capture some2mportant problem aspects. Firstly, it is often the case that many algorithms may produce equivalent resultsfor a given instance, a complication which can be remedied by using multi-label classiﬁcation algorithms(Tsoumakas and Katakis, 2007). Secondly, the evaluation of the individual decisions of a classiﬁcationalgorithm always returns zero or one for incorrect and correct predictions, respectively. In the ASP eachdecision is evaluated according to a real number which indicates how far the performance of the suggestedalgorithm is from the performance of the best algorithm, as stated in the objective function (1). Thus,the construction of optimal decision trees for the ASP can be seen as a generalization of the multi-labelclassiﬁcation problem and is at least as hard. Another approach is to model the ASP as a regression problem,in which one attempts to predict the performance of each algorithm for a given instance and select the bestone (Xu et al., 2008; Leyton-Brown et al., 2003b; Battistutta et al., 2017). In this approach, the cost ofdetermining the recommended algorithm for a new instance grows proportionally to the number of availablealgorithms and the performance of the classiﬁcation algorithm. By contrast, a decision tree with limiteddepth can recommend an algorithm in constant time. Another shortcoming of regression-based approachesis related to the loss of precision in solution evaluation: consider the case when the results produced by aregression algorithm are always the correct result plus some additional large constant value. Even thoughthe ranking of the algorithms for a given instance would remain correct and the right algorithm would alwaysbe selected, this large constant would imply an (invalid) estimated error for the regression algorithm. Thepresent paper therefore investigates the applicability of Integer Programming to build a mapping S withdecision trees using the precise objective function (1) of the ASP.It is also important to distinguish the ASP from the problem of discovering improved parameter settings.Popular software packages such as irace (L´opez-Ib´a˜nez et al., 2016) and ParamILS (Hutter et al., 2014a)embed several heuristics to guide the search of improved parameters to a parameter setting that, ideally,performs well across a large set of instances. During this search, parameter settings with a poor performancein an initial sampling of instances may be discarded. In the ASP, even parameter settings with poorresults for many instances may be worth investigating since they may well be the best choice to a smallgroup of instances with similar characteristics. Thus, exploring parameter settings for the ASP may besigniﬁcantly more computationally expensive than ﬁnding the best parameter setting on average, requiringmore exhaustive computational experiments. An intermediate approach was proposed by Kadioglu et al.(2010): initially the instance set is divided into clusters and then the parameter settings search begins.One shortcoming of this approach is the requirement of an a priori distance metric to cluster instances.It can be hard to decide which instance features may be more inﬂuential for the parameter setting phasebefore the results of an initial batch of experiments is available. Optimized decision trees for the ASPprovide important information regarding which instance features are more inﬂuential to parameter settingssince these parameters will appear in the ﬁrst levels of the decision tree. Also, instances are automaticallyclustered in the leaves. It is important to observe that an iterative approach is possible: after instances areclustered using a decision tree for the ASP, a parallel search for better parameters for instance groups maybe executed, generating a new input for the ASP.Another fundamental consideration is that the ASP is a static tuning technique: no runtime informationis considered to dynamically change some of the suggested algorithm/parameter settings, as in the so calledreactive methods (Mascia et al., 2014). The static approach has the advantage that usually no considerableadditional computational eﬀort is required to retrieve a recommended setting, but its success obviouslydepends on the construction of a suﬃciently diverse set of problem instances for the training dataset tocover all relevant cases. After the assembly of this dataset, a possibly large set of experiments must beperformed to collect the results of many diﬀerent algorithmic approaches for each instance. Finally, aclever recommendation algorithm must be trained to determine relevant features for recommending thebest parameters for new problem instances. Misir and Sebag (2017) tackle the problem of recommendingalgorithms with incomplete data, i.e., if the experiments results matrix is sparse and only a few algorithmswere executed for each problem instance. In this paper we consider the more computationally expensivecase, where for the training dataset all problem instances were evaluated on all algorithms.This paper proposes the construction of optimal decision trees for the ASP using Integer Programmingtechniques. To accelerate the production of high quality feasible solutions, a variable neighborhood descentbased mathematical programming heuristic was also developed. To validate our proposal, we set ourselvesthe challenging task of improving the performance of the COIN-OR Linear Programming Solver - CLP,which is the Linear Programming (LP) solver employed within the COIN-OR Branch & Cut - CBC solver(Lougee-Heimer, 2003). CLP is currently considered the fastest open source LP solver (Mittelmann, 2018;Gearhart et al., 2013). The LP solver is the main component in Integer Programming solvers (Atamt¨urkand Savelsbergh, 2005) and it is executed at every node of the search tree. Mixed-Integer Programing isthe most successful technique to optimally solve NP-Hard problems and has been applied to a large numberof problems, from production planning (Pochet and Wolsey, 2006) to prediction of protein structures (Zhu,2007).To the best of our knowledge, this is the ﬁrst time that mathematical programming based methods3ave been proposed and computationally evaluated for the ASP. As our results demonstrate, not only ouralgorithm produces more accurate predictions for the best algorithm with respect to unknown instances,considering a 10-fold validation process (Section 4.3) but it also has the distinct advantages of recommendingalgorithms in constant time and producing easily interpretable results.The remainder of the paper is organized as follows: Section 2 presents the Integer Programming formu-lation for the construction of optimal decision trees for the ASP. Section 3 presents a variable neighborhooddescent based mathematical programming heuristic. Our extensive computational experiments and their re-sults are presented in Section 4 and, ﬁnally, Section 5 presents the results and provides some future researchdirections. This section presents our integer programming model proposed for the construction of optimal decision treesfor the ASP. The sets and parameters are described in Section 2.1. The corresponding decision variablesare described in Section 2.2. The objective function associated with the problem and the constraints aredescribed in Section 2.3. (cid:80) set of problem instances = { , . . . , p } ; (cid:65) set of available algorithms with parameter settings = { , . . . , a } ; (cid:70) set of instance features = { , . . . , f } ; (cid:67) (cid:102) set of valid branching values for feature f , C f = { , . . . , c f } , c f is at most p when all instances havediﬀerent values for feature f ; d maximum tree depth; τ threshold indicating a minimum number of instances per leaf node; β penalty incorporated into the objective function when a leaf node contains a number of probleminstances smaller than threshold τ ; v p,f value of feature f for problem instance p ; g l,n parameter that indicates which is the parent node of a given node n (considering a child node n where its parent is at its left); h l,n parameter that indicates which is the parent node of a given node n (considering a child node n where its parent is at its right);The maximum allowed tree depth is deﬁned by d . To prevent overﬁtting, an additional cost (parameter β ) is included into the objective function to penalize the occurrence of leaf nodes containing a number ofproblem instances smaller than threshold τ .Parameters v p,f indicate the value of each feature f for each problem instance p . Parameters g l,n and h l,n indicate the parent node of a given node located at the left or right, respectively. Thus, if the parent ofthe node n is at left ( n mod 2 = 0), then g l,n = (cid:98) ( n + 1) / (cid:99) , otherwise g l,n = −

1. Similarly, if the parentof the node n is at the right ( n mod 2 = 1), then h l,n = (cid:98) ( n + 1) / (cid:99) , otherwise, h l,n = − The main decision variables x l,n,f,c are related to the feature and branch values at each branching nodeof the decision tree. From the choices deﬁned by the model, the problem instances are grouped (variables y l,n,p ) according to the features and the cut-oﬀ points that were imposed on the branches. The model willdetermine the best algorithm for each group of problem instances placed on each leaf node (variables z n,a ).Variables u n are used to check if there are problem instances allocated to a given leaf node. This set willbe linked to the set of variables m n - explained later in this section.4 l,n,f,c =  , if feature f ∈ (cid:70) and cut-oﬀ point c ∈ (cid:67) f is used for node n ∈ { , . . . , l } of level l ∈ { , . . . , ( d − } .0 , otherwise .y l,n,p =  , if problem instance p ∈ (cid:80) is included for node n ∈ { , . . . , l } of level l ∈ { , . . . , d } .0 , otherwise .z n,a = (cid:40) , if algorithm a ∈ (cid:65) is used in the leaf node n ∈ { , . . . , d } .0 , otherwise .u n = (cid:40) , if leaf node n ∈ { , . . . , d } has problem instances.0 , otherwise . The next two sets of decision variables are used in the objective function. With the exception of set m n ( m n ∈ Z + ), all other sets of variables are binary. To penalize leaf nodes with few instances, which couldresult in overﬁtting, variables m n are used to compute the number of problem instances that are missingfor the leaf node n to reach a pre-established threshold of problem instances per leaf node, determinedby the parameter τ . The set of decision variables w p,n,a is responsible for connecting the sets of decisionvariables y l,n,p and z n,a , i.e., to ensure that all problem instances allocated to a particular leaf node have thesame recommended algorithm and that this algorithm is exactly the one corresponding to z n,a . In addition,the connection between the set w p,n,a and y l,n,p ensures that the problem instances allocated to leaf nodesrespect branching decisions on parent nodes. m n = (cid:40) number of problem instances missing from the leaf node n to reacha pre-established threshold of problem instances per leaf node. w p,n,a =  , if problem instance p ∈ (cid:80) is selected for leaf node n ∈ { , . . . , d } with algorithm a ∈ (cid:65) .0 , otherwise . The objective of our model is to construct a tree of determined maximum depth that minimizes the distanceof the performance obtained using the recommended algorithm from the ideal performance for each problem p . Here we consider that this non-negative value is already computed in r . There is an additional costinvolved in the objective function to penalize the occurrence of leaf nodes with only a few supportinginstances. Follows the objective function (2) and the set of constraints (3-18) of our model: min d (cid:88) n =1 (cid:80) (cid:88) p =1 (cid:65) (cid:88) a =1 r p,a × w p,n,a + d (cid:88) n =1 β × m n (2)5ubject to (cid:88) f ∈ (cid:70) (cid:88) c ∈ (cid:67) f x l,n,f,c = 1 ∀ l ∈ { , . . . , ( d − } , n ∈ { , . . . , l } (3) d (cid:88) n =1 (cid:65) (cid:88) a =1 w p,n,a = 1 ∀ p ∈ (cid:80) (4) (cid:88) a ∈ (cid:65) z n,a = 1 ∀ n ∈ { , . . . , d } (5) w p,n,a ≤ z n,a ∀ p ∈ (cid:80) , n ∈ { , . . . , d } , a ∈ (cid:65) (6) w p,n,a ≤ y d,n,p ∀ p ∈ (cid:80) , n ∈ { , . . . , d } , a ∈ (cid:65) (7) u n ≥ y d,n,p ∀ n ∈ { , . . . , d } , p ∈ (cid:80) (8) (cid:88) p ∈ (cid:80) y d,n,p + m n ≥ τ × u n ∀ n ∈ { , . . . , d } (9) y l,n,p ≤ y ( l − ,max ( g l,n ,h l,n ) ,p ∀ l ∈ { , . . . , d } , n ∈ { , . . . , l } , p ∈ (cid:80) (10) y l,n,p ≤ − x ( l − ,g l,n ,f,c ∀ l ∈ { , . . . , d } , n ∈ { , . . . , l } , (11) p ∈ (cid:80) , f ∈ (cid:70) , c ∈ (cid:67) f : g l,n (cid:54) = − ∧ v p,f ≤ cy l,n,p ≤ − x ( l − ,h l,n ,f,c ∀ l ∈ { , . . . , d } , n ∈ { , . . . , l } , (12) p ∈ (cid:80) , f ∈ (cid:70) , c ∈ (cid:67) f : h l,n (cid:54) = − ∧ v p,f > cx l,n,f,c ∈ { , } ∀ l ∈ { , . . . , ( d − } , n ∈ { , . . . , l } , f ∈ (cid:70) , c ∈ (cid:67) (13) y l,n,p ∈ { , } ∀ l ∈ { , . . . , d } , n ∈ { , . . . , l } , p ∈ (cid:80) (14) z n,a ∈ { , } ∀ n ∈ { , . . . , d } , a ∈ (cid:65) (15) u n ∈ { , } ∀ n ∈ { , . . . , d } (16) w p,n,a ∈ { , } ∀ p ∈ (cid:80) , n ∈ { , . . . , d } , a ∈ (cid:65) (17) m n ∈ Z + ∀ n ∈ { , . . . , d } (18)Equations 3 ensure that each internal node of the tree must have exactly one feature and branching valueselected. Each problem instance must be allocated to exactly one leaf node and one algorithm (Equations 4)and each leaf node must have exactly one associated algorithm (Equations 5). Inequalities 6 guarantee thatthe recommended algorithm for a leaf node is the same as the algorithm of the problem instances allocatedto this node. Inequalities 7 guarantee that allocations of algorithms to problem instances are performedrespecting the leaf node selection for each problem instance.Constraint set 8 ensures that variables u n are 1 if and only if there is at least one problem instanceassociated with leaf node n . Constraints 9 ensure that variable m n is set to the number of problem instancesmissing from the leaf node n to reach the threshold τ . If m n = 0, then the leaf node n contains at least τ problem instances.Constraints 10 ensure that any problem instance allocated in a particular node must belong to theassociated parent node. Finally, constraints 11 and 12 ensure that problem instances allocated in a particularnode respect the feature and branching values selected at the parent node. Constraints 11 are generatedwhen g l,n (cid:54) = − v p,f ≤ c and ensure that problem instance p cannot be allocated at node n of level l ( y l,n,p = 0), when feature f and branching value c are chosen for its parent node ( x ( l − ,g l,n ,f,c = 1).Similarly, Constraints 12 are generated when h l,n (cid:54) = − v p,f > c and ensure that problem instance p cannot be allocated at node n of level l ( y l,n,p = 0), when feature f and branching value c are chosen forits parent node ( x ( l − ,h l,n ,f,c = 1). Constraints 13-18 are related to the domain of the decision variablesdeﬁned in the model. The model proposed in Section 2 can be optimized by standalone Mixed-Integer Programming (MIP) solversand in ﬁnite time the optimal solution for the ASP will be produced. Despite the continuous evolution ofMIP solvers (Johnson et al., 2000; Gamrath et al., 2015), the optimization of large MIP models in restrictedtimes, in the general case, is still challenging. Thus, we performed some scalability tests (Section 4.2) tocheck how practical it is the use of the our complete model to create optimal decision trees for the ASPin datasets of diﬀerent sizes in limited times. Since our objective is to produce a method that can tacklelarge datasets of experiment results, we also propose a mathematical programming heuristic (Fischetti and6ischetti, 2016) based on Variable Neighborhood Descent (VND) (Mladenovi´c and Hansen, 1997) to speedthe production of feasible solutions. VND is a local search method that consists of exploring the solutionspace through systematic change of neighborhood structures. Its success is based on the fact that diﬀerentneighborhood structures do not usually have the same local minimum.Algorithm 1 shows the pseudo-code for our approach called VND-ASP. The overall method operatesas follows: a Greedy Randomized algorithm is employed to generate initial feasible solutions (GRC-ASP).Multiple runs of this constructive algorithm are used to construct an Elite Set of solutions. The bestsolution of this set is used in our VND local search (MIPSearch), where parts of this solution are ﬁxed andthe remaining parts are optimized with the MIP model presented previously. Also, a subset (cid:81) including atmost q algorithms is built. It includes all algorithms that appear in the elite set (cid:69) and additional algorithmsselected with a MIP model as follows: a covering like model to select q algorithms is solved where eachinstance should be covered by at least q (cid:48) < q algorithms, minimizing the cost of covering each probleminstance with the selected algorithm. Algorithm 1

VND-ASP ( r , (cid:104) , (cid:108) , D , d , α , (cid:109) , (cid:110) , (cid:81) , (cid:65) , (cid:80) , (cid:113) , (cid:113) (cid:48) ) Input . matrix r : algorithm performance matrix; set D : all diﬀerent branching values for all features( (cid:70) × (cid:67) f ); set (cid:65) : set of algorithms; set (cid:80) : set of problems. Parameters . (cid:104) : matheuristic execution timeout; (cid:108) : MIP search execution timeout; d : maximum depth; α : represents a continuous value between 0.1 and 1.0 that controls the greedy or random construction ofthe solution; (cid:109) : maximum number of trees; (cid:110) : maximum number of iterations without improvement; (cid:113) :deﬁnes the number of algorithms of the set (cid:65) , (cid:113) (cid:48) : minimum number of algorithms to cover each probleminstance. (cid:69) ← {} ; (cid:84) ← {} ; st ← time() (cid:81) ← algsubset ( r , (cid:65) , (cid:80) , (cid:69) , (cid:113) , (cid:113) (cid:48) ) GRC-ASP ( (cid:80) , (cid:65) , r , (cid:84) , 0, D , d , 1.0, (cid:109) , (cid:69) , (cid:81) ) (cid:105) ← while ( (cid:105) < (cid:110) ) do (cid:84) ← {} GRC-ASP ( (cid:80) , (cid:65) , r , (cid:84) , (cid:105) , D , d , rand (0.1, 1.0), (cid:109) , (cid:69) , (cid:81) ) if (PerformanceDegradation ( (cid:84) ) < HigherPerformanceDegradation ( (cid:69) )) then if (PerformanceDegradation ( (cid:84) ) < LowerPerformanceDegradation ( (cid:69) )) then (cid:105) ← -1 end if (cid:105) ← (cid:105) + 1 if ( (cid:84) / ∈ (cid:69) ) then if ( | (cid:69) | < (cid:109) ) then (cid:69) ← (cid:69) ∪ { (cid:84) } else (cid:115) ← SimilarityTrees ( (cid:69) , (cid:84) ); (cid:69) (cid:115) ← (cid:84) end if (cid:81) ← (cid:81) ∪ AlgorithmsLeafNodes( (cid:84) ) end if else (cid:105) ← (cid:105) + 1 end if end while Let s be the best solution of the set (cid:69) Let (cid:74) be the list of all subproblems in all neighborhoods (cid:74) ← Shuﬄe(); k ← f t ← time(); et ← f t − st while ( k ≤ | (cid:74) | and (cid:101)(cid:116) ≤ (cid:104) ) do s (cid:48) ← MIPSearch ( s , k , (cid:108) , (cid:81) . (cid:74) ) if ( f ( s (cid:48) ) < f ( s )) then s ← s (cid:48) ; k ← (cid:74) ← Shuﬄe( (cid:74) ) else k ← k + 1 end if f t ← time(); et ← f t − st end while Return s ; 7ur matheuristic has the following parameters: (cid:104) indicates the time limit for running the entire algo-rithm; (cid:108) indicates the time limit for a single exploration of a sub-problem in the MIP based neighborhoodsearch; d is the maximum tree depth; parameter α ∈ [0 ,

1] controls the randomization of the greedy algo-rithm; parameter (cid:109) controls the maximum size of the set of trees (cid:69) ; executions of the GRC-ASP algorithmare restricted to at most n iterations without updates in the elite set. Set D represents the diﬀerent featuresand cut-oﬀ points of the problem instances.The list (cid:74) contains all sub-problems that can be obtained by ﬁxing solution components in all neigh-borhoods. We shuﬄe these sub-problems in a list so that there is no priority for searching ﬁrst in oneneighborhood relative to another. This strategy is inspired by (Souza et al., 2010), where several neighbor-hood orders were in tested in a VND algorithm and the randomized order obtained better results. Wheneverthe incumbent solution is updated, the list (cid:74) is shuﬄed again and the algorithm starts to explore it fromthe beginning.In the following sections we describe in more details the algorithm used to generate initial feasiblesolutions (Section 3.1) and the MIP based neighborhoods employed in our algorithm (Section 3.2). The initial solution s is obtained from the best solution of the set of trees (cid:69) . This set is obtained using ahybrid approach inspired by Quinlan (1993)’s C4.5 algorithm to generate a decision tree and the GreedyRandomized Constructive (GRC) (Resende and Ribeiro, 2014) search. GRC searches to a certain random-ness in the greedy criterion adopted by the C4.5 algorithm. Algorithm 2 shows the hybrid approach. Lines7-16 of Algorithm 2 (GRC-ASP) show the adaptation made in the C4.5 algorithm for use of the restrictedcandidate list of the GRC search.Another adaptation considers the metric to split the nodes. Algorithm C4.5 uses information gain metric. This metric aims to choose the attribute that minimizes the impurity of the data. In a data set,it is a measure of the lack of homogeneity of the input data in relation to its classiﬁcation. In our case,we used the performance degradation metric to split the nodes. This metric searches for the attributethat minimizes the degradation of performance obtained using the recommended algorithm from the idealperformance for each problem p . Algorithm 2

GRC-ASP ( (cid:80) , (cid:65) , r , (cid:84) , (cid:105) , D , d , α , (cid:109) , (cid:69) , (cid:81) ) if (current depth = d ) then terminate end if for all ( f, c ) in D do κ f,c ← PerformanceDegradationTree ( (cid:84) , r , f , c ) end for RCL ← ∅ a ← max( κ f,c ) a ← min( κ f,c ) pdt ← a + α × ( a - a ) for all ( f, c ) in D do if κ f,c ≤ pdt then RCL ← RCL ∪ ( f, c ) end if end for a rcl = Randomly select a ( f, c ) from the RCL list. (cid:116) = Create a decision node in (cid:84) that tests a rcl in the root D l = Induced sub-dataset of r whose value of feature f are less than or equal to the cut-oﬀ point a rcl D r = Induced sub-dataset of r whose value of feature f are greater cut-oﬀ point a rcl (cid:84) l = GRC-ASP ( (cid:80) , (cid:65) , r , (cid:84) , (cid:105) , D l , d , α , (cid:109) , (cid:110) , (cid:69) , (cid:81) ) Attach (cid:84) l to the corresponding branch of (cid:116) (cid:84) r = GRC-ASP ( (cid:80) , (cid:65) , r , (cid:84) , (cid:105) , D r , d , α , (cid:109) , (cid:110) , (cid:69) , (cid:81) ) Attach (cid:84) r to the corresponding branch of (cid:116) Since optimizing the complete MIP model may be too expensive, ﬁve neighborhoods have been designed tobe employed in a ﬁx-and-optimize context. Each neighborhood deﬁnes a set of subproblems to be optimized.These neighborhoods are explained in what follows, together with examples shown in Figs. 2-6. Examples8onsider a decision tree with 4 levels: variables in gray are ﬁxed and variables highlighted in black will beoptimized. These neighborhoods are explored in a Variable Neighborhood Descent using the MIP solver ina ﬁx-and-optimize strategy. Not only the neighborhoods, but the sub-problems of all neighboorhoods, areexplored in a random order.We will explain how the decision variables x l,n,f,c , y l,n,p and z n,a are optimized in each of the neighbor-hoods. The following neighborhoods were developed: • Neighborhood (cid:78) : optimizes the selection of the feature and the cut-oﬀ point of an internal nodeand consequently optimizes the allocation of problems in child nodes as well as optimizes the choiceof the recommended algorithm in the respective leaf nodes.In the example of Figure 2, we consider optimizing the feature and cut-oﬀ point of internal node 2 atlevel 1 of the tree (binary variables x , , ,...,f, ,...,C f ). Since these variables determine the problems thatwill be allocated to the left and right child nodes, the binary variables y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p and y , , ,...,p will also be optimized. Moreover, the recommended algorithm tothe problems allocated in all child nodes in relation to the chosen node - internal node 2 of level 1 ofthe tree - which are leaf nodes should also be optimized. In our example, these would be the binaryvariables z , ,...,a , z , ,...,a , z , ,...,a and z , ,...,a of the leaf nodes 5,. . . ,8 of level 3 of the tree. • Neighborhood (cid:78) : optimizes the selection of the feature and the cut-oﬀ point of an internal node(this node cannot be the root node) and optimizes the choice of the feature and the cut-oﬀ point of the associated parent node , it consequently optimizes the allocation of problems in child nodesof associated parent node as well as the choice of the recommended algorithm in the respective leafnodes.In the example of Figure 3, we consider optimizing the feature and cut-oﬀ point of internal node 2 oflevel 2 of the tree (binary variables x , , ,...,f, ,...,C f ) and optimizing the feature and cut-oﬀ point ofassociated parent node (binary variables x , , ,...,f, ,...,C f ). Since these variables determine the prob-lems that will be allocated to the left and right child nodes, the binary variables y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p and y , , ,...,p will also be optimized. Moreover, the recommended algo-rithm to execute the problems allocated in all child nodes in relation to the associated parent node -internal node 1 of level 1 of the tree - which are leaf nodes should also be optimized. In our example,these would be the binary variables z , ,...,a , z , ,...,a , z , ,...,a and z , ,...,a of the leaf nodes 1,. . . ,4 atlevel 3 of the tree. • Neighborhood (cid:78) : optimizes the selection of the feature and the cut-oﬀ point of all nodes at onelevel (the level of the root node cannot be chosen) of the decision tree. Consequently optimizes theallocation of the problems in the nodes of the subsequent levels to the chosen level, it in addition tothe choice of the recommended algorithm in the respective leaf nodes.In the example of Figure 4, we consider optimizing the feature and cut-oﬀ point of all nodes of level 2of the tree (binary variables x , , ,...,f, ,...,C f , x , , ,...,f, ,...,C f , x , , ,...,f, ,...,C f and x , , ,...,f, ,...,C f ).Since these variables determine the problems that will be allocated at subsequent levels, the binaryvariables y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p and y , , ,...,p will alsobe optimized. In addition, the recommended algorithm to execute the problems allocated in all leafnodes should also be optimized. In our example, these would be binary variables z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , and z , ,...,a of leaf nodes 1,. . . ,8 at level 3 of the tree. • Neighborhood (cid:78) : optimizes the selection of the feature and the cut-oﬀ point of the root nodeand the choice of the feature and the cut-oﬀ point of an internal node, so that this node is at leastat the third level of the tree ( l = 2). Consequently it optimizes the allocation of problems in all nodesof the tree as well as the choice of the recommended algorithm in the respective leaf nodes.In the example of Figure 5, we consider optimizing the feature and cut-oﬀ point of both the root node(binary variables x , , ,...,f, ,...,C f ) and optimizing internal node 2 at level 2 of the tree (binary variables x , , ,...,f, ,...,C f ). Since these variables determine the problems that will be allocated to all other nodesof the tree, binary variables y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p and y , , ,...,p will also be optimized. Inaddition, the recommended algorithm to execute the problems allocated to all leaf nodes should alsobe optimized. In our example, these would be binary variables z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , and z , ,...,a of leaf nodes 1,. . . ,8 at level 3 of the tree. • Neighborhood (cid:78) : optimizes the selection of the feature and the cut-oﬀ point of a particularpath from the root node to one of the tree’s leaf nodes. Consequently it both optimizes the allocation9f problems in all nodes of the tree and the choice of the recommended algorithm in the respectiveleaf nodes.In the example of Figure 6, we consider the path from the root node to leaf node 8. We consider opti-mizing the feature and cut-oﬀ point of both the root node (binary variables x , , ,...,f, ,...,C f ), internalnode 2 at level 1 of the tree (binary variables x , , ,...,f, ,...,C f ), and internal node 4 at level 2 of thetree (binary variables x , , ,...,f, ,...,C f ). Since these variables determine the problems that will be allo-cated to all other nodes of the tree, binary variables y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p , y , , ,...,p and y , , ,...,p will alsobe optimized. In addition, the recommended algorithm to execute the problems allocated to all leafnodes should also be optimized. In our example, these would be binary variables z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , z , ,...,a , and z , ,...,a of leaf nodes 1,. . . ,8 at level 3 of the tree. x , , ,...,f, ,...,C f x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m Figure 2: Example of neighborhood (cid:78) variables: variables highlighted in gray are ﬁxed and variableshighlighted in black will be optimized. x , , ,...,f, ,...,C f x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m Figure 3: Example of neighborhood (cid:78) variables: variables highlighted in gray are ﬁxed and variableshighlighted in black will be optimized. x , , ,...,f, ,...,C f x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m Figure 4: Example of neighborhood (cid:78) variables: variables highlighted in gray are ﬁxed and variableshighlighted in black will be optimized. 10 , , ,...,f, ,...,C f x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m Figure 5: Example of neighborhood (cid:78) variables: variables highlighted in gray are ﬁxed and variableshighlighted in black will be optimized. x , , ,...,f, ,...,C f x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m x , , ,...,f, ,...,C f y , , ,...,p y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m y , , ,...,p w ,...,p, , ,...,a z , ,...,a u m Figure 6: Example of neighborhood (cid:78) variables: variables highlighted in gray are ﬁxed and variableshighlighted in black will be optimized. The computational experiments are divided into two groups: the ﬁrst group (Section 4.2) is dedicated tocheck the scalability of our integer programming model using a standalone MIP solver in datasets withdiﬀerent numbers of problem instances and algorithms. The second group of experiments (Section 4.4) usescross-validation and divides the complete base into 10 partitions, where training and test partitions areused. Section 4.1 presents the initial conﬁgurations for these two experiments.

This section describes how the experiments database was created. Details of the sets of problem instancesand algorithms will be presented in Sections 4.1.1 and 4.1.2.

Computational experiments were performed for a diverse set of 1004 problem instances including the MIPLIB3, 2003, 2010 and 2017 (Koch et al., 2011) benchmark sets. Additional instances from Nurse Rostering(Santos et al., 2016), School Timetabling (Fonseca et al., 2017) and Graph Drawing (Silva and Santos,2017) were also included. We extracted 37 features ( | (cid:70) | = 37) associated to variables, constraints andcoeﬃcients in the constraint matrix for problem instances. These features are similar to the ones used in(Hutter et al., 2014b) with the notable exception that features that are computationally expensive to extractwere discarded to ensure that our approach would incur no overhead when incorporated into an algorithm.When building the problem instances dataset, special care was taken to ensure that no application was over-represented. Table 1 shows the minimum (min), maximum (max), average (avg) and standard deviation(sd) of each feature over the complete set of problem instances. The density feature was computed as( nzrows × cols ) × feature min max avg sd feature min max avg sd cols 3 2277736 35368.04 120402.72 rﬂowint 0 120201 428.23 4052.12bin 0 2277736 25402.59 103781.75 rﬂowmx 0 410733 1813.33 17306.95int 0 440899 2301.02 16289.21 rvbound 0 0 0 0cont 0 799416 7664.43 47456.96 rother 0 2365080 26491.79 110176.92objMin -1.72E+11 5084550 -1.72E+08 5.431E+09 rhsMin -1E+100 367127 -2.98E+97 5.45E+98objMax -400071 1.13E+09 4398763.75 5.9E+07 rhsMax -6.47 1E+100 1.39E+98 1.17E+99objAv -2.54E+10 5.00E+07 -2.52E+07 8E+08 rhsAv -5.24E+95 5.12E+098 5.13E+95 1.61E+97objMed -1.55E+08 40212300 -192578.19 7E+06 rhsMed -29961800 999949 -33807.20 979700.89objAllInt 0 1 0.70 0.46 rhsAllInt 0 1 0.82 0.39objRatioLSA -1 2.24E+12 2.448E+09 7.081E+10 rhsRatioLSA -1 1.01E+103 2.02E+100 4.50E+101rows 1 2897380 38440.14 146476.08 equalities 0 416449 5745.37 27584.05rpart 0 18431 255.05 1084.42 nz 3 27329856 390757.39 1567477.92rpack 0 773664 4598.49 44125.82 aMin -4E+09 1 -4523186.29 1E+08rcov 0 88452 381.60 3481.73 aMax -1 370795000 1891549.90 2E+07rcard 0 430 9.95 33.88 aAv -107447000 1148410 -101191.22 3391461.70rknp 0 103041 189.01 3356.63 aMed -41014 10000 -25.70 1353.80riknp 0 547200 2062.42 29593.52 aAllInt 0 1 0.69 0.46rﬂowbin 0 381806 2210.27 19281.02 aRatioLSA 1 5.787E+12 6.117E+09 1.824E+11density 0.0001819 100 5.44 17.50 Fig. 7 summarizes the 37 features for ASP, grouped by features related to variables, constraints andcoeﬃcients in the constraint matrix.

Variable features: Number of variables : cols.2 – 4.

Number of variables of type : bin, int and cont.5 – 10.

Variation of the objective function coefficient :: objMin, objMax, objAv, objMed, objAllInt and objRatioLSA.11 – 12.

Number of non-zeros and density : nz and density.

Constraint features:

11 – 12.

Number of non-zeros and density : nz and density.13.

Number of constraints : rows.14 – 25.

Number of constraints of type : rpart, rpack, rcov, rcard, rknp, riknp, rflowbin, rflowint, rflowmx, rvbound, rother and equalities.26 – 31

Right-hand Side Features : rhsMin, rhsMax, rhsAv, rhsMed, rhsAllInt and rhsRatioLSA.

Coefficients in the constraint matrix features:

32 – 37.

Variation of the coefficients : aMin, aMax, aAv, aMed, aAllInt, aRatioLSA.

Figure 7: Features of problem instances of Algorithm Selection Problem: variables, constraints and coeﬃ-cients in the constraint matrix.

The deﬁnition of the solution method for the LP solver in CBC involves selecting the algorithm, suchas dual simplex or the barrier and deﬁning several parameters, such as the perturbation value and thepre-solve eﬀort. Overall 532 diﬀerent algorithm conﬁgurations were evaluated for each one of the 1004problem instances . A timeout T = 4000 was set for each execution. The computational results matrix r was ﬁlled with the execution time for regular executions, i.e. executions that ﬁnished before the timelimit and provided correct results. Executions for a given problem instance p and algorithm a that crashed,exceeded the time limit or produced wrong results were penalized by setting r pa = 8000. This large batch ofexperiments was executed in computers with 32 Gb of RAM and 10 Intel R (cid:13) i9-7900X processing cores. Taskswhere scheduled in parallel (7 threads simultaneously) with the GNU Parallel package (Tange, 2011). Table 2shows the algorithms and parameter values evaluated. This experiment to generate the experimental resultsdataset produced some interesting results itself: a new better single parameter setting was discovered thatdecrease the solution time by 22% in average, a remarkable improvement considering that CLP is alreadythe fastest open source linear programming solver. 12able 2: Algorithms and some parameters values evaluated (cid:65) idiot crash dualize pertv sprint primalp psi scal passp subs perturb presolvespars primals { { idiot1 . . . { { -3500, { { change, { -0.84, { geo, { -138, { { oﬀ } { more, { oﬀ } simplex 7,9,10,11 on,lots,so } } -3157, 1557, exa } -0.62, oﬀ } } } } } } } -1483,-1000,61 } default -1 oﬀ 3 50 -1 auto -0.5 auto 5 3 on on on (cid:65) idiot crash dualize pertv sprint dualp psi scal passp subs perturb presolvespars duals { idiot1 . . . { } { -4900, { { pesteep, { -1.1, { geo, { -167, { { oﬀ } { more, { oﬀ } simplex on,lots,so } . . . ,820 } } . . . ,1.1 } rows } -81, 40,41 oﬀ } } -33, 4354,, 0,36,67, 4392 } } default oﬀ 3 50 -1 auto -0.5 auto 5 3 on on on (cid:65) cholesky gamma dualize pertv sprint dualp psi scal passp subs perturb presolvespars barrier { univ, { -208, { geo } { } { } dense } -61,-50,51,56,61,102,208 } default native oﬀ 3 50 auto 5 3 on on on To evaluate the performance and the scalability of the proposed formulation in a standalone MIP solver,models for generating trees with diﬀerent depths ( d = { , . . . , } ) with datasets of diﬀerent sizes built byrandomly selecting subsets of results of the complete experimental results of the COIN-OR CBC solverwere solved with the state-of-the-art CPLEX 12.9 MIP solver on a computer with 32GB of RAM and 6Intel R (cid:13) i7-4960X cores. In this experiment, we measured the ﬁnal gap reported by the solver between thebest lower and upper bound at the end of execution with one hour time limit. These experiments consideredgenerating trees with a minimum number of 10 instances per leaf node ( τ = 10) and penalty of ( β = 50)for leaf nodes violating this constraint. Fig. 8 shows the performance of our integer programming model,considering bases with 50 algorithms and problem instances ranging from 50 to 500. c o s t problemsPerformance on Di ﬀ erent Bases lower boundupper bound Figure 8: Performance of the integer programming model over sets of problem instances of diﬀerent sizesand 50 algorithms. 13s it can be seen, optimal or near optimal decision trees were generated for models with up to 200problem instances. For larger datasets the execution terminated with increasingly larger gaps for theproduced bounds, at the point that for models with more than 300 instances a feasible solution was notfound in the time limit. For the model with 500 problem instances not even the LP relaxation of the MIPmodel was computed in the time limit and no lower bound was available. Thus, for the complete dataset,experiments in the next subsections were performed only with the proposed VND-ASP heuristic.

To create optimized decision trees considering the entire experimental results dataset is quite challenging:there are more than half a million observations , far beyond the limits indicated in the previous section.Thus, only experiments with our mathematical programming heuristics were conducted for this dataset.Fig. 9 presents the decision tree constructed with VNS-ASP using the following parameters: maximumtree depth d = 3, total time limit (cid:104) = 72000, MIP search timeout l = 4000, elite set size m = 20, initialalgorithms subsetsize q = 100, q (cid:48) = 20, minimum number of instances per leaf node τ = 50 and penaltycost β = 500 . The estimated performance improvement with this decision tree is 61%, a remarkableimprovement. Please note, however, that this improvement does not reﬂects the expected performanceimprovement of this tree in unknown instances, which is the really important estimate. The estimatedresults of the decision trees produced with our method on unknown instances is computed in the nextsection in 10-fold cross validation experiments.An inspection in the contents of our decision tree shows that the range of the coeﬃcients in the constraintmatrix plays an important role for determining the best algorithm. The feature selected for the root node aRatioLSA is computed as the ratio between the largest and the smallest absolute non-zero values in theconstraint matrix. Each leaf node has a set of instances allocated to it, depending on the the decision on allparent nodes and a recommended algorithm, which is the algorithm with better results on these instances.As an example, for the left-most branch of the tree, the best algorithm conﬁguration select used the Primalsimplex algorithm setting the “idiot” parameter to value 7 considering 127 LP problems allocated to thisnode. f =aRatioLSA ≤ > f = objMed ≤ > f = objRatioLSA ≤ > Primals Simplex idiot 7 104 problems

Duals Simplex dualp pesteep f = objRatioLSA ≤ >

577 problems

Duals Simplex dualp pesteeppsi 1.0scal geo 149 problems

Primals Simplex idiot 80 f = rhsMax ≤ > f = int ≤ > Duals Simplex crash idiot6dualp steep; scal rowsspars oﬀ; subs 4354pertv 81; passp -81 139 problems

Primals Simplex idiot 100 f = objRatioLSA ≤ > Primals Simplex idiot 60 173 problems

Duals Simplex dual pesteeppsi 1.0pertv 52

Figure 9: Decision tree with maximum depth = 3

To evaluate the predictive power of our method, i.e. the expected performance on unknown instances, a10-fold cross validation experiment was performed: a randomly shuﬄed complete dataset was divided into10 partitions and at each iteration 9 of the partitions were used to create the decision tree (training dataset)and the remaining partition used for evaluating the decision tree (test dataset). Each partition had 480928examples (904 problem instances ×

532 available algorithms), with the exception of the last four partitionsthat contained 480396 examples (903 problem instances ×

532 available algorithms). The results of thecross-validation are given in Fig. 10. This ﬁgure shows the average performance degradation consideringthe ideal performance to solve the LP relaxation of all problem instances (the lower bound). Results ofVND-ASP with maximum tree depth 4 (VND-ASP(D=5)) and 5 (VND-ASP(D=5)) are included. Theremaining parameters of VND-ASP are the same described in the previous subsection. We also compare . . . ,T=200, where T is the number of trees)). DefaultCBC settings (Default) and results obtained selecting a single best algorithm (SBA) are also included. S B AV N D - A S P ( D = ) V N D - A S P ( D = ) D e f a u l t R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) R F ( T = ) algorithms t o t a l t i m e lower bound Figure 10: Cross-validation results for all partitionsAs can be seen, our results indicate performance gains of 40% compared against default settings. More-over, they are noticeably better than those obtained when selecting only the single best algorithm (22%).VND-ASP results are also mostly superior to the ones obtained with RF, with the exception of RF(T=50),where equivalent results were obtained. We believe that this is a very positive result, since the resultproduced by our algorithm (a single tree) is more easy to interpret than those produced by RF. Moreimportantly, algorithms can be recommended much faster (in constant time) with our approach since theprocessing cost does not depends on the number of available algorithms as in RF, where a series of regressionproblems must be solved in order to recommend an algorithm for each new instance.

This paper introduced a new mathematical programming formulation to solve the Algorithm SelectionProblem (ASP). This formulation produces globally optimal decision trees with limited depth. The mainadvantage of this approach is that, despite the construction of the tree itself potentially being computation-ally expensive, once the tree has been constructed, algorithm recommendations can be made in constanttime. A dataset containing the experimental results of many linear programming solver conﬁgurations ofthe COIN-OR Branch-&-Cut linear programming solver (CLP) was built solving a comprehensive set ofinstances from various applications. This initial batch of experiments itself already revealed improved pa-rameter settings for the LP solver, including the discovery of a new algorithm conﬁguration which was 22%faster than default CLP settings.Scalability tests were performed to check how large subsets could become before it was no longer possibleto generate provably optimal decision trees with a state of the art standalone MIP solver. Given that,at a certain point, the resulting MIP model becomes too diﬃcult to optimize exactly, a mathematicalprogramming-based VND local search heuristic was also proposed to handle larger datasets.To evaluate the predictive power of our method, a 10-fold cross validation experiment was conducted.The results were very promising: executions with the recommended parameter settings were 40% faster thanCLP default settings, almost doubling the improvement that could be obtained using a single best parametersetting. Our results are comparable with those obtained after tuning the Random Forest algorithm, with theadded advantage that the predictive model produced by our method (a single tree) is easily interpretable and,15ore importantly, the cost of recommending an algorithm is not dependent upon the number of availablealgorithms.Future directions include evaluating stronger alternative integer programming formulations for this prob-lem given that, as the scalability test showed, there is still a signiﬁcant gap between the lower and upperbounds produced for the larger datasets. The positive results for the ASP are also a good indicator thatthe application of our methodology to classiﬁcation and regression problems represents a promising futureresearch path.

Acknowledgements

The authors would like to thank the Brazilian agencies CNPq and FAPEMIG for the ﬁnancial support. Theauthors acknowledge the National Laboratory for Scientiﬁc Computing (LNCC/MCTI, Brazil) for providingHPC resources of the SDumont supercomputer, which have contributed to the research results reportedwithin this paper. URL: http://sdumont.lncc.br. The research was partially supported by Data-drivenlogistics (FWO-S007318N). Editorial consultation provided by Luke Connolly (KU Leuven).

References

Achard, P., De Schutter, E., 2006. Complex Parameter Landscape for a Complex Neuron Model.

PLOSComputational Biology

2, 7, 1–11.Applegate, D.L., Bixby, R.E., Chvatal, V., Cook, W.J., 2006.

The traveling salesman problem: a computa-tional study . Princeton university press.Atamt¨urk, A., Savelsbergh, M.W., 2005. Integer-programming software systems.

Annals of operationsresearch

Artiﬁcial Intelligence Review

11, 1-5, 11–73.Audet, C., Orban, D., 2006. Finding Optimal Algorithmic Parameters Using Derivative-Free Optimization.

SIAM Journal on Optimization

17, 3, 642–664.Battistutta, M., Schaerf, A., Urli, T., 2017. Feature-based tuning of single-stage simulated annealing forexamination timetabling.

Annals of Operations Research

Computers & OperationsResearch

65, 83 – 92.Bertsimas, D., Dunn, J., 2017. Optimal Classiﬁcation Trees.

Machine Learning

The Sharpest Cut: The Impact of Manfred Padberg and HisWork . SIAM, chapter 18, pp. 309–324.Bolme, D.S., Beveridge, J.R., Draper, B.A., Phillips, P.J., Lui, Y.M., 2011. Automatically Searching forOptimal Parameter Settings Using a Genetic Algorithm. In Crowley, J.L., Draper, B.A. and Thonnat,M. (eds),

Computer Vision Systems , Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 213–222.Breiman, L., 2001. Random Forests.

Machine Learning

45, 1, 5–32.Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984.

Classiﬁcation and Regression Trees . Statis-tics/Probability Series. Wadsworth Publishing Company, Belmont, California, U.S.A.Eiben, A.E., Hinterding, R., Michalewicz, Z., 1999. Parameter Control in Evolutionary Algorithms.

Trans-actions on Evolutionary Computation

3, 2, 124–141.Fischetti, M., Fischetti, M., 2016. Matheuristics.

Handbook of Heuristics pp. 1–33.16onseca, G.H., Santos, H.G., Carrano, E.G., Stidsen, T.J., 2017. Integer programming techniques foreducational timetabling.

European Journal of Operational Research

Emerging Theory, Methods, and Applications .INFORMS, pp. 257–277.Gamrath, G., Koch, T., Martin, A., Miltenberger, M., Weninger, D., 2015. Progress in Presolving for MixedInteger Programming.

Mathematical Programming Computation

7, 367–398.Garey, M.R., Johnson, D.S., 1979.

Computers and Intractability; A Guide to the Theory of NP-Completeness . W. H. Freeman & Co., New York, NY, USA.Gearhart, J.L., Adair, K.L., Detry, R.J., Durfee, J.D., Jones, K.A., Martin, N., 2013. Comparison ofOpen-Source Linear Programming Solvers. Technical report, Sandia National Laboratories.Haas, J., Peysakhov, M., Mancoridis, S., 2005. GA-Based Parameter Tuning for Multi-Agent Systems. In

Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation , ACM, New York,NY, USA, pp. 1085–1086.Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The weka data miningsoftware: an update.

ACM SIGKDD explorations newsletter

11, 1, 10–18.Hutter, F., St¨utzle, T., Leyton-Brown, K., Hoos, H.H., 2014a. Paramils: An automatic algorithm conﬁgu-ration framework.

CoRR abs/1401.3492. 1401.3492.Hutter, F., Xu, L., H., H.H., Leyton-Brown, K., 2014b. Algorithm runtime prediction: Methods & evalua-tion.

Artiﬁcial Intelligence

InformationProcessing Letters

5, 1, 15 – 17.Johnson, E., Nemhauser, G., Savelsbergh, W., 2000. Progress in Linear Programming-Based Algorithmsfor Integer Programming: An Exposition.

INFORMS Journal on Computing

Proceedings of the 2010 Conference on ECAI 2010: 19th European Conference on Artiﬁcial Intelligence ,IOS Press, Amsterdam, The Netherlands, The Netherlands, pp. 751–756.Kass, G.V., 1980. An Exploratory Technique for Investigating Large Quantities of Categorical Data.

Journalof the Royal Statistical Society. Series C (Applied Statistics)

29, 2, 119–127.Koch, T., Achterberg, T., Andersen, E., Bastert, O., Berthold, T., Bixby, R.E., Danna, E., Gamrath, G.,Gleixner, A.M., Heinz, S., et al., 2011. MIPLIB 2010.

Mathematical Programming Computation

3, 2, 103.Kohavi, R., John, G.H., 1995. Automatic Parameter Selection by Minimizing Estimated Error. In

InProceedings of the Twelfth International Conference on Machine Learning , Morgan Kaufmann, pp. 304–312.Land, A.H., Doig, A.G., 1960. An Automatic Method for Solving Discrete Programming Problems.

Econo-metrica

28, 3, 497–520.Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y., 2003a. Boosting as a metaphorfor algorithm design. In Rossi, F. (ed.),

Principles and Practice of Constraint Programming – CP 2003 ,Springer Berlin Heidelberg, Berlin, Heidelberg, pp. 899–903.Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y., 2003b. A portfolio approach toalgorithm selection. In

IJCAI , Vol. 3, pp. 1542–1543.Loh, W.Y., Shih, Y.S., 1997. Split selection methods for classiﬁcation trees.

Statistica Sinica

7, 4, 815–840.L´opez-Ib´a˜nez, M., Dubois-Lacoste, J., P´erez C´aceres, L., St¨utzle, T., Birattari, M., 2016. The irace Package:Iterated Racing for Automatic Algorithm Conﬁguration.

Operations Research Perspectives

3, 43–58.L´opez-Ib´a˜nez, M., St¨utzle, T., 2014. Automatically Improving the Anytime Behaviour of OptimisationAlgorithms.

European Journal of Operational Research

IBM Journal of Research and Development

International Transactions in Operational Research

21, 1, 127–152.Menickelly, M., G¨unl¨uk, O., Kalagnanam, J., Scheinberg, K., 2016. Optimal Generalized Decision Trees viaInteger Programming.

CoRR abs/1612.03225.Misir, M., Sebag, M., 2017. Alors: An algorithm recommender system.

Artiﬁcial Intelligence

Computers and Operations Research

Production Planning by Mixed Integer Programming (Springer Series inOperations Research and Financial Engineering) . Springer-Verlag New York, Inc., Secaucus, NJ, USA.Polyakovskiy, S., Bonyadi, M.R., Wagner, M., Michalewicz, Z., Neumann, F., 2014. A comprehensivebenchmark set and heuristics for the traveling thief problem. In

Proceedings of the 2014 Annual Conferenceon Genetic and Evolutionary Computation , ACM, pp. 477–484.Quinlan, J.R., 1986. Induction of Decision Trees.

Machine Learning

1, 1, 81–106.Quinlan, J.R., 1993.

C4.5: Programs for Machine Learning . Morgan Kaufmann Publishers Inc., SanFrancisco, CA, USA.Resende, M., Ribeiro, C., 2014. GRASP: Greedy randomized adaptive search procedures, Springer US. pp.287–312.Rice, J.R., 1976. The algorithm selection problem.

Advances in Computers . Vol. 15. Elsevier, pp. 65 – 118.Santos, H.G., Toﬀolo, T.A., Gomes, R.A., Ribas, S., 2016. Integer programming techniques for the nurserostering problem.

Annals of Operations Research

Electronic Notes in Discrete Mathematics

58, 207 – 214.Song, Y.Y., Lu, Y., 2015. Decision tree methods: applications for classiﬁcation and prediction, 27, 2,130–135.Souza, M., Coelho, I., Ribas, S., Santos, H., Merschmann, L., 2010. A hybrid heuristic algorithm for theopen-pit-mining operational planning problem.

European Journal of Operational Research

The USENIX Magazine

36, 1, 42–47.Tsoumakas, G., Katakis, I., 2007. Multi-label classiﬁcation: An overview.

International Journal of DataWarehousing and Mining (IJDWM)

3, 3, 1–13.Vapnik, V.N., 1995.

The Nature of Statistical Learning Theory . Springer-Verlag New York, Inc., New York,NY, USA.Vilas Boas, M.G., Santos, H.G., Martins, R.S.O., Merschmann, L.H.C., 2017. Data Mining Approach forFeature Based Parameter Tunning for Mixed-Integer Programming Solvers.

Procedia Computer Science

Data Mining: Practical Machine Learning Tools and Techniques (3rd edn.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.Wolsey, L.A., 2007. Mixed Integer Programming, John Wiley & Sons, Inc.Xu, L., Hoos, H., Leyton-Brown, K., 2010. Hydra: Automatically conﬁguring algorithms for portfolio-basedselection.Xu, L., Hutter, F., Hoos, H.H., Leyton-Brown, K., 2008. Satzilla: portfolio-based algorithm selection forsat.

Journal of artiﬁcial intelligence research

32, 565–606.18hang, H., Wang, S., Xu, X., Chow, T.W., Wu, Q.J., 2018. Tree2vector: learning a vectorial representationfor tree-structured data.

IEEE transactions on neural networks and learning systems , 99, 1–15.Zhu, Y., 2007. Mixed-Integer Linear Programming Algorithm for a Computational Protein Design Problem.