Zero Training Overhead Portfolios for Learning to Solve Combinatorial Problems
ZZero Training Overhead Portfoliosfor Learning to Solve Combinatorial Problems
Yiwei Bai Wenting Zhao Carla P. Gomes Abstract
There has been an increasing interest in harness-ing deep learning to tackle combinatorial opti-mization (CO) problems in recent years. TypicalCO deep learning approaches leverage the prob-lem structure in the model architecture. Neverthe-less, the model selection is still mainly based onthe conventional machine learning setting. Dueto the discrete nature of CO problems, a singlemodel is unlikely to learn the problem entirely.We introduce
ZTop , which stands for Zero Train-ing Overhead Portfolio, a simple yet effectivemodel selection and ensemble mechanism forlearning to solve combinatorial problems.
ZTop is inspired by algorithm portfolios, a popular COensembling strategy, particularly restart portfo-lios, which periodically restart a randomized COalgorithm, de facto exploring the search spacewith different heuristics. We have observed thatwell-trained models acquired in the same train-ing trajectory, with similar top validation perfor-mance, perform well on very different validationinstances. Following this observation,
ZTop en-sembles a set of well-trained models, each provid-ing a unique heuristic with zero training overhead ,and applies them, sequentially or in parallel, tosolve the test instances. We show how
ZTop-ping, i.e., using a ZTop ensemble strategy witha given deep learning approach, can significantlyimprove the performance of the current state-of-the-art deep learning approaches on three proto-typical CO domains, the hardest unique-solutionSudoku instances, challenging routing problems,and the graph maximum cut problem, as well ason multi-label classification, a machine learningtask with a large combinatorial label space. Department of Computer Science, Cornell University, Ithaca,USA. Correspondence to: Yiwei Bai < [email protected] > .
1. Introduction
Deep learning has achieved tremendous success in manyareas, including visual object recognition, neural machinetranslation, and autonomous driving. In contrast, combi-natorial optimization problems are challenging for deeplearning given that, in general, they are unsupervised orweakly supervised and have a large combinatorial solutionspace (Bengio et al., 2020; Kool et al., 2018). A classi-cal approach to solving CO problems is to design usefulheuristics manually to guide the algorithm on exploring thelarge combinatorial search space. Nevertheless, a heuris-tic is typically specialized for a specific problem, and itis not trivial to adapt to other CO problems. Deep rein-forcement learning (DRL) is becoming a go-to approachfor learning heuristics for CO problems (Bello et al., 2016;Mazyavkina et al., 2020). The hope is that DRL generalizeswell to tackle several CO problems by learning heuristicsfrom scratch. The architecture design of a deep learningframework leverages the specific problem structure, e.g.,it employs the attention mechanism for routing problems(Kool et al., 2018). However, the model selection methodstill follows the conventional machine learning style, i.e., itselects only the model with the best validation performancefrom the same training trajectory. Due to the problems’combinatorial nature, we conjecture that a single model typ-ically is not enough to capture the entire problem structurewell and learn a useful heuristic for solving the potentiallyvery different problem instances. We propose a new modelselection and ensemble method to tackle this challenge.The runtime and performance of combinatorial optimiza-tion algorithms for a given problem can vary significantlyfrom instances to instances, depending on the heuristicsused. Even when solving the same instance, a randomizedheuristic can also vary dramatically. In fact, the run timedistributions of combinatorial optimization algorithms oftenexhibit heavy tails (Gomes et al., 2000). To remedy and evenexploit the heavy-tailed phenomena and large performancevariance of CO search methods, algorithm portfolios , and inparticular restart portfolios , are widely used in the CO com-munity (Gomes & Selman, 2001; Xu et al., 2008; Gomeset al., 1998; Biere & Fr¨ohlich, 2018). Essentially, in restartportfolios, the CO algorithms are periodically restarted to a r X i v : . [ c s . A I] F e b Top: Zero Training Overhead Portfolios
Model_Best
Test Instances Test Instances
Model_Best Model_C1 Model_C2
Validation Instances
Model_Best Model_C1 Model_C2 Model_C3
Figure 1.
High-level concept of the ZTop ensemble method.
As-sume we use a deep learning model to solve Sudoku. The top rect-angle represents Sudoku validation instances. A model (heuristic)can solve the validation instances with the same color. All thewell-trained models can solve pink instances. No model can solveblack instances. Specific models (other colors) can solve other in-stances (a) The model selection mechanism of
ZTop . We observethat well-trained models (i.e., with similar validation performance)acquired in the same training trajectory can solve significantlydiverse instances. (b) Traditional model selection method, i.e.,selecting the model with the best validation performance. Dueto the combinatorial nature of CO problems, a single model typ-ically cannot learn the problem entirely, so it can only solve fewinstances. (c) The
ZTop method selects several additional modelswith zero training overhead. These models have similar valida-tion performance but the test instances they can solve are quitedifferent. Thus, applying these models sequentially or in parallelto solve instances and taking the best output can work well formost instances. explore different search space parts, using randomized orpseudo-randomized heuristics and caching useful run infor-mation. De facto restarts correspond to using thousands ofheuristic variants, with low heuristic switching costs.We have also observed a large variance in the deep learningmodels’ performance when considering CO problems. In adeep learning setting, we acquire lots of models during thesame training trajectory, and each model can be viewed as aheuristic. However, using thousands of models in the testphase would be computationally intractable. Our goal is toselect a few models covering very diverse heuristics.
Our Contributions : (1) ZTop , a simple yet effective mech-anism to select a diverse set of models covering differentheuristics with zero training overhead (see Fig. 1). Thebasic idea is to find a few models from the same trainingtrajectory with similar average performance (near the bestmodel’s performance) but with a high variance with respect to the instances they can solve. Near best performanceassures these models are well-trained, while the varianceindicates they potentially capture uniquely different heuris-tics. (2) ZTop also provides an ensembling technique toleverage different models. Conventional ensemble meth-ods (averaging the decisions of different models) do notwork well, as we show in the experimental section, sinceone heuristic typically works only for a group of instances.Averaging all the heuristics may cancel out each heuristic’scontribution. So we propose a new ensemble method thatapplies each model sequentially or in parallel to each testinstance and takes the best output per instance.
ZTop ’straining time is identical to the training time of a single model, since
ZTop selects the models from the same train-ing trajectory. During the test time, we have a similar com-puting overhead as the traditional ensemble. For each testinstance, we compute every models’ output and take the bestone instead of averaging their outputs. (3) ZTop substan-tially improves the performance of very different learningframeworks on three prototypical CO domains , the hard-est unique solution Sudokus (Chen et al., 2020), the routingproblem (Kool et al., 2018), and the graph maximum cutproblem (Barrett et al., 2020), as well as multi-label classifi-cation (MLC), a machine learning task with a large com-binatorial label space , with zero training overhead.
Theseframeworks cover various architecture and learning meth-ods : for Sudoku, a modified Long Short Term Memory(LSTM) (Hochreiter & Schmidhuber, 1997) with weaklysupervised learning, for routing problems, a pointer networkwith attention encoder with reinforcement learning (RL),for Maximum Cut, structure2vector (s2v) with RL, and forMLC, Label Message Passing (LaMP) (Lanchantin et al.,2019) and Higher Order correlation VAE (HotVAE) (Zhaoet al., 2021).
2. Related Work
Deep Learning for Combinatorial Optimization (CO):
Several CO deep learning model architectures leverage theproblem structures for learning better heuristics. Struc-ture2Vector (s2v) (Dai et al., 2016; Khalil et al., 2017b) wasproposed for graph instances. For each node of the graph,a feature vector can be learned, capturing the properties ofitself and its neighbors. A Pointer Network (PN) (Vinyalset al., 2015) computes the permutation of variable length se-quence data. This subtle model can naturally capture manyCO constraints. Graph Neural Networks are also employedfor SAT instances (Selsam et al., 2018; Kurin et al., 2019). Amodified LSTM, i.e., each LSTM step’s input is multipliedwith a constraint graph capturing the Sudoku constraints,performs very well on weakly-supervised learning to solvechallenging Sudoku problems (Chen et al., 2020). Creatinga large labeled dataset is computationally intractable forCO problems. So, deep reinforcement learning or weakly
Top: Zero Training Overhead Portfolios supervised learning is used to learn CO problems, rangingfrom routing problems (Bello et al., 2016; Kool et al., 2018),the maximum cut problem (Khalil et al., 2017a; Barrettet al., 2020) to the minimum vertex cover problem (Khalilet al., 2017a; Song et al., 2020). Several approaches com-bine deep models with search algorithms (Kool et al., 2018;Deudon et al., 2018; Abe et al., 2019) to further improve theperformance.
ZTop can further significantly improve theperformance on top of these learning approaches.
Ensemble Methods in Deep Learning:
Neural networkensembles (Zhou et al., 2002; Polikar, 2006; Rodriguezet al., 2006) have been widely used to boost the model’sperformance in the machine learning community. However,they mainly focus on (weighted) averaging the output of themodels and training a set of heterogeneous models requiringsubstantial computational resources. Due to the consider-able overhead of traditional ensemble methods, the deeplearning community proposes several methods to remedythis issue. “Implicit” ensembles (Srivastava et al., 2014;Wan et al., 2013; Huang et al., 2016; Singh et al., 2016;Krueger et al., 2016; Han et al., 2017) are proposed asan alternative to traditional ensemble methods given theirminor overhead in both the training and test phase. Oneexample of “implicit” ensembles is Dropout (Srivastavaet al., 2014), which randomly zeros some hidden neuronsduring each training step. At the test time, every neuronis kept and scaled by the keeping probability of the train-ing phase. An explanation of dropout is that there are aconsiderable number of models created by dropping neu-rons, and these models are implicitly ensembled at the testphase. Similar to the dropout mechanism, stochastic depth(Huang et al., 2016) proposes to randomly drop some layersduring the training time to create different models with vari-ous depths. These implicit ensemble methods create manyshared-weights models and ensemble them in the test phaseimplicitly. Several works focus on efficiently acquiringmany good models (Huang et al., 2017; Zhang et al., 2020)to decrease the training cost. Snapshot ensemble (Huanget al., 2017) leverages the cyclic learning rates (Loshchilov& Hutter, 2016) to force the model to jump from one localminimum to another local minimum, and ensemble thesemodels of different local minimum to reduce the trainingphase overhead. However, cyclic learning rates can poten-tially damage the model’s performance, and it is designedfor only convolution neural networks. Another line of re-search focuses on reducing the test time overhead (Buciluˇaet al., 2006; Hinton et al., 2015; Shen et al., 2019; lan et al.,2018). Distill (Hinton et al., 2015) proposes to employ theensemble of many models as the training label of a singlemodel (similar or smaller size). Learning to improve (L2I)(Lu et al., 2019) proposes a new way to leverage models.L2I employs the model to guide their local search, andthey find that taking the best results of many models with small rollout steps is better than one model with much largerrollout steps. However, L2I requires to train these modelsseparately, and the training method is designed specificallyfor their learning framework.Ensemble methods for MLC fall into three groups: the firstgroup is to ensemble binary relevance classifiers (Read et al.,2011; Wang et al., 2016); the second group is to ensemblelabel powerset classifiers (Read et al., 2008; Tsoumakaset al., 2010); and, the third group uses random forest ofpredictive clustering trees (Kocev et al., 2007). All theensemble methods above except for L2I rely on (weighted)averaging the outputs of many models.
Algorithm Portfolios, used by the CO community (Gomes& Selman, 2001; Leyton-Brown et al., 2003; Xu et al.,2008), ensemble different algorithms and solvers to solvea CO problem. In particular, restart portfolios (restarts) are widely used in the SAT community (Gomes et al.,1998; Biere & Fr¨ohlich, 2018), since they are an effectiveway of combating long and heavy tailed runtime distribu-tions (Gomes et al., 2000). They are also used in stochasticoptimization to solve non-convex problems (Dick et al.,2014; Gagliolo & Schmidhuber, 2007).
Restarts periodi-cally restart an algorithm, de facto trying different (pseudo)randomized heuristics, with low heuristic-switching cost.
Restarts in Deep Learning (DL):
As an example of restarts in DL, the learning rate is reset based on some manuallydesigned metrics to speed up the training or improve theperformance (Gotmare et al., 2018; Loshchilov & Hutter,2016). DRNets (Chen et al., 2020) also employ restartsin the optimization phase to improve performance, withdifferent random seeds. Nevertheless restarts are not oftenemployed in the deep learning community.
A key feature that differentiates ZTop is that it leveragesdifferent models from the same training trajectory in thetest phase to improve performance, with zero trainingoverhead.
3. ZTop, a Novel Ensemble Method
ZTop is a novel ensemble method that selects a set of di-verse models for learning to solve CO problems incurringzero training overhead, inspired by the restart portfolios .The selected models are applied sequentially or in parallelto solve the test instances, taking the best output per in-stance.
ZTop can be used on top of any weakly supervisedor reinforcement CO learning framework, or even in somecases on top of a supervised framework, to significantlyimprove the performance, with zero training overhead andthe same test overhead as traditional ensemble methods.Due to the discrete nature of many CO problems, an ora-cle heuristic, i.e., a heuristic that works for almost everyinstance, is unlikely to exist. In the CO community, restarts
Top: Zero Training Overhead Portfolios
Algorithm 1 ZTop method workflow
Input:
Split Dataset D train , D val , D test , performancemetric φ , number of ensemble models n for epoch = 1 to max epoch do Update model’s parameters using D train Test model on D val , save the test result along withmodel’s parameters in val res object end for Compute the performance metric φ based on val res .Select the top n models, denoted as M , . . . , M n , w.r.t φ . for i = 1 to n do Test model M i on D test , and save the result as res i end forfor all x ∈ D test do Take the best output of x in res i,i =1 ...n end for are widely employed to remedy this issue by periodicallyswitching to a new heuristic. We postulate that it is also dif-ficult for a single deep learning model to perfectly solve allkinds of instances. Surprisingly, we have observed that sev-eral models acquired through the same training trajectorycontain quite diverse heuristics. These models have similarvalidation performance, but the subsets of the validation setthey perform well vary significantly. It is then promisingto select several models from the same training trajectoryinstead of a single one. Selecting diverse models from the same training trajec-tory . The goal is to select several models with similarvalidation performance, but they should perform well ondifferent subsets of the validation set. Often, in CO prob-lems, we naturally have a metric φ to validate the model’sperformance, e.g., the average length of the route in theTSP problems. Surprisingly, selecting top-k models withrespect to the metric φ on the validation set can achieve thegoal. The resulting set of models have very comparableperformance as the optimal set of models found by enu-merating all the possible combinations, as we show in theexperimental section. In terms of the training time overhead,we mark the additional training operation of ZTop in bluein the Alg. 1. We only need to store each model’s valida-tion performance along with its parameters. This additionaloperation incurs zero training cost since traditional modelselection also needs to validate these models.How to leverage these models is another issue. Each modelcan be viewed as a heuristic, and different heuristics typi-cally work for different instance groups. So averaging thesemodels’ output may cancel out each heuristic’s contribution.
Ensembling models at test time. ZTop applies its modelssequentially or in parallel to solve each test instance andselects the best output per instance (see Alg. 1). For weakly- supervised or reinforcement learning (RL) settings, it isnatural to have a metric, e.g., the reward used in RL, toselect the best output. Consider using a deep model to solveSudoku. We employ ZTop’s models to solve test Sudokuinstances sequentially. If any model solves a test instance,no need to try other models. Note that each test instance isonly fed once into each model, so our test time overhead isidentical to that of traditional ensemble methods.
4. Experiments
We show the performance of the
ZTop approach on variousCO problems learning frameworks and two MLC learningframeworks. has zero training overhead, thus we com-pare it with methods that share similar training overheadwith us. The first baseline is the conventional machinelearning model selection scheme (denoted as
Single ), i.e., itselects a single model with the best validation performance.Note here, the learning framework’s training strategy con-sists of proper implicit ensemble methods. So we are alsocomparing
ZTop with these implicit ensemble methods. Interms of leveraging the models, we also compare
ZTop withthe conventional ensemble method, i.e., averaging the outputof all models. We denote this baseline as
Average : pickingtop models w.r.t metric φ and averaging the outputs of them.The last method is our method ZTop : selecting top modelsw.r.t metric φ and taking the best output per instance. Settings for the learning framework. ZTopping a givenlearning approach involves adapting released codes andfollowing the original paper’s settings (for training/test in-stances, train models, and test models’ performance).
Sudoku: (a) a × Sudoku with hints. (b) Thesolution to (a); the three rectangles represent the three types ofconstraints (no repetition of digits per row, column, or block). Sudoku is an NP-hard combinatorial number-placementpuzzle on an n × n board. We focus on the Sudoku problemson × boards, as they are most widely studied. The objectis to complete a × board, pre-filled with several hints Top: Zero Training Overhead Portfolios
Single (0) Single (10)
ZTop (0) Average (10)
ZTop (10)0.289 0.387 3 0.364
10 0.469
20 0.475
Table 1.
ZTopping DRNets, on 1,000 17 hint Sudokus.
DRNets is the state of the art for unsupervised Sudoku (Chen et al., 2020).Sudoku accuracy of hint Sudokus for single and ensemble models, under two test modes: 0 and 10 optimization steps.The number inparentheses next to the modality is the optimization step used in the test mode. We mark ZTop’s results in blue and bold the best results.ZTop significantly improves the DRNets’ performance and outperforms the traditional averaging ensemble method on the two test modes. (numbers pre-assigned to cells), with numbers to . Ina Sudoku, each row, column , and pre-defined × boxcannot have a repeated digit (see Fig. 2). It has been showedthat is the minimum number of hints for which a Sudokuhas a unique solution (McGuire et al., 2012). So we focuson solving × Sudokus with 17 hints.We select DRNets (Chen et al., 2020) as the learning frame-work since it is the state-of-the-art weakly-supervised learn-ing Sudoku framework, supervised only by the Sudoku rules.It employs continuous relaxation to convert the discrete con-straints to differentiable loss functions. The training and testof DRNets only require Sudokus without labels.DRNets assign each digit a learnable embedding and a ten-sor represents a Sudoku instance. A modified LSTM com-putes the missing digits given the Sudoku tensor as input.We create , Sudoku with to hints for trainingand validating the models and Sudoku with hintsfor testing the model. The other training/validating/testingsettings also strictly follow the original paper.To ensure the model candidates are well-trained, we savethe top models from the training phase in terms of thevalidation performance. Then we select models based onthe performance metric φ , which in this case is the Sudokuaccuracy, i.e., the number of Sudoku this model correctlysolved. We do not assign partial credits. We evaluate theensemble performance of , , , and models. Sudoku Accuracy.
The results are summarized in Table 1.We consider two test modes of the DRNets framework: -step optimization mode and -step optimization mode.Since the loss function of DRNets is derived only from theSudoku rules, it can still be optimized for the test instances.Here -step mode fixes the models’ parameters while -step mode optimizes the loss function steps in the testphase. In all the cases, ZTop substantially outperforms thesingle model and standard averaging ensemble methods. Weonly require models to achieve . Sudoku accuracy,considerably higher than the single model’s . Sudokuaccuracy, reaching . with 20 models and 10 optimiza-tion steps. These results show that a set of diverse modelsare more effective than a single model to capture the varied structure of hint Sudoku instances. Routing problems are intensively studied in the CO com-munity and have many real-world applications (Ross et al.,2019). We consider four routing problems and their vari-ants.
Traveling Salesperson Problems (TSP):
Given a setof cities and the distance between any two cities, the goalis to find the shortest route that can traverse every city andreturn to the start city.
Vehicle Routing Problems (VRP):
This problem (Toth & Vigo, 2014) is a generalized TSPproblem. Given a set of cities and a depot, each city hasa demand, i.e., how many people intend to visit the city,the goal is to find an optimal set of routes to meet all thedemands. We focus on two variants of this problem.
Ca-pacitated VRP (CVRP) : The total cities’ demands in oneroute cannot excel a pre-set threshold.
Split Delivery VRP(SDVRP) : This variant has the same constraint as CVRP,but here the demand of a city can be split through multipleroutes.
Orienteering Problem (OP) (Golden et al., 1987):
We have a set of cities and know the distance between anytwo cities. Each city contains a prize. The object is to find aroute whose length cannot surpass a threshold and we wantto maximize the summation of the cities’ prizes in this route.
Prize Collecting TSP (PCTSP) (Balas, 1989):
This is amore challenging variant of the TSP problem. Each cityis assigned a prize and a penalty. We need to find a routecollecting at least a minimum total prize to minimize theroute length plus the sum of missed cities’ penalty.
Stochas-tic PCTSP (SPCTSP):
This problem is quite similar to thePCTSP problem. The only difference is that the prize ofeach city is sampled from a fixed distribution. The salesper-son only knows the expected prize of each city, and the trueprize is revealed when visiting a city.We select Kool et al.’s (Kool et al., 2018) learning frame-work (denoted as AM) for these routing problems sinceit is one of the state-of-the-art methods. This frameworkemploys reinforcement learning (REINFORCE algorithm(Williams, 1992) with baselines) to learn efficient heuristicsunsupervisedly. Thus, the training, validation, and test ofthis framework only require problem instances. The model
Top: Zero Training Overhead Portfolios n = 20 n = 50 n = 100
Method Obj.
Time
Obj.
Time
Obj.
Time T SP Single AM 3.85 1s 5.81 2s 8.11 6s Average AM (m=3/5/10) 3.85/3.85/3.85 3s/5s/10s 5.81/5.81/5.81 6s/10s/20s 8.09/8.09/8.09 18s/30s/60s ZTopping AM (m=3/5/10) C V R P Single AM 6.42 1s 11.02 3s 16.68 8s Average AM (m=3/5/10) 6.41/6.41/6.41 3s/5s/10s 11.01/11.01/11.01 9s/15s/30s 16.66/16.66/16.66 24s/40s/80s ZTopping AM (m=3/5/10) 6.33/6.30/ S DV R P Single AM 6.41 1s 10.96 4s 16.70 11s Average AM (m=3/5/10) 6.40/6.40/6.40 3s/5s/10s 10.95/10.95/10.95 12s/20s/40s 16.68/16.68/16.68 33s/55s/110s ZTopping AM (m=3/5/10) 6.31/6.29/ O P Single AM 5.18 1s 15.42 1s 31.21 5s Average AM (m=3/5/10) 5.19/5.19/5.19 3s/5s/10s 15.46/15.46/15.46 3s/5s/10s 31.21/31.20/31.20 15s/25s/50s ZTopping AM (m=3/5/10) 5.27/5.30/ P C T SP Single AM 3.19 1s 4.60 2s 6.24 5s Average AM (m=3/5/10) 3.19/3.19/3.19 3s/5s/10s 4.60/4.60/4.60 6s/10s/20s 6.24/6.24/6.24 15s/25s/50s ZTopping AM (m=3/5/10) 3.16/ SP C T SP Single AM 3.27 1s 4.66 2s 6.30 5s Average AM (m=3/5/10) 3.26/3.26/3.26 3s/5s/10s 4.66/4.66/4.66 6s/10s/20s 6.30/6.30/6.30 15s/25s/50s ZTopping AM (m=3/5/10) 3.22/3.20/ n = 20 n = 50 n = 100
Method Obj.
Time
Obj.
Time
Obj.
Time
TSP ZTopping AM (m=100)
1h SPCTSP ZTopping AM (m=100)
Figure 3.
Performance of ZTop and state-of-the-art AM baselines on , instances . Ztop ’s results are in blue and the best resultfor each problem is in bold . The variable n represents the number of cities/nodes of the problem. The variable m refers to the number ofmodels used for ensemble. The up arrow next to the problem means the larger the better the results are while the down arrow means thesmaller the better the results are. The top table compares all direct methods, i.e., no search included. While the bottom table compares Ztop using all the models and the sampling method. Due to the stochastic nature of SPCTSP, AM (sampling) does not work. We compareour method with REOPT (half), a C++ ILS-based algorithm proposed in the original paper. From the top table, we observe that ZToppingAM substantially outperforms AM and the conventional ensemble method using only 3 models for the ensemble. ZTopping AM using100 the models can surprisingly excel the AM (sampling) method with a considerably shorter time in most cases. architecture of AM is the pointer network (Vinyals et al.,2015) with an attention encoder. The graph of the problemis fed into the attention encoder to compute the graph em-bedding and node embeddings. Then the decoder of thepointer network generates the policy of the next action.The solution generation process of AM is incremental , i.e.,the model selects one node (city) per step conditioned onall the previous actions. The data generation and trainingprocesses are strictly following the original paper’s settings(Kool et al., 2018). We select the best models from thetraining phase. We evaluate the ensemble performance of , , , and all the models. The metric we used foreach problem is illustrated below. Results.
We introduce one more baseline for the AM method since it proposes a sampling method based on themodel’s output. This method (denoted as
AM (sampling) )can generate better solutions with considerable time. AM(sampling) samples solutions and report the best re-sult for each problem. We now introduce the metrics (Obj.column of the table) used to evaluate the performance ofeach problem. All the metrics we report are averaged acrossall the test instances.
TSP : length of the route,
CVRP, SD-VRP: length of the route that meets the capacity constraint, OP : the summation of prize collected in the route that meetsthe length constraint and PCTSP, SPCTSP: length of theroute plus the penalty of missing cities. The results aresummarized in Fig 3. The top table compares all the direct methods, i.e., they do not use any search to generate solu-tions. We can observe that
ZTop significantly outperforms
Top: Zero Training Overhead Portfolios n=20 n=40 n=60 n=100 n=200 Method Max-Cut Approximation Ratio E R ( n = ) Single ECO-DQN 0.973 0.982 0.969 0.954 0.930 Average ECO-DQN (m=3/5/10) 0.974/0.973/0.972 0.988/0.988/0.988 0.985/0.984/0.984 0.980/0.980/0.980 0.975 ZTopping ECO-DQN (m=3/5/10) 0.979/0.980/ B A ( n = ) Single ECO-DQN 0.958 0.941 0.935 0.929 0.909 Average ECO-DQN (m=3/5/10) 0.943/0.931/0.927 0.904/0.957/0.880 0.892/0.880/0.867 0.883/0.8670.848 0.872/0.831/0.808 ZTopping ECO-DQN (m=3/5/10) 0.976/0.979/
Figure 4.
Approximation ratios of ZTop and ECO-DQN baselines ( instances) . ZTop ’s results are in blue and the best results arein bold . ER and BA refer to two types of the random graphs. m represents the number of models used for the ensemble. The variable n on the left and top represents the number of vertices of the training set graphs and the test set graphs respectively. ZTopping ECO-DQNsubstantially improves the ECO-DQN and outperforms the conventional averaging ensemble method in all the cases. the AM learning framework even with only models. Wealso compare the AM (sampling) method with our ZTop using all (100) models and the results are showed in thebottom table. In most cases, our method can also outper-form the AM (sampling) method with a considerably shortertime. In the CVRP ( n = 100) problem, we achieve the meanlength of route . in . minutes while AM (sampling)requires hours to achieve only . . In the SPCTSP ( n = 100) problem, we achieve . in . minutes while RE-OPT (half) algorithm requires a substantially long time of hours to achieves a poorer performance . . We alsoobserve that averaging ensemble method sometimes evendecreases the performance, e.g., the objective drops from . to . in the OP ( n = 100 ) problem. Moreover,in most cases, averaging ensemble method only achievesminor improvements. The graph maximum cut problem is a classical problem ingraph theory. The cut of a graph is a partition of the vertexset into two complementary sets and the number of edges orthe sum of edges’ weight between these two sets is denotedas the cut’s capacity. The maximum cut is the cut that has amaximum capacity.We employ the state-of-the-art ECO-DQN (Barrett et al.,2020) as the learning framework for the maximum cut prob-lem. ECO-DQN leverages reinforcement learning (DeepQ learning (Mnih et al., 2015)) to learn useful heuristicsunsupervisedly. So, like the AM learning framework, we donot require any solution to train the model. The model archi-tecture of ECO-DQN is Message Passing Neural Networks(MPNN) (Gilmer et al., 2017). It represents each vertexas an embedding. Several rounds of message passing up-date the vertices’ embeddings. The probability distributionover all the actions is computed through a readout function.The ECO-DQN proposes the test-time exploration conceptwhere their policy is to improve a given solution instead ofconstructing a solution from scratch. Their original eval-uation protocol is to generate random solutions on thefly, employ the model to improve them and pick the best one. We generate random solutions in advance and en-sure every method is improving the same solutions. Weselect models uniformly from the last quarter of trainingepochs. We evaluate the ensemble performance of , , and models. We report the approximation ratio averaged overall the test instances. Approximation Ratio.
We consider two types of randomgraphs used in the original paper (Barrett et al., 2020). Oneis Erd˝os-R´enyi (denoted as ER ) (Erdos et al., 1960) , andthe other is Barabasi-Albert (denoted as BA ) (Albert &Barab´asi, 2002) with the edge weights belong to { , , − } .We select two worst-performance cases of the original paperto show ZTopping can improve the ECO-DQN learningframework. One case is to train the model on ER ( n =20 ) graphs and tested on ER ( n = 20 , , , , )graphs where n refers to the number of vertices, the otheris to train the model on BA graphs and tested on BA ( n =20 , , , , ) graphs. The results are summarized inFig 4, and ZTop substantially improves the performanceof ECO-DQN in all the scenarios. Furthermore, we alsoobserve a similar phenomenon as in the routing problems:the averaging ensemble method decreases the performance.The more models we used for the averaging ensemble, thelower performance it achieves. This phenomenon verifiesour assumptions that the averaging ensemble may cancelout each heuristic’s contribution.We show that ZTop can substantially improve the perfor-mance of the three learning approaches above. Moreover,the three learning approaches cover two of the most populararchitectures: pointer network and s2v, and they also coverboth weakly-supervised learning and reinforcement learn-ing algorithms. The variety of these learning frameworksillustrate the universality of our ZTop method. Multi-label classification (MLC) is the problem of assigninga set of labels to a given object. MLC can be viewed asa combinatorial classification problem, since the potentialMLC’s label set is the powerset of the set of single labels.
Top: Zero Training Overhead Portfolios
Metrics Single LaMP ZTopping LaMP (m=5/10/15) Single HotVAE ZTopping HotVAE (m=5/10/15) Bibtext (k = 159) ebF1 0.4375 0.4393/0.4474/ miF1 0.4633 0.4686/0.4751/ maF1 0.3789 0.3763/0.3867/ /0.4008/0.3979 Reuters (k = 90) ebF1 0.9008 0.9098/0.9128/ miF1 0.8855 0.8934/0.8954/ maF1 0.5589 0.5629/0.5777/ /0.6181/0.5932 Delicious (k = 983) ebF1 0.3469 0.3597/0.3616/
N/A N/A miF1 0.3582 0.3707/0.3732/
N/A N/A maF1 0.2012 0.2044/0.2054/
N/A N/A Yeast (k = 14) ebF1 0.638 0.6372/0.6394/ miF1 0.6473 0.6474/0.6489/ /0.6615 maF1 0.4713 0.4774/
Figure 5.
Multi-label classification.
The performance of
ZTop and LaMP/HotVAE on different datasets.
ZTop ’s results are in blue andthe best result for each dataset is in bold . m is the number of models used for the ensemble. k is the number of labels. We report ebF1,miF1 and maF1 scores of these approaches. We observe that ZTop significantly improves single LaMP or HotVAE on all the datasetsexcept for Yeast.
ZTop only improves slightly on Yeast since the dataset is relatively simple, with only 14 labels.
MLC is typically a supervised problem, with no obviousunsupervised metric to identify the best model at test time.The MLC models from the same training trajectory are notas heterogeneous as those learned for more standard COproblems, such as TSP or Sudoku. Thus, different MLCmodels may focus on the object’s different label subsets andtherefore an averaging ensemble method is a good strategy.So, ZTop selects top validation performance models andaverages their outputs, in contrast to other CO problemswith well defined unsupervised metrics.We tested ZTop on the two top state-of-the-art learningapproaches for MLC, Label Message Passing (LaMP) (Lan-chantin et al., 2019) and Higher Order correlation VAE(HotVAE) (Zhao et al., 2021) and on four datasets: Bib-text (Katakis et al., 2008), Reuters (Lewis et al., 2004),Delicious (Bi & Kwok, 2013) and Yeast (Nakai & Kanehisa,1992). ZTop trains a single MLC architecture and selects n models with the lowest validation losses. ZTop runs infer-ence with the n models and averages their sigmoid scoresfor the final prediction probabilities. We consider ZTop en-sembles with 5, 10, and 15 models. To our knowledge, thereare no other deep learning ensemble methods for MLC. F1 scores.
We report three kinds of F1 scores: ebF1, miF1and maF1 scores. The main results are summarized in Fig. 5.We can observe that ZTopping LaMP/HotVAE can signifi-cantly improve the LaMP/HotVAE on all the datasets exceptfor Yeast. Yeast is a simple dataset with only potentiallabels, so ZTop can only improve a little on that dataset.
Herein we show how close ZTop (k models) is to the optimalk-model ensemble, computed by enumerating all possible k-model combinations, out of the top 100 validation models. n = 20 n = 50 n = 100 Problem Approximation Ratio TSP 99.99 99.96 99.99 CVRP 99.93 99.95 99.99 SDVRP 99.98 99.99 99.99 OP (distance) 99.87 99.93 99.99 PCTSP 99.96 99.96 99.96 SPCTSP 99.96 98.97 99.99
Figure 6.
The approximation ratio of performance between ZTop(3 best validation models wrt φ ) and the optimal 3 models (out oftop 100 validation models). n = number of graph vertices. We only consider ensembles of size 3 since enumerating allthe combinations requires considerable time. We computethe optimal 3-model ensemble (out of the top 100 validationmodels), for the different routing problems. The approxi-mation ratios (Fig 6) are close to , the worst case is . ,which shows how ZTop ’strategy is near-optimal, given thetop 100 validation models, implicitly selecting good and di-verse models (heuristics) for these combinatorial problems.
5. Conclusion
We introduce the Zero Training Overhead Portfolio (
ZTop )method, a simple yet effective model selection and ensem-ble mechanism to select a set of diverse models for learn-ing to solve combinatorial problems. We demonstrate howZTopping substantially improves the current state of the artdeep learning frameworks on the hardest unique-solutionSudokus, a series of routing problems, the graph maximumcut problem, and multi-label classification.
Top: Zero Training Overhead Portfolios
6. Acknowledgements
This research was supported by NSF awards CCF-1522054(Expeditions in computing) and CNS-1059284 (Infrastruc-ture). We thank Di Chen, Johan Bjorck and Ruihan Wu fortheir valuable feedback.
References
Abe, K., Xu, Z., Sato, I., and Sugiyama, M. Solving np-hardproblems on graphs with extended alphago zero. arXivpreprint arXiv:1905.11623 , 2019.Albert, R. and Barab´asi, A.-L. Statistical mechanics ofcomplex networks.
Reviews of modern physics , 74(1):47,2002.Balas, E. The prize collecting traveling salesman problem.
Networks , 19(6):621–636, 1989.Barrett, T., Clements, W., Foerster, J., and Lvovsky, A. Ex-ploratory combinatorial optimization with reinforcementlearning. In
Proceedings of the AAAI Conference onArtificial Intelligence , volume 34, pp. 3243–3250, 2020.Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio,S. Neural combinatorial optimization with reinforcementlearning. arXiv preprint arXiv:1611.09940 , 2016.Bengio, Y., Lodi, A., and Prouvost, A. Machine learningfor combinatorial optimization: a methodological tourd’horizon.
European Journal of Operational Research ,2020.Bi, W. and Kwok, J. Efficient multi-label classificationwith many labels. In Dasgupta, S. and McAllester, D.(eds.),
Proceedings of the 30th International Confer-ence on Machine Learning , volume 28 of
Proceedingsof Machine Learning Research , pp. 405–413, Atlanta,Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/bi13.html .Biere, A. and Fr¨ohlich, A. Evaluating CDCL restartschemes. In Berre, D. L. and J¨arvisalo, M. (eds.),
Pro-ceedings of Pragmatics of SAT 2015, Austin, Texas, USA,September 23, 2015 / Pragmatics of SAT 2018, Oxford,UK, July 7, 2018 , volume 59 of
EPiC Series in Comput-ing , pp. 1–17. EasyChair, 2018.Buciluˇa, C., Caruana, R., and Niculescu-Mizil, A. Modelcompression. In
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge discovery anddata mining , pp. 535–541, 2006.Chen, D., Bai, Y., Zhao, W., Ament, S., Gregoire, J., andGomes, C. Deep reasoning networks for unsupervisedpattern de-mixing with constraint reasoning. In
Interna-tional Conference on Machine Learning , pp. 1500–1509.PMLR, 2020. Dai, H., Dai, B., and Song, L. Discriminative embeddings oflatent variable models for structured data. In
Internationalconference on machine learning , pp. 2702–2711, 2016.Deudon, M., Cournut, P., Lacoste, A., Adulyasak, Y., andRousseau, L.-M. Learning heuristics for the tsp by policygradient. In
International conference on the integrationof constraint programming, artificial intelligence, andoperations research , pp. 170–181. Springer, 2018.Dick, T., Wong, E., and Dann, C. How many random restartsare enough. Technical report, Technical report, 2014.Erdos, P., R´enyi, A., et al. On the evolution of randomgraphs.
Publ. Math. Inst. Hung. Acad. Sci , 5(1):17–60,1960.Gagliolo, M. and Schmidhuber, J. Learning restart strategies.In
IJCAI , pp. 792–797, 2007.Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., andDahl, G. E. Neural message passing for quantum chem-istry. In
International Conference on Machine Learning ,pp. 1263–1272. PMLR, 2017.Golden, B. L., Levy, L., and Vohra, R. The orienteeringproblem.
Naval Research Logistics (NRL) , 34(3):307–318, 1987.Gomes, C. P. and Selman, B. Algorithm portfolios.
ArtificialIntelligence , 126(1-2):43–62, 2001.Gomes, C. P., Selman, B., Kautz, H., et al. Boosting com-binatorial search through randomization.
AAAI/IAAI , 98:431–437, 1998.Gomes, C. P., Selman, B., Crato, N., and Kautz, H. Heavy-tailed phenomena in satisfiability and constraint satisfac-tion problems.
Journal of automated reasoning , 24(1-2):67–100, 2000.Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R.A closer look at deep learning heuristics: Learningrate restarts, warmup and distillation. arXiv preprintarXiv:1810.13243 , 2018.Han, B., Sim, J., and Adam, H. Branchout: Regularizationfor online ensemble tracking with convolutional neuralnetworks. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , July2017.Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531 , 2015.Hochreiter, S. and Schmidhuber, J. Long short-term memory.
Neural computation , 9(8):1735–1780, 1997.
Top: Zero Training Overhead Portfolios
Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger,K. Q. Deep networks with stochastic depth. In
Europeanconference on computer vision , pp. 646–661. Springer,2016.Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., andWeinberger, K. Q. Snapshot ensembles: Train 1, get mfor free. arXiv preprint arXiv:1704.00109 , 2017.Katakis, I., Tsoumakas, G., and Vlahavas, I. Multilabeltext classification for automated tag suggestion.
ECMLPKDD Discovery Challenge 2008 , pp. 75, 2008.Khalil, E., Dai, H., Zhang, Y., Dilkina, B., and Song,L. Learning combinatorial optimization algorithms overgraphs. In
Advances in neural information processingsystems , pp. 6348–6358, 2017a.Khalil, E., Dai, H., Zhang, Y., Dilkina, B., and Song,L. Learning combinatorial optimization algorithms overgraphs. In
Advances in neural information processingsystems , pp. 6348–6358, 2017b.Kocev, D., Vens, C., Struyf, J., and Dˇzeroski, S. Ensemblesof multi-objective decision trees. In
European conferenceon machine learning , pp. 624–631. Springer, 2007.Kool, W., Van Hoof, H., and Welling, M. Attention,learn to solve routing problems! arXiv preprintarXiv:1803.08475 , 2018.Krueger, D., Maharaj, T., Kram´ar, J., Pezeshki, M., Ballas,N., Ke, N. R., Goyal, A., Bengio, Y., Courville, A., andPal, C. Zoneout: Regularizing rnns by randomly preserv-ing hidden activations. arXiv preprint arXiv:1606.01305 ,2016.Kurin, V., Godil, S., Whiteson, S., and Catanzaro, B. Im-proving sat solver heuristics with graph networks and re-inforcement learning. arXiv preprint arXiv:1909.11830 ,2019.lan, x., Zhu, X., and Gong, S. Knowledge distillation byon-the-fly native ensemble. In Bengio, S., Wallach, H.,Larochelle, H., Grauman, K., Cesa-Bianchi, N., andGarnett, R. (eds.),
Advances in Neural Information Pro-cessing Systems , volume 31, pp. 7517–7527. Curran As-sociates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/94ef7214c4a90790186e255304f8fd1f-Paper.pdf .Lanchantin, J., Sekhon, A., and Qi, Y. Neural messagepassing for multi-label classification. In
Joint EuropeanConference on Machine Learning and Knowledge Dis-covery in Databases , pp. 138–163. Springer, 2019. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: Anew benchmark collection for text categorization research.
Journal of machine learning research , 5(Apr):361–397,2004.Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden,J., and Shoham, Y. A portfolio approach to algorithmselection. In
IJCAI , volume 3, pp. 1542–1543, 2003.Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra-dient descent with warm restarts. arXiv preprintarXiv:1608.03983 , 2016.Lu, H., Zhang, X., and Yang, S. A learning-based iter-ative method for solving vehicle routing problems. In
International Conference on Learning Representations ,2019.Mazyavkina, N., Sviridov, S., Ivanov, S., and Burnaev, E.Reinforcement learning for combinatorial optimization:A survey. arXiv preprint arXiv:2003.03600 , 2020.McGuire, G., Tugemann, B., and Civario, G. There is no16-clue sudoku: Solving the sudoku minimum numberof clues problem.
CoRR , abs/1201.0749, 2012. URL http://arxiv.org/abs/1201.0749 .Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. nature , 518(7540):529–533, 2015.Nakai, K. and Kanehisa, M. A knowledge base for pre-dicting protein localization sites in eukaryotic cells.
Ge-nomics , 14(4):897–911, 1992.Polikar, R. Ensemble based systems in decision making.
IEEE Circuits and systems magazine , 6(3):21–45, 2006.Read, J., Pfahringer, B., and Holmes, G. Multi-label classi-fication using ensembles of pruned sets. In , pp. 995–1000. IEEE, 2008.Read, J., Pfahringer, B., Holmes, G., and Frank, E. Classifierchains for multi-label classification.
Machine learning ,85(3):333, 2011.Rodriguez, J. J., Kuncheva, L. I., and Alonso, C. J. Rotationforest: A new classifier ensemble method.
IEEE trans-actions on pattern analysis and machine intelligence , 28(10):1619–1630, 2006.Ross, I. M., Proulx, R. J., and Karpenko, M. Autonomousuav sensor planning, scheduling and maneuvering: An ob-stacle engagement technique. In , pp. 65–70. IEEE, 2019.
Top: Zero Training Overhead Portfolios
Selsam, D., Lamm, M., B¨unz, B., Liang, P., de Moura, L.,and Dill, D. L. Learning a sat solver from single-bitsupervision. arXiv preprint arXiv:1802.03685 , 2018.Shen, Z., He, Z., and Xue, X. Meal: Multi-model en-semble via adversarial learning. In
Proceedings of theAAAI Conference on Artificial Intelligence , volume 33,pp. 4886–4893, 2019.Singh, S., Hoiem, D., and Forsyth, D. Swapout: Learningan ensemble of deep architectures. In
Advances in neuralinformation processing systems , pp. 28–36, 2016.Song, J., Lanka, R., Yue, Y., and Ono, M. Co-training forpolicy learning. In
Uncertainty in Artificial Intelligence ,pp. 1191–1201. PMLR, 2020.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: a simple way to preventneural networks from overfitting.
The journal of machinelearning research , 15(1):1929–1958, 2014.Toth, P. and Vigo, D.
Vehicle routing: problems, methods,and applications . SIAM, 2014.Tsoumakas, G., Katakis, I., and Vlahavas, I. Random k-labelsets for multilabel classification.
IEEE transactionson knowledge and data engineering , 23(7):1079–1089,2010.Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks.
Advances in neural information processing systems , 28:2692–2700, 2015.Wan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R.Regularization of neural networks using dropconnect. In
International conference on machine learning , pp. 1058–1066, 2013.Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., and Xu,W. Cnn-rnn: A unified framework for multi-label imageclassification. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , pp. 2285–2294,2016.Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning.
Machinelearning , 8(3-4):229–256, 1992.Xu, L., Hutter, F., Hoos, H. H., and Leyton-Brown, K.Satzilla: portfolio-based algorithm selection for sat.
Jour-nal of artificial intelligence research , 32:565–606, 2008.Zhang, W., Jiang, J., Shao, Y., and Cui, B. Efficient diversity-driven ensemble for deep neural networks. In , pp. 73–84. IEEE, 2020. Zhao, W., Kong, S., Bai, J., Fink, D., and Gome, C. Hot-vae: Learning high-order label correlation for multi-labelclassification via attention-based variational autoencoders.In
Thirty-Fifth AAAI Conference on Artificial Intelligence ,2021.Zhou, Z.-H., Wu, J., and Tang, W. Ensembling neuralnetworks: many could be better than all.
Artificial intelli-gence , 137(1-2):239–263, 2002.
Top: Zero Training Overhead Portfolios
7. Appendix
Graph-9661 Graph-9494
Graph-1837Graph-7016 Graph-3843 Graph-6022
Figure 7.
We randomly sample graph instances from the valida-tion set of the TSP-100 problem. We then test the top-10 validationmodels on these and report their results. One bar corresponds toone model. The red bar means this model performs the best. We randomly sample graphs from the , TSP-100validation instances to show how different these models per-form on the validation set. The top-10 validation models aretested on the graphs, and we report their results in Fig. 7.The models are sorted on x-axis based on the validationperformance, and the left first model is the best validationperformance model. We can see that the best model of eachgraph instance is different. For each graph instance, thevariance of models’ performance is also huge. This verifiesour observations. Traditional model selection workflow
Input:
Split Dataset D train , D val , D test , performancemetric φ for epoch = 1 to max epoch do Train model on D train .Test model on D val and record the best model M best . end for Test model M best on D test .We provide the traditional model selection workflow here.This can be compared with our ZTop workflow to show thatwe incur zero training overhead into the algorithm.
UDOKU
The -hint Sudoku data is public available through here .This original site maintained by the author Gordon Royle https://sites.google.com/site/dobrichev/sudoku-puzzle-collections http://units.maths.uwa.edu.au/ gordon/sudokumin.phphttp://units.maths.uwa.edu.au/ gordon/sudokumin.php