[PDF] Reinforcement Learning for Combinatorial Optimization: A Survey

Abstract

Many traditional algorithms for solving combinatorial optimization problems involve using hand-crafted heuristics that sequentially construct a solution. Such heuristics are designed by domain experts and may often be suboptimal due to the hard nature of the problems. Reinforcement learning (RL) proposes a good alternative to automate the search of these heuristics by training an agent in a supervised or self-supervised manner. In this survey, we explore the recent advancements of applying RL frameworks to hard combinatorial problems. Our survey provides the necessary background for operations research and machine learning communities and showcases the works that are moving the field forward. We juxtapose recently proposed RL methods, laying out the timeline of the improvements for each problem, as well as we make a comparison with traditional algorithms, indicating that RL models can become a promising direction for solving combinatorial problems.

Full PDF

RReinforcement Learning for Combinatorial Optimization: A Survey

Nina Mazyavkina a , ∗ , Sergey Sviridov b , Sergei Ivanov c,a and Evgeny Burnaev a a Skolkovo Institute of Science and Technology, Russia b Zyfra, Russia c Criteo AI Lab, France

A R T I C L E I N F O

Keywords :reinforcement learning, operations re-search, combinatorial optimization, value-based methods, policy-based meth-ods

A B S T R A C T

Many traditional algorithms for solving combinatorial optimization problems involve usinghand-crafted heuristics that sequentially construct a solution. Such heuristics are designed bydomain experts and may often be suboptimal due to the hard nature of the problems. Rein-forcement learning (RL) proposes a good alternative to automate the search of these heuristicsby training an agent in a supervised or self-supervised manner. In this survey, we explore therecent advancements of applying RL frameworks to hard combinatorial problems. Our surveyprovides the necessary background for operations research and machine learning communitiesand showcases the works that are moving the ﬁeld forward. We juxtapose recently proposedRL methods, laying out the timeline of the improvements for each problem, as well as we makea comparison with traditional algorithms, indicating that RL models can become a promisingdirection for solving combinatorial problems.

1. Introduction

Optimization problems are concerned with ﬁnding optimal conﬁguration or "value" among different possibilities,and they naturally fall into one of the two buckets: conﬁgurations with continuous and with discrete variables. Forexample, ﬁnding a solution to a convex programming problem is a continuous optimization problem, while ﬁndingthe shortest path among all paths in a graph is a discrete optimization problem. Sometimes the line between the twocan not be drawn that easily. For example, the linear programming task in the continuous space can be regarded asa discrete combinatorial problem because its solution lies in a ﬁnite set of vertices of the convex polytope as it hasbeen demonstrated by Dantzig’s algorithm [Dantzig and Thapa, 1997]. Conventionally, optimization problems in thediscrete space are called combinatorial optimization (CO) problems and, typically, have different types of solutionscomparing to the ones in the continuous space. One can formulate a CO problem as follows:

Deﬁnition 1.

Let 𝑉 be a set of elements and 𝑓 ∶ 𝑉 ↦ ℝ be a cost function. Combinatorial optimization problem aims to ﬁnd an optimal value of the function 𝑓 and any corresponding optimal element that achieves that optimalvalue on the domain 𝑉 .Typically the set 𝑉 is ﬁnite, in which case there is a global optimum, and, hence, a trivial solution exists for anyCO problem by comparing values of all elements 𝑣 ∈ 𝑉 . Note that the deﬁnition 1 also includes the case of decisionproblems, when the solution is binary (or, more generally, multi-class), by associating a higher cost for the wronganswer than for the right one. One common example of a combinatorial problem is Travelling Salesman Problem(TSP). The goal is to provide the shortest route that visits each vertex and returns to the initial endpoint, or, in otherwords, to ﬁnd a Hamiltonian circuit 𝐻 with minimal length in a fully-connected weighted graph. In this case, a set ofelements is deﬁned by all Hamiltonian circuits, i.e. 𝑉 = { all Hamiltonian paths } , and the cost associated with eachHamiltonian circuit is the sum of the weights 𝑤 ( 𝑒 ) of the edges 𝑒 on the circuit, i.e. 𝑓 ( 𝐻 ) = ∑ 𝑒 ∈ 𝐻 𝑤 ( 𝑒 ) . Anotherexample of CO problem is Mixed-Integer Linear Program (MILP), for which the objective is to minimize 𝑐 ⊤ 𝑥 for agiven vector 𝑐 ∈ ℝ 𝑑 such that the vector 𝑥 ∈ ℤ 𝑑 satisﬁes the constraints 𝐴𝑥 ≤ 𝑏 for the parameters 𝐴 and 𝑏 .Many CO problems are NP-hard and do not have an efﬁcient polynomial-time solution. As a result, many al-gorithms that solve these problems either approximately or heuristically have been designed. One of the emergingtrends of the last years is to solve CO problems by training a machine learning (ML) algorithm. For example, we can ∗ Corresponding author [email protected] (N. Mazyavkina)

ORCID (s): (N. Mazyavkina)

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 1 of 24 a r X i v : . [ c s . L G ] D ec einforcement Learning for Combinatorial Optimization train ML algorithm on a dataset of already solved TSP instances to decide on which node to move next for new TSPinstances. A particular branch of ML that we consider in this survey is called reinforcement learning (RL) that for agiven CO problem deﬁnes an environment and the agent that acts in the environment constructing a solution.In order to apply RL to CO, the problem is modeled as a sequential decision-making process, where the agentinteracts with the environment by performing a sequence of actions in order to ﬁnd a solution. Markov DecisionProcess (MDP) provides a widely used mathematical framework for modeling this type of problems [Bellman, 1957].

Deﬁnition 2.

MDP can be deﬁned as a tuple 𝑀 = ⟨ 𝑆, 𝐴, 𝑅, 𝑇 , 𝛾, 𝐻 ⟩ , where• 𝑆 - state space 𝐬 𝑡 ∈ 𝑆 . State space for combinatorial optimization problems in this survey is typically deﬁnedin one of two ways. One group of approaches constructs solutions incrementally deﬁne it as a set of partialsolutions to the problem (e.g. a partially constructed path for TSP problem). The other group of methods startswith a suboptimal solution to a problem and iteratively improves it (e.g. a suboptimal tour for TSP).• 𝐴 - action space 𝐚 𝑡 ∈ 𝐴 . Actions represent addition to partial or changing complete solution (e.g. changing theorder of nodes in a tour for TSP);• 𝑅 - reward function is a mapping from states and actions into real numbers 𝑅 ∶ 𝑆 × 𝐴 ←←→ ℝ . Rewards indicatehow action chosen in particular state improves or worsens a solution to the problem (i.e. a tour length for TSP);• 𝑇 - transition function 𝑇 ( 𝐬 𝑡 +1 | 𝐬 𝑡 , 𝐚 𝑡 ) that governs transition dynamics from one state to another in response toaction. In combinatorial optimization setting transition dynamics is usually deterministic and known in advance;• 𝛾 - scalar discount factor , < 𝛾 ≤ . Discount factor encourages the agent to account more for short-termrewards;• 𝐻 - horizon , that deﬁnes the length of the episode, where episode is deﬁned as a sequence { 𝑠 𝑡 , 𝑎 𝑡 , 𝑠 𝑡 +1 , 𝑎 𝑡 +1 , 𝑠 𝑡 +2 , ... } 𝐻𝑡 =0 .For methods that construct solutions incrementally episode length is deﬁned naturally by number of actions per-formed until solution is found. For iterative methods some artiﬁcial stopping criteria are introduced.The goal of an agent acting in Markov Decision Process is to ﬁnd a policy function 𝜋 ( 𝑠 ) that maps states intoactions. Solving MDP means ﬁnding the optimal policy that maximizes the expected cumulative discounted sum ofrewards: 𝜋 ∗ = argmax 𝜋 𝔼 [ 𝐻 ∑ 𝑡 =0 𝛾 𝑡 𝑅 ( 𝑠 𝑡 , 𝑎 𝑡 )] , (1)Once MDP has been deﬁned for a CO problem we need to decide how the agent would search for the optimalpolicy 𝜋 ∗ . Broadly, there are two types of RL algorithms:• Value-based methods ﬁrst compute the value action function 𝑄 𝜋 ( 𝑠, 𝑎 ) as the expected reward of a policy 𝜋 given a state 𝑠 and taking an action 𝑎 . Then the agent’s policy corresponds to picking an action that maximizes 𝑄 𝜋 ( 𝑠, 𝑎 ) for a given state. The main difference between value-based approaches is in how to estimate 𝑄 𝜋 ( 𝑠, 𝑎 ) accurately and efﬁciently.• Policy-based methods directly model the agent’s policy as a parametric function 𝜋 𝜃 ( 𝑠 ) . By collecting previousdecisions that the agent made in the environment, also known as experience, we can optimize the parameters 𝜃 by maximizing the ﬁnal reward 1. The main difference between policy-based methods is in optimizationapproaches for ﬁnding the function 𝜋 𝜃 ( 𝑠 ) that maximizes the expected sum of rewards.As can be seen, RL algorithms depend on the functions that take as input the states of MDP and outputs theactions’ values or actions. States represent some information about the problem such as the given graph or the currenttour of TSP, while Q-values or actions are numbers. Therefore an RL algorithm has to include an encoder , i.e., afunction that encodes a state to a number. Many encoders were proposed for CO problems including recurrent neuralnetworks, graph neural networks, attention-based networks, and multi-layer perceptrons. Mazyavkina et al.:

Preprint submitted to Elsevier

Page 2 of 24einforcement Learning for Combinatorial Optimization

ActionsStates/Rewards

AgentEnvironmentProblemEncoderRL Algorithm MDP

Figure 1:

Solving a CO problem with the RL approach requires formulating MDP. The environment is deﬁned by aparticular instance of CO problem (e.g. Max-Cut problem). States are encoded with a neural network model (e.g.every node has a vector representation encoded by a graph neural network). The agent is driven by an RL algorithm(e.g. Monte-Carlo Tree Search) and makes decisions that move the environment to the next state (e.g. removing avertex from a solution set).

To sum up, a pipeline for solving CO problem with RL is presented in Figure 1. A CO problem is ﬁrst reformulatedin terms of MDP, i.e., we deﬁne the states, actions, and rewards for a given problem. We then deﬁne an encoder of thestates, i.e. a parametric function that encodes the input states and outputs a numerical vector (Q-values or probabilitiesof each action). The next step is the actual RL algorithm that determines how the agent learns the parameters of theencoder and makes the decisions for a given MDP. After the agent has selected an action, the environment moves to anew state and the agent receives a reward for the action it has made. The process then repeats from a new state withinthe allocated time budget. Once the parameters of the model have been trained, the agent is capable of searching thesolutions for unseen instances of the problem.Our work is motivated by the recent success in the application of the techniques and methods of the RL ﬁeldto solve CO problems. Although many practical combinatorial optimization problems can be, in principle, solved byreinforcement learning algorithms with relevant literature existing in the operations research community, we will focuson RL approaches for CO problems. This survey covers the most recent papers that show how reinforcement learningalgorithms can be applied to reformulate and solve some of the canonical optimization problems, such as TravellingSalesman Problem (TSP), Maximum Cut (Max-Cut) problem, Maximum Independent Set (MIS), Minimum VertexCover (MVC), Bin Packing Problem (BPP).

Related work.

Some of the recent surveys also describe the intersection of machine learning and combinatorialoptimization. This way a comprehensive survey by [Bengio et al., 2020] has summarized the approaches that solveCO problems from the perspective of the general ML, and the authors have discussed the possible ways of the combi-nation of the ML heuristics with the existing off-the-shelf solvers. Moreover, the work by [Zhou et al., 2018], which isdevoted to the description and possible applications of GNNs, has described the progress on the CO problems’ formu-lation from the GNN perspective in one of its sections. Finally, the more recent surveys by [Vesselinova et al., 2020]and [Guo et al., 2019], describe the latest ML approaches to solving the CO tasks, in addition to possible applicationsof such methods. We note that our survey is complementary to the existing ones as we focus on RL approaches,provide necessary background and classiﬁcation of the RL models, and make a comparison between different RLmethods and existing solutions.

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 3 of 24einforcement Learning for Combinatorial Optimization

Paper organization.

The remainder of this survey is organized as follows. In section 2, we provide a necessarybackground including the formulation of CO problems, different encoders, and RL algorithms that are used for solvingCO with RL. In section 3 we provide a classiﬁcation of the existing RL-CO methods based on the popular designchoices such as the type of RL algorithm. In section 4 we describe the recent RL approaches for the speciﬁc COproblems, providing the details about the formulated MDPs as well as their inﬂuence on other works. In section 5we make a comparison between the RL-CO works and the existing traditional approaches. We conclude and providefuture directions in section 6.

2. Background

In this section, we provide deﬁnitions of combinatorial problems, state-of-the-art algorithms and heuristics thatsolve these problems. We also describe machine learning models that encode states of CO problems for an RL agent.Finally, we categorize popular RL algorithms that have been employed recently for solving CO problems.

We start by considering mixed-integer linear programs (MILP) – a constrained optimization problem, to whichmany practical applications can be reduced. Several industrial optimizers (e.g. [CPLEX, 1987; Gleixner et al., 2017;Gurobi Optimization, 2020; The Sage Developers, 2020; Makhorin; Schrage, 1986]) exist that use a branch-and-boundtechnique to solve the MILP instance.

Deﬁnition 3 (Mixed-Integer Linear Program (MILP) [Wolsey, 1998]) . A mixed-integer linear program is an opti-mization problem of the form arg min 𝐱 { 𝐜 ⊤ 𝐱 | 𝐀𝐱 ≤ 𝐛 , ≤ 𝐱 , 𝐱 ∈ ℤ 𝑝 × ℝ 𝑛 − 𝑝 } , where 𝐜 ∈ ℝ 𝑛 is the objective coefﬁcient vector, 𝐀 ∈ ℝ 𝑚 × 𝑛 is the constraint coefﬁcient matrix, 𝐛 ∈ ℝ 𝑚 is theconstraint vector, and 𝑝 ≤ 𝑛 is the number of integer variables.Next, we provide formulations of the combinatorial optimization problems, their time complexity, and the state-of-the-art algorithms for solving them. Deﬁnition 4 (Traveling Salesman Problem (TSP)) . Given a complete weighted graph 𝐺 = ( 𝑉 , 𝐸 ) , ﬁnd a tour ofminimum total weight, i.e. a cycle of minimum length that visits each node of the graph exactly once.TSP is a canonical example of a combinatorial optimization problem, which has found applications in planning,data clustering, genome sequencing, etc. [Applegate et al., 2006]. TSP problem is NP-hard [Papadimitriou and Stei-glitz, 1998], and many exact, heuristic, and approximation algorithms have been developed, in order to solve it. Thebest known exact algorithm is the Held–Karp algorithm [Held and Karp, 1962]. Published in 1962, it solves theproblem in time 𝑂 ( 𝑛 𝑛 ) , which has not been improved in the general setting since then. TSP can be formulated asa MILP instance [Dantzig et al., 1954; Miller et al., 1960], which allows one to apply MILP solvers, such as Gurobi[Gurobi Optimization, 2020], in order to ﬁnd the exact or approximate solutions to TSP. Among them, Concorde[Applegate et al., 2006] is a specialized TSP solver that uses a combination of cutting-plane algorithms with a branch-and-bound approach. Similarly, an extension of the Lin-Kernighan-Helsgaun TSP solver (LKH3) [Helsgaun, 2017],which improves the Lin-Kernighan algorithm [Lin and Kernighan, 1973], is a tour improvement method that iterativelydecides which edges to rewire to decrease the tour length. More generic solvers that avoid local optima exist such asOR-Tools [Perron and Furnon, 2019] that tackle vehicle routing problems through local search algorithms and meta-heuristics. In addition to solvers, many heuristic algorithms have been developed, such as Christoﬁdes-Serdyukov al-gorithm [Christoﬁdes, 1976; van Bevern and Slugina, 2020], the Lin-Kernighan-Helsgaun heuristic [Helsgaun, 2000],2-OPT local search [Mersmann et al., 2012]. [Applegate et al., 2006] provides an extensive overview of variousapproaches to TSP. Deﬁnition 5 (Maximum Cut Problem (Max-Cut)) . Given a graph 𝐺 = ( 𝑉 , 𝐸 ) , ﬁnd a subset of vertices 𝑆 ⊂ 𝑉 thatmaximizes a cut 𝐶 ( 𝑆, 𝐺 ) = ∑ 𝑖 ∈ 𝑆,𝑗 ∈ 𝑉 ⧵ 𝑆 𝑤 𝑖𝑗 where 𝑤 𝑖𝑗 ∈ 𝑊 is the weight of the edge-connecting vertices 𝑖 and 𝑗 .Max-Cut solutions have found numerous applications in real-life problems including protein folding [Perdomo-Ortiz et al., 2012], ﬁnancial portfolio management [Elsokkary et al., 2017], and ﬁnding the ground state of the Ising Mazyavkina et al.:

Preprint submitted to Elsevier

Page 4 of 24einforcement Learning for Combinatorial Optimization

Hamiltonian in physics [Barahona, 1982]. Max-Cut is an NP-complete problem [Karp, 1972], and, hence, does nothave a known polynomial-time algorithm. Approximation algorithms exist for Max-Cut, including deterministic 0.5-approximation [Mitzenmacher and Upfal, 2005; Gonzalez, 2007] and randomized 0.878-approximation [Goemansand Williamson, 1995]. Industrial solvers can be used to ﬁnd a solution by applying the branch-and-bound routines.In particular, Max-Cut problem can be transformed into a quadratic unconstrained binary optimization problem andsolved by CPLEX [CPLEX, 1987], which takes within an hour for graph instances with hundreds of vertices [Barrettet al., 2020]. For larger instances several heuristics using the simulated annealing technique have been proposed thatcould scale to graphs with thousands of vertices [Yamamoto et al., 2017; Tiunov et al., 2019; Leleu et al., 2019].

Deﬁnition 6 (Bin Packing Problem (BPP)) . Given a set 𝐼 of items, a size 𝑠 ( 𝑖 ) ∈ ℤ + for each 𝑖 ∈ 𝐼 , and a positiveinteger bin capacity 𝐵 , ﬁnd a partition of 𝐼 into disjoint sets 𝐼 , … , 𝐼 𝐾 such that the sum of the sizes of the items ineach 𝐼 𝑗 is less or equal than 𝐵 and 𝐾 has the smallest possible value.There are other variants of BPP such as 2D, 3D packing, packing with various surface area, packing by weights,and others [Wu et al., 2010]. This CO problem has found its applications in many domains such as resource optimiza-tion, logistics, and circuit design [Kellerer et al., 2004]. BPP is an NP-complete problem with many approximationalgorithms proposed in the literature. First-ﬁt decreasing (FFD) and best-ﬁt decreasing (BFD) are two simple ap-proximation algorithms that ﬁrst sort the items in the decreasing order of their costs and then assign each item to theﬁrst (for FFD) or the fullest (for BFD) bin that it ﬁts into. Both FFD and BFD run in 𝑂 ( 𝑛 log 𝑛 ) time and have asymptotic performance guarantee [Korte et al., 2012]. Among exact approaches, one of the ﬁrst attempts has been theMartello-Toth algorithm that works under the branch-and-bound paradigm [Martello and Toth, 1990a,b]. In addition,several recent improvements have been proposed [Schreiber and Korf, 2013; Korf, 2003] which can run on instanceswith hundreds of items. Alternatively, BPP can be formulated as a MILP instance [Wu et al., 2010; Chen et al., 1995]and solved using standard MILP solvers such as Gurobi [Gurobi Optimization, 2020] or CPLEX [CPLEX, 1987]. Deﬁnition 7 (Minimum Vertex Cover (MVC)) . Given a graph 𝐺 = ( 𝑉 , 𝐸 ) , ﬁnd a subset of nodes 𝑆 ⊂ 𝑉 , such thatevery edge is covered, i.e. ( 𝑢, 𝑣 ) ∈ 𝐸 ⟺ 𝑢 ∈ 𝑆 or 𝑣 ∈ 𝑆 , and | 𝑆 | is minimized.Vertex cover optimization is a fundamental problem with applications to computational biochemistry [Lancia et al.,2001] and computer network security [Filiol et al., 2007]. There is a naïve approximation algorithm with a factor 2,which works by adding both endpoints of an arbitrary edge to the solution and then removing this endpoints from thegraph [Papadimitriou and Steiglitz, 1998]. A better approximation algorithm with a factor of ( √ log | 𝑉 |) isknown [Karakostas, 2009], although, it has been shown that MVC cannot be approximated within a factor √ 𝜀 forany 𝜀 > [Dinur and Safra, 2005; Subhash et al., 2018]. The problem can formulated as an integer linear program(ILP) by minimizing ∑ 𝑣 ∈ 𝑉 𝑐 ( 𝑣 ) 𝑥 𝑣 , where 𝑥 𝑣 ∈ {0 , denotes whether a node 𝑣 with a weight 𝑐 ( 𝑣 ) is in a solutionset, subject to 𝑥 𝑢 + 𝑥 𝑣 ≥ . Solvers such as CPLEX [CPLEX, 1987] or Gurobi [Gurobi Optimization, 2020] can beused to solve the ILP formulations with hundreds of thousands of nodes [Akiba and Iwata, 2016]. Deﬁnition 8 (Maximum Independent Set (MIS)) . Given a graph 𝐺 ( 𝑉 , 𝐸 ) ﬁnd a subset of vertices 𝑆 ⊂ 𝑉 , such thatno two vertices in 𝑆 are connected by an edge of 𝐸 , and | 𝑆 | is minimized.MIS is a popular CO problem with applications in classiﬁcation theory, molecular docking, recommendations, andmore [Feo et al., 1994; Gardiner et al., 2000; Agrawal et al., 1996]. As such the approaches of ﬁnding the solutionsfor this problem have received a lot of attention from the academic community. It is easy to see that the complementof an independent set in a graph 𝐺 is a vertex cover in 𝐺 and a clique in a complement graph ̄𝐺 , hence, the solutionsto a minimum vertex cover in 𝐺 or a maximum clique in 𝐺 can be applied to solve the MIS problem. The runningtime of the brute-force algorithm is 𝑂 ( 𝑛 𝑛 ) , which has been improved by [Tarjan and Trojanowski, 1977] to 𝑂 (2 𝑛 ∕3 ) ,and recently to the best known bound 𝑂 (1 . 𝑛 ) with polynomial space [Xiao and Nagamochi, 2017]. To cope withmedium and large instances of MIS several local search and evolutionary algorithms have been proposed. The localsearch algorithms maintain a solution set, which is iteratively updated by adding and removing nodes that improvethe current objective value [Andrade et al., 2008; Katayama et al., 2005; Hansen et al., 2004; Pullan and Hoos, 2006].In contrast, the evolutionary algorithms maintain several independent sets at the current iterations which are thenmerged or pruned based on some ﬁtness criteria [Lamm et al., 2015; Borisovsky and Zavolovskaya, 2003; Back andKhuri, 1994]. Hybrid approaches exist that combine the evolutionary algorithms with the local search, capable tosolve instances with hundreds of thousands of vertices [Lamm et al., 2016]. Mazyavkina et al.:

Preprint submitted to Elsevier

Page 5 of 24einforcement Learning for Combinatorial Optimization

In order to approach the outlined problems with reinforcement learning, we must represent the graphs, involvedin the problems, as vectors that can be further provided as an input to a machine learning algorithm. Next, we discussdifferent approaches for learning the representations of these problems.

In order to process the input structure 𝑆 (e.g. graphs) of CO problems, we must present a mapping from 𝑆 to a 𝑑 -dimensional space ℝ 𝑑 . We call such a mapping an encoder as it encodes the original input space. The encodersvary depending on the particular type of the space 𝑆 but there are some common architectures that researchers havedeveloped over the last years to solve CO problems. Figure 2: The scheme for a recurrent neural network(RNN). Each box represents an encoding function. Eachelement in the sequence is encoded using its initial repre-sentation and the output of the model at the previous step.RNN parameters are shared across all elements of the se-quence.The ﬁrst frequently used architecture is a recurrentneural network (RNN). RNNs can operate on sequen-tial data, encoding each element of the sequence intoa vector. In particular, the RNN is composed of theblock of parameters that takes as an input the currentelement of the sequence and the previous output of theRNN block and outputs a vector that is passed to thenext element of the sequence. For example, in the caseof TSP, one can encode a tour of TSP by applying RNNto the current node (e.g. initially represented by a con-stant 𝑑 -dimensional vector) and the output of the RNNon the previous node of the tour. One can stack multi-ple blocks of RNNs together making the neural networkdeep. Popular choices of RNN blocks are a Long Short-Term Memory (LSTM) unit [Hochreiter and Schmid-huber, 1997] and Gated Recurrent Unit (GRU) [Choet al., 2014], which tackle the vanishing gradient prob-lem [Goodfellow et al., 2016].One of the fundamental limitations of RNN models is related to the modeling of the long-range dependencies:as the model takes the output of the last time-step it may “forget” the information from the previous elements ofthe sequence. Attention models ﬁx this by forming a connection not just to the last input element, but to all inputelements. Hence, the output of the attention model depends on the current element of the sequence and all previouselements of the sequence. In particular, similarity scores (e.g. dot product) are computed between the input elementand each of the previous elements, and these scores are used to determine the weights of the importance of each of theprevious elements to the current element. Attention models has recently gained the superior performance on languagemodeling tasks (e.g. language translation) [Vaswani et al., 2017] and have been applied to solving CO problems (e.g.for building incrementally a tour for TSP). Figure 3: The scheme for a pointer network. Element "B"in the sequence ﬁrst computes similarity scores to all otherelements. Next we encode the representation of "B" us-ing the element with maximum value ("A" in this case,dashed). This process is then repeated for other elementsin the sequence.Note that the attention model relies on modeling de-pendencies between each pair of elements in the inputstructure, which can be inefﬁcient if there are only fewrelevant dependencies. One simple extension of the at-tention model is a pointer network (PN) [Vinyals et al.,2015]. Instead of using the weights among all pairs forcomputation of the inﬂuence of each input element, thepointer networks use the weights to select a single inputelement that will be used for encoding. For example, inFigure 3 the element "A" has the highest similarity tothe element "B", and, therefore, it is used for computa-tion of the representation of element "B" (unlike atten-tion model, in which case the elements "C" and "D" arealso used).Although these models are general enough to be ap-plied to various spaces 𝑆 (e.g. points for TSP), many COproblems studied in this paper are associated with the Mazyavkina et al.:

Preprint submitted to Elsevier

Page 6 of 24einforcement Learning for Combinatorial Optimization

Figure 4:

A classiﬁcation of reinforcement learning methods. graphs. A natural continuation of the attention modelsto the graph domain is a graph neural network (GNN).Initially, the nodes are represented by some vectors (e.g. constant unit vectors). Then, each node’s representation isupdated depending on the local neighborhood structure of this node. In the most common message-passing paradigms,adjacent nodes exchange their current representations in order to update them in the next iteration. One can see thisframework as a generalization of the attention model, where the elements do not attend to all of the other elements(forming a fully-connected graph), but only to elements that are linked in the graph. Popular choices of GNN mod-els include a Graph Convolutional Network (GCN) [Kipf and Welling, 2017], a Graph Attention Network (GAT)[Veliˇckovi´c et al., 2018], a a Graph Isomorphism Network (GIN) [Xu et al., 2018], Structure-to-Vector Network(S2V) [Dai et al., 2016].While there are many intrinsic details about all of these models, at a high level it is important to understandthat all of them are the differentiable functions optimized by the gradient descent that return the encoded vectorrepresentations, which next can be used by the RL agent.

In the introduction section 1 we gave the deﬁnitions of an MDP, which include the states, actions, rewards, andtransition functions. We also explained what the policy of an agent is and what is the optimal policy. Here we willdeep-dive into the RL algorithms that search for the optimal policy of an MDP.Broadly, the RL algorithms can be split into the model-based and model-free categories (Figure 4).•

Model-based methods focus on the environments, where transition functions are known or can be learned, andcan be utilized by the algorithm when making decisions. This group includes

Monte-Carlo Tree Search (MCTS)algorithms such as AlphaZero [Silver et al., 2016] and MuZero [Schrittwieser et al., 2019].•

Model-free methods do not rely on the availability of the transition functions of the environment and utilizesolely the experience collected by the agent.

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 7 of 24einforcement Learning for Combinatorial Optimization

Furthermore, model-free methods can be split into two big families of RL algorithms – policy-based and value-based methods. This partition is motivated by the way of deriving a solution of an MDP. In the case of policy-basedmethods, a policy is approximated directly, while value-based methods focus on approximating a value function, whichis a measure of the quality of the policy for some state-action pair in the given environment. Additionally, there areRL algorithms that combine policy-based methods with value-based methods. The type of methods that utilize suchtraining procedure is called actor-critic methods [Sutton et al., 2000; Mnih et al., 2016]. The basic principle behindthese algorithms is for the critic model to approximate the value function, and for the actor model to approximatepolicy. Usually to do this, both actor and critic, use the policy and value-based RL, mentioned above. This way, thecritic provides the measure of how good the action taken by the actor has been, which allows to appropriately adjustthe learnable parameters for the next train step.Next, we formally describe the value-based, policy-based, and MCTS approaches and the corresponding RLalgorithms that have been used to solve CO problems.

As it has been mentioned earlier, the main goal of all reinforcement learning methods is to ﬁnd a policy, whichwould consistently allow the agent to gain a lot of rewards. Value-based reinforcement learning methods focus onﬁnding such policy through the approximation of a value function 𝑉 ( 𝑠 ) and an action-value function 𝑄 ( 𝑠, 𝑎 ) . In thissection, we will deﬁne both of these functions, which value and action-value functions can be called optimal, and howcan we derive the optimal policy, knowing the optimal value functions. Deﬁnition 9.

Value function of a state 𝑠 is the expectation of the future discounted rewards, when starting from thestate 𝑠 and following some policy 𝜋 : 𝑉 𝜋 ( 𝑠 ) = 𝔼 [ ∞ ∑ 𝑡 =0 𝛾 𝑡 𝑟 ( 𝑠 𝑡 ) | 𝜋, 𝑠 = 𝑠 ] (2)The notation 𝑉 𝜋 here and in the following sections means that the value function 𝑉 is deﬁned with respect to thepolicy 𝜋 . It is also important to note, that the value of a terminal state in the case of a ﬁnite MDP equals .At the same time, it can be more convenient to think of the value function as the function depending not only onthe state but also on the action. Deﬁnition 10.

Action-value function 𝑄 ( 𝑠, 𝑎 ) is the expectation of the future discounted rewards, when starting fromthe state 𝑠 , taking the action 𝑎 and then following some policy 𝜋 : 𝑄 𝜋 ( 𝑠, 𝑎 ) = 𝔼 [ ∞ ∑ 𝑡 =0 𝛾 𝑡 𝑟 ( 𝑠 𝑡 , 𝑎 𝑡 ) | 𝜋, 𝑠 = 𝑠, 𝑎 = 𝑎 ] . (3)It is also clear that 𝑉 𝜋 ( 𝑠 ) can be interpreted in terms of the 𝑄 𝜋 ( 𝑠, 𝑎 ) as: 𝑉 𝜋 ( 𝑠 ) = max 𝑎 𝑄 𝜋 ( 𝑠, 𝑎 ) . From the deﬁnition of a value function comes a very important recursive property, representing the relationshipbetween the value of the state 𝑉 𝜋 ( 𝑠 ) and the values of the possible following states 𝑉 𝜋 ( 𝑠 ′ ) , which lies at the foundationof many value-based RL methods. This property can be expressed as an equation, called the Bellman equation[Bellman, 1952]: 𝑉 𝜋 ( 𝑠 ) = 𝑟 ( 𝑠 ) + 𝛾 ∑ 𝑠 ′ 𝑇 ( 𝑠, 𝜋 ( 𝑠 ) , 𝑠 ′ ) 𝑉 𝜋 ( 𝑠 ′ ) . (4)The Bellman equation can be also rewritten in terms of the action-value function 𝑄 𝜋 ( 𝑠, 𝑎 ) in the following way: 𝑄 𝜋 ( 𝑠, 𝑎 ) = 𝑟 ( 𝑠, 𝑎 ) + 𝛾 ∑ 𝑠 ′ 𝑇 ( 𝑠, 𝑎, 𝑠 ′ ) max 𝑎 ′ 𝑄 𝜋 ( 𝑠 ′ , 𝑎 ′ ) . (5)At the beginning of this section, we have stated that the goal of all of the RL tasks is to ﬁnd a policy, which canaccumulate a lot of rewards. This means that one policy can be better than (or equal to) the other if the expected return Mazyavkina et al.:

Preprint submitted to Elsevier

Page 8 of 24einforcement Learning for Combinatorial Optimization of this policy is greater than the one achieved by the other policy: 𝜋 ′ ≥ 𝜋 . Moreover, by the deﬁnition of a valuefunction, we can claim that 𝜋 ′ ≥ 𝜋 if and only if 𝑉 𝜋 ′ ( 𝑠 ) ≥ 𝑉 𝜋 ( 𝑠 ) in all states 𝑠 ∈ 𝑆 .Knowing this relationship between policies, we can state that there is a policy that is better or equal to all the otherpossible policies. This policy is called an optimal policy 𝜋 ∗ . Evidently, the optimality of the action-value and valuefunctions is closely connected to the optimality of the policy they follow. This way, the value function of an MDP iscalled optimal if it is the maximum of value functions across all policies: 𝑉 ∗ ( 𝑠 ) = max 𝜋 𝑉 𝜋 ( 𝑠 ) , ∀ 𝑠 ∈ 𝑆. Similarly, we can give the deﬁnition to the optimal action-value function 𝑄 ∗ ( 𝑠, 𝑎 ) : 𝑄 ∗ ( 𝑠, 𝑎 ) = max 𝜋 𝑄 𝜋 ( 𝑠, 𝑎 ) , ∀ 𝑠 ∈ 𝑆, ∀ 𝑎 ∈ 𝐴. Given the Bellman equations (4) and (5), one can derive the optimal policy if the action-value or value functionsare known. In the case of a value function 𝑉 ∗ ( 𝑠 ) , one can ﬁnd optimal actions by doing the greedy one-step search:picking the actions that correspond to the maximum value 𝑉 ∗ ( 𝑠 ) in the state 𝑠 computed by the Bellman equation (4).On the other hand, in the case of the action-value function one-step search is not needed. For each state 𝑠 we caneasily ﬁnd such action 𝑎 that maximizes the action-function, as in order to do that we just need to compute 𝑄 ∗ ( 𝑠, 𝑎 ) .This way, we do not need to know any information about the rewards and values in the following states 𝑠 ′ in contrastwith the value function.Therefore, in the case of value-based methods, in order to ﬁnd the optimal policy, we need to ﬁnd the optimalvalue functions. Notably, it is possible to explicitly solve the Bellman equation, i.e. ﬁnd the optimal value function,but only in the case when the transition function is known. In practice, it is rarely the case, so we need some methodsto approximate the Bellman’s equation solution.• Q-learning.

One of the popular representatives of the approximate value-based methods is Q-learning [Watkinsand Dayan, 1992] and its deep variant Deep Q-learning [Mnih et al., 2015]. In Q-learning, the action-valuefunction 𝑄 ( 𝑠, 𝑎 ) is iteratively updated by learning from the collected experiences of the current policy. It hasbeen shown in [Sutton, 1988, Theorem 3] that the function updated by such a rule converges to the optimalvalue function.• DQN.

With the rise of Deep Learning, neural networks (NNs) have proven to achieve state-of-the-art resultson various datasets by learning useful function approximations through the high-dimensional inputs. This ledresearchers to explore the potential of NNs’ approximations of the Q-functions. Deep Q-networks (DQN) [Mnihet al., 2015] can learn the policies directly using end-to-end reinforcement learning. The network approximatesQ-values for each action depending on the current input state. In order to stabilize the training process, authorshave used the following formulation of the loss function: 𝐿 ( 𝜃 𝑖 ) = 𝔼 ( 𝑠,𝑎,𝑟,𝑠 ′ )∼ 𝐷 [( 𝑟 + 𝛾 max 𝑎 ′ 𝑄 𝜃 − 𝑖 ( 𝑠 ′ , 𝑎 ′ ) − 𝑄 𝜃 𝑖 ( 𝑠, 𝑎 ) ) ] , (6)where 𝐷 is a replay memory buffer, used to store ( 𝑠, 𝑎, 𝑟, 𝑠 ′ ) trajectories. Equation (6) is the mean-squared errorbetween the current approximation of the Q-function and some maximized target value 𝑟 + 𝛾 max 𝑎 ′ 𝑄 𝜃 − 𝑖 ( 𝑠 ′ , 𝑎 ′ ) .The training of DQN has been shown to be more stable, and, consequently, DQN has been effective for manyRL problems, including RL-CO problems. In contrast to the value-based methods that aim to ﬁnd the optimal state-action value function 𝑄 ∗ ( 𝑠, 𝑎 ) and actgreedily with respect to it to obtain the optimal policy 𝜋 ∗ , policy-based methods attempt to directly ﬁnd the optimalpolicy, represented by some parametric function 𝜋 ∗ 𝜃 , by optimizing (1) with respect to the policy parameters 𝜃 : themethod collects experiences in the environment using the current policy and optimizes it utilizing these collectedexperiences. Many methods have been proposed to optimize the policy functions, and we discuss the most commonlyused ones for solving CO problems. Mazyavkina et al.:

Preprint submitted to Elsevier

Page 9 of 24einforcement Learning for Combinatorial Optimization • Policy gradient.

In order to optimize (1) with respect to the policy parameters 𝜃 , policy gradient theorem[Sutton et al., 2000] can be applied to estimate the gradients of the policy function in the following form: ∇ 𝜃 𝐽 ( 𝜋 𝜃 ) = 𝔼 𝜋 𝜃 [ 𝐻 ∑ 𝑡 =0 ∇ 𝜃 log 𝜋 𝜃 ( 𝑎 𝑡 | 𝑠 𝑡 ) ̂𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 )] , (7)where the return estimate ̂𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝐻 ∑ 𝑡 = 𝑡 ′ 𝛾 𝑡 ′ − 𝑡 𝑟 ( 𝑠 ′ 𝑡 , 𝑎 ′ 𝑡 ) − 𝑏 ( 𝑠 𝑡 ) , 𝐻 is the agent’s horizon, and 𝑏 ( 𝑠 ) is the baselinefunction. The gradient of the policy is then used by the gradient descent algorithm to optimize the parameters 𝜃 .• REINFORCE.

The role of the baseline 𝑏 ( 𝑠 ) is to reduce the variance of the return estimate ̂𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) — as it iscomputed by running the current policy 𝜋 𝜃 , the initial parameters can lead to poor performance in the beginningof the training, and the baseline 𝑏 ( 𝑠 ) tries to mitigate this by reducing the variance. When the baseline 𝑏 ( 𝑠 𝑡 ) isexcluded from the return estimate calculation we obtain a REINFORCE algorithm that has been proposed by[Williams, 1992]. Alternatively, one can compute the baseline value 𝑏 ( 𝑠 𝑡 ) by calculating an average reward overthe sampled trajectories, or by using a parametric value function estimator 𝑉 𝜙 ( 𝑠 𝑡 ) .• Actor-critic algorithms.

The family of Actor-Critic (A2C, A3C) [Mnih et al., 2016] algorithms further extendREINFORCE with the baseline by using bootstrapping — updating the state-value estimates from the values ofthe subsequent states. For example, a common approach is to compute the return estimate for each step usingthe parametric value function: ̂𝐴 ( 𝑠 𝑡 , 𝑎 𝑡 ) = 𝑟 ( 𝑠 𝑡 , 𝑎 𝑡 ) + 𝑉 𝜙 ( 𝑠 ′ 𝑡 ) − 𝑉 𝜙 ( 𝑠 𝑡 ) (8)Although this approach introduces bias to the gradient estimates, it often reduces variance even further. More-over, the actor-critic methods can be applied to the online and continual learning, as they no longer rely onMonte-Carlo rollouts, i.e. unrolling the trajectory to a terminal state.• PPO/DDPG.

Further development of this group of reinforcement learning algorithms has resulted in the ap-pearance of several more advanced methods such as

Proximal Policy Optimization (PPO) [Schulman et al.,2017], that performs policy updates with constraints in the policy space, or

Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2016], an actor-critic algorithm that attempts to learn a parametric state-action valuefunction 𝑄 𝜙 ( 𝑠, 𝑎 ) , corresponding to the current policy, and use it to compute the bootstrapped return estimate. Both value-based and policy-based approaches do not use the model of the environment (model-free approaches),i.e. the transition probabilities of the model, and, hence, such approaches do not plan ahead by unrolling the envi-ronment to the next steps. However, it is possible to deﬁne an MDP for CO problems in such a way that we can usethe knowledge of the environment in order to improve the predictions by planning several steps ahead. Some notableexamples are AlphaZero [Silver et al., 2016] and Expert Iteration [Anthony et al., 2017] that have achieved super-human performances in games like chess, shogi, go, and hex, learning exclusively through self-play. Moreover, themost recent algorithm, MuZero [Schrittwieser et al., 2019], has been able to achieve a superhuman performance byextending the previous approaches using the learned dynamics model in challenging and visually complex domains,such as Atari games, go and shogi without the knowledge of the game rules.•

MCTS.

The algorithm follows the general procedure of Monte Carlo Tree Search (MCTS) [Browne et al.,2012] consisting of selection, expansion, roll-out and backup steps (Figure 5). However, instead of evaluatingleaf nodes in a tree by making a rollout step, a neural network 𝑓 𝜃 is used to provide a policy 𝑃 ( 𝑠, ∗) and state-value estimates 𝑉 ( 𝑠 ) for the new node in the tree. The nodes in the tree refer to states 𝑠 , and edges refer toactions 𝑎 . During the selection phase we start at a root state, 𝑠 , and keep selecting next states that maximizethe upper conﬁdence bound:UCB = 𝑄 ( 𝑠, 𝑎 ) + 𝑐 ⋅ 𝑃 ( 𝑠, 𝑎 ) ⋅ √∑ 𝑎 ′ 𝑁 ( 𝑠, 𝑎 ′ )1 + 𝑁 ( 𝑠, 𝑎 ) , (9) Mazyavkina et al.:

Preprint submitted to Elsevier

Page 10 of 24einforcement Learning for Combinatorial Optimization

Figure 5:

Three steps of the Monte Carlo Tree Search (MCTS). Starting the simulation from the root node the select step picks the node that maximizes the upper conﬁdence bound. When previously unseen node is expanded , the policy 𝑃 ( 𝑠, ∗) and the state-value function 𝑉 ( 𝑠 ) are evaluated at this node, and the action-value 𝑄 ( 𝑠, 𝑎 ) and the counter 𝑁 ( 𝑠, 𝑎 ) are initialized to . Then 𝑉 ( 𝑠 ) estimates are propagated back along the path of the current simulation to update 𝑄 ( 𝑠, 𝑎 ) and 𝑁 ( 𝑠, 𝑎 ) . When previously unseen node in a search tree is encountered, policy 𝑃 ( 𝑠, ∗) and state-value value functions 𝑃 ( 𝑠, ∗) and state-value estimates 𝑉 ( 𝑠 ) are estimated for this node. After that 𝑉 ( 𝑠 ) estimate is propagated backalong the search tree updating the 𝑄 ( 𝑠, 𝑎 ) and 𝑁 ( 𝑠, 𝑎 ) values. After a number of search iterations we select thenext action from the root state according to the improved policy: 𝜋 ( 𝑠 𝑜 ) = 𝑁 ( 𝑠 , 𝑎 ) ∑ 𝑎 ′ 𝑁 ( 𝑠 , 𝑎 ′ ) . (10)

3. Taxonomy of RL for CO

The full taxonomy of RL methods for CO can be challenging because of the orthogonality of the ways we canclassify the given works. In this section we will list all the taxonomy groups that are used in this survey.One straightforward way of dividing the RL approaches, concerning the CO ﬁeld, is by the family of the RLmethods used to ﬁnd the solution of the given problem. As shown in the Figure 4, it is possible to split the RLmethods by either the ﬁrst level of the Figure 4 (i.e. into the model-based and model-free methods) or by the secondlevel (i.e. policy-based , value-baded , the methods using Monte-Carlo approach) (section 2.3). In addition, anotherdivision is possible by the type of encoders used for representing the states of the MDP (section 2.2). This division ismuch more granular than the other ones discussed in this section, as can be seen from the works surveyed in the nextsection.Another way to aggregate the existing RL approaches is based on the integration of RL into the given CO problem,i.e. if an RL agent is searching for a solution to CO problem or if an RL agent is facilitating the inner workings of theexisting off-the-shelf solvers.• In principal learning an agent makes the direct decision that constitutes a part of the solution or the completesolution of the problem and does not require the feedback from the off-the-shelf solver. For example, in TSPthe agent can be parameterized by a neural network that incrementally builds a path from a set of vertices andthen receives the reward in the form of the length of the constructed path, which is used to update the policy ofthe agent.• Alternatively one can learn the RL agent’s policy in the joint training with already existing solvers so that it canimprove some of the metrics for a particular problem. For example, in MILP a commonly used approach is the

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 11 of 24einforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL[Bello et al., 2017] No Yes Pointer Network REINFORCE with baseline[Khalil et al., 2017] No Yes S2V DQN[Nazari et al., 2018] No Yes Pointer Network with Convolutional Encoder REINFORCE (TSP) and A3C (VRP)[Deudon et al., 2018] No Yes Pointer Network with Attention Encoder REINFORCE with baseline[Kool et al., 2019] No Yes Pointer Network with Attention Encoder REINFORCE with baseline[Emami and Ranka, 2018] No No FF NN with Sinkhorn layer Sinkhorn Policy Gradient[Cappart et al., 2020] Yes Yes GAT/Set Transformer DQN/PPO[Drori et al., 2020] Yes Yes GIN with an Attention Decoder MCTS[Lu et al., 2020] Yes No GAT REINFORCE[Chen and Tian, 2019] Yes No LSTM encoder + classiﬁer Q-Actor-Critic

Table 1

Summary of approaches for Travelling Salesman Problem.

Branch & Bound method, which at every step selects a branching rule on the node of the tree. This can havea signiﬁcant impact on the overall size of the tree and, hence, the running time of the algorithm. A branchingrule is a heuristic that typically requires either some domain expertise or a hyperparameter tuning procedure.However, a parameterized RL agent can learn to imitate the policy of the node selection by receiving rewardsproportional to the running time.Another dimension that the RL approaches can be divided into is the way the solution is searched by the learnedheuristics. In this regard, methods can be divided into those learning construction heuristics or improvement heuristics.• Methods that learn construction heuristics are building the solutions incrementally using the learned policy bychoosing each element to add to a partial solution.• The second group of methods start from some arbitrary solution and learn a policy that improves it iteratively.This approach tries to address the problem that is commonly encountered with the construction heuristics learn-ing, namely, the need to use some extra procedures to ﬁnd a good solution like beam search or sampling.

4. RL for CO

In this section we survey existing RL approaches to solve CO problems that include Traveling Salesman Prob-lem (Deﬁnition 4), Maximum Cut Problem (Deﬁnition 5), Bin Packing Problem (Deﬁnition 6), Minimum VertexCover Problem (Deﬁnition 7), and Maximum Independent Set (Deﬁnition 8). These problems have received the mostattention from the research community and we juxtapose the approaches for all considered problems.

One of the ﬁrst attempts to apply policy gradient algorithms to combinatorial optimization problems has beenmade in [Bello et al., 2017]. In the case of solving the Traveling Salesman Problem, the MDP representation takesthe following form: a state is a 𝑝 -dimensional graph embedding vector, representing the current tour of the nodes atthe time step 𝑡 , while the action is picking another node, which has not been used at the current state. This way theinitial state 𝑠 is the embedding of the starting node. A transition function 𝑇 ( 𝑠, 𝑎, 𝑠 ′ ) , in this case, returns the nextnode of the constructed tour until all the nodes have been visited. Finally, the reward in [Bello et al., 2017] is intuitive:it is the negative tour length. The pointer network architecture, proposed in [Vinyals et al., 2015], is used to encodethe input sequence, while the solution is constructed sequentially from a distribution over the input using the pointermechanism of the decoder, and trained in parallel and asynchronously similar to [Mnih et al., 2016]. Moreover, severalinference strategies are proposed to construct a solution — along with greedy decoding and sampling Active Searchapproach is suggested. Active Search allows learning the solution for the single test problem instance, either startingfrom a trained or untrained model. To update the parameters of the controller so that to maximize the expected rewardsREINFORCE algorithm with learned baseline is used.The later works, such as the one by [Khalil et al., 2017], has improved on the work of [Bello et al., 2017]. Inthe case of [Khalil et al., 2017] the MDP, constructed for solving the Traveling Salesman problem, is similar to Mazyavkina et al.:

Preprint submitted to Elsevier

Page 12 of 24einforcement Learning for Combinatorial Optimization the one used by [Bello et al., 2017], except for the reward function 𝑟 ( 𝑠, 𝑎 ) . The reward, in this case, is deﬁned asthe difference in the cost functions after transitioning from the state 𝑠 to the state 𝑠 ′ when taking some action 𝑎 : 𝑟 ( 𝑠, 𝑎 ) = 𝑐 ( ℎ ( 𝑠 ′ ) , 𝐺 ) − 𝑐 ( ℎ ( 𝑠 ) , 𝐺 ) , where ℎ is the graph embedding function of the partial solutions 𝑠 and 𝑠 ′ , 𝐺 isthe whole graph, 𝑐 is the cost function. Because the weighted variant of the TSP is solved, the authors deﬁne a costfunction 𝑐 ( ℎ ( 𝑠 ) , 𝐺 ) as the negative weighted sum of the tour length. Also, the work implements S2V [Dai et al., 2016]for encoding the partial solutions, and DQN as the RL algorithm of choice for updating the network’s parameters.Another more work by [Nazari et al., 2018], motivated by [Bello et al., 2017], concentrates on solving the VehicleRouting Problem (VRP), which is a generalization of TSP. However, the approach suggested in [Bello et al., 2017]can not be applied directly to solve VRP due to its dynamic nature, i.e. the demand in the node becoming zero, oncethe node has been visited since it embeds the sequential and static nature of the input. The authors of [Nazari et al.,2018] extend the previous methods used for solving TSP to circumvent this problem and ﬁnd the solutions to VRPand its stochastic variant. Speciﬁcally, similarly to [Bello et al., 2017], in [Nazari et al., 2018] approach the state 𝑠 represents the embedding of the current solution as a vector of tuples, one value of which is the coordinates of thecustomer’s location and the other is the customer’s demand at the current time step. An action 𝑎 is picking a node,which the vehicle will visit next in its route. The reward is also similar to the one used for TSP: it is the negativetotal route length, which is given to the agent only after all customers’ demands are satisﬁed, which is the terminalstate of the MDP. The authors of [Nazari et al., 2018] also suggest to improve the Pointer Network, used by [Belloet al., 2017]. To do that, the encoder is simpliﬁed by replacing the LSTM unit with the 1-d convolutional embeddinglayers so that the model is invariant to the input sequence order, consequently, being able to handle the dynamic statechange. The policy learning is then performed by using REINFORCE algorithm for TSP and VRP while using A3Cfor stochastic VRP.Similarly to [Nazari et al., 2018], the work by [Deudon et al., 2018] uses the same approach as the one by [Belloet al., 2017], while changing the encoder-decoder network architecture. This way, while the MDP is the same as in[Bello et al., 2017], instead of including the LSTM units, the GNN encoder architecture is based solely on the attentionmechanisms so that the input is encoded as a set and not as a sequence. The decoder, however, stays the same as in thePointer Network case. Additionally, the authors have looked into combining a solution provided by the reinforcementlearning agent with the 2-Opt heuristic [Croes, 1958], in order to further improve the inference results. REINFORCEalgorithm with critic baseline is used to update the parameters of the described encode-decoder network.Parallel to [Deudon et al., 2018], inspired by the transformer architecture of [Vaswani et al., 2017], a construction heuristic learning approach by [Kool et al., 2019] has been proposed in order to solve TSP, two variants of VRP(Capacitated VRP and Split Delivery VRP), Orienteering Problem (OP), Prize Collecting TSP (PCTSP) and StochasticPCTSP (SPCTSP). In this work, the authors have implemented a similar encoder-decoder architecture as the authorsof [Deudon et al., 2018], i.e. the Transformer-like attention-based encoder, while the decoder is similar to one of thePointer Network. However, the authors found that slightly changing the training procedure and using a simple rolloutbaseline instead of the one learned by a critic yields better performance. The MDP formulation, in this case, is alsosimilar to the one used by [Deudon et al., 2018], and, consequently, the one by [Bello et al., 2017].One speciﬁc construction heuristic approach has been proposed in [Emami and Ranka, 2018]. The authors havedesigned a novel policy gradient method, Sinkhorn Policy Gradient (SPG), speciﬁcally for the class of combinatorialoptimization problems, which involves permutations. This approach yields a different MDP formulation. Here, incontrast with the case when the solution is constructed sequentially, the state space consists of instances of combina-torial problems of a particular size. The action space, in this case, is outputting a permutation matrix, which, appliedto the original graph, produces the solution tour. The reward function is the negated sum of the Euclidean distancesbetween each stop along the tour. Finally, using a special Sinkhorn layer on the output of the feed-forward neuralnetwork with GRUs to produce continuous and differentiable relaxations of permutation matrices, authors have beenable to train actor-critic algorithms similar to Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al., 2016].The work by [Cappart et al., 2020] combines two approaches to solving the traveling salesman problem with timewindows, namely the RL approach and the constraint programming (CP) one, so that to learn branching strategies.In order to encode the CO problems, the authors bring up a dynamic programming formulation, that acts as a bridgebetween both techniques and can be exposed both as an MDP and a CP problem. A state 𝑠 is a vector, consistingof three values: the set of remaining cities that still have to be visited, the last city that has been visited, and thecurrent time. An action 𝑎 corresponds to choosing a city. The reward 𝑟 ( 𝑠, 𝑎 ) corresponds to the negative traveltime between two cities. This MDP can then be transformed into a dynamic programming model. DQN and PPOalgorithms have been trained for the MDP formulation to select the efﬁcient branching policies for different CP search Mazyavkina et al.:

Preprint submitted to Elsevier

Page 13 of 24einforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL[Khalil et al., 2017] No Yes S2V DQN[Barrett et al., 2020] No Yes S2V DQN[Cappart et al., 2019] Yes Yes S2V DQN[Tang et al., 2020] Yes No LSTM + Attention Policy Gradient + ES[Abe et al., 2019] No Yes GNN Neural MCTS[Gu and Yang, 2020] No Yes Pointer Network A3C

Table 2

Summary of approaches for Maximum Cut Problem. strategies — branch-and-bound, iterative limited discrepancy search and restart based search, and have been used tosolve challenging CO problems.The work by [Drori et al., 2020] differs from the previous works, which tailor their approaches to individualproblems. In contrast, this work provides a general framework for model-free reinforcement learning using a GNNrepresentation that adapts to different problem classes by changing a reward. This framework models problems byusing the edge-to-vertex line graph and formulates them as a single-player game framework. The MDPs for TSPand VRP are the same as in [Bello et al., 2017]. Instead of using a full-featured Neural MCTS, [Drori et al., 2020]represents a policy as a GIN encoder with an attention-based decoder, learning it during the tree-search procedure.[Lu et al., 2020] suggests to learn the improvement heuristics in hierarchical manner for capacitated VRP as a partof the joint approach. The authors have designed an intrinsic MDP, which incorporates not only the features of thecurrent solutions but also the running history. A state 𝑠 𝑖 includes free capacity of the route containing a customer 𝑖 , its location, a location of the node 𝑖 − visited before 𝑖 , a location of the node 𝑖 + visited after 𝑖 ,a distance from 𝑖 − to 𝑖 ,a distance from 𝑖 to 𝑖 + ,a distance from 𝑖 − to 𝑖 + , an action taken ℎ steps before, an effect of 𝑎 𝑡 − ℎ . The actionconsists of choosing between two groups of operators, that change the current solution, for example by applying the2-Opt heuristic, which removes two edges and reconnects their endpoints. Concretely, these two operator groupsare improvement operators, that are chosen according to a learned policy, or perturbation operators in the case ofreaching a local minima. The authors have experimented with the reward functions, and have chosen the two mostsuccessful ones: +1∕ − 1 reward for each time the solution improves/does not give any gains, and the advantagereward, which takes the initial solution’s total distance as the baseline, and constitutes the difference between thisbaseline and the distance of the subsequent solutions as the reward at each time step. The policy is parameterized bythe Graph Attention Network and is trained with the REINFORCE algorithm.The ﬁnal work, we are going to cover for this section of problems is by [Chen and Tian, 2019], who proposessolving VRP and online job scheduling problems by learning improvement heuristics. The algorithm rewrites differentparts of the solution until convergence instead of constructing the solution in the sequential order. The state space isrepresented as a set of all solutions to the problem, while the action set consists of regions, i.e. nodes in the graph, andtheir corresponding rewriting rules. The reward, in this case, is the difference in the costs of the current and previoussolutions. The authors use an LSTM encoder, speciﬁc to each of the covered problems and train region-picking andrule-picking policies jointly by applying the Q-Actor-Critic algorithm. The ﬁrst work to address solving Maximum Cut Problem with reinforcement learning was [Khalil et al., 2017],that proposed the principled approach to learning the construction heuristic by combining graph embeddings with Q-learning - S2V-DQN. They formulated the problem as an MDP, where the state space, 𝑆 , is deﬁned as a partial solutionto the problem, i.e. the subset of all nodes in a graph added to the set, that maximizes the maximum cut. The actionspace, 𝐴 , is a set of nodes that are not in the current state. The transition function, 𝑇 ( 𝑠 𝑡 +1 | 𝑠 𝑡 , 𝑎 𝑡 ) , is deterministic andcorresponds to tagging the last selected node with a feature 𝑥 𝑣 = 1 . The reward is calculated as an immediate changein the cut weight, and the episode terminates when the cut weight can’t be improved with further actions. The graphembedding network proposed in [Dai et al., 2016] was used as state encoder. A variant of the Q-learning algorithmwas used to learn to construct the solution, that was trained on randomly generated instances of graphs. This approachachieves better approximation ratios compared to the commonly used heuristic solutions of the problem, as well asthe generalization ability, which has been shown by training on graphs of consisting of 50-100 nodes and tested on Mazyavkina et al.:

Preprint submitted to Elsevier

Page 14 of 24einforcement Learning for Combinatorial Optimization graphs with up to 1000-1200 nodes, achieving very good approximation ratios to exact solutions.[Barrett et al., 2020] improved on the work of [Khalil et al., 2017] in terms of the approximation ratio as well asthe generalization by proposing an ECO-DQN algorithm. The algorithm kept the general framework of S2V-DQN butintroduced several modiﬁcations. The agent was allowed to remove vertices from the partially constructed solutionto better explore the solution space. The reward function was modiﬁed to provide a normalized incremental rewardfor ﬁnding a solution better than seen in the episode so far, as well as give small rewards for ﬁnding a locally optimalsolution that had not yet been seen during the episode. In addition, there were no penalties for decreasing cut value.The input of the state encoder was modiﬁed to account for changes in the reward structure. Since in this setting theagent had been able to explore indeﬁnitely, the episode length was set to | 𝑉 | . Moreover, the authors allowed thealgorithm to start from an arbitrary state, which could be useful by combining this approach with other methods, e.g.heuristics. This method showed better approximation ratios than S2V-DQN, as well as better generalization ability.[Cappart et al., 2019] devised the joint approach to the Max-Cut problem by incorporation of the reinforcementlearning into the Decision Diagrams (DD) framework [Bergman et al., 2016] to learn the constructive heuristic. Theintegration of the reinforcement learning allowed to provide tighter objective function bounds of the DD solution bylearning heuristics for variable ordering. They have formulated the problem as an MDP, where the state space, 𝑆 ,is represented as a set of ordered sequences of selected variables along with partially constructed DDs. The actionspace, 𝐴 , consists of variables, that are not yet selected. The transition function, 𝑇 , adds variables to the selectedvariables set and to the DD. The reward function is designed to tighten the bounds of the DD and is encoded as therelative upper and lower bounds improvements after the addition of the variable to the set. The training was performedon the generated random graphs with the algorithm and state encoding described above in [Khalil et al., 2017]. Theauthors showed that their approach had outperformed several ordering heuristics and generalized well to the largergraph instances, but didn’t report any comparison to the other reinforcement learning-based methods.Another joint method proposed by [Tang et al., 2020] combined a reinforcement learning framework with thecutting plane method. Speciﬁcally, in order to learn the improvement heuristics to choose Gomory’s cutting plane,which is frequently used in the Branch-and-Cut solvers, an efﬁcient MDP formulation was developed. The statespace, 𝑆 , includes the original linear constraints and cuts added so far. Solving the linear relaxation produces theaction space, 𝐴 , of Gomory cut that can be added to the problem. After that the transition function, 𝑇 , adds thechosen cuts to the problem that results in a new state. The reward function is deﬁned as a difference between theobjective function of two consecutive linear problem solutions. The policy gradient algorithm was used to select newGamory cuts, and the state was encoded with an LSTM network (to account for a variable number of variables) alongwith the attention-based mechanism (to account for a variable number of constraints). The algorithm was trained onthe generated graph instances using evolution strategies and had been shown to improve the efﬁciency of the cuts,the integrality gaps, and the generalization, compared to the usually used heuristics used to choose Gomory cuts.Also, the approach was shown to be beneﬁcial in combination with the branching strategy innexperiments with theBranch-and-Cut algorithm.[Abe et al., 2019] proposed to use a graph neural network along with Neural MTCS approach to learn the con-struction heuristics. The MDP formulation deﬁnes the state space, 𝑆 , as a set of partial graphs from where nodescan be removed and colored in one of two colors representing two subsets. The action space, 𝐴 , represents sets ofnodes still left in the graph and their available colors. The transition function, 𝑇 , colors the selected node of the graphand removes it along with the adjacent edges. The neighboring nodes, that have been left, are keeping a counter ofhow many nodes of the adjacent color have been removed. When the new node is removed, the number of the earlierremoved neighboring nodes of the opposite color is provided as the incremental reward signal, 𝑅 (the number of edgesthat were included in the cut set). Several GNNs were compared as the graph encoders, with GIN[Xu et al., 2018]being shown to be the most performing. Also, the training procedure similar to AlphaGo Zero was employed withthe modiﬁcation to accommodate for a numeric rather than win/lose solution. The experiments were performed witha vast variety of generated and real-world graphs. The extensive comparison of the method with several heuristicsand with previously described S2V-DQN [Khalil et al., 2017] showed the superior performance as well as the bettergeneralization ability to larger graphs, yet they didn’t report any comparison with the exact methods.[Gu and Yang, 2020] applied the Pointer Network [Vinyals et al., 2015] along with the Actor-Critic algorithmsimilar to [Bello et al., 2017] to iteratively construct a solution. The MDP formulation deﬁnes the state, 𝑆 , as asymmetric matrix, 𝑄 , the values of which are the edge weights between nodes (0 for the disconnected nodes). Columnsof this matrix are fed to the Pointer Network, which sequentially outputs the actions, 𝐴 , in the form of pointers to inputvectors along with a special end-of-sequence symbol "EOS". The resulting sequence of nodes separated by the "EOS" Mazyavkina et al.:

Preprint submitted to Elsevier

Page 15 of 24einforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL[Hu et al., 2017] No Yes Pointer Network REINFORCE with baseline[Duan et al., 2019] No Yes Pointer Network + Classiﬁer PPO[Laterre et al., 2018] No Yes FF NN Neural MCTS[Li et al., 2020] No No Attention Actor-Critic[Cai et al., 2019] Yes No N/A PPO

Table 3

Summary of approaches for Bin Packing Problem. symbol represents a solution to the problem, from which the reward is calculated. The authors conducted experimentswith simulated graphs with up to 300 nodes and reported fairly good approximations ratios, but, unfortunately, didn’tcompare with the previous works or known heuristics.

To our knowledge, one of the ﬁrst attempts to solve a variant of Bin Packing Problem with modern reinforcementlearning was [Hu et al., 2017]. The authors have proposed a new, more realistic formulation of the problem, where thebin with the least surface area that could pack all 3D items is determined. This principled approach is only concernedwith learning the construction heuristic to choose a better sequence to pack the items and using regular heuristics todetermine the space and orientation. The state space, 𝑆 , is denoted by a set of sizes (height, width, and length) of theitems that need to be packed. The approach proposed by [Bello et al., 2017], which utilizes the Pointer Network, isused to output the sequence of actions, 𝐴 , i.e. sequence of items to pack. Reward, 𝑅 , is calculated as the value of thesurface area of packed items. REINFORCE with the baseline is used as a reinforcement learning algorithm, with thebaseline provided by the known heuristic. The improvement over the heuristic and random item selection was shownwith greedy decoding as well as sampling from the with beam search.Further work by [Duan et al., 2019] extends the approach of [Hu et al., 2017] to learning of the orientations alongwith a sequence order of items by combining reinforcement and supervised learning in a multi-task fashion. In thiswork a Pointer Network, trained with a PPO algorithm, was enhanced with a classiﬁer that determined the orientationof the current item in the output sequence, given the representation from the encoder and the embedded partial itemssequence. The classiﬁer is trained in a supervised setting, using the orientations in the so-far best solution of theproblem as labels. The experiments were conducted on the real-world dataset and showed that the proposed methodperforms better than several widely used heuristics and previous approaches by [Hu et al., 2017].[Laterre et al., 2018] applied the principled Neural MCTS approach to solve the already mentioned variant of2D and 3D bin packing problems by learning the construction heuristic. The MDP formulation includes the statespace, 𝑆 , represented by the set of items that need to be packed with their heights, widths, and depths. The actionspace, 𝐴 , is represented by the set of item ids, coordinates of the bottom-left corner of the position of the items, andtheir orientations. To solve 2D and 3D Bin Packing Problems, formulated as a single-player game, Neural MCTSconstructs the optimal solution with the addition of a ranked reward mechanism that reshapes the rewards accordingto the relative performance in the recent games. This mechanism aims to provide a natural curriculum for a singleagent similar to the natural adversary in two-player games. The experimental results have been compared with theheuristic as well as Gurobi solver and showed better performance in several cases on the dataset created by randomlycutting the original bin into items.[Li et al., 2020] tries to address the limitation of the three previously described works, namely using heuristicsfor the rotation or the position coordinates ([Hu et al., 2017],[Duan et al., 2019]) or obtaining items from cutting theoriginal bin ([Laterre et al., 2018]). Concretely, the authors propose to construct an end-to-end pipeline to choose anitem, orientation, and position coordinates by using an attention mechanism. The MDP’s state space, 𝑆 , includes abinary indicator of whether the item is packed or not, its dimensions, and coordinates relative to the bin. The actionspace, 𝐴 , is deﬁned by the selection of the item, the rotation and the position of the item in the bin. The rewardfunction is incremental and is calculated as the volume gap in the bin, i.e. the current bin’s volume − the volume ofthe packed items. The actor-critic algorithm was used for learning. The comparison provided with a genetic algorithmand previous reinforcement learning approaches, namely [Duan et al., 2019] and [Kool et al., 2019], has showed that Mazyavkina et al.:

Preprint submitted to Elsevier

Page 16 of 24einforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL[Khalil et al., 2017] No Yes S2V DQN[Song et al., 2020] No Yes S2V DQN + Imitation Learning[Manchanda et al., 2019] No Yes GNN DQN

Table 4

Summary of approaches for Minimum Vertex Cover problem. the proposed method achieves a smaller bin gap ratio for the problems of size up to 30 items.[Cai et al., 2019] has taken the joint approach to solving a 1D bin packing problem by combining proximal policyoptimization (PPO) with the simulated annealing (SA) heuristic algorithm. PPO is used to learn the improvement heuristic to build an initial starting solution for SA, which in turn, after ﬁnding a good solution in a limited numberof iterations, calculates the reward function, 𝑅 , as the difference in costs between the initial and ﬁnal solutions andpasses it to the PPO agent. The action space, 𝐴 , is represented by a set of changes of the bins between two items,e.g. a perturbation to the current solution. The state space, 𝑆 , is described with a set of assignments of items to bins.The work has showed that the combination of RL and the heuristics can ﬁnd solutions better than these algorithms inisolation, but has not provides any comparison with known heuristics or other algorithms. The principled approach to solving the Minimum Vertex Problem (MVC) with reinforcement learning was devel-oped by [Khalil et al., 2017]. To learn the construction heuristic the problem was put into the MDP framework, whichis described in details in 4.2 along with the experimental results. To apply the algorithm to the MVC problem rewardfunction, 𝑅 , was modiﬁed to produce −1 for assigning a node to the cover set. Episode termination happens when alledges are covered.[Song et al., 2020] proposed the joint co-training method, which has gained popularity in the classiﬁcation domain,to construct sequential policies. The paper describes two policy-learning strategies for the MVC problem: the ﬁrststrategy copies the one described in [Khalil et al., 2017], i.e. S2V-DQN, the second is the integer linear programmingapproach solved by the branch & bound method. The authors create the 𝐶𝑜𝑃 𝑖𝐸𝑟 algorithm that is intuitively similar toImitation Learning [Hester et al., 2018], in which two strategies induce two policies, estimate them to ﬁgure out whichone is better, exchange the information between them, and, ﬁnally, make the update. The performed experimentsresulted in the extensive ablation study, listing the comparisons with the S2V-DQN, Imitation learning, and Gurobisolver, and showed a smaller optimality gap for problems up to 500 nodes.Finally, it is worth including in this section the work by [Manchanda et al., 2019] that combined the supervisedand reinforcement learning approaches in the joint method that learns a construction heuristic for a budget-constrainedMaximum Vertex Cover problem. The algorithm consists of two phases. In the ﬁrst phase, a GCN is used to deter-mine "good" candidate nodes by learning the scoring function, using the scores, provided by the probabilistic greedyapproach, as labels. Then the candidates nodes are used in an algorithm similar to [Khalil et al., 2017] to sequentiallyconstruct a solution. Since the degree of nodes in large graphs can be very high, the importance sampling according tothe computed score is used to choose the neighboring nodes for the embedding calculation, which helps to reduce thecomputational complexity. The extensive experiments on random and real-world graphs has showed that the proposedmethod performs marginally better compared to S2V-DQN, scales to much larger graph instances up to a hundredthousand nodes, and is signiﬁcantly more efﬁcient in terms of the computation efﬁciency due to a lower number oflearned parameters.

One of the ﬁrst RL for CO works, that covered MIS problem, is [Cappart et al., 2019]. It focuses on a particularapproach to solving combinatorial optimization problems, where an RL algorithm is used to ﬁnd the optimal orderingof the variables in a Decision Diagram (DD), to tighten the relaxation bounds for the MIS problem. The MDPformulation, as well as the encoder and the RL algorithm are described in detail in Section 4.4.Another early article, covering MIS, is [Abe et al., 2019]. In it, authors have proposed the following MDP for-mulation: let a state 𝑠 ∈ 𝑆 be a graph, received at each step of constructing a solution, with the initial state 𝑠 being Mazyavkina et al.:

Preprint submitted to Elsevier

Page 17 of 24einforcement Learning for Combinatorial Optimization

Searching Solution Training

Approach Joint Constructive Encoder RL[Cappart et al., 2019] Yes No S2V DQN[Abe et al., 2019] No No GIN MCTS[Ahn et al., 2020] Yes Yes GCN PPO

Table 5

Summary of approaches for Maximum Independent Set problem. the initial graph 𝐺 ; an action 𝑎 ∈ 𝐴 be a selection of one node of a graph in the current state; a transition function 𝑇 ( 𝑠, 𝑎, 𝑠 ′ ) be the function returning the next state, corresponding to the graph where edges covered by the action 𝑎 andits adjacent nodes are deleted; and, ﬁnally, a reward function 𝑟 ( 𝑠, 𝑎 ) be a constant and equal to . For encoders, theauthors proposed to apply a GIN, to account for the variable size of a state representation in a search tree. [Abe et al.,2019] uses a model-based algorithm, namely AlphaGo Zero, to update the parameters of the GIN network.The latest work, [Ahn et al., 2020] modiﬁes the MDP formulation, by applying a label to each node, i.e. each nodecan be either included into the solution, excluded from the solution or the determination of its label can be deferred.This way, a state 𝑠 ∈ 𝑆 becomes a vector, the size of which is equal to the number of nodes in the graph, and whichconsists of the labels that have been given to each node at the current time step. The initial state 𝑠 is a vector with alllabels set to being deferred. An action 𝑎 ∈ 𝐴 is a vector with new label assignments for the next state of only currentlydeferred nodes. To maintain the independence of the solution set, the transition function 𝑇 ( 𝑠, 𝑎 ) is set to consist of twophases: the update phase and clean-up phase. The ﬁrst phase represents the naive assignment of the labels by applyingthe action 𝑎 , which leads to the intermediate state ̂𝑠 . In the clean-up phase, the authors modify the intermediate state ̂𝑠 in such a way that the included nodes are only adjacent to the excluded ones. Finally, the reward function 𝑟 ( 𝑠, 𝑎 ) is equal to the increase in the cardinality of included vertices between the current state 𝑠 ′ and the previous state 𝑠 .The authors propose to use the Graph Convolutional Network encoder and PPO method with the rollback procedureto learn the optimal Deep Auto-Deferring Policy (ADP), which outputs the improvement heuristic to solve the MISproblem.

5. Comparison

In this section, we will partially compare the results achieved by the works presented in this survey. Concretely,we have distinguished the two most frequently mentioned problems, namely, Travelling Salesman Problem (TSP) andCapacitated Vehicle Routing Problem (CVRP). The average tour lengths for these problems, reported in the works [Luet al., 2020; Kool et al., 2019; Lodi et al., 2002; Bello et al., 2017; Emami and Ranka, 2018; Ma et al., 2020; Nazariet al., 2018; Chen and Tian, 2019], are shown in Tables 6, 7.The presented results have been achieved on Erd˝os–Rényi (ER) graphs of various sizes, namely, with the numberof nodes of 20, 50, 100 for TSP, and 10, 20, 50, 100 for CVRP. In the case of CVRP, we have also speciﬁed the capacityof the vehicle (Cap.), which varies from 10 to 50. Also, we have included the results achieved by the OR-Tools solver[Perron and Furnon, 2019] and LK-H heuristic algorithm [Helsgaun, 2017] as the baseline solutions.

Best performing methods.

It is clear from the presented table that the best performing methods for TSP are [Koolet al., 2019] and [Bello et al., 2017], and for VRP — [Lu et al., 2020]. These algorithms perform on par with thebaseline, and in some cases demonstrate better results. Moreover, in the case of [Lu et al., 2020], the algorithmmanages to present the best performance across all the other methods, even for tasks with smaller vehicle capacities.

Focus on smaller graphs.

Throughout our analysis, we have found that most of the articles focus on testing theCO-RL methods on graphs with the number of nodes of 20, 50, 100. At the same time, [Ma et al., 2020] presentsthe results for bigger graphs with 250, 500, 750, and 1000 nodes for a TSP problem. This may be connected to thefact that with the increasing size of the graphs the process of ﬁnding the optimal solution also becomes much morecomputationally difﬁcult even for the commercial solvers. The comparison of the reported results and the baselinefurther supports this fact: for TSP it can be seen how for smaller graphs almost all of the methods outperform OR-Tools, while for bigger graphs it is no longer the case. Consequently, this can be a promising direction for further

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 18 of 24einforcement Learning for Combinatorial Optimization

Algo Article Method Average tour lengthN=20 N=50 N=100RL [Lu et al., 2020] REINFORCE 4.0 6.0 8.4[Kool et al., 2019] 3.8 5.7 7.9[Deudon et al., 2018] 3.8 5.8 8.9[Deudon et al., 2018] REINFORCE+2opt 3.8 5.8 8.2[Bello et al., 2017] A3C 3.8 5.7 7.9[Emami and Ranka, 2018] Sinkhorn Policy Gradient 4.6 – –[Helsgaun, 2017] LK-H 3.8 5.7 7.8[Perron and Furnon, 2019] OR-Tools 3.9 5.8 8.0

Table 6

The average tour lengths (the smaller, the better) comparison for TSP for ER graphswith the number of nodes 𝑁 equal to 20, 50, 100. research. Non-overlapping problems.

We can see that although there have emerged a lot of works focused on creating well-performing RL-based solvers, the CO problems, covered in these articles, rarely coincide, which makes the faircomparison a much harder task. We are convinced that further analysis should be focused on unifying the results fromdifferent sources, and, hence, identifying more promising directions for research.

Running times.

One of the main pros of using machine learning and reinforcement learning algorithms to solve COproblems is the considerable reduction in running times compared to the ones obtained by the metaheuristic algorithmsand solvers. However, it is still hard to compare the running time results of different works as they can signiﬁcantlyvary depending on the implementations and the hardware used for experimentation. For these reasons, we do notattempt to exactly compare the times achieved by different RL-CO works. Still, however, we can note that some of theworks such as [Nazari et al., 2018], [Chen and Tian, 2019], [Lu et al., 2020] claim to have outperformed the classicheuristic algorithms. Concretely, the authors of [Nazari et al., 2018], show that for larger problems, their frameworkis faster than the randomized heuristics and their running times grow slower with the increase of the complexities ofthe CO problems than the ones achieved by Clarke-Wright [Clarke and Wright, 1964] and Sweep heuristics [Wrenand Holliday, 1972]. [Chen and Tian, 2019] claim that their approach outperforms the expression simpliﬁcationcomponent in Z3 solver [De Moura and Bjørner, 2008] in terms of both the objective metrics and the time efﬁciency.Finally, although the exact training times are not given in the article, the authors of [Lu et al., 2020] note that the giventime of their algorithm is much smaller than that of LK-H. In addition, although also acknowledging the complexityof comparing the times of different works, [Kool et al., 2019] have claimed that the running time of their algorithm isten times faster than the one of [Bello et al., 2017].

6. Conclusion and future directions

The previous sections have covered several approaches to solving canonical combinatorial optimization problemsby utilizing reinforcement learning algorithms. As this ﬁeld has demonstrated to be performing on-par with the state-of-the-art heuristic methods and solvers, we are expecting new algorithms and approaches to emerge in the followingpossible directions, which we have found promising:

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 19 of 24einforcement Learning for Combinatorial Optimization

Algo Article Method Average tour lengthN=10, N=20, N=20, N=50, N=50, N=100, N=100,Cap. 10 Cap. 20 Cap. 30 Cap. 30 Cap. 40 Cap. 40 Cap. 50RL [Nazari et al., 2018] REINFORCE 4.7 – 6.4 – 11.15 – 17.0[Kool et al., 2019] – – 6.3 – 10.6 – 16.2[Lu et al., 2020] – 6.1 – 10.4 – 15.6 –[Chen and Tian, 2019] A2C – – 6.2 – 10.5 – 16.1[Helsgaun, 2017] LK-H – 6.1 6.1 10.4 10.4 15.6 15.6[Perron and Furnon, 2019] OR-Tools 4.7 6.4 6.4 11.3 11.3 17.2 17.2

Table 7

The average tour lengths comparison for Capacitated VRP for ER graphs with the num-ber of nodes 𝑁 equal to 10, 20, 50, 100. Cap. represents the capacity of the vehicle forCVRP.

Generalization to other problems.

In 5, we have formulated one of the main problems of the current state ofthe RL-CO ﬁeld, which is a limited number of experimental comparisons. Indeed, the CO group of mathematicalproblems is vast, and the current approaches often require being implemented for a concrete set of problems. RLﬁeld, however, has already made some steps towards the generalization of the learned policies to the unseen problems(for example, [Groshev et al., 2018]). In the case of CO, these unseen problems can be smaller instances of the sameproblem, problem instances with different distributions, or even the ones from the other group of CO problems. Webelieve, that although this direction is challenging, it is extremely promising for future development in the RL-COﬁeld.

Improving the solution quality.

A lot of the reviewed works, presented in this survey, have demonstrated superiorperformance compared to the commercial solvers. Moreover, some of them have also achieved the quality of thesolutions equal to the optimal ones or the ones achieved by the heuristic algorithms. However, these results are trueonly for the less complex versions of CO problems, for example, the ones with smaller numbers of nodes. This leavesus with the possibility of further improvement of the current algorithms in terms of the objective quality. Some of thepossible ways for this may be further incorporation of classical CO algorithms with the RL approaches, for example,with using imitation learning as in [Hester et al., 2018].

Filling the gaps.

One of the ways to classify RL-CO approaches, which we have mentioned previously, is bygrouping them into joint and constructive methods. Tables 1, 2, 3, 4, 5 contain the information about these labelsfor each of the reviewed article, and from them, we can identify some unexplored approaches for each of the COproblems. This way from Table 3, it can be seen that there has not been published any both joint and constructivealgorithm for solving the Bin Packing problem. The same logic can be applied to the Minimum Vertex Problem, Table3, where there are no approaches of joint-constructive and joint-nonconstructive type. Exploring these algorithmicpossibilities can provide us not only with the new methods but also with useful insights into the effectiveness of theseapproaches.In conclusion, we see the ﬁeld of RL for CO problems as a very promising direction for CO research, because ofthe effectiveness in terms of the solution quality, the capacity to outperform the existing algorithms, and huge runningtime gains compared to the classical heuristic approaches.

References

K. Abe, Z. Xu, I. Sato, and M. Sugiyama. Solving np-hard problems on graphs with extended alphago zero, 2019.

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 20 of 24einforcement Learning for Combinatorial Optimization

R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo, et al. Fast discovery of association rules.

Advances in knowledge discovery anddata mining , 12(1):307–328, 1996.S. Ahn, Y. Seo, and J. Shin. Deep auto-deferring policy for combinatorial optimization, 2020. URL https://openreview.net/forum?id=Hkexw1BtDr .T. Akiba and Y. Iwata. Branch-and-reduce exponential/fpt algorithms in practice: A case study of vertex cover.

Theoretical Computer Science ,609:211–225, 2016.D. V. Andrade, M. G. Resende, and R. F. Werneck. Fast local search for the maximum independent set problem. In

International Workshop onExperimental and Efﬁcient Algorithms , pages 220–234. Springer, 2008.T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. In

Proceedings of the 31st International Con-ference on Neural Information Processing Systems , NIPS’17, page 5366–5376, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN9781510860964.D. L. Applegate, R. E. Bixby, V. Chvatal, and W. J. Cook.

The traveling salesman problem: a computational study . Princeton university press,2006.T. Back and S. Khuri. An evolutionary heuristic for the maximum independent set problem. In

Proceedings of the First IEEE Conference onEvolutionary Computation. IEEE World Congress on Computational Intelligence , pages 531–535. IEEE, 1994.F. Barahona. On the computational complexity of ising spin glass models.

Journal of Physics A: Mathematical and General , 15(10):3241, 1982.ISSN 13616447. doi: 10.1088/0305-4470/15/10/028.T. D. Barrett, W. R. Clements, J. N. Foerster, and A. Lvovsky. Exploratory combinatorial optimization with reinforcement learning. In

Proceedingsof the the 34th National Conference on Artiﬁcial Intelligence , AAAI, pages 3243–3250, 2020. doi: 10.1609/aaai.v34i04.5723.R. Bellman. On the theory of dynamic programming.

Proceedings of the National Academy of Sciences of the United States of America , 38(8),1952. ISSN 0027-8424. doi: 10.1073/pnas.38.8.716.R. Bellman. A markovian decision process.

Indiana Univ. Math. J. , 6:679–684, 1957. ISSN 0022-2518.I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio. Neural combinatorial optimization with reinforcement learning. In

Workshop Proceedingsof the 5th International Conference on Learning Representations , ICLR, 2017.Y. Bengio, A. Lodi, and A. Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon.

European Journal ofOperational Research , Aug. 2020. ISSN 03772217. doi: 10.1016/j.ejor.2020.07.063.D. Bergman, A. A. Cire, W.-J. v. Hoeve, and J. Hooker.

Decision Diagrams for Optimization . Springer Publishing Company, Incorporated, 1stedition, 2016. ISBN 3319428470.P. A. Borisovsky and M. S. Zavolovskaya. Experimental comparison of two evolutionary algorithms for the independent set problem. In

Workshopson Applications of Evolutionary Computation , pages 154–164. Springer, 2003.C. Browne, E. Powley, D. Whitehouse, S. Lucas, P. Cowling, P. Rohlfshagen, S. Tavener, D. Perez Liebana, S. Samothrakis, and S. Colton. Asurvey of monte carlo tree search methods.

IEEE Transactions on Computational Intelligence and AI in Games , 2012. ISSN 1943068X. doi:10.1109/TCIAIG.2012.2186810.Q. Cai, W. Hang, A. Mirhoseini, G. Tucker, J. Wang, and W. Wei. Reinforcement learning driven heuristic optimization. In

Proceedings ofWorkshop on Deep Reinforcement Learning for Knowledge Discovery , DRL4KDD, 2019.Q. Cappart, E. Goutierre, D. Bergman, and L.-M. Rousseau. Improving optimization bounds using machine learning: Decision diagrams meetdeep reinforcement learning. In

Proceedings of the 33rd AAAI Conference on Artiﬁcial Intelligence , AAAI, 2019. ISBN 9781577358091. doi:10.1609/aaai.v33i01.33011443.Q. Cappart, T. Moisan, L.-M. Rousseau, I. Prémont-Schwarz, and A. Cire. Combining reinforcement learning and constraint programming forcombinatorial optimization. arXiv preprint arXiv:2006.01610 , 2020.C. Chen, S.-M. Lee, and Q. Shen. An analytical model for the container loading problem.

European Journal of Operational Research , 80(1):68–76,1995.X. Chen and Y. Tian. Learning to perform local rewriting for combinatorial optimization. In

Proceedings of the 33rd Conference on Advances inNeural Information Processing Systems , NeurIPS, pages 6281–6292, 2019.K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnnencoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014.N. Christoﬁdes. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical report, Carnegie-Mellon Univ Pittsburgh PaManagement Sciences Research Group, 1976.G. Clarke and J. W. Wright. Scheduling of Vehicles from a Central Depot to a Number of Delivery Points.

Operations Research , 12(4):568–581,1964. ISSN 0030-364X. doi: 10.1287/opre.12.4.568.CPLEX. IBM ILOG CPLEX optimization studio.

Version , 12:1987–2018, 1987.A. Croes. A method for solving traveling salesman problems.

Operations Research , 5:791—-812, 1958. ISSN 0030-364X. doi: 10.1287/opre.6.6.791.H. Dai, B. Dai, and L. Song. Discriminative embeddings of latent variable models for structured data. In

Proceedings of the 33rd InternationalConference on Machine Learning , ICML, 2016. ISBN 9781510829008.G. Dantzig, R. Fulkerson, and S. Johnson. Solution of a large-scale traveling-salesman problem.

Journal of the operations research society ofAmerica , 2(4):393–410, 1954.G. B. Dantzig and M. N. Thapa.

Linear programming 1: Introduction . Springer International Publishing, New York, NY, 1997. doi: https://doi.org/10.1007/b97672.L. De Moura and N. Bjørner. Z3: An efﬁcient smt solver. In

International conference on Tools and Algorithms for the Construction and Analysisof Systems , pages 337–340. Springer, 2008. ISBN 3540787992. doi: 10.1007/978-3-540-78800-3_24.M. Deudon, P. Cournut, A. Lacoste, Y. Adulyasak, and L.-M. Rousseau. Learning heuristics for the TSP by policy gradient. In

Lecture Notes in Com-puter Science (including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) , 2018. ISBN 9783319930305.

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 21 of 24einforcement Learning for Combinatorial Optimization doi: 10.1007/978-3-319-93031-2_12.I. Dinur and S. Safra. On the hardness of approximating minimum vertex cover.

Annals of Mathematics , pages 439–485, 2005. ISSN 0003486X.doi: 10.4007/annals.2005.162.439.I. Drori, A. Kharkar, W. R. Sickinger, B. Kates, Q. Ma, S. Ge, E. Dolev, B. Dietrich, D. P. Williamson, and M. Udell. Learning to solve combinatorialoptimization problems on real-world graphs in linear time, 2020.L. Duan, H. Hu, Y. Qian, Y. Gong, X. Zhang, J. Wei, and Y. Xu. A multi-task selected learning approach for solving 3d ﬂexible bin packingproblem. In

Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems , AAMAS, page 1386–1394,2019. ISBN 9781510892002.N. Elsokkary, F. S. Khan, D. La Torre, T. S. Humble, and J. Gottlieb. Financial portfolio management using d-wave quantum optimizer: The caseof abu dhabi securities exchange. Technical report, Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), 2017.P. Emami and S. Ranka. Learning permutations with sinkhorn policy gradient, 2018.T. A. Feo, M. G. Resende, and S. H. Smith. A greedy randomized adaptive search procedure for maximum independent set.

Operations Research ,42(5):860–878, 1994.E. Filiol, E. Franc, A. Gubbioli, B. Moquet, and G. Roblot. Combinatorial optimisation of worm propagation on an unknown network.

InternationalJournal of Computer Science , 2(2):124–130, 2007.E. J. Gardiner, P. Willett, and P. J. Artymiuk. Graph-theoretic techniques for macromolecular docking.

Journal of Chemical Information andComputer Sciences , 40(2):273–279, 2000.A. Gleixner, L. Eiﬂer, T. Gally, G. Gamrath, P. Gemander, R. L. Gottwald, G. Hendel, C. Hojny, T. Koch, M. Miltenberger, B. Müller, M. E. Pfetsch,C. Puchert, D. Rehfeldt, F. Schlösser, F. Serrano, Y. Shinano, J. M. Viernickel, S. Vigerske, D. Weninger, J. T. Witt, and J. Witzig. The SCIPOptimization Suite 5.0. Technical report, Optimization Online, December 2017. URL .M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisﬁability problems using semideﬁniteprogramming.

Journal of the ACM (JACM) , 42(6):1115–1145, 1995. ISSN 1557735X. doi: 10.1145/227683.227684.T. F. Gonzalez.

Handbook of approximation algorithms and metaheuristics . CRC Press, 2007. ISBN 9781420010749. doi: 10.1201/9781420010749.I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio.

Deep learning , volume 1. MIT press Cambridge, 2016.E. Groshev, M. Goldstein, A. Tamar, S. Srivastava, and P. Abbeel. Learning generalized reactive policies using deep neural networks. In

Proceedingsof the 28th International Conference on Automated Planning and Scheduling , ICAPS, pages 408–416, 2018.S. Gu and Y. Yang. A deep learning algorithm for the max-cut problem based on pointer network structure with supervised learning and reinforce-ment learning strategies.

Mathematics , 8(2):298, Feb 2020. ISSN 2227-7390. doi: 10.3390/math8020298. URL http://dx.doi.org/10.3390/math8020298 .T. Guo, C. Han, S. Tang, and M. Ding. Solving combinatorial problems with machine learning methods. In

Nonlinear Combinatorial Optimization ,pages 207–229. Springer International Publishing, Cham, 2019. ISBN 978-3-030-16194-1. doi: 10.1007/978-3-030-16194-1_9.L. Gurobi Optimization. Gurobi optimizer reference manual, 2020. URL .P. Hansen, N. Mladenovi´c, and D. Uroševi´c. Variable neighborhood search for the maximum clique.

Discrete Applied Mathematics , 145(1):117–125, 2004.M. Held and R. M. Karp. A dynamic programming approach to sequencing problems.

Journal of the Society for Industrial and Applied mathematics ,10(1):196–210, 1962.K. Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic.

European Journal of Operational Research , 126(1):106–130, 2000.K. Helsgaun. An extension of the lin-kernighan-helsgaun tsp solver for constrained traveling salesman and vehicle routing problems. Technicalreport, Roskilde University, 2017.T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, et al. Deep q-learning fromdemonstrations. In

Proceedings of the 32nd Conference on Artiﬁcial Intelligence , AAAI, 2018. ISBN 9781577358008.S. Hochreiter and J. Schmidhuber. Long short-term memory.

Neural computation , 9(8):1735–1780, 1997.H. Hu, X. Zhang, X. Yan, L. Wang, and Y. Xu. Solving a new 3d bin packing problem with deep reinforcement learning method, 2017.G. Karakostas. A better approximation ratio for the vertex cover problem.

ACM Transactions on Algorithms (TALG) , 5(4):1–8, 2009.R. M. Karp. Reducibility among combinatorial problems. In

Complexity of computer computations , pages 85–103. Springer, 1972. doi: 10.1007/978-1-4684-2001-2_9.K. Katayama, A. Hamamoto, and H. Narihisa. An effective local search for the maximum clique problem.

Information Processing Letters , 95(5):503–511, 2005.H. Kellerer, U. Pferschy, and D. Pisinger. Multidimensional knapsack problems. In

Knapsack problems , pages 235–283. Springer, 2004.E. Khalil, H. Dai, Y. Zhang, B. Dilkina, and L. Song. Learning combinatorial optimization algorithms over graphs. In

Proceedings of the 31stConference on Advances in Neural Information Processing Systems , NeurIPS, 2017.T. N. Kipf and M. Welling. Semi-supervised classiﬁcation with graph convolutional networks. In

International Conference on Learning Represen-tations (ICLR) , 2017.W. Kool, H. van Hoof, and M. Welling. Attention, learn to solve routing problems! In

Proceedings of the 7th International Conference on LearningRepresentations , ICLR, 2019.R. E. Korf. An improved algorithm for optimal bin packing. In

IJCAI , volume 3, pages 1252–1258. Citeseer, 2003.B. Korte, J. Vygen, B. Korte, and J. Vygen.

Combinatorial optimization , volume 2. Springer, 2012.S. Lamm, P. Sanders, and C. Schulz. Graph partitioning for independent sets. In

International Symposium on Experimental Algorithms , pages68–81. Springer, 2015.S. Lamm, P. Sanders, C. Schulz, D. Strash, and R. F. Werneck. Finding near-optimal independent sets at scale. In

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 22 of 24einforcement Learning for Combinatorial Optimization

Eighteenth Workshop on Algorithm Engineering and Experiments (ALENEX) , pages 138–150. SIAM, 2016.G. Lancia, V. Bafna, S. Istrail, R. Lippert, and R. Schwartz. Snps problems, complexity, and algorithms. In

European symposium on algorithms ,pages 182–193. Springer, 2001.A. Laterre, Y. Fu, M. K. Jabri, A.-S. Cohen, D. Kas, K. Hajjar, T. S. Dahl, A. Kerkeni, and K. Beguir. Ranked reward: Enabling self-playreinforcement learning for combinatorial optimization, 2018.T. Leleu, Y. Yamamoto, P. L. McMahon, and K. Aihara. Destabilization of local minima in analog spin systems by correction of amplitudeheterogeneity.

Physical review letters , 122(4), 2019. ISSN 10797114. doi: 10.1103/PhysRevLett.122.040607.D. Li, C. Ren, Z. Gu, Y. Wang, and F. Lau. Solving packing problems by conditional query learning, 2020. URL https://openreview.net/forum?id=BkgTwRNtPB .T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning.In

Proceedings of the 4th International Conference on Learning Representations , ICLR, 2016.S. Lin and B. W. Kernighan. An effective heuristic algorithm for the traveling-salesman problem.

Operations research , 21(2):498–516, 1973.A. Lodi, S. Martello, and D. Vigo. Heuristic algorithms for the three-dimensional bin packing problem.

European Journal of Operational Research ,141(2):410–420, 2002.H. Lu, X. Zhang, and S. Yang. A learning-based iterative method for solving vehicle routing problems. In

International Conference on LearningRepresentations , 2020. URL https://openreview.net/forum?id=BJe1334YDH .Q. Ma, S. Ge, D. He, D. Thaker, and I. Drori. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. In

AAAI Workshop on Deep Learning on Graphs: Methodologies and Applications , AAAI, 2020.A. Makhorin. Glpk (gnu linear programming kit), 2012. URL .S. Manchanda, A. Mittal, A. Dhawan, S. Medya, S. Ranu, and A. K. Singh. Learning heuristics over large graphs via deep reinforcement learning.

CoRR , abs/1903.03332, 2019. URL http://arxiv.org/abs/1903.03332 .S. Martello and P. Toth. Bin-packing problem.

Knapsack problems: Algorithms and computer implementations , pages 221–245, 1990a.S. Martello and P. Toth. Lower bounds and reduction procedures for the bin packing problem.

Discrete applied mathematics , 28(1):59–70, 1990b.O. Mersmann, B. Bischl, J. Bossek, H. Trautmann, M. Wagner, and F. Neumann. Local search and the traveling salesman problem: A feature-basedcharacterization of problem hardness. In

International Conference on Learning and Intelligent Optimization , pages 115–129. Springer, 2012.C. E. Miller, A. W. Tucker, and R. A. Zemlin. Integer programming formulation of traveling salesman problems.

Journal of the ACM (JACM) , 7(4):326–329, 1960.M. Mitzenmacher and E. Upfal.

Probability and Computing: Randomized Algorithms and Probabilistic Analysis . Cambridge University Press,USA, 2005. ISBN 9780521835402.V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.Human-level control through deep reinforcement learning.

Nature , 2015. ISSN 14764687. doi: 10.1038/nature14236.V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcementlearning. In

Proceedings of the 33rd International Conference on International Conference on Machine Learning , volume 48 of

ICML , page1928–1937, 2016. ISBN 9781510829008.M. Nazari, A. Oroojlooy, L. Snyder, and M. Takác. Reinforcement learning for solving the vehicle routing problem. In

Proceedings of the 32ndConference on Advances in Neural Information Processing Systems , NeurIPS, pages 9839–9849, 2018.C. H. Papadimitriou and K. Steiglitz.

Combinatorial optimization: algorithms and complexity . Courier Corporation, 1998. ISBN 9780486402581.A. Perdomo-Ortiz, N. Dickson, M. Drew-Brook, G. Rose, and A. Aspuru-Guzik. Finding low-energy conformations of lattice protein models byquantum annealing.

Scientiﬁc reports , 2:571, 2012. ISSN 20452322. doi: 10.1038/srep00571.L. Perron and V. Furnon. Or-tools, 2019. URL https://developers.google.com/optimization/ .W. Pullan and H. H. Hoos. Dynamic local search for the maximum clique problem.

Journal of Artiﬁcial Intelligence Research , 25:159–185, 2006.L. Schrage.

Linear, Integer, and Quadratic Programming with LINDO: User’s Manual , 1986.E. L. Schreiber and R. E. Korf. Improved bin completion for optimal bin packing and number partitioning. In

Twenty-Third International JointConference on Artiﬁcial Intelligence , 2013.J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, andD. Silver. Mastering atari, go, chess and shogi by planning with a learned model, 2019.J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,et al. Mastering the game of go with deep neural networks and tree search.

Nature , 529(7587):484–489, 2016. ISSN 14764687. doi:10.1038/nature16961.J. Song, R. Lanka, Y. Yue, and M. Ono. Co-training for policy learning. In

Proceedings of the 35th Conference on Uncertainty in ArtiﬁcialIntelligence, UAI 2019 , volume 115 of

Proceedings of Machine Learning Research , pages 1191–1201, Tel Aviv, Israel, 22–25 Jul 2020. PMLR.K. Subhash, D. Minzer, and M. Safra. Pseudorandom sets in grassmann graph have near-perfect expansion. In , pages 592–601. IEEE, 2018.R. S. Sutton. Learning to predict by the methods of temporal differences.

Machine learning , 1988. ISSN 15730565. doi: 10.1023/A:1022633531479.R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In

Advances in neural information processing systems , pages 1057–1063, 2000. ISBN 0262194503.Y. Tang, S. Agrawal, and Y. Faenza. Reinforcement learning for integer programming: Learning to cut. In

Proceedings of the InternationalConference on Machine Learning , ICML, pages 1483–1492, 2020.R. E. Tarjan and A. E. Trojanowski. Finding a maximum independent set.

SIAM Journal on Computing , 6(3):537–546, 1977.The Sage Developers.

SageMath, the Sage Mathematics Software System (Version 9.0.0) , 2020. URL .E. S. Tiunov, A. E. Ulanov, and A. Lvovsky. Annealing by simulating the coherent ising machine.

Optics express , 27(7):10288–10295, 2019. ISSN

Mazyavkina et al.:

Preprint submitted to Elsevier

Page 23 of 24einforcement Learning for Combinatorial Optimization

HistoriaMathematica , 2020.A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In

Advances inneural information processing systems , pages 5998–6008, 2017.P. Veliˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. Graph attention networks. In

Proceedings of the 6th InternationalConference on Learning Representations , ICLR, 2018.N. Vesselinova, R. Steinert, D. F. Perez-Ramirez, and M. Boman. Learning combinatorial optimization on graphs: A survey with applications tonetworking.

IEEE Access , 8:120388–120416, 2020. ISSN 2169-3536. doi: 10.1109/access.2020.3004964.O. Vinyals, M. Fortunato, and N. Jaitly. Pointer networks. In

Proceedings of the 28th International Conference on Neural Information ProcessingSystems , volume 2 of

NeurIPS , page 2692–2700, 2015.C. J. Watkins and P. Dayan. Q-learning.

Machine learning , 8(3-4):279–292, 1992. ISSN 0885-6125. doi: 10.1007/bf00992698.R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.

Machine Learning , 1992. ISSN 15730565.doi: 10.1023/A:1022672621406.L. A. Wolsey.

Integer programming , volume 52. John Wiley & Sons, 1998. ISBN 9780471283669.A. Wren and A. Holliday. Computer scheduling of vehicles from one or more depots to a number of delivery points.

Journal of the OperationalResearch Society , 23(3):333–344, 1972. ISSN 0160-5682. doi: 10.1057/jors.1972.53.Y. Wu, W. Li, M. Goh, and R. de Souza. Three-dimensional bin packing problem with variable bin height.

European journal of operationalresearch , 202(2):347–355, 2010.M. Xiao and H. Nagamochi. Exact algorithms for maximum independent set.

Information and Computation , 255:126–146, 2017.K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks?, 2018.Y. Yamamoto, K. Aihara, T. Leleu, K.-i. Kawarabayashi, S. Kako, M. Fejer, K. Inoue, and H. Takesue. Coherent ising machines—optical neuralnetworks operating at the quantum limit. npj Quantum Information , 3(1):1–15, 2017. ISSN 20566387. doi: 10.1038/s41534-017-0048-9.J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun. Graph neural networks: A review of methods and applications. arXivpreprint, arXiv:1812.08434 , 2018.

Mazyavkina et al.: