[PDF] A State Aggregation Approach for Solving Knapsack Problem with Deep Reinforcement Learning

Abstract

This paper proposes a Deep Reinforcement Learning (DRL) approach for solving knapsack problem. The proposed method consists of a state aggregation step based on tabular reinforcement learning to extract features and construct states. The state aggregation policy is applied to each problem instance of the knapsack problem, which is used with Advantage Actor Critic (A2C) algorithm to train a policy through which the items are sequentially selected at each time step. The method is a constructive solution approach and the process of selecting items is repeated until the final solution is obtained. The experiments show that our approach provides close to optimal solutions for all tested instances, outperforms the greedy algorithm, and is able to handle larger instances and more flexible than an existing DRL approach. In addition, the results demonstrate that the proposed model with the state aggregation strategy not only gives better solutions but also learns in less timesteps, than the one without state aggregation.

Full PDF

aa r X i v : . [ c s . N E ] A p r A State Aggregation Approach for SolvingKnapsack Problem with Deep ReinforcementLearning

Reza Refaei Afshar − − − , Yingqian Zhang − − − ,Murat Firat − − − , and Uzay Kaymak − − − Eindhoven University of Technology, Eindhoven, Netherlands

Abstract.

This paper proposes a Deep Reinforcement Learning (DRL)approach for solving knapsack problem. The proposed method consistsof a state aggregation step based on tabular reinforcement learning toextract features and construct states. The state aggregation policy isapplied to each problem instance of the knapsack problem, which is usedwith Advantage Actor Critic (A2C) algorithm to train a policy throughwhich the items are sequentially selected at each time step. The methodis a constructive solution approach and the process of selecting items isrepeated until the ﬁnal solution is obtained. The experiments show thatour approach provides close to optimal solutions for all tested instances,outperforms the greedy algorithm, and is able to handle larger instancesand more ﬂexible than an existing DRL approach. In addition, the resultsdemonstrate that the proposed model with the state aggregation strategynot only gives better solutions but also learns in less timesteps, than theone without state aggregation.

Keywords:

Knapsack Problem · Deep Reinforcement Learning · StateAggregation.

Heuristic algorithms for solving Combinatorial Optimization Problems (COPs)achieve acceptable solutions in a polynomial time. These algorithms relies onhandcrafted heuristics that conduct the process of ﬁnding the solutions. Al-though these heuristics work well in many COPs, they mostly rely on the natureof problems and they need to be modiﬁed for diﬀerent class of problems. In thispaper, we aim to learn and improve the handcrafted heuristics to improve thequality of the solutions. We study knapsack problem (KP) , which is one of thewell-known benchmark problems in COPs. KP is deﬁned as a set of items, eachwith a value and a weight. The objective is to select a subset of items withmaximum total value to ﬁll a knapsack such that the cumulative items weightdoes not exceed its capacity. This problem has many applications such as cargoloading, cutting stock and capital budgeting [28].Recently, there is a great progress in the Artiﬁcial Intelligence (AI) commu-nity in developing machine learning (ML) methods to solve COPs [3], where a

Reza Refaei Afshar, Yingqian Zhang , Murat Firat, and Uzay Kaymak popular ML based method is

Deep Reinforcement Learning (DRL) . DRL is theintegration of

Reinforcement Learning (RL) and

Deep Neural Networks (DNN) [1,18,13]. Several DRL based approaches have been proposed to solve the Trav-eling Salesman Problem (TSP), e.g. [2,14,17], where a discrete representationof TSP is used as states and the solution is a sequence of the inputs. Theseapproaches work well for the TSP problem, however, in the Knapsack problem,the values and weights of items are continuous, which entails an extremely largestate space when the number of items increases. Hence, the existing DRL basedapproaches for solving the problems with discrete nature such as TSP might notwork well for KP, as shown in [2]. The authors of [2] solve a Knapsack problemusing the policy gradient algorithm with pointer networks. Although they showoptimal solutions can be obtained for instances up to 200 items, the followinglimitations are identiﬁed: (1) intractability to large instances: the state spacegrows rapidly with increasing number of items, and (2) generality to other sizesof instances: the trained model is applicable for solving the problems that havethe exactly same knapsack capacity and same number of items. In this paper, weintroduce a DRL approach with state aggregation that boosts the capability ofthe typical greedy algorithm and improves the heuristic to overcome these twolimitations.In our approach, propose a state aggregation method to discretize the featurevalues of items. We construct a feature table by assigning a row for each prob-lem instance and a column for each item’s information. A tabular reinforcementlearning is used to learn the best operation strategy for each item. This resultingdiscretization of features not only provides a discrete representation of the prob-lem instances, but also reduces the state space by reducing the number of uniquevalues. However, the state space is still large despite state aggregation. There-fore we exploit DRL as a powerful function approximation approach. We use

Advantage Actor Critic (A2C) algorithm to learn the policy of selecting items.A2C makes use of two DNNs for learning policy and value functions [19]. Thepolicy DNN has N outputs which is equal to the number of items. By followinga greedy or softmax algorithm on the output of the policy DNN, a sequence ofitems are selected until the knapsack is full.The experimental results show that the proposed approach ﬁnds optimalsolutions with two decimal places for the problem instances of same size usedin [2]. Moreover, we show the method obtains close to optimal solutions forthree diﬀerent types of instances with at most 50, 300 and 500 items. We alsodemonstrate that the proposed DRL with state aggregation performs better thanthe DRL without aggregation in terms of both learning rate and the solutionquality. We summarize our contributions as follows. – We develop a state aggregation strategy to derive state embedding thatreduces the state space size. We show this general strategy eﬀectively speedup learning on solving KP. – Our DRL-based approach to solve KP improves the heuristic greedy al-gorithm for 0-1 KP and shows better performance than the existing DRL itle Suppressed Due to Excessive Length 3 approaches. The developed method can be trained once for N items and itcan be used for any KP instances with size up to N . It has been proven by reduction that most of COPs are NP-Hard problems. Theiroptimal solutions can not be found in polynomial time and exact algorithmstake exponential time to ﬁnd optimal solutions [15,4]. Knapsack Problem (KP)has gained a remarkable attention in the literature. Despite the fact that thefractional KP is optimally solvable by the heuristic greedy algorithm, the 0-1knapsack problem is NP-Hard [5], and a large variety of KPs remain hard to solve[21]. Moreover, it has been shown by empirical evidence that solving instancesnear the phase transition are challenging for humans [30]. The phase transitionemerges around critical values of items and capacity so that the probability ofhaving a solution for an instance change from zero to one. Many algorithms,ranging from dynamic programming algorithms (e.g. [7]) to meta-heuristics (e.g.[9]) have been proposed to solve KP.Cleverly searching and branch and bound methods can prune the searchtree and reduce the time complexity of COPs [29]. However, these methodsare still prohibitive for large instances. Polynomial time approximation schemesand integer linear programming (ILP) based approaches are the other helpfulmethods [25,8]. Although the approximation algorithms might be performed inreasonable time, they rely on handcraft heuristics and the methods need to berevised when the problem settings change. Furthermore, they suﬀer from weakoptimality for some problems. In order to cope with this limitations, MachineLearning (ML) based and data driven methods are developed.In recent years, it has been shown that DRL can be used for learning goodheuristics for solving COPs. In [26], the Pointer Network architecture is intro-duced where the output layer of the deep neural network is a function of theinput. In [2], the pointer network is used with RL to solve the Traveling Sales-man Problem (TSP). They use policy gradient and a variant of AsynchronousAdvantage Actor-Critic (A3C) algorithm of [19] to train a DNN, and show closeto optimal solutions are found for up to 100 cities. In [16] another neural net-work framework is introduced for graph-based COPs, where structure2vec [6] isused to derive an embedding for the vertices of the graph. The structure2veccomputes a p-dimensional feature embedding for each node and a parametric Q function is trained using Q learning algorithm. In [17], the pointer networkis incorporated with attention layers. With the REINFORCE algorithm, theyobtained close to optimal solutions for the TSP instances of up to 100 nodes.Most of ML-based research on solving COPs focuses on TSP. COPs like TSPand Vehicle Routing Problem that have gained high attentions in past few years,require a sequence of the input as the solution and sequence-to-sequence neuralarchitectures might be proper approaches for solving them [23]. However, thesolutions of COPs like KP and Weighted Vertex Cover are a subset of the input.This issue makes the original sequence-to-sequence approaches inapplicable for Reza Refaei Afshar, Yingqian Zhang , Murat Firat, and Uzay Kaymak solving KP. Recently, a pointer network deep learning approach is presented forsolving 0-1 KP [10]. This method is based on supervise learning and optimalsolutions which is not available in most of the cases. In this paper we propose aDRL framework for subset selection problems.

We consider the following instance P : We are given a set I P containing n P items and a knapsack of capacity W P . Each item i has value v i and weight w i . The goal is to ﬁll the knapsack with a selected subset of itemssuch that the total weight of the selected items does not exceed W P and thetotal value is maximized. Since P is a 0-1 KP, selecting a fraction of an item isnot possible.Our method for solving this variant of KP is based on deep reinforcementlearning. We assume that the number of items is variable and a constructive solu-tion can solve the problem only by considering the capacity constraint. Therefore,the process of selecting a subset of items I ′P ⊆ I P is modeled as a sequentialdecision process. The policy DNN is trained with A2C introduced in [19] on aset of problem instances with at most N items. The information of each prob-lem instance consists of |I P | = n P ≤ N items with value v i and weight w i foreach i ∈ I P and together with W P , they are the inputs of DNN. The DNN has N outputs that each being associated with a value of selecting a speciﬁc item i ∈ I P . The policy is to select an item in each step. After selecting item i , itis removed from the original problem instance P and a new problem instance P ′ with a reduced item set I P ′ = I P \ { i } and capacity W P ′ = W P − w i isgenerated. For the cases where i cannot be added to the knapsack because ofthe capacity constraint, the new instance P ′ is generated by simply removing i from the item set, without altering W P . In this way, when the policies aretrained with problem instances of at most N items, the policies can be used toﬁnd solutions for new instances as long as their item sizes are no greater than N . Such KP problems can be found in diﬀerent applications. For example, anonline ad publisher faces with a set of advertisements. Assuming a ﬁxed upperbound for the number of ads, the problem is to select a subset of them to showto the users. In this example, the values are relevance scores and the weights arethe size of ads banners. The goal is to ﬁll a slot of a certain size with the ads. Figure 1 shows the overview of our proposed method. It consists of two compo-nents. The ﬁrst component includes a formulation of KP to MDP, which is solvedusing a DRL approach (Algorithm 1). The second component is a state aggre-gation method (Algorithm 2), which learns a aggregation policy to discretizestates that are serves as inputs to DRL. We ﬁrst discuss how to formulate theKP problem as MDP. itle Suppressed Due to Excessive Length 5

Fig. 1.

The overview of the KP solver method. 1) A set of problem instances are usedfor deriving an aggregation policy for item information. 2) The same set of problemsare used in the second step which is DRL. 3) A problem instance is selected for training.4) Items are selected sequentially until ﬁnding a solution. At each step the updated P is aggregated to ﬁnd the state. The parameters of value and policy DNNs are updatedusing A2C. 5) The best solution is stored. 6) Another problem instance is selected fortraining. The process continues for a certain number of timesteps. In order to solve the 0-1 KP, DRL is used to derive a policy through that theitems are sequentially added to the solution. We deﬁne the states, actions andrewards of DRL modeling of KPs for an instance P ′ which is a representation of P after selecting some items, as follows. States s ( P ) : A complete set of information of instance P ′ containing n P ′ , v i and w i for n P ′ items, capacity W P ′ , the total value of the items ( Sv = P i ∈I P′ v i ),and the total weight of the items ( Sw = P i ∈I P′ w i ) makes a feature vector of2 n P ′ + 4 features. Since n P ′ ≤ N for all P ′ , the feature vector of the instancesthat have n P ′ < N items consists of 2 N + 4 features such that the ﬁrst 2 n P ′ + 4ones carry the information of the problem instance and the remaining ones arezero. Section 4.2 will reduce this feature vector by a state aggregation strategy. Actions:

There are N actions A , A , ..., A N , each corresponding to select oneitem. At each decision moment, a state is fed to the policy DNN and an actionis selected according to the output of the DNN. Reward Function:

The reward function is deﬁned based on three criteria. First,if item i can be added to the knapsack without exceeding the capacity limit, v i is used as a positive reward. Second, if w i is greater than W P ′ , i.e. i cannot beadded to the knapsack, − w i is set as a negative reward. Third, for each instance P ′ where n P ′ < N , the ﬁrst n P ′ outputs of DNN correspond to the items of P ′ and the next N − n P ′ outputs are undeﬁned actions because the correspondingitems do not exist. Therefore, a large penalty i.e. − W P ′ is used for the rewardof choosing undeﬁned actions. We separate the reward of undeﬁned action and Reza Refaei Afshar, Yingqian Zhang , Murat Firat, and Uzay Kaymak heavy items because an action with i > n is always undeﬁned, however itemswith w i > W P could be added to the knapsack if they were selected in earliersteps. Therefore, their penalty is lower. Equation (1) shows the reward of state s ( P ′ ) and action A i . r ( s ( P ′ ) , A i ) =  − W P ′ if i > n P ′ v i if w i ≤ W P ′ − w i if w i > W P ′ (1)Employing these deﬁnitions of states, actions and rewards, the A2C algorithmis used for training policy and value DNNs [19], where two DNNs are usedfor policy ( π ) and value ( V ) functions. The advantage value is obtained bysubtracting state values (V) from state action values (Q) which is deﬁned by r + γV ( s t +1 ). This value is used in gradient function to update the parametersof the DNNs using Equations (2) and (3) [19,12]. θ t +1 ← θ t + ∇ θ t log π ( A i | s ( P ) , θ t )[ r t + γV ( s ( P ′ ) , θ tv ) − V ( s ( P ) , θ tv )] (2) θ t +1 v ← θ tv + ∂ ( r t + γV ( s ( P ′ ) , θ tv ) − V ( s ( P ) , θ tv )) ∂θ tv (3)where, θ t and θ tv are the parameters of policy and value DNNs in decisionmoment t respectively. The corresponding state of a problem instance P is fed tothe policy DNN and the items can be selected by following a policy according tothe output of the policy DNN. Upon selecting an item, P ′ is obtained from P andit is again fed to the policy DNN to select the next item. This process is continueduntil ﬁlling the knapsack or exceeding the weight constraint. Algorithm 1 showsthe DRL-based knapsack solver method. As the number of items increases, the state space grows up exponentially andthis aﬀects the performance of function approximation with DNN. In order toshrink the state space and boost the method to have the capability of solvinglarge problem instances, a new state embedding is derived by state aggregation.The feature values of states are divided into subsets and the values of eachsubset are converted to a certain value. Instead of manually testing diﬀerentnumber of subsets to ﬁnd the one that has the best performance, the processof ﬁnding appropriate number of subsets is considered as a sequential decisionmaking problem and reinforcement learning is used for solving the problem.Before developing the RL framework, we ﬁrst pre-process the problem instances.

Preparing data.

A set of problem instances are used for deriving the state em-bedding. Each problem instance is identiﬁed by a set of feature values which arethe items information and capacity. The ﬁrst step in aggregating the states is to itle Suppressed Due to Excessive Length 7

Algorithm 1

DRL-based Knapsack Solver

Input : M Problem Instances each having at most N items Output : Values of solutions of the M instances1: Initialize a policy DNN with 2 N + 4 inputs, N outputs and parameters θ as policy π ( A i | s, θ )2: Initialize a value DNN with parameters θ v as V ( s, θ v )3: t max = 3 N × , t = 04: Initialize V al : a list of length M , all 05: while t < t max do

6: Select a problem instance P with capacity W P .7: ow = 0 { Total weight of selected items } ov = 0 { Total values of selected items } P ′ ← P , n P ′ ← n P , W P ′ ← W P while ow < W P ′ and n P ′ > do

11: Find s ( P ′ ) using state aggregation strategy (Eqn. (8))12: Perform action i according to policy π ( A i | s ( P ′ ) , θ t ) and observe r ( s ( P ′ ) , A i )13: if i ≤ n P ′ and w i + ow ≤ W P ′ then ow ← ow + w i ov ← ov + v i W P ′ ← W P ′ − w i end if P ′ ← P ′ \ { i } , n P ′ ← n P ′ − θ and θ v using Eqns. (2) and (3)20: t ← t + 121: end while if ov > V al [ P ′ ] then V al [ P ′ ] ← ov end if end while return V al generate random solutions for each problem instance. As mentioned before, anepisode is a sequence of states and actions that each action selects an item andthe solution is the set of selected items. These instances can be shown in a tablein which each row corresponds to a problem instance and the columns are itemsinformation.One issue in selecting the feature vector of original items information as statesis that diﬀerent KP instances are not comparable because the values and weightsof items might be very diﬀerent. As an example, assume that values and weightsof an instance are integer numbers between 1 to 10, while these values andweights lies between 100 and 110 for another instance. Generalization based onthese diﬀerent values is diﬃcult, although their ratio are similar. In order to solvethis issue, for each item of instance P , all v i are normalized through dividing bythe product of w i and W P as shown in Eqn. (4). Furthermore, the ratio between w i and W is also calculated based on Eqn. (5). The v i and w i for each itemare replaced with these two ratios in the feature vector of P . This modiﬁcation Reza Refaei Afshar, Yingqian Zhang , Murat Firat, and Uzay Kaymak makes the items of diﬀerent problems comparable. The learned policy network inthis way would boost the capability of the well-known heuristic greedy algorithmwhich is optimal for fractional KP. vr i ( v i , w i , W P ) = v i w i × W P (4) wr i ( v i , w i , W P ) = w i W P (5)where, vr i and wr i are the normalized value and normalized weight respec-tively. For a problem instance P , vr i , wr i , W P , Sv and Sw construct a featurevector F ( P ). F ( P ) = ( F , ..., F N +4 ) = ( n P , W P , Sv, Sw, vr , wr , ..., vr n P , wr n P ) (6)where, F ( P ) is the feature vector of P , Sv and Sw are the sum of remainedvalues and weights respectively.After obtaining a table of problem instances with comparable items, we sortfor each row (i.e. each problem instance) the columns (i.e. vr i and wr i ) in de-scending order with respect to vr i . In other words, the ﬁrst two columns of eachrow, i.e. vr and wr correspond to the item with highest normalized value. Thesecond two columns which are vr and wr , correspond to the of item with thesecond highest normalized value and so on. For a problem P with n P < N ,the items information are located from the columns vr and wr to vr n P and wr n P respectively. The values of vr n P +1 to vr N and also the values of wr n P +1 to wr N are zero. This ordering helps to aggregate all the highest vr i of all probleminstances with a single aggregation strategy because the problem instances arecomparable and the highest vr i is in a certain column. This explanation holdsfor second, third and other highest vr i . Each column is called a feature and thenext step is to derive an aggregation strategy for the values of each feature. State aggregation through Q-Learning.

The idea of the aggregation is to reducethe number of unique values for all features. We do such reduction by splittingthe values of one feature into several groups and then mapping each group’svalue to a particular integer. The proper number of splits for each feature islearned by reinforcement learning. For each feature F k that k ∈ { , ..., N + 4 } ,let action d F k be the number of splits on the values of the feature F k , and F k,P bethe value of feature F k for problem instance P . Among all features, we performstate aggregation on vr i and wr i of item i .For aggregating the values of vr i of all M problem instances, action d vr i ∈{ , , ..., x } is the number of splits where its optimal value i.e. d ∗ vr i is obtained byAlgorithm 2. Using d ∗ vr i splits, the values of vr i are divided into d ∗ vr i + 1 subsetsand all the subsets except the last one have (cid:6) Md ∗ vri +1 (cid:7) values. The last subset has M − ( (cid:6) Md ∗ vri +1 (cid:7) d ∗ vr i ) values. Then, all values of each subset is converted to aninteger starting from 0. This process transforms the values of feature vr i to a set itle Suppressed Due to Excessive Length 9 of integers { , , ..., d ∗ vr i } . As an example, assume there are M = 7 problem in-stances that the values of vr are (1 , , , , , ,

5) and d ∗ vr is 2. These values needto be divided into d ∗ vr + 1 = 3 subsets. First they are sorted in order to acquirethe sorted values (1 , , , , , , { , , } , { , , } , { } ) areobtained that each has (cid:6) / (cid:7) = 3 values except the last one which has one value.Finally, the values of vr are aggregated and the new values are (0 , , , , , , vr to 3 unique values.For all wr i , d ∗ wr i is 2 and the split points are 0.5 and 1. The motivation ofthis hard setting is separating illegal, light and heavy weights. Illegal weightsare the weights with wr i > wr i ≤ . . < wr i ≤ map ( F k,P , d ∗ F k ) that gets F k,P and returns an integer which corresponds to a subset based on d ∗ F k splits.We used heuristics to deﬁne the reward function R ( F k , d F k ) which is shownin Eqn. (7). R ( F k , d F k ) = Q d Fk +1 j =1 l F k ,j ( d F k + 1) × c F k ,d Fk (7)where, l F k ,j is the size of j th subset, and c F k ,d Fk is the number of all commonvalues between all subsets. Three main motivations of designing rewards are asfollows. – We aim to deﬁne the reward function such that it reduces the size of statespace. The number of unique states for each feature is d F k + 1 after applying d F k splits and this value inversely relates to the reward of each action. – For feature F k and d F k splits, let j ∈ { , , , ..., d F k } be a subset based on d F k splits and l F k ,j be the diﬀerence between maximum and minimum valuesof j th subset. As larger l F k ,j entail in aggregating more values, their rewardsare higher than those for smaller l F k ,j . However, unequal subsets containunequal number of values. For example, if the feature values are uniformlydispersed between 0 and 10, creating two subsets with lengths 5 and 5 arebetter than two subsets with length 1 and 9. Therefore, the product of the l F k ,j for all j is in the numerator of the reward function. – Distinct states help an agent to derive a deterministic policy because stateshave dissimilar features. Likewise, two subsets with less overlapped valuesrepresent diﬀerent sets of states and the policy can better distinguish them.For example, for the subsets ( { , , } , { , , } , { } ), 2 is common betweentwo subsets and it can be assigned to both subsets. Assigning this valueto diﬀerent subsets entails a diﬀerent policy that may have diﬀerent per-formance. In order to reduce the number of common values between twogroups, we deﬁne c F k ,d Fk as the total number of common values in diﬀerentsubsets.A Q table is constructed for the states and actions and it is ﬁlled by the Q − learning algorithm [24] as shown in Algorithm 2. Each vr i is a state and the next state is the vr i ′ which i ′ is an arbitrary state. Finally, an optimal decision isfound by using the Q table for each feature. The algorithm is used for aggregating vr i and we denote d ∗ vr i as the optimal aggregation action for each vr i . The stateembedding derived by this strategy is a feature vector consisting of aggregatedfeatures and this state embedding is used in line 11 of algorithm 1. Equation (8)shows s ( P ), the state embedding of P . s ( P ) = { map ( F k,P , d ∗ F k ) : ∀ F k ∈ F ( P ) } (8) Algorithm 2

Q-Learning for State Aggregation

Input : Feature table of problem instances P , ..., P M Output : The number of optimal split points for all vr i

1: Initialize a Q table with N rows and x columns. States are features and actionsare the number of split points2: Select item i randomly3: repeat

4: Select i ′ randomly as the next item5: Select d vr i ∈ { , ..., x } according to ǫ -greedy policy6: Find R ( vr i , d vr i ) using Eqn. 77: Update Q ( vr i , d vr i ) ← Q ( vr i , d vr i ) + α [ R vr i ,d vri + γ max d ′ Q ( vr i ′ , d ′ ) − Q ( vr i , d vr i )]8: i = i ′ until Convergence10: return d ∗ vr i = argmax d Q ( vr i , d ) ∀ i The DRL based knapsack solver is applied on three diﬀerent types of probleminstances. We tested diﬀerent algorithms for training the policy DNN such asDeep Q Network (DQN) [20], Advantage Actor Critic (A2C) [19], Proximal Pol-icy Optimization [22] and Sample Eﬃcient Actor-Critic with Experience Replay[27], and the A2C algorithm is selected because it provided better solutions. Weused stable-baseline tools for implementing the A2C algorithm [12]. The policynetwork consists of two layers of 64 nodes and the method is trained on 10 episodes which are selected from M instances. The DRL with aggregation algo-rithm is compared with (1) greedy algorithm and (2) DRL without aggregation.The problem instances and code used for experiments are available in URL . The link is not shown due to the blind review.itle Suppressed Due to Excessive Length 11

The three diﬀerent types of instances are called

Random Instances , Fixed W P Instances and

Hard Instances . A set of M problem instances makes a datasetthat the maximum number of items over all instances in the dataset is N . Eachdataset contains the instances of one of the following types. Random instances (RI):

A dataset of random instances has M problem instancesthat each instance P has n P ∈ { , , ..., N } items. For an item i , v i and w i arerandomly generated integers from one to R that is a ﬁxed upper bound for v i and w i . The W P is a random integer between R/

10 and 3 R . Three datasets ofrandom instances are generated with M = 1000. For these three datasets, N is50, 300 and 500, and R is 100, 600 and 1800 respectively. Fixed W P Instances (FI):

In [2] a set of KP instances with ﬁxed capacity andﬁxed item set size are used for evaluation. We generated three datasets of thesame instances with M = 1000. The N for these three datasets is 50, 300 and500 respectively. The values and the weights of all items in the three datasetsare random real numbers between zero and one. The W P is ﬁxed for all theinstances and it is 12.5 for N = 50, 37.5 for N = 300 and 37.5 for N = 500. Hard instances (HI):

In [21], a group of hard to solve problem instances wereintroduced that for each item i , v i is strongly correlated with w i . Speciﬁcally, w i is a random integer in [1 , R ], v i = w i + R/

10 and W P = pM +1 Σ n P i =1 w i where p is the id of P . Three datasets of M = 1000 hard instances are generated. Forthe ﬁrst dataset, N is 50 and R is 100. Likewise, N is 300 and 500, and R is 600and 1000 for the second and the third datasets respectively. The following metrics are considered to evaluate the performance of using DRLbased KP solver for solving the instances introduced in 5.1.

Average values of solutions (

V al ). For each dataset of M problem instances, V al is the average of all solution values (total values of the selected items). Likewise,

V al opt is the average values of optimal solutions, which are obtained using theoptimization solver

Gurobi [11].

Learning rate.

In order to calculate the learning rate, the rate of increasing in

V al is calculated per timesteps and the result is shown for each instance typewhen N = 300. Number of optimally solved instances ( opt ). In order to evaluate the per-formance of the method on the individual problem instances, the number ofinstances that the method ﬁnds their optimal solution is computed for eachdataset.

Table 1.

Results of diﬀerent algorithms and datasets of M = 1000 problem instances.The method of [2] is not applicable on RI and HI while it is optimal for small N aswell as the DRL w/ aggregation method. It is possible that two approaches ﬁnd theoptimal solution for a certain instance. Hence, the total number of optimally solvedinstances is not necessarily 1000.Dataset Method N V al opt highest

V al opt V alV al opt

Greedy 50 429.10 596 0 434.78 98.694%DRL w/o aggregation 50 434.09 893 7 434.78 99.843%DRL w/ aggregation 50

161 27965.76 99.933%DRL w/ aggregation 300

168 81103.99 99.899%DRL w/ aggregation 500

Number of instances with highest solution value ( highest ). This metric com-pares the value of solutions of DRL with aggregation and DRL without aggre-gation and counts the number of times that each one is higher. This value iscalculated for the last M/ W P = pM +1 Σ n P i =1 w i is increasing with respect to p . Therefore, with larger W P , the set of feasiblesolutions is bigger and hence ﬁnding the optimal solution is more diﬃcult. We ran the algorithm with 1000 problem instances. Table 1 shows the quality ofthe solutions of diﬀerent types of KP instances: RI, FI, and HI, that are obtained itle Suppressed Due to Excessive Length 13

Timesteps S u m o f v a l ue s o f t he s o l u t i on s Random Instances

DRL w/ AggregationDRL w/o Aggregation

Fig. 2.

Random Instances by DRL algorithms with (i.e. w/) and without (i.e. w/o) aggregation, and thegreedy algorithm (Greedy).Table 1 contains the ratio of

V al and

V al opt . These values show that theratios of the solutions provided by our proposed method (DRL w/ aggregation)and the optimal solutions are most of the times more than 99 . Comparison with [2].

The pointer network based DRL method [2] is also ableto ﬁnd close to optimal solutions for problem size up to N = 200. However,the method of [2] can only be applied to solve the instances with exactly samenumber of items N , and in addition, with exactly same capacity value W P . Incomparison, our DRL formulation allows to solve instances of any size up to andincluding N = 500, and of any capacity value W P . Comparison with Greedy and DRL without aggregation.

The results show thatthe proposed DRL-based methods, with or without aggregation, always performbetter than the greedy algorithm, in terms of the average solution quality (

V al ),the number of optimally solved instances ( opt ), and the number of instanceswith highest solution value ( highest ).When we evaluate the advantage of having state aggregation, we notice thatthe state aggregation strategy improves the solutions especially for large in-stances, which is clearly observable in the solutions of instances RI and FI. Re-garding to hard instances HI, the DRL with aggregation method is better than

Timesteps S u m o f v a l ue s o f t he s o l u t i on s Fixed W Instances

DRL w/ AggregationDRL w/o Aggregation

Fig. 3.

Fixed W P instances Timesteps S u m o f v a l ue s o f t he s o l u t i on s Hard Instances

DRL w/ AggregationDRL w/o Aggregation

Fig. 4.

Hard instancesitle Suppressed Due to Excessive Length 15 the one without aggregation strategy in terms of solution quality (

V al ). For thelarge instances of sizes 300 and 500, the DRL without aggregation ﬁnds moreoptimal solutions than the one with aggregation, with 370 vs 336 for N = 300,and 217 vs 166 for N = 500. However, when looking at their performances interms of how many times they have the highest solution values for 500 morediﬃcult instances in HI, DRL with aggregation performs better than withoutaggregation, with 72 more wins for N = 300, and 133 more wins for N = 500.We investigate these instances in HI further. We have mentioned that W P isincreasing with respect to p which is the identiﬁer of each problem instance P , for M problem instances, this identiﬁer ranges from one to M . When p issmall, W P is also small. Since vr i indirectly relates to W P , it would be largewhen W P is small. For small W P , the number of feasible solutions is low becauseless number of items can be ﬁt into the knapsack. These problem instances arehence not actually “hard”. In this case, state aggregation is not beneﬁcial asaggregating vr i of the items of the instances with small p leads to sub-optimalsolutions. However, problem instances with large p have larger feasible solutionspace, and hence aggregation is beneﬁcial as it reduces the state size to enhancegeneralization.The other beneﬁt of DRL with aggregation method is that it is able to ﬁndthe high quality solutions in less time steps. As it can be observed from Figures2, 3 and 4, the learning rate of DRL with aggregation method is higher thanthe DRL without aggregation. Hence, in general, it not only provides bettersolutions, but also the solutions are found in around 10 ,

000 less timesteps.

In this paper we developed a DRL-based method for boosting the heuristicgreedy algorithm and solving KP. In the DRL based KP solver, a policy DNNand a value DNN are trained using A2C algorithm and the policy DNN is usedfor sequentially selecting items to ﬁnd a solution. The states in DRL modelingof KP contain the information of the instances that are aggregated to reducethe state space. The state aggregation policy is derived by solving a tabular RLproblem. Using this aggregation policy, a state embedding is obtained and thisstate embedding is used with another RL framework to train the parameters ofthe policy network.We applied this method on three types of problem instances named randominstances, ﬁxed W P instances and hard instances. Three datasets with 50, 300and 500 are generated for each type. The DRL with aggregation method foundclose to optimal solutions for the instances. It also found optimal solutions forﬁxed capacity instance with small number of items as well as the method devel-oped by [2].The proposed method can be generalized to other COPs. For instance, theTSP consists of some cities and the goal is to ﬁnd the minimum length tour thatvisits every city exactly once. The cities might be the items and an aggregationstrategy could reduce the state space by aggregating the coordinates. As another example, in minimum vertex cover problem, the items may be the vertices andaggregation can be performed by grouping the weight of the vertices.In this paper, we use RL to automate the reduction of the state space, asa pre-processing step of the DRL based approach for KPs. It is also interestingto investigate in the future how better reward functions can be derived throughlearning. This might be very helpful for problems that many tuning processesare needed for deriving a strong reward function. In general, automating thestate, reward and action derivation for RL problems are interesting topics forresearch in the future. References

1. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: A brief surveyof deep reinforcement learning. arXiv preprint arXiv:1708.05866 (2017)2. Bello, I., Pham, H., Le, Q.V., Norouzi, M., Bengio, S.: Neural combinato-rial optimization with reinforcement learning. In: ICLR (Workshop) (2017), https://academic.microsoft.com/paper/2560592986

3. Bengio, Y., Lodi, A., Prouvost, A.: Machine learning for combinatorial optimiza-tion: a methodological tour d’horizon. arXiv preprint arXiv:1811.06128 (2018)4. Cook, S.: The p versus np problem. The millennium prize problems pp. 87–104(2006)5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to algorithms.MIT press (2009)6. Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models forstructured data. In: International conference on machine learning. pp. 2702–2711(2016)7. Dasgupta, S., Papadimitriou, C.H., Vazirani, U.V.: Algorithms. McGraw-HillHigher Education (2008)8. Du, D.Z., Pardalos, P.M.: Handbook of combinatorial optimization: supplement,vol. 1. Springer Science & Business Media (2013)9. Feng, Y., Yang, J., Wu, C., Lu, M., Zhao, X.J.: Solving 0–1 knapsack prob-lems by chaotic monarch butterﬂy optimization algorithm with gaussian mutation.Memetic Computing (2), 135–150 (2018)10. Gu, S., Hao, T.: A pointer network based deep learning algorithm for 0–1 knapsackproblem. In: 2018 Tenth International Conference on Advanced ComputationalIntelligence (ICACI). pp. 473–477. IEEE (2018)11. Gurobi Optimization, L.: Gurobi optimizer reference manual (2020),

12. Hill, A., Raﬃn, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore,R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M.,Radford, A., Schulman, J., Sidor, S., Wu, Y.: Stable baselines. https://github.com/hill-a/stable-baselines (2018)13. Huang, T., Ma, Y., Zhou, Y., Huang, H., Chen, D., Gong, Z., Liu, Y.: A review ofcombinatorial optimization with graph neural networks. In: 2019 5th InternationalConference on Big Data and Information Analytics (BigDIA). pp. 72–77. IEEE(2019)14. Joshi, C.K., Laurent, T., Bresson, X.: An eﬃcient graph convolutional networktechnique for the travelling salesman problem. arXiv preprint arXiv:1906.01227(2019)itle Suppressed Due to Excessive Length 1715. Karp, R.M.: Reducibility among combinatorial problems. In: Complexity of com-puter computations, pp. 85–103. Springer (1972)16. Khalil, E., Dai, H., Zhang, Y., Dilkina, B., Song, L.: Learning combinatorial opti-mization algorithms over graphs. In: Advances in Neural Information ProcessingSystems. pp. 6348–6358 (2017)17. Kool, W., van Hoof, H., Welling, M.: Attention, learn to solve routing problems!arXiv preprint arXiv:1803.08475 (2018)18. LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P., et al.: Gradient-based learning ap-plied to document recognition. Proceedings of the IEEE (11), 2278–2324 (1998)19. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver,D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In:International conference on machine learning. pp. 1928–1937 (2016)20. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D.,Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprintarXiv:1312.5602 (2013)21. Pisinger, D.: Where are the hard knapsack problems? Computers & OperationsResearch (9), 2271–2284 (2005)22. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347 (2017)23. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks. In: Advances in neural information processing systems. pp. 3104–3112(2014)24. Sutton, R.S., Barto, A.G., et al.: Introduction to reinforcement learning, vol. 2.MIT press Cambridge (1998)25. Vazirani, V.V.: Approximation algorithms. Springer Science & Business Media(2013)26. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in NeuralInformation Processing Systems. pp. 2692–2700 (2015)27. Wang, Z., Bapst, V., Heess, N., Mnih, V., Munos, R., Kavukcuoglu, K., de Fre-itas, N.: Sample eﬃcient actor-critic with experience replay. arXiv preprintarXiv:1611.01224 (2016)28. Wilbaut, C., Hanaﬁ, S., Salhi, S.: A survey of eﬀective heuristics and their applica-tion to a variety of knapsack problems. IMA Journal of Management Mathematics19