Investment vs. reward in a competitive knapsack problem
IInvestment vs. reward in a competitive knapsackproblem
Oren Neumann
Institute for Theoretical PhysicsGoethe University FrankfurtFrankfurt am Main, Germany [email protected]
Claudius Gros
Institute for Theoretical PhysicsGoethe University FrankfurtFrankfurt am Main, Germany [email protected]
Abstract
Natural selection drives species to develop brains, with sizes that increase withthe complexity of the tasks to be tackled. Our goal is to investigate the balancebetween the metabolic costs of larger brains compared to the advantage theyprovide in solving general and combinatorial problems. Defining advantage asthe performance relative to competitors, a two-player game based on the knapsackproblem is used. Within this framework, two opponents compete over sharedresources, with the goal of collecting more resources than the opponent. Neuralnets of varying sizes are trained using a variant of the AlphaGo Zero algorithm[1]. A surprisingly simple relation, N A / ( N A + N B ) , is found for the relative winrate of a net with N A neurons against one with N B . Success increases linearlywith investments in additional resources when the networks sizes are very different,i.e. when N A (cid:28) N B , with returns diminishing when both networks becomecomparable in size. Optimal resource allocation often leads to combinatorial problems, e.g. time can be minimized bysolving the travelling salesman problem. In nature, resource allocation is in many cases a competitiveprocess, with animals competing with each other over shared resources. A key question regards in thiscontext the balance between success rate and investments in computational capabilities, e.g. in termsof metabolic cost. This question is relevant also for artificial neural networks, which are notoriouslybad at solving combinatorial problems [2], in particular when compared with the performance oftraditional, problem-specific algorithms.Lately, there has been a rising interest to apply deep learning architectures to combinatorial problems[3] [4] [5]. In this context we are interested in the evolutionary factors determining network sizes,when assuming that larger nets come with correspondingly larger metabolic cost. Presently it ishowever unclear how costs scale with reward, that is the precise functional dependence of successon network size is not known. Here we investigate this question within the 0-1 knapsack problem, abasic resource allocation task that can be generalize to a range of real-world situations [6].The problem is defined as follows: Given n items with values v i and weights w i , maximize n (cid:88) i =1 v i x i , x i ∈ { , } , (1)subject to a capacity constraint n (cid:88) i =1 w i x i ≤ W , (2) a r X i v : . [ c s . A I] J a n here { x i } denotes the chosen subset of items and W is a constraint on the total weight. In otherwords, one should select the group of items with maximal total value which does not exceed the totalweight limit. There are several algorithms developed specifically to solve the 0-1 knapsack problemand special cases of it, such as dynamic programming and branch-and-bound algorithms [7].Survival of the fittest pushes species to become better than their competition, when too manyindividuals are fighting for the same resources. It is therefore logical to measure the success of anagent only in comparison to other agents, rather than looking for an absolute scale. For this reasonwe decided to focus on a two-player extension of the knapsack problem. In this setup, two agentscompete over the same pool of resources trying each to collect more value than the other. Thisframework allowed us to use a powerful tool for training deep learning models to play turn basedgames, the AlphaGo Zero ( α GZ) algorithm [1]. Suitably adapted, we applied α GZ to train neuralnetwork models of varying sizes N to play the two-player knapsack game, with the goal of comparingtheir performances when playing against each other. This framework allows us to determine howincreasing the number of available neurons affects the performance of the competing networks. In order to simulate direct interactions between agents, we created a zero sum game for two playerswhich is an extension of the knapsack problem. During the course of a game, players pick items inturns from a common item pool. Each item can be selected by at most one player, and each playerhas a capacity limit for the total weight of items they can personally collect. The goal of the game isnot to maximize the total value of items collected, but rather to surpass the total value gained by theopponent. A game progresses in the following manner:• Each turn, a player may choose any item from the pool of free items and add it to theircollection, provided this would not cause the total weight of items collected to exceed theweight limit. The item chosen is then removed from the common pool, and the other playerplays his turn. If no suitable item is available, the current player passes.• The two players take their turns one after the other, until no item is picked for two consecutiveturns, meaning there are no more valid items. At this point the player with the larger totalvalue is declared the winner.For our simulations we focused on games where items are unique, v i (cid:54) = v j for i (cid:54) = j , with bothplayers having identical capacity limits. Since generating v i , w i from a uniform distribution createsmostly easily solvable knapsack problem instances [8], we generated weakly correlated instances [9],which are generally harder to solve. These are defined as instances where w i is uniformly distributedin [0 , and v i is uniformly distributed in a range of w i ± . , confined to [0 , . The capacity limitswere set to W = n/ as this avoids generating easy instances in the normal knapsack problem setting[10]. The complexity of the two player framework is somewhat increased, as compared to the originalknapsack problem, as players need to deprive the opponent of items, in addition to optimizing theirown collection.Training was done using an adaptation of the α GZ algorithm. The neural network agents are trainedby reinforcement learning on self generated data, without using any a priori knowledge of the game.The networks receive as inputs the current state of the game, i.e. the weights and values of all itemsand a list of all items acquired so far by each player. The output consists of a policy vector and avalue prediction: f ( s ) = ( (cid:126)p, v ) , (3)where v is a prediction of the value of the current game position s , which is equivalent to the expectedfinal outcome of the game from this position. We set the possible game outcomes to be { , } , forlosing/winning the game respectively, therefore implying that v ∈ [0 , . (cid:126)p is a probability distributionover all available moves, which should prioritize better moves and is used to guide the Monte Carlotree search (MCTS). The neural net is trained on randomly picked game states that were visited inprevious games played, using the loss function: l = ( z − v ) − (cid:126)π log (cid:126)p . (4)Here z is the recorded game outcome and (cid:126)π is an improved policy vector generated by the MCTS.Before each training phase, several games are played using the current version of the agent, pitched2gainst a copy of itself. During a game, each player makes use of a search tree in order to chooseevery move. The tree contains a parent node holding information about the current game state, withnodes descending from it for each possible future game state that has been explored before. Everyturn, the agent is given a certain number of iterations to expand their search tree, before decidingwhich move to make.MCTS makes use of the information stored in each node in order to expand new leaf nodes. A nodecontains four parameters: The number of times the node has been visited in previous searches, N ;the sum of all scores predicted for this node and its children, W ; the mean score Q = W/N ; and thepolicy vector (cid:126)p calculated once by the agent for this game state. Each iteration of expanding the treetraverses it by starting at the first node, i.e. the current game state, and repeatedly moving to the childnode which maximizes a quantity Q + U until a previously unvisited node is reached. The term U isdefined according to: U = cP a (cid:112)(cid:80) i N i N a , (5)where c is a constant, P a is the policy vector element corresponding to action a leading to the node, N a is the node’s visit count and (cid:80) i N i is the sum of the visit counts of all available actions. Whenan unexplored node is reached, it is added to the tree and the parameters of it and all other nodestraversed are updated. After the expansion iterations are done a move is chosen according to animproved policy (cid:126)π , defined by using the visit counts of all possible moves: π i ∝ N /τi (6)With τ a temperature parameter controlling the degree of exploration, set to in our case. The turnthen ends, and both players trim their search trees to remove all nodes that are now unreachable,making the new game state the parent node of the whole tree.We trained fully connected feed forward neural network models to play two-player knapsack gameswith 16 items. Each net had 2 hidden layers with ReLU connections, both with the same numberof neurons, which changed from one net to the other. The inputs consisted of 4 vectors of length16: A vector of the items’ weights, a vector of the values and two binary vectors encoding the itemstaken by each player. The items were ordered by descending ratio of v i /w i such that a minimalsized model could easily implement a greedy strategy of picking the items with the best ratios. Thenetwork outputs were a policy vector with a softmax activation and a value head with a sigmoidactivation. Every net was trained on a total of 40,000 self generated games, where optimizationsteps were taken every 40 games and performance was evaluated every 4,000 games, saving thebest performing version. Each evaluation step consisted of 200 games of the net against a greedyalgorithm in order to save time. The agents used 40 Monte Carlo steps per turn in all games. Allmodel instances were given the same number of optimization steps regardless of their sizes, withoutearly stopping if performance saturated. The final results presented in the next section were generatedby training four separate copies of each net and taking the average of their game outcomes.In order to interpret our results we made use of the Elo rating system [11], a rating system forzero-sum games that was invented for chess and gained popularity in numerous games. The expectedscore, or winning probability, of player A over player B is given by the formula: P A = 11 + 10 ( R B − R A ) / , (7)where R A , R B are the Elo ratings of both players. These ratings are found by repeatedly matchingplayers together and adjusting their ratings to fit the game outcomes. Combinatorial problems manifest themselves in many ways in nature, making it beneficial fororganisms to evolve ways to solve them. But unlike raw mathematical problems which have in mostcases goals and solving conditions that are defined in absolute terms, in the natural world evolutionpushes agents to overcome their competition such that merit is measured predominantly in relation toone’s competitors. It is therefore biologically sensible to look at adversarial problems when analysingthe performance of neural networks at solving combinatorial problems. This is why we chose towork on the two-player competitive version of the knapsack problem, where the merit of an agent in3 a) (b)
Figure 1: Training neural networks to play competitive knapsack problem games with 16 items. (a)Probability of net A to win against net B, when both have differing numbers of neurons. The blacklines mark increments of 0.1 in winning probability, the red lines mark the same for the predictiongiven by Eq. ( 8). (b) Elo rating of networks with different numbers of neurons in log scale, with thelogarithmic fit y = 400 log( N ) + const in red.relation to other agents is clearly defined by the probability of winning or losing a game against them.Figure 1 presents the outcome of training neural nets of varying sizes to play a competitive knapsackproblem game with 16 weakly correlated items [9]. All neural nets contained two hidden layers, bothwith the same number of neurons, which was changed from one net to the other.The problem displays diminishing returns, requiring increasingly larger differences between the sizesof competing nets in order to maintain the same win-lose statistics between them as they increase insize. This is supported by plotting the Elo rating of the nets, which scales logarithmically in networksize. In fact the Elo rating is fitted well by y = 400 log( N ) + const , where N is the number ofhidden neurons. Since is the Elo scaling factor, plugging this expression into Eq. (7) reveals thatthe probability of net A to win a game against net B reduces to P A = N A N A + N B , (8)where N A , N B are the numbers of hidden neurons of nets A and B respectively. In other words, theprobability of player A to win is N A /N B times greater than that of player B, meaning the performanceof neural nets against each other in this task is determined directly by the ratio of their sizes.A peculiar property of the Elo rating system is that it assumes the existence of an absolute ratingscale which determines game outcomes solely according to the difference in ratings between thetwo players. It could be possible that other factors also affect game outcomes, which would requirea more complex model. To rule out this possibility, we plotted the full map of game outcomes forevery combination of players in Figure 1a. Lines of equal winning probability predicted by Eq. (8)are marked in red. It is clear that the theoretical prediction matches the corresponding lines obtainedthrough simulations (in black) to an astonishing degree, proving that Eq. (8) holds.This result implies a linear increase of the probability P A with network size in the regime where N A (cid:28) N B , and diminishing returns when both opponents are relatively equally matched. Thereforea large difference between competitors can be quickly closed when they evolve to maximize their ownutility. Note that when N A ≈ N B the benefit of increasing the number of neurons by a multiplicativefactor, N A → γN A , is independent of N A , always yielding the same winning probability γ/ ( γ + 1) . References [1] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game ofgo without human knowledge. nature , 550(7676):354–359, 2017.42] Kate A Smith. Neural networks for combinatorial optimization: a review of more than a decadeof research.
INFORMS Journal on Computing , 11(1):15–34, 1999.[3] Kenshin Abe, Zijian Xu, Issei Sato, and Masashi Sugiyama. Solving np-hard problems ongraphs with extended alphago zero. arXiv preprint arXiv:1905.11623 , 2019.[4] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorialoptimization algorithms over graphs. In
Advances in Neural Information Processing Systems ,pages 6348–6358, 2017.[5] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combina-torial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940 , 2016.[6] Carsten Murawski and Peter Bossaerts. How humans solve complex problems: The case of theknapsack problem.
Scientific reports , 6:34851, 2016.[7] Silvano Martello, David Pisinger, and Paolo Toth. New trends in exact algorithms for the 0–1knapsack problem.
European Journal of Operational Research , 123(2):325–332, 2000.[8] Kate Smith-Miles and Leo Lopes. Measuring instance difficulty for combinatorial optimizationproblems.
Computers & Operations Research , 39(5):875–889, 2012.[9] David Pisinger. Where are the hard knapsack problems?
Computers & Operations Research ,32(9):2271–2284, 2005.[10] Mattias Ohlsson, Carsten Peterson, and Bo Söderberg. Neural networks for optimizationproblems with inequality constraints: the knapsack problem. neural computation , 5(2):331–339,1993.[11] Mark E Glickman and Albyn C Jones. Rating the chess rating system.