A Deep Q-Network for the Beer Game: A Deep Reinforcement Learning algorithm to Solve Inventory Optimization Problems
Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence Snyder, Martin Takáč
——–
Vol. 00, No. 0, Xxxxx 0000, pp. 000–000 issn — | eissn —- | | | INFORMS doi (cid:13)
A Deep Q-Network for the Beer Game:Reinforcement Learning for Inventory Optimization
Afshin Oroojlooyjadid, MohammadReza Nazari, Lawrence V. Snyder, Martin Tak´aˇc
Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015, { oroojlooy, mon314,larry.snyder } @lehigh.edu, [email protected] Problem definition:
The beer game is a widely used game that is played in supply chain managementclasses to demonstrate the bullwhip effect and the importance of supply chain coordination. The game isa decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain networkin which agents choose order quantities while cooperatively attempting to minimize the network’s totalcost, even though each agent only observes its own local information.
Academic/practical relevance:
Undersome conditions, a base-stock replenishment policy is optimal. However, in a decentralized supply chain inwhich some agents act irrationally, there is no known optimal policy for an agent wishing to act optimally.
Methodology:
We propose a reinforcement learning (RL) algorithm, based on deep Q-networks, to playthe beer game. Our algorithm has no limits on costs and other beer game settings. Like any deep RLalgorithm, training can be computationally intensive, but this can be performed ahead of time; the algorithmexecutes in real time when the game is played. Moreover, we propose a transfer-learning approach so thatthe training performed for one agent can be adapted quickly for other agents and settings.
Results:
Whenplaying with teammates who follow a base-stock policy, our algorithm obtains near-optimal order quantities.More importantly, it performs significantly better than a base-stock policy when other agents use a morerealistic model of human ordering behavior. Finally, applying transfer-learning reduces the training time byone order of magnitude.
Managerial implications:
This paper shows how artificial intelligence can be appliedto inventory optimization. Our approach can be extended to other supply chain optimization problems,especially those in which supply chain partners act in irrational or unpredictable ways.
Key words : Inventory Optimization, Reinforcement Learning, Beer Game
History : a r X i v : . [ c s . L G ] F e b uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13)
1. Introduction
The beer game consists of a serial supply chain network with four agents—a retailer, a warehouse, adistributor, and a manufacturer—who must make independent replenishment decisions with limitedinformation. The game is widely used in classroom settings to demonstrate the bullwhip effect , aphenomenon in which order variability increases as one moves upstream in the supply chain, aswell as the importance of communication and coordination in the supply chain. The bullwhip effectoccurs for a number of reasons, some rational (Lee et al. 1997) and some behavioral (Sterman1989). It is an inadvertent outcome that emerges when the players try to achieve the stated purposeof the game, which is to minimize costs. In this paper, we are interested not in the bullwhip effectbut in the stated purpose, i.e., the minimization of supply chain costs, which underlies the decisionmaking in every real-world supply chain. For general discussions of the bullwhip effect, see, e.g.,Lee et al. (2004), Geary et al. (2006), and Snyder and Shen (2019).The agents in the beer game are arranged sequentially and numbered from 1 (retailer) to 4(manufacturer), respectively. (See Figure 1.) The retailer node faces a stochastic demand from itscustomer, and the manufacturer node has an unlimited source of supply. There are deterministictransportation lead times ( l tr ) imposed on the flow of product from upstream to downstream,though the actual lead time is stochastic due to stockouts upstream; there are also deterministicinformation lead times ( l in ) on the flow of information from downstream to upstream (replenish-ment orders). Each agent may have nonzero shortage and holding costs.In each period of the game, each agent chooses an order quantity q to submit to its predecessor(supplier) in an attempt to minimize the long-run system-wide costs, T (cid:88) t =1 4 (cid:88) i =1 c ih ( IL it ) + + c ip ( IL it ) − , (1)where i is the index of the agents; t = 1 , . . . , T is the index of the time periods; T is the time horizonof the game (which is often unknown to the players); c ih and c ip are the holding and shortage costcoefficients, respectively, of agent i ; and IL it is the inventory level of agent i in period t . If IL it > uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 1 Generic view of the beer game network.
Manufacturer Distributer Warehouse RetailerSupplier Customer then the agent has inventory on-hand, and if IL it <
0, then it has backorders.The notation x + and x − denotes max { , x } and max { , − x } , respectively.The standard rules of the beer game dictate that the agents may not communicate in any way,and that they do not share any local inventory statistics or cost information with other agentsuntil the end of the game, at which time all agents are made aware of the system-wide cost. Inother words, each agent makes decisions with only partial information about the environmentwhile also cooperates with other agents to minimize the total cost of the system. According to thecategorization by Claus and Boutilier (1998), the beer game is a decentralized, independent-learners(ILs), multi-agent, cooperative problem.The beer game assumes the agents incur holding and stockout costs but not fixed orderingcosts, and therefore the optimal inventory policy is a base-stock policy in which each stage ordersa sufficient quantity to bring its inventory position (on-hand plus on-order inventory minus back-orders) equal to a fixed number, called its base-stock level (Clark and Scarf 1960). When thereare no stockout costs at the non-retailer stages, i.e., c ip = 0 , i ∈ { , , } , the well known algorithmby Clark and Scarf (1960) provides the optimal base-stock levels. To the best of our knowledge,there is no algorithm to find the optimal base-stock levels for general stockout-cost structures.Moresignificantly, when some agents do not follow a base-stock or other rational policy, the form andparameters of the optimal policy that a given agent should follow are unknown.In this paper, we propose an extension of deep Q-networks (DQN) to solve this problem. Ouralgorithm is customized for the beer game, but we view it also as a proof-of-concept that DQN canbe used to solve messier, more complicated supply chain problems than those typically analyzedin the literature. The remainder of this paper is as follows. Section 2 provides a brief summary ofthe relevant literature and our contributions to it. The details of the algorithm are introduced inSection 3. Section 4 provides numerical experiments, and Section 5 concludes the paper. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13)
2. Literature Review
The beer game consists of a serial supply chain network. Under the conditions dictated by thegame (zero fixed ordering costs, no ordering capacities, linear holding and backorder costs, etc.),a base-stock policy is optimal at each stage (Lee et al. 1997). If the demand process and costs arestationary, then so are the optimal base-stock levels, which implies that in each period (except thefirst), each stage simply orders from its supplier exactly the amount that was demanded from it. Ifthe customer demands are i.i.d. random and if backorder costs are incurred only at stage 1, thenthe optimal base-stock levels can be found using the exact algorithm by Clark and Scarf (1960).There is a substantial literature on the beer game and the bullwhip effect. We review some ofthat literature here, considering both independent learners (ILs) and joint action learners (JALs)(Claus and Boutilier 1998). (ILs have no information about the other agent’s current states, whereasJALs may share such information.) For a more comprehensive review, see Devika et al. (2016). SeeMartinez-Moyano et al. (2014) for a thorough history of the beer game.In the category of ILs, Mosekilde and Larsen (1988) develop a simulation and test differentordering policies, which are expressed using a formula that involves state variables such as thenumber of anticipated shipments and unfilled orders. They assume the customer demand is 4 ineach of the first four periods, and then 7 per period for the remainder of the horizon. Sterman(1989) uses a similar version of the game in which the demand is 8 after the first four periods.(Hereinafter, we refer to this demand process as C (4 ,
8) or the classic demand process.) Also, hedo not allow the players to be aware of the demand process. He proposes a formula (which wecall the
Sterman formula ) to determine the order quantity based on the current backlog of orders,on-hand inventory, incoming and outgoing shipments, incoming orders, and expected demand. Hisformula is based on the anchoring and adjustment method of Tversky and Kahneman (1979). Ina nutshell, the Sterman formula attempts to model the way human players over- or under-reactto situations they observe in the supply chain such as shortages or excess inventory. Note thatSterman’s formula is not an attempt to optimize the order quantities in the beer game; rather, it uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) is intended to model typical human behavior. There are multiple extensions of Sterman’s work.For example, Strozzi et al. (2007) considers the beer game when the customer demand increasesconstantly after four periods and proposes a genetic algorithm (GA) to obtain the coefficients ofthe Sterman model. Subsequent behavioral beer game studies include Croson and Donohue (2003)and Croson and Donohue (2006a).Most of the optimization methods described in the first paragraph of this section assume thatevery agent follows a base-stock policy. The hallmark of the beer game, however, is that playersdo not tend to follow such a policy, or any policy. Often their behavior is quite irrational. There iscomparatively little literature on how a given agent should optimize its inventory decisions when theother agents do not play rationally (Sterman 1989, Strozzi et al. 2007)—that is, how an individualplayer can best play the beer game when her teammates may not be making optimal decisions.Some of the beer game literature assumes the agents are JALs, i.e., information about inventorypositions is shared among all agents, a significant difference compared to classical IL models. Forexample, Kimbrough et al. (2002) propose a GA that receives a current snapshot of each agentand decides how much to order according to the d + x rule . In the d + x rule, agent i observes d it ,the received demand/order in period t , chooses x it , and then places an order of size a it = d it + x it .In other words, x it is the (positive or negative) amount by which the agent’s order quantity differsfrom his observed demand. Giannoccaro and Pontrandolfo (2002) consider a beer game with threeagents with stochastic shipment lead times and stochastic demand. They propose a RL algorithmto make decisions, in which the state variable is defined as the three inventory positions, whicheach are discretized into 10 intervals. The agents may use any actions in the integers on [0 , d + x rule to determine the order quantity, with x restricted to be in { , , , } . Notethat these RL algorithms assume that real-time information is shared among agents, whereas oursadheres to the typical beer-game assumption that each agent only has local information. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 2 A generic procedure for RL.
𝐴𝑔𝑒𝑛𝑡
𝐸𝑛𝑣𝑖𝑟𝑜𝑛𝑚𝑒𝑛𝑡 𝑎 𝑡 𝑟 𝑡+1 𝑠 𝑡+1 𝑟 𝑡 𝑠 𝑡 𝑡 = 𝑡 + 1 Reinforcement learning (Sutton and Barto 1998) is an area of machine learning that has been suc-cessfully applied to solve complex sequential decision problems. RL is concerned with the questionof how a software agent should choose an action to maximize a cumulative reward. RL is a populartool in telecommunications, robot control, and game playing, to name a few (see Li (2017)).Consider an agent that interacts with an environment. In each time step t , the agent observesthe current state of the system, s t ∈ S (where S is the set of possible states), chooses an action a t ∈ A ( s t ) (where A ( s t ) is the set of possible actions when the system is in state s t ), and gets reward r t ∈ R ; and then the system transitions randomly into state s t +1 ∈ S . This procedure is known asa Markov decision process (MDP) (see Figure 2), and RL algorithms can be applied to solve thistype of problem.The matrix P a ( s, s (cid:48) ), which is called the transition probability matrix , provides the probability oftransitioning to state s (cid:48) when taking action a in state s , i.e., P a ( s, s (cid:48) ) = Pr( s t +1 = s (cid:48) | s t = s, a t = a ).Similarly, R a ( s, s (cid:48) ) defines the corresponding reward matrix. In each period t , the decision makertakes action a t = π t ( s ) according to a given policy, denoted by π t . The goal of RL is to maximizethe expected discounted sum of the rewards r t , when the systems runs for an infinite horizon. Inother words, the aim is to determine a policy π : S → A to maximize (cid:80) ∞ t =0 γ t E [ R a t ( s t , s t +1 )], where a t = π t ( s t ) and 0 ≤ γ < P a ( s, s (cid:48) ) and R a ( s, s (cid:48) ), the optimal policycan be obtained through dynamic programming or linear programming (Sutton and Barto 1998).Another approach for solving this problem is Q-learning , a type of RL algorithm that obtainsthe
Q-value for any s ∈ S and a = π ( s ), i.e. Q ( s, a ) = E [ r t + γr t +1 + γ r t +2 + · · · | s t = s, a t = a ; π ] . uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) The Q-learning approach starts with an initial guess for Q ( s, a ) for all s and a and then proceedsto update them based on the iterative formula Q ( s t , a t ) = (1 − α t ) Q ( s t , a t ) + α t (cid:16) r t +1 + γ max a Q ( s t +1 , a ) (cid:17) , ∀ t = 1 , , . . . , (2)where α t is the learning rate at time step t . In each observed state, the agent chooses an actionthrough an (cid:15) -greedy algorithm: with probability (cid:15) t in time t , the algorithm chooses an actionrandomly, and with probability 1 − (cid:15) t , it chooses the action with the highest cumulative actionvalue, i.e., a t +1 = argmax a Q ( s t +1 , a ). The random selection of actions, called exploration, allowsthe algorithm to explore the solution space and gives an optimality guarantee to the algorithm if (cid:15) t → t → ∞ (Sutton and Barto 1998). After finding optimal Q ∗ , one can recover the optimalpolicy as π ∗ ( s ) = arg max a Q ∗ ( s, a ).Both of the algorithms discussed so far (dynamic programming and Q-learning) guarantee thatthey will obtain the optimal policy. However, due to the curse of dimensionality, these approachesare not able to solve MDPs with large state or action spaces in reasonable amounts of time. Manyproblems of interest (including the beer game) have large state and/or action spaces. Moreover, insome settings (again, including the beer game), the decision maker cannot observe the full statevariable. This case, which is known as a partially observed MDP (POMDP), makes the problemmuch harder to solve than MDPs.In order to solve large POMDPs and avoid the curse of dimensionality, it is common to approxi-mate the Q-values in the Q-learning algorithm (Sutton and Barto 1998). Linear regression is oftenused for this purpose (Melo and Ribeiro 2007); however, it is not powerful or accurate enough forour application. Non-linear functions and neural network approximators are able to provide moreaccurate approximations; on the other hand, they are known to provide unstable or even divergingQ-values due to issues related to non-stationarity and correlations in the sequence of observations(Mnih et al. 2013). The seminal work of Mnih et al. (2015) solved these issues by proposing targetnetworks and utilizing experience replay memory (Lin 1992). They proposed a deep Q-network (DQN) algorithm, which uses a deep neural network to obtain an approximation of the Q-function uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) and trains it through the iterations of the Q-learning algorithm while updating another targetnetwork. This algorithm has been applied to many competitive games, which are reviewed by Li(2017). Our algorithm for the beer game is based on this approach.The beer game exhibits one characteristic that differentiates it from most settings in which DQNis commonly applied, namely, that there are multiple agents that cooperate in a decentralizedmanner to achieve a common goal. Such a problem is called a decentralized POMDP, or Dec-POMDP. Due to the partial observability and the non-stationarity of the local observations ofagents, Dec-POMDPs are hard to solve and are categorized as NEXP-complete problems (Bernsteinet al. 2002).The beer game exhibits all of the complicating characteristics described above—large state andaction spaces, partial state observations, and decentralized cooperation. In the next section, wediscuss the drawbacks of current approaches for solving the beer game, which our algorithm aimsto overcome. In Section 2.1, we reviewed different approaches to solve the beer game. Although the model ofClark and Scarf (1960) can solve some types of serial systems, for more general serial systemsneither the form nor the parameters of the optimal policy are known. Moreover, even in systemsfor which a base-stock policy is optimal, such a policy may no longer be optimal for a given agentif the other agents do not follow it. The formula-based beer-game models by Mosekilde and Larsen(1988), Sterman (1989), and Strozzi et al. (2007) attempt to model human decision-making; theydo not attempt to model or determine optimal decisions.A handful of models have attempted to optimize the inventory actions in serial supply chainswith more general cost or demand structures than those used by Clark and Scarf (1960); these areessentially beer-game settings. However, these papers all assume full observation or a centralizeddecision maker, rather than the local observations and decentralized approach taken in the beergame. For example, Kimbrough et al. (2002) use a genetic algorithm (GA), while Chaharsooghi uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) et al. (2008), Giannoccaro and Pontrandolfo (2002) and Jiang and Sheng (2009) use RL. However,classical RL algorithms can handle only a small or reduced-size state space. Accordingly, theseapplications of RL in the beer game or even simpler supply chain networks also assume (implicitlyor explicitly) that size of the state space is small. This is unrealistic in the beer game, since the statevariable representing a given agent’s inventory level can be any number in ( −∞ , + ∞ ). Solving suchan RL problem would be nearly impossible, as the model would be extremely expensive to train.Moreover, Chaharsooghi et al. (2008) and Giannoccaro and Pontrandolfo (2002), which modelbeer-game-like settings, assume sharing of information. Also, to handle the curse of dimensionality,they propose mapping the state variable onto a small number of tiles, which leads to the loss ofvaluable state information and therefore of accuracy. Thus, although these papers are related toour work, their assumption of full observability differentiates their work from the classical beergame and from our paper.Another possible approach to tackle this problem might be classical supervised machine learningalgorithms. However, these algorithms also cannot be readily applied to the beer game, since thereis no historical data in the form of “correct” input/output pairs. Thus, we cannot use a stand-alonesupport vector machine or deep neural network with a training data-set and train it to learn thebest action (like the approach used by Oroojlooyjadid et al. (2017a,b) to solve some simpler supplychain problems). Based on our understanding of the literature, there is a large gap between solvingthe beer game problem effectively and what the current algorithms can handle. In order to fill thisgap, we propose a variant of the DQN algorithm to choose the order quantities in the beer game. We propose a Q-learning algorithm for the beer game in which a DNN approximates the Q-function.Indeed, the general structure of our algorithm is based on the DQN algorithm (Mnih et al. 2015),although we modify it substantially, since DQN is designed for single-agent, competitive, zero-sumgames and the beer game is a multi-agent, decentralized, cooperative, non-zero-sum game. In otherwords, DQN provides actions for one agent that interacts with an environment in a competitive uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) setting, and the beer game is a cooperative game in the sense that all of the players aim to minimizethe total cost of the system in a random number of periods. Also, beer game agents are playingindependently and do not have any information from other agents until the game ends and thetotal cost is revealed, whereas DQN usually assumes the agent fully observes the state of theenvironment at any time step t of the game. For example, DQN has been successfully applied toAtari games (Mnih et al. 2015), but in these games the agent is attempting to defeat an opponentand observes full information about the state of the systems at each time step t .One naive approach to extend the DQN algorithm to solve the beer game is to use multipleDQNs, one for each agent. However, using DQN as the decision maker of each agent results ina competitive game in which each DQN agent plays independently to minimize its own cost. Forexample, consider a beer game in which players 2, 3, and 4 each have a stand-alone, well-trainedDQN and the retailer (stage 1) uses a base-stock policy to make decisions. If the holding costsare positive for all players and the stockout cost is positive only for the retailer (as is common inthe beer game), then the DQN at agents 2, 3, and 4 will return an optimal order quantity of 0 inevery period, since on-hand inventory hurts the objective function for these players, but stockoutsdo not. This is a byproduct of the independent DQN agents minimizing their own costs withoutconsidering the total cost, which is obviously not an optimal solution for the system as a whole.Instead, we propose a unified framework in which the agents still play independently from oneanother, but in the training phase, we use a feedback scheme so that the DQN agent learns thetotal cost for the whole network and can, over time, learn to minimize it. Thus, the agents in ourmodel play smartly in all periods of the game to get a near-optimal cumulative cost for any randomhorizon length.In principle, our framework can be applied to multiple DQN agents playing the beer gamesimultaneously on a team. However, to date we have designed and tested our approach only for asingle DQN agent whose teammates are not DQNs, e.g., they are controlled by simple formulas orby human players. Enhancing the algorithm so that multiple DQNs can play simultaneously andcooperatively is a topic of ongoing research. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Another advantage of our approach is that it does not require knowledge of the demand distri-bution, unlike classical inventory management approaches (e.g., Clark and Scarf 1960). In practice,one can approximate the demand distribution based on historical data, but doing so is prone toerror, and basing decisions on approximate distributions may result in loss of accuracy in the beergame. In contrast, our algorithm chooses actions directly based on the training data and does notneed to know, or estimate, the probability distribution directly.The proposed approach works very well when we tune and train the DQN for a given agentand a given set of game parameters (e.g., costs, lead times, action spaces, etc.). Once any of theseparameters changes, or the agent changes, in principle we need to tune and train a new network.Although this approach works, it is time consuming since we need to tune hyper-parameters foreach new set of game parameters. To avoid this, we propose using a transfer learning approach(Pan and Yang 2010) in which we transfer the acquired knowledge of one agent under one set ofgame parameters to another agent with another set of game parameters. In this way, we decreasethe required time to train a new agent by roughly one order of magnitude.To summarize, our algorithm is a variant of the DQN algorithm for choosing actions in the beergame . In order to attain near-optimal cooperative solutions, we develop a feedback scheme as acommunication framework . Finally, to simplify training agents with new settings, we use transferlearning to efficiently make use of the learned knowledge of trained agents. In addition to playingthe beer game well, we believe our algorithm serves as a proof-of-concept that DQN and othermachine learning approaches can be used for real-time decision making in complex supply chainsettings. Finally, we note that we have integrated our algorithm into a new online beer gamedeveloped by Opex Analytics ( http://beergame.opexanalytics.com/ ); see Figure 3. The Opexbeer game allows human players to compete with, or play on a team with, our DQN agent.
3. The DQN Algorithm
In this section, we first present the details of our DQN algorithm to solve the beer game, and thendescribe the transfer learning mechanism. uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 3 Screenshot of Opex Analytics online beer game integrated with our DQN agent
In our algorithm, a DQN agent runs a Q-learning algorithm with DNN as the Q-function approx-imator to learn a semi-optimal policy with the aim of minimizing the total cost of the game. Eachagent has access to its local information and considers the other agents as parts of its environment.That is, the DQN agent does not know any information about the other agents, including bothstatic parameters such as costs and lead times, as well as dynamic state variables such as inventorylevels. We propose a feedback scheme to teach the DQN agent to work toward minimizing the totalsystem-wide cost, rather than its own local cost. The details of the scheme, Q-learning, state andaction spaces, reward function, DNN approximator, and the DQN algorithm are discussed below.
State variables:
Consider agent i in time step t . Let OO it denote the on-order items at agent i ,i.e., the items that have been ordered from agent i + 1 but not received yet; let AO it denote the sizeof the arriving order (i.e., the demand) received from agent i −
1; let AS it denote the size of thearriving shipment from agent i + 1; let a it denote the action agent i takes; and let IL it denote theinventory level as defined in Section 1. We interpret AO t to represent the end-customer demandand AS t to represent the shipment received by agent 4 from the external supplier. In each period t of the game, agent i observes IL it , OO it , AO it , and AS it . In other words, in period t agent i hashistorical observations o it = [(( IL i ) + , IL i ) − , OO i , AO i , RS i ) , . . . , (( IL it ) + , IL it ) − , , OO it , AO it , AS it )] . In addition, any beer game will finish in a finite time horizon, so the problem can be modeled asa POMDP in which each historic sequence o it is a distinct state and the size of the vector o it growsover time, which is difficult for any RL or DNN algorithm to handle. To address this issue, we uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) capture only the last m periods (e.g., m = 3) and use them as the state variable; thus the statevariable of agent i in time t is s it = (cid:2) (( IL ij ) + , IL ij ) − , OO ij , AO ij , RS ij ) (cid:3) tj = t − m +1 . DNN architecture:
In our algorithm, DNN plays the role of the Q-function approximator, pro-viding the Q-value as output for any pair of state s and action a . There are various possibleapproaches to build the DNN structure. The natural approach is to provide the state s and action a as the input of the DNN and then get the corresponding Q ( s, a ) from the output. Thus, weprovide as input the m previous state variables into the DNN and get as output Q ( s, a ) for everypossible action a ∈ A (since in beer game A ( s ) is fixed for any s , we use A hereinafter). Action space:
In each period of the game, each agent can order any amount in [0 , ∞ ). Since ourDNN architecture provides the Q-value of all possible actions in the output, having an infiniteaction space is not practical. Therefore, to limit the cardinality of the action space, we use the d + x rule for selecting the order quantity: The agent determines how much more or less to orderthan its received order; that is, the order quantity is d + x , where x is in some bounded set. Thus,the output of the DNN is x ∈ [ a l , a u ] ( a l , a u ∈ Z ), so that the action space is of size a u − a l + 1. Experience replay:
The DNN algorithm requires a mini-batch of input and a corresponding setof output values to learn the Q-values. Since we use DQN algorithm as our RL engine, we have thenew state s t +1 , the current state s t , the action a t taken, and the observed reward r t , in each period t . This information can provide the required set of input and output for the DNN; however, theresulting sequence of observations from the RL results in a non-stationary data-set in which there isa strong correlation among consecutive records. This makes the DNN and, as a result, the RL proneto over-fitting the previously observed records and may even result in a diverging approximator(Mnih et al. 2015). To avoid this problem, we follow the suggestion of Mnih et al. (2015) and use experience replay (Lin 1992).In this way, agent i has experience memory E i that in iteration t of thealgorithm, agent i ’s observation e it = ( s it , a it , r it , s it +1 ) is added in, so that E i includes { e i , e i , . . . , e it } in period t . Then, in order to avoid having correlated observations, we select a random mini-batchof the agent’s experience replay to train the corresponding DNN (if applicable). This approach uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) breaks the correlations among the training data and reduces the variance of the output (Mnihet al. 2013). Moreover, as a byproduct of experience replay, we also get a tool to keep every piece ofthe valuable information, which allows greater efficiency in a setting in which the state and actionspaces are huge and any observed experience is valuable. Reward function:
In iteration t of the game, agent i observes state variable s it and takes action a it ; we need to know the corresponding reward value r it to measure the quality of action a it . Thestate variable, s it +1 , allows us to calculate IL it +1 and thus the corresponding shortage or holdingcosts, and we consider the summation of these costs for r it . However, since there are informationand transportation lead times, there is a delay between taking action a it and observing its effect onthe reward. Moreover, the reward r it reflects not only the action taken in period t , but also thosetaken in previous periods, and it is not possible to decompose r it to isolate the effects of each ofthese actions. However, defining the state variable to include information from the last m periodsresolves this issue to some degree; the reward r it represents the reward of state s it , which includesthe observations of the previous m steps.On the other hand, the reward values r it are the intermediate rewards of each agent, and theobjective of the beer game is to minimize the total reward of the game, (cid:80) i =1 (cid:80) Tt =1 r it , whichthe agents only learn after finishing the game. In order to add this information into the agents’experience, we use reward shaping through a feedback scheme. Feedback scheme:
When any episode of the beer game is finished, all agents are made aware ofthe total reward. In order to share this information among the agents, we propose a penalizationprocedure in the training phase to provide feedback to the DQN agent about the way that it hasplayed. Let ω = (cid:80) i =1 (cid:80) Tt =1 r it T and τ i = (cid:80) Tt =1 r it T , i.e., the average reward per period and the averagereward of agent i per period, respectively. After the end of each episode of the game (i.e., afterperiod T ), for each DQN agent i we update its observed reward in all T time steps in the experiencereplay memory using r it = r it + β i ( ω − τ i ), ∀ t ∈ { , . . . , T } , where β i is a regularization coefficientfor agent i . With this procedure, agent i gets appropriate feedback about its actions and learns totake actions that result in minimum total cost, not locally optimal solutions. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Determining the value of m : As noted above, the DNN maintains information from the mostrecent m periods in order to keep the size of the state variable fixed and to address the issue withthe delayed observation of the reward. In order to select an appropriate value for m , one has toconsider the value of the lead times throughout the game. First, when agent i takes action a it attime t , it does not observe its effect until at least l tri + l ini periods later, when the order may bereceived. Moreover, node i + 1 may not have enough stock to satisfy the order immediately, inwhich case the shipment is delayed and in the worst case agent i will not observe the correspondingreward r it until (cid:80) j = i ( l trj + l inj ) periods later. However, one needs the reward r it to evaluate theaction a it taken. Thus, ideally m should be chosen at least as large as (cid:80) j =1 ( l trj + l inj ). On the otherhand, this value can be large and selecting a large value for m results in a large input size for theDNN, which increases the training time. Therefore, selecting m is a trade-off between accuracyand computation time, and m should be selected according to the required level of accuracy andthe available computation power. In our numerical experiment, (cid:80) j =1 ( l trj + l inj ) = 15 or 16, and wetest m ∈ { , } . The algorithm:
Our algorithm to get the policy π to solve the beer game is provided in Algorithm1. The algorithm, which is based on that of Mnih et al. (2015), finds weights θ of the DNN networkto minimize the Euclidean distance between Q ( s, a ; θ ) and y j , where y j is the prediction of theQ-value that is obtained from target network Q − with weights θ − . Every C iterations, the weights θ − are updated by θ . Moreover, the actions in each training step of the algorithm are obtained byan (cid:15) -greedy algorithm, which is explained in Section 2.2.In the algorithm, in period t agent i takes action a it , satisfies the arriving demand/order AO it − ,observes the new demand AO it , and then receives the shipments AS it . This sequence of events,which is explained in detail in online supplement E, results in the new state s t +1 . Feeding s t +1 intothe DNN network with weights θ provides the corresponding Q-value for state s t +1 and all possibleactions. The action with the smallest Q-value is our choice. Finally, at the end of each episode, thefeedback scheme runs and distributes the total cost among all agents. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Algorithm 1
DQN for Beer Game procedure DQN Initialize Experience Replay Memory E i = [ ] , ∀ i for Episode = 1 : n do Reset IL , OO , d , AO , and AS for each agent for t = 1 : T do for i = 1 : 4 do With probability (cid:15) take random action a t , otherwise set a t = argmin a Q ( s t , a ; θ ) Execute action a t , observe reward r t and state s t +1 Add ( s it , a it , r it , s it +1 ) into the E i Get a mini-batch of experiences ( s j , a j , r j , s j +1 ) from E i Set y j = (cid:40) r j if it is the terminal state r j + min a Q ( s j , a ; θ − ) otherwise Run forward and backward step on the DNN with loss function ( y j − Q ( s j , a j ; θ )) Every C iterations, set θ − = θ end for end for Run feedback scheme, update experience replay of each agent end for end procedureEvaluation procedure:
In order to validate our algorithm, we compare the results of our algo-rithm to those obtained using the optimal base-stock levels (when possible) in serial systems byClark and Scarf (1960), as well as models of human beer-game behavior by Sterman (1989). (Notethat none of these methods attempts to do exactly the same thing as our method. The methodsby Clark and Scarf (1960) optimizes the base-stock levels assuming all players follow a base-stockpolicy—which beer game players do not tend to do—and the formula by Sterman (1989) modelshuman beer-game play, but they do not attempt to optimize.) The details of the training procedureand benchmarks are described in Section 4.
Transfer learning (Pan and Yang 2010) has been an active and successful field of research in machinelearning and especially in image processing. In transfer learning, there is a source dataset S and atrained neural network to perform a given task, e.g. classification, regression, or decisioning throughRL. Training such networks may take a few days or even weeks. So, for similar or even slightlydifferent target datasets T , one can avoid training a new network from scratch and instead use the uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) same trained network with a few customizations. The idea is that most of the learned knowledgeon dataset S can be used in the target dataset with a small amount of additional training. Thisidea works well in image processing (e.g. Rajpurkar et al. (2017)) and considerably reduces thetraining time.In order to use transfer learning in the beer game, assume there exists a source agent i ∈ { , , , } with trained network S i (with a fixed size on all agents), parameters P i = {| A j | , c jp , c jh } , observeddemand distribution D , and co-player policy π . The weight matrix W i contains the learnedweights such that W qi denotes the weight between layers q and q + 1 of the neural network, where q ∈ { , . . . , nh } , and nh is the number of hidden layers. The aim is to train a neural network S j fortarget agent j ∈ { , , , } , j (cid:54) = i . We set the structure of the network S j the same as that of S i ,and initialize W j with W i , making the first k layers not trainable. Then, we train neural network S j with a small learning rate. Note that, as we get closer to the final layer, which provides theQ-values, the weights become less similar to agent i ’s and more specific to each agent. Thus, theacquired knowledge in the first k hidden layer(s) of the neural network belonging to agent i istransferred to agent j , in which k is a tunable parameter. Following this procedure, in Section 4.3,we test the use of transfer learning in six cases to transfer the learned knowledge of source agent i to:1. Target agent j (cid:54) = i in the same game.2. Target agent j with {| A j | , c jp , c jh } , i.e., the same action space but different cost coefficients.3. Target agent j with {| A j | , c jp , c jh } , i.e., the same cost coefficients but different action space.4. Target agent j with {| A j | , c jp , c jh } , i.e., different action space and cost coefficients.5. Target agent j with {| A j | , c jp , c jh } , i.e., different action space and cost coefficients, as well asa different demand distribution D .6. Target agent j with {| A j | , c jp , c jh } , i.e., different action space and cost coefficients, as well asa different demand distribution D and co-player policy π .Unless stated otherwise, the demand distribution and co-player policy are the same for the sourceand target agents. Transfer learning could also be used when other aspects of the problem change, uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) e.g., lead times, state representation, and so on. This avoids having to tune the parameters of theneural network for each new problem, which considerably reduces the training time. However, westill need to decide how many layers should be trainable, as well as to determine which agent canbe a base agent for transferring the learned knowledge. Still, this is computationally much cheaperthan finding each network and its hyper-parameters from scratch.
4. Numerical Experiments
In Section 4.1, we discuss a set of numerical experiments that uses a simple demand distributionand a relatively small action space: • d t ∈ U [0 , A = {− , − , , , } .After exploring the behavior of our algorithm under different co-player policies, in Section 4.2 wetest the algorithm using three well-known cases from the literature, which have larger possibledemand values and action spaces: • d t ∈ U [0 , A = {− , . . . , } (Croson and Donohue 2006b) • d t ∈ N (10 , ), A = {− , . . . , } (adapted from Chen and Samroengraja 2000, , who assume N (50 , )) • d t ∈ C (4 , A = {− , . . . , } (Sterman 1989).As noted above, we only consider cases in which a single DQN plays with non-DQN agents, e.g.,simulated human co-players. In each of the cases listed above, we consider three types of policiesthat the non-DQN co-players follow: (i) base-stock policy, (ii) Sterman formula, (iii) random policy.In the random policy, agent i also follows a d + x rule, in which a ti ∈ A is selected randomly andwith equal probability, for each t . After analyzing these cases, in Section 4.3 we provide the resultsobtained using transfer learning for each of the six proposed cases.We test values of m in { , } and C ∈ { , } . Our DNN network is a fully connectednetwork, in which each node has a ReLU activation function. The input is of size 5 m , and thereare three hidden layers in the neural network. There is one output node for each possible value ofthe action, and each of these nodes takes a value in R indicating the Q-value for that action. Thus,there are a u − a l + 1 output nodes, and the neural network has shape [5 m, , , , a u − a l + 1]. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) In order to optimize the network, we used the Adam optimizer (Kingma and Ba 2014) with abatch size of 64. Although the Adam optimizer has its own weight decaying procedure, we usedexponential decay with a stair of 10000 iterations with rate 0 .
98 to decay the learning rate further.This helps to stabilize the training trajectory. We trained each agent on at most 60000 episodesand used a replay memory E equal to the one million most recently observed experiences. Also,the training of the DNN starts after observing at least 500 episodes of the game. The (cid:15) -greedyalgorithm starts with (cid:15) = 0 . . β i heavily dependson τ j , the average reward for agent j , for each j (cid:54) = i . For example, when τ i is one order of magnitudelarger than τ j , for all j (cid:54) = i , agent i needs a large coefficient to get more feedback from the otheragents. Indeed, the feedback coefficient has a similar role as the regularization parameter λ has inthe lasso loss function; the value of that parameter depends on the (cid:96) -norm of the variables, butthere is no universal rule to determine the best value for λ . Similarly, proposing a simple rule orvalue for each β i is not possible, as it depends on τ i , ∀ i . For example, we found that a very large β i does not work well, since the agent tries to decrease other agents’ costs rather than its own.Similarly, with a very small β i , the agent learns how to minimize its own cost instead of the totalcost. Therefore, we used a similar cross validation approach to find good values for each β i . In this section, we test our approach using a beer game setup with the following characteristics.Information and shipment lead times, l trj and l inj , equal 2 periods at every agent. Holding andstockout costs are given by c h = [2 , , ,
2] and c p = [2 , , , , . . . ,
4. The demand is an integer uniformly drawn from { , , } . Additionally,we assume that agent i observes the arriving shipment AS it when it chooses its action for period t . We relax this assumption later. We use a l = − a u = 2; so that there are 5 outputs in theneural network. i.e., each agent chooses an order quantity that is at most 2 units greater or lessthan the observed demand. (Later, we expand these to larger action spaces.) uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) We consider two types of simulated human players. In Section 4.1.1, we discuss results for thecase in which one DQN agent plays on a team in which the other three players use a base-stockpolicy to choose their actions, i.e., the non-DQN agents behave rationally. See https://youtu.be/gQa6iWGcGWY for a video animation of the policy that the DQN learns in this case. Then, inSection 4.1.2, we assume that the other three agents use the Sterman formula (i.e., the formula bySterman (1989)), which models irrational play.For the cost coefficients and other settings specified for this beer game, it is optimal for all playersto follow a base-stock policy, and we use this policy as a benchmark and a lower bound on thebase stock cost. The vector of optimal base-stock levels is [8 , , , . . , . , . , . We consider four cases, with the DQN playing therole of each of the four players and the co-players using a base-stock policy. We then compare theresults of our algorithm with the results of the case in which all players follow a base-stock policy,which we call BS hereinafter.The results of all four cases are shown in Figure 4. Each plot shows the training curve, i.e., theevolution of the average cost per game as the training progresses. In particular, the horizontal axisindicates the number of training episodes, while the vertical axis indicates the total cost per game.After every 100 episodes of the game and the corresponding training, the cost of 50 validationpoints (i.e., 50 new games) each with 100 periods, are obtained and their average plus a 95%confidence interval are plotted. (The confidence intervals, which are light blue in the figure, arequite narrow, so they are difficult to see.) The red line indicates the cost of the case in which allplayers follow a base-stock policy. In each of the sub-figures, there are two plots; the upper plotshows the cost, while the lower plot shows the normalized cost, in which each cost is divided bythe corresponding BS cost; essentially this is a “zoomed-in” version of the upper plot. We trained uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 4 Total cost (upper figure) and normalized cost (lower figure) with one DQN agent and three agentsthat follow base-stock policy (a) DQN plays retailer (b) DQN plays warehouse(c) DQN plays distributor (d) DQN plays manufacturer the network using values of β ∈ { , , , , , } , each for at most 60000 episodes. Figure 4plots the results from the best β i value for each agent; we present the full results using different β i , m and C values in Section C of the online supplement.The figure indicates that DQN performs well in all cases and finds policies whose costs are closeto those of the BS policy. After the network is fully trained (i.e., after 60000 training episodes), theaverage gap between the DQN cost and the BS cost, over all four agents, is 2.31%.Figure 5 shows the trajectories of the retailer’s inventory level ( IL ), on-order quantity ( OO ),order quantity ( a ), reward ( r ), and order up to level (OUTL) for a single game, when the retaileris played by the DQN with β = 50, as well as when it is played by a base-stock policy ( BS ), and uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 5 IL t , OO t , a t , r t , and OUT L when DQN plays retailer and other agents follow base-stock policy the Sterman formula (
Strm ). The base-stock policy and DQN have similar IL and OO trends, andas a result their rewards are also very close: BS has a cost of [1 . , . , . , .
05] (total 1.49) andDQN has [1 . , . , . , .
08] (total 1.54, or 3.4% larger). (Note that BS has a slightly differentcost here than reported on page 20 because those costs are the average costs of 50 samples whilethis cost is from a single sample.) Similar trends are observed when the DQN plays the other threeroles; see Section B of the online supplement. This suggests that the DQN can successfully learnto achieve costs close to BS when the other agents also play BS . (The OUTL plot shows that theDQN does not quite follow a base-stock policy, even though its costs are similar.) Figure 6 shows the results of the case in which thethree non-DQN agents use the formula proposed by Sterman (1989) instead of a base-stock policy.(See Section A of online supplement for the formula and its parameters.) For comparison, the redline represents the case in which the single agent is played using a base-stock policy and the otherthree agents continue to use the Sterman formula, a case we call
Strm-BS .From the figure, it is evident that the DQN plays much better than
Strm-BS . This is because ifthe other three agents do not follow a base-stock policy, it is no longer optimal for the fourth agentto follow a base-stock policy, or to use the same base-stock level. In general, the optimal inventorypolicy when other agents do not follow a base-stock policy is an open question. This figure suggeststhat our DQN is able to learn to play effectively in this setting.Table 1 gives the cost of all four agents when a given agent plays using either DQN or a base-stock policy and the other agents play using the Sterman formula. From the table, we can see thatDQN learns how to play to decrease the costs of the other agents, and not just its own costs—for uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Table 1 Average cost under different choices of which agent uses DQN or
Strm-BS . Cost (DQN,
Strm-BS )DQN Agent Retailer Warehouse Distributer Manufacturer TotalRetailer (0.89, 1.89) (10.87, 10.83) (10.96, 10.98) (12.42, 12.82) (35.14, 36.52)Warehouse (1.74, 9.99) (0.00, 0.13) (11.12, 10.80) (12.86, 12.34) (25.72, 33.27)Distributer (5.60, 10.72) (0.11, 9.84) (0.00, 0.14) (12.53, 12.35) (18.25, 33.04)Manufacturer (4.68, 10.72) (1.72, 10.60) (0.24, 10.13) (0.00, 0.07) (6.64, 31.52)example, the retailer’s and warehouse’s costs are significantly lower when the distributor uses DQNthan they are when the distributor uses a base-stock policy. Similar conclusions can be drawn fromFigure 6. This shows the power of DQN when it plays with co-player agents that do not playrationally, i.e., do not follow a base-stock policy, which is common in real-world supply chains.Also, we note that when all agents follow the Sterman formula, the average cost of the agents is[10.81, 10.76, 10.96, 12.6], for a total of 45.13, much higher than when any one agent uses DQN.Finally, for details on
IL, OO, a, r, and
OUTL on this case see Section B of the online supplement.
We next test our approach on beer game settings from the literature. These have larger demand-distribution domains, and therefore larger plausible action spaces, and thus represent harderinstances to train the DQN for. In all instances in this section, l in = [2 , , ,
2] and l tr = [2 , , , U [0 ,
8] instance has non-zero costs at every stage. Therefore, we used a heuristicapproach based on a two-moment approximation, similar to that proposed by Graves (1985), tochoose the base-stock levels; see Snyder (2018). In addition, the C (4 ,
8) demand process is non-stationary—4, then 8—but we allow only stationary base-stock levels. Therefore, we chose to setthe base-stock levels equal to the values that would be optimal if the demand were 8 in every period.Finally, in the experiments in this section, we assume that agent i observes AS it after choosing a it ,whereas in Section 4.1 we assumed the opposite. Therefore, the agents in these experiments haveone fewer piece of information when choosing actions, and are therefore more difficult to train. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 6 Total cost (upper figure) and normalized cost (lower figure) with one DQN agent and three agentsthat follow the Sterman formula (a) DQN plays retailer (b) DQN plays warehouse(c) DQN plays distributor (d) DQN plays manufacturer
Table 2 Cost parameters and base-stock levels for instances with uniform, normal, and classic demanddistributions. demand c p c h BS level U [0 ,
8] [1.0,1.0,1.0,1.0] [0.50,0.50,0.50,0.50] [19,20,20,14] N (10 , ) [10.0,0.0,0.0,0.0] [1.00,0.75,0.50,0.25] [48,43,41,30] C (4 ,
8) [1.0,1.0,1.0,1.0] [0.50,0.50,0.50,0.50] [32,32,32,24]Tables 3, 4, and 5 show the results of the cases in which the DQN agent plays with co-playerswho follow base-stock, Sterman, and random policies, respectively. In each group of columns, thefirst column (“DQN”) gives the average cost (over 50 instances) when one agent (indicated by thefirst column in the table) is played by the DQN and co-players are played by base-stock (Table 3),Sterman (Table 4), or random (Table 5) agents. The second column in each group (“ BS ”, “ Strm-BS ”,“
Rand-BS ”) gives the corresponding cost when the DQN agent is replaced by a base-stock agent uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Table 3 Results of DQN playing with co-players who follow base-stock policy.
Uniform Normal ClassicDQN BS Gap (%) DQN BS Gap (%) DQN BS Gap (%)R 904.88 799.20 13.22 881.66 838.14 5.19 0.50 0.34 45.86W 960.44 799.20 20.18 932.65 838.14 11.28 0.47 0.34 36.92D 903.49 799.20 13.05 880.40 838.14 5.04 0.67 0.34 96.36M 830.16 799.20 3.87 852.33 838.14 1.69 0.30 0.34 -13.13Average 12.58 5.80 41.50(using the base-stock levels given in Table 2) and the co-players remain as in the previous column.The third column (“Gap”) gives the percentage difference between these two costs.As Table 3 shows, when the DQN plays with base-stock co-players under uniform or normaldemand distributions, it obtains costs that are reasonably close to the case when all players use abase-stock policy, with average gaps of 12.58% and 5.80%, respectively. These gaps are not quiteas small as those in Section 4.1, due to the larger action spaces in the instances in this section.Since a base-stock policy is optimal at every stage, the small gaps demonstrate that the DQN canlearn to play the game well for these demand distributions. For the classic demand process, thepercentage gaps are larger. To see why, note that if the demand were to equal 8 in every period,the base-stock levels for the classic demand process will result in ending inventory levels of 0 atevery stage. The four initial periods of demand equal to 4 disrupt this effect slightly, but the costof the optimal base-stock policy for the classic demand process is asymptotically 0 as the timehorizon goes to infinity. The absolute gap attained by the DQN is quite small—an average of 0.49vs. 0.34 for the base-stock cost—but the percentage difference is large simply because the optimalcost is close to 0. Indeed, if we allow the game to run longer, the cost of both algorithms decreases,and so does the absolute gap. For example, when the DQN plays the retailer, after 500 periods thediscounted costs are 0.0090 and 0.0062 for DQN and BS , respectively, and after 1000 periods, thecosts are 0.0001 and 0.0000 (to four-digit precision).Similar to the results of Section 4.1.2, when the DQN plays with co-players who follow theSterman formula, it performs far better than Strm-BS . As Table 4 shows, DQN performs 34%better than
Strm-BS on average. Finally, when DQN plays with co-players who use the random uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Table 4 Results of DQN playing with co-players who follow Sterman policy.
Uniform Normal ClassicDQN
Strm-BS
Gap (%) DQN
Strm-BS
Gap (%) DQN
Strm-BS
Gap (%)R 6.88 8.99 -23.45 9.98 10.67 -6.44 3.80 13.28 -71.41W 5.90 9.53 -38.10 7.11 10.03 -29.06 2.85 8.17 -65.08D 8.35 10.99 -23.98 8.49 13.83 -38.65 3.82 20.07 -80.96M 12.36 13.90 -11.07 13.86 15.37 -9.82 15.80 19.96 -20.82Average -24.15 -20.99 -59.57
Table 5 Results of DQN playing with co-players who follow random policy.
Uniform Normal ClassicDQN
Rand-BS
Gap (%) DQN
Rand-BS
Gap (%) DQN
Rand-BS
Gap (%)R 31.39 28.24 11.12 13.03 28.39 -54.10 19.99 25.88 -22.77W 29.62 28.62 3.49 27.87 35.80 -22.15 23.05 23.44 -1.65D 30.72 28.64 7.25 34.85 38.79 -10.15 22.81 23.53 -3.04M 29.03 28.13 3.18 37.68 40.53 -7.02 22.36 22.45 -0.42Average 6.26 -23.36 -6.97policy, for all demand distributions DQN learns very well to play so as to minimize the total costof the system, and on average obtains 8% better solutions than
Rand-BS .To summarize, DQN does well regardless of the way the other agents play, and regardless ofthe demand distribution. The DQN agent learns to attain near- BS costs when its co-players followa BS policy, and when playing with irrational co-players, it achieves a much smaller cost than abase-stock policy would. Thus, when the other agents play irrationally, DQN should be used. We trained a DQN network with shape [50 , , , , m = 10, β = 20, and C = 10000 for eachagent, with the same holding and stockout costs and action spaces as in section 4.1, using 60000training episodes, and used these as the base networks for our transfer learning experiment. (Intransfer learning, all agents should have the same network structure to share the learned networkamong different agents.) The remaining agents use a BS policy.Table 6 shows a summary of the results of the six cases discussed in Section 3.2. The first set ofcolumns indicates the holding and shortage cost coefficients, the size of the action space, as well asthe demand distribution and the co-players’ policy for the base agent (first row) and the target agent(remaining rows). The “Gap” column indicates the average gap between the cost of the resultingDQN and the cost of a BS policy; in the first row, it is analogous to the 2.31% average gap reported uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Table 6 Results of transfer learning when π is BS and D is U [0 , (Holding, Shortage) Cost Coefficients | A | D π Gap CPU TimeR W D M (%) (sec)Base agent (2,2) (2,0) (2,0) (2,0) 5 U [0 , BS U [0 , BS U [0 , BS U [0 , BS U [0 , BS N (10 , ) BS N (10 , ) Strm -38.20 1,153,571Case 6 (1,10) (0.75,0) (0.5,0) (0.25,0) 11 N (10 , ) Rand -0.25 1,292,295in Section 4.1.1. The average gap is relatively small in all cases, which shows the effectiveness ofthe transfer learning approach. Moreover, this approach is efficient, as demonstrated in the lastcolumn, which reports the average CPU times for all agent. In order to get the base agents, we didhyper-parameter tuning and trained 140 instances to get the best possible set of hyper-parameters,which resulted in a total of 28,390,987 seconds of training. However, using the transfer learningapproach, we do not need any hyper-parameter tuning; we only need to check which source agentand which k provides the best results. This requires only 12 instances to train and resulted in anaverage training time (across case 1-4) of 1,613,711 seconds—17.6 times faster than training thebase agent. Additionally, in case 5, in which a normal distribution is used, full hyper-parametertuning took 20,396,459 seconds, with an average gap of 4.76%, which means transfer learning was16.6 times faster on average. We did not run the full hyper-parameter tuning for the instances ofcase-6, but it is similar to that of case-5 and should take similar training time, and as a result asimilar improvement from transfer learning. Thus, once we have a trained agent i with a given set P i of parameters, demand D and co-players’ policy π , we can efficiently train a new agent j withparameters P j , demand D and co-players’ policy π .In order to get more insights about the transfer learning process, Figure 7 shows the resultsof case 4, which is a quite complex transfer learning case that we test for the beer game. Thetarget agents have holding and shortage costs (10,1), (10,0), (10,0), and (10,0) for agents 1 to 4,respectively; and each agent can select any action in {− , . . . , } . Each caption reports the baseagent (shown by b ) and the value of k used. Compared to the original procedure (see Figure 4), uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 7 Results of transfer learning for case 4 (different agent, cost coefficients, and action space) (a) Target agent = retailer ( b = 3 , k = 1) (b) Target agent = wholesaler ( b = 1 , k = 1)(c) Target agent = distributor ( b = 3 , k = 2) (d) Target agent = manufacturer ( b = 4 , k = 2) i.e., k = 0, the training is less noisy and after a few thousand non-fluctuating training episodes, itconverges into the final solution. The resulting agents obtain costs that are close to those of BS ,with a 12 .
58% average gap compared to the BS cost. (The details of the other cases are providedin Sections D.1—D.5 of the online supplement.)Finally, Table 7 explores the effect of k on the tradeoff between training speed and solutionaccuracy. As k increases, the number of trainable variables decreases and, not surprisingly, the CPUtimes are smaller but the costs are larger. For example, when k = 3, the training time is 46.89%smaller than the training time when k = 0, but the solution cost is 17.66% and 0.34% greater thanthe BS policy, compared to 4.22% and -11.65% for k = 2. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Table 7 Savings in computation time due to transfer learning. First row provides average training time amongall instances. Third row provides average of the best obtained gap in cases for which an optimal solution exists.Fourth row provides average gap among all transfer learning instances, i.e., cases 1–6. k = 0 k = 1 k = 2 k = 3Training time 185,679 126,524 118,308 107,711Decrease in time compared to k = 0 — 37.61% 41.66% 46.89%Average gap in cases 1–4 2.31% 4.39% 4.22% 17.66%Average gap in cases 1–6 — -15.95% -11.65% 0.34%To summarize, transferring the acquired knowledge between the agents is very efficient. Thetarget agents achieve costs that are close to those of the BS policy (when co-players follow BS ) andthey achieve smaller costs than Strm-BS and
Rand-BS , regardless of the dissimilarities betweenthe source and the target agents. The training of the target agents start from relatively small costvalues, the training trajectories are stable and fairly non-noisy, and they quickly converge to acost value close to that of the BS policy or smaller than Strm-BS and
Rand-BS . Even when theaction space and costs for the source and target agents are different, transfer learning is still quiteeffective, resulting in a 12.58% gap compared to the BS policy. This is an important result, since itmeans that if the settings change—either within the beer game or in real supply chain settings—wecan train new DQN agents much more quickly than we could if we had to begin each training fromscratch.
5. Conclusion and Future Work
In this paper, we consider the beer game, a decentralized, multi-agent, cooperative supply chainproblem. A base-stock inventory policy is known to be optimal for special cases, but once some ofthe agents do not follow a base-stock policy (as is common in real-world supply chains), the optimalpolicy of the remaining players is unknown. To address this issue, we propose an algorithm basedon deep Q-networks. It obtains near-optimal solutions when playing alongside agents who follow abase-stock policy and performs much better than a base-stock policy when the other agents use amore realistic model of ordering behavior. Furthermore, the algorithm does not require knowledgeof the demand probability distribution and uses only historical data.To reduce the computation time required to train new agents with different cost coefficients oraction spaces, we propose a transfer learning method. Training new agents with this approach takes uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) less time since it avoids the need to tune hyper-parameters and has a smaller number of trainablevariables. Moreover, it is quite powerful, resulting in beer game costs that are close to those offully-trained agents while reducing the training time by an order of magnitude.A natural extension of this paper is to apply our algorithm to supply chain networks with othertopologies, e.g., distribution networks. Another important extension is having multiple learnableagents. Finally, developing algorithms capable of handling continuous action spaces will improvethe accuracy of our algorithm. References
D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein. The complexity of decentralized control ofmarkov decision processes.
Mathematics of operations research , 27(4):819–840, 2002.S. K. Chaharsooghi, J. Heydari, and S. H. Zegordi. A reinforcement learning model for supply chain orderingmanagement: An application to the beer game.
Decision Support Systems , 45(4):949–959, 2008.F. Chen and R. Samroengraja. The stationary beer game.
Production and Operations Management , 9(1):19, 2000.A. J. Clark and H. Scarf. Optimal policies for a multi-echelon inventory problem.
Management science , 6(4):475–490, 1960.C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems.
AAAI/IAAI , 1998:746–752, 1998.R. Croson and K. Donohue. Impact of POS data sharing on supply chain management: An experimentalstudy.
Production and Operations Management , 12(1):1–11, 2003.R. Croson and K. Donohue. Behavioral causes of the bullwhip effect and the observed value of inventoryinformation.
Management Science , 52(3):323–336, 2006a.R. Croson and K. Donohue. Behavioral causes of the bullwhip effect and the observed value of inventoryinformation.
Management science , 52(3):323–336, 2006b.K. Devika, A. Jafarian, A. Hassanzadeh, and R. Khodaverdi. Optimizing of bullwhip effect and net stockamplification in three-echelon supply chains using evolutionary multi-objective metaheuristics.
Annals ofOperations Research , 242(2):457–487, 2016. uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13)
International Journal of Production Economics , 101(1):2–18, 2006.I. Giannoccaro and P. Pontrandolfo. Inventory management in supply chains: A reinforcement learningapproach.
International Journal of Production Economics , 78(2):153 – 161, 2002. ISSN 0925-5273. doi:http://dx.doi.org/10.1016/S0925-5273(00)00156-0.S. C. Graves. A multi-echelon inventory model for a repairable item with one-for-one replenishment.
Man-agement Science , 31(10):1247–1256, 1985.C. Jiang and Z. Sheng. Case-based reinforcement learning for dynamic inventory control in a multi-agentsupply-chain system.
Expert Systems with Applications , 36(3):6520–6526, 2009.S. O. Kimbrough, D.-J. Wu, and F. Zhong. Computers play the beer game: Can artificial agents managesupply chains?
Decision support systems , 33(3):323–333, 2002.D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.H. L. Lee, V. Padmanabhan, and S. Whang. Information distortion in a supply chain: The bullwhip effect.
Management Science , 43(4):546–558, 1997.H. L. Lee, V. Padmanabhan, and S. Whang. Comments on “Information distortion in a supply chain: Thebullwhip effect”.
Management Science , 50(12S):1887–1893, 2004.Y. Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274 , 2017.L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.
MachineLearning , 8(3-4):293–321, 1992.I. J. Martinez-Moyano, J. Rahn, and R. Spencer. The Beer Game: Its History and Rule Changes. Technicalreport, University at Albany, 2014.F. S. Melo and M. I. Ribeiro. Q-learning with linear function approximation. In
International Conferenceon Computational Learning Theory , pages 308–322. Springer, 2007.V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. PlayingAtari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013. uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning.
Nature ,518(7540):529–533, 2015.E. Mosekilde and E. R. Larsen. Deterministic chaos in the beer production-distribution model.
SystemDynamics Review , 4(1-2):131–147, 1988.A. Oroojlooyjadid, L. Snyder, and M. Tak´aˇc. Applying deep learning to the newsvendor problem. http://arxiv.org/abs/1607.02177 , 2017a.A. Oroojlooyjadid, L. Snyder, and M. Tak´aˇc. Stock-out prediction in multi-echelon networks. arXiv preprintarXiv:1709.06922 , 2017b.S. J. Pan and Q. Yang. A survey on transfer learning.
IEEE Transactions on knowledge and data engineering ,22(10):1345–1359, 2010.P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A. Bagul, C. Langlotz, K. Shpanskaya,et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprintarXiv:1711.05225 , 2017.L. V. Snyder. Multi-echelon base-stock optimization with upstream stockout costs. Technical report, LehighUniversity, 2018.L. V. Snyder and Z.-J. M. Shen.
Fundamentals of Supply Chain Theory . John Wiley & Sons, 2nd edition,2019.J. D. Sterman. Modeling managerial behavior: Misperceptions of feedback in a dynamic decision makingexperiment.
Management Science , 35(3):321–339, 1989.F. Strozzi, J. Bosch, and J. Zaldivar. Beer game order policy optimization under changing customer demand.
Decision Support Systems , 42(4):2153–2163, 2007.R. S. Sutton and A. G. Barto.
Reinforcement learning: An introduction . MIT Press, Cambridge, 1998.A. Tversky and D. Kahneman. Judgment under uncertainty: Heuristics and biases.
Science , 185(4157):1124–1131, 1979. uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Online Supplements for A Deep Q-Network for the BeerGame: Reinforcement Learning for Inventory Optimization
Appendix A: Sterman Formula Parameters
The computational experiments that use
Strm agents calculate the order quantity using formula (3), adaptedfrom Sterman (1989). q it = max { , AO i − t +1 + α i ( IL it − a i ) + β i ( OO it − b i ) } (3)where α i , a i , β i , and b i are the parameters corresponding to the inventory level and on-order quantity. Theidea is that the agent sets the order quantity equal to the demand forecast plus two terms that representadjustments that the agent makes based on the deviations between its current inventory level (resp., on-orderquantity) and a target value a i (resp., b i ). We set a i = µ d , where µ d is the average demand; b i = µ d ( l fii + l tri ); α i = − .
5; and β i = − . i = 1 , , ,
4. The negative α and β mean that the player over-orderswhen the inventory level or on-order quantity fall below the target value a i or b i . Appendix B: Extended Numerical Results
This appendix shows additional results on the details of play of each agent. Figure 8 provides the details of IL , OO , a , r , and OUTL for each agent when the DQN retailer plays with co-players who use the BS policy.Clearly, DQN attains a similar IL, OO, action, and reward to those of BS . Figure 9 provides analogous resultsfor the case in which the DQN manufacturer plays with three Strm agents. The DQN agent learns that theshortage costs of the non-retailer agents are zero and exploits that fact to reduce the total cost. In each ofthe figures, the top set of charts provides the results of the retailer, followed by the warehouse, distributor,and manufacturer.
Appendix C: The Effect of β on the Performance of Each Agent Figure 10 plots the training trajectories for DQN agents playing with three BS agents using various valuesof C , m , and β . In each sub-figure, the blue line denotes the result when all players use a BS policy whilethe remaining curves each represent the agent using DQN with different values of C , β , and m , trained for60000 episodes with a learning rate of 0 . β ∈ { , } works well, and β = 40 providesthe best results. As we move upstream in the supply chain (warehouse, then distributor, then manufacturer), uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 8 IL t , OO t , a t , and r t of all agents when DQN retailer plays with three BS co-players Time − − I L Time OO Time a Time O U T L Time . . . . . . . . . . r Time − − − − I L Time OO Time a Time O U T L Time . . . . . . . . . . r Time − − − − I L Time OO Time a Time O U T L Time . . . . . . . . . . r Time − − I L Time OO Time a Time O U T L Time . . . . . . r DQN Strm BS
Figure 9 IL t , OO t , a t , and r t of all agents when DQN manufacturer plays with three Strm-BS co-players
Time − − I L Time OO Time a Time O U T L Time . . . . . . r Time − − − I L Time OO Time a Time O U T L Time . . . . . . r Time − − − − − I L Time OO Time a Time O U T L Time . . . . . . r Time − − − − − − I L Time OO Time a Time − − − O U T L Time . . . . . . r DQN Strm Strm-BS smaller β values become more effective (see Figures 10b–10d). Recall that the retailer bears the largestshare of the optimal expected cost per period, and as a result it needs a larger β than the other agents. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 10 Total cost (upper figure) and normalized cost (lower figure) with one DQN agent and three agentsthat follow base-stock policy (a) DQN plays retailer (b) DQN plays warehouse(c) DQN plays distributor (d) DQN plays manufacturerNot surprisingly, larger m values provide attain better costs since the DQN has more knowledge of theenvironment. Finally, larger C works better and provides a stable DQN model. However, there are somecombinations for which smaller C and m also work well, e.g., see Figure 10d, trajectory 5000-20-5. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 11 Results of transfer learning between agents with the same cost coefficients and action space (a) Case 1-4-1 (b) Case 2-4-1 (c) Case 3-1-1 (d) Case 4-2-1
Appendix D: Extended Results on Transfer Learning
D.1. Transfer Knowledge Between Agents
In this section, we present the results of the transfer learning method when the trained agent i ∈ { , , , } transfers its first k ∈ { , , } layer(s) into co-player agent j ∈ { , , , } , j (cid:54) = i . For each target-agent j , Figure11 shows the results for the best source-agent i and the number of shared layers k , out of the 9 possiblechoices for i and k . In the sub-figure captions, the notation j - i - k indicates that source-agent i shares weightsof the first k layers with target-agent j , so that those k layers remain non-trainable.Except for agent 2, all agents obtain costs that are very close to those of the BS policy, with a 6 .
06% gap,on average. (In Section 4.1.1, the average gap is 2 . BS cost.In order to get more insight, consider Figure 4, which presents the best results obtained through hyper-parameter tuning for each agent. In that figure, all agents start the training with a large cost value, andafter 25000 fluctuating iterations, each converges to a stable solution. In contrast, in Figure 11, each agentstarts from a relatively small cost value, and after a few thousand training episodes converges to the finalsolution. Moreover, for agent 3, the final cost of the transfer learning solution is smaller than that obtainedby training the network from scratch. And, the transfer learning method used one order of magnitude lessCPU time than the approach in Section 4.1.1 to obtain very close results.We also observe that agent j can obtain good results when k = 1 and i is either j − j + 1. This showsthat the learned weights of the first DQN network layer are general enough to transfer knowledge to the uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 12 Fesults of transfer learning between agents with different cost coefficients and same action space (a) Case 1-4-1 (b) Case 2-3-3 (c) Case 3-1-1 (d) Case 4-4-2other agents, and also that the learned knowledge of neighboring agents is similar. Also, for any agent j ,agent i = 1 provides similar results to that of agent i = j − i = j + 1 does, and in some cases it providesslightly smaller costs, which shows that agent 1 captures general feature values better than the others. D.2. Transfer Knowledge for Different Cost Coefficients
Figure 12 shows the best results achieved for all agents, when agent j has different cost coefficients, ( c p , c h ) (cid:54) =( c p , c h ). We test target agents j ∈ { , , , } , such that the holding and shortage costs are (5,1), (5,0), (5,0),and (5,0) for agents 1 to 4, respectively. In all of these tests, the source and target agents have the sameaction spaces. All agents attain cost values close to the BS cost; in fact, the overall average cost is 6.16%higher than the BS cost.In addition, similar to the results of Section D.1, base agent i = 1 provides good results for all targetagents. We also performed the same tests with shortage and holding costs (10,1), (1,0), (1,0), and (1,0) foragents 1 to 4, respectively, and obtained very similar results. D.3. Transfer Knowledge for Different Size of Action Space
Increasing the size of the action space should increase the accuracy of the d + x approach. However, it makesthe training process harder. It can be effective to train an agent with a small action space and then transferthe knowledge to an agent with a larger action space. To test this, we test target-agent j ∈ { , , , } withaction space {− , . . . , } , assuming that the source and target agents have the same cost coefficients.Figure 13 shows the best results achieved for all agents. All agents attained costs that are close to the BS cost, with an average gap of approximately 10.66%. uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 13 Results of transfer learning for agents with | A | (cid:54) = | A | , ( c jp , c jh ) = ( c jp , c jh ) (a) Case 1-3-1 (b) Case 2-3-2 (c) Case 3-4-2 (d) Case 4-2-1 Figure 14 Results of transfer learning for agents with | A | (cid:54) = | A | , ( c jp , c jh ) (cid:54) = ( c jp , c jh ) , D (cid:54) = D , and π (cid:54) = π (a) Case 1-3-1 (b) Case 2-3-3 (c) Case 3-2-1 (d) Case 4-3-2 D.4. Transfer Knowledge for Different Action Space, Cost Coefficients, and DemandDistribution
This case includes all difficulties of the cases in Sections D.1, D.2, D.3, and 4.3, in addition to the demanddistributions being different. So, the range of demand, IL , OO , AS , and AO that each agent observes isdifferent than those of the base agent. Therefore, this is a hard case to train, and the average optimalitygap is 17.41%; however, as Figure 14 depicts, the cost values decrease quickly and the training noise is quitesmall. D.5. Transfer Knowledge for Different Action Space, Cost Coefficients, DemandDistribution, and π Figures 15 and 16 show the results of the most complex transfer learning cases that we tested. Although theDQN plays with non-rational co-players and the observations in each state might be quite noisy, there arerelatively small fluctuations in the training, and for all agents after around 40,000 iterations they converge. uthor:
Article Short Title —– 00(0), pp. 000–000, c (cid:13) Figure 15 Results of transfer learning for agents with | A | (cid:54) = | A | , ( c jp , c jh ) (cid:54) = ( c jp , c jh ) , D (cid:54) = D , and π (cid:54) = π (a) Case 1-1-1 (b) Case 2-1-3 (c) Case 3-1-1 (d) Case 4-1-1 Figure 16 Results of transfer learning for agents with | A | (cid:54) = | A | , ( c jp , c jh ) (cid:54) = ( c jp , c jh ) , D (cid:54) = D , and π (cid:54) = π (a) Case 1-2-1 (b) Case 2-1-2 (c) Case 3-3-3 (d) Case 4-1-1 Appendix E: Pseudocode of the Beer Game Simulator
The DQN algorithm needs to interact with the environment, so that for each state and action, theenvironment should return the reward and the next state. We simulate the beer game environment usingAlgorithm 2. In addition to the notation defined earlier, the algorithm also uses the following notation: d t : The demand of the customer in period t . OS ti : Outbound shipment from agent i (to agent i −
1) in period t . uthor: Article Short Title —– 00(0), pp. 000–000, c (cid:13) Algorithm 2
Beer Game Simulator Pseudocode procedure playGame Set T randomly, and t = 0, Initialize IL i for all agents, AO ti = 0 , AS ti = 0 , ∀ i, t while t ≤ T do AO t + l fii i + = d t (cid:46) set the retailer’s arriving order to external demand for i = 1 : 4 do (cid:46) loop through stages downstream to upstream get action a ti (cid:46) choose order quantity OO t +1 i = OO ti + a ti (cid:46) update OO i AO t + l fii i +1 + = a ti (cid:46) propagate order upstream end for AS t + l tr + = a t (cid:46) set manufacturer’s arriving shipment to its order quantity for i = 4 : 1 do (cid:46) loop through stages upstream to downstream IL t +1 i = IL ti + AS ti (cid:46) receive inbound shipment OO t +1 i − = AS ti (cid:46) update OO i current Inv = max { , IL t +1 i } (cid:46) determine outbound shipment current BackOrder = max { , − IL ti } OS ti = min { current Inv, current BackOrder + AO ti } AS t + l tri i − + = OS ti (cid:46) propagate order downstream IL t +1 i − = AO ti (cid:46) update IL i c ti = c pi max {− IL t +1 i , } + c hi max { IL t +1 i , } (cid:46) calculate cost end for t + = 1 end while23: