[PDF] A Cooperative Multi-Agent Reinforcement Learning Framework for Resource Balancing in Complex Logistics Network

Abstract

Resource balancing within complex transportation networks is one of the most important problems in real logistics domain. Traditional solutions on these problems leverage combinatorial optimization with demand and supply forecasting. However, the high complexity of transportation routes, severe uncertainty of future demand and supply, together with non-convex business constraints make it extremely challenging in the traditional resource management field. In this paper, we propose a novel sophisticated multi-agent reinforcement learning approach to address these challenges. In particular, inspired by the externalities especially the interactions among resource agents, we introduce an innovative cooperative mechanism for state and reward design resulting in more effective and efficient transportation. Extensive experiments on a simulated ocean transportation service demonstrate that our new approach can stimulate cooperation among agents and lead to much better performance. Compared with traditional solutions based on combinatorial optimization, our approach can give rise to a significant improvement in terms of both performance and stability.

Full PDF

AA Cooperative Multi-Agent Reinforcement Learning Frameworkfor Resource Balancing in Complex Logistics Network

Xihan Li

Key Laboratory of MachinePerception, Peking UniversityBeijing, [email protected]

Jia Zhang

Microsoft Research AsiaBeijing, [email protected]

Jiang Bian

Microsoft Research AsiaBeijing, [email protected]

Yunhai Tong

Key Laboratory of MachinePerception, Peking UniversityBeijing, [email protected]

Tie-Yan Liu

Microsoft Research AsiaBeijing, [email protected]

ABSTRACT

Resource balancing within complex transportation networks is oneof the most important problems in real logistics domain. Traditionalsolutions on these problems leverage combinatorial optimizationwith demand and supply forecasting. However, the high complex-ity of transportation routes, severe uncertainty of future demandand supply, together with non-convex business constraints makeit extremely challenging in the traditional resource managementfield. In this paper, we propose a novel sophisticated multi-agentreinforcement learning approach to address these challenges. Inparticular, inspired by the externalities especially the interactionsamong resource agents, we introduce an innovative cooperativemechanism for state and reward design resulting in more effectiveand efficient transportation. Extensive experiments on a simulatedocean transportation service demonstrate that our new approachcan stimulate cooperation among agents and lead to much betterperformance. Compared with traditional solutions based on combi-natorial optimization, our approach can give rise to a significantimprovement in terms of both performance and stability.

KEYWORDS multi-agent; reinforcement learning, resource balancing, logisticsnetwork

ACM Reference Format:

Xihan Li, Jia Zhang, Jiang Bian, Yunhai Tong, and Tie-Yan Liu. 2019. ACooperative Multi-Agent Reinforcement Learning Framework for ResourceBalancing in Complex Logistics Network. In

Proc. of the 18th InternationalConference on Autonomous Agents and Multiagent Systems (AAMAS 2019),Montreal, Canada, May 13–17, 2019,

IFAAMAS, 14 pages.

With the rapid growth of logistics industry, the imbalance betweenthe resource’s supply and demand (SnD) has become one of themost important problems in many real logistics scenarios. For ex-ample, in the domain of ocean transportation, the SnD of empty

Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,Montreal, Canada containers are very unequal due to the world trade imbalance [20];in the domain of express delivery, there exists severe emergingunevenness of the SnD of carriers within local areas; in the fast-growing car-sharing and bike-sharing areas, the unbalanced SnD ofshared taxis and bikes are also explicit due to various temporal andspatial factors [10, 17]. Henceforth, efficient resource balancing hasrisen to be the critical approach to solve the resource imbalance inthe logistics industry. The failure of that will cause large amountsof unfulfilled resource demand, further resulting in reduction ofcustomer satisfaction, increasing resource shortage cost and de-clining revenue. Persistent unsolved SnD imbalance can give riseto accumulated resource shortage and, even worse, a stalemate ofSnD [17] with unexpected amplified price.Traditional solutions for resource balancing leverage operationalresearch (OR) based methods [20], which are typically multistage:they first use forecasting techniques to estimate the future SnD ofeach resource agent; then, the combinatorial optimization approachis employed to find each resource agent’s optimal action to mini-mize a pre-defined objective, which is usually formed as the totalcost caused by resource shortage; finally, the feasible executionplan is generated by tailoring the raw solution obtained by OR-based models. Nevertheless, the drastic uncertainty of future SnD,complex business constraints in the non-convex form, as well asthe high complexity of transportation networks make it extremelychallenging to generate satisfying action plans by using traditionalOR solutions.More concretely, the first crucial challenge, i.e., the uncertaintyof future SnD, is mainly caused by multiple external highly dynamicfactors, either temporal or spatial, such as special days/events,emerging market changes, unstable policies [20], etc. Moreover,such uncertainty can be even aggravated due to the inherent mu-tual dependency between the OR-based model and future SnD.Particularly, the future SnD can be dramatically deviated by actionplans generated by the OR model, which in turn heavily relies onthe future SnD. Henceforth, the uncertainty of future SnD, as drasti-cally increasing the difficulty of accurate SnD forecasting, tends tofail the effectiveness of the traditional multistage OR-based method.The second major challenge is reflected by many important butcomplex business rules in real logistics services. On the one hand, a r X i v : . [ c s . M A ] M a r hey are hard to be formulated in constraints of linear or convexforms, which, therefore, makes it quite hard to model and solvethe problem precisely using traditional OR-based method such aslinear programming and convex optimization. On the other hand,ignoring these necessary constraints is unacceptable since it willcause a big gap between the model and the real world, leading tosignificant performance drop and even unfeasible solutions.Furthermore, since the transportation networks in real logisticsservices are usually very complex, consisting of various types ofterminals and complex connecting routes, the consequential com-plicated dependencies among terminals rise another vital challengewhen building effective OR-based model. Specifically, those com-plicated dependencies make it quite difficult to create acceptablenumber of constraints and variables to balance between the indi-vidual and the collective objectives in the OR-based model.To address these challenges, in this paper, we formally formulatethe resource balancing problem in complex logistics networks as a stochastic game and then propose a novel cooperative multi-agentreinforcement learning (MARL) framework. With the dedicated de-sign of the agent set, joint action space, state set, reward functions,transition probability functions, and discount factor, respectively,our multi-agent reinforcement learning framework provides anend-to-end and high-capability solution, which can not only com-pensate the imperfect forecasting results to avoid further errorpropagation in multistage OR methods, but also enable to optimizethe obtained action plans towards complicated constraints based onreal business rules. Moreover, in contrast to applying MARL undersome easier logistics scenarios, a blind employment of reinforce-ment learning approach may not produce satisfactory results incomplex logistics networks, because of its incapability of enhancingcooperation among highly dependent resource agents. To tacklethis challenge, we further introduce three levels of cooperativemetrics and, accordingly, improve the state and reward design tobetter promote the cooperation in the complex logistics networks.To demonstrate the superiority of the MARL framework, weimplement our approach under an empty container repositioning(ECR) task in a complex ocean transportation network. In fact, suchmaritime transportation is essential to the world’s economy as 80%of global trade is carried by sea [21]. By far, maritime transporta-tion is the most cost-effective way to move bulk commodity andraw materials around the world. Extensive experiments show thatour method can achieve nearly optimal resource balancing results,which yields a significant improvement over the traditional ORbaseline.Our major contributions can be summarized as follows: • Formulating the resource balancing problem in a complextransportation network as a stochastic game. • Introducing a cooperative multi-agent reinforcement learn-ing framework as an end-to-end and high-capability solutionto the resource balancing problem, as it is not only morerobust to the imperfect SnD forecasting but yields highercapability and flexibility compared with the traditional mul-tistage OR-based methods. • Proposing three levels of cooperative metrics to provide guid-ance to improve state and reward design, in order to betterpromote the cooperation in the complex logistics network. • Conducting extensive experiments on the empty containerrepositioning task in the scenario of real-world ocean logis-tics industry.

Resource balancing in transportation network, which can be re-garded as a branch of scheduling problem, is comprehensively stud-ied in the field of OR [2, 5, 9, 18]. Among them, Epstein et al. [5]studied the ECR problem, and developed a logistics optimizationsystem to manage the imbalance with a multicommodity networkflow model based on demand forecasting and safety stock con-trol. For more works about ECR, Song and Dong [20] provides anin-depth review of the OR-based literature.With the prosperity of deep learning, deep reinforcement learn-ing (RL) methods like DQN [15] has achieved great success inmodeling and solving many intellectual challenging problems, suchas video games [15] and go [19]. However, they are not widelyapplied to complicated real-world applications, especially for thosewho have high-dimensional action spaces and need cooperationbetween lots of agents.In recent years, motivated by the great success of deep RL, somemethods have been proposed based on RL to address resource bal-ancing problem, especially rebalancing homogeneous, flexible ve-hicles. Pan et al. [17] proposed a deep reinforcement learning al-gorithm to tackle the rebalance problem for shared bikes, whichlearns a pricing strategy to incentivize users to rebalance the sys-tem. Lin et al. [10] proposed a contextual multi-agent reinforcementlearning framework to tackle the rebalance problem for online ride-sharing platforms, in which every taxi is treated as an agent thatlearns its action to move to its neighboring grids. Xu et al. [23] pro-posed a learning and planning approach in on-demand ride-hailingplatforms, which combines RL for learning and combinatorial op-timizing algorithm for planning. These works have successfullymodeled and handled large-scale and real-world traffic scenarios.However, compared with resource balancing in complicate logis-tics network, the environments in their scenarios are much looser,and the dependency of agents is simple and straightforward. Thustheir methods can hardly be applied to solve the resource balancingproblem.To apply MARL in resource balancing, one of the main obstaclesis to deal with collaboration of agents with complicated depen-dency. This dependency is mainly caused by complicated logisticsnetwork structures. In the area of traditional multi-agent system,fruitful works are done by dealing with collaboration of multi-agents. Among them, FF-Q [11], Nash-Q [7] and Correlated-Q [6]are famous methods achieving convergence and optimum. How-ever, all of them adopt the joint action approach, which is hardlyapplied in real-world multi-agent system with lots of agents, dueto the extremely large joint action space. Similar limitation occursin other joint action or best response based methods [8, 22]. Someother works [3, 4, 13] managed to apply potential based rewardshaping in MARL to stimulate cooperation. Methods in these worksachieve performance improvement in their own scenarios. How-ever, in resource balancing scenarios, where agents’ actions have along-term and immeasurable effect on the ultimate results, moreefforts should be put to understand the problem and design rewards.

PROBLEM STATEMENT

In this section, we will formally define the resource balancing prob-lem in a complex logistic network.A typical logistic network can be defined as G = ( P , R , V ) , inwhich P , R and V stand for the set of terminals, routes, and vehicles,respectively. More specifically, • Each terminal P i ∈ P represents a place that can store re-sources and generate corresponding SnD. We denote theinitial resources in stock at P i as C i , and we use C ti , D ti , and S ti ( t = · · · T ) to represent the numbers of stocks, resourcedemands, and resource supplies at different time, respec-tively. • Each route R i ∈ R is a cycle in the logistic network, consist-ing of a sequence of consecutive terminals { P i , P i , · · · , P i | Ri | } ,where | R i | is the number of stops on R i and the next desti-nation of P i | Ri | is P i . Each route can intersect with othersin the network. • On each route R i , there is a fixed set of vehicles V R i ⊆ V ,each of which, V j ∈ V R i , yields an initial position, a durationfunction d j ( P u , P v ) : P × P → N + (mapping from an originterminal P u and a destination one P v into the transit time),a capacity Cap tj (the maximum number of resources it canconvey). When a vehicle arrives at a terminal, it can eitherload resources from or discharge its resources to the terminal.The objective of resource balancing is to minimize the resourceshortage among all terminals. At a specific time t , the terminal canonly use the stock in the last day, i.e, C t − i , to fulfill the currentdemand D ti . Once the stock is not enough, the shortage happens.Thus, we denote the number of shortage as L ti = max (cid:16) D ti − C t − i , (cid:17) . Accordingly, the objective of resource balancing is to minimize thetotal resource shortage: L = (cid:205) P i ∈ P , t ∈ T L ti .After the current demand is processed, new resource supplies andthose discharged from the vehicle will be added to the stock, thus wecan compute the new stock amount as C ti = max (cid:16) C t − i − D ti , (cid:17) + S ti − (cid:205) | V | j = I ( i , j , t ) x tj , where x tj ∈ N denotes the number of resourcesloaded onto vehicle V j at time t . x tj can be negative to denote thedischarged amount of resources from the vehicle, and I ( i , j , t ) is aindicator variable defined as I ( i , j , t ) = (cid:26) , V j arrives at P i at time slot t , otherwise.We further define C tV , j as the amount of resources on vehicle V j attime slot t , and clearly, C tV , j = C t − V , j + x tj . As aforementioned, traditional solutions for resource balancingemploy combinatorial optimization with SnD forecasting. However, This is because new supplies and discharged resources at time t are usually un-available temporarily for realistic reasons, such as inner terminal transportation andmaintenance. This logic can change with specific application scenarios, and will notaffect our framework. it suffers from failures in front of uncertainty of SnD, complex busi-ness constraints, and high complexity of transportation networks.To address these challenges, in this section, we first model the re-source balancing in complex logistic network as a stochastic gameand then propose a novel cooperative multi-agent reinforcementlearning (MARL) framework to solve it. The resource balancing problem can be formally modeled as astochastic game G = ( N , A , S , R , P , γ ) , where N is the agent set, A is the joint action space, S is the state set, R is the reward function, P is the transition probability function, and γ is the discount factor.More formally definitions are shown below: Agent set N . We define each vehicle as an agent, which yieldstwo major advantages: (1) As each vehicle agent continuously sailscircularly along the certain route, it can be aware of the largerscope of information within the whole route such that optimizingtowards maximizing its own reward, i.e., minimizing the shortage,can benefit the total reward of the entire route. (2) Since multiplevehicle agents navigating along the same route usually share thesimilar environment, it is natural for them to share the same policyso as to significantly reduce the model complexity in MARL andboost the learning process. Joint action space A . We define the action of a vehicle agent V j as loading or discharging resources when it arrives at a terminal P i . Similar to Menda et al. [14], we apply the idea of event-drivenreinforcement learning. To be more concrete, we treat agents’ eacharrival at a terminal as a trigger event, and an agent only needs totake action once a trigger event happens. Under this event-drivensetting, we use a tj to denote the action taken by agent N j ∈ N at t -tharrival event. For agent N j , we define its action space as A j = [− , ] ,where a tj ∈ [− , ) means discharging a portion of a tj resourcesfrom the vehicle, a tj ∈ ( , ] means loading a portion of a tj resourcesonto the vehicle, and a tj = A = A × A × · · · × A | N | , where | N | isthe number of agents. The total amount of resources that can bedischarged or loaded at t is usually restrictively determined by thedynamic values of C ti , Cap ti , C tV , j as well as some other externalfactors, which are controlled by domain-specific business logics. State set S . The state S is a finite set that stands for all possiblesituations of the whole logistics network. Note that, from a practicalpoint of view, it is not necessary for the agents to take action basedon the whole state information, due to the extremely large statespace and the potential noise introduced by unrelated information.We will elaborate more on the practical state design later in thissection. Rewards function R . The objective of the resource balancingproblem is to minimize the accumulated shortage for all terminals.With respect to each individual action, i.e., loading or discharg-ing some resources at a terminal, the impact can be spread to itsfollow-up periods. To model such delayed reward, it usually lever-ages rewards shaping to guide the learning process [16], a typicalspecification of which is to measure the difference of the ultimateaccumulated shortage between with and without this action. How-ever, this reward is very hard to compute in practice. Thus, we find V P V P P R P V P P R V P R P V (a) Self Awareness (b) Territorial Awareness (c) Diplomatic Awareness P V P V R

Terminal Vehicle ResourceStock Route Current Event P Figure 1: Illustration of three levels of cooperative metrics.(a) Self awareness agent V only consider information of ( P , V ) to make decision. (b) Territorial agent V will makedecision based on information within its territory. It couldload more resources at arrival port P with the awarenessthat port P on its route R has low stock. (c) Agent V withdiplomatic awareness can look far beyond its route. It couldload more resources at current port P and discharge themat transshipment port P or P later with the awareness thatport P on its neighboring route R needs support. other more realistic rewards shaping methods, which will also bediscussed later in this section. Transition probability function P . It is defined as a mapping S × A × S → [ , ] , which can be specified by the definition of S , R , V and the distribution behind SnD within particular logisticsnetworks. After formulating the resource balancing problem as a stochasticgame, applying MARL approach to the real world, however, requiresa dedicated design on the game state and the action’s reward to pro-mote cooperation and improve performance. Based on the scope ofagents’ awareness of cooperation, we identify three levels of coop-erative metrics: self awareness , territorial awareness , and diplomaticawareness . In general, agents with self awareness are fully selfishand shortsighted and only consider immediate information andinterests; agents with territorial awareness have a broader visionand make decision based on information belonging to their terri-tories, i.e., routes in this problem. At last, agents with diplomaticawareness even overlook beyond their own routes and conductresource balancing, in a diplomatic way, by cooperating with in-tersecting routes so that resources can flow from fertile routes to barren routes. When agent V j arrives at terminal P i , itis natural that V j makes decisions just based on the information ofitself and P i . Regarding the reward of this action, a straightforwardmetric is to consider whether shortages will happen before nextvehicle’s arrival at P i . Obviously, this is a very shortsighted agent.Suppose the time of k -th arrival event of a vehicle agent V j is t k and the arrival terminal is P i . The state s t k P , i for terminal P i can beformed up by: • Current available resources C t k i . • Historical information of available resources ϕ (cid:16) C i , · · · , C t k − i (cid:17) and shortages ψ (cid:16) L i , · · · , L t k − i (cid:17) . • Other domain-specific information, such as terminal ID,berth length, etc.where ϕ (·) and ψ (·) denote some statistical function (Mean, Me-dian, etc.) or more advanced sequential data processing models(CNN, RNN, etc.). Specific implementation should depend on theapplication scenario.State s t k V , j for vehicle V j can be comprised of: • Current available resources onboard C t k j . • Available space

Cap t k j − C t k j . • Other domain-specific information, such as vehicle ID, vehi-cle type, etc.Concatenating the above information, we get the state s I = (cid:104) s t k P , i , s t k V , j (cid:105) for self awareness agents. The self awareness agentsonly concerns if shortage happens between t k and t ′ k where t ′ k ≥ t k stands for the time of next vehicle’s arrival at P i . Besides, inspiredby the idea of safety stock in traditional methods, we add a smallpositive reward if no shortage happens. This reward is calculatedaccording to a function f : N → R that has diminishing marginalgain . The purpose is to encourage the agents to put some safetystock with upper limit on terminals. In summary, the reward canbe written as follows: r I = f (cid:18) C t ′ k i (cid:19) − д (cid:169)(cid:173)(cid:173)(cid:171) t ′ k (cid:213) t = t k L ti (cid:170)(cid:174)(cid:174)(cid:172) , (1) where д : N → R is the loss defined on the total shortage. According to the problem defini-tion, a vehicle needs to navigate along with the certain route andis obliged to balance the SnD within its own territory, i.e., the ter-minals in its route. Apparently, each agent with self awareness,with no consideration on other terminals and vehicles in its route,cannot balance the resources SnD within its route. Thus, we intro-duce territorial awareness agent to minimize the total shortage ofall terminals in the route. Specifically, for an agent V j on route R q ,we hope the agent to get the accurate information of neighboringenvironment on the route, which is more likely to influence thecurrent decision. We add extra successive information as follows: • Information about n successive terminals { s t k T , i ′ | P i ′ ∈ Sc i , j ( n ) (cid:9) where Sc i , j ( n ) is the set of n terminals to which vehicle V j will travel after terminal P i . • Information about m future vehicles { s t k V , j ′ | V j ′ ∈ Fu i , j ( m ) (cid:9) where Fu i , j ( m ) stands for the set of m vehicles that willarrive at P i just after V j ’s arrival.As we can see, the larger n and m are, the more information can beused for decision. However, in practice, we usually set small valuesfor n and m to control the model complexity and noise introduced byunimportant information. To compensate the potential informationloss, we introduce the overall statistical territory information s t k R , q for route R q : For example, f ( x ) = (cid:205) xi = β i for < β < . Information of available resources in all the terminals in theroute Φ (cid:16)(cid:110) C t k i | P i ∈ R q (cid:111)(cid:17) • Information of shortage in all the terminals in the route Ψ (cid:16)(cid:110) ψ (cid:16) L i , · · · , L t k − i (cid:17) | P i ∈ R q (cid:111)(cid:17) Similar as ϕ (·) and ψ (·) , Φ (·) and Ψ (·) are statistical functions ormodels based on series data.We concatenate all information above with s I to get the territorialstate s T . Territorial awareness agents will make decision based onthe state s T . In real logistics networks, imbal-ance can also happen among different routes: there may be a largeamount of supplies but very few demands on some routes, whilesome other routes may be opposite, with a large amount of de-mands that cannot be satisfied with limited supplies. In this case, itis infructuous to attempt balancing SnD within the territory of sin-gle route. To solve this problem substantially, agents should learnthe diplomacy: solving imbalance collaboratively with agents inintersecting routes.To this end, more information about neighboring routes shouldbe considered. Assume an event ( P i , V j , R q ), and denote Cr q as thecrossing routes having common terminal(s) with route R q . First,statistic information for all neighboring routes Φ r (cid:16)(cid:110) s t k R , p | R p ∈ Cr q (cid:111)(cid:17) should be involved to represent the general status of crossing routes.Moreover, we add additional information when agents arrive attransfer terminals, that is Φ n (cid:16)(cid:110) s t k R , p | R p ∈ Rt i (cid:111)(cid:17) where Rt i is theset of routes that pass through terminal P i . We concatenate allinformation above with s T as the diplomatic state s D .To encourage cooperation, we extend the reward by consideringcross routes shortage. For an agent V j on a route R q , its action notonly influences the reward on its own route, but also influencesthe reward of agents in the neighboring routes in Cr q , especiallyon the transfer terminals where routes are intersecting. To takeneighboring routes into consideration, we use r D = αr I + ( − α ) r C ,where α is a soft hyper-parameter and r C = f (cid:18) ξ (cid:18)(cid:26) C t ′ k i | P i ∈ R p , R p ∈ Cr q (cid:27)(cid:19)(cid:19) − д (cid:0) ξ (cid:0)(cid:8) L ti | t k ≤ t ≤ t ′ k , P i ∈ R p , R p ∈ Cr q (cid:9)(cid:1)(cid:1) , for statistical functions or advanced models ξ (·) and ξ (·) .The three levels of cooperative metrics are illustrated in Fig-ure 1. The whole cooperative MARL framework for resource bal-ancing is shown in Algorithm 1. From Line 4 to 13, the agentsinteract with environment by function calls, and collect transitionexperiences. It should be emphasized that GetState (cid:16) S j , k , P i , V j (cid:17) refers to the process of constructing state based on current event ( P i , V j ) and global environment snapshot S j , k . This snapshot con-tains complete information of the environment when the event istriggered. GetDelayedReward (cid:16) S j , k − , S j , k (cid:17) refers to the processto calculate the delayed reward based on shortage happens betweenthese two snapshots. The detail implementation of GetState (·) andGetDelayedReward (·) will be determined based on the adoptedlevel of cooperative metric. Algorithm 1

Cooperative MARL Framework Initialize replay memory D j to capacity M for each agent V j Initialize action-value function Q j with random weights θ j foreach agent V j for episode ← do ResetEnvironment() while environment is not terminated do // k means the k -th event of agent V j ( P i , V j , k ) ← WaitingEvent () S j , k ← GetEnvironmentSnapshot () s k ← GetState (cid:16) S j , k , P i , V j (cid:17) r k − ← GetDelayedReward (cid:16) S j , k − , S j , k (cid:17) StoreExperience (cid:0) D j , ( s k − , a k − , r k − , s k ) (cid:1) a k ← ϵ -Greedy (cid:0) arg max a Q j ( s k , a ) (cid:1) Execute ( P i , V j , a k ) end while for l ← do for each V j in V do Sample a batch of data ( s , a , r , s ′ ) from D j Compute target y ← r + γ max a ′ Q j ( s ′ , a ′ ; θ j ) Update Q-network for agent V j as θ j ← θ j − ∇ θ j ( y − Q j ( s , a ; θ j )) end for end for end for To evaluate the effectiveness of our proposed approach, we conductexperiments on resource balancing in the scenario of ocean con-tainer transportation. In this task, the resource balancing mainlycorresponds to Empty Container Repositioning (ECR). In the fol-lowing of this section, we will first introduce the background ofECR, then we will show the experimental results on a part of realocean logistics network.

As containers are the most important asset in ocean logistics in-dustry, the resource balancing in this scenario corresponds to ECR,which is quite necessary since the SnD of empty containers are veryunequal due to the world trade imbalance [20]. In particular, thegoal of ECR is to reposition empty containers by container vesselssailing on pre-determined routes within ocean logistics networksto fulfill the dynamic transportation demand of ports. According toAsariotis et al. [1], the estimated cost of seaborne empty containerrepositioning was about 20 billion dollars in 2009, with 50 millionempty containers movement, which has demonstrated the necessityto optimize ECR in ocean logistics industry. More formally, ports,container vessels, and predetermined routes for vessels correspondto terminals P , vehicles V , and routes R , respectively. External de-mands and supplies of empty containers for port P i at time slot t correspond to D ti and S ti , respectively. Nonethe-less, there are several domain-specific feature for the ECR problem. ort BPort BPort APort A

Cargoes load into containers

Laden containers load to vessel

Laden containers discharge from vessel

Cargoes unload from containersEmpty containers Empty containersConsume Transport ReturnEmpty Container RepositioningOrder placement Cargoes received

Consigner

Depot Depot

Consignee

Figure 2: The container transportation chain in ECR problem. Blue lines indicate laden container flows and green lines indicateempty container flows. All flows are under the control of specific business logics in real logistics scenarios.

In ECR problem, the external demands and supplies D ti and S ti aredetermined by transportation orders O , which are also external anddynamic. An order o ∈ O is a tuple ( P u , P v , n , t o ) , which denotesdeparture port, destination port, amount of needed containers andorder time. The container transportation chain for orders can bedescribed as follows, also illustrated in Figure 2: when an order ( P u , P v , n , t o ) is placed at time slot t o , the external demand of depar-ture port D t o u will be added by n , which means P u need to provide n empty containers to fulfill the order at time slot t o . If the order isfulfilled, cargoes will be loaded into these empty containers, andthey are transformed to laden containers waiting for vessels totransport them to destination port P v . Laden containers will belifted on the arriving vessel V j on route R k if P v ∈ R k . When theladen containers are discharged to the destination port P v at timeslot t ′ o , the cargoes in laden containers will be unloaded and thesecontainers will be returned to P v as empty containers at time slot t ′ o + t ret , in which t ret is a constant. Therefore, the external supplyof destination port S t ′ o + t ret v will be added by n . To summarize, thespecification of ECR problem is concluded as follows: • Empty containers are reusable, which will circulate betweenports as receptacles for cargoes; • Laden containers and empty containers share the same ves-sel. i.e., the space for empty containers

Cap tj for vessel V j will change dynamically depending on the amount of ladencontainers on the vessel; • The whole order will fail if not enough empty containerscan be served from departure port when the order is placed.The resource shortage L o for a single order o is defined as L o = n , when n > C t o u , and L o = The first difficulty of OR-based methods for ECR is brought bythe uncertainty of future SnD forecasts. As aforementioned, suchuncertainty is mainly caused by multiple external dynamic factors,such as market changes, and will be even aggravated by the inherentmutual dependency between the OR-based model and future SnDforecast. Since typical OR-based methods generate action plansbased on future SnD forecasts for a long time span, the severeuncertainty of long-term forecasts and ignorance of the inherentmutual dependency between OR and SnD forecast will lead to poorperformance of OR-based methods.The second major difficulty is caused by the certain businesslogic in the container transportation chain. A typical and impor-tant business logic that is hard to model by OR-based methods Without loss of generality, we only deal with non-transshipment order, that is wesuppose P u and P v are always within one route. A transshipment order can be viewedas multiple separated non-transshipment orders. corresponds to the state change of containers, i.e. from empty toladen and vice versa. To build constraints fully representing theSnD balance, the OR-based method must consider the state changesof containers. However, this is quite difficult in real world, becausethese state changes are completely controlled by business operatorsand yield quite different rules according to different customers andregions. As a blackbox in the ECR problem, such business logic thuscannot be exactly modeled by traditional OR-based methods. Inthe real world of container transportation, there are more businesslogics, e.g., regional policy regulation, which are in fact hard to bemodeled by OR-based methods.To leverage OR-based methods, we have to relax correspondingconstraints and take an approximation approach in the baselinealgorithms, including: • The transportation of empty and laden are decoupled. Thestate changes of containers are pre-determined rather thandynamically decided by business logic (nonlinear and evenblack-boxed in real scene), leading to simplified SnD predic-tion in OR model by decomposing future order information. • The atomicity of one order is not preserved. In our runningexample, the whole order will fail if the amount of remainingempty containers is not sufficient, even if the gap is verysmall. In OR models, this property cannot be guaranteedsince orders are decomposed into SnD.

In the following experiments, we extract a main ocean transporta-tion network among Asia, North America and Europe based on thereal world service loops of a commercial company. This networkconsists of 4 route, 17 ports and 31 vessels. The routes are listed asfollows and illustrated in Figure 3: • R1: Pacific Atlantic route, 94 days with 14 vessels. • R2: Central Asia to Southeast Asia route, 60 days with 9vessels. • R3: Japan to America route, 33 days with 5 vessels. • R4: Japan-China-Singapore route, 19 days with 3 vessels.The vessels are uniformly distributed with a interval aroundone week in their routes. Initially, there are 3000 empty contain-ers distributed in the 17 ports based on historical statistic froma commercial ocean logistics company, and all vessels are emptywithout any laden or empty containers. The distribution of SnDof all 17 ports in the simulated environment is shown in Figure 4based on information provided by the same company. Every vesselhas a capacity of 200 containers. i.e., the total amount of laden andempty containers cannot exceed 200 for every vessel. To assist thetraining of our cooperative MARL approach, we build a simulated hanghaiNingboYantian ShekouThail and

SingaporeArab Hong KongTaiwan TokyoKobeYokohama

Oakl and

Los

Angeles

SavannahNew York EUR1 (Pacific Atlantic route)R2 (Central Asia to Southeast Asia route)R3 (Japan to America route)R4 (Japan-China-Singapore route)

Mainland ChinaMainland China

JapanJapanAmericaAmericaOthersOthers

Figure 3: The extracted ocean transportation networkamong Asia, North America and Europe.

Figure 4: The distribution of demand and supply of all 17ports in the environment.

ECR environment based on real historical data from the commercialocean logistics company.To measure the performance of our approach, we use the metricof fulfillment ratio , which is defined as the ratio of total successfullyfulfilled containers compared to all containers requested in oneepisode (400 time steps, where one time step corresponds to oneday). In real-world, there are many other types of cost for containerrepositioning, including loading/discharging cost, storage cost, etc.Among all of them, however, the cost of shortage, measured bythe fulfillment ratio, is the dominant one since it will directly af-fect the booking acceptance and consequently the transportationcompany’s reputation. Therefore, we focus on minimizing the costof shortage in this work. Indeed, other types of cost can also benaturally captured by MARL through rewards shaping and specificaction space design, which will be one of our future targets.

In the following experiments, we compare the following methodson the ECR problem: • No Reposition : Empty containers are never repositioned.The flow of containers will only depends on the laden con-tainer transportation. • Rule-Based Inventory Control (IC) . With the idea in in-ventory management theory, this method sets two inven-tory thresholds, safety threshold F si and excess threshold F ei ( F si ≤ F ei ), for each port P i based on the historical informa-tion of SnD respectively. When a vessel V j arrives at P i attime slot t , it will try to maintain the stock C ti located in the range (cid:2) F si , F ei (cid:3) by loading or discharging containers. For-mally, suppose x ti , j is the number containers loading from P i (negative value means discharging to this port), it satisfies x ti , j =  min ( C ti − F ei , Cap tj − C tV , j , C ti ) , C ti > F ei , − min ( F si − C ti , C tV , j ) , C ti < F si , , otherwise . • Online Linear Programming (LP) . With some approxi-mation approaches mentioned above, ECR problem can bemodeled in linear programming (LP) by adopting the mathe-matical definitions in problem statement section. However,it is hard to apply the solution directly due to the gap causedby simplified model. Here, we apply rolling horizon policydescribed in Long et al. [12] to solve the problem: empty repo-sition plan are generated for a long period on the planninghorizon based on LP model with forecasting information forthis period, but only partial planning at the beginning areexecuted. Repeat this procedure until termination. This is theso called online LP method. Note that, our proposed end-to-end MARL method directly interacts with the simulator withno explicit forecasting stage, therefore, for the purpose ofappropriate comparison, we use exact future order informa-tion to replace the forecasted future demand in the LP modelso as to eliminate the effects of external factors leading touncertain forecasts, which can be seen as a relatively idealcondition. More details about the online LP can be found inthe appendix. • Online LP with Inventory Control . In this baseline, weadopt the idea from Epstein et al. [5] which combines LPmodel with inventory control. This method sets a safetythreshold F si for each port P i based on the historical informa-tion of SnD, and then constrains L ti = max (cid:16) D ti − ( C t − i − F si ) , (cid:17) . • Self Awareness MARL (SA-MARL) . This is the MARLmodel described in the previous section with self aware-ness agents. For terminal (port) state s t k P , i , ϕ (·) is an averagefunction while ψ (·) is a sum function. For vehicle (vessel)state s t k V , i , we add amount of laden containers onboard as ad-ditional domain-specific information. As for reward, we set f ( x ) = − . x and д ( y ) = y , where x and y are calculatedas in Equation (1). • Territorial Awareness MARL (TA-MARL) . This is theMARL model with territorial awareness agents. For succes-sive terminal information, both m and n are set to 1. Φ (·) and Ψ (·) in s t k R , q are set to be average functions. • Diplomatic Awareness MARL (DA-MARL) . This is theMARL model described in previous session with diplomaticawareness agents. Φ r (·) and Φ n (·) are set to be average func-tions with α = .

5. Both ξ (·) and ξ (·) are 2-layer averagefunctions Avg { Avg { (cid:205) t ′ k t = t k L ti | P i ∈ R p }| R p ∈ Cr q } . • Offline Optimal LP (Upper Bound) . In this case, the short-age will be directly calculated as objective by LP model men-tioned above, which has the knowledge of all orders in ad-vance, without implementation in simulated environment.This can be seen as an upper bound for the problem. i.e., it able 1: Performance comparison with different baselines.

Method Fulfillment Ratio (%)

80% Container 100% Container 150% Container

No Reposition 26.58 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Offline LP(Upper Bound) 98.32 ± ± ± Table 2: Performance comparison with different delay pa-rameter k in DA-MARL k Fulfillment Ratio k Fulfillment Ratio1 95.87 ± ± ± ± ± ± ± ± ϵ -greedyexploration. The ϵ is annealed linearly from 0.5 to 0.01 across thefirst 8000 episodes, and fixed at 0 .

01 in the rest episodes. We use

Adam Optimizer with a learning rate of 10 − . Batch size is fixedto 32. All agents in the same route share the same Q-network, andeach Q-network is parameterized by a 2-layer MLP with node sizeof 16 and 16, activated by ReLU. Since DQN works on discreteaction space, we discretize the continuous action space A i = [− , ] uniformly by 21 actions, that is A ′ i = {− , − . , · · · , . , } . 7UDLQLQJ(SLVRGHV ) X O I LOO P HQ W 5 D W L R 6$0$5/7$0$5/'$0$5/ (a) ) X O I LOO P HQ W 5 D W L R (b) Figure 5: (a) Convergence comparison of MARL methods.The X-axis is number of episodes. (b) Performance compar-ison with different α in Diplomatic Awareness MARL. TheX-axis is α . The Y-axis is fulfillment ratio in both figures. To compare all the methods aforementioned, we run our trainedmodels and baseline methods in 100 randomly initialized environ-ments. For baseline methods, we run grid search to find suitableparameters. To test the robustness of the learned policy in ourframework, we also evaluate the model trained under 100% (3000)empty containers setting by changing the total amount of con-tainers to 80% (2400 containers) and 150% (4500 containers). Theresults are summarized in Table 1, in which we report the meanand standard deviations of the fulfillment ratios. As we can see, DA-MARL method achieves the best performance in all initial containersettings. Even TA-MARL method is comparable with traditionalonline LP method. The SA-MARL achieves the poorest performanceamong our MARL methods, while it is still better than rule-basedinventory control. The testing of robustness shows that agentshave learned efficient policies to deal with dramatic environmentchanges. The trained DA-MARL model still performs better thanthe online LP and its IC version, which in fact are built on changedenvironments.The convergence comparison of MARL methods are shown inFigure 5a. Each MARL method is trained for 10 times, and we reportthe mean and standard deviation of performance during training. Aswe can see, all MARL methods converge very quickly at first 1000episodes. After that, DA-MARL will get a much larger improvementthan the others.In Diplomatic Awareness MARL, α is an important parameter tocontrol the proportion between territorial reward and diplomaticreward. We train the model with different α and the results areshown in Figure 5b. Every model is trained for 5 times due to timelimitation, and every trained model is tested for 100 times. Theresult shows that neither r I alone ( α =

1) nor r C alone ( α = α = . Φ r (·) and Φ n (·) about neighboring routes and transshipment routesis required to achieve high performance. However, it is possiblethat these information cannot be transferred in real-time in realisticscenario, i.e., agents can only have access to an outdated versionof these information. Table 2 shows the fulfillment ratio whenall agents can only access these information of k days ago. Theresult shows that our proposed method performs robustly withoutsignificant loss when the delay is in a reasonable range, i.e., k ≤ The major objective of ECR is to balance the SnD so that the short-age costs of deficit ports are minimized. Figure 6a shows the amountof imported empty containers of Shekou and Thailand, two majorports that are deficient of empty containers, by different methods.From Figure 3, Thailand is the next ports of a surplus port Singaporeon route R2, which means it is not hard to obtain empty containerswithout complicated cooperative mechanism. For Shekou, the sitia-tion is much more severe as it need more containers than Thailand(shown in Figure 4) while the only supply port, Singapore, in routeR4 doesn’t have enough containers to supply Shekou. The only waythat demand of Shekou can be sufficiently fulfilled is to use Tokyoand Kobe as transshipment ports to transport empty containersrom America regions, which requires strong ability of cooperationbetween regions. Figure 6a shows that all the three MARL methodsperforms well on Thailand, while Diplomatic Awareness MARL out-performs all other methods on Shekou, indicating that our design iscapable to fulfill the demand that requires inter-route cooperation.For inter-route cooperation, the amount of exported empty con-tainers at transshipment port is essential, since it is the sourcefrom which deficient ports such as Shekou obtain empty containers.Figure 6b shows the amount of exported empty containers of Singa-pore, Tokyo and Kobe, which are three major transshipment portsbetween different routes in our setting. It shows that the amountof exported empty containers at transshipment ports significantlyincreases with more cooperative awareness of MARL agent, whichindicates that our cooperative design is effective. Online LP methodwith its IC version can also perform well on transshipment portssince they are globally optimized. However, the gap between LPmodels and environment confines their overall performance. I m p o r t e d E m p t y C o n t a i n e r s IC Online LP Online LP with ICSA-MARL TA-MARL DA-MARL (a) E x p o r t e d E m p t y C o n t a i n e r s IC Online LP Online LP with ICSA-MARL TA-MARL DA-MARL (b)

Figure 6: (a) Imported empty containers of Shekou and Thai-land, two major ports that are deficient of empty containers,by different methods. (b) Exported empty containers of Sin-gapore, Tokyo and Kobe, three major transshipment portsbetween different routes, by different methods. “No Reposi-tion” method is omitted since it won’t import or export anyempty containers.

In this paper, we first formulate the resource balancing problemin logistics networks as a stochastic game. Given this setting, wepropose a cooperative multi-agent reinforcement learning frame-work, in which three levels of cooperative metrics are identifiedbased on the scope of agents’ awareness of cooperation, whichpromote efficient and cost-effective transportation. Extensive ex-periments on a simulated ocean transportation service demonstratethat our new approach can stimulate the cooperation among agentsand give rise to a significant improvement in terms of both perfor-mance and stability. In future, we will integrate more types of cost,such as transport cost and inventory cost in real logistic scenarios,into a unified objective to optimize. Moreover, we will investigatemore advanced RL techniques to achieve a more precise control ofactions.

ACKNOWLEDGEMENT

We sincerely appreciate Ryan Ho, Johnson Lui, Karab Sze, JeffreyKo, Simon Choi, Tony Y Li, Apple Ng, Terry Tam and Wyatt Leifrom Orient Overseas Container Line for their great support onthis work.

A APPENDIXA.1 Linear Programming Model for the ECRProblem

The linear programming is given by:min (cid:213) P i ∈ P , t ∈ Event ( P i ) L ti (2)Subject to C tP , i = C prev ( P , i , t ) P , i − D ti + S ti − | V | (cid:213) j = I ( i , j , t ) x tj , (3) L ti ≥ D ti − C prev ( P , i , k ) P , i , (4) L ti ≥ , (5)for P i ∈ P , t ∈ Event ( P i ) ; C tV , j = C prev ( V , j , t ) V , j + x tj , (6)0 ≤ C tV , j ≤ Cap tj , (7)for V j ∈ V , t ∈ Event ( V j ) ;Event (·) denotes the set of time slot that an event is predictedto be triggered for the argument, which can be inferred by V , R andduration function d j (· , ·) . Indicator function I ( i , j , t ) can be inferredby similar manner. prev ( P , i , t ) (prev ( V , j , t ) ) denotes the previoustime slot that an event is triggered on a port P i (vessel V j ). Externaldemand D ti and supply S ti for each port P i ∈ P are provided byexternal forecast model. In ECR problem, Cap tj will dynamicallychange according to the amount of laden containers in V j at timeslot t . For order-based forecast model, i.e., the model forecasts futureorder set O first and calculates predicted D ti and S ti based on O , Cap tj can be also computed based on O with the assumption thatall the external demand D ti can be fulfilled (so that the amount ofladen containers for each vessel at each time slot can be estimated).The LR model is solved by GNU Linear Programming Kit (GLPK)as integer programming. A.2 Details of Simulated ECR Environment

A.2.1 Route Schedule.

The schedule of each route is shown inTable 3, Table 4, Table 5 and Table 6 respectively based on infor-mation provided by the same commercial company mentioned inexperiment section. All the routes are cycled. To achieve uniformdistribution of vessels in each route, vessels are not required toberth in certain port when the environment is initialized.

A.2.2 Business Logic of Port-Vessel Interaction.

When an event ( P i , V j ) is triggered, i.e., a vessel V j (on route R k ) arrives at a port P i , our simulated environment follows a 4-stage business logic toexecute action a ∈ [− , ] :(1) Laden container discharge : all laden containers on V j withdestination port P i are discharged from the vessel. Noticesthat Cap tj will increase to Cap ′ tj due to the decrease of ladencontainers in the vessel;(2) (if a < ) Empty container discharge : [− a ∗ C tV , j ] emptycontainers on V j are discharged from the vessel; able 3: Route schedule of R1 Port Region/City Transit daySTN Europe Union -NYC New York 15SAV Sawannah 18LAS Los Angeles 31OAK Oakland 32YOK Yokohama 44SHA Shanghai 47KOY Kobe 51TKY Tokyo 52OAK Oakland 67LAS Los Angeles 68SAV Sawannah 82NYC New York 85STN Europe Union 94

Table 4: Route schedule of R2

Port Region/City Transit dayJEB Arab -SIN Singapore 3LCB Thailand 6YAT Yantian 9LAS Los Angeles 26OAK Oakland 28SHA Shanghai 43NIN Ningbo 44YAT Yantian 46SIN Singapore 51JEB Arab 60

Table 5: Route schedule of R3

Port Region/City Transit dayKOY Kobe -TKY Tokyo 3LAS Los Angeles 17OAK Oakland 18TKY Tokyo 31KOY Kobe 33(3)

Laden container loading : laden containers in P i with des-tination port in R k are loaded into the vessel as much aspossible with the order of received date. Laden containersin the same order can be separately transported. Similarly, Cap ′ tj will decrease to Cap ′′ tj due to the increase of ladencontainers in the vessel;(4) (if a > ) Empty container loading : [ a ∗ min ( Cap ′′ tj − C tV , j , C tP , i )] empty containers are loaded into the vessel. Table 6: Route schedule of R4

Port Region/City Transit dayTKY Tokyo -KOY Kobe 2KHH Taiwan 5HKG Hong Kong 6SKZ Shekou 7SIN Singapore 11SKZ Shekou 14HKG Hong Kong 15KHH Taiwan 16TKY Tokyo 19Here [·] denotes the nearest integer function. In this businesslogic, laden container transportation has priority over empty con-tainer repositioning, which conforms real-world scenario in oceancontainer transport logistics.

A.3 Regional Statistics

The regional statistics of seven methods in experiment part arelisted in Table 7, 8, 9, 10, 11, 12, 13 respectively. All methods aretested 100 times and we report the average in the tables.

REFERENCES [1] Regina Asariotis, Hassiba Benamara, Hannes Finkenbrink, Jan Hoffmann, JenniferLavelle, Maria Misovicova, Vincent Valentine, and Frida Youssef. 2011.

Review ofMaritime Transport, 2011 . Technical Report.[2] Teodor Gabriel Crainic and Gilbert Laporte. 1997. Planning models for freighttransportation.

European journal of operational research

97, 3 (1997), 409–438.[3] Sam Devlin and Daniel Kudenko. 2011. Theoretical Considerations of Potential-based Reward Shaping for Multi-agent Systems. In

The 10th International Con-ference on Autonomous Agents and Multiagent Systems - Volume 1 (AAMAS ’11) .International Foundation for Autonomous Agents and Multiagent Systems, Rich-land, SC, 225–232. http://dl.acm.org/citation.cfm?id=2030470.2030503[4] Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. 2014. Potential-based Difference Rewards for Multiagent Reinforcement Learning. In

Proceed-ings of the 2014 International Conference on Autonomous Agents and Multi-agentSystems (AAMAS ’14) . International Foundation for Autonomous Agents andMultiagent Systems, Richland, SC, 165–172. http://dl.acm.org/citation.cfm?id=2615731.2615761[5] Rafael Epstein, Andres Neely, Andres Weintraub, Fernando Valenzuela, SergioHurtado, Guillermo Gonzalez, Alex Beiza, Mauricio Naveas, Florencio Infante,Fernando Alarcon, Gustavo Angulo, Cristian Berner, Jaime Catalan, CristianGonzalez, and Daniel Yung. 2012. A Strategic Empty Container Logistics Op-timization in a Major Shipping Company.

Interfaces

42, 1 (Feb. 2012), 5–16.https://doi.org/10.1287/inte.1110.0611[6] Amy Greenwald and Keith Hall. 2003. Correlated-Q Learning. In

Proceedingsof the Twentieth International Conference on Machine Learning (ICML’03) . AAAIPress, 242–249. http://dl.acm.org/citation.cfm?id=3041838.3041869[7] Junling Hu and Michael P. Wellman. 2003. Nash Q-learning for General-sumStochastic Games.

J. Mach. Learn. Res.

Proceedings of the31st International Conference on Neural Information Processing Systems (NIPS’17) .Curran Associates Inc., USA, 4193–4206. http://dl.acm.org/citation.cfm?id=3294996.3295174[9] Jing-An Li, Stephen CH Leung, Yue Wu, and Ke Liu. 2007. Allocation of emptycontainers between multi-ports.

European Journal of Operational Research

Proceedings of the Eighteenth International Conference on Machine Learning able 7: Regional Statistic of No Reposition Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 400 .

26 0 343 .

61 0 400 .

26 0 1Ningbo 0 0 136 .

56 0 0 0 /Yantian 601 .

08 405 .

24 134 .

25 0 195 .

84 0 0 .

325 814Shekou 8010 .

75 7124 .

58 785 .

03 0 886 .

17 0 0 .

110 623Thailand 5008 .

44 4719 . .

05 0 289 .

34 0 0 .

057 77Singapore 797 .

06 0 860 .

58 0 797 .

06 0 1Arab 0 0 0 0 0 0 /Hong Kong 3991 .

25 2501 .

22 1129 .

67 0 1490 .

03 0 0 .

373 324Taiwan 1403 .

97 944 .

97 296 .

11 0 459 0 0 .

326 93Tokyo 2200 .

88 1038 .

99 866 . .

89 0 0 .

527 921Kobe 3010 . .

84 980 .

57 0 1128 .

26 0 0 .

374 825Yokohama 0 0 289 .

72 0 0 0 /Oakland 199 .

28 6 .

12 805 .

74 0 193 .

16 0 0 .

969 289Los Angeles 1403 .

94 630 .

41 762 .

26 0 773 .

53 0 0 .

550 971Sawannah 1002 .

21 534 . .

11 0 0 .

467 078New York 200 .

86 0 445 . .

86 0 1EU 1003 .

48 505 .

85 279 .

56 0 497 .

63 0 0 .

495 904Total 29 233 .

56 20 292 .

42 8681 .

21 0 8941 .

14 0 0 .

299 626

Table 8: Regional Statistic of Inventory Control Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 397 .

12 6 .

81 600 .

44 52 .

13 390 .

31 589 .

51 0 .

982 852Ningbo 0 0 208 .

77 0 0 352 .

35 /Yantian 597 .

89 9 .

01 946 .

32 109 .

82 588 .

88 507 . .

984 93Shekou 8007 .

37 6068 .

76 1321 .

94 536 .

18 1938 .

61 0 0 .

242 103Thailand 4994 .

01 2303 .

97 108 .

97 2470 .

64 2690 .

04 28 . .

538 653Singapore 801 .

94 1 .

28 1602 .

52 108 .

31 800 .

66 1314 .

41 0 .

998 404Arab 0 0 0 0 0 269 /Hong Kong 3998 .

32 1817 .

13 2051 .

84 6 .

09 2181 .

19 231 .

21 0 .

545 527Taiwan 1402 .

76 671 . .

05 47 .

26 731 .

66 111 .

99 0 .

521 586Tokyo 2177 .

95 4 . .

37 1006 .

98 2173 .

75 431 .

89 0 .

998 072Kobe 2969 .

29 6 .

64 1526 .

76 1526 .

89 2962 .

65 152 .

97 0 .

997 764Yokohama 0 0 463 .

89 0 0 535 .

82 /Oakland 199 .

29 5 .

74 2225 .

95 42 .

91 193 .

55 2067 .

45 0 .

971 198Los Angeles 1397 .

14 41 .

29 2071 .

77 127 .

65 1355 .

85 804 .

18 0 .

970 447Sawannah 998 .

67 7 . .

59 275 .

88 990 .

77 131 .

33 0 .

992 089New York 200 .

87 4 .

01 862 .

88 43 .

47 196 .

86 745 .

36 0 .

980 037EU 994 .

69 48 .

39 459 .

47 481 .

44 946 . . .

951 352Total 29 137 .

31 10 996 .

23 17 354 .

53 6835 .

65 18 141 .

08 8451 .

07 0 .

612 14 (ICML ’01) . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 322–328.http://dl.acm.org/citation.cfm?id=645530.655661[12] Yin Long, Loo Hay Lee, and Ek Peng Chew. 2012. The sample average approxi-mation method for empty container repositioning with uncertainties.

EuropeanJournal of Operational Research

Proceedings of Learning,Inference and Control of Multi-Agent Systems (at NIPS 2016) (2016).[14] Kunal Menda, Yi-Chun Chen, Justin Grana, James W. Bono, Brendan D. Tracey,Mykel J. Kochenderfer, and David Wolpert. 2017. Deep Reinforcement Learningfor Event-Driven Multi-Agent Decision Processes. arXiv:1709.06656 [cs] (Sept.2017). http://arxiv.org/abs/1709.06656[15] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, HelenKing, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015.Human-level control through deep reinforcement learning.

Nature

Proceedings of the Sixteenth International Conference on Machine Learning(ICML ’99) . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 278–287.http://dl.acm.org/citation.cfm?id=645528.657613[17] Ling Pan, Qingpeng Cai, Zhixuan Fang, Pingzhong Tang, and Longbo Huang.2018. Rebalancing Dockless Bike Sharing Systems. arXiv:1802.04592 [cs] (Feb.2018). http://arxiv.org/abs/1802.04592 arXiv: 1802.04592.[18] Warren B Powell. 1996. Toward a Unified Modeling Framework for Real-TimeLogistics Control.

Military Operations Research (1996), 69–79. able 9: Regional Statistic of Online LP Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 397 .

31 14 .

88 597 .

56 511 .

27 382 .

43 1023 .

93 0 .

962 548Ningbo 0 0 204 .

94 132 .

56 0 470 .

35 /Yantian 598 .

71 158 .

21 1110 . .

69 440 . .

43 0 .

735 749Shekou 7962 .

84 1194 .

66 1663 . .

06 6768 .

18 142 .

81 0 .

849 971Thailand 4971 .

08 1760 . . .

04 3210 .

48 66 .

81 0 .

645 831Singapore 803 .

75 82 .

85 3172 .

18 1992 .

17 720 . .

45 0 .

896 921Arab 0 0 0 60 .

54 0 329 .

54 /Hong Kong 3982 .

99 156 .

27 3664 . .

68 3826 .

72 1164 .

39 0 .

960 766Taiwan 1403 . .

41 614 .

38 1108 .

52 1301 .

99 536 .

47 0 .

927 74Tokyo 2175 .

06 112 .

21 2892 .

28 1727 .

29 2062 .

85 2765 .

18 0 .

948 411Kobe 2975 .

31 158 .

49 3196 .

83 2769 .

94 2816 .

82 3194 .

36 0 .

946 732Yokohama 0 0 470 .

02 122 .

06 0 624 .

55 /Oakland 199 .

17 5 .

09 2231 .

34 979 .

83 194 .

08 2844 .

75 0 .

974 444Los Angeles 1392 .

12 49 .

25 2129 .

32 1035 .

11 1342 .

87 1760 .

18 0 .

964 622Sawannah 1000 . . .

71 534 .

48 989 . .

59 0 .

988 608New York 200 .

84 1 .

94 816 .

29 152 .

33 198 . .

67 0 .

990 341EU 997 .

67 143 .

08 455 .

15 341 . .

59 146 .

09 0 .

856 586Total 29 060 .

95 3950 .

34 24 160 . .

27 25 110 .

61 22 062 .

55 0 ..

859 485

Table 10: Regional Statistic of Online LP with IC Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 399 .

33 21 .

13 606 .

51 569 .

89 378 . .

43 0 .

947 086Ningbo 0 0 207 . .

47 0 481 .

19 /Yantian 598 .

65 87 .

58 1029 .

14 326 .

27 511 .

07 877 .

82 0 .

853 704Shekou 7972 .

27 677 .

09 1732 .

56 5657 .

99 7295 .

18 74 .

89 0 .

915 069Thailand 4936 .

96 1927 .

75 99 .

03 2832 .

29 3009 .

21 17 .

63 0 .

609 527Singapore 796 .

67 87 .

21 3331 .

11 1853 .

54 709 .

46 4859 .

85 0 .

890 532Arab 0 0 0 26 .

51 0 295 .

51 /Hongkong 3981 .

14 1 .

53 3896 .

77 1140 .

93 3979 .

61 1248 .

01 0 .

999 616Taiwan 1407 .

85 25 628 .

13 1284 .

43 1382 .

85 647 .

28 0 .

982 242Tokyo 2194 .

04 21 .

85 3093 .

01 1740 .

66 2172 .

19 2786 .

07 0 .

990 041Kobe 2969 .

91 111 .

72 3325 .

12 3043 .

36 2858 .

19 3536 .

31 0 .

962 383Yokohama 0 0 465 . .

34 0 638 .

74 /Oakland 199 .

22 5 .

01 2221 .

68 936 .

29 194 .

21 2814 .

07 0 .

974 852Los Angeles 1396 .

98 42 .

27 2138 .

31 1026 . .

71 1702 .

91 0 .

969 742Sawannah 997 .

87 8 .

76 871 .

68 598 .

89 989 .

11 421 .

21 0 .

991 221New York 200 .

77 5 . .

01 163 .

15 195 .

27 834 .

53 0 .

972 605EU 989 .

84 64 .

39 455 .

31 408 .

18 925 .

45 128 . .

934 949Total 29 041 . .

79 24 953 .

17 21 852 .

89 25 954 .

71 22 469 .

85 0 .

889 923 [19] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, Georgevan den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go withdeep neural networks and tree search.

Nature

Handbook of Ocean Container Transport Logistics . Springer, Cham, 163–208.https://doi.org/10.1007/978-3-319-11891-8_6[21] UNCTAD. 2017.

Review of maritime transport 2017 . OCLC: 1022725798. [22] Xiaofeng Wang and Tuomas Sandholm. 2002. Reinforcement Learning to Playan Optimal Nash Equilibrium in Team Markov Games. In

Proceedings of the15th International Conference on Neural Information Processing Systems (NIPS’02) .MIT Press, Cambridge, MA, USA, 1603–1610. http://dl.acm.org/citation.cfm?id=2968618.2968817[23] Zhe Xu, Zhixin Li, Qingwen Guan, Dingshui Zhang, Qiang Li, Junxiao Nan,Chunyang Liu, Wei Bian, and Jieping Ye. 2018. Large-Scale Order Dispatch inOn-Demand Ride-Hailing Platforms: A Learning and Planning Approach. In

Proceedings of the 24th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining (KDD ’18) . ACM, New York, NY, USA, 905–913. https://doi.org/10.1145/3219819.3219824 able 11: Regional Statistic of Self Awareness MARL Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 396 .

75 15 .

46 588 .

56 327 .

65 381 .

29 871 .

11 0 .

961 033Ningbo 0 0 204 .

44 25 .

86 0 366 .

75 /Yantian 595 .

06 35 .

19 1588 .

92 145 .

66 559 .

87 1201 .

87 0 .

940 863Shekou 8010 .

53 5979 .

48 1467 .

91 1101 .

63 2031 .

05 646 . .

253 548Thailand 4940 .

34 254 .

41 96 .

13 4639 .

55 4685 .

93 87 .

87 0 .

948 504Singapore 799 .

44 89 .

14 1743 .

24 1250 .

79 710 . . .

888 497Arab 0 0 0 224 .

56 0 491 . .

99 1133 .

04 2252 .

99 710 .

83 2859 .

95 412 .

74 0 .

716 243Taiwan 1405 .

77 430 . .

48 453 .

21 975 .

27 200 .

81 0 .

693 762Tokyo 2192 . .

74 1515 . .

18 2091 .

96 1317 .

72 0 .

954 057Kobe 2970 .

62 180 .

91 1710 .

44 1887 .

26 2789 .

71 883 .

26 0 . .

66 10 .

66 0 531 .

22 /Oakland 199 .

28 7 .

37 2718 .

49 188 .

76 191 .

91 2693 .

95 0 .

963 017Los Angeles 1396 .

48 67 .

75 2731 .

53 906 .

05 1328 .

73 2263 .

86 0 .

951 485Sawannah 1006 .

08 36 .

43 839 .

19 450 .

62 969 .

65 289 .

04 0 .

963 79New York 200 .

75 1 .

81 831 .

19 69 .

71 198 .

94 731 . .

990 984EU 994 .

14 61 .

25 449 .

12 573 . .

89 272 .

27 0 .

938 389Total 29 100 .

93 8393 .

48 19 798 .

79 14 629 .

58 20 707 .

45 15 942 .

07 0 ..

702 295

Table 12: Regional Statistic of Territorial Awareness MARL Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 400 .

81 20 . .

02 337 .

98 380 .

11 885 .

46 0 .

948 355Ningbo 0 0 206 .

01 30 .

21 0 377 . .

05 47 . .

54 195 .

44 549 .

35 1306 .

06 0 .

920 107Shekou 7967 .

28 2248 .

07 1447 .

38 4352 .

67 5719 .

21 98 .

23 0 .

717 837Thailand 4939 .

66 135 .

07 85 . .

73 4804 .

59 25 .

76 0 .

972 656Singapore 805 .

05 178 .

25 2630 .

17 659 .

64 626 . .

51 0 .

778 585Arab 0 0 0 152 .

09 0 420 .

43 /Hong Kong 3997 . .

12 3129 .

24 350 .

29 2952 .

28 868 .

87 0 .

738 55Taiwan 1402 .

58 425 . .

39 506 .

39 977 .

38 270 .

44 0 .

696 844Tokyo 2181 .

59 169 .

94 2362 .

92 1617 .

55 2011 .

65 2208 .

95 0 .

922 103Kobe 2969 .

32 234 .

94 2671 .

21 1985 .

45 2734 .

38 1986 .

71 0 .

920 878Yokohama 0 0 465 .

79 16 .

68 0 542 .

53 /Oakland 199 . .

12 2729 .

45 226 .

27 194 .

18 2744 .

53 0 .

974 31Los Angeles 1393 .

16 54 .

74 2726 .

46 870 .

76 1338 .

42 2228 .

29 0 .

960 708Sawannah 998 .

18 32 .

36 827 .

99 558 .

79 965 .

82 404 .

74 0 .

967 581New York 200 .

76 0 .

04 813 .

79 124 .

39 200 .

72 772 .

29 0 .

999 801EU 993 .

28 51 .

05 452 .

57 505 .

77 942 .

23 202 .

39 0 .

948 605Total 29 045 .

42 4648 . .

73 17 221 . .

12 18 422 .

29 0 .

834 133 able 13: Regional Statistic of Diplomatic Awareness MARL Method

Region/City TotalContainers FailedContainers ImportedLadenContainers ImportedEmptyContainers ExportedLadenContainers ExportedEmptyContainers FulfillmentRatioShanghai 398 . .

46 591 .

15 487 .

31 395 .

74 1001 .

32 0 .

993 822Ningbo 0 0 205 .

53 122 .

52 0 469 . .

74 20 . .

78 240 .

68 576 .

54 1298 .

11 0 .

966 149Shekou 7963 .

94 383 .

02 1704 .

92 5951 .

09 7580 .

92 11 .

96 0 .

951 906Thailand 4937 .

34 205 . .

41 4670 .

81 4731 .

54 57 .

05 0 .

958 318Singapore 800 . .

58 3395 .

41 601 .

86 731 .

92 3674 .

31 0 .

914 329Arab 0 0 0 216 .

75 0 484 .

96 /Hong Kong 3994 .

47 117 .

05 3919 .

51 543 .

61 3877 .

42 881 .

82 0 .

970 697Taiwan 1401 .

06 47 . .

78 1025 .

34 1353 .

66 405 .

71 0 .

966 168Tokyo 2184 .

47 64 .

91 3109 .

54 2088 . .

56 3245 .

84 0 .

970 286Kobe 2975 .

98 61 .

25 3380 .

17 2273 . .

73 2754 .

09 0 .

979 419Yokohama 0 0 462 .

15 21 .

55 0 541 .

47 /Oakland 199 .

32 7 .

29 2766 .

72 449 .

23 192 .

03 2955 .

57 0 .

963 426Los Angeles 1399 .

43 44 .

17 2763 .

45 661 .

21 1355 .

26 2028 .

79 0 .

968 437Sawannah 1003 .

12 35 .

03 887 .

31 419 .

99 968 .

09 196 .

33 0 .

965 079New York 200 . .

73 853 .

98 67 .

35 191 .

07 711 .

75 0 .

951 544EU 995 .

32 67 .

04 447 .

44 561 .

59 928 .

28 202 .

88 0 .

932 645Total 29 050 .

69 1133 .

93 26 822 .

25 20 402 .

89 27 916 .

76 20 921 .

26 0 ..