[PDF] Fast Approximate Solutions using Reinforcement Learning for Dynamic Capacitated Vehicle Routing with Time Windows

Abstract

This paper develops an inherently parallelised, fast, approximate learning-based solution to the generic class of Capacitated Vehicle Routing Problems with Time Windows and Dynamic Routing (CVRP-TWDR). Considering vehicles in a fleet as decentralised agents, we postulate that using reinforcement learning (RL) based adaptation is a key enabler for real-time route formation in a dynamic environment. The methodology allows each agent (vehicle) to independently evaluate the value of serving each customer, and uses a centralised allocation heuristic to finalise the allocations based on the generated values. We show that the solutions produced by this method are significantly faster than exact formulations and state-of-the-art meta-heuristics, while being reasonably close to optimal in terms of solution quality. We describe experiments in both the static case (when all customer demands and time windows are known in advance) as well as the dynamic case (where customers can pop up at any time during execution). The results with a single trained model on large, out-of-distribution test data demonstrate the scalability and flexibility of the proposed approach.

Full PDF

FFast Approximate Solutions using Reinforcement Learning forDynamic Capacitated Vehicle Routing with Time Windows

Nazneen N Sultana , Vinita Baniwal , Ansuma Basumatary , Piyush Mittal ,Supratim Ghosh , Harshad Khadilkar , TCS Research, Mumbai, India IIT Bombay, Mumbai, IndiaFebruary 25, 2021

Abstract

This paper develops an inherently parallelised, fast, approximate learning-based solution to the genericclass of Capacitated Vehicle Routing with Time Windows and Dynamic Routing (CVRP-TWDR). Con-sidering vehicles in a ﬂeet as decentralised agents, we postulate that using reinforcement learning (RL)based adaptation is a key enabler for real-time route formation in a dynamic environment. The method-ology allows each agent (vehicle) to independently evaluate the value of serving each customer, and usesa centralised allocation heuristic to ﬁnalise the allocations based on the generated values. We show thatthe solutions produced by this method on standard datasets are signiﬁcantly faster than exact formula-tions and state-of-the-art meta-heuristics, while being reasonably close to optimal in terms of solutionquality. We describe experiments in both the static case (when all customer demands and time windowsare known in advance) as well as the dynamic case (where customers can ‘pop up’ at any time duringexecution). The results with a single trained model on large, out-of-distribution test data demonstratethe scalability and ﬂexibility of the proposed approach.

The Vehicle Routing Problem (VRP) is a well-known NP-Hard problem in Combinatorial Optimisation(Lenstra and Kan, 1981). The most basic version involves computation of the optimal route for a singleor multiple identical vehicles, given the nodes to be visited. If there is a single vehicle with no constraintsin terms of capacity or fuel/endurance and the optimality is deﬁned by length of the route, the problem isequivalent to the travelling salesman problem (TSP) (Bellman, 1962).However, there are numerous versions of vehicle routing problem that make the problem more realisticfor practical use. The obvious constraints include a limit on the volume/weight of load carried (capaci-tated VRP), and on the distance travelled in a single trip (before the vehicle must return to its depot ).Further complexities include versions with multiple vehicles, time windows for visiting each node, dynam-ically generated demand, and several other options. In particular, there appear to be two basic variantsof the problem (Psaraftis, 1988; Li et al., 2009): static routing where the demands do not change duringcomputation of the solution or its execution, and dynamic routing where such demands can change at anytime. We are interested in the capacitated vehicle routing problem with time windows and dynamic routing(CVRP-TWDR), due to its relevance to real-world problems, as well as the time-constrained nature of theproblem. Learning-based techniques provide a good solution for the twin problems of scale and dynamicity.Reinforcement Learning (RL) has been applied to many hard problems, like Atari (Mnih et al., 2013), Go(Silver et al., 2017), recommender systems, robotics and much more. Recently, studies have begun to developsolutions for combinatorial optimisation (CO) problems using machine learning (Bengio et al., 2020). A few Optimality could be deﬁned by minimum distance, minimum time, or other metrics. a r X i v : . [ c s . A I] F e b tudies have considered simpler versions of VRP using reinforcement learning, which we survey in Section 2.This is our motivation for using RL in much more complex and realistic versions of VRP. The hypothesis isthat RL will provide solutions signiﬁcantly faster than exact methods and meta-heuristics, enabling its usein real-time applications. At the same time, the solution quality will be reasonably close to optimal.The precise version of the problem considered in this paper is described in Section 3. We compare theperformance on standard data sets both when customer demands and locations are known in advance (capac-itated vehicle routing problem with time windows or CVRP-TW), and when customer demands can arriveat arbitrary times during execution (CVRP-TW with dynamic routing, or CVRP-TWDR). For illustrationof the method, the description in Section 4 focusses on CVRP-TW, but it is easy to see that the same pro-cedure works for CVRP-TWDR. In essence, we consider the problem as a Markov Decision Process (MDP)where each vehicle independently evaluates the value of serving each available customer at each time step. Acentralised allocation heuristic collects the generated values, and maps vehicles to customers. This processis repeated until all customers are served, or no further service can be rendered. In Section 5, we compareour method with the existing state of the art solutions not only in terms of solution quality but also in termsof runtime. The variants that we include in this paper are, • Vehicle capacity in terms of load and range • Arbitrary number of customers and locations without retraining for each instance • Arbitrary number of vehicles • Customer service time windows • Dynamic arrival of demand at arbitrary locationsFinally, we note that our method is amenable to parallelised computation of values, making it scalable to avast number of vehicles and nodes. In addition, we are able to solve the problem incrementally, which allowsfor initial decisions to be implemented while the rest of the route is being planned, and also for dynamicupdates to be made to customer demands, travel times, and vehicle availability.

Given the fundamental nature of the vehicle routing problem and its formulation in exact mathematicalprogramming terms (Laporte, 1992; Kara et al., 2007), a vast amount of prior literature exists in this area.In this section, we focus on recent studies that show promise in terms of scalability and the ability to handledynamic variations during execution.The capacitated VRP with time windows has traditionally been solved using linear programming, heuris-tics, and meta-heuristic algorithms such as neighbourhood search (Br¨aysy, 2003), genetic algorithms (Prins,2004; Ombuki et al., 2006), tabu search (Lau et al., 2003), and ant colony optimization (Gambardella et al.,1999). Meta-heuristics are a popular way of solving combinatorial problems. A genetic algorithm basedsolution to a dynamic vehicle routing problem with time-dependent travel times is discussed in (Haghaniand Jung, 2005). The problem is a pick-up or delivery vehicle routing problem with soft time windows, andconsiders multiple vehicles with diﬀerent capacities, real-time service requests, and real-time variations intravel times between demand nodes. However, while meta-heuristic approaches can handle dynamic rerout-ing from the point of formulation, their response time characteristics are unclear. For representation, we usegenetic algorithms as a baseline in Section 5.Heuristic approaches have excellent response times for the dynamic problem. A Lagrangian relaxationbased-heuristic obtains a feasible solution to the real-time vehicle rerouting problem with time windowsapplicable to delivery and/or pickup services that undergo service disruptions due to vehicle breakdowns(Li et al., 2009). Another study describes a dynamic routing system that dispatches a ﬂeet of vehiclesaccording to customer orders arriving at random during the planning period Fleischmann et al. (2004).The system disposes of online communication with all drivers and customers and, in addition, disposes of2nline information on travel times from a traﬃc management center which is updated at random incidents.While heuristics are known to produce near-optimal results in standard versions of VRP in literature, theycome with the overhead of designing a ﬁxed set of rules for each speciﬁc problem instance. Learning basedapproaches do not face this diﬃculty, since they can adapt to new situations.For handling real-time constraints, one study has proposed to combine reinforcement learning (RL) withlinear programming (Delarue et al., 2020), where the action is an ordering of nodes, and the remaining pathis planned by linear programming. However, the computational time of the linear program may outweighthe beneﬁts of RL. Among learning based algorithms for VRP, we ﬁnd an overwhelming majority of studiesusing techniques related to graph convolutional networks, pointer networks, and recurrent encoders anddecoders. Because it is invariant to the length of the encoder sequence, the Pointer Networks (Vinyals et al.,2015) enable the model to solve combinatorial optimization problems where the output sequence length isdetermined by the source sequence. They have been used for solving the TSP (Bello et al., 2016) usingactor-critics, scaling up to 100 nodes. The dependence of pointer networks on input order can be resolvedby element-wise projections (Nazari et al., 2018), such that the updated embeddings after state-changes canbe eﬀectively computed, irrespective of the order of input nodes. The most recent development is with theapplication of pointer and attention networks (Deudon et al., 2018) (Kool et al., 2018). However, most priorstudies report results on single vehicle and single depot instances without time windows. It is unclear howwell pointer networks will scale to large constrained problems.A common alternative to pointer networks is to use graph embeddings (Dai et al., 2017), which can betrained to output the order in which nodes are inserted into a partial tour. A Graph Attention Network withan attention-based decoder (Kool et al., 2018) can be trained with reinforcement learning to autoregressivelybuild TSP solutions, but no results appear to be available so far for VRP. Pointer networks and graphicalapproaches have been combined into a graph pointer network (GPN) (Ma et al., 2019) to tackle the vanillaTSP. GPN extends the pointer network with graph embedding layers and achieves faster convergence. Themodel has been shown to tackle larger-scale TSP instances with upto 1000 nodes, from a model trained ona much smaller TSP instance with 50 nodes. More recently, a graph convolutional network has been trainedin a supervised manner to output a tour as an adjacency matrix (Joshi et al., 2019), which is converted intoa feasible solution using a beam search decoder.The key drawback of most graph-based approaches appears to be the necessity of ﬁxing the graph topol-ogy. Some approaches decouple the graph embedding from the reinforcement learning problem (Figueiredoet al., 2017), which is one inspiration behind the present work. Further, the theme of combining a heuristicwith a learning based algorithm has been used before, for instance to solve the TSP using attention (Deudonet al., 2018) or using simulated annealing with the value function approximated by machine learning (Joeand Lau, 2020). The study uses policy gradient for training the model and ﬁlters out invalid tours by using2-OPT local search, a well-known heuristic. In this paper, we use the idea of deﬁning input features that usegraphical information, and we combine the independent value estimates using a centralised mapping heuris-tic. However, to the best of our knowledge, we describe the ﬁrst learning-based solution for CVRP-TW,usable for dynamic rerouting, multiple vehicles, depots and request types.

For simplicity, we describe the formulation for the static version (CVRP-TW) of the problem. The sameobjectives and constraints apply to the dynamic case, with the diﬀerence that not all the information isavailable in advance. The static version of the capacitated vehicle routing problem with service time windows(CVRP-TW) assumes that a set of customers C is known, with their locations ( x i , y i ), demanded load m i ,and service time windows [ T i, min , T i, max ], where i ∈ C , x i , y i ∈ R , T i, min , T i, max ∈ R + , and T i, min < T i, max .All vehicles start at the depot o , and are able to travel between any two locations (fully connected graph),with the distance d i,j between any two customers i, j ∈ C given by the usual Euclidean metric. We also havea ﬁxed set of vehicles V , the speed v of the vehicles (assumed constant on all edges and for all vehicles inthe present work), and the maximum load M that any vehicle can carry in one trip. Then the objective of3he problem is to ﬁnd the total distance J that minimises, J = min a i,j,k , f i,k , l i,k (cid:88) i,j,k d i,j a i,j,k + (cid:88) i,k d o,i f i,k + (cid:88) i,k d o,i l i,k  , (1)where d i,j is the distance from customer i to customer j , d o,i is the distance from origin (depot) to customer i , a i,j,k is an indicator variable which is 1 if vehicle k goes directly from customer i to customer j , f i,k isan indicator variable which is 1 if customer i is the ﬁrst customer served by vehicle k , and l i,k is a similarindicator variable which is 1 if customer i is the last customer visited by vehicle k . Apart from constraintson a i,j,k , f i,k , and l i,k to take values from { , } , the other constraints are deﬁned as follows.Every customer must be served by exactly one vehicle, within its speciﬁed time window. If t i,k is thetime at which vehicle k visits i , then we write these constraints as, (cid:88) k f i,k + (cid:88) j,k a j,i,k = 1 ∀ i (2) T i, min ≤ t i,k ≤ T i, max , if f i,k = 1 or ∃ j s.t. a j,i,k = 1 (3)If a vehicle k serves customer i as deﬁned above, it must also leave from the customer location and travelto another customer j , or back to the depot. Conversely, a vehicle that has not served i cannot leave fromthere. This is formalised as, l i,k + (cid:88) j a i,j,k = (cid:40) f i,k = 1 or ∃ j s.t. a j,i,k = 10 otherwise (4)A vehicle that starts a journey must also end at the depot, and its total load carried must be at most M .This is formalised as, (cid:88) i l i,k = (cid:88) i f i,k ∀ k (5) (cid:88) i,j m j a i,j,k + (cid:88) i m i f i,k ≤ M ∀ k (6)Finally, we impose travel time constraints between any two locations, based on the distance between themand the speed v at which vehicles can travel. t i,k ≥ d o,i v if f i,k = 1 (7) t i,k ≥ t j,k + d j,i v if a j,i,k = 1 (8)Clearly, this is a simpliﬁed version of a real-world situation where the number of vehicles also needs to beminimised, the vehicles can have diﬀerent distance and velocity constraints, travel times can vary based ontraﬃc conditions, vehicles can do multiple trips, in addition to other variations. The dynamic version of theproblem will also allow customers to ‘pop up’ at arbitrary times, while a plan is being executed. However,the formulation described above is itself of interest because of its NP-hard nature, and the fact that exactsolutions become intractable very quickly (as we shall see in Section 5). We describe a value-based reinforcement learning algorithm for solving CVRP-TW and CVRP-TWDR.The algorithm builds the solution by mapping vehicles to customers one at a time, eventually leading tocomplete routes and arrival times. We implement parallelised computation of values associated with each4ctive customer and vehicle pair, followed by a centralised heuristic to compute the ﬁnal assignment for eachvehicle. Parallelism is achieved by noting that the state inputs and value outputs for a given customer-vehiclepair can be computed independently of all other pairs, given the status of all vehicles and customers at aspeciﬁc time. Furthermore, cloning the network parameters of the RL agent allows us to (i) keep the valueestimates consistent, (ii) accelerate training through pooling of experience, and (iii) scale to large probleminstances without retraining. Details of the approach are given below.

The most common methods for value-based reinforcement learning (such as Q-learning or its neural counter-part DQN (Mnih et al., 2013)) use the concepts of states and actions to diﬀerentiate between the informationprovided by the environment, and the decisions taken by the RL agent. The obvious analogy in the presentscenario would be to provide information about customers and vehicles as the state for a vehicle, and tooutput the values for all customers as actions. However, such an architecture would only allow us to workwith a constant number of customers (equal to the size of output). In order to handle an arbitrary numberof vehicles and customers, we chose to provide information about individual customer-vehicle pairs as inputsto the RL agent, and to output a scalar value for each pair. Table 1 lists the features used for computationof pairwise value estimates, for a generic vehicle k , at time t , considering a proposed customer i . The vehiclecould presently be at the depot o or it could have just completed a service for customer j .Each input is normalised by a constant, though Table 1 omits these for notational clarity. All distance-related inputs are normalised by D , the diagonal length of the x − y map on which customers are located.The time-related inputs are normalised by τ = max i T i, max , the latest among all time windows for customers.The load-related inputs are normalised by M , the load carrying capacity of each vehicle. Several of the inputsare self-explanatory. The non-trivial inputs emulate information that would be available from a graphicalanalysis, and are described below.Two of the inputs are binary ﬂags. The ﬁrst, I in/out , is set to 1 if the current time t is greater than τ /

2. It is meant to encourage the agent to start moving vehicles back towards the depot. The second ﬂag, I serve,k (cid:48) (cid:54) = k , is set to 1 if there is no other vehicle k (cid:48) (cid:54) = k which can also serve the proposed customer i within the time window, considering its current commitments. This ﬂag indicates to the agent whether acustomer is likely to get dropped if it is not served by k . An extension to this signal is min k ( d i,loc ( k ) ), whichis the distance to the closest vehicle from the proposed customer i . A forward look-ahead is introduced bymin rchbl. j ( d i,j ), which is the distance from proposed customer i to the nearest customer j which (i) is stillactive (not yet picked by any vehicle), and (ii) can be reached within its time window after serving i . Finally,we also indicate the time for which vehicle k will have to wait at customer i , in case it arrives before theminimum time T i, min . The inputs as described above are used by the RL agent (described in Section 4.4) to produce a scalar outputfor each customer-vehicle pair. The trigger for these computations is either the start of an episode ( t = 0),or a vehicle becoming free (completes assigned service) at time step t >

0. As shown in Figure 1, the agentthen computes pairwise values q ( k, i ) for each vehicle k and customer i . This computation is done regardlessof the current state of each vehicle (busy or free), but the state inputs (Table 1) reﬂect any additional timerequired for ﬁnishing the current service. Likewise, all active customers (ones not yet allocated to a vehicle)are included. After all q ( k, i ) values are generated with this logic, one pair is chosen (based on subsequentdescription).If the chosen mapping assigns a customer to the vehicle that triggered the computation, the decisionis communicated to the environment. However, if the chosen pair belongs to a vehicle that is currentlybusy, a ‘soft’ update is made to the state information, emulating the assignment. Note that the soft updateincludes deactivation of chosen customers and consequent changes to estimated times and distances, butthese updates are local to the agent. The process continues until the trigger vehicle gets an assignment.While this procedure involves additional computation as compared to directly assigning a customer to the5able 1: State space input for each customer-vehicle pair, assuming that the proposed customer is i , thevehicle is k , and current time is t . Distance inputs are based on the current vehicle location (at depot o orcustomer j ). Note that all inputs are normalised by relevant constants ( D for distances, M for loads, and τ for times) as explained in text. Input Explanation d o,i or d j,i Distance to proposed customer from current vehicle location ( o or j ) m i Customer demand quantity d o,i Distance to depot from proposed customer I in/out Flag: is vehicle outbound or inbound0 or d o,j Distance to depot from current vehicle location (0 if at o , otherwise d o,j ) I serve,k (cid:48) (cid:54) = k Flag: whether customer can be served by another vehicle k (cid:48) t Current time-step (normalised by scheduling time-out τ ) M k ( t ) Remaining capacity of vehicle k at current time tT i, max Max window of the proposed customermin rchbl. j ( d i,j ) Distance to next-nearest reachable customer after proposed customermin k ( d i,loc ( k ) ) Distance from proposed customer i to the closest vehiclemax( T i, min − t − d j,i /v,

0) Time for which vehicle will have to wait to start servicefree vehicle, it is more reﬂective of the context in which the decision is taken. Consider the simple exampleshown in Figure 2. Vehicle k becomes free at time t , after serving customer j . Since customer i is the onlyavailable customer at this time, it generates a single value q ( k, i ). However, vehicle k (cid:48) is scheduled to becomefree one time step later, and is much closer to i . Therefore, it would be more optimal to assign k (cid:48) to i . Thiscan only be achieved with the proposed procedure.Note that if a vehicle is assigned a customer whose window is not yet open (waiting is expected), thenthe vehicle does not leave its current location until the correct time arrives. This gives us more ﬂexibility toreroute the vehicle in the dynamic scenario. The overall objective of the problem is to minimise the total distance travelled, as deﬁned in (1), whilesatisfying constraints (2)-(8). Since this is a global objective (a summation over the journeys of all vehicles)while the decisions are taken locally (with limited global context), we deﬁne a global terminal reward and aset of localised step rewards. The terminal reward is given by the fulﬁlment ratio F , which is the ratio ofcustomers served to the total number of customers in the dataset (computed at the end of the episode).The step reward for vehicle k at stage n of its journey is computed based on several terms, as deﬁnedbelow and explained in Table 2. We assume that the vehicle is currently at the location of customer i attime t , and is going to customer j . R step ( k, n ) = − a d ( k, n ) − a ( T j, max − t ( k, j )) − a t wait ( k, j, n ) − a ∆( k, k ∗ , j ) − a (cid:18) d j,j ∗ v + t wait ( k, j ∗ , n + 1) (cid:19) + a α ( i, j, t ) + a I serve,k (cid:48) (cid:54) = k (9)The reward deﬁnition (9) is necessarily complex, because it must provide feedback about movements of othervehicles. Most terms can be correlated by the agent to various combinations of inputs, which were deﬁnedin Table 1. Apart from basic distance and time terms, the step reward includes a comparison ∆( k, k ∗ , j ) ofthe distance of the current vehicle k from j , versus the closest vehicle k ∗ from j . We also include a termfor the expected cost incurred in the next ( n + 1) th stage of the journey. Finally, we correlate the indicatorvariables in the input with the last two terms of the step reward. The binary reward α ( i, j, t ) is equal to 1 if k is moving outward from the depot ( j is farther from the depot than i ) in the ﬁrst half, or when I in/out = 06 nvironment All customerinformation All vehicleinformationGet pairwise information

All free vehiclesmapped?

Vehicles become free atSoftupdates YesNo

Any feasiblecustomers for freevehicles?

No Yes

Figure 1: Workﬂow of decision-making at each time step t . The decision loop terminates either when all freevehicles are allocated a customer, or when all feasible customers are exhausted.Figure 2: Example to show the need for proposed procedure. It is more optimal to send k (cid:48) to i , even though k becomes free earlier. In this scenario, k will be sent to the depot at time t .Table 2: Terms included in step reward, for vehicle k at stage n of its journey, assuming currently at customer i and heading towards customer j .Wgt. Term Explanation a d ( k, n ) Distance travelled by k in stage n of journey; d o,j or d i,j as applicable a T j, max − t ( k, j ) Time remaining for servicing j , when k arrives at the location a t wait ( k, j, n ) Wait time for k in stage n , given by max(0 , T j, min − t ( k, j )) a ∆( k, k ∗ , j ) How much farther k has to travel, compared to closest vehicle k ∗ a d j,j ∗ /v + t wait ( k, j ∗ , n + 1) Expected time to service next closest customer j ∗ after ja α ( i, j, t ) Whether k is moving away from depot when I in/out = 0, and vice versa a I serve,k (cid:48) (cid:54) = k Whether k is the only vehicle that can serve j j if k is the only vehicle that can doso. After experimentation, we found that setting [ a , . . . , a ] to [0.2, 0.5, 1.0, 0.25, 0.5, 0.1, 0.25] resulted ingood solutions on the training data. If k serves N k customers in its journey, the total reward is, R ( k, n ) = R step ( k, n ) + γ N k − n F . We use a neural network with experience replay, for approximating the value function. The network receivesthe inputs as deﬁned in Table 1, and attempts to predict the total reward as deﬁned in Section 4.3. Weconsider each vehicle as a decentralized agent with parameter sharing. This property helps us to (i) acceleratelearning by pooling experience from all vehicles, (ii) improve scalability as the number of vehicles andcustomers can be arbitrary, and (iii) improve consistency of decision-making, as the value outputs can diﬀeronly due to inputs and not due to network parameters.We use a network with fully connected layers of dimension (12, 6, 3, 1) neurons, with the ﬁrst and lastlayers being input and output respectively. All the hidden layers have tanh activation and the output has linear activation. The network is trained using the torch library in

Python 3.6 , using

Adam optimizerwith a learning rate of 0.001, a batch size of 32 samples, and mean squared loss. The replay memory buﬀerconsists of the latest 50000 samples which is used for randomly sampling the batch size for training.Algorithm 1 describes the detailed training procedure. Before prediction of the value estimate, thereis a pre-processing step which allows only the feasible combination of vehicle and customers for furtherprocessing. Feasibility is decided based on the overall system constraints including vehicle remaining capacityand customer time windows. We not only predict the values for free vehicles, but also for busy vehicles(currently en-route). The reasoning was explained in Section 4.2. The execution of one full schedulingexercise is carried out as per Stage 1. Training is carried out after the episode is complete, as described inStage 2. The training and testing data sets are described in the next section.

We train the proposed RL agent using 20 randomly generated data sets with 20 customers ( |C| = 20) and4 vehicles ( |V| = 4). The customer locations ( x i , y i ) are uniformly random over the range [ − , − , m i of each customer isdrawn from an exponential distribution with β = 0 . M = 200. The speed of each vehicle is v = 10 distance units in8 lgorithm 1 RL Training Framework Initialize replay buﬀer B Initialize value network parameters θ , batch size β for ep = 1 to N episode do Randomly choose instance from training set Reset environment and get initial state Stage 1: Predict and Sample collection while t < T max do Create a copy of environment for soft updates while free vehicle is unassigned do Find feasible combinations of vehicles k and customers i , ∀ k (including free and busy vehicles) if no feasible customer for any free vehicle then break Pre-processing of input state

Predict value-estimate for all feasible ( k, i ) Greedy assignment of pair with highest value

Perform soft update end while

Collect ﬁrst assignment for each free vehicle

Execute new assignments in environment

Add [ input ( k, n ) , R step ( k, n ) , q pred ( k, n )] to B for all assignments end while Stage 2: Learning phase

Compute fulﬁlment ratio F Update terminal reward term γ N k − n F for all decisions taken in current episode Delete oldest entries in B if size exceeds buﬀer capacity Draw β random samples from B Update network parameters θ by minimising mse loss between q pred ( k, n ) and R step ( k, n ) + γ N k − n F on β batch end for A v g c o m p t i m e ( s e c ) Static (CVRP-TW)

RLGA

Figure 4: Comparison of computation times for diﬀerent sized data sets, averaged over all types and classes.Computation times for MILP are not shown because of insuﬃcient data (average for 25 customers is 10975sec).each time step. The minimum time window T i, min is uniformly randomly placed between 0 and 200, andthe width of the window ( T i, max − T i, min ) is Gaussian with a mean of 35 and a standard deviation of 5 timeunits (clipped to a minimum of 1). We generate a set of 20 episodes using these parameters, and these formthe entirety of the training data set.The resulting training performance is shown in Figure 3 for a run of 700 episodes, where each episode con-sists of one of the 20 pre-generated days (picked randomly) and is run with static assumptions (CVRP-TW).We implement an (cid:15) − greedy exploration policy, with (cid:15) decaying linearly from 1 to 0 over the ﬁrst 300 episodes(changing only at the end of each episode). Note that the training stabilises to a fulﬁlment ratio (proportionof customers served) close to 1.0. For all subsequent tests, we use the same trained model parameters with-out further reﬁnement.

This restriction is imposed for handling real-world scenarios where demands mayarise from completely diﬀerent distributions and with diﬀerent numbers of vehicles and customers, but mustnevertheless be solved quickly.

We base all our experiments (CVRP-TW as well as CVRP-TWDR) on benchmark data sets proposed bySolomon (1987) in 1987, which include diﬀerent problem scales and also come with optimal solutions (orbest-known solutions (SINTEF, 2008) where optimal ones are not available). The static version (CVRP-TW)contains two ‘types’ of vehicle capacities (type 2 has much larger vehicle capacities than type 1), and threecustomer location distributions: C for clustered, R for uniformly random, and RC for a mix of random andclustered locations. Each of the six resulting combinations of characteristics contains on average 9 probleminstances. The distance metric is Euclidean, with a vehicle speed of v = 1.Among the Solomon data sets, we experiment with sets containing 25, 50, and 100 customers (the 25 and50 instances are subsets of 100 customer data sets). We also run two benchmark algorithms for comparison:a meta-heuristic approach with genetic algorithms, and an exact linear programming approach. The lattermethod is computationally expensive (even the best-known solutions in literature are not always solved tooptimality), and so we report our LP outputs where available, and use the best-known numbers otherwise.The methods are explained later in this section.To experiment with dynamic arrival of customers (CVRP-TWDR), we use a simulated dataset (van Veenet al., 2013) that is a modiﬁed version of Solomon CVRP-TW . The time of occurance of arrival of newrequests are generated using (Gendreau et al., 1999), with a speciﬁed level of dynamicity. An instance withX% dynamicity implies that X% of customers are not revealed at the start, and appear during execution. We compare the results of the RL approach with two alternative methods: one using a mixed integer linearprogram formulation (MILP); and the other, a meta-heuristic based on genetic algorithms. https://liacs.leidenuniv.nl/ csnaco/index.php?page=code Type Customers Class RL veh Rl dist Rl time GA veh GA dist GA time BS veh BS dist BS time .

99 10 .

21 3 191 . .

64 3 190 .

75 18153R 5 .

18 586 .

68 4 .

23 4 .

75 469 .

82 54 .

16 4 .

83 463 .

83 27113RC 4 .

13 489 .

80 6 .

46 3 .

25 351 .

11 35 .

05 3 .

25 350 .

76 120150 C 5 407 .

41 46 .

59 5 362 .

57 82 .

78 5 361 .

68 -1R 8 .

75 982 .

41 31 .

60 7 .

75 785 .

53 147 .

76 7 .

75 766 .

13 -1RC 8 .

13 951 .

03 51 .

81 6 . .

31 98 .

19 6 . .

31 -1100 C 10 .

11 975 .

90 353 10 830 .

91 357 10 826 . . .

96 323 .

72 12 .

92 1206 .

69 1335 .

24 11 .

92 1210 .

33 -1RC 14 .

13 1800 .

31 310 .

23 12 .

63 1385 .

54 1330 .

82 11 . .

16 -12 25 C 1 . .

83 30 .

29 1 .

12 245 .

44 17 .

65 2 215 .

23 445R 1 .

91 521 .

99 7 .

43 1 .

36 420 .

62 29 .

08 2 .

72 382 .

89 14972RC 2 448 .

99 7 .

41 1 .

62 365 .

14 28 .

38 2 .

87 319 .

76 397150 C 2 .

38 457 .

81 101 .

13 2 401 .

18 133 .

32 2 .

75 357 . .

22 891 .

51 52 .

27 2 .

18 652 .

48 218 .

91 4 .

11 634 .

03 -1RC 3 .

43 811 .

94 38 .

04 2 .

87 621 .

21 174 .

11 4 .

43 585 .

24 -1100 C 3 .

63 787 .

85 516 .

61 3 589 .

87 424 3 587 .

37 -1R 3 .

36 1341 .

59 235 .

28 3 .

72 902 .

99 1770 .

47 2 .

73 951 .

03 -1RC 4 1497 .

07 255 .

52 4 1063 .

84 1835 .

46 3 .

25 1119 .

24 -1

Table 4: Performance comparison of RL and GA for CVRP-TWDR, with two diﬀerent dynamicity levels:10% and 50%.

Dynamicity Type Class RL veh Rl dist Rl time GA veh GA dist GA time

10% 1 C 10 .

11 957 .

43 303 .

18 10 .

55 1039 .

96 443 .

55R 14 .

27 1632 .

47 304 .

91 14 1454 .

37 1002 . .

38 1843 .

38 308 .

11 13 .

25 1576 .

00 1025 .

362 C 3 .

75 774 .

92 508 .

15 4 .

625 1117 .

50 575 .

64R 3 .

36 1333 .

96 268 .

50 4 .

45 1236 .

67 1942 . .

58 285 .

86 5 .

125 1438 .

04 1791 . .

89 311 .

51 10 953 .

52 1998 .

58R 15 .

08 1662 .

92 437 .

37 14 .

25 1446 .

36 2305RC 14 .

63 1844 .

99 407 .

29 14 .

125 1726 .

74 2047 .

932 C 3 .

63 796 .

61 394 .

93 4 .

25 912 .

65 2427R 3 .

45 1382 .

06 273 4 .

18 1326 .

45 5253 . .

13 1560 .

61 245 .

69 4 .

875 1699 .

94 270811he MILP formulation of the problem is based on the description given in Section 3, but with a fewmodiﬁcations to speed up the solution process and to linearise the constraints (using big M -method (Bazaraaet al., 2008)). The details are omitted here due to space constraints. Standard solvers like Coin-BC (Branchand Cut) and GeCode are used to solve the resulting mixed integer programs. Due to excessive computationtimes, we only report the results of the MILP formulation for the 25-customer cases in Table 3. These valuesare in agreement with best-known results (SINTEF, 2008) in terms of quality; however, the computationtimes are reported from our runs.Our second baseline algorithm is a meta-heuristic based on genetic algorithms (GA). The entire tourwhich describes individual routes for all vehicles is deﬁned as a single chromosome (Vaia, 2014). Eachchromosome is assigned a ﬁtness score which is a weighted sum of the total distance used in the tour andthe total number of vehicles used. The algorithm proceeds according to the following steps. Initial population generation:

This is done using a nearest neighbor search, starting from random nodesand adding nodes as long as they satisfy time window and capacity constraints. Random initial nodes resultin diversity of the population.

Improvement of initial population:

Once the initial solutions are generated, an insertion heuristic is usedto improve them in terms of number of vehicles used. This is accomplished by breaking up the route(s)with least volume utilization and inserting their nodes (called unreserved pool ) in routes with higher volumeutilization to reduce the number of routes/vehicles.

Selection of parents:

A binary tournament selection procedure is applied to select possible parents.Essentially, two chromosomes are selected at random, and the one with a lower ﬁtness score is chosen to bea parent. A pool of parents is thus, created.

Crossover and Mutation:

Two parents are crossed over to create an oﬀspring. Though there are severaltechniques of crossover present in literature; we use common nodes crossover and common arcs crossover with equal probability (Vaia, 2014). The oﬀspring thus generated then undergo a mutation with a probabilityof 10% (a conﬁgurable parameter known as mutation probability ). Progression:

If the best solution among the members of a population does not improve over 50 generations(iterations), the algorithm is terminated; and the chromosome with best ﬁtness value of the last generationis declared as the solution.

A comparison of the performance of all algorithms on the static version of the problem (all demands knownin advance) is given in Table 3. Each row in the table is for a speciﬁc type of data (small or large vehicle ca-pacities, denoted by Type 1 and Type 2 respectively), number of customers, and type of location distribution(Clustered, Random, or both Random & Clustered). Note that each row is averaged over approximately 9data sets each. The ﬁnal three columns (BS) include best-known results from literature (SINTEF, 2008), ex-cept for 25-customer data sets where we show results from the baseline MILP formulation. All computationsare carried out on an Intel i5 laptop with 12 GB RAM, Ubuntu 20.04, and no dedicated GPU.The key takeaways from the table are that RL performs close to the best-known solutions in terms ofnumber of vehicles, and is approximately 40% above optimal in terms of distance travelled. However, RLhas a signiﬁcant advantage over GA in terms of computation times, as shown in Figure 4. Both algorithmsare orders of magnitude smaller than MILP in terms of computation times. Notably, both RL and GAproduce solutions that use fewer vehicles than best-known solutions, although these are accompanied bylarger distance. A deeper look at this result is shown for RL in Figure 5, where we plot the relative distancetravelled on the y-axis, and the relative number of vehicles required on the x-axis. RL requires at least asmany vehicles as the best-known solution for type-1 data (where vehicle capacity is small), but it producessome solutions with fewer vehicles (ratio <

1) on type-2 data sets (where vehicle capacity is large).

Table 4 summarises the performance of RL and GA on dynamic test data (generated as per the descriptionin Section 5.2), for two levels of dynamicity and for data sets with 100 customers. As explained before, a12igure 5: Scatter plots of relative distance and vehicle counts for RL in comparison with best-known solutions,for type-1 data (top) and type-2 data (bottom) in the static case. Bubble sizes are proportional to numberof customers.

D10/T1 D10/T2 D50/T1 D50/T2Dynamicity and data type01000200030004000 A v g c o m p t i m e ( s e c ) Dynamic (CVRP-TWDR)

RLGA

D10/T1 D10/T2 D50/T1 D50/T2Dynamicity and data type510 A v g v e h i c l e s RLGA

D10/T1 D10/T2 D50/T1 D50/T2Dynamicity and data type1200130014001500 A v g d i s t a n c e RLGA

Figure 6: Comparison of all three approaches for diﬀerent instance scales, for CVRP-TWDR (dynamicversion). Results are for the 100-customer data sets.13ynamicity of 10% implies that 90% of customer demands are known in advance, while 10% are producedwhile the simulation is running. The computation times are summed over the initial planning phase as wellas dynamic updates for both algorithms. In this case, we note a much closer comparison between RL andGA.Figure 6 shows the results visually. We see that RL computation times are nearly constant for both levelsof dynamicity and types, and approximately equal to those in Table 3 for 100 customers. This is becauseof the one-step distributed decisions in RL, which do not diﬀerentiate between previously known and newlygenerated customers. On the other hand, since GA computes full vehicle paths, it requires much longerto adjust to new information. This is especially true when dynamicity is higher (50%). Furthermore, RLoutperforms GA in Type-2 data (both in vehicle counts and distance travelled) even with lower computationaltime. This is because Type-2 data assumes much higher vehicle capacity, giving RL more options for routingvehicles.

In this paper, we proposed an RL approach with distributed value estimation and centralised vehicle mapping,which was inherently suited to dynamic vehicle routing with capacity constraints. We veriﬁed this hypothesisusing a popular public data source. The primary attraction of our RL approach was its capability of producingcompetitive results with completely out-of-distribution training, and its parallelisability. However, this aspecthas not yet been fully explored; the RL results reported in this work are with serialised computation in thecodes. In future work, we wish to further explore the options for computational scalability, and ﬂexibility tohandle more constraints and stochasticity.

References

Mokhtar S Bazaraa, John J Jarvis, and Hanis D Sherali. 2008. Linear programming and network ﬂows. JohnWiley & Sons, Chapter 4.3.Richard Bellman. 1962. Dynamic programming treatment of the travelling salesman problem.

Journal ofthe ACM (JACM)

9, 1 (1962), 61–63.Irwan Bello, Hieu Pham, Quoc V. Le, Mohammad Norouzi, and Samy Bengio. 2016. Neural CombinatorialOptimization with Reinforcement Learning. arXiv:1611.09940 [cs.AI]Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. 2020. Machine learning for combinatorial optimization:a methodological tour d’horizon.

European Journal of Operational Research (2020).Olli Br¨aysy. 2003. A reactive variable neighborhood search for the vehicle-routing problem with time windows.

INFORMS Journal on Computing

15, 4 (2003), 347–368.Hanjun Dai, Elias B. Khalil, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learning CombinatorialOptimization Algorithms over Graphs. arXiv:1704.01665 [cs.LG]Arthur Delarue, Ross Anderson, and Christian Tjandraatmadja. 2020. Reinforcement Learning with Com-binatorial Actions: An Application to Vehicle Routing. arXiv preprint arXiv:2010.12001 (2020).Michel Deudon, Pierre Cournut, Alexandre Lacoste, Yossiri Adulyasak, and Louis-Martin Rousseau. 2018.Learning heuristics for the tsp by policy gradient. In

International conference on the integration of con-straint programming, artiﬁcial intelligence, and operations research . Springer, 170–181.Daniel R. Figueiredo, Leonardo Filipe Rodrigues Ribeiro, and Pedro H. P. Saverese. 2017. struc2vec: Learn-ing Node Representations from Structural Identity.

CoRR abs/1704.03165 (2017). arXiv:1704.03165 http://arxiv.org/abs/1704.03165

Transportation science

38, 2 (2004), 160–173.Luca Maria Gambardella, ´Eric Taillard, and Giovanni Agazzi. 1999. Macs-vrptw: A multiple colony systemfor vehicle routing problems with time windows. In

New ideas in optimization . Citeseer.Michel Gendreau, Francois Guertin, Jean-Yves Potvin, and Eric Taillard. 1999. Parallel tabu search forreal-time vehicle routing and dispatching.

Transportation science

33, 4 (1999), 381–390.Ali Haghani and Soojung Jung. 2005. A dynamic vehicle routing problem with time-dependent travel times.

Computers & operations research

32, 11 (2005), 2959–2986.Waldy Joe and Hoong Chuin Lau. 2020. Deep Reinforcement Learning Approach to Solve Dynamic VehicleRouting Problem with Stochastic Customers. In

Proceedings of the International Conference on AutomatedPlanning and Scheduling , Vol. 30. 394–402.Chaitanya K Joshi, Thomas Laurent, and Xavier Bresson. 2019. An eﬃcient graph convolutional networktechnique for the travelling salesman problem. arXiv preprint arXiv:1906.01227 (2019).Imdat Kara, Bahar Y Kara, and M Kadri Yetis. 2007. Energy minimizing vehicle routing problem. In

International Conference on Combinatorial Optimization and Applications . Springer, 62–71.Wouter Kool, Herke van Hoof, and Max Welling. 2018. Attention, Learn to Solve Routing Problems!arXiv:1803.08475 [stat.ML]Gilbert Laporte. 1992. The traveling salesman problem: An overview of exact and approximate algorithms.

European Journal of Operational Research

59, 2 (1992), 231–247.Hoong Chuin Lau, Melvyn Sim, and Kwong Meng Teo. 2003. Vehicle routing problem with time windowsand a limited number of vehicles.

European journal of operational research

Networks

11, 2 (1981), 221–227.Jing-Quan Li, Pitu B. Mirchandani, and Denis Borenstein. 2009. Real-time vehicle rerouting problems withtime windows.

European Journal of Operational Research arXiv preprint arXiv:1911.04936 (2019).Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, andMartin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).Mohammadreza Nazari, Afshin Oroojlooy, Lawrence Snyder, and Martin Tak´ac. 2018. Reinforcement learn-ing for solving the vehicle routing problem. In

Advances in Neural Information Processing Systems . 9839–9849.Beatrice Ombuki, Brian J Ross, and Franklin Hanshar. 2006. Multi-objective genetic algorithms for vehiclerouting problem with time windows.

Applied Intelligence

24, 1 (2006), 17–30.Christian Prins. 2004. A simple and eﬀective evolutionary algorithm for the vehicle routing problem.

Com-puters & operations research

31, 12 (2004), 1985–2002.Harilaos N Psaraftis. 1988. Dynamic vehicle routing problems.

Vehicle routing: Methods and studies nature

Marius M Solomon. 1987. Algorithms for the vehicle routing and scheduling problems with time windowconstraints.

Operations research

35, 2 (1987), 254–265.Gintaras Vaia. 2014. Genetic Algorithm for Vehicle Routing Problem.

PhD Thesis (2014).Barry van Veen, Michael Emmerich, Zhiwei Yang, Thomas B¨ack, and Joost Kok. 2013. Ant colony algorithmsfor the dynamic vehicle routing problem with time windows. In

International Work-Conference on theInterplay Between Natural and Artiﬁcial Computation . Springer, 1–10.Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In