[PDF] Can Machine Learning Help in Solving Cargo Capacity Management Booking Control Problems?

Abstract

Revenue management is important for carriers (e.g., airlines and railroads). In this paper, we focus on cargo capacity management which has received less attention in the literature than its passenger counterpart. More precisely, we focus on the problem of controlling booking accept/reject decisions: Given a limited capacity, accept a booking request or reject it to reserve capacity for future bookings with potentially higher revenue. We formulate the problem as a finite-horizon stochastic dynamic program. The cost of fulfilling the accepted bookings, incurred at the end of the horizon, depends on the packing and routing of the cargo. This is a computationally challenging aspect as the latter are solutions to an operational decision-making problem, in our application a vehicle routing problem (VRP). Seeking a balance between online and offline computation, we propose to train a predictor of the solution costs to the VRPs using supervised learning. In turn, we use the predictions online in approximate dynamic programming and reinforcement learning algorithms to solve the booking control problem. We compare the results to an existing approach in the literature and show that we are able to obtain control policies that provide increased profit at a reduced evaluation time. This is achieved thanks to accurate approximation of the operational costs and negligible computing time in comparison to solving the VRPs.

Full PDF

CCan Machine Learning Help in Solving CargoCapacity Management Booking ControlProblems?

Justin Dumouchelle , Emma Frejinger , and Andrea Lodi CERC, Polytechnique Montreal, Montreal, Canada , [email protected] , [email protected] DIRO, University of Montreal, Montreal, Canada , [email protected] IASI, CNR, Italy

Abstract

Revenue management is important for carriers (e.g., airlines and rail-roads). In this paper, we focus on cargo capacity management whichhas received less attention in the literature than its passenger counter-part. More precisely, we focus on the problem of controlling booking ac-cept/reject decisions: Given a limited capacity, accept a booking requestor reject it to reserve capacity for future bookings with potentially higherrevenue. We formulate the problem as a ﬁnite-horizon stochastic dynamicprogram. The cost of fulﬁlling the accepted bookings, incurred at the endof the horizon, depends on the packing and routing of the cargo. Thisis a computationally challenging aspect as the latter are solutions to anoperational decision-making problem, in our application a vehicle routingproblem (VRP). Seeking a balance between online and oﬄine computa-tion, we propose to train a predictor of the solution costs to the VRPsusing supervised learning. In turn, we use the predictions online in ap-proximate dynamic programming and reinforcement learning algorithmsto solve the booking control problem. We compare the results to an exist-ing approach in the literature and show that we are able to obtain controlpolicies that provide increased proﬁt at a reduced evaluation time. Thisis achieved thanks to accurate approximation of the operational costs andnegligible computing time in comparison to solving the VRPs.

Keywords: cargo capacity management; dynamic programming; super-vised learning; reinforcement learning; vehicle routing problem.1 a r X i v : . [ m a t h . O C ] J a n Introduction

Revenue management (RM) is of importance in many commercial applicationssuch as airline cargo, hotels, and attended home delivery (see, e.g., the survey[10]). In general, RM focuses on the decisions regarding the distribution ofproducts or services with the goal of maximizing the total proﬁt or revenue.The focus of this work is on quantity decisions in the context of a booking con-trol problem, where a set of requests are observed across a time horizon. Eachrequest can either be accepted or rejected. At the end of the booking period theset of accepted requests needs to be fulﬁlled, this corresponds to an operationaldecision-making problem . The booking control problem can be formulated as aMarkov Decision Process (MDP), which describes the relationship between therevenue from accepting a request, the decrease in capacity for future requests,and the cost associated with the fulﬁllment of the accepted requests. Althoughthe MDP captures the problem structure, solving it is often intractable due tothe curse of dimensionality. As such, approximate methods have been widelyadopted for a range of RM problems. Bid-price and booking-limit control poli-cies as described in [22] are among the most used methods for capacity controlproblems.Outside of the context of RM, reinforcement learning (RL), has seen successin a variety of challenging control problems, such as Atari games [16] and Go[20]. For a detailed overview of RL we refer the reader to [4, 21]. Despite the suc-cess of RL in applications with intractable state spaces, the applications withinRM have mainly been limited to airline seat allocation problems [5, 8, 12]. Amajor limitation for the direct application of RL to capacity control problems isin the time required for simulating the system. Indeed, the computational costassociated with solving the end-of-horizon operational decision-making problemis non-negligible, which leads to a prohibitively expensive computational costfor a simulation-based approach. This limitation can be observed in applica-tions such as the vector packing in airline cargo management [3, 13, 14] or thevehicle routing (VRP) in distribution logistics [7]. To overcome this, we pro-pose an approximation to the operational cost via supervised learning. We thenleverage the resulting prediction function within approximate dynamic program-ming (DP) and RL algorithms and evaluate our approach on an application indistribution logistics.In this paper, we make the following contributions: (1) We propose a method-ology that (i) formulates an approximate MDP by replacing the formulation ofthe operational decision-making problem with a prediction function deﬁned byoﬄine supervised learning and (ii) uses approximate DP and RL techniques toobtain approximate control policies. (2) We apply the proposed methodologyto a distribution logistics problem from the literature [7] and we show com-putationally that our policies provide increased proﬁt at a reduced evaluationtime.The remainder of the paper is organized as follows. In Section 2, we introduceour methodology. In Section 3, we introduce an application to distributionlogistics as well as the considerations made in order to formulate the supervised2earning task. In Section 4, we provide results of various control policies andcompare to baselines. Finally, Section 5 concludes.

In this section, we ﬁrst introduce a general MDP formulation of our bookingcontrol problem by following closely the notation in the literature, e.g., [22].Second, we describe a formulation based on an approximation of the operationalcosts.

Let N denote a set of requests with cardinality |N | = n . The decision-makingproblem is deﬁned over T periods indexed by t ∈ { , , . . . , T } . The probabilitythat request j ∈ N is made at time t is given by λ tj and the probability thatno request occurs at period t is given by λ t . The time intervals are deﬁned tobe small enough such that at most one request is received in each period. Therevenue associated with accepting a request j is p j .To formulate this as an MDP, we let w jt denote the number of requests j accepted before time t and w t ∈ R n , t = 1 , . . . , T + 1, denote the state vectorwhere the j -th index is given by w jt . The decision variables are denoted by u tj ∈ { , } , where u tj = 1 if an oﬀer for request j is accepted at time t . Thedeterministic transition is given by w t +1 = w t + e j u tj , with e j ∈ R n with a j -thcomponent equal to 1 and every other equal to 0.An operational decision-making problem occurs at the end of the bookingperiod and is related to the fulﬁlment of the accepted requests. This problemonly depends on the state at the end of the time horizon, w T +1 . We denoteoperational cost by Γ : R n → R − .With the above deﬁnitions, we now deﬁne the dynamic program that repre-sents the maximal expected proﬁt from period t on-ward. The value function isdenoted by V t : R n → R and is given by V t ( w t ) = λ t V t +1 ( w t ) + (cid:88) j ∈N λ tj max u tj ∈{ , } { p j u tj + V t +1 ( w t + u tj e j ) } , t = 1 , . . . , T, (1) V T +1 ( w T +1 ) = Γ( w T +1 ) (2)with w = . Solving the MDP (1)–(2) can be intractable even for small instances. For thisreason, approximate DP or RL appears to be a natural choice. However, thesealgorithms typically rely on some form of policy evaluation that consists insimulating trajectories of the system under a given policy. In our context, suchpolicy evaluation is computationally costly due to (2). To overcome this, we3ropose the use of an approximation of Γ. We ﬁrst introduce a mapping fromthe state at the end of the time horizon to an m dimensional representationwith the function g : R n → R m . Then, we deﬁne an approximation of Γ as φ : R m → R − and an approximate MDP formulation is then given by˜ V t ( w t ) = λ t ˜ V t +1 ( w t ) + (cid:88) j ∈N λ tj max u tj ∈{ , } { p j u tj + ˜ V t +1 ( w t + u tj e j ) } , t = 1 , . . . , T, (3)˜ V T +1 ( w T +1 ) = φ ( g ( w T +1 )) . (4) The approximation φ ( · ) in (4) can be deﬁned in various ways, such as a problemspeciﬁc heuristic, a mixed integer program (MIP) solved to a time limit oroptimality gap or predicted by machine learning (ML). Here, we focus on MLin combination with a heuristic. Speciﬁcally, we propose to train a supervisedlearning model oﬄine to separate the problem of accurately predicting Γ fromsolving (3)–(4).To train a supervised learning model, we require a feature mapping from thestate to the input to the model and a set of labeled states at the end of the timehorizon. The feature mapping is dependant on the application, but in generalis viewed as the function g ( · ) in (4). In Section 3 we describe the speciﬁc set offeatures we use for the application in distribution logistics.To obtain labeled data we simulate trajectories in the system using a sta-tionary random policy which accepts a request with given probability p . At theend of the time horizon we obtain a state w T +1 and compute Γ( w t ). We thenrepeat this process for diﬀerent values of p . The idea is to have a representationof feasible ﬁnal states in the data (optimal and sub-optimal ones). We denotethe set of N labeled data as D = { ( w T +1 , Γ( w T +1 )) , . . . , ( w NT +1 , , Γ( w NT +1 )) } . We consider the distribution logistics booking control problem described in [7].In this context booking requests correspond to pickup activities and the opera-tional decision-making problem to a VRP. Each pickup request has an associatedlocation and revenue. The cost incurred at the end of the booking period is thecost of the VRP solution and hence depends on all the accepted requests.The problem can be formulated as (1)–(2). We now detail how we solvethe VRP to obtain solutions that are comparable to those in [7]. We assumethat there are a ﬁxed number K of vehicles, each with capacity Q ∈ R + . Wealso assume that the depot is at location 0. The set of all nodes V is given by N ∪ { } , i.e., the union of the location requests and the depot. The set of arcsis given by V × V and denoted as E . The cost of an arc is given by c ij ∈ R + , for( i, j ) ∈ E . The optimal objective value is denoted by z ∗ ( w , K ). If more than4 vehicles are required, then we allow for additional outsourcing vehicles tobe used at an additional ﬁxed cost C ∈ R + .We choose C large enough such that the cost for adding an additional vehicleis larger than any potential revenue from the requests it can fulﬁll. Finally, theoperational cost is given byΓ( w ) = − max K ≥ K ,K ∈ Z { z ∗ ( w , K ) + C ( K − K ) } . (5)The VRP formulation is provided in Appendix A.1. Sets of Instances.

We generate 4 sets of instances with 4, 10, 15, and 50locations. The locations were determined uniformly at random. The locationsare split into groups, where the revenue in each group diﬀers. The requestprobabilities are deﬁned such that locations with a higher revenue have a greaterprobability of occurring later in the booking period. For a detailed descriptionof the parameters that describe each set of instances, see Appendix A.2.

Features.

The features are computed from the state at the end of the timehorizon, g ( w T +1 ). For the sake of simplicity, we consider a ﬁxed-size inputstructure for the ML model. The number of locations can vary depending onthe set of accepted requests. We therefore derive features from capacity, depotlocation, total number of accepted requests per location and aggregate statisticsof the locations. For the latter, we use the distance between locations and thedepot, and relative distances between locations. For each of these, we computethe min, max, mean, median, standard deviation, and 1st/3rd quartiles. Prediction task.

We seek an accurate approximation of (5) that is fast tocompute. For this purpose we use ML to predict z ∗ ( w T +1 , K ) (in this work trainrandom forest models [6]) and we compute the outsourcing cost, C ( K − K ),with a bin-packing solver (MTP, [15]). Data for Supervised Learning.

We generate one data set for each of thefour sets of instances (4, 10, 15 and 50 locations) using the algorithm describedin Section 2.3, with p ∈ { . , . , . , . , . , . , . , . , . , . } .To compute a label z ∗ ( w T +1 , K ) for each instance in a reasonable time we useFILO [1], a heuristic solver for VRPs. To ensure the VRP solutions make useof a minimal number of vehicles, we oﬀset the depot location. We note that thedata generation is fast. It takes less than ﬁve minutes for the 4-locations dataset and 40 minutes for the 50-locations data set. In this section, we start by presenting supervised learning performance metricsfollowed by the experimental setup and results on the booking control problem.5xperiments were run on an Intel Core i7-10700 2.90GHz with 32GB RAM.

We partition each of the data sets into training/validation D TV and test D T sets. It takes between 0.19 and 1.88 seconds to train the random forest models.We assess the prediction performance using two performance metrics: meansquared error (MSE) and mean absolute error (MAE). The results reported inTable 1 show that we achieve relatively good performance but it deteriorateswith the size of the instances. In particular, we note that the MSE (this metricpenalizes large errors more severely than MAE) is quite large for the 50-locationtest data. Although labels are of larger magnitude when the number of locationsincrease so an increase in both MSE and MAE is expected, these results canpotentially be improved by generating larger data sets and using more ﬂexibleML models. Table 1: Supervised Learning Performance Metrics Locations |D TV | |D T | Training MSE Test MSE Training MAE Test MAE4 1,000 250 1.82 5.30 0.70 1.2410 2,000 500 3.12 13.80 1.26 2.7915 2,000 500 1.45 11.17 0.90 2.5350 2,000 500 23.22 139.41 3.68 9.24

We benchmark our results with respect to three baseline policies. The bookinglimit policy (BLP) and booking limit policy with reoptimization (BLPR) asdescribed in [7] and implemented using SCIP [2] as the MIP solver, without rowgeneration. As a third baseline we use the stationary random policy (rand- p )with acceptance probability p giving the highest mean proﬁt (we use the samevalues for p as for the data generation).It is possible to solve the smallest instances with exact DP. So in this casewe report results for using the exact algorithm combined with solving eachoperational problem with FILO (DP-Exact) and with the predicted costs (DP-ML). For the problems with more than 10 locations our implementation inSCIP did not ﬁnd incumbent solutions within a reasonable time, so the BLPand BLPR baselines are omitted for the 15 and 50 location instances. We consider a standard set of RL and approximate DP algorithms to obtainapproximate control policies: SARSA [19] as well as Monte-Carlo tree search(MCTS) with Upper Conﬁdence Bounds Applied to Trees (UCT) [11]. Thesealgorithms all rely on a set of simulated trajectories to evaluate the expectedproﬁt ˜ V t +1 ( · ) in (3). Speciﬁcally, we consider SARSA with neural state approx-imation. We deﬁne two diﬀerent variants of the MCTS algorithm distinguished6y their base policy: one uses a random policy (MCTS-rand-X) and the otherSARSA (MCTS-SARSA-X). Here, “X” denotes the number of iterations eachalgorithm uses for simulation and in our experiments we use 30 and 100 simu-lations per state. Each of the above algorithms uses our approximation of (5)when computing a policy (i.e., the sum of the predicted operational cost andthe outsourcing cost computed with MTP). However, we use FILO in the laststep to compute the ﬁnal operational cost. To compare performance, we evaluate each method over the same 50 realizationsof requests for each of the sets of instances. We start by analyzing solutionquality followed by computing times.In Figure 1, we provide box plots to show the distribution of the percent-age gaps to the best known solution (i.e., the highest proﬁt solution for eachinstance) for each algorithm. Figure 1a shows that, as expected, DP-Exact hasthe lowest gaps with a median at zero. The same algorithm but with approx-imate operational costs (DP-ML) results in gaps close to those of DP-Exact,demonstrating the eﬀectiveness of predicting the operational cost. Comparingour approaches to the baselines BLP and BLPR in Figures 1a and 1b, we cansee that the policies obtained via any of the proposed control algorithms achievesmaller gaps than those of BLP, BLPR, and rand- p baselines. Moreover, theBLP and BLPR policies report larger gaps than those of rand- p for 10 locations.For all sets of instances, the MCTS algorithms consistently achieve the smallestgaps, with some variations depending on the number of simulations and thebase policy. As expected, larger number of simulations lead to smaller variance.We now turn our attention to the analysis of computing times (reportedin details in Appendix A.3). For this purpose, we distinguish between oﬄineand online computing time and we focus the analysis on the latter. For DP-Exact, DP-ML, SARSA and MCTS-SARSA-X, we compute an exact or approx-imate value function oﬄine. Similarly, the initial booking limit policy (BLP andBLRP) is computed oﬄine, while any reoptimization of the booking limits andthe solution of the VRP at the end of the time horizon contribute to the on-line computing time. For MCTS-rand-X and rand- p , the policies are computedentirely online. The time associated with generating data and training the MLmodels adds to the oﬄine computing time of the corresponding algorithms.As expected, the online computing times are the shortest for SARSA andrand- p with an average of less than 2 seconds for all sets of instances. On the10-location instances, the bid price policies (BLP and BLPR) have an averageonline computing time comparable to MCTS-rand-30 and MCTS-SARSA-30(10.93 and 12.63 seconds compared to 9.06 and 11.81 seconds, respectively).The computing time of the MCTS algorithms depend on the instance size. Onaverage, they approximately double for 15-locations compared to 10-locationsand the 50-location instances are on average 8 times more time consuming tosolve than the 10-location ones in the case of MCTS-rand-X (11 times in the caseof MCTS-SARSA-X). This leads to average online computing times between 1.27 a) 4 Locations (b) 10 Locations(c) 15 Locations (d) 50 Locations Figure 1:

Box plots of optimality gaps to best known solutions (MTCS-rand-30) to 7.5 minutes (MCTS-SARSA-100) for the largest instances.We note that the latter is the algorithm achieving the highest quality solutions.However, using 100 simulations instead of 30 increases the online computingtime by approximately a factor of 4 for all sets of instances.The trade-oﬀ between solution quality and online computing time becomesclearly visible for the larger instances (15 and 50 locations). As highlightedin Figure 1, this trade-oﬀ can be partly controlled by the simulation budget.Noteworthy is the performance of SARSA on the largest instances: On average,the gap to the best known solution is 4.7% and the online computing time isonly 1.31 seconds. In Appendix A.3, Table 2 reports the data generation andtraining times, and Table 3 reports mean proﬁt, oﬄine, and online computingtimes.Finally, we comment on the number of evaluations of the operational costs.For the sake of illustration, we compare DP-Exact and DP-ML. The formercalls FILO 10,000 times to compute the value functions (oﬄine), while DP-ML only solves 1,000 VRPs when generating data for supervised learning. Thegain is even more important in the context of SARSA that requires 25,000cost evaluations during oﬄine training. The MCTS-based algorithms require anumber of cost evaluations (online) proportional to the number of simulationstimes T . 8 Conclusion

In the context of complex RM problems where the quality of the booking policieshinges on accurate evaluation of operational costs, we have proposed a method-ology that formulates an approximate MDP by replacing the formulation ofthe operational decision-making problem with a prediction function deﬁned byoﬄine supervised learning. We used approximate DP and RL to obtain approx-imate control policies. We have applied the methodology to the distributionlogistics problem in [7] and we have shown computationally that our policiesprovide increased proﬁt at a reduced evaluation time. This is because our algo-rithm strikes a balance between the online and oﬄine computation. Accuratepredictions from the ML model combined with a bin-packing heuristic were usedto evaluate approximate operational costs online in computing times that arenegligible in comparison to solving the VRPs.

Acknowledgments

We are grateful to Luca Accorsi and Daniele Vigo who provided us access totheir VRP code FILO [1], which has been very useful for our research. We wouldalso like to thank Francesca Guerriero for sharing details on the implementationof the approach in [7].

References [1] Luca Accorsi and Daniele Vigo. A fast and scalable heuristic for the solu-tion of large-scale capacitated vehicle routing problems. Technical report,University of Bologna, 2020.[2] Tobias Achterberg. Scip: Solving constraint integer programs.

Mathemat-ical Programming Computation , 1:1–41, 2009.[3] Christiane Barz and Daniel Gartner. Air cargo network revenue manage-ment.

Transportation Science , 50(4):1206–1222, 2016.[4] Dimitri P Bertsekas.

Reinforcement learning and optimal control . AthenaScientiﬁc Belmont, MA, 2019.[5] Nicolas Bondoux, Anh Quan Nguyen, Thomas Fiig, and Rodrigo Acuna-Agost. Reinforcement learning applied to airline revenue management.

Journal of Revenue and Pricing Management , 19(5):332–348, 2020.[6] Leo Breiman. Random forests.

Machine Learning , 45(1):5–32, 2001.[7] Giovanni Giallombardo, Francesca Guerriero, and Giovanna Miglionico.Proﬁt maximization via capacity control for distribution logistics problems. arXiv preprint arXiv:2008.03216 , 2020.98] Abhuit Gosavi, Naveen Bandla, and Tapas K. Das. A reinforcement learn-ing approach to a single leg airline revenue management problem with mul-tiple fare classes and overbooking.

IIE Transactions , 34(9):729–742, 2002.[9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 , 2014.[10] Robert Klein, Sebastian Koch, Claudius Steinhardt, and Arne K. Strauss.A review of revenue management: Recent generalizations and advancesin industry applications.

European Journal of Operational Research ,284(2):397–412, 2020.[11] Levente Kocsis and Csaba Szepesv´ari. Bandit based monte-carlo planning.In

Lecture Notes in Computer Science , pages 282–293. Springer Berlin Hei-delberg, 2006.[12] Ryan J. Lawhead and Abhijit Gosavi. A bounded actor–critic reinforcementlearning algorithm applied to airline revenue management.

EngineeringApplications of Artiﬁcial Intelligence , 82:252 – 262, 2019.[13] Yuri Levin, Mikhail Nediak, and Huseyin Topaloglu. Cargo capacity man-agement with allotments and spot market demand.

Operations Research ,60(2):351–365, 2012.[14] Tatsiana Levina, Yuri Levin, Jeﬀ McGill, and Mikhail Nediak. Networkcargo capacity management.

Operations Research , 59(4):1008–1023, 2011.[15] Silvano Martello. Knapsack problems: algorithms and computer imple-mentations.

Wiley-Interscience series in discrete mathematics and opti-mization , 1990.[16] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, JoelVeness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K.Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik,Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,Shane Legg, and Demis Hassabis. Human-level control through deep rein-forcement learning.

Nature , 518(7540):529–533, 2015.[17] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury,Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, LucaAntiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito,Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: an imperative style,high-performance deep learning library. In

Advances in Neural InformationProcessing Systems , volume 32, pages 8026–8037. Curran Associates, Inc.,2019. 1018] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research ,12:2825–2830, 2011.[19] Gavin A Rummery and Mahesan Niranjan.

On-line Q-learning using con-nectionist systems , volume 37. University of Cambridge, Department ofEngineering Cambridge, UK, 1994.[20] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre,George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, VedaPanneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, JohnNham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas-tering the game of go with deep neural networks and tree search.

Nature ,529(7587):484–489, 2016.[21] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: AnIntroduction . A Bradford Book, Cambridge, MA, USA, 2018.[22] Kalyan T. Talluri and Garrett J. Van Ryzin.

The Theory and Practice ofRevenue Management . Springer US, 2004.11

Details on the Experimental Study and Re-sults

In the Appendix, we report detailed information on the experimental study andthe results discussed in Section 4.

A.1 VRP Formulation

In this section, we provide a MIP formulation for the VRP, which is similar tothat in [7], with the exception that the problem is constrained by K . z ∗ ( w , K ) = min (cid:88) k ∈K (cid:88) ( i,j ) ∈A c ij α kij (6)s.t. (cid:88) ( i,j ) ∈A α kij = β ki ∀ i ∈ V , ∀ k ∈ { , . . . , K } (7) (cid:88) ( i,j ) ∈A α kji = β ki ∀ i ∈ V , ∀ k ∈ { , . . . , K } (8) (cid:88) k ∈K β kj ≤ ∀ j ∈ N (9) (cid:88) k ∈K β k ≤ K (10) (cid:88) i ∈S (cid:88) j ∈V\S α kij ≥ β kh ∀S ⊂ V : 0 ∈ S , ∀ h ∈ V\ S , ∀ k ∈ { , . . . , K } (11) q kj ≤ QB kj ∀ j ∈ N , k ∈ { , . . . , K } (12) (cid:88) j ∈N q kj ≤ Q ∀ k ∈ { , . . . , K } (13) (cid:88) k ∈K q kj = w j ∀ j ∈ N (14) α kij ∈ { , } ∀ ( i, j ) ∈ A , ∀ k ∈ { , . . . , K } (15) β ki ∈ { , } ∀ i ∈ V , ∀ k ∈ { , . . . , K } (16) q kj ≥ ∀ j ∈ N , ∀ k ∈ { , . . . , K } (17)The decision variables are α kij , β ki , and q kj : The binary variable α kij equals1 if vehicle k uses arc ( i, j ). The binary variable β ki equals 1 if vehicle k visits12ode i . The quantity of accepted requests from vehicle k at location j is givenby the non-negative variable q kj . Objective function (6) minimizes the routingcost. Constraints (7) and (8) ensure that the each vehicle goes in and out of thevisited nodes. Constraints (9) assert that each vehicle can visit a node at mostonce, and (10) limits the number of vehicles to at most K . Constraints (11)ensure that every tour is connected, and (12) restrict vehicle k from collectingmore than the capacity at location j . Finally, Constraints (13) guarantee thatthe capacity is not exceeded, and (14) ensure that the accepted requests at eachlocation are fulﬁlled. A.2 Instances

In this section, we detail the parameters that deﬁne the 4 sets of instances. Ineach of the instances we determine the capacity to be proportional to the inversedemand of the locations. Speciﬁcally, we use a load factor LF , and determinethe capacity by Q = (cid:106) (cid:80) j ∈N ∪{ } (cid:80) Tt =1 λ tj K · LF (cid:107) . (18) The locations given by N = { , . . . , } , the number of periods T = 20, the revenue for accepting a request from each location are deﬁned by p = 4, p = 8, p = 12, p = 16. The probability of no request is λ t = 0 .

10, for t ∈ { , . . . , T } . The initial request probabilities for each location are λ = 0 . λ = 0 . λ = 0 . λ = 0 .

05. The remainder of the request probabilitiesare then given by λ t +1 j = λ tj − .

01 for j ∈ { , } and λ t +1 j = λ tj + 0 .

01 for j ∈ { , } . The coordinates for each location are sampled uniformly at randomin the interval [0 , K = 2and the cost for each additional over K is C = 100. The capacity is determineusing a load factor LF = 1 .

10 Locations:

The locations given by N = { , . . . , } , the number of peri-ods T = 30, the revenue for accepting a request from each location are deﬁnedby p j = 10 for j ∈ { , . . . , } , p j = 12 for j ∈ { , . . . , } , and p j = 20 for j ∈ { , } . The probability of no request is λ t = 0 .

10, for t ∈ { , . . . , T } . Theinitial request probabilities for each location are λ j = 0 .

125 for j ∈ { , . . . } , λ j = 0 .

075 for j ∈ { , . . . , } , and λ j = 0 .

05 for j ∈ { , } . The remainder ofthe request probabilities are then given by λ t +1 j = λ tj − .

001 for j ∈ { , . . . } , λ t +1 j = λ tj for j ∈ { , . . . , } , and λ t +1 j = λ tj + 0 .

002 for j ∈ { , } . The coordi-nates for each location are sampled uniformly at random in the interval [0 , K = 4 and the cost for eachadditional over K is C = 100. The capacity is determine using a load factor LF = 1 .

2. 13able 2: Data generation and supervised learning times. All times in seconds.Locations Data Generation Time Training Time4 285.77 0.1910 1577.34 0.7615 2698.57 0.9050 2379.68 1.88

15 Locations:

The locations given by N = { , . . . , } , the number of peri-ods T = 50, the revenue for accepting a request from each location are deﬁnedby p j = 10 for j ∈ { , . . . , } , p j = 12 for j ∈ { , . . . , } , and p j = 20 for j ∈ { , . . . , } . The probability of no request is λ t = 0 .

10, for t ∈ { , . . . , T } .The initial request probabilities for each location are λ j = 0 .

10 for j ∈ { , . . . , } , λ j = 0 .

06 for j ∈ { , . . . , } , and λ j = 0 .

02 for j ∈ { , . . . , } . The remainderof the request probabilities are then given by λ t +1 j = λ tj − .

001 for j ∈ { , . . . , } , λ t +1 j = λ tj for j ∈ { , . . . , } , and λ t +1 j = λ tj + 0 .

001 for j ∈ { , . . . , } . Thecoordinates for each location are sampled uniformly at random in the interval[0 , K = 4 and the cost foreach additional over K is C = 250. The capacity is determine using a loadfactor LF = 1 .

50 Locations:

The locations given by N = { , . . . , } , the number of periods T = 100, the revenue for accepting a request from each location are deﬁned by p j = 15 for j ∈ { , . . . , } , p j = 22 for j ∈ { , . . . , } , and p j = 30 for j ∈{ , . . . , } . The probability of no request is λ t = 0 .

10, for t ∈ { , . . . , T } . Theinitial request probabilities for each location are λ j = 0 . j ∈ { , . . . , } , λ j = 0 .

03 for j ∈ { , . . . , } , and λ j = 0 .

01 for j ∈ { , . . . , } . Theremainder of the request probabilities are then given by λ t +1 j = λ tj − . j ∈ { , . . . , } , λ t +1 j = λ tj for j ∈ { , . . . , } , and λ t +1 j = λ tj + 0 . j ∈ { , . . . , } . The coordinates for each location are sampled uniformly atrandom in the interval [0 , K = 4 and the cost for each additional over K is C = 600. The capacity isdetermine using a load factor LF = 1 . A.3 Mean Proﬁt and Computing Times

Table 2 reports the time required to generate data and train supervised learningmodel. These times required for all algorithms but BLP, BLPR, and rand- p .Tables 3 reports the mean proﬁt obtained by each approach as well as theoﬄine and online computing times as described in Section 4.14able 3: Mean proﬁts and computing times. All times in seconds. Locations Method Mean Proﬁt Online Time Oﬄine Time4 DP-Exact -50 SARSA 1098.78 1.31 7045.92MCTS-rand-30 1095.02 74.92 -MCTS-rand-100 1118.60 296.10 -MCTS-SARSA–30 1112.45 123.47 7045.92MCTS-SARSA-100 - A.4 Model Parameters

For supervised learning, we use the implementation of random forests fromscikit-learn [18]. We note that our results were obtained using the default pa-rameters provided in scikit-learn and changing model parameters did not havea signiﬁcant impact on results.For SARSA, we implement exploration by taking a random action with prob-ability (cid:15) = 0 .

10. We use Pytorch [17] to implement the neural value functionapproximation. For all sets of instances, we use Adam optimizer [9] with MSEloss. We use a neural network with one hidden layer and learning rate thatvary depending on the problem setting. The parameters for each instance arereported in Table 4. In each setting, we train SARSA for 25,000 iterations15able 4: Neural network hyperparametersLocations Hidden layer dimension Learning rate4 128 1 e −

10 256 1 e −

15 256 1 e −

50 1024 1 e − Table 5: MCTS UCT hyperparameter