[PDF] Learn-n-Route: Learning implicit preferences for vehicle routing

Abstract

We investigate a learning decision support system for vehicle routing, where the routing engine learns implicit preferences that human planners have when manually creating route plans (or routings). The goal is to use these learned subjective preferences on top of the distance-based objective criterion in vehicle routing systems. This is an alternative to the practice of distinctively formulating a custom VRP for every company with its own routing requirements. Instead, we assume the presence of past vehicle routing solutions over similar sets of customers, and learn to make similar choices. The learning approach is based on the concept of learning a Markov model, which corresponds to a probabilistic transition matrix, rather than a deterministic distance matrix. This nevertheless allows us to use existing arc routing VRP software in creating the actual routings, and to optimize over both distances and preferences at the same time. For the learning, we explore different schemes to construct the probabilistic transition matrix that can co-evolve with changing preferences over time. Our results on a use-case with a small transportation company show that our method is able to generate results that are close to the manually created solutions, without needing to characterize all constraints and sub-objectives explicitly. Even in the case of changes in the customer sets, our method is able to find solutions that are closer to the actual routings than when using only distances, and hence, solutions that require fewer manual changes when transformed into practical routings.

Full PDF

LLearn-n-Route: Learning implicit preferences for vehicle routing

Rocsildes Canoy , Víctor Bucarey , Jayanta Mandi , and Tias Guns Vrije Universiteit Brussel, Brussels, Belgium. {ﬁrstname.lastname}@vub.be

January 12, 2021

Abstract

We investigate a learning decision support system for vehicle routing, where the routing enginelearns implicit preferences that human planners have when manually creating route plans (or routings ).The goal is to use these learned subjective preferences on top of the distance-based objective criterionin vehicle routing systems. This is an alternative to the practice of distinctively formulating a customVRP for every company with its own routing requirements. Instead, we assume the presence ofpast vehicle routing solutions over similar sets of customers, and learn to make similar choices. Thelearning approach is based on the concept of learning a Markov model, which corresponds to aprobabilistic transition matrix, rather than a deterministic distance matrix. This nevertheless allowsus to use existing arc routing VRP software in creating the actual routings, and to optimize overboth distances and preferences at the same time. For the learning, we explore diﬀerent schemes toconstruct the probabilistic transition matrix that can co-evolve with changing preferences over time.Our results on a use-case with a small transportation company show that our method is able togenerate results that are close to the manually created solutions, without needing to characterizeall constraints and sub-objectives explicitly. Even in the case of changes in the customer sets, ourmethod is able to ﬁnd solutions that are closer to the actual routings than when using only distances,and hence, solutions that require fewer manual changes when transformed into practical routings.

Keywords: vehicle routing, preference learning, Markov models

Vehicle routing problems (VRP) at small or medium-sized enterprises (SME) is constrained by the limitednumber of vehicles, the capacity of each delivery vehicle, and the scheduling horizon within which alldeliveries have to be made. The objective, often implicitly, can include a wide range of company goalsincluding reducing operational costs, minimizing fuel consumption and carbon emissions, as well asoptimizing driver familiarity with the routes and maximizing fairness by assigning tours of similar lengthand/or time duration to the drivers. Daily plans are often created in a route optimization software that iscapable of producing plans that are optimal in terms of route length and travel time. We have observed,however, that in practice, route planners usually modify the result given by the software, or simply pullout, modify, and reuse an old plan that has been used and known to work in the past. The planners, byperforming these modiﬁcations, are essentially optimizing with their own set of objectives and personalpreferences.These preferences are often subjective and sometimes delicate to formalize in constraints. Someexamples are: where the best places for lunch breaks are, which stops are best served earlier or later,and which drivers (tours) are more ﬂexible in receiving more stops. Failure to capture such human aspects is often a source of frustration for both drivers and planners, and a cause for reluctance to usethe optimisation software. Furthermore, planners may ﬁnd it easier or more eﬀective to manually changea previous solution than to provide or update the detailed information in the system. Being able toautomatically capture such preferences without the need to formalize them can hence lead to a wideracceptance and better use of optimisation systems. 1 a r X i v : . [ c s . A I] J a n he goal of this research is to learn the preferences of the planners (and implicitly the drivers) whenchoosing one option over another. Hence, the goal is to build an ‘intelligent’ system that can eﬀectivelyreuse all of the knowledge and eﬀort that have been put into creating previous routings; much like howhuman planners use prior experience. We focus on techniques from the domain of artiﬁcial intelligencethat can learn from historical data, and that can be used to manage and recommend similar routes asused in the past. This is in contrast to the current practice of optimizing a separate VRP instance eachday.To learn from historical data, we take inspiration from various machine learning papers on routeprediction for a single vehicle: Markov models developed from historical data have been applied to driverturn prediction [Krumm 2008], prediction of the remainder of the route by looking at the previous roadsegments taken by the driver [Ashbrook and Starner 2003], and predicting individual road choices giventhe origin and destination [Simmons et al. 2006]. These studies have produced positive and encouragingresults for their respective tasks. Hence, in this work, we investigate the use of Markov models forpredicting the route choices for an entire ﬂeet, and how to use these choices to create preference-aware solutions to the VRP.With a Markov model, route optimization can be done by maximizing the product of the probabilitiesof the sequential actions taken by the vehicles, which corresponds to maximizing the sum of log likelihoodsover the arcs. In the case of a ﬁrst order Markov model, a key property of our approach is that it canreuse any existing VRP solution method to ﬁnd the maximum likelihood solution, as it corresponds tothe sum of log likelihoods over the individual arcs. This is a promising novel approach to the vehiclerouting problem.Our proposed ﬁrst order Markov model approach has been published in a conference proceeding[Canoy and Guns 2019] and in this article, we extend this methodology by developing the following newcontributions:• We provide a more general formalization of the methods for learning and optimizing the preferences,and detail both a ﬁrst order and a second order Markov model.• We introduce a new and superior weighing technique called the exponential weighing scheme andexamine how it handles data drift, i.e., when there is either a large reduction or a sudden increasein the number of stops.• We present an extended study on distance-based, preference-based, and combined optimisation.This manuscript is organized as follows: in Section 2, we provide the relevant work related to thisarticle. We discuss in Section 3 routing models that maximize the probability of the learned preferences.Later in the section, we introduce the algorithms to learn the transition matrix from historical data. Thecomparison of the diﬀerent construction schemes and the experimental results on actual company dataare shown in Section 4. Finally, our conclusions and future research directions are presented in Section 5.

The ﬁrst mathematical formulation and algorithmic approach to solving the Vehicle Routing Problem(VRP) appeared in the paper by Dantzig and Ramser 1959 which aimed to ﬁnd an optimal routing for aﬂeet of gasoline delivery trucks. Since its introduction, the VRP has become one of the most studiedcombinatorial optimization problems. Faced on a daily basis by distributors and logistics companiesworldwide, the problem has attracted a lot of attention due to its signiﬁcant economic importance.A large part of the research eﬀort concerning the VRP has focused on its classical and basic version—the

Capacitated

Vehicle Routing Problem (CVRP). The presumption is that the algorithms developed forCVRP could be extended and applied to more complicated real-world cases [Laporte 2007]. Due to therecent development of new and more eﬃcient optimisation methods, research interest has shifted towardsrealistic VRP variants known as Rich VRP [Caceres-Cruz et al. 2015; Drexl 2012]. These problems dealwith realistic, and sometimes multi-objective, optimisation functions, uncertainty, and a wide variety ofreal-life constraints related to time and distance factors, inventory and scheduling, environmental andenergy concerns, driver experience, etc. [Mor and Speranza 2020].The VRP becomes increasingly complex as additional sub-objectives and constraints are introduced.The inclusion of preferences, for example, necessitates the diﬃcult, if not impossible, task of formalizing2he route planners’ knowledge and choice preferences explicitly in terms of constraints and weights. Inmost cases, it is much easier to get examples and historical solutions rather than to extract explicit decisionrules from the planners, as observed by Potvin et al. in the case of vehicle dispatching [Potvin et al. 1993].One approach is to use learning techniques, particularly learning by examples, to reproduce the planners’decision behavior. To this end, we develop a new method that learns from previous solutions by using aMarkov model, which can simplify the problem formulation phase of vehicle routing by eliminating theneed to characterize all preference constraints and sub-objectives explicitly.Learning from historical solutions has been investigated before within the context of constraintprogramming. For example, in the paper of Beldiceanu and Simonis on constraint seeker [Beldiceanu andSimonis 2011] and model seeker [Beldiceanu and Simonis 2012], and Picard-Cantin et al. on learningconstraint parameters from data [Picard-Cantin et al. 2016], where a Markov chain is used, but forindividual constraints. In this respect, our goal is not to learn constraint instantiations, but rather tolearn choice preferences, e.g., as part of the objective. Related to the latter is the work on ConstructivePreference Elicitation [Dragone et al. 2018], although this elicitation technique actively queries the user,as does constraint acquisition [Bessiere et al. 2017].Our motivation for Markov models is that they have been previously used in route prediction ofindividual vehicles. Ashbrook and Starner 2003 presented a Markov model to predict the next stop of adriving vehicle. They estimated transition probabilities as frequencies and considered Markov models thatdepend not only on the actual stop but also on past stops; that is, high order Markov chains. Personalizedroute prediction has been used in transportation systems that provide drivers with real-time traﬃcinformation and in intelligent vehicle systems for optimizing energy eﬃciency in hybrid vehicles [Deguchiet al. 2004]. Ye et al. 2015 introduced a route prediction method that can accurately predict an entireroute early in the trip. The method is based on Hidden Markov Models (HMM) trained from the driver’spast history. A route prediction algorithm that predicts a driving route for a given pair of origin anddestination was presented by Wang et al. 2015. Also based on the ﬁrst order Markov model, the algorithmuses a probability transition matrix that was constructed to represent the knowledge of the driver’spreferred links and routes. An algorithm for driver turn prediction using a Markov model is developedin Krumm 2008. Trained from the driver’s historical data, the model makes a probabilistic predictionbased on a short sequence of just-driven road segments. Experimental results showed that by lookingat the most recent 10 segments into the past, the model can eﬀectively predict the next segment withabout 90% accuracy. Another learning approach for driving preferences is to learn a linear cost functiondepending on several attributes, e.g., time, distance, fuel consumption, number of edges in the route, etc.,as presented in Potvin et al. 1993; Funke et al. 2016, where linear programs to compute the weights wereproposed. While related to our current work, we do not assume that a partial route is already given.Moreover, in our case, we are interested in optimizing the routing across multiple vehicles rather thanmaking predictions for a single vehicle.Several algorithms were deployed to direct users through paths between origin and destination. Thisdeployment is done either through an enumeration of paths [Yang et al. 2015], or by adapting weights onedges and computing shortest paths [Letchner et al. 2006; Delling et al. 2015; Guo et al. 2020]. Althoughsimilar to this second category, where weights represent probabilities, our work diﬀers in that instead ofplanning single paths (i.e., for an origin-destination pair), our methods use preferences to create a routingplan for several vehicles. Also, we are adapting the classical CVRP mixed integer formulation in order togenerate high probability routes.The importance of preference-based learning can be realized from the study of Ceikute and Jensen2013, which ﬁnds that local drivers often prefer paths that are not optimal in terms of travel time orcost. This is because local drivers favor routes based on their personal preferences. Chang et al. 2011 andLetchner et al. 2006 proposed a framework to recommend personalized routes to the driver by taking intoaccount which roads he or she is familiar with. Yang et al. estimated a driver’s preferences by comparingthe paths used by the driver to the skyline paths [Yang et al. 2014]. Yang and Yang 2019 introduced

PathRank , a machine learning model trained with historical data, to rank candidate paths based on thedriver’s preferences. Once again, the above-mentioned are mostly concerned with point-to-point ‘shortestpath’ travel where the source and destination pairs are known. Our work, on the other hand, focuseson VRP: recommending a multi-vehicle solution over shared stops. Finally, a recent article showed anapplication of routing with preferences using patient preferences over the drivers/vehicles to create routevisits of caregivers, where the problem is formulated as a bi-criteria optimization problem [Ait Haddadene3t al. 2019]. In contrast, our work is concerned in creating routings while taking into account the implicit planner and driver preferences over the set of customers.

In this section we introduce the probabilistic model underlying our methodology to learn the drivers’preferences. Furthermore, we adapt the classical formulation of the CVRP to compute the route with thehighest probability.

We begin with a set of stops V = { , , . . . , n } , where 0 represents the depot, and the other stops representthe locations of the customers, and a fully-connected directed graph G = ( V, A ). We call a routing x of V with m homogeneous vehicles, a set of at most m tours in G with each tour starting from and endingat the depot 0 and such that each node in V \{ } is visited exactly once.We denote the set of all possible routings of m vehicles over V as X mV , or simply X when it is clearfrom the context. Our goal is to determine a probabilistic model Pr such that Pr ( x ) is the probability ofthe vehicles following the routing x ∈ X .A Markov model is deﬁned over a single sequence of events, and so to be able to use Markov modelsfor Pr ( x ) , we ﬁrst daisy-chain the set of routes x . This is done by connecting the routes into one longsequence: 0 −→ s −→ s −→ . . . −→ s p − −→ s p −→ −→ −→ s −→ s −→ . . . −→ s q − −→ s q −→ −→ ...0 −→ s m −→ s m −→ . . . −→ s mr − −→ s mr −→ , which we then simplify by replacing the (0 →

0) connections by 0 . Given this sequence interpretation of x , we can decompose the joint probability Pr ( x ) as the probability of each element conditional on theelements before it in the sequence: Pr ( x ) = Pr (0 , s , s , s , . . . , , s , . . . )= Pr ( X = 0) Pr ( X = s | X = 0) Pr ( X = s | X = 0 , X = s ) Pr ( X = s | X = 0 , X = s , X = s ) · · · . Estimating all conditional probabilities Pr ( s ij | , . . . , s ij − , s ij − ) is diﬃcult because of the large numberof such possible probabilities and their sparsity in the data. A common approach is to use a Markovianapproximation of the probabilistic model [Ames 1989], which states that the probability of an eventdepends only on the state of the previous event. The main advantage is that, with this approach, themodel can be described with much fewer parameters to learn.We consider a ﬁrst order Markov chain { X t } t ≥ over V with transition probabilities between statesas: Pr ( X t +1 = j | X t = i, X t − = i t − , . . . , X = i ) ≈ Pr ( X t +1 = j | X t = i ) . The joint probability therefore becomes: Pr ( x ) ≈ Pr ( X = 0) Pr ( X = s | X = 0) Pr ( X = s | X = s ) Pr ( X = s | X = s ) · · · . On one hand, the interpretation of this model’s assumption is straightforward: a driver’s decision to gofrom a stop i to another stop j depends only on the current position (current stop), and does not considerthe stops visited before that. With this assumption, the Markov model can be seen as a probabilitydistribution over the set of arcs A . On the other hand, this model is in fact an approximation: in practice,drivers and planners also take the stops preceding the current stop, and potentially even the stops afterit, into account when making their decision. In either case, the current stop X t is the key deciding factor.4 ﬁner-grained approximation is to consider a higher order, in particular a second order , Markov chain.In this case, the state of an event contains both the current stop and the stop before it. By extending thehistory of previously visited arcs, we get a better approximation of the probabilistic model over X . Forthe second order model, the conditional probabilities are approximated as follows: Pr ( X t +1 = k | X t = j, X t − = i, . . . , X = i ) ≈ Pr ( X t +1 = k | X t − = i, X t = j ) . The trade-oﬀ is that the number of parameters to estimate increases from O ( n ) to O ( n ). Later inthis section, we will investigate the diﬀerent ways of estimating the conditional probabilities from thehistorical routings. Given a (learned) probability distribution Pr over a set of stops V , the goal is to ﬁnd the maximumlikelihood routing, that is, the routing with the highest joint probability. Let X be the set of all possibleroutings for a given number m of vehicles and a set of stops V . The goal of ﬁnding the maximumlikelihood routing then becomes: max x ∈X Pr ( x ) . (1)Naturally, the set of all possible routings X is not given explicitly. However, we can deﬁne it implicitly asa set of constraints over decision variables, as common in optimisation approaches to vehicle routing. Weﬁrst look into formulating the constraints, and then the objective function corresponding to Eq. (1). Constraints.

Note that while the probability distribution Pr is deﬁned over all possible routings, we are free to imposeadditional constraints over X when searching for the maximum likelihood routing. In eﬀect, this will be aconstrained maximum likelihood inference problem.In this paper, we will use a standard Capacitated Vehicle Routing Problem (CVRP) formulation bymeans of Mixed Integer Programming (MIP) [Toth and Vigo 2014]. To do so, we represent the routing X by the binary vector x ∈ { , } | A | where A is the set of all possible edges between V , that is, the arcset of the complete graph G = ( V, A ). Each component of a vector x , namely x ij , takes the value 1 ifthere exists a transition from i to j in sequence x , and 0 otherwise. The CVRP routing problem with m homogeneous vehicles, each with capacity Q , and a given demand q i for every stop i ∈ V , can then berepresented by the following standard constraints [Toth and Vigo 2014]: X j ∈ V, j = i x ij = 1 i ∈ V (2) X i ∈ V, i = j x ij = 1 j ∈ V (3) n X j =1 x j ≤ m (4)if x ij = 1 ⇒ u i + q j = u j ( i, j ) ∈ A : j = 0 , i = 0 (5) q i ≤ u i ≤ Q i ∈ V \{ } (6) x ij ∈ { , } ( i, j ) ∈ A, (7)where constraints (2) and (3) impose that every stop other than the depot must be visited by exactlyone vehicle and that exactly one vehicle must leave from each node. Constraint (4) limits the numberof routes to the size of the ﬂeet, m . In constraint (5), u j denotes the cumulative vehicle load at node j . The constraint plays a dual role—it prevents the formation of subtours, i.e., cycling routes that donot pass through the depot, and together with constraint (6), it ensures that the vehicle capacity is notexceeded. While the model does not make explicit which stop belongs to which route, this informationcan be reconstructed from the active arcs x ij in the solution.5 bjective function. For a ﬁrst order Markov chain approximation, the joint probability over a daisy-chained sequence of stops x = { , s , s , s , . . . s n , } decomposes into the following: Pr ( x ) ≈ Pr ( X = 0) Pr ( X = s | X = 0) · · · Pr ( X n +1 = 0 | X n = s n )= Pr ( X = 0) Y ( i → j ) ∈ x Pr ( X t +1 = j | X t = i ) . (8)In our routing setting, by construction, the ﬁrst stop is always the depot. Hence, we know that Pr ( X = 0) = 1. In order to transform the above into an objective function over decision variables x ij we remark that x ij = 1 ⇔ ( i → j ) ∈ x . We can now derive the following:arg max x Pr ( x ) ≈ arg max x Pr ( X = 0) Y ( i → j ) ∈ x Pr ( X t +1 = j | X t = i )= arg max x Y ( i → j ) ∈ x Pr ( X t +1 = j | X t = i )= arg max x Y ( i,j ) | x ij =1 Pr ( X t +1 = j | X t = i ) x ij = arg max x X ( i,j ) ∈ A log Pr ( X t +1 = j | X t = i ) x ij = arg min x X ( i,j ) ∈ A − log Pr ( X t +1 = j | X t = i ) x ij . (9)If we deﬁne ˆ p ij = − log Pr ( X t +1 = j | X t = i ), then we obtain the following standard CVRP formula-tion : min x X ( i,j ) ∈ A ˆ p ij x ij (10)s.t. Constraints (2) - (7) . Hence, using ˆ p as the cost vector in the traditional VRP setting enables us to use any existing VRPsolver to ﬁnd the maximum likelihood routing of a given ﬁrst order Markov model.We now consider the case of a second order Markov chain : Pr ( x ) ≈ Pr ( X = 0) Pr ( X = s | X = 0) Pr ( X = s | X = 0 , X = s ) · · · Pr ( X n +1 = 0 | X n − = s n − , X n = s n )= Pr ( X = 0) Pr ( X = s | X = 0) Y ( i → j → k ) ∈ x Pr ( X t +1 = k | X t − = i, X t = j ) . (11)To construct a corresponding objective function over decision variables x ij , we have to be more carefulthan in the ﬁrst order case. More speciﬁcally, the second order Markov model includes the probabilitiesof transitions over the diﬀerent vehicles, such as ( s p → → s ). However, vehicles (and hence routes)are assumed to be homogeneous and therefore interchangeable. Hence, the ordering of the routes whenconstructing the daisy-chain is arbitrary. The last stop of one route should therefore not have any inﬂuenceon the ﬁrst stop of another route. We will hence ignore all transitions of the kind ( s p → → s ) andinstead use the ﬁrst order transition probability from the depot: (0 → s ). These need to be estimated6or the ﬁrst transition Pr ( X = s | X = 0) anyway. This leads to the following derivation:arg max x Pr ( X = 0) Pr ( X = s | X = 0) Y ( i → j → k ) ∈ x Pr ( X t +1 = k | X t − = i, X t = j ) ≈ arg max x Pr ( X = 0) Y (0 → i ) ∈ x Pr ( X t +1 = i | X t = 0) Y ( i → j → k ) ∈ x j =0 Pr ( X t +1 = k | X t − = i, X t = j )= arg min x X (0 ,i ) ∈ A − log Pr ( X t +1 = i | X t = 0) x i + X ( i,j ) ∈ A, ( j,k ) ∈ Aj =0 − log Pr ( X t +1 = k | X t − = i, X t = j ) x ij x jk . (12)If we now deﬁne ˆ p ijk = − log Pr ( X t +1 = k | X t − = i, X t = j ), and ˆ p ij = − log Pr ( X t +1 = j | X t = i ) asbefore, then we obtain the following CVRP formulation:min x X (0 ,i ) ∈ A ˆ p i x i + X ( i,j ) ∈ Aj =0 X ( j,k ) ∈ Aj =0 ˆ p ijk x ij x jk (13)s.t. Constraints (2) - (7) . Note how we are still using the same x ij decision variables. The objective function, however, is nowa quadratic function and hence the problem becomes a Mixed Integer Quadratic Problem (MIQP).There are well-known techniques to linearize the quadratic term and obtain an MIP by introducingadditional constraints and decision variables, e.g., by the McCormick inequalities [McCormick 1976].Many mathematical programming solvers for MIQP also exist. However, it is reasonable to expect thatoptimizing with this objective function of the second order Markov chain will be computationally harderthan optimizing with the linear objective function of the ﬁrst order model.We now explain how to learn the transition probability matrix from historical solutions (Section 3.3),followed by diﬀerent ways of weighing the instances (Section 3.4). Finally, in Section 3.5 we discuss howto combine a learned probability matrix with a distance-based probability matrix. We now outline the main approach that we propose for estimating the transition probability matrix fromhistorical solutions. We assume given a dataset H = { ( V t , m t , z t , x t ) } of instances with each instance atuple where t is a timestamp (e.g., a speciﬁc day in case of daily vehicle routing), V t is the set of stopsserved by the solution at timestamp t , m t is the number of vehicles used, x t is the subjectively optimalsolution, that is, a solution created by an expert or the actual driven routes extracted from an on-boardsystem, and z t are additional problem parameters such as the demand of every stop, or some other knownconstraint parameters.Note that V t , as well as m t and z t , can change from instance to instance. The approach we proposeassumes that while V t indeed changes from day to day, there will always be some overlap with otherdays, for example, because the set of stops is composed of regular and occasional or ad hoc stops. Probability estimation.

The basic idea of our approach is to estimate all the conditional probabilities Pr ( X n +1 = j | X n = i ) overthe set of all stops in the data: V all = S t V t .The basic properties of conditional probability theory state that the conditional probability of movingfrom stop i to stop j is: Pr ( X n +1 = j | X n = i ) = Pr ( X n +1 = j, X n = i ) Pr ( X n = i ) , with Pr ( X n = i ) = P k Pr ( X n +1 = k, X n = i ). 7 lgorithm 1 Estimating a ﬁrst order transition matrix from historical instances

Input:

A dataset H = { ( V t , m t , z t , x t ) } where we will assume that the timestamps t have been replacedby integer values 1 . . . | H | in a way that respects the ordering of the timestamps; a weight w t per datainstance, where the default value is w t = 1; and the Laplace smoothing parameter λ ≥ V all = S t V t and let µ = | V all | .2. For each ( V t , m t , z t , x t ) ∈ H , construct an adjacency matrix A tµ × µ = [ a tij ], where a tij = (cid:74) ( s i , s j ) ∈ x t (cid:75) . (17)3. Build the arc transition frequency matrix F µ × µ with the weights w t and the adjacency matricesconstructed in Step 2: F = X t w t A t . (18)4. Apply the Laplace smoothing technique to F µ × µ = [ f ij ] to get the ﬁnal probability estimates ˆP µ × µ = [ˆ p ij ]: ˆ p ij = f ij + λ P k ( f ik + λ ) . (19) Output:

Transition matrix ˆP µ × µ = [ˆ p ij ], where ˆ p ij = Pr ( X n +1 = s j | X n = s i ).To compute this, we will count how often two stops follow each other in the data. We denote by (cid:74) · (cid:75) the Iverson bracket which returns the value 1 if the statement inside the bracket evaluates to true , and 0otherwise. We deﬁne the frequency of a transition ( i → j ) in dataset H as: f ij = X t (cid:74) ( i → j ) ∈ x t (cid:75) . (14)Now we can empirically estimate the conditional probabilities from the data as follows: Pr ( X n +1 = s j | X n = s i ) = f ij P k f ik . (15) Laplace smoothing.

To account for the fact that the number of samples may be small, and some f ij may be zero, we cansmooth the probabilities using the Laplace smoothing technique [Johnson 1932; Chen and Goodman1999; Ye et al. 2015]. Laplace smoothing reduces the impact of data sparseness arising in the processof building the transition matrix. As a result of smoothing, these arcs are given a small, non-negativeprobability, thereby eliminating the zeros in the resulting transition matrix. Conceptually, with λ as thesmoothing parameter ( λ = 0 corresponds to no smoothing), we add λ observations to each event. Theconditional probability estimation therefore becomes: Pr ( X n +1 = s j | X n = s i ) = f ij + λ P k ( f ik + λ ) . (16) First-order estimation algorithm.Algorithm 1 shows the algorithm for estimating the ﬁrst order probability transition matrix. Thedimension of the matrix, that is, the size of the total set of unique stops, is determined in Step 1. In Step2, an adjacency matrix is constructed for each historical instance. A frequency matrix is constructed inStep 3 by computing the (weighted) sum of all the adjacency matrices (Eq. (18)). By default, w t = 1 forall instances; more advanced schemes will be discussed in Section 3.4. Finally, during normalisation inStep 4, Laplace smoothing is applied if λ >

0. 8able 1: An overview of the proposed weighing schemes

Name Weights Squared Weights Exponential Weights (EXP)Uniform (UNIF) w t = 1 —Time-based (TIME) w t = t/T w t = ( t/T ) w t = α (1 − α ) T − t Similarity-based (SIMI) w t = J ( V t , V T ) w t = J ( V t , V T ) Second-order estimation algorithm.

The second order transition matrix can be analogously constructed by building a three-dimensional matrix A tµ × µ × µ = [ a tijk ] in Step 2, where a tijk = (cid:74) ( s i , s j , s k ) ∈ x t (cid:75) . Laplace smoothing is then applied bydividing each element of the frequency matrix F µ × µ × µ = [ f ijk := P t (cid:74) ( i → j ) , ( j → k ) ∈ x t (cid:75) ] with therow sum to obtain the transition matrix ˆP µ × µ × µ = [ ˆ p ijk ] , whereˆ p ijk = Pr ( X n +1 = s k | X n − = s i , X n = s j ) = f ijk + λ P l ( f ijl + λ ) . We also estimate the conditional probabilities of leaving from the depot:ˆ p i = f j + λ P k ( f k + λ ) . With these estimates, we can use the above-deﬁned optimisation formulation to ﬁnd the maximumlikelihood VRP solution.

The estimation method described above assumes that each instance is equally important, and that theVRP solution of each instance is independently drawn from the same distribution. However, we know thatthis is not entirely true: human planners learn from day to day, with more recent experiences typicallycloser to mind. Furthermore, because the set of stops changes each day, the similarity of the currentinstance to previous instances may also change how choices will be made.We identify two types of context that change the importance of previous instances—the temporalcontext and the similarity context. The temporal context is known in machine learning as conceptdrift [Gama et al. 2014]. The concept of similarity is central in many machine learning and data miningapproaches. We now discuss two modiﬁcations of the above transition probability estimation algorithmthat will take these contexts into account.To this end, we assume that we are currently at timestamp T. Also, we are given a set of stops V T ,number of vehicles m T , and other parameters z T . That is, we have a new unseen tuple ( V T , m T , z T )for which we are to determine the corresponding x T . We also have a set of historical instances untiltimestamp T that we denote with H = { ( V t , m t , z t , x t ) } T − t =1 . To take the temporal and similarity aspects into account, we will deﬁne a prior on each historicaltraining instance, based on the current tuple ( V T , m T , z T ). More speciﬁcally, we use a weighing schemewhere we deﬁne a weight w t for each historical instance in H based on the properties of the current tuple.Table 1 provides an overview of the three types of prior: uniform, time-based, and similarity-based,with other possible variations within each type. Time-based weighing.

In machine learning, it is well known that temporally ordered data can have concept drift [Gama et al.2014], that is, the underlying distribution can change over time. To account for this, we can use atime-based weighing scheme where older instances are given smaller weights, and newer instances largerones. Assuming the timestamps are (replaced by) integer values in a way that respects the same ordering,we can weigh the instances as: w t = tT . (20)9his assumes a linearly increasing importance of instances, with the oldest (ﬁrst) instance having weight1 /T and the latest instance weight ( T − /T . We can increase the importance of the newer instances bysquaring the weights: w t = ( t/T ) or more generally, w t = ( t/T ) a for some value a . Exponential smoothing.

A further ampliﬁcation of the importance of more recent instances can be done by considering anexponential weighing scheme, a popular approach in time series analysis and forecasting [Cox 1961;Harrison 1967].In principle, exponential smoothing uses a weighted average of the most recent observations and theforecasted data. In general, let ˆ f T be the estimated data up to, but not including time period T . Then,using the data f T of the current timestamp, the following gives the forecast for the next time period:ˆ f T +1 = αf T + (1 − α ) ˆ f T , (21)where the smoothing parameter α ∈ (0 ,

1) is a weight assigned to the most recent data. Furthermore, theexpansion of Eq. (21) yields ˆ f T +1 = T X t =1 α (1 − α ) T − t f T − t . (22)This weighing scheme is called exponential smoothing because the weights in Eq. (22) decline exponentiallywith time.We can apply the same exponential smoothing on the frequency matrices, resulting in F = P t α (1 − α ) T − t A t and hence: w t = α (1 − α ) T − t . (23) Similarity-based weighing.

The stops in each instance typically vary, and the presence or absence of diﬀerent stops can lead todiﬀerent decision behaviors. To account for this, we can use the similarity between the set of stops ofthe current instance and the set of stops of each historical instance as prior. The goal is to assign largerweights to training instances that are more similar to the test instance, and smaller weights if they areless similar.The similarity of two stop sets can be measured using the Jaccard similarity coeﬃcient. The Jaccardsimilarity of two sets is deﬁned as the size of the intersection divided by the size of the union of the twosets: J ( A, B ) = | A ∩ B || A ∪ B | (24)for two non-empty sets A and B . The Jaccard similarity coeﬃcient is always a value between 0 (nooverlapping elements) and 1 (exactly the same elements).Hence, we can use the following distance-based Jaccard similarity weighing: w t = J ( V t , V T ) . (25)To further amplify the diﬀerences in the weights, we can also use the squared Jaccard coeﬃcient w t = J ( V t , V T ) or in general, w t = J ( V t , V T ) a . We have discussed learning preferences from historical solutions and how to optimize over them, leadingto the most likely VRP solution in expectation. However, this uses only the historical preferences butnever reasons at the level of distances traveled (kilometers). Using only the probability matrix can hencelead to purely mimicking the human behaviour, instead of reasoning and optimizing over both preferencesand the impact on the driven kilometers. 10o face this issue, we deﬁne a distance-based probability distribution based on the softmax function onthe distances between stops. The larger the distance between stops i and j , the shorter is the probabilityof that transition. This probability is deﬁned as:ˆ d ij = Pr dist ( X n +1 = s j | X n = s i ) = e − d ij P k e − d ik . (26)The main goal is to have a comparable measure between the driver’s preferences and the cost of drivinglong distances. In Appendix C, we show that this deﬁnition of distance-based probabilities would produce,under some mild conditions, the same set of solutions as the standard CVRP formulation with theobjective of minimizing the total distance. Combining transition probability matrices.

Given transition probability matrices ˆP = [ˆ p ij ] and ˆD = [ ˆ d ij ], we can take the convex combination asfollows: ˆ c ij = β ˆ p ij + (1 − β ) ˆ d ij . (27)Taking β = 1 corresponds to using purely the history-based transition probabilities, while β = 0 willuse only distance-based probabilities, with values in between resulting in a combination of the twoprobabilities.Note that this approach places no conditions on how the history-based transition probability matrix ˆP = [ˆ p ij ] is computed, and hence is compatible with Laplace smoothing and weighing during theconstruction of ˆP . Description of the Data.

The historical data used in the experiments consist of daily routings collected within a span of ninemonths. The routings were generated by the route planners and used by the company in their actualoperations. Each data is a numbered instance and the entire data is ordered by time.Data instances are grouped by day-of-week including Saturday and Sunday. This grouping mimics theoperational characteristic of the company. The entire data set is composed of 201 instances, equivalent toan average of 29 instances per weekday. Capacity demand estimates for each stop were provided by thecompany.

Data Visualization.Figure 1 shows the number of stops served per weekday during the entire experimental time period. A concept drift is clearly discernible starting Week 53, where a decrease in stop set size occurs. An averageof 9 vehicles servicing 35 stops are used per instance in the data before drift, and 6 vehicles (25 stops) forthe 73 instances after drift. This observation has prompted us to make explicit whenever we are using allthe data from the entire period, or only data from the period before the drift.

The number of times each of the customer stops has appeared in the historical data is shown in

Figure 2 , which exhibits a mix of regular and ad hoc (occasional) stops. At the extremes, 14 out of the73 unique stops (19.2%) have been serviced more than 195 (out of 201) times, while 30 (41.1%) haveappeared in only ten or less instances.

Evaluation Methodology.

In a traditional machine learning setup, the dataset is split into a training set and a test set. The trainingset is used for training, and the test set for evaluation. This is a batch evaluation as all test instances areevaluated in one batch. The best resulting model is then deployed.Our data, however, has a temporal aspect to it, namely the routing is performed everyday. Hence,each day one additional training instance becomes available, allowing us to incrementally grow the data11

Week Number N u m b e r o f S t o p s WD 1WD 2WD 3WD 4WD 5WD 6WD 7

Figure 1: No. of stops by weekday (WD)

Stops0255075100125150175200 F r e q u e n c y Figure 2: Frequency of Stops Across Dataand learn from it. Indeed, human planners also learn from prior experience and expand their knowledgeday by day.As this is the most sensible way in which such a system would be used, the system also has tobe evaluated in this way, i.e., by incremental evaluation . The incremental evaluation procedure isdepicted in

Algorithm 2 and the breakdown of the entire data set after the initial train-test split isshown in

Table 2 . Table 2: Initial training and test set sizes after 75% −

25% split

Before drift Entire dataWD Train Test Train Test1 14 5 23 72 12 5 21 73 11 5 19 74 13 5 22 75 14 5 22 76 14 5 22 77 15 5 23 7Total 93 35 152 49

Algorithm 2

Training and testing with an incrementally increasing training set

Input:

A dataset H = { ( V t , m t , z t , x t ) } with timestamps t as in Algorithm 1.1. Start from an initial η training instances, e.g., η = b . |H| c for a 75% −

25% initial split.2. For σ = η + 1 , . . . , |H|− ˆP σ on H σ = { ( V t , m t , z t , x t ) } t<σ using Algorithm 1.2.2. Add to ˆP σ all stops in V σ that are not in H σ , with uniform probability.2.3. Solve CVRP (10) using ˆP σ as in Section 3.2.2.4. Evaluate the CVRP solution against x σ .When making a comparison of the prediction accuracy of the proposed schemes, we evaluate theperformance using two evaluation measures based on two properties of a VRP solution, namely thegrouping of stops into routes ( Route Diﬀerence ) and the resulting chosen arcs (

Arc Diﬀerence ). Route Diﬀerence (RD) indicates the percentage of stops that were incorrectly assigned to a diﬀerentroute. Intuitively, RD may be interpreted as an estimate of how many moves between routes are necessarywhen modifying the predicted solution to match the actual grouping of stops into routes. To computeroute diﬀerence, a pairwise comparison of the routes contained in the predicted and test solution ismade. The pair of routes with the smallest diﬀerence in stops is greedily selected without replacement.12ncorrectly assigned stops are counted and the total number is taken as RD. The percentage is taken bydividing RD by the total number of stops in the whole routing.

Arc Diﬀerence (AD) measures the percentage of arcs traveled in the actual solution but not in thepredicted solution. AD is calculated by taking the set diﬀerence of the arc sets of the test and predictedsolutions, then dividing the value by the total number of arcs in the routing. Correspondingly, AD givesan estimate of the total number of modiﬁcations needed to correct the predicted solution.

Experimental Setup.

The numerical experiments were performed using Python 3.7.4 and the CPLEX 12.9 solver with thedefault setting, on a Lenovo ThinkPad X1 Carbon with an Intel Core i7 processor running at 1.8GHz with16GB RAM. The Laplace smoothing parameter ( λ ), convex combination parameter ( β ) and exponentialsmoothing parameter ( α ) take default values λ = 1, β = 1, and α = 0 .

7. Following the weighing schemeoverview in

Table 1 , we denote uniform weighing by UNIF, TIME for time-based, EXP for exponentialtime-based weighing and SIMI for similarity-based weights. TIME2 and SIMI2 indicate squared weights.Furthermore, whenever necessary, we use superscripts ( ) and ( ) to distinguish when a weighing schemeis applied using the ﬁrst or the second order model, respectively. For instance, UNIF indicates uniformweighing using the ﬁrst order model. UNIF UNIF EXP EXP SIMI2 SIMI2 R o u t e D i ff e r e n c e ( % ) UNIF UNIF EXP EXP SIMI2 SIMI2 A r c D i ff e r e n c e ( % ) Weighing Scheme UNIF UNIF EXP EXP SIMI2 SIMI2 Avg Computation Time (s) 0.1016 4355.75 0.1455 2055.33 0.0922 9101.41

Figure 3: First order ( ) versus second order ( ) model (data from entire period) We tested the performance of our models on a subset, composed of three weekdays, of all data from theentire period using three schemes (UNIF, EXP, SIMI2)—one from each type of prior: uniform, time-basedand similarity-based. From the results shown in (

Fig. 3 ), we see that the second order model consistentlyproduced better results in terms of route diﬀerence. However, this is not the case for arc diﬀerence, whereaside from there being almost no change in prediction accuracy, the variance also increased. We assertthat more data points are necessary in order to better evaluate the second order model.Looking at the computation times, we see a considerable disparity between the two models. Whileschemes using the ﬁrst order formulation resulted to subsecond computation times, using the fastestscheme in the second order model, EXP , took more than 2000 seconds for each test instance to be solvedto optimality. We can hence remark that the slight improvement in route diﬀerence is, for practicalpurposes, reduced by the substantially longer running time of the second order model.For the succeeding experiments, we therefore focus our attention on further investigating the eﬀects ofthe ﬁrst order formulation. 13 IST UNIF TIME TIME2 EXP SIMI SIMI2020406080100 R o u t e D i ff e r e n c e ( % ) DIST UNIF TIME TIME2 EXP SIMI SIMI2020406080100 A r c D i ff e r e n c e ( % ) Figure 4: Route and arc diﬀerence (period before drift)

DIST UNIF TIME TIME2 EXP SIMI SIMI2020406080100 R o u t e D i ff e r e n c e ( % ) DIST UNIF TIME TIME2 EXP SIMI SIMI2020406080100 A r c D i ff e r e n c e ( % ) Figure 5: Route and arc diﬀerence (entire period)

In the next two experiments (

Fig. 4 - ), we do a wider comparison by investigating the performanceof all our proposed weighing schemes (see Table 1 )—UNIF, TIME, SIMI, and EXP, and TIME2 andSIMI2—in the ﬁrst order model.

Fig. 4 is on data before the drift (up to week 53 in

Fig. 1 ). It shows that all the proposed schemesproduced better estimates than DIST. There appears to be no signiﬁcant diﬀerence in the results interms of route diﬀerence. For arc diﬀerence, however, using time-based weighing (TIME, TIME2, EXP)considerably improved the solutions given by UNIF. Among all schemes, EXP gave the most accuratepredictions. Hence, it can be deduced that more recent routings are more relevant when making choicesin this case.Results on data from the entire period (

Fig. 5 ) exhibit a slightly diﬀerent behavior. As before, all theschemes outperformed DIST. Notably in terms of arc diﬀerence, both the time-based and similarity-basedschemes signiﬁcantly outperformed UNIF. Also, in most cases, the schemes with the squared weights(TIME2, SIMI2) performed better than their counterparts (TIME, SIMI). As before, EXP gave the mostaccurate predictions among all schemes.

The two succeeding experiments were conducted to observe the performances of the proposed schemesduring a concept drift. Our motivation is that we want to determine how the prediction quality of eachscheme evolves when there is a sudden change in data structure.We consider two scenarios. The ﬁrst case is when there is an abrupt drop in the number of stops.Observe that this is the case immediately after week 53 in

Fig. 1 . For this scenario, we trim our data setsuch that our 13 test instances, i.e., instances in the tail end of the data set, consist of 3 instances beforethe drift and 10 instances after the drift. As before, the probability matrix is trained on all data older14 R o u t e D i ff e r e n c e ( % ) UNIFEXPTIMETIME2SIMISIMI2 -3 -2 -1 0 1 2 3 4 5 6 7 8 9Timestamp010203040506070 A r c D i ff e r e n c e ( % ) UNIFEXPTIMETIME2SIMISIMI2

Figure 6: Route and arc diﬀerence during concept drift (drop in number of stops) -3 -2 -1 0 1 2 3 4 5 6 7 8 9Timestamp010203040506070 R o u t e D i ff e r e n c e ( % ) UNIFEXPTIMETIME2SIMISIMI2 -3 -2 -1 0 1 2 3 4 5 6 7 8 9Timestamp010203040506070 A r c D i ff e r e n c e ( % ) UNIFEXPTIMETIME2SIMISIMI2

Figure 7: Route and arc diﬀerence during concept drift (rise in number of stops)than the ones contained in the test set. We plot the prediction accuracy of each scheme on each of thetest instances.The results of the ﬁrst scenario are shown in

Fig. 6 , where instance 0 denotes the ﬁrst instance afterthe drift. Naturally, in both route and arc diﬀerence, we observe a sharp rise in error percentage atinstance 0. Especially with route diﬀerence, the error remained high for all the schemes (except EXP)until about instance 2 or 3, after which we see some improvement particularly for TIME2 and SIMI2.EXP, on the other hand, readjusted immediately after instance 0 and clearly outperformed the otherschemes in terms of prediction accuracy and its ability to adjust to changes in data structure.For the second scenario, we consider the case where there is a sudden rise in the number of stops. Inorder to use the same company data that we have, we simulate this case by reversing the time ranking ofall the data instances. With the data in reverse order, we are able to simulate a drift where there is anincrease in the number of stops, e.g., new stops that have not been seen before. As with the ﬁrst case, wetrim the data and select the 13 test instances after reordering the data. Compared to the ﬁrst case, hereall the schemes seemed to adjust more rapidly (

Fig. 7 ) after instance 0 . The increase in route and arcdiﬀerence at instance 0 is unsurprisingly greater than in the previous scenario, as the schemes adjustwith the new stops that were not seen before.The above experiments show that generally, all schemes manage to stabilize after the initial risein error due to structural change. In both scenarios, TIME2 and SIMI2 restabilize faster than theircounterparts, TIME and SIMI. Among all the schemes, UNIF clearly performed the worst, while EXPwas the fastest to adjust.

Up to this point, we have investigated the performance of our model using transition matrices made ofonly probabilities learned from the historical solutions. We expect, however, to gain further improvement15

IST 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 parameter0102030405060708090100 R o u t e D i ff e r e n c e ( % ) DIST 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 parameter0102030405060708090100 A r c D i ff e r e n c e ( % ) β parameter(DIST) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (Actual Sol)RD 43.70 25.56 19.68 18.09 16.78 16.90 16.57 16.73 16.85 16.54 17.02 0AD 73.81 29.09 22.46 20.89 20.66 20.15 19.77 19.53 19.75 19.65 19.51 0Avg Total Dist (km) 395.42 401.78 412.86 415.06 415.69 416.34 418.39 419.15 418.94 419.27 421.01 412.13Avg Time (s) 5958.66 6.36 3.48 2.57 1.56 0.6004 0.5061 0.4265 0.5093 0.6751 0.7432 - Figure 8: EXP results for varying values of β (entire period)when the learned transition probability matrix is combined with a distance-based probability matrix, theconstruction of which is described in Section 3.5. For the next experiment, we investigate how the ﬁnalsolution is aﬀected by the diﬀerent ways of mixing, i.e., varying values of the β parameter in Eq. (27). Fig. 8 shows the result for diﬀerent β values, using EXP with the default scheme value on data fromthe entire period. When combined with the learned probability matrix, we see that even with small valuesof β , we already get better results than using distances alone.While the boxplots alone seem to indicate that larger β values lead to small but consistent improvementsin RD and AD, we must warn that the arc and route diﬀerences with respect to the actual solution tellonly one side of the story. In the accompanying table below the ﬁgures, we show the average RD/ADas well as the average total distance driven. Looking closely at the kilometers, we see that while highervalues of β generally lead to lower percent diﬀerences, they also increase the number of kilometers anddo so beyond the number of kilometers of the actual solution. Using a β value of 0.2 shows that onaverage, this leads to a good RD/AD accuracy as well as an average number of kilometers roughly equalto that of the actual solutions. We hence recommend a modest β value, such as β = 0 .

2, i.e., a mix of20%-preference and 80%-distance probabilities.

A visual illustration of the way the routes are predicted by the transition matrix can be observed in

Figs. 9 - . Fig. 9 shows a map (ﬁgure not drawn to scale) of 24 stops to visit using three deliveryvehicles. The depot is denoted by 0. Also shown in

Fig. 9 is the solution obtained by using the standarddistance-based approach. The actual routing used by the company is shown in

Fig. 10 . Note that exceptfor the slight similarity in the way the stops are clustered into routes, the two routings appear to bealmost completely diﬀerent.From the historical data of the same weekday as the actual solution, we construct a transitionprobability matrix. The ﬁrst order matrix learned with the simplest scheme (UNIF) is visualized in

Fig.11 , where darker arcs indicate higher probabilities. It can be observed that the image shows a clearstructure, with distinct connections, e.g., to the furthest stops, but also a higher variability in the denserregions and near the depot.

Fig. 12 shows the ﬁrst order solution constructed with the probabilitymatrix of

Fig. 11 . We can see that the solution was able to capture key structural parts while makingtrade-oﬀs elsewhere to come up with a global solution, e.g., making a connection from stop 11 to 16.The actual human-made solution, on the other hand, was made with a number of distinct choices which16

Figure 9: Distance-based solu-tion

01 23 4 5 678910 111213141516 171819202122 23 24

Figure 10: Human-made solu-tion

01 23 4 5 678910 111213141516 171819202122 23 24 00.10.20.30.40.50.60.70.80.9

Figure 11: Learned ﬁrst orderprobabilities

01 23 4 5 678910 111213141516 171819202122 23 24

Figure 12: First ordersolution

01 23 4 5 678910 111213141516 171819202122 23 24

Figure 13: First order solutionwith β = 0 .

01 23 4 5 678910 111213141516 171819202122 23 24

Figure 14: Second ordersolution

DIST First Order First Order( β = 0.2) Second Order Actual(Human-made)RD 36.00 32.00 28.00 28.00 0AD 81.48 70.37 25.93 72.63 0Total Dist (km) 337.12 352.71 346.02 359.17 353.32Runtime (s) 9938.91 0.2642 0.2580 21606.17 - Figure 15: Detailed results for example routing17annot be easily predicted just by looking at the probability matrix map, e.g., the direct connectionsfrom 0 to 17, and 23 to 24.The solution obtained by adding distance-based probabilities to the ﬁrst order model is displayed in

Fig. 13 . It can be observed that the solution is able to keep the general structure of the human-madesolution, while also eliminating the longer connections (e.g., 11 to 16, and 15 to 22) made by the purelypreference-based ﬁrst order model.

Fig. 14 shows the solution when optimising with the second order model, which improved the routediﬀerence of the ﬁrst order solution (RD in

Fig. 15 ). Also from the detailed table of results, we cansee that in this particular example, the ﬁrst order with β model was able to most closely capture thegeneral structure of the human-made solution. Obviously, this is not always the case (see Fig. 8 inSection 4.3). Nevertheless, looking at the total distances displayed in

Fig. 15 , here we observe thatadding distance-based probabilities not only preserved the structure but also oﬀered a solution with fewerkilometers. Hence, it does not merely mimic the human planners but is also able to take preferences intoaccount while still optimizing the total distance.

One of the crucial ﬁrst steps in solving vehicle routing problems is explicitly formulating the problemobjectives and constraints. Oftentimes in practice, the optimized routing takes into account not onlytime and distance-related factors, but also a multitude of other concerns, which remain a tacit knowledgeof the route planners. Specifying each sub-objective and constraint may be an intractable task. Moreover,as we have observed in practice, computed solutions seldom fully satisfy the wishes of the route plannersand all involved stakeholders.This paper studied the potential of learning the preferences of the route planners from historicalsolutions in solving VRP. Speciﬁcally, we presented an approach to solving the VRP which does notrequire a full explicit problem characterization. Inspired by existing research on exploiting Markov modelsto predict individual route choices, we developed an approach that learns a probabilistic preference modelbased on historical solutions. This probabilistic model can subsequently be optimized to obtain themost likely routing for an entire ﬂeet. We have shown how this approach is capable of learning implicitpreferences from real data, resulting in more accurate solutions than using distances alone. The algorithmperforms well even without capacity demands, conﬁrming its ability to learn patterns in the underlyingsolution structure.Our work extends beyond a ﬁrst order Markov model and we have provided a more general formalization,which includes a second order Markov model, for learning and optimizing preferences. We have seen thatthe second order model improves the accuracy in terms route diﬀerence. This improvement, however, doesnot translate to arc diﬀerence. Furthermore, the second order model is expensive in terms of computationalburden. Here, we assert that a richer database is necessary to better evaluate the performance of higherorder models.We also presented an approach which combines distance-based and preference-based probabilities.This approach has the advantage of being robust to changing client sets and computationally fast due tosparsity of the resulting cost matrix. By mixing the two probabilities, we obtained more robust resultswithout sacriﬁcing computational times.Results on real data have been encouraging. Validation on other real-life data, however, shouldbe considered for generalization, as other data may have more (or less) structural assumptions. Ourproposed ﬁrst order approach could be plugged into existing VRP software today, but as with all predictivetechniques there should be human oversight to avoid unwanted bias.Future work on the routing side will involve applications to richer VRP model, e.g., problems involvingtime windows, multiple deliveries, etc. On the learning side, the use of richer predictive models suchas neural networks or higher order Markov models can be considered. Also, using separate learned orcontextual models per vehicle or per driver is worth investigating. Finally, extending the technique sothat the user can be actively queried and learned from during construction is an interesting direction,e.g., to further reduce the amount of user modiﬁcations needed on the predicted solutions.18 eferences

Ait Haddadene, Syrine Roufaida, Nacima Labadie, and Caroline Prodhon (2019). “Bicriteria VehicleRouting Problem with Preferences and Timing Constraints in Home Health Care Services.” In:

Algorithms

Leonardo , pp. 175–187.Ashbrook, Daniel and Thad Starner (2003). “Using GPS to learn signiﬁcant locations and predictmovement across multiple users.” In:

Personal and Ubiquitous computing

International Conference on Principles and Practice of ConstraintProgramming . Springer, pp. 12–26.— (2012). “A model seeker: Extracting global constraint models from positive examples.” In:

InternationalConference on Principles and Practice of Constraint Programming . Springer, pp. 141–157.Bessiere, Christian, Frédéric Koriche, Nadjib Lazaar, and Barry O’Sullivan (2017). “Constraint acquisition.”In:

Artiﬁcial Intelligence

ACM Computing Surveys (CSUR)

International Conference on Principles and Practice of Constraint Programming . Springer, pp. 54–70.Ceikute, Vaida and Christian S Jensen (2013). “Routing service quality–local driver behavior versusrouting services.” In: . Vol. 1.IEEE, pp. 97–106.Chang, Kai-Ping, Ling-Yin Wei, Mi-Yeh Yeh, and Wen-Chih Peng (2011). “Discovering personalizedroutes from trajectories.” In:

Proceedings of the 3rd ACM SIGSPATIAL International Workshop onLocation-Based Social Networks , pp. 33–40.Chen, Stanley F and Joshua Goodman (1999). “An empirical study of smoothing techniques for languagemodeling.” In:

Computer Speech & Language

Journal of the Royal Statistical Society: Series B (Methodological)

Management science

HEV charge/dischargecontrol system based on navigation information . Tech. rep. SAE Technical Paper.Delling, Daniel, Andrew V Goldberg, Moises Goldszmidt, John Krumm, Kunal Talwar, and RenatoF Werneck (2015). “Navigation made personal: Inferring driving preferences from gps traces.” In:

Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic InformationSystems , pp. 1–9.Dragone, Paolo, Stefano Teso, and Andrea Passerini (2018). “Constructive preference elicitation.” In:

Frontiers in Robotics and AI

4, p. 71.Drexl, Michael (2012). “Rich vehicle routing in theory and practice.” In:

Logistics Research

Proceedings of the 24th ACM SIGSPATIAL International Conference onAdvances in Geographic Information Systems , pp. 1–9.Gama, João, Indr˙e Žliobait˙e, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia (2014). “Asurvey on concept drift adaptation.” In:

ACM computing surveys (CSUR)

The VLDB Journal , pp. 1–22.Harrison, PJ (1967). “Exponential smoothing and short-term sales forecasting.” In:

Management Science

Mind

SAE 2008 World Congress .Lloyd L. Withrow Distinguished Speaker Award.Laporte, Gilbert (2007). “What you should know about the vehicle routing problem.” In:

Naval ResearchLogistics (NRL)

AAAI , pp. 1795–1800.McCormick, Garth P (1976). “Computability of global solutions to factorable nonconvex programs: PartI—Convex underestimating problems.” In:

Mathematical programming ,pp. 1–21.Picard-Cantin, Émilie, Mathieu Bouchard, Claude-Guy Quimper, and Jason Sweeney (2016). “Learningparameters for the sequence constraint from solutions.” In:

International Conference on Principlesand Practice of Constraint Programming . Springer, pp. 405–420.Potvin, Jean-Yves, Gina Dufour, and Jean-Marc Rousseau (1993). “Learning vehicle dispatching withlinear programming models.” In:

Computers & operations research . IEEE,pp. 127–132.Toth, P and D Vigo (2014). “The family of vehicle routing problem.” In:

Vehicle Routing: Problems,Methods, and Applications , pp. 1–23.Wang, Xipeng, Yuan Ma, Junru Di, Yi L Murphey, Shiqi Qiu, Johannes Kristinsson, Jason Meyer,Finn Tseng, and Timothy Feldkamp (2015). “Building eﬃcient probability transition matrix usingmachine learning from big data for personalized route prediction.” In:

Procedia Computer Science . IEEE, pp. 136–147.Yang, Bin, Chenjuan Guo, Yu Ma, and Christian S Jensen (2015). “Toward personalized, context-awarerouting.” In:

The VLDB Journal arXiv preprint arXiv:1907.04028 .Ye, Ning, Zhong-qin Wang, Reza Malekian, Qiaomin Lin, and Ru-chuan Wang (2015). “A method fordriving route predictions based on hidden Markov model.” In:

Mathematical Problems in Engineering

A UNIF without and with capacity constraints

This experiment was done to compare the prediction accuracy of UNIF without and with the capacity(Cap) demand estimates (

Fig. 16 ). The motivation is to investigate how UNIF will perform even withoutthe capacity constraints. When evaluating without the capacity estimates, in order to keep the subtourelimination constraint (5), each q i is assigned a value of 1 while using the number of stops as a ﬁctivebound on the vehicle capacities, e.g., Q = n . As a baseline, we included the solution (DIST) obtained bysolving the standard distance-based CVRP to near-optimality (5% optimality gap). Computation wasdone on data from the entire period.Results show that DIST is consistently outperformed by UNIF. Moreover, in both route and arcdiﬀerence, we always notice some improvement when capacity estimates are taken into account. Remark-ably, when using the transition probability matrices, we see that we can solve the VRP even without thecapacity constraints and still get meaningful results. This shows the ability of the method to learn thestructure underlying the problem just from the solutions.As for the computation time, DIST needed an average of almost 6000 seconds to solve each instancedespite the imposed near-optimality relaxation. In contrast, both UNIF and UNIF(Cap) only took lessthan a second on average to obtain the optimal solutions, with UNIF(Cap) taking slightly more time dueto the additional capacity constraints. Additionally, we observed that the learned probability matricesare much more sparse (containing more 0 or near-0 values) than the distance matrices. B Parameter Sensitivity

Laplace parameter λ . To understand the eﬀects of varying the Laplace parameter λ ( Fig. 17 ), weperformed an experiment on data from the entire period. Capacity demands were taken into account and20

IST(Cap) UNIF UNIF(Cap)020406080100 R o u t e D i ff e r e n c e ( % ) DIST(Cap) UNIF UNIF(Cap)020406080100 A r c D i ff e r e n c e ( % ) Weighing Scheme DIST(Cap) UNIF UNIF(Cap)Avg Computation Time (s) 5958.66 0.1135 0.9730

Figure 16: UNIF without and with capacity (Cap) constraints (data from entire period)EXP was selected on the basis of the results from the previous experiments. It is interesting to note thatEXP worked well even with no smoothing ( λ = 0). It can also be observed that EXP produced slightlyimproved results when smoothing is applied. In general, however, we see that on our data, Laplacesmoothing has very little eﬀect. EXP parameter α . We examined the behavior of the model for diﬀerent values between 0 and 1 ofthe EXP parameter α. Instead of looking at the average percent diﬀerences as we did for the precedingexperiment on the Laplace parameter, here we opted to inspect more closely the evolution of the predictionacross diﬀerent time periods. In

Fig. 18 , each point on the graph represents the average route or arcdiﬀerence when the model is tested on the test instances from the corresponding time period 0 , , . . . , or6, with 0 being the oldest, and 6 the newest time period. The results of the experiment show that as α increases, prediction accuracy also increases. Accuracy appears to stabilize at α = 0 . , which incidentallycoincides with our choice of the parameter’s default value. C Distance-based Probabilities

Here, we prove that minimizing the absolute distances in the standard CVRP is equivalent to minimizingthe distance-based probabilities of Eq. (26) under some mild conditions. We write the solution using thedistance-based probabilities as:arg min x X ( i,j ) ∈ A ˆ d ij x ij = arg min x X ( i,j ) ∈ A − log (cid:16) e − d ij P k e − d ik (cid:17) x ij = arg min x X ( i,j ) ∈ A − log e − d ij − log X k e − d ik ! x ij =arg min x X ( i,j ) ∈ A d ij x ij + X i ∈ V X j :( i,j ) ∈ A log (cid:18) X k e − d ik (cid:19) x ij =arg min x X ( i,j ) ∈ A d ij x ij + X i ∈ V \{ } log (cid:18) X k e − d ik (cid:19) X j :( i,j ) ∈ A x ij + log (cid:18) X k e − d k (cid:19) X j :(0 ,j ) ∈ A x j . (28)The ﬁrst term in the ﬁnal expression corresponds to the classical VRP where the total distance isminimized. Given that P j :( i,j ) ∈ A x ij = 1 (Eq. (2)), the second term is always a constant. Hence, it doesnot play any role in the optimization above. The third term depends on the number of vehicles used inthe routing x .Whenever the optimal solution utilizes all the vehicles, we have P j :(0 ,j ) ∈ A x j = m (where m is the number of vehicles). As the third term also becomes a constant, Eq. (28) simply becomesarg min x P ( i,j ) ∈ A d ij x ij , which is the same as the original CVRP objective function.From the operational point of view, in the company setting in which this work is applied, the numberof available drivers each day is ﬁxed. Allocating a route to each of the drivers does not entail anyadditional cost. Therefore, we can assume that the equality constraint is always active.21 R o u t e D i ff e r e n c e ( % ) A r c D i ff e r e n c e ( % ) Figure 17: Route and arc diﬀerence for values of Laplace parameter λ (entire period) R o u t e D i ff e r e n c e ( % ) A r c D i ff e r e n c e ( % ) Figure 18: Route and arc diﬀerence for varying values of EXP parameter (entire period)Otherwise, we can always scale up the softmax function with a parameter θ > g ( θ ) = P k e − θd k = 1. In fact, g ( · ) is a continuous function such that g (0) = | V | and lim θ → + ∞ g ( θ ) = 0.Hence, there exists a value θ ∗ > θ ∗∗