[PDF] A Distributed Model-Free Ride-Sharing Approach for Joint Matching, Pricing, and Dispatching using Deep Reinforcement Learning

Abstract

Significant development of ride-sharing services presents a plethora of opportunities to transform urban mobility by providing personalized and convenient transportation while ensuring efficiency of large-scale ride pooling. However, a core problem for such services is route planning for each driver to fulfill the dynamically arriving requests while satisfying given constraints. Current models are mostly limited to static routes with only two rides per vehicle (optimally) or three (with heuristics). In this paper, we present a dynamic, demand aware, and pricing-based vehicle-passenger matching and route planning framework that (1) dynamically generates optimal routes for each vehicle based on online demand, pricing associated with each ride, vehicle capacities and locations. This matching algorithm starts greedily and optimizes over time using an insertion operation, (2) involves drivers in the decision-making process by allowing them to propose a different price based on the expected reward for a particular ride as well as the destination locations for future rides, which is influenced by supply-and demand computed by the Deep Q-network, (3) allows customers to accept or reject rides based on their set of preferences with respect to pricing and delay windows, vehicle type and carpooling preferences, and (4) based on demand prediction, our approach re-balances idle vehicles by dispatching them to the areas of anticipated high demand using deep Reinforcement Learning (RL). Our framework is validated using the New York City Taxi public dataset; however, we consider different vehicle types and designed customer utility functions to validate the setup and study different settings. Experimental results show the effectiveness of our approach in real-time and large scale settings.

Full PDF

11 A Distributed Model-Free Ride-Sharing Approachfor Joint Matching, Pricing, and Dispatching usingDeep Reinforcement Learning

Marina Haliem, Ganapathy Mani, Vaneet Aggrwal, and Bharat Bhargava

Abstract —Signiﬁcant development of ride-sharing servicespresents a plethora of opportunities to transform urban mobilityby providing personalized and convenient transportation whileensuring efﬁciency of large-scale ride pooling. However, a coreproblem for such services is route planning for each driver tofulﬁll the dynamically arriving requests while satisfying givenconstraints. Current models are mostly limited to static routeswith only two rides per vehicle (optimally) or three (withheuristics). In this paper, we present a dynamic, demand aware,and pricing-based vehicle-passenger matching and route planningframework that (1) dynamically generates optimal routes for eachvehicle based on online demand, pricing associated with eachride, vehicle capacities and locations. This matching algorithmstarts greedily and optimizes over time using an insertionoperation, (2) involves drivers in the decision-making processby allowing them to propose a different price based on theexpected reward for a particular ride as well as the destinationlocations for future rides, which is inﬂuenced by supply-and-demand computed by the Deep Q-network, (3) allows customersto accept or reject rides based on their set of preferenceswith respect to pricing and delay windows, vehicle type andcarpooling preferences, and (4) based on demand prediction,our approach re-balances idle vehicles by dispatching them tothe areas of anticipated high demand using deep ReinforcementLearning (RL). Our framework is validated using the New YorkCity Taxi public dataset; however, we consider different vehicletypes and designed customer utility functions to validate thesetup and study different settings. Experimental results showthe effectiveness of our approach in real-time and large scalesettings.

Index Terms —Ride-sharing, Deep Reinforcement Learning,Vehicle Routing, Pricing, Deep Q-Network, Pooling, SharedMobility, Route Planning

I. I

NTRODUCTION

Advanced user-centric ride-hailing services such as Uberand Lyft are thriving in urban environments by transformingurban mobility through convenience in travel to anywhere,by anyone, and at anytime. Given tens of millions of usersper month [1], these mobility-on-demand (MoD) servicesintroduce a new paradigm in urban mobility —ride-sharingor ride-splitting. These ride-hailing services, when adoptingshared mobility, can provide an efﬁcient and sustainable wayof transportation [2]. With higher usage per vehicle, its servicecan reduce trafﬁc congestion, environmental pollution, as well

M. Haliem, G. Mani and B. Bhargava are with the Department of ComputerScience, Purdue University, West Lafayette, IN 47907 USA email: { mwadea,manig, bbshail } @purdue.eduV. Aggrwal is with the Department of Electrical and Computer Engineering,Purdue University, West Lafayette, IN 47907 USA email: vaneet @purdue.edu as energy consumption, thereby enhance living conditions inurban environments [3]. An essential issue in realizing sharedmobility in online ride-sharing settings is route planning ,which has been proven to be an NP-hard problem [4]–[6].Route planning - given a set of vehicles V , and requests R - designs a route that consists of a sequence of pickup anddrop-off locations of requests assigned to each vehicle. In ride-sharing environments, vehicles and requests arrive dynamicallyand are arranged such that they meet different objectives (e.g.,maximizing the ﬂeet’s proﬁts [3], [7] and the number of servedrequests [8], [9], and minimizing the average delay amongcustomers and the total travelling distance [10]–[12]).Utilizing an operation called insertion to solve such ahighly dynamic problem has been proven, in literature, tobe both effective and efﬁcient [3], [8]–[15]. The insertion operation aims to accommodate the newly arriving requestby inserting its pickup (i.e., origin) and its drop-off (i.e.,destination) locations into a vehicle’s route. The main issuewith the state-of-the-art route planning methods [3], [7], [13]is that it operates in a greedy fashion to optimize its objectives,which means each insertion operation only optimizes theobjective function upon the current arranged request leadingto low performance over the long run. Moreover, most ofthe approximation algorithms that provide solutions to theroute planning problem are limited to only two requests pervehicle such as the settings in [4]–[6]. In [5], the authorsimprove the effectiveness of route planning for shared mo-bility through taking into consideration the demand in nearfuture (e.g., future 90 minutes), thus overcoming the shortsightedness problem of the insertion operator. However, theDemand-Aware Route Planning (DARP) in [5] is still limitedto only two rides per vehicle in the initial assignment phase.In addition, it does not enforce constraints on which ridescan be grouped together, which may result in grouping rideswhose destinations are in opposite directions; thus resulting inlarge detours (boosting the travel distance), as well as extrafuel and energy consumption per vehicle. Therefore, a robustframework is essential to achieve a dynamic and demand-aware matching framework that optimizes for the overall ﬂeet’sproﬁts and its total travelling distance. Finally, the overallproblem for ride-sharing requires interconnected decisions ofmatching, pricing, and dispatching, and is the focus of thispaper.In this paper, we present a dynamic, demand-aware andpricing-based vehicle-passenger matching and route-planningframework that scales up to accommodate more than two rides a r X i v : . [ c s . M A ] O c t per vehicle (up to the maximum capacity of the vehicle) in theinitial assignment phase followed by an optimization phasewhere the insertion operations are applied in order to achievea matching that satisﬁes our platform-related set of constraints(as described in Section III). At the same time, we incorporatethe pricing aspect to our mathing algorithm which wouldprioritize the rides with signiﬁcant intersection in the path to begrouped together. This provides an efﬁcient dynamic matchingalgorithm for the ride-sharing environment by managing theprices for the passengers based on the distance travelled afterthe insertion of the passenger in the route based on theprevious passengers.Another problem addressed in this paper is that even thoughthe pooling services provide customized personal service tocustomers, both the drivers and the customers are largelyleft out in deciding what is best for them in terms of theirconveniences and preferences. It is challenging to introducecustomer and driver conveniences into the framework. Forexample, a customer may have limitation on the money she/hecould spend for a particular ride as well as time constraintson reaching the destination. On the other hand, driver may notbe willing to accept the customer’s convenient fare as it maynegatively affect his/her proﬁts since the ﬁnal destination maybe in low demand area. Thus, a reliable framework is neededto identify trade-offs between the drivers’ and the customers’needs, and make a compromised decision that is favorable toboth.In this paper, we approach the pricing-based ride-sharing(with pooling) problem through a model-free technique forride-sharing with ride-pooling. In contrast to the model-basedapproaches in literature [16]–[19], our proposed approachcan adapt to dynamic distributions of customer and driverpreferences. The authors of [20] provided the ﬁrst model-freeapproach for ride-sharing with pooling, DeepPool, based ondeep reinforcement learning (RL). However, DeepPool neitherincorporates a dynamic demand-aware matching with a pricingstrategy, nor accommodates customers’ and drivers’ conve-niences. It primarily focuses on dispatching idle vehicles andcustomer-vehicle matching using a greedy matching policy. Tothe best of our knowledge, ours is the ﬁrst work that introducesa model-free approach for a distributed joint matching, pricing,and dispatching for ride-sharing where customers and driverscan weigh in their ride preferences, inﬂuencing the decisionmaking of ride-sharing platforms.This paper utilizes the dispatch of idle vehicles using aDeep Q-Network (DQN) framework as in [20]. Dependingon the DQN, we decide on the pricing by vehicles as well asthe acceptance or rejection of rides by customers. The goalof our approach is to utilize optimal dispatching providedby the pre-deﬁned model, inﬂuence the customer and vehicleutility functions to achieve convenience, maximize the proﬁtsfor both and reduce the customers’ waiting time, travel time,and idle driving. The dispatching affects the pricing of thepassengers to add additional price if the destination locationof the passengers is not a preducted good destination. Further,the DQN takes into account the proﬁts of the vehicles andthus takes into account the pricing and matching decisions.Thus, the three sets of algorithms are inter-connected with each other.The key contributions in this paper can be summarized asfollows. • We present a novel dynamic, Demand-Aware and Pricing-based Matching and route planning (DARM) frameworkthat is scalable up to the maximum capacity per vehicle inthe initial assignmnet phase. In the optimization phase, thisalgorithm takes into account the near-future demand as wellas the pricing associated with each ride in order to improvethe route-planning by eliminating rides heading towardsopposite directions and applying insertion operations tovehicles’ current routes. • In addition to our matching and route-planning (DARM)framework, we integrate a novel Distributed Pricing ap-proach for Ride-Sharing -with pooling- (DPRS) frameworkwhere, based on their convenience, customers and driversget to weigh-in on the decision-making of a particularride. The key idea is that the passengers are offered pricebased on the additional distance given the previous matchedpassengers thus prioritizing the passengers which will haveintersections in the routes. DPRS approach is built on topof a distributed model-free approach, DeepPool in [20], andutilizes its dispatching algorithm of idle vehicles using DeepQ-learning (DQN). • In the DPRS framework, drivers are allowed to proposea price based on the location of the ride (source anddestination) that accounts for the reward of DQN basedon the destination location. Similarly, customers can eitheraccept or reject rides based on their pricing threshold, timingpreference, type of vehicle, and number of people to sharea ride with. • Our joint (DARM + DPRS) framework increases the proﬁtmargins of both customers and drivers, and the proﬁtsare also fed back to the reinforcement learning utilityfunctions that inﬂuence the Q-values learnt using DQN formaking the vehicles’ dispatch decisions. The optimizationproblem is formulated such that our novelty frameworktries to minimize the rejection rate, customers’ waiting time,vehicles’ idle time, total number of vehicles to reduce trafﬁccongestion, fuel consumption, and maximize the vehicle’sproﬁt. • Through experiments using real-word dataset of New YorkCity’s taxi trip records [21] (15 million trips), we simulatethe ride-sharing system with dynamic matching, demand-aware route planning and distributed pricing strategy. Weshow that our novel Joint (DARM + DPRS) frameworkprovides up to times more proﬁts for drivers, whilemaintaining minimal waiting times for customers whencompared to various baselines. In addition, our frameworkshows a signiﬁcant improvent in ﬂeet utilization, as it usesonly 50% of the vehicles allowed by the system to fulﬁll upto 90% of the demand, while keeping most of the vehiclesoccupied more than 80% of the time, which brings their idletime to the minimum. We note that the three componentsof pricing, matching, and dispatching inﬂuence each other,and are tightly coupled.The rest of this paper is organized as follows: Section II describes the overall architecture of our framework as well asthe model parameters. In Section III, we explain our dynamic,demand-aware, and pricing-based approach for matching androute planning. Section IV provides details for our pricingstrategy, including the customers and drivers’ utility functionsand decision-making processes. In Section V, we decribe theDQN-based approach we utilized for dispatching vehicles.Simulation setup as well as experimental results are presentedin Section VI. Finally, Section VII concludes the paper.II. D ISTRIBUTED J OINT M ATCHING , P

RICING AND D ISPATCHING F RAMEWORK

We propose a novel distributed framework for matching,pricing, and dispatching in ride-sharing environments usingDeep Q-Network (DQN), where initial matchings (that aredecided in a greedy fashion) are then optimized in a distributedmanner (per vehicle) in order to meet the vehcile’s capacityconstraints as well as minimize customers’ extra waitingtime and driver’s additional travel distance. This frameworkinvolves customers and drivers (will be referred to as Agentshenceforth) in the decision-making process. They learn thebest pricing actions based on their utility functions that dy-namically change based on each agent’s set of preferencesand environmental variables. Moreover, vehicles learn the bestfuture dispatch action to take at time step t , taking intoconsideration the locations of all other nearby vehicles, butwithout anticipating their future decisions. Note that, vehiclesget dispatched to areas of anticipated high-demand either whenthey just enter the market, or when they spend a long timebeing idle (searching for a ride). Vehicles’ dispatch decisionsare made in parallel. However, if multiple vehicles select thesame passenger at the exact same time, the passenger willchoose one of the vehicle. Therefore, our algorithm learns theoptimal policy for each agent independently as opposed tocentralized-based approaches such as in [22]. Vehicles Customers

Dispatching Agent

DQN Engine

CONTROL UNIT

Price Estimation Model Demand Prediction Model OSRM Model ETA Model

Future Demand Routes Trip Time Future Demand Vehicle State Requests

Final Matchings

Initial Price Negotiated Price Final Pricing Final Pricing Accept / Reject Decision

Environment

Matching Agent

Vehicle States Initial Matchings Dispatch Action

Matching Optimizer

Initial Matchings Final Matchings

Fig. 1: Overall architecture of the proposed framework

A. Model Architecture

Figure 1 shows the basic components of our joint frame-work and the interactions between them. We assume that the control unit is responsible for: (1) making the initial matchingdecisions, based on the proximity of vehicles to ride requests,(2) maintaining the states such as current locations, currentcapacity, destinations, etc., for all vehicles. These states areupdated in every time step based on the dispatching andmatching decisions. (3) Control unit also has some internalcomponents that help manage the ride-sharing environmentsuch as: (a) the estimated time of arrival (ETA) model used tocalculate and continuously update the estimated arrival time.(b) The Open Source Routing Machine (OSRM) model used togenerate the vehicle’s optimal trajectory to reach a destination,and (c) the (Demand Prediction) model used to calculate thefuture anticipated demand in all zones. We adopt these threemodels from [22]; whose details and their relevant calculationsare provided in Appendix A. First, the ride requests areinput to the system along with the heat map for supply anddemand (which involves demand prediction in the near future).Then, the control unit performs greedy vehicle-passenger(s)matching where one request (or more) gets assigned to thenearest vehicle based on its maximum passenger capacity.Next, communicating with the (Price Estimation) model, thecontrol unit gets the corresponding initial pricing associatedwith each request and notiﬁes the driver. Afterwards, thematching optimizer operates per vehicle, where each vehiclehaving received the list of initial matchings (requests) assignedto it with their corresponding initial prices, performs aninsertion operation. In this step, vehicles reach their ﬁnalmatchings list by dealing with their initial matchings list in theorder of their proximity to it, performing an insertion operationto its current route plan (as long as this insertion satisﬁesthe capacity, extra waiting time, and additional travel distanceconstraints to guarantee that serving this request would yielda proﬁt). The vehicles adopt a dispatching policy using DQN,where they get dispatched to zones with anticipated highdemand when they experience large idle duration or whenthey newly enter the market. Using the expected discountedreward learnt from DQN and the ride’s destination, vehiclesweigh their utility based on the potential hotspot locations, andpropose a new pricing for the customer. This takes place ina customer-by-customer basis, where a vehicle upon insertinga customer to its current route plan, proposes to him/her thenew price. Then, the customer has the ability to accept or rejectbased on his/her own independent utility function. Finally, thevehicle upon receiving the customer’s decision, either conﬁrmsthe addition of this customer to its route plan or removeshim/her. A Vehicle communicates with the control unit, asneeded, to request new information of other vehicles (prior tomaking a dispatch or price decision) or update its own status(after any decision). After deﬁning the model parameters andnotations in the Section II-B, we present the overall ﬂow ofour framework in Algorithm 1.

B. Model Parameters and Notations

We built a ride-sharing simulator to train and evaluate ourframework. We simulate New York City as our area of op-eration and this area is divided into multiple non-overlappingzones (or regions), each of which is 1 square mile. This allows us to discretize the area of operation and thus makes theaction space—where to dispatch the vehicles—tractable. Weuse m ∈ { , , , ..., M } to denote the city’s zones, and n to denote the number of vehicles. We optimize our algorithmover T time steps, each of duration ∆ t . In Appendix B, wepresent the details of the model parameters, where we havedeﬁned a three tuple that captures the environment updates attime t as s t,n = ( X t , V t : t + T , D t : t + T ) , where X t is vehicles’status vector of N vectors, each of which represents vehicle n ’s state variables at time step t , D t : t + T is the predicted futuredemand from time t to time t + T at each zone, and V t : t + T is the predicted supply (i.e. the number of vehicles in eachzone for T time slots ahead). Note that a vehicle is consideredavailable for serving demand, if and only if it has at least oneempty seat. Our framework keeps track of the rapid changes ofall these variables and seeks to make the demand, d t , ∀ t andsupply v t , ∀ t close enough (mismatch between them is zero).Note that, by ride-sharing we mean ride-sharing with poolingin our model. For the DQN dispatch, we use the framework asin [20], and add the proﬁt in the reward function which makesthe output Q-values impacted by the pricing decisions. This isexplained in Section V, and further detailed in Appendix C. Algorithm 1

Joint RideSharing Framework Initialize vehicles’ states X at t .2: for t ∈ T do Fetch all vehicles that entered the market in time slot t, V new .4: Dispatch V new to zones with anticipated high demand usingAlgorithm 4.5: Fetch all ride requests at time slot t, D t .6: Fetch all available vehicles at time slot t, V t .7: for each vehicle V j ∈ V t . . . do Obtain initial matching A j using Algorithm 2 in III-A.9: for each ride request r i ∈ A j . . . do Obtain initial price P init ( r i ) using (3).11: Perform route planning using Algorithm 3 in III-B.12:

Obtain S (cid:48) V j [ r i ] based on cost( V j , S (cid:48) V j [ r i ]) .13: Update trip time T i based on S (cid:48) V j [ r i ] using ETA model.14: Calculate ﬁnal price P ( r i ) based on S (cid:48) V j [ r i ] using (4).15: Get customer i ’s decision C id on P ( r i ) using (5) and(6).16: if C id == 1 then Update S V j ← S (cid:48) V j [ r i ] .18: else Insert r i to D t +1 end if Update the state vector s t .22: end for Retrieve next stop from S V j .24: Head to next stop (whether a pickup or a dropoff).25: end for

Fetch all idle vehicles with Idle duration > minutes, V idle .27: Dispatch V idle to zones with anticipated high demand usingAlgorithm 4..28: Update the state vector s t .29: end for III. DARM

FRAMEWORK FOR M ATCHING AND R OUTE P LANNING

NP-Hardness:

The ride-sharing assignment problem isproven to be NP-hard in [19] as it is a reduction from the 3-dimensional perfect matching problem (3DM). In 3DM, given a number of requests with source and destination locationsand a number of available vehicle locations, the task is toassign vehicles to requests. However, in [19], the authorslimit this allocation to only two requests sharing the samevehicle at a time using an approximation algorithm that is 2.5times the optimal cost. They approach this problem by pairingrequests ﬁrst greedily, and then using bi-partitile graphs tomatch each vehicle to one pair of requests, while assigning themaximum number of requests with the minimum total cost. Inour approach, we don’t limit matching to only two requests,instead we go as far as the maximum capacity of a vehicleallows (satisfying the capacity constraint), which signiﬁcantlyboosts the acceptance rate of passengers.Moreover, our demand-aware route planning problem is avariation of the basic route planning problem for shareablemobility services [11] [3] by setting α = 1 and β = 0 . Since thebasic route planning problem is NP-hard [11] [3], our DARMproblem is also an NP-hard problem. Further, the existingliterature proved that there is no optimal method to maximizethe total revenue for the basic route planning problem (whichis reducible to our DARM problem) using neither deterministicnor randomized algorithms [7] [3]. Thus, the same applies toour DARM problem.In this section, we explain our approach to solve theDARM problem using dynamic, demand-aware and pricing-based matching and route planning framework. The frameworkgoes through two phases as follows: A. Initial Vehicle-Passenger(s) Assignment Phase:

The initial assignment is represented in Algorithm 2. In thisphase, the control unit having the knowledge of the futuredemand D t : t + T at each zone, the vehicles’ status vectors X t including their current locations and destinations as well asthe origin o i and destination d i locations for each request r i ,performs a greedy matching operation. This is where eachrequest r i gets assigned to the nearest available vehicle to it,satisfying the capacity constraints. In other words, we deﬁnethe capacity constraint to be: the summation of the capacitiesof all requests assigned to vehicle V j is less than its maximumcapacity C V j max at any time. At the end of this phase, eachvehicle V j has a list of initial matchings A j = [ r , r , ..., r k ] ,where k ≤ C V j max assuming that each request has only onepassenger. Assume the passenger count per request is | r i | ,and the vehicle V j arrives at location z . Then, to checkthe capacity constraint in O (1) time (which will be furtherdiscussed in the complexity analysis), we deﬁne vehicle V j ’scurrent capacity V jC [ z ] which refers to the total capacity of therequests that are still on the route of V j when it arrives at thatlocation z as follows: V jC [ z ] = (cid:40) V jC [ z − | r i | if z == o i V jC [ z − − | r i | if z == d i (1) B. Distributed Optimization Phase:

As shown earlier, there is no polynomial time algorithmthat achieves an optimal solution for our demand-aware route

Algorithm 2

Greedy Assignment Input:

Available Vehicles V t with their locations loc ( V j ) suchthat: V j ∈ V t , Ride Requests D t with origin o i and destination d i associated with r i ∈ D t .2: Output:

Matching decisions A j for each V j ∈ V t Initialize A j = [ ] , V j capacity = V jC for each V j ∈ V t .4: for each r i ∈ D t . . . do Obtain locations of candidate vehicles V cand , such that: | loc ( V j ) − o i | ≤ km AND ( V j capacity + | r i | ) ≤ C V j max .6: Calculate trip time T j,i ∈ T cand ,i from each loc ( V j ) ∈ V cand to o i using the ETA model.7: Pick V j whose T j,i = argmin ( T cand ,i ) to serve ride r i .8: Push r i to A j Update loc ( V j ) ← o i Increment V j capacity ← V j capacity + | r i | end for Return A t = [ A j , A j +1 , ..., A n ] , where n = | V t | . planning problem. However, several studies show that Inser-tion is an effective approach to greedily deal with the sharedmobility problem. We propose an insertion based frameworkthat can achieve better results in a relatively long time periodusing the near future predicted demand to overcome the short-sightedness problem of the basic insertion algorithms.In DARM, we follow the idea of searching each route andlocally optimally inserting new vertex (or vertices) into a route.In our problem, there are two vertices (i.e., origin o i anddestination d i ) to be inserted for each request r i . We deﬁnethe insertion operation as: given a vehicle V j with the currentroute S V j , and a new request r i , the insertion operation aimsto ﬁnd a new feasible route S (cid:48) V j by inserting o i and d i into S V j with the minimum increased cost, that is the minimumextra travel distance, while maintaining the order of verticesin S V j unchanged in S (cid:48) V j . Speciﬁcally, for a new request r i ,the basic insertion algorithm checks every possible position toinsert the origin and destination locations and return the newroute such that the incremental cost is minimized. To presentour cost function, we ﬁrst deﬁne our distance metric, wheregiven a graph G we use our OSRM engine to pre-calculateall possible routes over our simulated city. Then, we derivethe distances of the trajectories (i.e., paths) from location a to location b to deﬁne our graph weights. Thus, we obtain aweighted graph G with realisitic distance measures serving asits weights. We extend the weight notation to paths as follows: w ( a , a , ..., a n ) = (cid:80) n − i =1 w ( a i , a i +1 ) .Thus, we deﬁne the cost associated with each new potentialroute/path S (cid:48) V j = [ r i , r i +1 , ..., r k ] to be the cost ( V j , S (cid:48) V j ) = w ( r i , r i +1 , ...r k ) resulting from this speciﬁc ordering of ver-tices (origin and destination locations of the k requests as-signed to vehicle V j ). Besides, we derive the cost of theoriginal route to calculate the increased costs for every newroute S (cid:48) V j [ r i ] . To illustrate, assume A j for vehicle V j has onlytwo requests r x and r y , its location is loc ( V j ) and its currentroute has r x already inserted as: [ loc ( V j ) , o x , d x ] . Then, V j picks S (cid:48) V j of inserting r y into its current route, such that:cost ( V j , S (cid:48) V j [ r y ]) = min [ w ( loc ( V j ) , o x , o y , d x , d y ) ,w ( loc ( V j ) , o y , o x , d x , d y ) ,w ( loc ( V j ) , o y , o x , d y , d x ) ,w ( loc ( V j ) , o x , o y , d y , d x ) w ( loc ( V j ) , o x , d x , o y , d y ) w ( loc ( V j ) , o y , d y , o x , d x )] Note that the last two optional routes complete one re-quest before serving the other, hence they do not ﬁt intothe ride-sharing category. However, we still take them intoconsideration we optimize for the ﬂeet’s overall proﬁts andtotal travel distance. Also, note that these two routes willstill serve both requests and thus would not affect the overallacceptance rate of our algorithm. They may just increase thecustomers’ waiting time a little, however, we show in ourresults that the customers’ waiting time is very reasonable.Note that if this was the ﬁrst allocation made to this vehicle,then the ﬁrst request will be just added to its currently emptyroute. Otherwise, the ﬁrst request will be dealt with (like allother requests in the list) by following the insertion operationdescribed above. Then, the cost of serving all k requests inmatching A j is in turn deﬁned as:cost ( V j , A j ) = (cid:88) r i ∈ M j cost ( V j , S (cid:48) V j [ r i ]) (2)Finally, this phase works in a distributed fashion where eachvehicle minimizes its travel cost, following Algorithm 3. Thisdistributed procedure goes on a customer-by-customer basisas follows: • Each vehicle V j receives its initial matchings list A j , and aninitial price P init (as explained in Section IV-A) associatedwith each request r i in that list. This initial matchings listis sorted ascendingly based on proximity to vehicle V j . • Then, each vehicle V j considers the requests in this list,in order, inserting one by one into its current route. Foreach request r i , the vehicle arrives at the minimum costcost ( V j , S (cid:48) V j [ r y ]) associated with inserting it to its currentroute (as described above). • Now, given the initial price associated with this request P init and the the new route S (cid:48) V j [ r y ] (which may involve detours toserve this request), the vehicle can re-calculate the pricingtaking into consideration any extra distance using equation3 in Section IV-A. • Afterwards, drivers will then modify the pricing based onthe Q-values of the driver’s dispatch-to location, havinggained that insight using the DQN dispatch policy aboutwhich destinations can yield him/her a higher proﬁt. Thedrivers weigh in on their utility function and propose a newpricing to the customer using equation 4. This procedure isexplained in Section IV-B. • Finally, the customer(s) can accept or reject based on his/herutility function as explained in Section IV-C. If a customeraccepts, the vehicle updates its route S V j to be S (cid:48) V j [ r y ] , oth-erwise S V j remains unchanged. The vehicle then proceedsto the next customer and repeats the process. This rejected request will be fed back into the system to be consideredin the matching process initiated in the next timestep forother/same vehicles. Algorithm 3

Insertion-based Route Planning Input:

Vehicle V j , its current route S V j , a request r i = ( o i , d i ) and weighted graph G with pre-calculated trajectories usingOSRM model.2: Output:

Route S (cid:48) V j after insertion, with minimum cost( V j , S (cid:48) V j ).3: if S V j is empty then S (cid:48) V j ← [ loc ( V j ) , o i , d i ] .5: cost( V j , S (cid:48) V j ) = w ( S (cid:48) V j ) .6: Return S (cid:48) V j , cost( V j , S (cid:48) V j )7: end if Initialize S (cid:48)(cid:48) V j = S V j , P os [ o i ] = NULL, cost min = + ∞ .9: for each x in to | S V j | do S xV j := Insert o i at x − th in S V j .11: Calculate cost( V j , S xV j ) = w ( S xV j ) .12: if cost( V j , S xV j ) < cost min then

13: cost min ← cost( V j , S xV j ).14: P os [ o i ] ← x , S (cid:48)(cid:48) V j ← S xV j .15: end if end for S (cid:48) V j = S (cid:48)(cid:48) V j , cost min = + ∞ .18: for each y in P os [ o i ] + 1 to | S (cid:48)(cid:48) V j | do S yV j := Insert d i at y − th in S (cid:48)(cid:48) V j .20: Calculate cost( V j , S yV j ) = w ( S yV j ) .21: if cost( V j , S yV j ) < cost min then

22: cost min ← cost( V j , S yV j ).23: S (cid:48) V j ← S yV j , cost( V j , S (cid:48) V j ) ← cost min .24: end if end for Return S (cid:48) V j , cost( V j , S (cid:48) V j ) The key idea in the proposed matching algorithm is thatthe pricing of the passenger depends on the distance in theroute based on the previous matched passengers. Thus, if thenew passenger is going in the opposite direction, the distancewill be larger leading to higher price, and we expect that thepassenger would likely not accept the high price as comparedto another vehicle that is going towards that direction. Ofcourse, if the customer is willing to pay the high price,he/she will be matched. Thus, the increased pricing leads topassengers being matched if their routes have intersectionsas compared to the case where they are going in opposingdirections.

Complexity Analysis:

As the basic insertion algorithmneeds to check every possible insertion position, which is O ( n ) , and for every insertion position pair, calculates the newcost of the new route which is O ( n ) , it results in complexityof O ( n ) . Besides, to realize a constant-time cost calculationin DARM, we propose a list to pre-derive all the needed casessuch that any situation can look up the list to ﬁnd the increasedcost in constant time O (1) . Based on the granularity of timevalues (e.g., at least 30 seconds as the unit time for drop-offtime), we can dynamically derive all the situations accordingto ﬁnite cases of time delay after insertion. Therefore, theroutes (with their asscoiated costs) pre-calculation step doneusing our OSRM engine, provides us with fast routing andconstant-time computation O (1) , thus reduces the complexityof our algorithm from O ( n ) to O ( n ) . In addition, for a route to be feasible, for each request r i inthis route o i has to come before d i , therefore to further reducethe computation needed, we ﬁrst ﬁnd the optimal position P os [ o i ] to insert o i and then, to ﬁnd the optimal position P os [ d i ] to insert d i we only consider positions starting from P os [ o i ]+1 . Therefore, we never have to check all permutationsof positions, we only check n options in the ride-sharing en-vironment as we check the route feasibility in O (1) time. Thisis further reﬂected in the capacity constraint deﬁned in phase1, where we borrow the idea of deﬁning smallest position toinsert origin without violating the capacity constraint from [19]as P s [ l ] , the capacity constraint of vehicle V j will be satisﬁedif and only if: P s [ l ] ≤ P os [ o i ] . Here, we need to guaranteethat there does not exist any position l ∈ ( P os [ o i ] , P os [ d i ]) such that: V jC [ l ] ≥ ( | r i − | − | r i | ) , other than P s [ l ] to satisfy P s [ l ] ≤ P os [ o i ] and thus not violate the capacityconstraint.IV. D ISTRIBUTED P RICING - BASED R IDE - SHARING (DPRS)In this section, we elaborate the pricing-based strategycomponents that are built on top of the distributed ride-sharingenvironment detailed in Section II. In our simulator setup, weconsider various vehicle types: hatch-back, sedan, luxury, andvan. Each of which has different capacity, mileage, and baseprice for driver per trip denoted by B j . B j serves as the localminimum earning that the driver gains per trip. Also, based onthe vehicle type, the pricing per mile distance varies as wellas the pricing per waiting minute. A. Initial Pricing

Initially, the system suggests a price for each vehicle-customer matching, taking into consideration several factors: • The total trip distance, i.e., the distance till pickup plus thedistance from pickup to drop off. Note that, this distance iscomposed of the weights of the n edges that constitute thevehicle’s optimal route from its current location to origin o i and then to destination d i . This is obtained as explainedin Algorithm 3, as the cost associated with route S V j =[ loc ( V i ) , o i , d i ] after inserting request r i into vehicle V j ’scurrent route, assuming initially that r i is the only requestmatched to vehicle V j . Thus, cost( V j , S V j [ r i ] ) after subtract-ing the cost of the original route before insertion. Then, inthe next step when driver already knows the updated route S (cid:48) V j , they plug into the equation cost( V j , S (cid:48) V j [ r i ] ). • Number of customers who share travelling a trip distance(whether all or part of it, which can be determined fromthe vehicle’s path). For simplicity, we denote it by thevehicle j ’s capacity V jC [ o i ] when it reaches the origin o i location of this request r i subtracted from its capacitywhen it reaches the destination d i , V jC [ d i ] . Thus, we deﬁne V jC [ r i ] = | V jC [ d i ] − V jC [ o i ] | . Initially, this is just anestimation, using the number of requests in the initialmatching A j assigned to vehicle V j at timestep t . However,at the destination where the customer is expected to makethe payment, all these numbers become already known (i.e.deterministic exact values). • The cost for fuel consumption associated with this trip,denoted by D i ∗ ( P gas /M j ) , where P gas represents theaverage gas price, and M jV denotes the mileage for vehicle j assigned to trip i . • The waiting time experienced by the customer (or cus-tomers) associated with trip i till pickup, denoted T i .The overall price initialization equation for request r i isrepresented as: P init [ r i ] = B j + (cid:34) ω ∗ cost ( V j , S V j [ r i ] ) V jC [ r i ] (cid:35) + (cid:34) ω ∗ (cid:32) cost ( V j , S V j [ r i ] ) V jC [ r i ] ∗ ( P gas /M jV ) (cid:33)(cid:35) − (cid:2) ω ∗ T i (cid:3) (3)where ω , ω , and ω are the weights associated with each ofthe factors affecting the price calculation. ω is the price permile distance according to the vehicle type. ω is set to 1 as itdoesn’t change across vehicles, what changes is the mileage inthis factor. Finally, ω is the price per waiting minute that isinﬂuenced by the vehicle type, it is negative here as we wantto minimize the waiting time for the customer. Our proposedalgorithm will ﬁrst use the initial price and notify the vehicle(or driver), who will then modify the pricing based on theQ-values of the driver’s dispatch-to location. B. Vehicles’ Proposed Pricing

Each driver follows a dispatch policy once he/she entersthe market. This dispatch policy provides him/her with the bestnext dispatch action to make, which is predicted after weighingthe expected discounted rewards (Q-values) associated witheach possible move on the map using DQN (described inSection V and Appendix C). As a result of running sucha policy every dispatch cycle (which is set to 5 minutes),the driver gains insight about how the supply-demand isdistributed over the city, and thus can make informed decisionson which destinations can yield him a higher proﬁt. Thevehicle’s (or driver’s) decision-making process is formulatedas follows: • Based on the knowledge of the expected discounted reward(i.e. Q-values) associated with the action of going to eachlocation on the map (i.e. using DQN to learn the Q-values associated with each action), the driver can order thedestinations in a descending order and assign each of thema rank α , representing where it falls in that ordered list. Thedriver dynamically maintains this ranking and continuouslyupdates it whenever the dispatch policy runs. For ﬂexibility,the driver also has a tolerance rate, λ (which is a percentage),that he/she uses to decide on the size of the desired zones list. • After the route planning optimization step, the driver re-calulates the initial pricing P init ( r i ) using the updated route S (cid:48) V j , by plugging cost( V j , S (cid:48) V j [ r i ] ) (after subtracting the costof the original route before insertion) into equation 3. This isdone to account for any detours required to serve this newrequest r i , and thus enforces requests going in the samedirection to be matched together. • Knowing the request pickup location, the driver retrievesthe ranking α associated with various destinations. Afterthat, he adds the highest ranking λ locations to a set of desired zones , denoted L . Then, if the request’s location isamong that desired set L , the driver uses the initial pricingsuggested for this trip, denoted P init ( r i ) . • Otherwise, it would indicate that he/she might end up inthe middle of nowhere (region with low demand), and thusreceives no more requests or at least drives idle a longdistance. Instead of just rejecting the request, he/she cansuggest a price to the customer that is slightly higher by afactor inﬂuenced by both the rank of the destination as wellas his/her own base price per trip B j .Let P ( r i ) represents the ﬁnal price suggested by driver j forthe customer associated with request/trip i , the driver’s pricedecision is as follows: P ( r i ) =  P init ( r i ) if loc ( r i ) ∈ LP init ( r i )+[ P init ( r i ) ∗ α loc ( ri ) ∗ B j ] otherwise (4) C. Customers’ Decision Function

After the driver makes a decision regarding the price asso-ciated with the trip, it becomes the customer’s turn to makehis/her own decision according to his/her set of preferences.In our algorithm, we consider various preferences for eachcustomer: • Whether the customer is in a hurry or not or how muchdelay can he/she tolerate. This is taken into consideration inthe customer’s utility and denoted by delay/waiting time oftrip i : T i . • Whether the customer prefers car-pooling or would rathertake the ride alone even if it means a higher price. This iscaptured in the utility equation based on the current capacityof vehicle j assigned to trip i , denoted by V jC . • Whether he/she prefer a certain type of vehicle for their trip,and whether he/she is willing to par more in exchange fora more luxurious vehicle. The type of vehicle j assigned totrip i is denoted by V jT .Based on the aforementioned factors, the customer’s utility forrequest/trip i is formulated as: U i = (cid:34) ω ∗ V jC (cid:35) + (cid:20) ω ∗ T i (cid:21) + (cid:104) ω ∗ V jT (cid:105) (5)where ω , ω , and ω are the weights associated witheach of the factors affecting the customer’s overall utility. Toadd more ﬂexibility, we introduce a customer’s compromisethreshold δ i to represent how much the customer i is willingto compromise in the decision-making process. Finally, thedecision of customer i to accept or reject, denoted by C id , afterreceiving the ﬁnal price P ( r i ) for the trip i is as follows: C id = (cid:40) if U i > P ( r i ) − δ i otherwise (6)Upon customer’s acceptance, no further action is required,and the process continues. In case of rejection, a new matchingprocess is initiated to match this request to another vehicle thatbetter meets their preferences. V. D

USTRIBUTED

DQN D

ISPATCHING A PPROACH

We utillize the framework in order to re-balance vehiclesover the city to better serve the demand. At the beginning ofevery time step t , vehicles that newly enter the system at time t , are dispatched to areas of anticipated high demand follwoingthis approach. Moreover, at the end of every time step, wecheck for vehicles whose idle duration exceeds 10 minutes,and we apply this technique to dispatch them to high-demandareas to better utilize our respurces. The overall ﬂow of thisframework is explained in Algorithm 4. Algorithm 4

Dispatching using DQN Input: X t , V t : t + T , D t : t + T .2: Output:

Dispatch Decisions3:

Construct a state vector s t,n = ( X t , V t : t + T , D t : t + T ) .4: Get the best dispatch action a t,n = argmax [ Q ( s t,n , a, θ )] forall vehicles V n using the Q-network.5: Get the destination zone Z t,j for each vehicle j ∈ V n based onaction a t,j ∈ a t,n Update dispatch decisions by adding ( j, Z t,j ) Return ( n, Z t,n ) At every time step t , the DQN agent obtains a representationfor the environment, s t,n , and calculates a reward r t associatedwith each dispatch-to location in the action space a t,n . Basedon this information, the agent takes an action that directsthe vehicle to different dispatch zone where the expecteddiscounted future reward is maximized as in equation (7).In our algorithm, we deﬁne the reward r k as a weightedsum of different performance components that reﬂect theobjectives of our DQN agent. The reward will be learnt fromthe environment for individual vehicles and then leveragedby the agnet/optimizer to optimize its decisions. The decisionvariables are i) Dispatching of an available vehicle in zone m , V j ∈ v t,m to another zone at time slot t , ii) if a vehicle V j isnot full, decide γ j,t its availability for serving new customersat time slot t . If the vehicle is full, then γ j,t = 0. If it is empty,it will serve new passengers whose requests generate withinthe vehicle V j ’s current region at time t .We deﬁne the overall objectives of the dispatcher, whereour dispatch policy aims to (1) minimize the supply-demandmismatch: (diff t ), (2) minimize the dispatch time: T Dt (i.e.,the expected travel time of vehicle V j to go zone m at timestep t ), (3) minimize the extra travel time a vehicle takesfor car-pooling compared to serving one customer: ∆ t , (4)maximize the ﬂeet proﬁts P t , and (5) minimize the numberof utilized vehicles: e t . Each of these objectives is calculatedand represented with a corresponding term, as described in Ap-pendix C-B. Although we are minimizing the number of activevehicles in time step t , if the total distance or the total triptime of the passengers increase, it would be beneﬁcial to usean unoccupied vehicle instead of having existing passengersencounter a large undesired delay. The DQN overall rewardfunction is represented as a weighted sum of these terms asin equation (17), and (18) for individual agents/vehicles. Thedetails of learning these Q-values asscoiated with the actionspace is provided in Appendix C-C, and the architecture ofour Deep Q-Network is presented in Appendix C-A.Note that the proﬁts term P t added to the reward functionmakes the output expected discounted rewards (Q-values) associated with each possible move on the map, a goodreﬂection of the expected earnings gained when heading tothese locations. This gives drivers an insight about how thesupply-demand is distributed over the city, and assists themin making knowledgeable and informed decisions when itcomes to ranking their desired go-to locations that can yieldthem higher proﬁts (potentail hotspots), and thus making thecorresponding pricing decisions as explained in Section IV-B.VI. E XPERIMENTAL R ESULTS

Our simulator is created based on real public dataset oftaxi trips in Manhattan, New York city [21]. We start bypopulating vehicles over the city, randomly assigning eachvehicle a type and an initial location. According to the typeassigned to each vehicle, we set the accompanied featuresaccordingly such as: maximum capacity, mileage, and pricerates (per mile of travel distance ω , and per waiting minute ω ). We initialize the number of vehicles, to , allowinga portion of vehicles to enter the market at every time step t . We consider the data of June 2016 for training, and oneweek from July 2016 for evaluations. We trained our DQNneural networks using the data from June 2016 for 10000epochs and used the most recent 5000 experiences as a replaymemory. For each trip, we obtain the pick-up time, passengercount, origin location and drop-off location. We use this tripinformation to construct travel requests demand predictionmodel. We also use Python and

Tensorﬂow to implement ourDPRS framework. To initialize the environment, we run thesimulation for 20 minutes without dispatching the vehicles.In addition, we run the simulator for 8 days, and thus T =8 × ×

60 steps, where ∆ t = 1 minute. Finally, we set β = 10 , β = 1 , β = 5 , β = 12 , β = 8 , λ = 10% , ω = 15 , ω = 1 , and ω = 4 . Each vehicle has a maximumworking time of hours per day, after which it exits themarket.We breakdown the reward and the drivers’ and customers’utilities, and investigate the performance for various baselines.Recall that we want to minimize the components of our rewardfunction: supply-demand mismatch, average travel distance pervehicle, and number of used vehicles captured by utilizationrate. We note that the supply-demand mismatch is reﬂectedin our simulation through the acceptance/rejection rate metric.Recall that a request is rejected if there is no vehicle in arange of km that is available to serve the request, or ifit was rejected by a customer or driver. Also, the metric ofcruising (idle) time represents the time at which a vehicle isneither occupied nor gaining proﬁt but still incurring gasolinecost. Further, the drivers’ average proﬁts, and the customers’waiting time are also evaluated. Note that, the Q-values dependon the pricing since the decisions made by customers anddrivers impact the reward function through the proﬁt term.We compare our proposed joint framework (with dispatch-ing, ride-sharing, our novel DPRS pricing strategy and DARMapproach for matching and route planning) against the follow-ing baselines to emphasis the impact of each component ofour joint framework: • No Dispatch, No Ride-sharing, No Pricing Strategy, GreedyMatching (!D, !RS, !PS, GM): In this setting, vehicles don’t | Wed | | Thu | | Fri | | Sat | | Sun | | Mon | | Tue | | Wed |Simulation Time (hrs in days)0100200300400500 o f a cc e p t e d r e q u e s t s / h o u r Number of Accepted Requests !DS !RS !PS GMDS RS PS GMDS RS !PS GMDS !RS !PS GMDARM + DPRS!DS RS !PS GM | Wed | | Thu | | Fri | | Sat | | Sun | | Mon | | Tue | | Wed |Simulation Time (hrs in days)0.0123456 o f o cc u p i e d v e h i c l e s ( i n s ) p e r h o u r Occupancy Rate of Vehicles

DARM + DPRSDS RS PS GMDS RS !PS GMDS !RS !PS GM!DS !RS !PS GM!DS RS !PS GM| Wed | | Thu | | Fri | | Sat | | Sun | | Mon | | Tue | | Wed |Simulation Time (hrs in days)0100200300400500 o f r e j e c t e d r e q u e s t s p e r h o u r Number of Rejected Requests !DS !RS !PS GMDS RS PS GMDS RS !PS GMDS !RS !PS GMDARM + DPRS!DS RS !PS GM

Fig. 2: Performance Metrics of the proposed algorithm and the baselinesget dispatched to areas with anticipated high demand, nomatter how long they stay idle. Ride-sharing (pooling) isnot allowed, every vehicle serves only one request at a time.Also, initial pricing is used and is accepted by both driversand customers by default. For matching, in this setting onlythe greedy phase one of our DARM algorithm is applied,no optimization takes place. • No Dispatch with Ride-sharing but No Pricing Strategy,and with Greedy Matching (!D, RS, !PS, GM): similar to(!D, !RS, !PS) except that ride-sharing (pooling) is allowed,where vehicles can serve more than one request altogether. • Dispatch with No Ride-sharing and No Pricing Strategy withGreedy Matching (D, !RS, !PS, GM): Here, vehicles aredispatched when idle but, ride-sharing is not allowed similarto the setting in [22]. • Dispatch with Ridesharing but No Pricing Strategy, and withGreedy Matching (D, RS, !PS, GM): similar to (D, !RS, !PS,GM), but here ride-sharing is allowed, similar to DeepPoolin [20]. • Dispatch with Ridesharing and Pricing Strategy, but withGreedy Matching (D, RS, PS, GM): similar to (D, RS, !PS,GM), but with applying our DPRS pricing strategy wherecustomers and drivers are involoved in the decision-makingprocess. However, as in all the previous baselines, onlygreedy matching is adopted here without the insertion-basedoptimization phase of our DARM algorithm.Involving drivers and customers in the decision-makingprocess is expected to increase the rejection rate as well asthe number of vehicles utilized to serve the demand. However,even after adding the distributed pricing based on the Q-Network that gives more control to the customers and drivers,our joint ride-sharing framework has signiﬁcantly low rejec-tion rate. In Figure 2, we investigate the overall performance ofour proposed framework in comparison to all other baselines.We observed that our joint DARM + DPRS framework rankshighest in the acceptance rate per hour, followed by all theridesharing-based baselines, while the non-ridesharing as wellas the non-dispatching baslines come at the bottom of the list.Clearly, our matching and route planning approach boosts theacceptance rate, with a minimal rejection rate possible, whencompared to the baselines that adopt greedy matching. We observe that the rejection rate is close to zero throughout mostof the simulation time. To further analyze the rejection rate inour framework, we also observe that the rejection rate made bycustomers (i.e., when a customer weighs in and rejects a ride)is fairly close to the naturally encountered rejection rate thatoccurs due to the unavailability of vehicles within the request’svicinity. Besides, this comes at a utilization/occupancy rate ofaround two thirds of that of the rest of the baslines, saving onethird of the vehicles for serving new incoming requests whichwould -in turn- further increase the accepatnce rate. Also, wehad set the maximum number of vehicles in our simulator to8000, our approach utilizes only half of them (4000) to servethe demand with an acceptance rate close to .This result supports our hypothesis, showing that our frame-work utilizes around the same number of vehicles as DeepPooland DPRS, and at the same time, involving both customersand drivers in the decision-making process. We will furthershow that our framework not only enhances the overall ﬂeetutilization, but also the utilization of each individual vehicle(i.e., percentage of time the vehicle is occupied). Clearly,non-dispatching and non-ride-sharing algorithms are shownto have poor utilization of resources, as they use highernumber of vehicles to serve the same amount of demand.Since involving customers and drivers in the decision-makingprocess is supposed to boost both rejection rate and the numberof vehicles utilized to serve the demand, while DARM +DPRS performs equally well as DeepPool in both metrices;this makes DARM + DPRS superior to both DeepPool andDPRS only.Further, we take a closer look at all the ride-sharing basedbaselines that adopt dispatching as we can safely exclude thenon-ridesharing non-dispatching baselines from the compe-tition just by looking at their poor performance in Figure2. Table I shows that the average proﬁts for the drivershas signiﬁcantly increased over time as compared to all theother three protocols. Thus, quantifying the individual drivers’preferred zones based on the learnt reward using DQN, couldguarantee them a signiﬁcant improvement in earnings that, inturn, helped them make up for any extra encountered cost andboost their proﬁts. This implies both the drivers and customersare achieving a compromise that is proﬁtable and convenient o f v e h i c l e s ( i n s ) Profit o f v e h i c l e s ( i n s ) Cruising Time o f v e h i c l e s ( i n s ) Occupancy Rate o f c u s t o m e r s ( i n s ) Waiting Time o f v e h i c l e s ( i n s ) Travel Distance (a) Performance Metrics of our Joint Framework“DARM + DPRS” o f v e h i c l e s ( i n s ) Profit o f v e h i c l e s ( i n s ) Cruising Time o f v e h i c l e s ( i n s ) Occupancy Rate o f c u s t o m e r s ( i n s ) Waiting Time o f v e h i c l e s ( i n s ) Travel Distance (b) Performance Metrics of (D, RS, PS, GM) Baseline“DPRS with Greedy Matching” o f v e h i c l e s ( i n s ) Profit o f v e h i c l e s ( i n s ) Cruising Time o f v e h i c l e s ( i n s ) Occupancy Rate o f c u s t o m e r s ( i n s ) Waiting Time o f v e h i c l e s ( i n s ) Travel Distance (c) Performance Metrics of (D, RS, !PS, GM) Baseline“Deep Pool in [20]”

60 80 100 120Profit ($/h)0.00.10.20.30.40.50.6 o f v e h i c l e s ( i n s ) Profit o f v e h i c l e s ( i n s ) Cruising Time o f v e h i c l e s ( i n s ) Occupancy Rate o f c u s t o m e r s ( i n s ) Waiting Time o f v e h i c l e s ( i n s ) Travel Distance (d) Performance Metrics of (D, !RS, !PS, GM) Baseline“Dynamic Taxi Fleet Management in [22]”

TABLE I: Histograms of Performance Metrics for the Proposed Algorithm and the Baselinesto them. Speciﬁcally, the proﬁts for our DARM + DPRSframework is almost double that of DPRS only, and − times that of DeepPool. Without ride-sharing, proﬁts are wayless ranging between $60 − $100 per hour, which are 10 timesless than the average proﬁt in our framework.Moreover, we emphasize that non-dispatching protocolsyield higher idle time for the vehicles as they might spendlarge amount of time being idle and they never get dispatchedto higher demand areas. In contrast, non-ride-sharing protocolsyield lower idle time, but they are still inefﬁcient as vehiclesspend more time on duty while serving lower number ofcustomers than the ride-sharing protocols. Figure Ia showsthat our joint framework minimizes the cruising time, wherevehicles are idle and thus minimizes the extra travel distanceas well as extra gasoline cost. On average, vehicles’ idle timebetween requests is within a minimal range ( − ) hours forDARM + DPRS framework. Knowing that vehicles’ workingtime is around hours per day, we can observe that morethan , vehicles, out of the , vehicles that the systemallows per day, experience idle time less than 2 hours per day,which is less than of their total working time. This metricis almost doubled for all other three baselines. This provesthat dynamic demand-aware matching provides better overallutilization of vehicles, which is reﬂected in the occupancyrate metric. Occupancy Rate is deﬁned as the percentage oftime where vehicles are occupied out of their total workingtime. Figure Ia shows that , − , vehicles arebetween − occupied, which once again proves thatour framework signiﬁcantly improves the utilization of eachinvidual vehicle as well as the whole ﬂeet.However, Table I shows that the average travel distance ofDARM + DPRS is a little higher, ranging between − kmper hour, compared to − km for the other frameworks.Since DARM + DPRS provides signiﬁcantly higher proﬁts fordrivers, if it comes at the cost of a slight increase in traveldistance. That becomes an advantage of DARM + DPRS.Note that, the two policies with the lowest travel distance inTable I, are not involving divers or customers in the decision-making process which explains why vehicles have lower traveldistance on average. However, we can observe that the non- ride-sharing protocols are not efﬁcient as they result in lowerproﬁt margins and higher customers’ waiting time.Compared to both DeepPool and DPRS only framework,the waiting time per request is lower for DARM + DPRSapproach. As shown in Table I, the waiting time for cus-tomers reduces overtime to less than a minute. On average,the response time (time till customer is picked up) of ourframework is less than 200 seconds ( ≈ ONCLUSION

In this paper, we detailed two novel approches—Demand-Aware and Pricing-based Matching and route planning(DARM) framework and Distributed Pricing approach forRide-Sharing with pooling (DPRS)—that generate ideal routeson-the-ﬂy and involves both customers as well as driversin the decision-making processes of ride acceptance/rejectionand pricing. Agents’ decision-making process is informedby utility functions that aim to achieve the maximum proﬁtfor both drivers and customers. The utility functions alsoaccounts for the fuel costs, waiting time, and passenger’sspending power to compute the reward. These novel DARMand DPRS methodologies are also integrated via a Deep Q-network (DQN) based dispatch algorithm where the proﬁtsinﬂuence the dispatch and the Q-values impact the pricing andthus matching. The integrated approach has been implementedthrough real-time New York City taxi public data where thesupply-and-demand is predicted by the Deep-Q Networks anddipatching is inﬂuenced by deep Reinforcement Learning.Contradictory to an expected high rejection rate when agentsare given a choice to reject rides, experimental results showthat the rejection rate is signiﬁcantly low for DARM and DPRSframeworks. Given the maximum number of vehicles (8000 vehicles) populated in the simulation, our framework onlyuses 50% of the vehicles to accept and serve the demand ofupto 90% the requets. When compared with no ride-sharingframework, our framework provides 10 times more proﬁts.Experiments also show that vehicle idle time (cruising withoutpassengers) is reduced to just two hours and 80% - 100% of thevehicles are occupaid all the time. Even though travel distancefor DARM + DPRS frameworks is higher than the otherframeworks, yet they reduce the impact of the trade off byproviding signiﬁcantly higher proﬁts for drivers in exchangefor a slightly higher travel distance. Our model-free DARM +DPRS framework can be extended to large-scale ride-sharingprotocols due to distributed decision making for the differentvehicles reducing the decision space signiﬁcantly.R EFERENCES[1] M. Iqbal, “Uber revenue and usage statistics (2020),”

BusinessOfApps

Proceedings of theVLDB Endowment , vol. 11, no. 11, p. 1633, 2018.[4] X. Bei and S. Zhang, “Algorithms for trip-vehicle assignment in ride-sharing,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelligence ,2018.[5] J. Wang, P. Cheng, L. Zheng, C. Feng, L. Chen, X. Lin, and Z. Wang,“Demand-aware route planning for shared mobility services,”

Proc.VLDB Endow. , vol. 13, no. 7, p. 979–991, Mar. 2020. [Online].Available: https://doi.org/10.14778/3384345.3384348[6] L. Zheng, L. Chen, and J. Ye, “Order dispatch in price-aware rideshar-ing,”

Proceedings of the VLDB Endowment , vol. 11, no. 8, pp. 853–865,2018.[7] M. Asghari, D. Deng, C. Shahabi, U. Demiryurek, and Y. Li, “Price-aware real-time ride-sharing at scale: an auction-based approach,” in

Proceedings of the 24th ACM SIGSPATIAL international conference onadvances in geographic information systems , 2016, pp. 1–10.[8] B. Cici, A. Markopoulou, and N. Laoutaris, “Designing an on-line ride-sharing system,” in

Proceedings of the 23rd SIGSPATIAL InternationalConference on Advances in Geographic Information Systems , 2015, pp.1–4.[9] S. Yeung, E. Miller, and S. Madria, “A ﬂexible real-time ridesharingsystem considering current road conditions,” in , vol. 1. IEEE,2016, pp. 186–191.[10] Y. Huang, R. Jin, F. Bastani, and X. S. Wang, “Large scale real-timeridesharing with service guarantee on road networks,” arXiv preprintarXiv:1302.6666 , 2013.[11] S. Ma, Y. Zheng, and O. Wolfson, “T-share: A large-scale dynamic taxiridesharing service,” in . IEEE, 2013, pp. 410–421.[12] M. Ota, H. Vo, C. Silva, and J. Freire, “Stars: Simulating taxi ride sharingat scale,”

IEEE Transactions on Big Data , vol. 3, no. 3, pp. 349–361,2016.[13] P. Cheng, H. Xin, and L. Chen, “Utility-aware ridesharing on roadnetworks,” in

Proceedings of the 2017 ACM International Conferenceon Management of Data , 2017, pp. 1197–1210.[14] D. O. Santos and E. C. Xavier, “Dynamic taxi and ridesharing: Aframework and heuristics for the optimization problem,” in

Twenty-ThirdInternational Joint Conference on Artiﬁcial Intelligence , 2013.[15] R. S. Thangaraj, K. Mukherjee, G. Raravi, A. Metrewar, N. Annamaneni,and K. Chattopadhyay, “Xhare-a-ride: A search optimized dynamic ridesharing system with approximation guarantee,” in . IEEE, 2017,pp. 1117–1128.[16] R. Zhang and M. Pavone, “Control of robotic mobility-on-demand sys-tems: a queueing-theoretical perspective,” in

The International Journalof Robotics Research , vol. 35(1-3), 2016, pp. 186–203. [17] S. Ma, Y. Zheng, and O. Wolfson, “Real-time city-scale taxi rideshar-ing,” in

IEEE Transactions on Knowledge and Data Engineering , vol.27(7), 2014, pp. 1782–1795.[18] A. Kleiner, B. Nebel, and V. A. Ziparo, “A mechanism for dynamicride sharing based on parallel auctions,” in

Twenty-Second InternationalJoint Conference on Artiﬁcial Intelligence , 2011.[19] X. Bei and S. Zhang, “Algorithms for trip-vehicle assignment in ride-sharing,” in

Thirty-Second AAAI Conference on Artiﬁcial Intelligence

IEEE INFOCOM 2018-IEEE Conference onComputer Communications

Marketing Intelligence & Planning Journal , vol. 23, pp. 382–394, 2005. A PPENDIX AC ONTROL U NIT C OMPONENTS : A. OSRM and ETA Model

We construct a region graph relying on the New York citymap, obtained from OpenStreetMap [23]. Also, we constructa directed graph as the road network by partitioning the cityinto small service area 212 x 219 bin locations of size 150mx 150m. We ﬁnd the closest edge-nodes to the source anddestination and then search for the shortest path between them.To estimate the minimal travel time for every pair of nodes, weneed to ﬁnd the travel time between every two nodes/locationson the graph. To learn that, we build a fully connected neuralnetwork using historical trip data as an input and the traveltime as output. The fully connected multi-layer perceptionnetwork consists of two hidden layers with width of 64 unitsand rectiﬁer nonlinearity. The output of this neural networkgives the expected time between zones. While this model isrelatively simple (contains only two hidden layers), our goal isto achieve a reasonable accurate estimation of dispatch times,with short running time. Finally, if there is no vehicle in therange of km , the request is considered rejected. B. Demand Prediction Model

We use a convolutional neural network to predict futuredemand. The output of the network is a x image suchthat each pixel represents the expected number of ride requestsin a given zone for minutes ahead. The network inputconsists of two planes of actual demand of the last two steps.The size of each plane is 212 x 219. The ﬁrst hidden layerconvolves ﬁlters of size 5 x 5 with a rectiﬁer nonlinearity,while the second layer convolves 32 ﬁlters of 3 x 3 with arectiﬁer nonlinearity. The output layer convolves 1 ﬁlter ofsize 1 x 1 followed by a rectiﬁer nonlinear function.A PPENDIX BM ODEL P ARAMETERS AND N OTATIONS :We build a ride-sharing simulator to train and evaluateour framework. We simulate New York City as our area ofoperation and the area is divided into multiple non-overlappingzones (or regions), each of which is 1 square mile. Thisallows us to discretize the area of operation and thus makesthe action space—where to dispatch the vehicles—tractable.We use m ∈ { , , , ..., M } to denote the city’s zones. Weoptimize our algorithm over T time steps, each of duration ∆ t . Here, we present the model parameters and notations:1) Demand:

We denote the number of requests for zone m at time t as d t,m . d t, ˜ t,m is the number of vehiclesthat are currently unavailable at time t but will becomeavailable at time ˜ t as they will drop-off customer(s) atregion m . d t, ˜ t,m can be estimated using the estimated timeof arrival (ETA) model in [20]. We denote the predictedfuture demand from time t to time t + T at each zone as D t : T = { ¯ d t , ...., ¯ d t + T } . Such information can be reachedby maintaining the state vectors as described next in step 2.2) State Vector:

The state variables are utilized to reﬂect theenvironment status. We use X t = { x t, , x t, , ... , x t,N } to denote the N vehicles’ status at time t . x t,n is a vectorthat represents vehicle n ’s state variables at time step t such as: its current location V loc , its current capacity V C , its type V T , its maximum capacity C Vmax , the timeat which a passenger was picked up, and the destinationof each passenger. A vehicle is considered available ifat least one of its seats is vacant that is, if and only if V C < C Vmax . A vehicle becomes unavailable when all itsseats are occupied or if it will not consider taking an extrapassenger. Let γ j,t be a binary decision variable which is 1if vehicle V j decides to serve customers, otherwise it is 0.Only available vehicles can be dispatched in our algorithm.Let the number of available vehicles at region m at timeslot t be v t,m = (cid:80) Nn =1 γ n,t,m {∀ vehicle n ∈ zone m } .Using the vehicle’s state information, we can predict thetime slot at which that vehicle will become available (if itis currently unavailable). Thus, for a set of dispatch actionsat time t , we can predict the number of vehicles in eachzone for T time slots ahead, from time t to time t + T ,denoted by V t : t + T which serves as our predicted supply ineach zone for T time slots ahead. An improvement uponour dispatching policies can be achieved by anticipatingthe demand in every zone through historical weekly/dailydistribution of trips across zones [24].Combining all this data, we have deﬁned a three tuplethat captures the environment updates at time t as s t = ( X t , V t : t + T , D t : t + T ) . When a set of new riderequests arrive at the system, we can retrieve from theenvironment all the state elements, combined in one vector s t . Also, when a passenger’s request becomes accepted,we append the customer’s expected pickup time, source,and destination to s t as well. These variables change inreal time according to the environment variations anddemand/supply dynamics. However, our framework keepstrack of all these rapid changes and seeks to make thedemand, d t , ∀ t and supply v t , ∀ t close enough (i.e.,mismatch between them is zero).3) Action: a nt denotes the action taken by vehicle n at timestep t . This action has two parts: a)First, if the vehiclestill has vacant seats, it decides whether to accept newpassengers or to only serve the on-board customers, andb) if it makes the decision of accepting new customersor if it were initially totally empty, it needs to decide onwhich zone to head to at time step t . Naturally, a fullyoccupied vehicle cannot serve any additional customers.Finally, if a vehicle decides to only serve its existingon-board passengers, it uses the shortest optimal route toreach the destinations of its customers.4) Reward: having explained all of the above factors, at everytime step t , the DQN agent obtains a representation forthe environment, s t , and a reward r t that will be explainedin Appendix C-B. Based on this information, the agenttakes an action that directs the vehicle (that is either idleor recently entered the market) to different dispatch zone

51 x 51 x 4 23 x 23 x 4 Filter: 5 x 5 x 16 Filter:

Fig. 3: The architecture of the Q-network. The outputrepresents the Q-value for each possible movement/dispatchwhere the expected discounted future reward is maximized,i.e., ∞ (cid:88) k = t η k − t r k ( a t , s t ) (7)where η < is a time discounting factor. In our algorithm,we deﬁne the reward r k as a weighted sum of differentperformance components that reﬂect the objectives of ourDQN agent. The reward will be learnt from the environ-ment for individual vehicles and then leveraged by theDPRS optimizer to optimize its decisions.Our framework keeps track of the rapid changes of all thesevariables and seeks to make the demand, d t , ∀ t and supply v t , ∀ t close enough (mismatch between them is zero). Notethat, by ride-sharing we mean ride-sharing with pooling in ourmodel. A PPENDIX

CDQN D

ISPATCHING A LGORITHM

A. DQN Architecture

Figure 3 demonstrates the architecture of the Q-network.The output represents the Q-value for each possible move-ment/dispatch. In our simulator, the service area is dividedinto 43x44, cells each of size 800mx800m. The vehicle canmove (vertically or horizontally) at most 7 cells, and hence theaction space is limited to these cells. A vehicle can move toany of the 14 vertical (7 up and 7 down) and 14 horizontal (7left and 7 right). This results in a 15x15 map for each vehicleas a vehicle can move to any of the 14 cells or it can remainin its own cell. The input to the neural network consists of thestate representation, demand and supply, while the output isthe Q-values for each possible action/move (15 moves). Theinput consists of a stack of four feature planes of demand andsupply heat map images each of size 51x51. In particular, ﬁrstplane includes the predicted number of ride requests next 30minutes in each region, while the three other planes providethe expected number of available vehicles in each region in 0;15 and 30 minutes. Before passing demand and supply imagesinto the network, different sizes of average pooling with stride(1, 1) to the heat maps are applied, resulting in 23 x 23 x 4feature maps. The ﬁrst hidden layer convolves 16 ﬁlters of size5x5 followed by a rectiﬁer non-linearity activation. The second and third hidden layers convolve 32 and 64 ﬁlters of size 3x3applied a rectiﬁer non-linearity. Then, the output of the thirdlayer is passed to another convolutional layer of size 15 x 15x 128. The output layer is of size 15 x 15 x 1 and convolvesone ﬁlter of size 1x1. Since reinforcement learning is unstablefor nonlinear approximations such as the neural network, dueto correlations between the action-value, we use experiencereplay to overcome this issue. Since every vehicle runs itsown DQN policy, the environment during training changesover time from the perspective of individual vehicles.

B. DQN Dispatch Agent

Initially, we deﬁne the overall objectives of the dispatcher aswell as the decision variables in Section V. Below, we presentthe how each of these objectives is calculated and representedwith a corresponding term:1) Minimize the supply-demand mismatch, recall that v t,m ,and ¯ d t,m denotes the number of available vehicles, and theanticipated demand respectively at time step t in zone m .We want to minimize their difference over all M zones,therefore, we get:diff t = M (cid:88) m =1 ( ¯ d t,m − v t,m ) (8)The reward will be learnt from the environment for indi-vidual vehicles, therefore, we map this term for individualvehicles. When vehicle serves more requests, the differencebetween supply and demand is minimized, and helps satisfythe demand of the zone it is located in. Therefore, we canget the total number of customers served by vehicle n attime step t :C t,n = M (cid:88) m =1 v nt,m ( where v t,m = 1 when v t,m < ¯ d t,m ) where M (cid:88) m =1 v nt,m = 1 ( γ n,t,m ∈ { , } where n ∈ v t,m ) (9)2) Minimize the dispatch time, which refers to the expectedtravel time of vehicle V j to go zone m at time step t ,denoted by h jt,m . We calculate this time from the locationof vehicle V j at time t which is already included in thestate variable X t,j . Idle vehicles get dispatched to differentzones (where anticipated demand is high) than their currentzones (even if they do not have any new requests yet), inorder to pick up new customers in the future. Since wewant to minimize over all available vehicles N over allzones M within time t , we get the total dispatch time, T Dt as follows: T Dt = N (cid:88) n =1 M (cid:88) m =1 h nt,m {∀ n ∈ v t,m } (10) For individual vehicles, considering the neighboring ve-hicles’ locations while making their decision, we get forvehicle n at time step t : T Dt,n = M (cid:88) m =1 h nt,m { where n ∈ v t,m } (11)3) Minimize the difference in times that the vehicle wouldhave taken if it only serves one customer and the time itwould take for car-pooling. For vehicles that participatein ride-sharing, an extra travel time may be incurred dueto (1) either taking a detour to pickup an extra customeror (2) after picking up a new customer, the new optimalroute based on all destinations might incur extra traveltime to accommodate the new customers. This will alsoimply that customers already on-board will encounter ex-tra delay. Therefore, that difference in time needs to beminimized, otherwise both customers and drivers would bedisincentivized to car-pool. Let t (cid:48) be the total time elapsedafter the passenger l has requested the ride, t n,l be thetravel time that vehicle n would have been taken if it onlyserved rider l , and ˜ t n,l be the updated time the vehicle n will now take to drop off passenger l because of thedetour and/or picking up a new customer at time t . Notethat ˜ t n,l is updated every time a new customer is added.Therefore, for vehicle n , rider l at time step t , we want tominimize: ξ t,n,l = t (cid:48) +˜ t n,l − t n,l . But for vehicle n , we wantto minimize over all of its passengers, thus: (cid:80) ∪ n l =1 ξ t,n,l ,where ∪ n is the total number of chosen users for pooling atvehicle n till time t . Note that ∪ n is not known apriori, butwill be adapted dynamically in the DQN policy. It will alsovary as the passengers are picked or dropped by vehicle n .We want to optimize over all N vehicles, therefore, thetotal extra travel time can be represented as: ∆ t = N (cid:88) n =1 ∪ n (cid:88) l =1 ξ t,n,l . (12)For individual vehicles, extra travel time for vehicle n attime step t becomes: T Et,n = ∪ n (cid:88) l =1 ξ t,n,l . (13)4) Maximize the ﬂeet proﬁts. This is calculated as the averageearnings E t minus the average cost of all vehicles. Cost iscalculated by dividing the total travel distance of vehicle V j by its mileage, and multiplied by the average gas price P G . Therefore, the average proﬁts for the whole ﬂeet canbe represented as: P t = N (cid:88) n =1 E t,n − (cid:20) D t,n M nV ∗ P G (cid:21) (14)But, since we are estimating the reward for individualvehicles, we get for vehicle n at time step t , the averageproﬁts becomes: P t,n = E t,n − (cid:20) D t,n M nV ∗ P G (cid:21) (15) 5) Minimize the number of utilized vehicles/resources. Wecapture this by minimizing the number of vehicles thatbecome active from being inactive at time step t . Let e t,n represent whether vehicle n is non-empty at time step t .The total number of vehicles that recently became activeat time t is given by: e t = N (cid:88) n =1 [ max ( e t,n − e t − ,n , (16)Although we are minimizing the number of active vehiclesin time step t , if the total distance or the total trip time ofthe passengers increase, it would be beneﬁcial to use anunoccupied vehicle instead of having existing passengersencounter a large undesired delay.Having deﬁned all our objective terms, we represent the DQNreward function as a weighted sum of these terms as follows: r t = − (cid:2) β diff t + β T Dt + β ∆ t (cid:3) + β P t − β e t (17)Note, from equation 7, that we maximize the discountedreward over a time frame. The negative sign here indicatesthat we want to minimize the terms within the bracket.Note that weights β , β , β , β and β depend on theweight factors of each of the objectives. Further, we maximizethe discounted reward over a time frame, and the negativesign here indicates that we want to minimize the terms withinthe function. Finally, note that the reward for vehicle n is 0if it decides to only serve the passengers on-board (if, any).Therefore, we focus on the scenario where vehicle n decidesto serve a new user and it is willing to take a detour at time t . In this case, the reward r t,n for vehicle n at time slot t is represented equation 18, where the objectives above aremapped to: (1) C t,n : number of customers served by vehicle n at time t , (2), (3) dispatch time and extra travel time are thesame, denoted by: T Dt,n , and T Et,n . (4) average proﬁt for vehicle n at time t , P t,n . In this case, the reward r t,n for vehicle n attime t is represented by: r t,n = r ( s t,n , a t,n )= β C t,n + β T Dt,n + β T Et,n + β P t,n + β [ max ( e t,n − e t − ,n , (18)In equation (18), the last term captures the status of vehicle n where e t,n is set to 1 if vehicle n was empty and thenbecomes occupied at time t (even if by one passenger),however, if it was already occupied and just takes a newcustomer, e t,n is 0. The intuition here is that if an alreadyoccupied vehicle serves a new user, the congestion and fuelcosts will be less when compared to when an empty vehicleserves that user. Note that if we make β very large, it willdisincentivize passengers and drivers from making detours toserve other passengers, Thus, the setting becomes similar tothe one in [22], where there is no carpooling.In our algorithm, we use reinforcement learning to learn thereward function stated in (18) using DQN. Through learningthe probabilistic dependence between the action and the rewardfunction, we learn the Q-values associated with the probabil-ities P ( r t | a t , s t ) over time by feeding the current states of the system. Instead of assuming any speciﬁc structure, ourmodel-free approach learns the Q-values dynamically usingconvolutional neural networks whose architecture is describedin Appendix C-A The Q-values are then used to decideon the best dispatching action to take for each individualvehicle. Since the state space is large, we don’t use the fullrepresentation of s t , instead a map-based input is used toalleviate this massive computing. C. Learning Expected Discounted Reward Function —Q-values

In our algorithm, deep queue networks are utilized todynamically generate optimized values. This technique oflearning is characterized by its high adaptability to dynamicfeatures in the system, which is why it is widely adoptedin modern decision-making tasks. The optimal action-valuefunction for vehicle n is deﬁned as the maximum expectedachievable reward. Thus, for any policy π t we have: Q ∗ ( s, a ) = max π E (cid:34) ∞ (cid:88) k = t η k − t r k,n | ( s t,n = s, a t,n = a, π t ) ] (19)where < η < is the discount factor for the future. If η issmall (large, resp.), the dispatcher is more likely to maximizethe immediate (future, resp.) reward. At any time slot t , thedispatcher monitors the current state s t and then feeds it to theneural network (NN) to generate an action. In our algorithm,we utilize a neural network to approximate the Q function inorder to ﬁnd the expectation.For each vehicle n , an action is taken such that the outputof the neural network is maximized. The learning starts withno knowledge and actions are chosen using a greedy schemeby following the Epsilon-Greedy method. Under this policy,the agent chooses the action that results in the highest Q-valuewith probability − (cid:15) , otherwise, it selects a random action.The (cid:15) reduces linearly from 1 to 0.1 over T n steps. For the n th vehicle, after choosing the action and according to thereward r t,n , the Q-value is updated with a learning factor σ as follows: Q (cid:48) ( s t,n , a t,n ) ← (1 − σ ) Q ( s t,n , a t,n )+ σ [ r t,n + η max a Q ( s t +1 ,n , a ) ] (20)Similar to (cid:15) , the learning rate σ is also reduced linearly from0.1 to 0.001 over 10000 steps. We note that an artiﬁcialneural network is needed to maintain a large system space.When updating these values, a loss function L i ( θ i ) is usedto compute the difference between the predicted Q-values andthe target Q-values, i.e., L i ( θ i ) = E (cid:2) (cid:0) ( r t + η max a Q ( s, a ; ¯ θ i )) − Q ( s, a ; θ i ) ) ] (21)where θ i , ¯ θ i , are the weights of the neural networks. This aboveexpression represents the mean-squared error in the Bellmanequation where the optimal values are approximated with atarget value of r t + η max a Q ( s, a ; ¯ θ i ) , using the weight ¯ θ ii