[PDF] Road Network Metric Learning for Estimated Time of Arrival

Abstract

Recently, deep learning have achieved promising results in Estimated Time of Arrival (ETA), which is considered as predicting the travel time from the origin to the destination along a given path. One of the key techniques is to use embedding vectors to represent the elements of road network, such as the links (road segments). However, the embedding suffers from the data sparsity problem that many links in the road network are traversed by too few floating cars even in large ride-hailing platforms like Uber and DiDi. Insufficient data makes the embedding vectors in an under-fitting status, which undermines the accuracy of ETA prediction. To address the data sparsity problem, we propose the Road Network Metric Learning framework for ETA (RNML-ETA). It consists of two components: (1) a main regression task to predict the travel time, and (2) an auxiliary metric learning task to improve the quality of link embedding vectors. We further propose the triangle loss, a novel loss function to improve the efficiency of metric learning. We validated the effectiveness of RNML-ETA on large scale real-world datasets, by showing that our method outperforms the state-of-the-art model and the promotion concentrates on the cold links with few data.

Full PDF

aa r X i v : . [ c s . L G ] J un Road Network Metric Learning forEstimated Time of Arrival

Yiwen Sun ∗ , Kun Fu † , Zheng Wang † , Changshui Zhang ∗ and Jieping Ye †∗ Department of Automation, Tsinghua University,State Key Lab of Intelligent Technologies and Systems,Institute for Artiﬁcial Intelligence, Tsinghua University (THUAI),Beijing National Research Center for Information Science and Technology (BNRist), Beijing, ChinaEmail: [email protected], [email protected] † DiDi AI Labs, Beijing, ChinaEmail: { fukunkunfu, wangzhengzwang, yejieping } @didiglobal.com Abstract —Recently, deep learning have achieved promisingresults in Estimated Time of Arrival (ETA), which is consideredas predicting the travel time from the origin to the destinationalong a given path. One of the key techniques is to useembedding vectors to represent the elements of road network,such as the links (road segments). However, the embeddingsuffers from the data sparsity problem that many links in theroad network are traversed by too few ﬂoating cars even inlarge ride-hailing platforms like Uber and DiDi. Insufﬁcient datamakes the embedding vectors in an under-ﬁtting status, whichundermines the accuracy of ETA prediction. To address thedata sparsity problem, we propose the Road Network MetricLearning framework for ETA (RNML-ETA). It consists of twocomponents: (1) a main regression task to predict the travel time,and (2) an auxiliary metric learning task to improve the qualityof link embedding vectors. We further propose the triangle loss,a novel loss function to improve the efﬁciency of metric learning.We validated the effectiveness of RNML-ETA on large scale real-world datasets, by showing that our method outperforms thestate-of-the-art model and the promotion concentrates on thecold links with few data.

I. I

NTRODUCTION

Intelligent Transportation System (ITS) aims to explorebetter transportation options for human beings and betterrelationships among users, vehicles and transportation infras-tructures [1], [2]. Nowadays, with massive spatio-temporaldata, artiﬁcial intelligence plays more and more important rolein ITS by leveraging data-driven methods to analyze the trafﬁcpatterns, and has obtained promising results in many tasks ofITS [3], [4], [5].Estimated Time of Arrival (ETA) is one of the most fun-damental and challenging problems in ITS. It is consideredas predicting the travel time from an origin location to adestination location along a given route. An ETA modelenables the transportation system to efﬁciently schedule thevehicles to control the increasing urban trafﬁc congestion[6]. Due to the rapid growth of ride-hailing apps such asUber and DiDi, ETA has attracted more and more attentionin recent years. An accurate ETA system can signiﬁcantlyimprove the operating efﬁciency of the ride-hailing platformsby inﬂuencing route planning, navigation, carpooling, vehicle

Speed Distribution of cold link l j l i Speed Distribution of hot link

Knowledge Transfer by

Road Network Metric Learning vv tt Fig. 1. The conceptual demonstration of RNML-ETA. The left part showsa real case in which the ETA system predicts the travel time along the routestarting from the greed pin to the red pin. The route consists of a sequenceof links. To alleviate the data sparsity problem, we propose to transfer theknowledge of hot links to the cold links by metric learning. The links’similarity is measured using their speed distrubtion. dispatching and scheduling. The left part of Fig. 1 show a realcase of ETA.Existing ETA methods can be divided into two categories.The ﬁst one is the additive methods that explicitly predict thetravel time for each road segment and give the total travel timeof a route by assembling the ingredients’ travel time. Thesemethods have intuitive interpretability, but the prediction maybe inaccurate when local errors are accumulated. The other oneis the overall methods that directly predict the overall traveltime of the route, by formulating ETA as a regression problem.For example, the Wide-Deep-Recurrent model (WDR) [4]takes neural network to predict the travel time based on a richset of input features. This kind of methods avoid the local erroraccumulation but have relatively weak interpretability becauseof using black-box model.We refer to the road segments as links in the remainingpart of this paper. The technique of embedding [7], [8], [9]is widely used, especially in deep learning ETA models, tocapture the spatio-temporal patterns of link as it is one ofthe most fundamental element in the road network. Each linkis represented by an embedding vector which encodes thelink’s semantic information through sufﬁcient iterations duringhe training process. Though the ride-hailing platforms collectmillions of trajectories per day, the embedding vectors stillsuffers from the data sparsity problem of road network thatmany links are traversed by too few ﬂoating cars. For coldlinks, which are covered by few trajectories, the training oftheir embedding vectors may end in an under-ﬁtting status.Thus, the travel time estimation may have large error if aroute goes through cold links.To alleviate the data sparsity problem, we propose a novelETA model named as RNML-ETA. The model leveragesmulti-task learning [10] and consists of a main task predictingthe travel time and an auxiliary task performing the metriclearning, in which the similarity between links are measuredby their speed distribution. Via the metric learning, similarlinks get close and dissimilar links get far away in theembedded space. Thus, the embedding vectors of cold linksget sufﬁcient training, which signiﬁcantly improves the ETAaccuracy. Moreover, we propose a novel loss function, thetriangle loss, for metric learning to take more interaction intoconsideration in on update. To achieve this, we switch the rolesof links among the anchor, positive and negative samples. Aconceptual demonstration of RNML-ETA is given in Fig. 1.The main contributions of this paper are three-fold: • To our best knowledge, RNML-ETA is the ﬁrst deeplearning method that effectively addresses the data spar-sity problem of road network. • We propose a novel metric learning framework to im-prove the quality of link embedding vectors. The similar-ity of links can be measured using the speed distributionof links which can be computed from existing ETA data,requiring no extra information. We also propose the noveltriangle loss to improve the learning efﬁciency of metriclearning. • We conducted comprehensive evaluation of our methodon large scale real-world datasets containing over 100million trajectories. The experimental results validatedthat RNML-ETA signiﬁcantly improves the performancecompared to a state-of-the-art deep learning method.The rest of this paper is organized as follows. Section IIreviews the related works. Section III introduces our methodRNML-ETA in detail. Section IV gives the experimentalresults on the large-scale real-world datasets. Section V is aconclusion of this paper.II. RELATED WORK

Estimated Time of Arrival.

As one of the fundamentalproblems in intelligent transportation system, ETA attracts anextensive study in both academic and industrial communities.ETA models can be divided into two categories. The ﬁrstcategory is the additive methods that explicitly estimate thetravel time for each link and give the prediction of a route byassembling the ingredients’ travel time. Rule-based methodcan be used in the estimation of link travel time. For example,a simple rule dividing the link length by the link travelspeed is widely used in the industry. Learning-based methods,such as the dynamic bayesian network [11], gradient boosted regression tree [12], least-square minimization [13] and patternmatching [14] are also used to mine the trafﬁc patterns andpredict the link’s travel time. The data sparsity problem of roadnetwork is discussed in [15] that a part of links are traversedby too few trajectories. To alleviate the data sparseness , theauthors of [15] propose to represent the trips as a tensor andutilize tensor decomposition to complete the missing values.However, dealing with data sparsity is still a challengingproblem for ETA.The second category is the overall methods that directlypredict the overall travel time of the given route. Earlymethods such as TEMP [16] and time-dependent landmarkgraph [17] use traditional machine learning methods to predictthe travel time. Recently, due to the bloom of deep learning[18], [19], [20], neural network models for ETA are in arapid development. MURAT [21] uses feed-forward neuralnetworks to predict the travel time from the origin to thedestination without a given path. Multi-task learning and graphembedding are used in MURAT to narrow the accuracy gapto the path-based methods. DeepTTE [22] proposes a geo-convolution operation to encode the coordinate informationand uses recurrent neural network to learn the travel time alonga GPS sequence. Since GPS sequence cannot be acquireduntil the trip is ﬁnished, DeepTTE resamples the GPS pointsby uniform distance at training stage and generates pseudopoints according to a planned route at inference stage. WDRmodel [4] uses a wide linear part and a deep neural network tolearn the trip-level information, and a recurrent neural networkto learn the ﬁne-grained sequential information in the route.The authors of [23], [24] transform the map information intothe image sequence, and adopt convolutional neural networkto mine spatial correlations for ETA. In these deep learningmethods, the embedding of geographical elements, such as thelink embedding in [21], [4] and the grid embedding in [25],plays an important role. The embedding technique suffers fromthe data sparsity problem as well, because insufﬁcient datamakes the embedding vectors in an under-ﬁtting status.

Metric learning.

The goal of metric learning is to learna representation function that maps objects into an embeddedspace. The distance in the embedded space should preserve theobjects’ similarity — similar objects get close and dissimilarobjects get far away. Various loss functions have been devel-oped for metric learning. For example, the contrastive loss[26] guides the objects from the same class to be mappedto the same point and those from different classes to bemapped to different points whose distances are larger thana margin. Triplet loss [27] is also popular, which requires thedistance between the anchor sample and the positive sampleto be smaller than the distance between the anchor sample andthe negative sample. The case with one positive sample andmultiple negative samples is extended in [28]. Metric learningoften suffers from slow convergence, partially because the lossonly captures limited interaction in one update.II. METHODOLOGYWe describe the road network as a set of links { l =1 , , · · · , M } , where M is the total link number in the mapand l is the link ID ranging from 1 to M . We then givethe deﬁnition of ETA learning problem which is essentiallya regression task: Deﬁnition 3.1:

ETA Learning . Suppose we have a collec-tion of historical trips { s i , e i , d i , p i } Ni =1 , where N stands forthe total trip number, s i is the departure time, e i is the arrivingtime, d i is the driver ID and p i is the travel path for i -th trip.Our goal is to ﬁt a model that can predict the travel timeestimation y ′ i given the departure time, the driver ID and thetravel path. The ground-truth travel time y i can be computedas y i = e i − s i . The travel path p i is represented as a sequenceof links p i = { l i , l i , · · · , l iT i } , where l ij is the ID of j -thlink in the i -th sequence and T i is the sequence length of p i .We introduce the overall framework of the proposed methodin Section III-A, deﬁne the measurement of link similarity inSection III-B and introduce the details of our metric learningloss in Section III-C. A. Overall Framework

We ﬁrst construct a rich feature set from the raw informationof trips. For example, according to the departure time, we canobtain the time slice in a day (every 5 minutes) and the day ofweek. The features can be categorized into two types: (1) thesequential features which are extracted from the travel path p i .For a link l ij , we denote its feature vector as x ij , and get afeature matrix X i = [ x , · · · , x T i ] for the i -th trip. Note thatthe sequential feature has variable size — in other words, thecolumn number of X i is decided by the path length; and (2)the non-sequential features which are irrelative to the travelpath, e.g day of the week. They are represented as a featurevector z i with ﬁxed size.The link embedding vector is an important component ofthe link feature vector x ij . For link with ID= l ij , we look upan embedding table E L ∈ R × M , and use its l ij -th column E L (: , l ij ) as a distributional representation for the link [7] .The E L is randomly initialized and will be updated in thetraining process by gradient descending to encode semanticinformation of links. The link feature vector is a concatenationof E L (: , l ij ) , the link length len ( l ij ) and the link’s travel speed v ij : x ij = [ E L (: , l ij ); len ( l ij ); v ij ] . (1)The link’s length is obtained by geographical survey andthe travel speed is the average speed of the ﬂoating carsthat traversed the link within the latest time window (e.g 10minutes).Data amount signiﬁcantly affects the quality of embeddingvectors. For example in the natural language processing ﬁeld,Word2vec [9] cannot generate meaningful embedding vectorsfor rare words that occur in very limited sentences. In ride-hailing platforms, the data coverage on road network is stillnot satisfactory though there are already millions of ﬂoatingcars. A part of links are traversed by only a few or even zero MLP

Other Sequential Features Non-sequentialFeatures

LSTM

CrossProduct AfﬁneTransformation

MLP

WideDeepRecurrent

ConcatLink ID

Embed L main − ββ L aux Fig. 2. The overall architecture of RNML-ETA. The loss function consistsof two aspects: (1) the main task uses a Wide-Deep-Recurrent model to learnthe travel time prediction, and (2) the auxiliary task uses metric learning toimprove the quality of link embedding vectors. trajectories. We refer to those traversed by plenty of trips ashot links, and those traversed by only a few or even zero tripsas cold links. The hot links’ embedding vectors can be welltrained with sufﬁcient iteration. However, the training of coldlinks’ embedding vectors is often ended in an under-ﬁttingstatus, which undermines the accuracy of ETA prediction.To improve the embedding quality of cold links, we pro-pose the Road Network Metric Learning ETA (RNML-ETA),whose training process consists of two tasks. The main taskis to predict the travel time, while the auxiliary task is toregularize the link embedding vectors by transferring theknowledge of road network patterns from hot links to coldlinks. The metric learning in the auxiliary task can help toplace the embedding vector of a cold link in a proper positionin the embedded space, by reducing the distance to its similarhot links. The loss function of RNML-ETA is: L = (1 − β ) · L main + β · L aux , (2)where β is a hyper-parameter to balance the trade-off betweenthe main task and the auxiliary task.We choose Wide-Deep-Recurrent (WDR) model [4], a state-of-the-art ETA model, to accomplish the main task. The threecomponents of WDR model includes: (1) a wide modulememorizing the historical patterns in data by constructing asecond order cross product and an afﬁne transformation ofthe non-sequential feature z i ; (2) a deep module improvingthe generalization ability by feeding z i into a Multi-LayerPerceptron (MLP), which is a stack of fully-connected layerswith ReLU [19] activation functions; and (3) a recurrent module providing a ﬁne-grained modeling on the sequentialfeature X i via Long-Short Term Memory network (LSTM)[29], which can capture the spatial and temporal dependencybetween links.We denote the outputs of the wide module as h ( w ) i , theoutput of the deep module as h ( d ) i , and the last hidden stateof LSTM as h ( T i ) i . The travel time prediction is given by aregressor, which is also a MLP, based on the concatenation ofthe outputs: y ′ i = M LP ( h ( w ) i , h ( d ) i , h ( T i ) i ) . (3)The hidden state sizes in the deep module, the LSTM andthe regressor MLP are all set to 128. The hidden state andmemory cell of LSTM are initialized as zeros. We chooseean Absolute Percentage Error (MAPE) as the loss functionof the main task: L main = N N X i =1 | y i − y ′ i | y i , (4)where y i is the ground-truth travel time. The overall architec-ture of RNML-ETA and the main task workﬂow are visualizedin Fig. 2. The details of the auxiliary task will be introducedin the following sections. B. Link Similarity

To apply metric learning on the link embedding vectors,a similarity measurement of links should be deﬁned. Sincethe link’s travel speed essentially reﬂects how long a car isexpected to take to pass through the link, the speed distributionacross different time could be used to depict the trafﬁccharacteristic of the link. We construct a series of time bins { τ , τ , · · · , τ K } for a day. These time bins are ensured to benon-overlapped: τ i ∩ τ j = ∅ , ∀ i = j ; and their union coversthe whole day: τ ∪ τ ∪ · · · ∪ τ K = 24 h . We then statistic theaverage travel speed for link l and time bin τ k by computing: ¯ v k ( l ) = 1 Z N X i =1 T i X j =1 v ij I s i ∈ τ k I l ij = l ,Z = N X i =1 T i X j =1 I s i ∈ τ k I l ij = l , (5)where v ij is the travel speed feature of j -th link in i -th trip,and I cond is an indicator that I cond = 1 if cond is satisﬁed and I cond = 0 otherwise. Intuitively, we ﬁnd a subset of the link l ’stravel speed features by selecting those whose departure timebelongs to the time bin τ k , and then compute the average onthe subset. In practice, we use a conﬁguration of K = 3 timebins with τ from 5 a.m to 11 a.m representing the morningpeak, τ from 4 p.m to 10 p.m representing the evening peakand τ taking the remaining hours representing the off-peaktime.We further scale the speeds to be within [0 , by applying e v k ( l ) = ( v k ( l ) − a ) / ( b − a ) , where a and b are the minimumand maximum of { v k ( l ) , k = 1 · · · K, l = 1 · · · M } . We ﬁnallyget a normalized speed histogram of link l : e v ( l ) = [ e v ( l ) , e v ( l ) , e v ( l )] T . (6)A difference matrix Q ∈ R M × M can be computed asfollows: Q ij = Q ji = k e v ( i ) − e v ( j ) k , (7)where Q ij is the element of Q measuring the differencebetween links with ID= i and ID= j . Smaller difference meanslarger similarity. The similarity based on speed histogramshows advantages on two aspects. Firstly, the ETA is mostlydetermined by the trafﬁc condition and is partially inﬂuencedby personalized factors such as the driving habit. The latestaverage speed is a good reﬂection of the trafﬁc condition. Iftwo links have similar speed distribution, they should also havesimilar impact on the ETA prediction. Secondly, the speed Q l i l j Q l j l k Q l i l k l i l j l k Fig. 3. The distances forms a triangle and the order of their edge lengthsshould satisfy the relation in Eq. 8. histogram does not rely any extra information and can becomputed directly from the data used in the main task, whichfacilitates the method implementation.

C. Triangle Loss

Links with similar characteristic are expected to be closerin the embedded space and those with dissimilar characteristicare expected to be farther. With this end in view, we proposea novel metric learning loss function, named as triangleloss. Suppose we have three links with ID= l i , l j , l k and thecorresponding differences Q l i l j , Q l j l k and Q l i l k , without lossof generality, we assume: Q l i l j < Q l j l k < Q l i l k . (8)We then compute the Euclidean distances between theembedding vectors of link l i , l j and l k . For example: D l i l j = k e E L (: , l i ) − e E L (: , l j ) k , (9)where e E L (: , l i ) = E L (: , l i ) / k E L (: , l i ) k is the L-2 normal-ized embedding vector. The three distances D l i l j , D l j l k and D l i l k forms a triangle. We aims to restrict the lengths of thetriangle edges to be in the same order as in Eq. 8, whichderives three inequations: D l i l j + α < D l j l k ,D l i l j + α < D l i l k ,D l j l k + α < D l i l k (10)where α , α and α are required margins. Unlike the tripletloss [27] which has only one restriction that the distancebetween anchor and positive sample should be smaller thanthe distance between anchor and negative sample, the links inour method take turns to act as the anchor. This enables a moreefﬁcient metric learning in one update and thus acceleratesthe convergence. Fig. 3 gives a visualized demonstration. Thetriangle loss is in the form of: L aux = 1 U X l i ,l j ,l k (cid:18) γ h D l i l j − D l j l k + α i + + γ h D l i l j − D l i l k + α i + + γ h D l j l k − D l i l k + α i + (cid:19) , (11)where the operator [ x ] + = max ( x, and U is the numberof possible triangles in the training set, γ , γ and γ areyper-parameters to adjust the weights of the three distances.The auxiliary task and main task are simultaneously optimizedvia gradient descending. For a mini-batch of trips, we ﬁrstcompute the loss of the main task, and then compute theauxiliary loss by randomly combining triangles with all thelinks in the trips. IV. EXPERIMENTThe evaluation is on large scale real-world datasets collectedin DiDi platform. We will introduce the datasets, the compet-ing methods, the implementation details and the experimentalresults in sequence. A. Dataset

We collected massive ﬂoating car trajectories of Beijing in2018 in DiDi platform. The trajectories are split into pickup and trip datasets according to the driver’s working status. A pickup trajectory starts when a driver responds to a passenger’srequest and ends when he/she picks up the passenger. A trip trajectory starts when the passenger gets on board and endswhen arriving the destination. For each dataset, we use 25weeks of data as training set and the following 2 weeks asvalidation set and test set, respectively. We remove the outlierswith extremely short travel time ( < > TABLE IS

TATISTICS OF DATASETS size pickup trip training set 25 weeks 111.0M 105.5Mvalidation set 1 week 4.0M 4.5Mtest set 1 week 4.1M 3.9M

The links are from a wide range of roads, such as privatecommunity roads, local streets and urban freeways. As shownin Table I, the trip dataset covers more links than the pickup dataset. However, both the datasets suffer from the roadnetwork sparsity problem that most of the links are shortof data. To demonstrate it, we plot the histogram of linkcoverage frequency in Fig. 4. Though with over 0.1 billionof trajectories, there is a signiﬁcant number of cold links thatare traversed by only a few times in about half a year (25weeks). The median coverage frequencies of link are 42 on pickup and 69 on trip . B. Competing Methods

We compare the proposed RNML-ETA with the followingcompetitors.(1) Route-ETA: a representative method in industrial ap-plication. In this solution, the travel time estimation for eachlink is made by dividing the link length by the link travelspeed. The waiting time at each intersection is mined from thehistorical data. Given a route, the total travel time is predictedas the sum of each link’s travel time and each intersection’s (a) (b)

Fig. 4. Statistics of link coverage frequency. For both pickup and trip datasets,the links concentrate on the bands with small number of traversing trajectories. waiting time. Route-ETA has very fast inference speed butits accuracy is often far from satisfactory compared to deeplearning methods.(2) WDR [4]: a deep learning method achieving the state-of-the-art performance in ETA problem. Since it is the modelused in our main task, the comparison between WDR andRNML-ETA evaluates the beneﬁt of the auxiliary task.(3) WDR-no-link-emb: a variant of WDR that removesthe link embedding technique. The main purpose of usingthis model is to quantify the contribution of link embeddingvectors, of which the RNML-ETA is aiming to improve thequality.Besides the Mean Absolute Percentage Error (MAPE),which is used as objective function in the main task, we alsotake Mean Absolute Error (MAE) and Root Mean Square Error(RMSE) as the evaluation metrics. The computations are:MAE = 1 N N X i =1 | y i − y ′ i | , RMSE = " N N X i =1 ( y i − y ′ i ) / . (12) C. Implementation Details

The neural networks in WDR, WDR-no-link-emb andRNML-ETA are implemented in PyTorch [30], and the train-ing is accelerated on a single NVIDIA P40 GPU. We use amini-batch size of 256 and set the maximal iteration num-ber to 7 millions. The hyper-parameters of RNML-ETA areselected by the results on validation set. We use margins α = α = 0 . , α = 0 . and weights γ = γ = 0 . , γ = 0 . in the triangle loss for both pickup and trip datasets. The task weight β is 0.52 for pickup and 0.35for trip . All the parameters, such as the MLP weights andthe embedding vectors, are jointly trained using Adam [31]optimizer, which is a stochastic gradient descending method.Adam can adaptively adjust the step size according to thehistorical gradients and thus accelerate the convergence. Thelearning rate is set to 0.0002. D. Experimental Results

We list the results of pickup data in Table II and trip datain Table III, and mark the best scores by bold font. Theproposed method RNML-ETA outperforms all the competitorsn both datasets. The metric learning component signiﬁcantlyimproves the main task model’s accuracy to predict the traveltime. For example, RNML-ETA reduces . RMSE on pickup data and reduces . MAPE on trip data comparedto WDR. The importance of link embedding technique is alsovalidated that it brings . and . reduction on MAPEfor pickup and trip data, respectively (WDR-no-link-emb v.s.WDR). Moreover, it can be observed that there is a largeperformance gap between the simple rule-based model Route-ETA and the deep learning models. TABLE IIR

ESULTS OF THE PICKUP DATASET

MAPE (%) MAE (sec) RMSE (sec)Route-ETA . . . TABLE IIIR

ESULTS OF THE TRIP DATASET

MAPE(%) MAE (sec) RMSE (sec)Route-ETA . . . The results in Table II and Table III show the overallaccuracy on all the links. Since RNML-ETA mainly aims toimprove the embedding quality of cold links, its contributionneeds a ﬁner evaluation which reports the metrics at differentlink coverage level. Thus, we select a series of subsets fromthe dataset by restricting the link coverage frequency in thetrajectory. Speciﬁcally, we keep a trajectory if at least of the contained links have coverage frequencies less than athreshold δ , and drop the trajectory otherwise. By varying δ from 50 to 500 on pickup data and from 300 to 750 on trip data in a step of 50, we obtain 10 subsets for each dataset. Insubset with lower δ , the trajectory contains more cold links.We then compute the metrics on these subsets and plot thecurves in Fig. 5.We take Fig. 5 (a) as an example (the trends in othersubﬁgures are similar). As the threshold δ increases, the subsetincludes more hot links and the MAPE of WDR graduallydecreases from to , which is a large improvementfor ETA problem. This phenomenon shows that links coveredby more trajectories do have better prediction accuracy andsupports the existence of the road network data sparsityproblem. On the subset with δ = 50 , our method RNML-ETA outperforms WDR by more than 2 percentage in termsof MAPE. However, the gain on overall MAPE (Table II)is less than 0.2 percentage. Such a comparison validatesthe effectiveness of RNML-ETA that it mainly improvesthe performance of cold links. As δ increases, RNML-ETA (a) (b)(c) (d)(e) (f) Fig. 5. Results of the ﬁner evaluation on subsets with different link coveragelevel. For a threshold δ , we keep the trajectory that at least of thecontained links have coverage frequencies less than δ . The 6 subﬁgures standfor (a) MAPE on pickup data, (b) MAPE on trip data, (c) MAE on pickup data, (d) MAE on trip data, (e) RMSE on pickup data and (f) RMSE on trip data. achieves MAPE improvements up to . on pickup dataand up to . on trip data. E. Inﬂuence of Hyper-parameter

To explore the inﬂuence of hyper-parameters, we plot theperformance curves of pickup data in Fig. 6 by varying themargin α and the task weight β , which are two representativehyper-parameters. The basic conﬁguration is the same as inSection IV-C, namely, α = α = 0 . , α = 0 . , γ = γ = 0 . , γ = 0 . and β = 0 . .The hyper-parameter α is a bit more special than α and α , because it controls the gap between the longest edge andthe shortest edge in the triangle loss. If this restriction is bro-ken, it means that the model is far from our expected status andneeds a stronger gradient to update the parameters. Usually,we set α > α + α and ﬁnd that . achieves the bestperformance according to the curve in Fig. 6 (a). Moreover,RNML-ETA achieves better performance than WDR from α = 0 . to . , which demonstrates that the superiority ofRNML-ETA is not sensitive to the margin hyper-parameter.he task weight β is to balance the trade-off between themain task and the auxiliary task. In extreme cases, RNML-ETA degenerates to WDR if β = 0 and degenerates to apure metric learning model if β = 1 . Fig. 6 (b) shows thatthe advantage of RNML-ETA over WDR is robust in a widerange of β from . to . and that the best performance isachieved at β = 0 . . (a) (b) Fig. 6. The inﬂuence of hyper-parameters: (a) for the margin α in the triangleloss, and (b) for the weight balancing the main task and the auxiliary task.Though MAPE varies under different hyper-parameters, RNML-ETA gener-ally outperforms the competitor WDR, which demonstrates the robustness ofour method. V. C

ONCLUSION

In this paper, we propose a novel metric learning frameworkfor ETA, named as RNML-ETA, to address the data sparsityproblem of road network. In the main task, we use WDRmodel to predict the travel time. In the auxiliary task, weﬁrst construct a difference matrix by computing the Euclideandistances between the links’ speed distributions, and then usemetric learning to get the similar links close and dissimilarlinks far away in the embedded space. The auxiliary taskis aiming to improve the quality of embedding vectors oflinks. We conduct experiments on two large scale real-worlddatasets collected in DiDi platform. The results validated theeffectiveness of RNML-ETA by showing that it outperformsthe state-of-the-art WDR model on all the evaluation metrics.A further experiment ﬁnely examines the gains for differenttypes of link and ﬁnd that RNML-ETA signiﬁcantly improvesthe accuracy for routes containing cold links.R

EFERENCES[1] G. Dimitrakopoulos and P. Demestichas, “Intelligent transportation sys-tems,”

IEEE Vehicular Technology Magazine , vol. 5, no. 1, pp. 77–84,2010.[2] L. Figueiredo, I. Jesus, J. T. Machado, J. R. Ferreira, and J. M.De Carvalho, “Towards the development of intelligent transportationsystems,” in

ITSC (Cat. No. 01TH8585) . IEEE, 2001, pp. 1206–1211.[3] J. Zhang, F.-Y. Wang, K. Wang, W.-H. Lin, X. Xu, and C. Chen, “Data-driven intelligent transportation systems: A survey,”

IEEE Transactionson Intelligent Transportation Systems , vol. 12, no. 4, pp. 1624–1639,2011.[4] Z. Wang, K. Fu, and J. Ye, “Learning to estimate the travel time,” in

SIGKDD . ACM, 2018, pp. 858–866.[5] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial-temporal graph convolutional networks for trafﬁc ﬂow forecasting,” in

AAAI , vol. 33, 2019, pp. 922–929.[6] S. C¸ olak, A. Lima, and M. C. Gonz´alez, “Understanding congested travelin urban areas,”

Nature communications , vol. 7, no. 1, pp. 1–8, 2016. [7] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba-bilistic language model,”

Journal of machine learning research , vol. 3,no. Feb, pp. 1137–1155, 2003.[8] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-neural-network architectures and learning methods for spoken languageunderstanding.” in

Interspeech , 2013, pp. 3771–3775.[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,“Distributed representations of words and phrases and their composi-tionality,” in

NeurIPS , 2013, pp. 3111–3119.[10] R. Caruana, “Multitask learning,”

Machine learning , vol. 28, no. 1, pp.41–75, 1997.[11] A. Hoﬂeitner, R. Herring, P. Abbeel, and A. Bayen, “Learning thedynamics of arterial trafﬁc from probe data using a dynamic bayesiannetwork,”

IEEE Transactions on Intelligent Transportation Systems ,vol. 13, no. 4, pp. 1679–1693, 2012.[12] F. Zhang, X. Zhu, T. Hu, W. Guo, C. Chen, and L. Liu, “Urban linktravel time prediction based on a gradient boosting method consider-ing spatiotemporal correlations,”

ISPRS International Journal of Geo-Information , vol. 5, no. 11, p. 201, 2016.[13] X. Zhan, S. Hasan, S. V. Ukkusuri, and C. Kamga, “Urban link traveltime estimation using large-scale taxi data with partial information,”

Transportation Research Part C: Emerging Technologies , vol. 33, pp.37–49, 2013.[14] H. Chen, H. A. Rakha, and C. C. McGhee, “Dynamic travel time pre-diction using pattern recognition,” in . TU Delft, 2013.[15] Y. Wang, Y. Zheng, and Y. Xue, “Travel time estimation of a path usingsparse trajectories,” in

SIGKDD . ACM, 2014, pp. 25–34.[16] H. Wang, Y. H. Kuo, D. Kifer, and Z. Li, “A simple baseline fortravel time estimation using large-scale trip data,” in

SIGSPATIAL GIS .Association for Computing Machinery, 2016, p. 61.[17] J. Yuan, Y. Zheng, X. Xie, and G. Sun, “T-drive: Enhancing drivingdirections with taxi drivers’ intelligence,”

IEEE Transactions on Knowl-edge and Data Engineering , vol. 25, no. 1, pp. 220–232, 2011.[18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

NeurIPS , 2012, pp. 1097–1105.[20] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploringstrategies for training deep neural networks,”

Journal of machine learn-ing research , vol. 10, no. Jan, pp. 1–40, 2009.[21] Y. Li, K. Fu, Z. Wang, C. Shahabi, J. Ye, and Y. Liu, “Multi-taskrepresentation learning for travel time estimation,” in

SIGKDD . ACM,2018, pp. 1695–1704.[22] D. Wang, J. Zhang, W. Cao, J. Li, and Y. Zheng, “When will you arrive?estimating travel time based on deep neural networks,” in

AAAI , 2018.[23] T.-y. Fu and W.-C. Lee, “Deepist: Deep image-based spatio-temporalnetwork for travel time estimation,” in

ACM CIKM , 2019, pp. 69–78.[24] W. Lan, Y. Xu, and B. Zhao, “Travel time estimation without roadnetworks: an urban morphological layout representation approach,” in

IJCAI . AAAI Press, 2019, pp. 1772–1778.[25] H. Zhang, H. Wu, W. Sun, and B. Zheng, “Deeptravel: a neural networkbased travel time estimation model with auxiliary supervision,” in

IJCAI ,2018, pp. 3655–3661.[26] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metricdiscriminatively, with application to face veriﬁcation,” in

CVPR , vol. 1.IEEE, 2005, pp. 539–546.[27] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A uniﬁed embed-ding for face recognition and clustering,” in

CVPR , 2015, pp. 815–823.[28] K. Sohn, “Improved deep metric learning with multi-class n-pair lossobjective,” in

NeurIPS , 2016, pp. 1857–1865.[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neuralcomputation , vol. 9, no. 8, pp. 1735–1780, 1997.[30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al. , “Pytorch: Animperative style, high-performance deep learning library,” in

NeurIPS ,2019, pp. 8024–8035.[31] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”