[PDF] Spatio-Temporal Top-k Similarity Search for Trajectories in Graphs

Abstract

We study the problem of finding the k most similar trajectories to a given query trajectory. Our work is inspired by the work of Grossi et al. [6] that considers trajectories as walks in a graph. Each visited vertex is accompanied by a time-interval. Grossi et al. define a similarity function that captures temporal and spatial aspects. We improve this similarity function to derive a new spatio-temporal distance function for which we can show that a specific type of triangle inequality is satisfied. This distance function is the basis for our index structures, which can be constructed efficiently, need only linear memory, and can quickly answer queries for the top- k most similar trajectories. Our evaluation on real-world and synthetic data sets shows that our algorithms outperform the baselines with respect to indexing time by several orders of magnitude while achieving similar or better query time and quality of results.

Full PDF

SSpatio-Temporal Top-k Similarity Search for Trajectories in Graphs

Lutz Oettershagen ∗ Anne Driemel † Petra Mutzel ∗ Abstract

We study the problem of ﬁnding the k most similartrajectories to a given query trajectory. Our work isinspired by the work of Grossi et al. [6] that considerstrajectories as walks in a graph. Each visited vertex isaccompanied by a time-interval. Grossi et al. deﬁne asimilarity function that captures temporal and spatialaspects. We improve this similarity function to derivea new spatio-temporal distance function for which wecan show that a speciﬁc type of triangle inequality issatisﬁed. This distance function is the basis for ourindex structures, which can be constructed eﬃciently,need only linear memory, and can quickly answer queriesfor the top- k most similar trajectories. Our evaluationon real-world and synthetic data sets shows that ouralgorithms outperform the baselines with respect toindexing time by several orders of magnitude whileachieving similar or better query time and quality ofresults. Keywords—

Trajectories, Indexing, Top- k Query

More and more trajectory data is collected due to theubiquitous availability of sensors and personal mobiledevices that allow tracking of movement over time.Therefore, trajectory data mining is attracting increas-ing attention in the scientiﬁc literature [1, 2, 3, 4, 7, 11,13, 14, 15]. For any fundamental task in trajectory datamining, the choice of similarity measure is a crucial stepin the design process. Often there are spatial restric-tions to the movement and the trajectories of interestare related to a graph, or are mapped to a spatial net-work. We are interested in similarity, which takes suchspatial as well as temporal aspects into account. Weconsider two trajectories as similar if they visit the sameor proximate vertices during the same periods of time.Our work is inspired by Grossi et al. [6], who deﬁne asimilarity function for two trajectories in a graph. Thetrajectories can be of diﬀerent length and the similarityfunction takes spatial and temporal similarity into ac- ∗ Institute of Computer Science, University of Bonn, Germany, { lutz.oettershagen,petra.mutzel } @cs.uni-bonn.de † Hausdorﬀ Center for Mathematics, University of Bonn, Ger-many, [email protected] [10:32,10:38] [10:38,10:47][10:47,11:11][11:11,12:24][10:32,11:01][11:01,12:30][12:30,14:04] [12:39,12:47][12:47,12:58][12:58,13:03][13:03,15:47]

Figure 1: Example for trajectories with time intervalsin a network, e.g., an online social network. Thetrajectories reveal the user behavior, i.e., the times auser visits and leaves a website.count. It can be computed in linear time with respect tothe length of the trajectories. We consider trajectoriesas sequence of vertices in a graph and for each visitedvertex there is a discrete time-interval for the time thetrajectory stays at the vertex. See Figure 1 for an exam-ple. Based on Grossi et al. [6], we introduce an improvedand new spatio-temporal similarity and a correspondingdistance function for trajectories in graphs. We showthat a speciﬁc kind of triangle-inequality holds for thedistance function under reasonable assumptions. Thisdistance function provides the basis for new index datastructures that allow eﬃcient top- k similarity queries.A top- k trajectory query ( Q, s ) speciﬁes a trajectory Q and a time interval s . Given a set of trajectories T ,the result of a top- k query consists of the subset of T containing all trajectories that have one of the k high-est similarities to Q with respect to the time interval s .These queries have important real-life applications: • Web analytics:

Users of a online social networkor web community following links and visitinguser pages. The goal is to ﬁnd similar browsingbehavior. Figure 1 shows an example. • Travel recommendation:

Tourism is one ofthe largest industries and the emergence of travelfocused social networks enables users to share their a r X i v : . [ c s . D S ] O c t ours. The locations are points-of-interests (POI),and the intervals are the duration person stays ata POI. A query is a request for a recommendation. • Animal behavior:

Consider wildlife that istracked using GPS. The living space of the animalis divided into zones. The goal is to identify similar-ities in animal behaviors. Vertices represent eitherspeciﬁc locations like waterhole or feeding place orterritories of animals. • Traﬃc and crowd analysis:

The goal is toidentify person or vehicle ﬂows at speciﬁc timesthrough predeﬁned areas. Vertices represent theseareas. This also includes the application in contacttracing, where we need to determine contacts of an,e.g. infected individual, to other persons.These applications have in common that one is inter-ested in ﬁnding a set of the most similar trajectories to agiven one. This is a fundamental problem in trajectorymining like clustering, outlier detection, classiﬁcation,or prediction tasks. It is necessary to select or deﬁnean adequate similarity measure or distance function, re-spectively, that ﬁts the requirements of the application.

Contributions:

1. We introduce a spatio-temporal similarity functionand show that the triangle-inequality holds undercertain conditions for the corresponding distancefunction. The similarity computation for two tra-jectories only needs linear time with respect to thelength of the longer trajectory.2. We design indices that can be constructed veryeﬃciently and use linear memory with respectto the number of trajectories. The indices arebased on spatial as well on temporal ﬁltering andallow heuristic top- k similarity queries with shortrunning times and high quality of the results.Additionally, we apply upper bounding, whichallows a direct, highly eﬃcient query even withoutthe need for a preprocessed index data structure.In the latter case, the output is exact.3. We evaluate our new algorithms on real-world andsynthetic data sets. Our new solutions outperformthe baselines (including [6]) with respect to index-ing time by several orders of magnitude. Moreover,our query times are substantially faster, and thequality of the results is better than or on par withthe baselines algorithms. Since trajectory similarity is ofhigh interest for many data analytics tasks, many diﬀer-ent similarity measures have been used, e.g., based ondynamic time warping, Euclidean distances, or edit dis-tances. Su et al. [12] provides a nice overview. For tra-jectory analysis in networks, many approaches have con- centrated on the spatial similarity only, and a few con-sider spatio-temporal similarity. Hwang et al. [7] havesuggested a similarity measure based on the network dis-tance measuring spatial and temporal similarity. How-ever, a set of nodes need to be selected in advance, andspatial similarity then means passing through the samenodes simultaneously. Xia et al. [15] use a similaritymeasure for network constrained trajectories based onan extension of the Jaccard similarity. As a similaritymeasure, they use the product of spatial and tempo-ral similarity. Tiakas et al. [13, 14] suggest a weightedsum of spatial and temporal similarity. Their similarityfunction works for two trajectories with the same lengthand can be computed in linear time with respect to thelength of the given trajectories. Shang et al. [10] alsouse a weighted sum of spatial and temporal similarityfor similarity-joins of trajectories.Another way to approach the problem is to usea distance measure based on the discrete Fr´echet dis-tance, or dynamic time warping, which optimize overall vertex-mappings between the two trajectories thatrespect the time-ordering, where the underlying metricwould be derived from the shortest-path metric givenby the graph. Near-neighbor data structures have beenstudied theoretically with speciﬁc conditions on the un-derlying graph and the length of the queries, see [5, 8].Our work is inspired by Grossi et al. [6]. They sug-gest a spatio-temporal similarity measure for two trajec-tories in a graph. The trajectories can be of diﬀerendlength and if the pairwise distances are given, then themeasure can be computed in linear time with respect tothe length of the trajectories. The authors also suggestan algorithm for answering the top- k trajectory queryproblem. For speeding up the computations, they sug-gest an indexing method based on interval trees and amethod to approximate their similarity measure. Weprovide a more detailed description of their work and acomparison to our approach in Section 5. An undirected and weighted graph G = ( V, E, c ) consistsof a ﬁnite set of vertices V , a ﬁnite set E ⊆ {{ u, v } ⊆ V | u (cid:54) = v } of undirected edges and a cost function c : E → R > that assigns a positive cost to each edge e ∈ E . A walk in G is an alternating sequence of verticesand edges connecting consecutive vertices. A path is awalk that visits each vertex at most once. The cost ofa walk or path is the sum of its edge costs. Let d ( u, v )denote the shortest path distance between u, v ∈ V . Definition 1. (Trajectory)

Let G = ( V, E, c ) be anundirected, weighted and connected graph. A trajectory T is a sequence of pairs (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) , such thator ≤ i ≤ (cid:96) the pair ( v i , t i ) consists of v i ∈ V and adiscrete time interval t i = [ a i , b i ] with a i , b i ∈ Z , a i < b i and a i +1 = b i for ≤ i < (cid:96) . The starting time of T is T.start = a and the endingtime T.end = b (cid:96) . We denote with I ( T ) the totalinterval in which trajectory T exists, i.e., from T.start to T.end . For a trajectory T and a time interval t = [ a, b ]we deﬁne T [ t ] as the time-restricted trajectory that isintersected with t , i.e., T [ t ] = (( v i , t (cid:48) i ) , . . . , ( v j , t (cid:48) j )) with v i ( v j ) being the ﬁrst (last) vertex of T such that for t i = [ a i , b i ] it holds that b i > a (and for t j = [ a j , b j ] a j < b , resp.), t (cid:48) i = max { t i , a } and t (cid:48) j = min { t j , b } . Weassume for T = (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) that v i (cid:54) = v i +1 forall 1 ≤ i < (cid:96) . We say trajectory T intersects a timeinterval t if there is a ( v i , t i ) ∈ T with t i ∩ t (cid:54) = ∅ . We deﬁne our new similarity function for trajectories onnetworks. The goal of is to capture both temporal andspatial aspects, such that if two trajectories are oftenin close proximity, i.e., visiting vertices that are closeto each other during the same period of time, then thesimilarity should be high.

Definition 2.

Let T = (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) and Q =(( u , s ) , . . . , ( u k , s k )) be two trajectories, and s a timeinterval. We deﬁne the similarity of Q and T in thetime interval s as Sim ( Q, T, s ) = 1 | s | · (cid:88) ( v i ,t i ) ∈ T ( u j ,s j ) ∈ Q | s ∩ t i ∩ s j | · e − d ( v i ,u j ) . Notice that for two trajectories T and Q , and a timeinterval s , it holds that 0 ≤ Sim ( Q, T, s ) ≤ Sim ( Q, T, s ) is minimal if the common intersection ofthe time intervals is empty. In this case

Sim ( Q, T, s ) =0.

Lemma 3.1.

Let Q and T be trajectories and s a timeinterval with s ⊆ I ( Q ) . It holds that1. Sim ( Q, T, s ) =

Sim ( T, Q, s ) , and2. Q [ s ] = T [ s ] if and only if Sim ( Q, T, s ) = 1 .Proof. (1.) The shortest path metric is symmetric, i.e., d ( u, v ) = d ( v, u ) for all u, v ∈ V . The summation is overthe same pairs of ( v i , t i ) ∈ T and ( u j , s j ) ∈ Q , and theintersection of the intervals is commutative. Therefore,the result follows.(2.) ⇒ : Notice that if Q [ s ] = T [ s ] in each step ofthe summation e − d ( u,v ) = 1. Because s ⊆ I ( Q ), theresult of the summation is s and normalization is 1. ⇐ : Assume that Sim ( Q, T, s ) = 1 but Q [ s ] (cid:54) = T [ s ], i.e., Q [ s ] and T [ s ] diﬀer in the vertices they visit or the times when they visit them. In the ﬁrst case, due tothe strictly positive edge weights, there is a vertex pairsuch that e − d ( u,v ) <

1, however for all other vertex pairsthe value e − d ( u (cid:48) ,v (cid:48) ) is at most 1. Because the intervalsare intersected with the interval s the total sum willbe less than | s | and leads to a contradiction to theassumption. Analogously, in the case that I ( T [ s ]) < | s | a contradiction follows. Now, the case that Q [ s ] and T [ s ]diﬀer in the times when they visit the vertices. Becauseof the assumption that a trajectory does not stay atthe same vertex in two consecutive time intervals, thereis an intersection of time intervals in which Q [ s ] and T [ s ] visit diﬀerent vertices u and v . Due to the strictlypositive edge weights it is e − d ( u,v ) <

1. This leads againto a contradiction.For the computation of the similarity, the shortest-pathdistances between the vertices of the graph is needed.These distances can be precomputed for all vertices orcomputed on-the-ﬂy for vertices u that are visited bythe query trajectory Q . Theorem 3.1.

Let Q and T be trajectories, and s a time interval, the computation of the similarity Sim ( Q, T, s ) takes O ( | Q | + | T | ) time, if the shortest pathdistance d ( u, v ) between u, v ∈ V can be obtained in con-stant time.Proof. Consider the query trajectory Q =(( u , s ) , . . . , ( u i , [ a i , b i ]) , . . . , ( u k , s k )) and the tra-jectory T = (( v , t ) , . . . , ( v j , [ c j , d j ]) , . . . , ( v (cid:96) , t (cid:96) )). Westart the computation with i = 1 and j = 1, and | s ∩ t ∩ s | is either zero or larger than zero. In the ﬁrstcase we can increase both i and j . In the second case,we increase i if b i < d j or j = (cid:96) , and we increase j if b i > d j or i = k . We repeat this for maximal | Q | + | T | times and ﬁnd all pairs ( u i , s i ) and ( v j , t j ) that havenon-empty intersection.Deﬁnition 2 is similar to the similarity function deﬁnedin [6], however our improvements allow to prove usefulproperties for the corresponding distance function. Wenow deﬁne the distance function based on the similarityand show a speciﬁc type of triangle inequality. Definition 3.

Let Q and T be trajectories, and s atime interval. We deﬁne the distance Dist ( Q, T, s ) =1 − Sim ( Q, T, s ) . Lemma 3.2.

Let Q , T and R be trajectories, and s a time interval. If s ⊆ I ( Q ) , then Dist ( Q, T, s ) ≤ Dist ( Q, R, s ) +

Dist ( R, T, s ) .Proof. Let t = I ( T ) and r = I ( R ). We can assumewithout loss of generality that s = I ( Q ). We show that − Sim ( Q, T, s ) ≤ − Sim ( Q, R, s ) + 1 − Sim ( R, T, s ).This is equivalent to 1 +

Sim ( Q, T, s ) ≥ Sim ( Q, R, s ) +

0, whereeach ( u j , w k ) is in B exactly | s ∩ s j ∩ r k | times. Andﬁnally, C contains the pairs ( w k , v i ) that are summedup during the second summation on the right-hand sideof the inequality for which | s ∩ r k ∩ t i | >

0, where each( w k , v i ) is in C exactly | s ∩ r k ∩ t i | times. Then, we show | s | − | s ∩ t | + (cid:88) ( u j ,v i ) ∈ A (1 + e − d ( u j ,v i ) ) ≥ (cid:88) ( u j ,w k ) ∈ B e − d ( u j ,w k ) + (cid:88) ( w k ,v i ) ∈ C e − d ( w k ,v i ) .(3.1)We show that the multisets A , B and C contain vertexpairs such that the inequality holds. And let p ⊆ s bean interval of length one. For each possible p we mayhave some vertex pairs in the multisets.We need the consider the following cases:1. p ∩ t (cid:54) = ∅ and p ∩ r = ∅ : A contains vertex pairs( u, v ) but neither B nor C contain correspondingpairs. Therefore, favoring the left side of eq. (3.1).2. p ∩ t (cid:54) = ∅ and p ∩ r (cid:54) = ∅ : There are ( v i , u j ) ∈ A ,( v i , w k ) ∈ B and ( w k , u j ) ∈ C . In this case it holdsthat 1 + e − d ( u j ,v i ) ≥ e − d ( u j ,w k ) + e − d ( w k ,v i ) .3. p ∩ t = ∅ and p ∩ r (cid:54) = ∅ : There are no correspondingvertex pairs in A and C but in B . However, thiscan only be the case for | s | − | s ∩ t | pairs and eachcontributes at most 1 to the right-hand side. 4. p ∩ t = ∅ and p ∩ r = ∅ : There are no correspondingvertex pairs in A , B or C .Now, we show a strong relationship between the simi-larities, or distances, of two trajectories with respect totwo diﬀerent time intervals. Lemma 3.3.

Let Q and T be trajectories, and s and t time intervals with I ( Q ) = s ⊆ t . It holds that Dist ( Q, T, t ) = 1 − | s || t | + | s || t | Dist ( Q, T, s ) .Proof. Assuming s ⊆ t and using Deﬁnition 2 it holdsthat Sim ( Q, T, t ) = 1 | t | (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | t ∩ s j ∩ t i | · e − d ( u j ,v i ) = 1 | t | (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | s ∩ s j ∩ t i | · e − d ( u j ,v i ) since s j ⊆ s ⊆ t for all s j . Now we can applyDeﬁnition 2 again and obtain Sim ( Q, T, t ) = | s || t | Sim ( Q, T, s ) . Finally, applying Deﬁnition 3 leads to the result.Lemma 3.2 and Lemma 3.3 are the basis for our indicesthat we present in the following section.

We introduce eﬃcient indexing methods for the trajec-tories by applying temporal and spatial ﬁlters. First, wegive a high-level view of our approach, which consists oftwo phases: 1. An oﬄine phase for preparing the index.Given a set of trajectories T a preprocessing phase con-structs the index D that allows eﬃcient queries. 2. Thequery phase. Given a query trajectory Q and a timeinterval s , the index D ﬁrst determines a candidate set C ⊆ T . For each T ∈ C the query algorithm computesthe similarity Sim ( Q, T, s ), and keeps all trajectorieswith a top- k similarity in a heap data structure. Thequery result is the set of all trajectories with a top- k sim-ilarity to Q w.r.t. s . Notice that the candidate set maycontain all trajectories stored in D , e.g., if k ≥ |T | orif all trajectories have the same similarity to the query.In the following, we describe the techniques that achievesmall candidate sets wherever possible. Our techniquesare based on ﬁlters for the temporal and the spatial do-main. Our indexing algorithms, as well as our queryalgorihms, are embarrassingly parallel. .1 Pivot-Based Spatial Filters We choose h ∈ N vertices p , . . . , p h from which we construct h pivottrajectories P , . . . , P h . The h vertices are the onesthat are most-frequently visited by trajectories, wherewe also count multiple visits from a trajectory T at avertex. Each pivot trajectory P i stays stationary atvertex p i during the time interval t = [ a, b ], where a is the earliest starting and b the latest ending time overall trajectories T ∈ T . Next we compute the pairwisedistances Dist ( T, P i , t ) between all T ∈ T and P i for1 ≤ i ≤ h and store these distances together with thepivot trajectories. Given a query ( Q, s ) we compute

Dist ( Q, P i , t ) for 1 ≤ i ≤ h . Using Lemma 3.2 andLemma 3.3, it follows that | Dist ( Q, P i , t ) − Dist ( P i , T, t ) | ≤ Dist ( Q, T, t )= 1 − | s || t | + | s || t | Dist ( Q, T, s ) , where we use that t ⊆ I ( P i ) which holds by constructionof P i . We can use the above bound to ﬁlter out a lotof trajectories from the candidate set that are too faraway from the query to be in the top- k result set. Tothis end, we use a threshold radius r such that we onlykeep trajectories T for which | Dist ( Q, P i , t ) − Dist ( T, P i , t ) | ≤ r for all pivots P i with 1 ≤ i ≤ h .The running time needed for ﬁltering the trajecto-ries during a query is in O ( |T | + h · | Q | ). The construc-tion of the index utilizing pivot-based spatial ﬁlteringis eﬃcient—we only need not compute the distance be-tween the h pivot trajectories and all T ∈ T , each in O ( | T | + | P i | ). Theorem 4.1.

The index based on pivot-based spatialﬁlters can be computed in O ( |T | · hm ) time, where m is the the maximal length of a trajectory over T . Thememory needed for storage is in O ( |T | · h ) . Notice that trajectories thathave empty intersection with the query interval s donot have to be considered in the candidate set C ⊆ T .To ﬁlter out such trajectories we construct a binaryinterval tree using the following procedure. For eachnode h in the tree, we have a set of trajectories T h .We compute the median m of the end-points in T h and assign all trajectories that end before m to the leftchild and all trajectories that start after m to the rightchild of h . All other trajectories are stored at h . Weproceed recursively until we reach a minimum size forthe trajectory set T l , where l is a leaf of the tree.We combine the temporal ﬁlter with the pivot-basedspatial ﬁlter by using a pivot-based spatial ﬁlter at eachnode h for the trajectories stored at node h . The running time needed for temporal ﬁlteringduring a query is in O ( |C| + h · | Q | ), with |C| being thesize of the candidate set returned by the index. Duringthe construction, we have to do the pivot based ﬁlterconstruction at each vertex. Theorem 4.2.

The tree index can be computed in O (log( |T | ) · |T | · hm ) time, where m is the the maxi-mal length of a trajectory in T . The memory needed forstorage is in O ( |T | · h ) . During the computations ofthe similarities between Q and a trajectory T in a set oftrajectories T we can apply the following upper bound-ing technique. Let T , . . . , T |C| be the trajectories of thecandidate set in order of processing. After computingthe similarity of the ﬁrst k trajectories, we can stop thesimilarity computation between Q and T h for h > k early if we can assure that Sim ( Q, T h , s ) is smaller thanany similarity between Q and any T ∈ T computed sofar. To this end, we iteratively update an upper bound¯ s for the value of Sim ( Q, T h , s ). Consider the compu-tation of the similarity Sim ( Q, T h , s ) described in theproof of Theorem 3.1. At each step, before increasing i or j , we obtain the upper bound ¯ s for Sim ( Q, T h , s ) byassuming that in each remaining time step the trajecto-ries are at the same vertices. If ¯ s is smaller than the k lowest similarity found so far, we stop the computationof Sim ( Q, T h , s ) and proceed with Sim ( Q, T h +1 , s ). Grossi et al. [6] introduce three algorithms for answer-ing top- k similarity queries in a spatio-temporal setting.The idea of their baseline algorithm is to have a prepro-cessing phase that constructs an interval tree at eachvertex v of the graph. The interval tree at v ∈ V con-tains all pairs of ( T.id, t ) if trajectory T ∈ T visits v orany of its adjacent vertices during time interval t . Here, T.id is the identiﬁer of the trajectory T . Then, usingthe constructed index, a query ( Q, s ) is answered by vis-iting all vertices v with ( v, t ) ∈ Q and collecting all idsof trajectories that visit vertex v or any of its neighborsduring t ∩ s . With the collected set of ids, the candi-date set of trajectories can be evaluated, and the top- k similar trajectories are found. Therefore, the runningtime and memory requirements depend on the numberof trajectories, the lengths of the trajectories, and thevertex degrees. Moreover, the algorithm solves a spe-cial case, in which only trajectories are considered thathave at least one vertex in hop-distance less or equal to1 to a vertex of the query trajectory. A simple examplefor which the algorithm fails to ﬁnd a similar trajectorycan be constructed in a graph consisting of a chain ofour vertices, i.e., G = ( { v , . . . , v } , { v v , v v , v v } ),and trajectories T = (( v , [0 , Q = (( v , [0 , v and v contain the id of T . For a query( Q, [0 , v and cannot ﬁnd T . Grossi et al. [6] alsointroduce two heuristic algorithms for the top- k queryproblem. Their idea is to reduce the graph size and thenshrink the length of the query trajectory or all trajec-tories to save running time by reducing the number ofdistance computations between vertices. However, thiscan also lead to larger candidate sets, and hence moreevaluations of the similarity function are necessary.Our algorithms diﬀer from the ones suggested byGrossi et al. [6] as follows. First, we introducedan alternative and improved similarity function forwhich we showed certain metric-like properties. Ourindices use these properties to reduce the size of thetrajectory candidate set and, hence, reduce the numberof similarity computations. The memory requirementsof our indices are independent of the size of the graphand only linear in number of the trajectories (seeTheorems 4.1 and 4.2). By using the upper boundingtechnique without preprocessing and constructing anindex, we obtain an exact algorithm that is competitivein terms of running time. In this section, we evaluate our new algorithms andcompare them to the approaches suggested in [6]. Weare interested in answering the following questions:

Q1:

How fast are the indexing times of our algorithmscompared to the algorithms in [6]?

Q2:

How fast are queries of our algorithms comparedto the baseline and to the heuristics in [6]? Do ourindex solutions improve the query times?

Q3:

How good is the quality of the approximated top- k queries? Q4:

How do the choices of the radius r and the numberof pivots h impact running time and accuracy? Q5:

How much does the upper bounding improve therunning time?

Weimplemented the following new algorithms: • Exact is the linear scan over the complete dataset that does not use indexing. • Tree is the index that uses an interval tree withadditional pivot-based spatial ﬁltering at each nodeof the interval tree (see Section 4.2). • Pivot is the index that applies the pivot-basedspatial ﬁltering globally (see Section 4.1).

Exact , Tree and

Pivot use the upper bounding Table 1: Statistics and properties of the synthetic andreal-world data sets.

PropertiesData set | V | | E | ∅ Traj. Len.

Facebook1 . ± . Facebook2 . ± . Milan . ± . T-Drive . ± . technique (Section 4.3). Furthermore, we implementedthe following algorithms from [6]: • Gbase, the baseline algorithm in [6] (see Sec-tion 5). • Gshq and

Gshqt denote their heuristic algorithmsbased on shrinking the graph and the trajectoriesand gaining advantage of the smaller graph size andreduced trajectory lengths (see [6]).All of our implementations use the similarity measuredeﬁned in Deﬁnition 2. We implemented all algorithmsin C++ using GNU CC Compiler 9.3.0 with the ﬂag --O2 . All experiments were conducted on a workstationwith an AMD EPYC 7402P 24-Core Processor with 2.80GHz and 256 GB of RAM running Ubuntu 18.04.3 LTS.The source code and data sets are online available at https://gitlab.com/tgpublic/topktraj . For the evaluation of the algorithms,we used the following data sets: • Facebook 1&2:

The network consists of Facebookfriendship relations [9] and is provided by the

Stanford Network Analyses Project . We havegenerated synthetic trajectories. • Milan:

The Milan data set is based on GPStrajectories of private cars in the city of Milan . • T-Drive:

The data set contains GPS data of taxitrajectories in Beijing [16, 17].For the

Milan and

T-Drive data set, we generated agraph by ﬁrst interpreting each GPS location point asa vertex and then clustering these vertices using the k -means algorithm. The resulting clusters are the ﬁnalvertices. Two clusters are connected by an edge ifat least one trajectory visits a vertex in each of bothclusters in a consecutive time interval. We assign thedistance between the centers of the clusters as thedistance to the edge. Table 1 shows some statistics forthe data sets. https://snap.stanford.edu/data/ego-Facebook.html https://sobigdata.d4science.org/catalogue-sobigdata?path=/dataset/gps_track_milan_italy .3 Results We answer questions Q1 to Q5 . Q1:

Table 2 shows the running times for indexingthe data sets. For

Tree and

Pivot , we choose h = 8pivots for both Facebook data sets and set h = 64 forthe T-Drive data set. In case of the

Milan data set, wechoose h = 32 for Tree and h = 16 for Pivot . Theconstruction of our index structures is several orders ofmagnitude faster than that of the algorithms suggestedin [6]. The largest speed-up is achieved for the

T-Drive data set, for which

Pivot is over 22 000 times faster.For

Facebook2

Pivot is over 6 000 faster. Out of allindexing approaches, as expected,

Pivot is the fastedmethod for all data sets.

Tree is the second fastest withvery large speed ups compared to

Gbase , Gshq and

Gshqt . The low indexing times allow us to learn theparameter h , i.e., ﬁnding a suitable number of pivots.Table 2: Indexing times in seconds. AlgorithmData set

Tree Pivot Gbase Gshq Gshqt

Facebook1 .

14 0 .

06 188 .

23 32 .

77 25 . Facebook2 .

80 0 .

47 3 132 .

22 485 .

26 324 . Milan .

26 0 .

07 232 .

13 43 .

66 34 . T-Drive .

09 0 .

49 11 823 .

81 475 .

06 403 . Table 3: Threshold values r used for the pivot basedﬁlters during query time. Data setIndex

Facebook1 Facebook2 Milan T-Drive

Tree . . . . Pivot . . .

02 0 . Table 4: Query times in seconds for top- k similarityqueries. The running times are the average and stan-dard deviations over 100 queries. The fastest runningtime in each row is highlighted. AlgorithmData set k Exact Tree Pivot Gbase Gshq Gshqt

Facebook1 . ± . . ± . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive . ± . ± . ± . . ± . . ± . . ± . Facebook1 . ± . . ± . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . ± . . ± . . ± . . ± . Facebook1

16 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook2

16 1 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive

16 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook1

64 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook2

64 1 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive

64 0 . ± . . ± . ± . . ± . . ± . . ± . Q2:

We selected 100 trajectories randomly fromthe data sets as queries. The query interval is set to I ( Q ). We ran the algorithms for k ∈ { , , , } . Table 5: Average candidate set sizes and standarddeviation over 100 queries. AlgorithmData set

Tree Pivot Gbase Gshq Gshqt

Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Table 3 shows the threshold radii that we used for thepivot-based ﬁltering, and Table 4 shows the averagerunning times for querying a trajectory from the dataset. First, note that the query times of our exactapproach (

Exact ) are lower than the query times of

Gbase for all data sets. For the

Facebook2 and the

T-Drive instances

Gbase is up 0.2 seconds slower. We willsee later that

Gbase , in contrast to our exact approach,does not always ﬁnd the optimal solution set. Bothof our index structures lead to accelerated query timescompared to the exact approach for all data sets but the

Milan data set. The largest speed-up of about three tofour is achieved for

Pivot on the

Facebook instances.Note that almost always, the two heuristics

Gshq and

Gshqt are much slower in answering the queries. Forthe

Milan and

T-Drive instances, they are even slowerthan our exact approach. The reason is that they oftenhave large candidate sets, see Table 5.

Tree is on-parwith

Pivot and has, in most cases, only a little higherrunning times beside the more complex data structure.The candidate set sizes of

Tree and

Pivot are similarfor the

Facebook data sets, see Table 5. For the

Milan data set,

Tree returns a smaller candidate set and hasa slightly better running time for k = 1 and k = 64compared to Pivot . The full potential of the

Tree index does not come to play for the other data sets dueto the temporal distribution of the trajectories, and orthe limited size. We suspect that the high running timeof

Gbase for

Facebook2 and k = 1 is an outlier and theresult of the very high memory usage of the algorithm.Figure 2 shows the average running times for k = 64. Q3:

In order to evaluate the quality of our queryresults, we use the similarity score ratio (SSR) deﬁnedin [6]. The SSR of two sets T and T of trajectories withrespect to a query is deﬁned as SSR ( T , T , ( Q, s )) = (cid:80) T ∈T Sim ( Q,T ,s ) (cid:80) T ∈T Sim ( Q,T ,s ) . We compare the results of theindices to the results of the exact algorithm Exact .Table 6 shows the average SSR values and the standarddeviations over 100 queries.First we observe that as expected (see section 5) thebaseline

Gbase [6] has not always found the optimalsolution set. The SSR score takes values below onefor the

Milan and the

T-Drive data sets. However, R unn i n g t i m e ( s ) ExactTree PivotGBase GshqGshqt

Figure 2: The average running times for k = 64 over100 queries on a logarithmic scale.Table 6: The SSR results are the averages and standarddeviations over 100 queries. AlgorithmData set k Tree Pivot Gbase Gshq Gshqt

Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Facebook1

16 0 . ± . . ± . . ± . . ± . . ± . Facebook2

16 0 . ± . . ± . . ± . . ± . . ± . Milan

16 0 . ± . . ± . . ± . . ± . . ± . T-Drive

16 0 . ± . . ± . . ± . . ± . . ± . Facebook1

64 0 . ± . . ± . . ± . . ± . . ± . Facebook2

64 0 . ± . . ± . . ± . . ± . . ± . Milan

64 0 . ± . . ± . . ± . . ± . . ± . T-Drive

64 0 . ± . . ± . . ± . . ± . . ± . for the optimal solution, the SSR value should be one.With increasing value of k the SSR value for our Tree algorithm decreases from 0.99 for k = 1 to 0.70 for k = 64. However, for our Pivot approach the decreaseis less strong; the SSR score is always above 0.91 for theinstances

Facebook1 , Facebook2 , and

T-Drive . For the

Milan instance, our heuristics do not behave very wellfor large k . Here, the SSR score for Tree takes a valueof 0 . k = 64. The reason for this low value is thesmall value of r chosen in our experiments. However,a larger value of r will lead to even higher runningtime compared to the exact computations, which isalready faster. This is because of the length of the Milan trajectories are relatively small (see Table 1). For the

Facebook2 instances the SSR score of

Pivot is always0.99. The values of the

Gshq and

Gshqt heuristics for

Facebook1 , Facebook2 , and

T-Drive are always above0.93 due to the usage of the large candidate sets (seeTable 5). However, remember that their query timestake are much longer than that for

Tree , Pivot , and Table 7: Candidate set sizes |C| , running times in s andSSR for varying number of pivot elements h and radii r for the T-Drive data set. We report the average andthe standard deviation over 100 queries. r = 0 . r = 0 . h k size time SSR size time

SSR T r ee

64 1 354 . ± . ± . ± . . ± . . ± . . ± .

128 1 247 . ± . . ± . . ± . . ± . . ± . . ± .

64 64 354 . ± . . ± . . ± . . ± . . ± . . ± .

128 64 247 . ± . . ± . . ± . . ± . ± . ± . P i v o t

64 1 337 . ± . ± . ± . . ± . . ± . . ± .

128 1 230 . ± . . ± . . ± . . ± . . ± . . ± .

64 64 337 . ± . . ± . . ± . . ± . . ± . . ± .

128 64 230 . ± . . ± . . ± . . ± . ± . ± . Sp ee d u p f a c t o r Facebook 1 Facebook 2 Milan T-Drive

Figure 3: Average speed up by using upper boundingduring the calculation of the similarity over 100 querieseven our exact computations.

Q4:

By increasing the number of pivot elements h ,a larger number of trajectories may be excluded fromthe candidate set, since every pivot adds an additionalﬁlter. However, each additional pivot might lead to ahigher number of false negatives, i.e., trajectories thatare not part of the candidate set but are part of theoptimal top- k set. For the T-Drive data set, we ran

Tree and

Pivot with h ∈ { , } and r ∈ { . , . } .For building the index, Tree took 1 .

41 seconds and

Pivot took 0 .

85 seconds. Table 7 shows the eﬀect onquery times and quality. We compare these results toTable 6 and Table 4 (there, the value of r was chosen as0 . .

25, respectively).Lowering h and increasing r , each increases the sizeof the candidate set. A larger candidate set may lead tobetter SSR values; however, it also increases the runningtime. Notice, for k = 1 we can achieve faster runningtimes with high SSR value by choosing a small radius r = 0 . k = 64, Pivot improves its SSR valuecompared to Table 6 by choosing h = 128 pivots andradius r = 0 .

4, while being faster than

Exact . .000010.000100.001000.010000.100001.00000 0 1Facebook 1 Figure 4: Distribution of similarities: The x -axis ranges from 0 to 1. The y -axis shows the fraction of input-querypairs that have this similarity on a logarithmic scale. The red line highlights the 10th percentile for Facebook1 and the 1st percentile for the other data sets. All the similarities of trajectories found by our queries lie to theright of the displayed red lines.

Q5:

To evaluate the speedup gained by the upperbounding technique, we computed the similarity for100 queries for k = 2 i with 0 ≤ i ≤

6. For each k ,we computed the top- k results without indexing, withand without the upper bounding. Figure 3 shows thespeedup that is achieved by using the upper boundingtechnique. The T-Drive and

Milan data sets proﬁtimmensely with speedups between over 4 and 14, and2 and 13, respectively. The speedups decrease withincreasing k . The reason is that there are often onlya few trajectories with very high similarity. If thealgorithm ﬁnds these early on during the processingof the query and if the value of k is small, then theupper bounding is most eﬀective. For larger k , thelowest of the top- k similarities is closer to the non-top- k similarities, and upper bounding, i.e., stoppingthe computation early, happens less often. There is nospeedup in the case of the Facebook data sets. Thereason is that the diﬀerences in the similarities betweenthe query and the trajectories are small (see Figure 4).Moreover, due to the long trajectories (see Table 1), theupper bounds have to be updated often, such that theupper bounding in total cannot speed up the query.

We studied computing the top- k most similar trajecto-ries in a graph to a given query trajectory. For this,we proposed a new spatio-temporal similarity measurebased on the work of Grossi et al. [6]. We derived a dis-tance function from our new similarity function, whichsatisﬁes a triangle inequality under certain conditions.That built the basis for our pivot-based ﬁltering tech-nique, which accelerates ﬁnding exact solutions of top- k trajectory queries. Furthermore, we suggested a tree-based temporal ﬁltering method in combination withthe pivot-based technique. Both approaches stronglyoutperform the baselines for all data sets, but the Milan data set. Here, our new baseline algorithm that uses theupper bounding technique has the lowest running time.It is also the ﬁrst exact algorithm for the top- k trajec- tory problem, as we showed that the baseline in [6] doesnot always ﬁnd the exact solution. Acknowledgments

This work is funded by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) under Ger-many’s Excellence Strategy – EXC-2047/1 – 390685813.

References [1] R. Agrawal, C. Faloutsos, and A. Swami. Eﬃcientsimilarity search in sequence databases. In

Founda-tions of Data Organization and Algorithms , pages 69–84. Springer Berlin Heidelberg, 1993.[2] L. Chen, M. T. ¨Ozsu, and V. Oria. Robust and fastsimilarity search for moving object trajectories. In ,SIGMOD ’05, pages 491–502. ACM, 2005.[3] L. Chen, S. Shang, B. Yao, and K. Zheng. Spatio-temporal top-k term search over sliding window.

WorldWide Web , 22(5):1953–1970, 2019.[4] Z. Chen, H. T. Shen, X. Zhou, Y. Zheng, andX. Xie. Searching trajectories by locations: An eﬃ-ciency study. In , pages 255–266. ACM, 2010.[5] A. Driemel, I. Psarros, and M. Schmidt. Sublineardata structures for short Fr´echet queries.

CoRR ,abs/1907.04420, 2019.[6] R. Grossi, A. Marino, and S. Moghtasedi. Find-ing structurally and temporally similar trajectories ingraphs. In , volume 160 of

LIPIcs , pages 24:1–24:13. SchlossDagstuhl - Leibniz-Zentrum f¨ur Informatik, 2020.[7] J.-R. Hwang, H.-Y. Kang, and K.-J. Li. Searchingfor similar trajectories on road networks using spatio-temporal similarity. In

Advances in Databases andInformation Systems , pages 282–295. Springer BerlinHeidelberg, 2006.[8] P. Indyk. Approximate nearest neighbor algorithmsfor frechet distance via product metrics. In , pages 102–106, 2002.[9] J. J. McAuley and J. Leskovec. Learning to discoverocial circles in ego networks. In

Advances in Neu-ral Information Processing Systems 25: 26th AnnualConference on Neural Information Processing Systems2012 , pages 548–556, 2012.[10] S. Shang, L. Chen, Z. Wei, C. S. Jensen, K. Zheng, andP. Kalnis. Parallel trajectory similarity joins in spatialnetworks.

VLDB J. , 27(3):395–420, 2018.[11] S. Shang, R. Ding, K. Zheng, C. S. Jensen, P. Kalnis,and X. Zhou. Personalized trajectory matching inspatial networks.

VLDB J. , 23(3):449–468, 2014.[12] H. Su, S. Liu, B. Zheng, X. Zhou, and K. Zheng. A sur-vey of trajectory distance measures and performanceevaluation.

VLDB J. , 29(1):3–32, 2020.[13] E. Tiakas, A. N. Papadopoulos, A. Nanopoulos,Y. Manolopoulos, D. Stojanovic, and S. Djordjevic-Kajan. Trajectory similarity search in spatial net-works. In , pages 185–192. IEEE, 2006. [14] E. Tiakas and D. Rafailidis. Scalable trajectory simi-larity search based on locations in spatial networks. In

Model and Data Engineering - 5th Intl. Conf., MEDI ,volume 9344 of

LNCS , pages 213–224. Springer, 2015.[15] Y. Xia, G.-Y. Wang, X. Zhang, G.-B. Kim, and H.-Y.Bae. Spatio-temporal similarity measure for networkconstrained trajectory data.

Intl. J. ComputationalIntelligence Systems , 4:1070–1079, 2011.[16] J. Yuan, Y. Zheng, X. Xie, and G. Sun. Driving withknowledge from the physical world. In , pages 316–324. ACM, 2011.[17] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun,and Y. Huang. T-drive: Driving directions based ontaxi trajectories. In18th ACM SIGSPATIAL Intl.Symp. Advances in Geographic Information Systems