Spatio-Temporal Top-k Similarity Search for Trajectories in Graphs
SSpatio-Temporal Top-k Similarity Search for Trajectories in Graphs
Lutz Oettershagen ∗ Anne Driemel † Petra Mutzel ∗ Abstract
We study the problem of finding the k most similartrajectories to a given query trajectory. Our work isinspired by the work of Grossi et al. [6] that considerstrajectories as walks in a graph. Each visited vertex isaccompanied by a time-interval. Grossi et al. define asimilarity function that captures temporal and spatialaspects. We improve this similarity function to derivea new spatio-temporal distance function for which wecan show that a specific type of triangle inequality issatisfied. This distance function is the basis for ourindex structures, which can be constructed efficiently,need only linear memory, and can quickly answer queriesfor the top- k most similar trajectories. Our evaluationon real-world and synthetic data sets shows that ouralgorithms outperform the baselines with respect toindexing time by several orders of magnitude whileachieving similar or better query time and quality ofresults. Keywords—
Trajectories, Indexing, Top- k Query
More and more trajectory data is collected due to theubiquitous availability of sensors and personal mobiledevices that allow tracking of movement over time.Therefore, trajectory data mining is attracting increas-ing attention in the scientific literature [1, 2, 3, 4, 7, 11,13, 14, 15]. For any fundamental task in trajectory datamining, the choice of similarity measure is a crucial stepin the design process. Often there are spatial restric-tions to the movement and the trajectories of interestare related to a graph, or are mapped to a spatial net-work. We are interested in similarity, which takes suchspatial as well as temporal aspects into account. Weconsider two trajectories as similar if they visit the sameor proximate vertices during the same periods of time.Our work is inspired by Grossi et al. [6], who define asimilarity function for two trajectories in a graph. Thetrajectories can be of different length and the similarityfunction takes spatial and temporal similarity into ac- ∗ Institute of Computer Science, University of Bonn, Germany, { lutz.oettershagen,petra.mutzel } @cs.uni-bonn.de † Hausdorff Center for Mathematics, University of Bonn, Ger-many, [email protected] [10:32,10:38] [10:38,10:47][10:47,11:11][11:11,12:24][10:32,11:01][11:01,12:30][12:30,14:04] [12:39,12:47][12:47,12:58][12:58,13:03][13:03,15:47]
Figure 1: Example for trajectories with time intervalsin a network, e.g., an online social network. Thetrajectories reveal the user behavior, i.e., the times auser visits and leaves a website.count. It can be computed in linear time with respect tothe length of the trajectories. We consider trajectoriesas sequence of vertices in a graph and for each visitedvertex there is a discrete time-interval for the time thetrajectory stays at the vertex. See Figure 1 for an exam-ple. Based on Grossi et al. [6], we introduce an improvedand new spatio-temporal similarity and a correspondingdistance function for trajectories in graphs. We showthat a specific kind of triangle-inequality holds for thedistance function under reasonable assumptions. Thisdistance function provides the basis for new index datastructures that allow efficient top- k similarity queries.A top- k trajectory query ( Q, s ) specifies a trajectory Q and a time interval s . Given a set of trajectories T ,the result of a top- k query consists of the subset of T containing all trajectories that have one of the k high-est similarities to Q with respect to the time interval s .These queries have important real-life applications: • Web analytics:
Users of a online social networkor web community following links and visitinguser pages. The goal is to find similar browsingbehavior. Figure 1 shows an example. • Travel recommendation:
Tourism is one ofthe largest industries and the emergence of travelfocused social networks enables users to share their a r X i v : . [ c s . D S ] O c t ours. The locations are points-of-interests (POI),and the intervals are the duration person stays ata POI. A query is a request for a recommendation. • Animal behavior:
Consider wildlife that istracked using GPS. The living space of the animalis divided into zones. The goal is to identify similar-ities in animal behaviors. Vertices represent eitherspecific locations like waterhole or feeding place orterritories of animals. • Traffic and crowd analysis:
The goal is toidentify person or vehicle flows at specific timesthrough predefined areas. Vertices represent theseareas. This also includes the application in contacttracing, where we need to determine contacts of an,e.g. infected individual, to other persons.These applications have in common that one is inter-ested in finding a set of the most similar trajectories to agiven one. This is a fundamental problem in trajectorymining like clustering, outlier detection, classification,or prediction tasks. It is necessary to select or definean adequate similarity measure or distance function, re-spectively, that fits the requirements of the application.
Contributions:
1. We introduce a spatio-temporal similarity functionand show that the triangle-inequality holds undercertain conditions for the corresponding distancefunction. The similarity computation for two tra-jectories only needs linear time with respect to thelength of the longer trajectory.2. We design indices that can be constructed veryefficiently and use linear memory with respectto the number of trajectories. The indices arebased on spatial as well on temporal filtering andallow heuristic top- k similarity queries with shortrunning times and high quality of the results.Additionally, we apply upper bounding, whichallows a direct, highly efficient query even withoutthe need for a preprocessed index data structure.In the latter case, the output is exact.3. We evaluate our new algorithms on real-world andsynthetic data sets. Our new solutions outperformthe baselines (including [6]) with respect to index-ing time by several orders of magnitude. Moreover,our query times are substantially faster, and thequality of the results is better than or on par withthe baselines algorithms. Since trajectory similarity is ofhigh interest for many data analytics tasks, many differ-ent similarity measures have been used, e.g., based ondynamic time warping, Euclidean distances, or edit dis-tances. Su et al. [12] provides a nice overview. For tra-jectory analysis in networks, many approaches have con- centrated on the spatial similarity only, and a few con-sider spatio-temporal similarity. Hwang et al. [7] havesuggested a similarity measure based on the network dis-tance measuring spatial and temporal similarity. How-ever, a set of nodes need to be selected in advance, andspatial similarity then means passing through the samenodes simultaneously. Xia et al. [15] use a similaritymeasure for network constrained trajectories based onan extension of the Jaccard similarity. As a similaritymeasure, they use the product of spatial and tempo-ral similarity. Tiakas et al. [13, 14] suggest a weightedsum of spatial and temporal similarity. Their similarityfunction works for two trajectories with the same lengthand can be computed in linear time with respect to thelength of the given trajectories. Shang et al. [10] alsouse a weighted sum of spatial and temporal similarityfor similarity-joins of trajectories.Another way to approach the problem is to usea distance measure based on the discrete Fr´echet dis-tance, or dynamic time warping, which optimize overall vertex-mappings between the two trajectories thatrespect the time-ordering, where the underlying metricwould be derived from the shortest-path metric givenby the graph. Near-neighbor data structures have beenstudied theoretically with specific conditions on the un-derlying graph and the length of the queries, see [5, 8].Our work is inspired by Grossi et al. [6]. They sug-gest a spatio-temporal similarity measure for two trajec-tories in a graph. The trajectories can be of differendlength and if the pairwise distances are given, then themeasure can be computed in linear time with respect tothe length of the trajectories. The authors also suggestan algorithm for answering the top- k trajectory queryproblem. For speeding up the computations, they sug-gest an indexing method based on interval trees and amethod to approximate their similarity measure. Weprovide a more detailed description of their work and acomparison to our approach in Section 5. An undirected and weighted graph G = ( V, E, c ) consistsof a finite set of vertices V , a finite set E ⊆ {{ u, v } ⊆ V | u (cid:54) = v } of undirected edges and a cost function c : E → R > that assigns a positive cost to each edge e ∈ E . A walk in G is an alternating sequence of verticesand edges connecting consecutive vertices. A path is awalk that visits each vertex at most once. The cost ofa walk or path is the sum of its edge costs. Let d ( u, v )denote the shortest path distance between u, v ∈ V . Definition 1. (Trajectory)
Let G = ( V, E, c ) be anundirected, weighted and connected graph. A trajectory T is a sequence of pairs (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) , such thator ≤ i ≤ (cid:96) the pair ( v i , t i ) consists of v i ∈ V and adiscrete time interval t i = [ a i , b i ] with a i , b i ∈ Z , a i < b i and a i +1 = b i for ≤ i < (cid:96) . The starting time of T is T.start = a and the endingtime T.end = b (cid:96) . We denote with I ( T ) the totalinterval in which trajectory T exists, i.e., from T.start to T.end . For a trajectory T and a time interval t = [ a, b ]we define T [ t ] as the time-restricted trajectory that isintersected with t , i.e., T [ t ] = (( v i , t (cid:48) i ) , . . . , ( v j , t (cid:48) j )) with v i ( v j ) being the first (last) vertex of T such that for t i = [ a i , b i ] it holds that b i > a (and for t j = [ a j , b j ] a j < b , resp.), t (cid:48) i = max { t i , a } and t (cid:48) j = min { t j , b } . Weassume for T = (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) that v i (cid:54) = v i +1 forall 1 ≤ i < (cid:96) . We say trajectory T intersects a timeinterval t if there is a ( v i , t i ) ∈ T with t i ∩ t (cid:54) = ∅ . We define our new similarity function for trajectories onnetworks. The goal of is to capture both temporal andspatial aspects, such that if two trajectories are oftenin close proximity, i.e., visiting vertices that are closeto each other during the same period of time, then thesimilarity should be high.
Definition 2.
Let T = (( v , t ) , . . . , ( v (cid:96) , t (cid:96) )) and Q =(( u , s ) , . . . , ( u k , s k )) be two trajectories, and s a timeinterval. We define the similarity of Q and T in thetime interval s as Sim ( Q, T, s ) = 1 | s | · (cid:88) ( v i ,t i ) ∈ T ( u j ,s j ) ∈ Q | s ∩ t i ∩ s j | · e − d ( v i ,u j ) . Notice that for two trajectories T and Q , and a timeinterval s , it holds that 0 ≤ Sim ( Q, T, s ) ≤ Sim ( Q, T, s ) is minimal if the common intersection ofthe time intervals is empty. In this case
Sim ( Q, T, s ) =0.
Lemma 3.1.
Let Q and T be trajectories and s a timeinterval with s ⊆ I ( Q ) . It holds that1. Sim ( Q, T, s ) =
Sim ( T, Q, s ) , and2. Q [ s ] = T [ s ] if and only if Sim ( Q, T, s ) = 1 .Proof. (1.) The shortest path metric is symmetric, i.e., d ( u, v ) = d ( v, u ) for all u, v ∈ V . The summation is overthe same pairs of ( v i , t i ) ∈ T and ( u j , s j ) ∈ Q , and theintersection of the intervals is commutative. Therefore,the result follows.(2.) ⇒ : Notice that if Q [ s ] = T [ s ] in each step ofthe summation e − d ( u,v ) = 1. Because s ⊆ I ( Q ), theresult of the summation is s and normalization is 1. ⇐ : Assume that Sim ( Q, T, s ) = 1 but Q [ s ] (cid:54) = T [ s ], i.e., Q [ s ] and T [ s ] differ in the vertices they visit or the times when they visit them. In the first case, due tothe strictly positive edge weights, there is a vertex pairsuch that e − d ( u,v ) <
1, however for all other vertex pairsthe value e − d ( u (cid:48) ,v (cid:48) ) is at most 1. Because the intervalsare intersected with the interval s the total sum willbe less than | s | and leads to a contradiction to theassumption. Analogously, in the case that I ( T [ s ]) < | s | a contradiction follows. Now, the case that Q [ s ] and T [ s ]differ in the times when they visit the vertices. Becauseof the assumption that a trajectory does not stay atthe same vertex in two consecutive time intervals, thereis an intersection of time intervals in which Q [ s ] and T [ s ] visit different vertices u and v . Due to the strictlypositive edge weights it is e − d ( u,v ) <
1. This leads againto a contradiction.For the computation of the similarity, the shortest-pathdistances between the vertices of the graph is needed.These distances can be precomputed for all vertices orcomputed on-the-fly for vertices u that are visited bythe query trajectory Q . Theorem 3.1.
Let Q and T be trajectories, and s a time interval, the computation of the similarity Sim ( Q, T, s ) takes O ( | Q | + | T | ) time, if the shortest pathdistance d ( u, v ) between u, v ∈ V can be obtained in con-stant time.Proof. Consider the query trajectory Q =(( u , s ) , . . . , ( u i , [ a i , b i ]) , . . . , ( u k , s k )) and the tra-jectory T = (( v , t ) , . . . , ( v j , [ c j , d j ]) , . . . , ( v (cid:96) , t (cid:96) )). Westart the computation with i = 1 and j = 1, and | s ∩ t ∩ s | is either zero or larger than zero. In the firstcase we can increase both i and j . In the second case,we increase i if b i < d j or j = (cid:96) , and we increase j if b i > d j or i = k . We repeat this for maximal | Q | + | T | times and find all pairs ( u i , s i ) and ( v j , t j ) that havenon-empty intersection.Definition 2 is similar to the similarity function definedin [6], however our improvements allow to prove usefulproperties for the corresponding distance function. Wenow define the distance function based on the similarityand show a specific type of triangle inequality. Definition 3.
Let Q and T be trajectories, and s atime interval. We define the distance Dist ( Q, T, s ) =1 − Sim ( Q, T, s ) . Lemma 3.2.
Let Q , T and R be trajectories, and s a time interval. If s ⊆ I ( Q ) , then Dist ( Q, T, s ) ≤ Dist ( Q, R, s ) +
Dist ( R, T, s ) .Proof. Let t = I ( T ) and r = I ( R ). We can assumewithout loss of generality that s = I ( Q ). We show that − Sim ( Q, T, s ) ≤ − Sim ( Q, R, s ) + 1 − Sim ( R, T, s ).This is equivalent to 1 +
Sim ( Q, T, s ) ≥ Sim ( Q, R, s ) +
Sim ( R, T, s ). By substituting Definition 2 and usingthe fact that | s ∩ t | = (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | s ∩ s j ∩ t i | , we can rewrite the above equivalently as | s | − | s ∩ t | + (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | s ∩ s j ∩ t i | · (1 + e − d ( u j ,v i ) ) ≥ (cid:88) ( u j ,s j ) ∈ Q ( w k ,r k ) ∈ R | s ∩ s j ∩ r k | · e − d ( u j ,w k ) + (cid:88) ( w k ,r k ) ∈ R ( v i ,t i ) ∈ T | s ∩ r k ∩ t i | · e − d ( w k ,v i ) .We now want to show that the above inequality holds.Consider the following multisets of vertex pairs. A contains the pairs ( u j , v i ) that are summed up on theleft side of the inequality for which | s ∩ s j ∩ t i | > u j , v i ) is in A exactly | s ∩ s j ∩ t i | times.Similarly, B contains the pairs ( u j , w k ) that are summedup during the first summation on the right-hand sideof the inequality for which | s ∩ s j ∩ r k | >
0, whereeach ( u j , w k ) is in B exactly | s ∩ s j ∩ r k | times. Andfinally, C contains the pairs ( w k , v i ) that are summedup during the second summation on the right-hand sideof the inequality for which | s ∩ r k ∩ t i | >
0, where each( w k , v i ) is in C exactly | s ∩ r k ∩ t i | times. Then, we show | s | − | s ∩ t | + (cid:88) ( u j ,v i ) ∈ A (1 + e − d ( u j ,v i ) ) ≥ (cid:88) ( u j ,w k ) ∈ B e − d ( u j ,w k ) + (cid:88) ( w k ,v i ) ∈ C e − d ( w k ,v i ) .(3.1)We show that the multisets A , B and C contain vertexpairs such that the inequality holds. And let p ⊆ s bean interval of length one. For each possible p we mayhave some vertex pairs in the multisets.We need the consider the following cases:1. p ∩ t (cid:54) = ∅ and p ∩ r = ∅ : A contains vertex pairs( u, v ) but neither B nor C contain correspondingpairs. Therefore, favoring the left side of eq. (3.1).2. p ∩ t (cid:54) = ∅ and p ∩ r (cid:54) = ∅ : There are ( v i , u j ) ∈ A ,( v i , w k ) ∈ B and ( w k , u j ) ∈ C . In this case it holdsthat 1 + e − d ( u j ,v i ) ≥ e − d ( u j ,w k ) + e − d ( w k ,v i ) .3. p ∩ t = ∅ and p ∩ r (cid:54) = ∅ : There are no correspondingvertex pairs in A and C but in B . However, thiscan only be the case for | s | − | s ∩ t | pairs and eachcontributes at most 1 to the right-hand side. 4. p ∩ t = ∅ and p ∩ r = ∅ : There are no correspondingvertex pairs in A , B or C .Now, we show a strong relationship between the simi-larities, or distances, of two trajectories with respect totwo different time intervals. Lemma 3.3.
Let Q and T be trajectories, and s and t time intervals with I ( Q ) = s ⊆ t . It holds that Dist ( Q, T, t ) = 1 − | s || t | + | s || t | Dist ( Q, T, s ) .Proof. Assuming s ⊆ t and using Definition 2 it holdsthat Sim ( Q, T, t ) = 1 | t | (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | t ∩ s j ∩ t i | · e − d ( u j ,v i ) = 1 | t | (cid:88) ( u j ,s j ) ∈ Q ( v i ,t i ) ∈ T | s ∩ s j ∩ t i | · e − d ( u j ,v i ) since s j ⊆ s ⊆ t for all s j . Now we can applyDefinition 2 again and obtain Sim ( Q, T, t ) = | s || t | Sim ( Q, T, s ) . Finally, applying Definition 3 leads to the result.Lemma 3.2 and Lemma 3.3 are the basis for our indicesthat we present in the following section.
We introduce efficient indexing methods for the trajec-tories by applying temporal and spatial filters. First, wegive a high-level view of our approach, which consists oftwo phases: 1. An offline phase for preparing the index.Given a set of trajectories T a preprocessing phase con-structs the index D that allows efficient queries. 2. Thequery phase. Given a query trajectory Q and a timeinterval s , the index D first determines a candidate set C ⊆ T . For each T ∈ C the query algorithm computesthe similarity Sim ( Q, T, s ), and keeps all trajectorieswith a top- k similarity in a heap data structure. Thequery result is the set of all trajectories with a top- k sim-ilarity to Q w.r.t. s . Notice that the candidate set maycontain all trajectories stored in D , e.g., if k ≥ |T | orif all trajectories have the same similarity to the query.In the following, we describe the techniques that achievesmall candidate sets wherever possible. Our techniquesare based on filters for the temporal and the spatial do-main. Our indexing algorithms, as well as our queryalgorihms, are embarrassingly parallel. .1 Pivot-Based Spatial Filters We choose h ∈ N vertices p , . . . , p h from which we construct h pivottrajectories P , . . . , P h . The h vertices are the onesthat are most-frequently visited by trajectories, wherewe also count multiple visits from a trajectory T at avertex. Each pivot trajectory P i stays stationary atvertex p i during the time interval t = [ a, b ], where a is the earliest starting and b the latest ending time overall trajectories T ∈ T . Next we compute the pairwisedistances Dist ( T, P i , t ) between all T ∈ T and P i for1 ≤ i ≤ h and store these distances together with thepivot trajectories. Given a query ( Q, s ) we compute
Dist ( Q, P i , t ) for 1 ≤ i ≤ h . Using Lemma 3.2 andLemma 3.3, it follows that | Dist ( Q, P i , t ) − Dist ( P i , T, t ) | ≤ Dist ( Q, T, t )= 1 − | s || t | + | s || t | Dist ( Q, T, s ) , where we use that t ⊆ I ( P i ) which holds by constructionof P i . We can use the above bound to filter out a lotof trajectories from the candidate set that are too faraway from the query to be in the top- k result set. Tothis end, we use a threshold radius r such that we onlykeep trajectories T for which | Dist ( Q, P i , t ) − Dist ( T, P i , t ) | ≤ r for all pivots P i with 1 ≤ i ≤ h .The running time needed for filtering the trajecto-ries during a query is in O ( |T | + h · | Q | ). The construc-tion of the index utilizing pivot-based spatial filteringis efficient—we only need not compute the distance be-tween the h pivot trajectories and all T ∈ T , each in O ( | T | + | P i | ). Theorem 4.1.
The index based on pivot-based spatialfilters can be computed in O ( |T | · hm ) time, where m is the the maximal length of a trajectory over T . Thememory needed for storage is in O ( |T | · h ) . Notice that trajectories thathave empty intersection with the query interval s donot have to be considered in the candidate set C ⊆ T .To filter out such trajectories we construct a binaryinterval tree using the following procedure. For eachnode h in the tree, we have a set of trajectories T h .We compute the median m of the end-points in T h and assign all trajectories that end before m to the leftchild and all trajectories that start after m to the rightchild of h . All other trajectories are stored at h . Weproceed recursively until we reach a minimum size forthe trajectory set T l , where l is a leaf of the tree.We combine the temporal filter with the pivot-basedspatial filter by using a pivot-based spatial filter at eachnode h for the trajectories stored at node h . The running time needed for temporal filteringduring a query is in O ( |C| + h · | Q | ), with |C| being thesize of the candidate set returned by the index. Duringthe construction, we have to do the pivot based filterconstruction at each vertex. Theorem 4.2.
The tree index can be computed in O (log( |T | ) · |T | · hm ) time, where m is the the maxi-mal length of a trajectory in T . The memory needed forstorage is in O ( |T | · h ) . During the computations ofthe similarities between Q and a trajectory T in a set oftrajectories T we can apply the following upper bound-ing technique. Let T , . . . , T |C| be the trajectories of thecandidate set in order of processing. After computingthe similarity of the first k trajectories, we can stop thesimilarity computation between Q and T h for h > k early if we can assure that Sim ( Q, T h , s ) is smaller thanany similarity between Q and any T ∈ T computed sofar. To this end, we iteratively update an upper bound¯ s for the value of Sim ( Q, T h , s ). Consider the compu-tation of the similarity Sim ( Q, T h , s ) described in theproof of Theorem 3.1. At each step, before increasing i or j , we obtain the upper bound ¯ s for Sim ( Q, T h , s ) byassuming that in each remaining time step the trajecto-ries are at the same vertices. If ¯ s is smaller than the k lowest similarity found so far, we stop the computationof Sim ( Q, T h , s ) and proceed with Sim ( Q, T h +1 , s ). Grossi et al. [6] introduce three algorithms for answer-ing top- k similarity queries in a spatio-temporal setting.The idea of their baseline algorithm is to have a prepro-cessing phase that constructs an interval tree at eachvertex v of the graph. The interval tree at v ∈ V con-tains all pairs of ( T.id, t ) if trajectory T ∈ T visits v orany of its adjacent vertices during time interval t . Here, T.id is the identifier of the trajectory T . Then, usingthe constructed index, a query ( Q, s ) is answered by vis-iting all vertices v with ( v, t ) ∈ Q and collecting all idsof trajectories that visit vertex v or any of its neighborsduring t ∩ s . With the collected set of ids, the candi-date set of trajectories can be evaluated, and the top- k similar trajectories are found. Therefore, the runningtime and memory requirements depend on the numberof trajectories, the lengths of the trajectories, and thevertex degrees. Moreover, the algorithm solves a spe-cial case, in which only trajectories are considered thathave at least one vertex in hop-distance less or equal to1 to a vertex of the query trajectory. A simple examplefor which the algorithm fails to find a similar trajectorycan be constructed in a graph consisting of a chain ofour vertices, i.e., G = ( { v , . . . , v } , { v v , v v , v v } ),and trajectories T = (( v , [0 , Q = (( v , [0 , v and v contain the id of T . For a query( Q, [0 , v and cannot find T . Grossi et al. [6] alsointroduce two heuristic algorithms for the top- k queryproblem. Their idea is to reduce the graph size and thenshrink the length of the query trajectory or all trajec-tories to save running time by reducing the number ofdistance computations between vertices. However, thiscan also lead to larger candidate sets, and hence moreevaluations of the similarity function are necessary.Our algorithms differ from the ones suggested byGrossi et al. [6] as follows. First, we introducedan alternative and improved similarity function forwhich we showed certain metric-like properties. Ourindices use these properties to reduce the size of thetrajectory candidate set and, hence, reduce the numberof similarity computations. The memory requirementsof our indices are independent of the size of the graphand only linear in number of the trajectories (seeTheorems 4.1 and 4.2). By using the upper boundingtechnique without preprocessing and constructing anindex, we obtain an exact algorithm that is competitivein terms of running time. In this section, we evaluate our new algorithms andcompare them to the approaches suggested in [6]. Weare interested in answering the following questions:
Q1:
How fast are the indexing times of our algorithmscompared to the algorithms in [6]?
Q2:
How fast are queries of our algorithms comparedto the baseline and to the heuristics in [6]? Do ourindex solutions improve the query times?
Q3:
How good is the quality of the approximated top- k queries? Q4:
How do the choices of the radius r and the numberof pivots h impact running time and accuracy? Q5:
How much does the upper bounding improve therunning time?
Weimplemented the following new algorithms: • Exact is the linear scan over the complete dataset that does not use indexing. • Tree is the index that uses an interval tree withadditional pivot-based spatial filtering at each nodeof the interval tree (see Section 4.2). • Pivot is the index that applies the pivot-basedspatial filtering globally (see Section 4.1).
Exact , Tree and
Pivot use the upper bounding Table 1: Statistics and properties of the synthetic andreal-world data sets.
PropertiesData set | V | | E | ∅ Traj. Len.
Facebook1 . ± . Facebook2 . ± . Milan . ± . T-Drive . ± . technique (Section 4.3). Furthermore, we implementedthe following algorithms from [6]: • Gbase, the baseline algorithm in [6] (see Sec-tion 5). • Gshq and
Gshqt denote their heuristic algorithmsbased on shrinking the graph and the trajectoriesand gaining advantage of the smaller graph size andreduced trajectory lengths (see [6]).All of our implementations use the similarity measuredefined in Definition 2. We implemented all algorithmsin C++ using GNU CC Compiler 9.3.0 with the flag --O2 . All experiments were conducted on a workstationwith an AMD EPYC 7402P 24-Core Processor with 2.80GHz and 256 GB of RAM running Ubuntu 18.04.3 LTS.The source code and data sets are online available at https://gitlab.com/tgpublic/topktraj . For the evaluation of the algorithms,we used the following data sets: • Facebook 1&2:
The network consists of Facebookfriendship relations [9] and is provided by the
Stanford Network Analyses Project . We havegenerated synthetic trajectories. • Milan:
The Milan data set is based on GPStrajectories of private cars in the city of Milan . • T-Drive:
The data set contains GPS data of taxitrajectories in Beijing [16, 17].For the
Milan and
T-Drive data set, we generated agraph by first interpreting each GPS location point asa vertex and then clustering these vertices using the k -means algorithm. The resulting clusters are the finalvertices. Two clusters are connected by an edge ifat least one trajectory visits a vertex in each of bothclusters in a consecutive time interval. We assign thedistance between the centers of the clusters as thedistance to the edge. Table 1 shows some statistics forthe data sets. https://snap.stanford.edu/data/ego-Facebook.html https://sobigdata.d4science.org/catalogue-sobigdata?path=/dataset/gps_track_milan_italy .3 Results We answer questions Q1 to Q5 . Q1:
Table 2 shows the running times for indexingthe data sets. For
Tree and
Pivot , we choose h = 8pivots for both Facebook data sets and set h = 64 forthe T-Drive data set. In case of the
Milan data set, wechoose h = 32 for Tree and h = 16 for Pivot . Theconstruction of our index structures is several orders ofmagnitude faster than that of the algorithms suggestedin [6]. The largest speed-up is achieved for the
T-Drive data set, for which
Pivot is over 22 000 times faster.For
Facebook2
Pivot is over 6 000 faster. Out of allindexing approaches, as expected,
Pivot is the fastedmethod for all data sets.
Tree is the second fastest withvery large speed ups compared to
Gbase , Gshq and
Gshqt . The low indexing times allow us to learn theparameter h , i.e., finding a suitable number of pivots.Table 2: Indexing times in seconds. AlgorithmData set
Tree Pivot Gbase Gshq Gshqt
Facebook1 .
14 0 .
06 188 .
23 32 .
77 25 . Facebook2 .
80 0 .
47 3 132 .
22 485 .
26 324 . Milan .
26 0 .
07 232 .
13 43 .
66 34 . T-Drive .
09 0 .
49 11 823 .
81 475 .
06 403 . Table 3: Threshold values r used for the pivot basedfilters during query time. Data setIndex
Facebook1 Facebook2 Milan T-Drive
Tree . . . . Pivot . . .
02 0 . Table 4: Query times in seconds for top- k similarityqueries. The running times are the average and stan-dard deviations over 100 queries. The fastest runningtime in each row is highlighted. AlgorithmData set k Exact Tree Pivot Gbase Gshq Gshqt
Facebook1 . ± . . ± . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive . ± . ± . ± . . ± . . ± . . ± . Facebook1 . ± . . ± . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . ± . . ± . . ± . . ± . Facebook1
16 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook2
16 1 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive
16 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook1
64 0 . ± . . ± . ± . . ± . . ± . . ± . Facebook2
64 1 . ± . . ± . ± . . ± . . ± . . ± . Milan ± . . ± . . ± . . ± . . ± . . ± . T-Drive
64 0 . ± . . ± . ± . . ± . . ± . . ± . Q2:
We selected 100 trajectories randomly fromthe data sets as queries. The query interval is set to I ( Q ). We ran the algorithms for k ∈ { , , , } . Table 5: Average candidate set sizes and standarddeviation over 100 queries. AlgorithmData set
Tree Pivot Gbase Gshq Gshqt
Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Table 3 shows the threshold radii that we used for thepivot-based filtering, and Table 4 shows the averagerunning times for querying a trajectory from the dataset. First, note that the query times of our exactapproach (
Exact ) are lower than the query times of
Gbase for all data sets. For the
Facebook2 and the
T-Drive instances
Gbase is up 0.2 seconds slower. We willsee later that
Gbase , in contrast to our exact approach,does not always find the optimal solution set. Bothof our index structures lead to accelerated query timescompared to the exact approach for all data sets but the
Milan data set. The largest speed-up of about three tofour is achieved for
Pivot on the
Facebook instances.Note that almost always, the two heuristics
Gshq and
Gshqt are much slower in answering the queries. Forthe
Milan and
T-Drive instances, they are even slowerthan our exact approach. The reason is that they oftenhave large candidate sets, see Table 5.
Tree is on-parwith
Pivot and has, in most cases, only a little higherrunning times beside the more complex data structure.The candidate set sizes of
Tree and
Pivot are similarfor the
Facebook data sets, see Table 5. For the
Milan data set,
Tree returns a smaller candidate set and hasa slightly better running time for k = 1 and k = 64compared to Pivot . The full potential of the
Tree index does not come to play for the other data sets dueto the temporal distribution of the trajectories, and orthe limited size. We suspect that the high running timeof
Gbase for
Facebook2 and k = 1 is an outlier and theresult of the very high memory usage of the algorithm.Figure 2 shows the average running times for k = 64. Q3:
In order to evaluate the quality of our queryresults, we use the similarity score ratio (SSR) definedin [6]. The SSR of two sets T and T of trajectories withrespect to a query is defined as SSR ( T , T , ( Q, s )) = (cid:80) T ∈T Sim ( Q,T ,s ) (cid:80) T ∈T Sim ( Q,T ,s ) . We compare the results of theindices to the results of the exact algorithm Exact .Table 6 shows the average SSR values and the standarddeviations over 100 queries.First we observe that as expected (see section 5) thebaseline
Gbase [6] has not always found the optimalsolution set. The SSR score takes values below onefor the
Milan and the
T-Drive data sets. However, R unn i n g t i m e ( s ) ExactTree PivotGBase GshqGshqt
Figure 2: The average running times for k = 64 over100 queries on a logarithmic scale.Table 6: The SSR results are the averages and standarddeviations over 100 queries. AlgorithmData set k Tree Pivot Gbase Gshq Gshqt
Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Facebook1 . ± . . ± . . ± . . ± . . ± . Facebook2 . ± . . ± . . ± . . ± . . ± . Milan . ± . . ± . . ± . . ± . . ± . T-Drive . ± . . ± . . ± . . ± . . ± . Facebook1
16 0 . ± . . ± . . ± . . ± . . ± . Facebook2
16 0 . ± . . ± . . ± . . ± . . ± . Milan
16 0 . ± . . ± . . ± . . ± . . ± . T-Drive
16 0 . ± . . ± . . ± . . ± . . ± . Facebook1
64 0 . ± . . ± . . ± . . ± . . ± . Facebook2
64 0 . ± . . ± . . ± . . ± . . ± . Milan
64 0 . ± . . ± . . ± . . ± . . ± . T-Drive
64 0 . ± . . ± . . ± . . ± . . ± . for the optimal solution, the SSR value should be one.With increasing value of k the SSR value for our Tree algorithm decreases from 0.99 for k = 1 to 0.70 for k = 64. However, for our Pivot approach the decreaseis less strong; the SSR score is always above 0.91 for theinstances
Facebook1 , Facebook2 , and
T-Drive . For the
Milan instance, our heuristics do not behave very wellfor large k . Here, the SSR score for Tree takes a valueof 0 . k = 64. The reason for this low value is thesmall value of r chosen in our experiments. However,a larger value of r will lead to even higher runningtime compared to the exact computations, which isalready faster. This is because of the length of the Milan trajectories are relatively small (see Table 1). For the
Facebook2 instances the SSR score of
Pivot is always0.99. The values of the
Gshq and
Gshqt heuristics for
Facebook1 , Facebook2 , and
T-Drive are always above0.93 due to the usage of the large candidate sets (seeTable 5). However, remember that their query timestake are much longer than that for
Tree , Pivot , and Table 7: Candidate set sizes |C| , running times in s andSSR for varying number of pivot elements h and radii r for the T-Drive data set. We report the average andthe standard deviation over 100 queries. r = 0 . r = 0 . h k size time SSR size time
SSR T r ee
64 1 354 . ± . ± . ± . . ± . . ± . . ± .
128 1 247 . ± . . ± . . ± . . ± . . ± . . ± .
64 64 354 . ± . . ± . . ± . . ± . . ± . . ± .
128 64 247 . ± . . ± . . ± . . ± . ± . ± . P i v o t
64 1 337 . ± . ± . ± . . ± . . ± . . ± .
128 1 230 . ± . . ± . . ± . . ± . . ± . . ± .
64 64 337 . ± . . ± . . ± . . ± . . ± . . ± .
128 64 230 . ± . . ± . . ± . . ± . ± . ± . Sp ee d u p f a c t o r Facebook 1 Facebook 2 Milan T-Drive
Figure 3: Average speed up by using upper boundingduring the calculation of the similarity over 100 querieseven our exact computations.
Q4:
By increasing the number of pivot elements h ,a larger number of trajectories may be excluded fromthe candidate set, since every pivot adds an additionalfilter. However, each additional pivot might lead to ahigher number of false negatives, i.e., trajectories thatare not part of the candidate set but are part of theoptimal top- k set. For the T-Drive data set, we ran
Tree and
Pivot with h ∈ { , } and r ∈ { . , . } .For building the index, Tree took 1 .
41 seconds and
Pivot took 0 .
85 seconds. Table 7 shows the effect onquery times and quality. We compare these results toTable 6 and Table 4 (there, the value of r was chosen as0 . .
25, respectively).Lowering h and increasing r , each increases the sizeof the candidate set. A larger candidate set may lead tobetter SSR values; however, it also increases the runningtime. Notice, for k = 1 we can achieve faster runningtimes with high SSR value by choosing a small radius r = 0 . k = 64, Pivot improves its SSR valuecompared to Table 6 by choosing h = 128 pivots andradius r = 0 .
4, while being faster than
Exact . .000010.000100.001000.010000.100001.00000 0 1Facebook 1 Figure 4: Distribution of similarities: The x -axis ranges from 0 to 1. The y -axis shows the fraction of input-querypairs that have this similarity on a logarithmic scale. The red line highlights the 10th percentile for Facebook1 and the 1st percentile for the other data sets. All the similarities of trajectories found by our queries lie to theright of the displayed red lines.
Q5:
To evaluate the speedup gained by the upperbounding technique, we computed the similarity for100 queries for k = 2 i with 0 ≤ i ≤
6. For each k ,we computed the top- k results without indexing, withand without the upper bounding. Figure 3 shows thespeedup that is achieved by using the upper boundingtechnique. The T-Drive and
Milan data sets profitimmensely with speedups between over 4 and 14, and2 and 13, respectively. The speedups decrease withincreasing k . The reason is that there are often onlya few trajectories with very high similarity. If thealgorithm finds these early on during the processingof the query and if the value of k is small, then theupper bounding is most effective. For larger k , thelowest of the top- k similarities is closer to the non-top- k similarities, and upper bounding, i.e., stoppingthe computation early, happens less often. There is nospeedup in the case of the Facebook data sets. Thereason is that the differences in the similarities betweenthe query and the trajectories are small (see Figure 4).Moreover, due to the long trajectories (see Table 1), theupper bounds have to be updated often, such that theupper bounding in total cannot speed up the query.
We studied computing the top- k most similar trajecto-ries in a graph to a given query trajectory. For this,we proposed a new spatio-temporal similarity measurebased on the work of Grossi et al. [6]. We derived a dis-tance function from our new similarity function, whichsatisfies a triangle inequality under certain conditions.That built the basis for our pivot-based filtering tech-nique, which accelerates finding exact solutions of top- k trajectory queries. Furthermore, we suggested a tree-based temporal filtering method in combination withthe pivot-based technique. Both approaches stronglyoutperform the baselines for all data sets, but the Milan data set. Here, our new baseline algorithm that uses theupper bounding technique has the lowest running time.It is also the first exact algorithm for the top- k trajec- tory problem, as we showed that the baseline in [6] doesnot always find the exact solution. Acknowledgments
This work is funded by the Deutsche Forschungsgemein-schaft (DFG, German Research Foundation) under Ger-many’s Excellence Strategy – EXC-2047/1 – 390685813.
References [1] R. Agrawal, C. Faloutsos, and A. Swami. Efficientsimilarity search in sequence databases. In
Founda-tions of Data Organization and Algorithms , pages 69–84. Springer Berlin Heidelberg, 1993.[2] L. Chen, M. T. ¨Ozsu, and V. Oria. Robust and fastsimilarity search for moving object trajectories. In ,SIGMOD ’05, pages 491–502. ACM, 2005.[3] L. Chen, S. Shang, B. Yao, and K. Zheng. Spatio-temporal top-k term search over sliding window.
WorldWide Web , 22(5):1953–1970, 2019.[4] Z. Chen, H. T. Shen, X. Zhou, Y. Zheng, andX. Xie. Searching trajectories by locations: An effi-ciency study. In , pages 255–266. ACM, 2010.[5] A. Driemel, I. Psarros, and M. Schmidt. Sublineardata structures for short Fr´echet queries.
CoRR ,abs/1907.04420, 2019.[6] R. Grossi, A. Marino, and S. Moghtasedi. Find-ing structurally and temporally similar trajectories ingraphs. In , volume 160 of
LIPIcs , pages 24:1–24:13. SchlossDagstuhl - Leibniz-Zentrum f¨ur Informatik, 2020.[7] J.-R. Hwang, H.-Y. Kang, and K.-J. Li. Searchingfor similar trajectories on road networks using spatio-temporal similarity. In
Advances in Databases andInformation Systems , pages 282–295. Springer BerlinHeidelberg, 2006.[8] P. Indyk. Approximate nearest neighbor algorithmsfor frechet distance via product metrics. In , pages 102–106, 2002.[9] J. J. McAuley and J. Leskovec. Learning to discoverocial circles in ego networks. In
Advances in Neu-ral Information Processing Systems 25: 26th AnnualConference on Neural Information Processing Systems2012 , pages 548–556, 2012.[10] S. Shang, L. Chen, Z. Wei, C. S. Jensen, K. Zheng, andP. Kalnis. Parallel trajectory similarity joins in spatialnetworks.
VLDB J. , 27(3):395–420, 2018.[11] S. Shang, R. Ding, K. Zheng, C. S. Jensen, P. Kalnis,and X. Zhou. Personalized trajectory matching inspatial networks.
VLDB J. , 23(3):449–468, 2014.[12] H. Su, S. Liu, B. Zheng, X. Zhou, and K. Zheng. A sur-vey of trajectory distance measures and performanceevaluation.
VLDB J. , 29(1):3–32, 2020.[13] E. Tiakas, A. N. Papadopoulos, A. Nanopoulos,Y. Manolopoulos, D. Stojanovic, and S. Djordjevic-Kajan. Trajectory similarity search in spatial net-works. In , pages 185–192. IEEE, 2006. [14] E. Tiakas and D. Rafailidis. Scalable trajectory simi-larity search based on locations in spatial networks. In
Model and Data Engineering - 5th Intl. Conf., MEDI ,volume 9344 of
LNCS , pages 213–224. Springer, 2015.[15] Y. Xia, G.-Y. Wang, X. Zhang, G.-B. Kim, and H.-Y.Bae. Spatio-temporal similarity measure for networkconstrained trajectory data.
Intl. J. ComputationalIntelligence Systems , 4:1070–1079, 2011.[16] J. Yuan, Y. Zheng, X. Xie, and G. Sun. Driving withknowledge from the physical world. In , pages 316–324. ACM, 2011.[17] J. Yuan, Y. Zheng, C. Zhang, W. Xie, X. Xie, G. Sun,and Y. Huang. T-drive: Driving directions based ontaxi trajectories. In18th ACM SIGSPATIAL Intl.Symp. Advances in Geographic Information Systems