REPOSE: Distributed Top-k Trajectory Similarity Search with Local Reference Point Tries
Bolong Zheng, Lianggui Weng, Xi Zhao, Kai Zeng, Xiaofang Zhou, Christian S. Jensen
RREPOSE: Distributed Top- k Trajectory SimilaritySearch with Local Reference Point Tries
Bolong Zheng , Lianggui Weng , Xi Zhao , Kai Zeng , Xiaofang Zhou , Christian S. Jensen Huazhong University of Science and Technology, Wuhan, ChinaEmail: { bolongzheng, liangguiweng, zhaoxi } @hust.edu.cn Alibaba Group, Hangzhou, ChinaEmail: [email protected] University of Queensland, Brisbane, AustraliaEmail: [email protected] Aalborg University, Aalborg, DenmarkEmail: [email protected]
Abstract —Trajectory similarity computation is a fundamentalcomponent in a variety of real-world applications, such asridesharing, road planning, and transportation optimization. Re-cent advances in mobile devices have enabled an unprecedentedincrease in the amount of available trajectory data such thatefficient query processing can no longer be supported by a singlemachine. As a result, means of performing distributed in-memorytrajectory similarity search are called for. However, existingdistributed proposals either suffer from computing resourcewaste or are unable to support the range of similarity measuresthat are being used. We propose a distributed in-memorymanagement framework called
REPOSE for processing top- k trajectory similarity queries on Spark. We develop a referencepoint trie (RP-Trie) index to organize trajectory data for localsearch. In addition, we design a novel heterogeneous globalpartitioning strategy to eliminate load imbalance in distributedsettings. We report on extensive experiments with real-world datathat offer insight into the performance of the solution, and showthat the solution is capable of outperforming the state-of-the-artproposals. Index Terms —trajectory similarity, top- k query, distributed I. I
NTRODUCTION
With the widespread diffusion of GPS devices (such assmart phones), massive amounts of data describing the mo-tion histories of moving objects, known as trajectories, arebeing generated and managed in order to serve a wide rangeof applications, e.g., travel time prediction [17], [18], taxisdispatching, and path planning [30], [36]. For example, Didihas released an open dataset that includes 750 million GPSpoints in Xi’an with a sampling rate of 2–4 seconds over thespan of one month.Top- k trajectory similarity search that finds k trajectoriesthat are most similar to a query trajectory is a basic operationin offline analytics applications. In the context of massive tra-jectory data, it is non-trivial to enable efficient top- k trajectorysimilarity search. Many existing studies [3], [4], [11]–[14],[16], [22], [25], [26] focus on optimizing the query processingon a single machine. However, if the data cardinality exceedsthe storage or processing capacity of a single machine, thesemethods do not extend directly to a distributed environment. Instead, a distributed algorithm is called for that is ableto exploit the resources of multiple machines. DFT [28] andDITA [19] are state-of-the-art distributed trajectory similaritysearch frameworks. They include global partitioning methodsthat place trajectories with similar properties in the samepartition, and they use a global index to prune irrelevantpartitions. Then, they merge the results of local searches onthe surviving partitions. Finally, they return a top- k result.However, these methods have two shortcomings that limit theiruse in practice.(1) Computing resource waste. DITA and DFT aim toguarantee load balancing by means of their global partitioningstrategies. In particular, DITA places trajectories with closefirst and last points in the same partition. DFT places trajectorysegments with close centroids in the same partition. However,only surviving partitions are employed on the distributednodes, while compute nodes with no surviving partitionsremain idle and are not utilized. This has an adverse effecton computing resource utilization.(2) Limited support for similarity measures. We observe thatDFT supports Hausdorff, Frechet [1], and DTW [31], but doesnot support LCSS [25], EDR [8], and ERP [7]. Further, DITAsupports Frechet, DTW, EDR, and LCSS, but does not supportHausdorff and ERP. In order to support diverse applicationscenarios, it is important to be able to accommodate a widerange of similarity measures in a single system.We propose an efficient distributed in-memory managementframework, REPOSE , for top- k trajectory similarity queryprocessing that supports a wide range of similarity measures.To eliminate poor computing resource utilization, REPOSE includes a novel heterogeneous global partition strategy thatplaces similar trajectories in different partitions. Therefore,partitions have similar composition structure with the effectthat most partitions and compute nodes are likely to contributeto a query result, which improves the computing resourceutilization significantly and enables a load balancing.In addition, we propose a reference point trie (RP-Trie)index to organize the trajectories in each partition. Duringindex construction, we convert each trajectory into a refer- a r X i v : . [ c s . D B ] J a n ABLE IS
UMMARY OF N OTATIONS
Notation Definition D Trajectory dataset τ A trajectory τ ∗ A reference trajectory D H ( τ , τ ) Hausdorff distance between τ and τ A A square region that encloses all trajectories g The grid δ The grid side length LB o One-side lower bound LB t Two-side lower bound LB p Pivot based lower bound N p The number of the pivot trajectories ence trajectory by adopting Z-order [9] that preserves sketchinformation. Then, we build an RP-Trie based on this repre-sentation, which reduces the space consumption. During queryprocessing, we traverse the local RP-Trie in a best-first manner.We first develop a one-side lower bound on internal nodes forpruning. Moreover, we devise a two-side lower bound on leafnodes for further improvement. Finally, we design a pivot-based pruning strategy for metrics.In summary, we make the following contributions: • We propose a distributed in-memory framework,
RE-POSE , for the processing of top- k trajectory similarityqueries on Spark. The framework supports multiple tra-jectory similarity measures, including Hausdorff, Frechet,DTW, LCSS, EDR, and ERP distances. • We discretize trajectories into reference trajectories anddevelop a reference point trie to organize the referencetrajectories. Several optimization techniques are proposedto accelerate query processing. • We design a novel heterogeneous global partitioningstrategy to achieve load balancing and to accelerate queryprocessing by balancing the composition of partitions. • We compare the performance of
REPOSE with the state-of-the-art distributed trajectory similarity search frame-works using real datasets. The experimental results in-dicate that
REPOSE is able to outperform the existingframeworks.The rest of the paper is organized as follows. Section IIstates the problem addressed. Section III discusses how todiscretize trajectories into reference trajectories and build RP-Trie. Section IV introduces the query processing and severaloptimization techniques. Section V introduces the global parti-tion strategy. Section VI describes how to extend our algorithmto other similarity measures. Section VII presents the resultsof our experimental study. Section VIII reviews related work.Finally, Section IX concludes the paper.II. P
ROBLEM D EFINITION
We proceed to present the problem definition. Frequentlyused notation is summarized in Table I.
TABLE IIP
OINT C OORDINATES OF T RAJECTORIES
Trajectory Point Coordinates τ (0 . , . , (2 . , . , (6 . , . , (6 . , . τ (1 . , . , (2 . , . , (2 . , . , (4 . , . τ (4 . , . , (7 . , . , (7 . , . , (4 . , . , (4 . , . τ (0 . , . , (2 . , . , (5 . , . , (5 . , . τ (1 . , . , (2 . , . , (2 . , . , (0 . , . , (0 . , . τ q (0 . , . , (2 . , . , (4 . , . 轨迹 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝜏𝜏
000 011 100 101 110 111001 010000010011100101110111001 𝜏𝜏 𝑞𝑞 Fig. 1. A Running Example
Definition 1 (Trajectory) . A trajectory τ is a finite, time-ordered sequence τ = (cid:104) p , p , . . . , p n (cid:105) , where each p i ∈ τ is a sample point with a longitude and a latitude. A range of distance functions have been used for quantifyingthe similarity between trajectories, including the Hausdorff,Frechet, DTW, LCSS, EDR and ERP distances. To ease thepresentation, we initially focus on the Hausdorff distance, andextend the coverage to include other distances later.
Definition 2 (Trajectory Distance) . Given two trajectories τ = (cid:104) q , q , . . . , q m (cid:105) and τ = (cid:104) p , p , . . . , p n (cid:105) , the Haus-dorff distance between τ and τ is computed as follows. D H ( τ , τ ) = max { max q i ∈ τ min p j ∈ τ d ( q i , p j ) , max p j ∈ τ min q i ∈ τ d ( q i , p j ) } , (1) where d ( p i , q j ) is the Euclidean distance. Definition 3 (Trajectory Similarity Search) . Given a set oftrajectories D = { τ , τ , . . . , τ N } , a query trajectory τ q , adistance function Dist ( · ) , and an integer k , the top- k trajectorysimilarity search problem reports a set R of k trajectories,where ∀ τ ∈ R and ∀ τ (cid:48) ∈ D − R , we have Dist ( τ q , τ ) < Dist ( τ q , τ (cid:48) ) . Example 1.
Given a query trajectory τ q and the dataset D = { τ , τ , τ , τ , τ } in Table II, we process a top- query.Computing the Hausdorff distance between τ q and all trajec-tories in D , we get D H ( τ q , τ ) = 2 . , D H ( τ q , τ ) = 6 . , D H ( τ q , τ ) = 6 . , D H ( τ q , τ ) = 3 . , D H ( τ q , τ ) = 6 . .Therefore, the top- result is { τ , τ } . II. R
EFERENCE P OINT T RIE
Before describing the distributed framework, we proceedto introduce a reference point trie (RP-Trie) index for localsearch. First, we explain how to convert trajectories intoreference trajectories. Second, we show how to build an RP-Trie on the reference trajectories. Finally, we propose an order-independent optimization to improve the performance.
A. Discretizing Trajectories with Z-order
Inspired by signature-based trajectory representation [23],we adopt the Z-order [9] to map trajectories from their nativetwo-dimensional space to a one-dimensional space. Let A be asquare region with side length U that encloses all trajectories.We partition A by means of a regular l × l grid with sidelength δ , where l = U/δ is a power of 2. Each cell g has aunique z-value and a reference point (the center point of g ). Example 2.
In Fig. 1, we show an 8 × Definition 4 (Reference Trajectory) . We convert a trajec-tory τ = (cid:104) p , p , . . . , p n (cid:105) into a reference trajectory τ ∗ = (cid:104) p ∗ , p ∗ , . . . , p ∗ n (cid:105) , where p ∗ i is the reference point of the cell g i that p i belongs to. Note that τ ∗ corresponds to a sequence ofof z-values Z = (cid:104) z , z , . . . , z n (cid:105) , where z i is the z-value of g i . Note that the grid granularity affects the fidelity of areference trajectory. A small δ ensures a high fidelity. B. Building an RP-Trie Index
We proceed to cover how to build an RP-Trie on a set ofreference trajectories, which is similar to building a classicaltrie index. We thus use the z-value as the value of the trienode and insert all reference trajectories into the trie. Thedifference is that if one reference trajectory is a prefix ofanother reference trajectory, we append the character $ at theend, which guarantees that every reference trajectory ends ata leaf node.The structure of an RP-Trie is shown in Fig. 2. In internalnodes, the Label attribute is the z-value of the node thatcorresponds to the coordinates of the reference point, and
Ptr is a pointer array pointing to child nodes. In leaf nodes, the
Label attribute has the same meaning as internal nodes. Sinceeach reference trajectory ends at a leaf node and representsmultiple trajectories, we use
Tid to record the trajectory ids.
Pivot trajectory.
For similarity metrics such as Hausdorff,Frechet, and ERP, the triangle inequality can be used forpruning. To enable this, we use pivot trajectories to estimatethe lower bound distance between a query trajectory and alltrajectories in a subtree rooted at a node. Specifically, we select N p trajectories as global pivot trajectories. For each node, let T sub be the set of reference trajectories covered by it. Wethen keep an N p -dimensional array HR , where HR [ i ] stores atuple ( min, max ) that represents the minimum and maximum Trie 结构 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝜏𝜏 Internal Node ∶ Leaf Node ∶ Label Tid HR D max Label Ptr HR
Root
Fig. 2. The Structure of RP-Trie distances between all reference trajectories in T sub and the i -thpivot trajectory.In particular, the selection of pivot trajectories has a sub-stantial effect on the pruning performance. Thus, existingproposals consider how to find optimal pivots [5], [20]. Ingeneral, we aim to obtain a pivot set such that any two pivottrajectories in it are as distant as possible. With this in mind,we adopt a practical but effective method [21]. We uniformlyand randomly sample m groups of N p trajectories. In eachgroup, we compute the distances of any two trajectories, andlet the sum of all distances be the score of the group. Finally,we choose the N p trajectories in the group with the largestscore as the set of pivot trajectories.In addition, we store a value D max for each leaf node that isthe maximum distance between the node’s reference trajectory τ ∗ and the trajectories in the leaf node. Cost analysis.
Assume that we have N reference trajec-tories with maximum length L . As the RP-Trie has at most N · L nodes, the space cost is O ( N · L · N p ) . To construct anRP-Trie, the cost of the computation of the distances betweenpivots and all trajectories dominates all other costs and takes O ( N · L · N p ) . Since both L and N p are small ( N p = 5 is usedin experiments), both the space and time costs are affordable. Succinct trie structure.
We observe that the upper levels ofan RP-Trie consist of few nodes that are accessed frequently,while the lower levels comprise the majority of nodes thatare accessed relatively infrequently due to pruning. Inspiredby SuRF [37], we introduce a fast search structure for theupper levels of an RP-Trie and a space-efficient structure forthe lower levels. Specifically, we optimize the RP-Trie byswitching between bitmaps and byte arrays at different layers.For each upper level node, we use two bitmaps B c and B l with the same size as the number of grid cells, to separatelyrecord the value of a child node and the state of the child node.If a cell is the node’s child node, we set the corresponding bitin B c to 1. If a child node is not a leaf node, we set thecorresponding bit in B l to 1. Then B c and B l of all nodesare concatenated in breadth-first order separately such that we rie 优化 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝜏𝜏 Unoptimized RP-Trie Optimized RP-TrieRoot Root
Fig. 3. Optimized Reference Point Trie can quickly access any upper level node. For each lower levelnode, we serialize the structure with byte sequences. Given thatthe trie becomes sparse in these levels, it is space-efficient touse byte sequences.
C. Optimization by Z-value Re-arrangement
Given τ q and a reference trajectory τ ∗ , if we interchange thepositions of any two points in τ ∗ to generate a new referencetrajectory τ ∗ (cid:48) , we have D H ( τ q , τ ∗ ) = D H ( τ q , τ ∗ (cid:48) ) . Therefore,we know that, unlike other distance measures, Hausdorff isorder independent. Hence, we propose an optimization thatreduces the size of an RP-Trie by rearranging the referencetrajectories that achieve longer common prefixes. Specifically,we simplify τ ∗ in two steps: (1) z-value deduplication: Wekeep only one point when two or more points in τ ∗ have iden-tical z-value. (2) z-value re-ordering: We re-order the pointsin τ ∗ . For example, assume that an RP-Trie is constructedfrom trajectories τ and τ as shown to the left in Fig. 3. Ifthe z-values 000100 and 001000 of τ are swapped, the totalnumber of nodes decreases, as shown to the right in Fig. 3.Thus, we aim to build an optimized RP-Trie with fewernodes by means of z-value re-arrangement. Let Z be the set ofz-values of a reference trajectory. Let Z = { Z , Z , . . . , Z N } be the collection of z-values sets of all trajectories. In orderto find the optimized RP-Trie structure, we want each level ofthe trie to have the minimum number of nodes. In other words,we need to find the smallest cell set G opt = { g , g , . . . , g r } such that for any Z i in Z , Z i contains at least one cell in G opt .This way, we regard g , g , . . . , g r as child nodes of the rootand partition Z into classes C , C , . . . , C r . Using this methodto recursively partition each class C i , we obtain an optimizedRP-Trie.However, it is difficult to find the set G opt for Z . Beforeexplaining how to solve this problem, we introduce a wellknown NP-hard problem: the hitting set problem. Definition 5 ( HS ( Z , b ) ) . Given a collection of sets Z = { Z ,Z , . . . , Z N } and a budget b , the hitting set HS ( Z , b ) returnsa set G that satisfies for any Z i in Z , | G | ≤ b ∧ G ∩ Z i (cid:54) = ∅ If no such set G exists, HS ( Z , b ) returns an empty set. Set G is called a hitting set of Z . Usually, a collection ofsets Z has more than one hitting set. It is easy to find a hittingset for it when disregarding the budget b . However, it is costlyto obtain a hitting set within a budget b . Theorem 1.
Finding an optimized RP-Trie is NP-hard.Proof.
To find an optimized RP-Trie for our reference trajec-tory collection Z , we find the smallest cell set G opt with size | G opt | = r such that for any Z i in Z , Z i contains at leasta cell in G opt . Obviously, G opt is the hitting set of Z withthe minimum size. So, the result of HS ( Z , r ) is G opt , andthe result of HS ( Z , r − is an empty set. Therefore, finding G opt is reduced to solving HS ( Z , r ) and HS ( Z , r − . Thiscompletes the proof.We employ a greedy algorithm to solve this problem. Wemake the most frequent z-value z in Z the first child nodeof the root. Then we put all Z i containing z into the subtreeof z and remove them from Z . Next, we make the currentmost frequent z-value z in Z the second child node andplace all Z i containing z in its subtree and remove themfrom Z . Repeating the above process until Z is empty, weobtain the division structure of the current level of the trie.We recursively repeat the process for the remaining levels tobuild an optimized RP-Trie. Assume that we have N referencetrajectories and M cells. The computation cost of the greedyalgorithm is O ( N · M ) , where M is usually a small constant.The details can be found in the accompanying technical report[38]. IV. Q UERY P ROCESSING
We proceed to present the top- k query processing as well asseveral query optimization techniques. For ease of presenta-tion, we mainly discuss the details of the Hausdorff distance.Extensions to support other distance measures, such as Frechetand DTW, can be found in Section VI. First, we introduce ourbasic algorithm to perform a top- k query, where the nodes inthe RP-Trie are traversed in a best-first manner. We developa pruning condition called one-side lower bound to filterout unqualified internal nodes. We provide three optimizationtechniques. The first technique reduces the computing cost ona node by using intermediate computation results. The secondtechnique provides a tight pruning condition called two-sidelower bound for pruning on leaf nodes. The third techniqueutilizes the metric property for pivot based pruning. A. Search Procedure
To answer a top- k query, the nodes in the RP-Trie aretraversed in ascending order of a lower bound, which is calledone-side lower bound LB o used to prune unqualified internalnodes. Let d k be the k -th smallest found result. An internalnode is pruned when its LB o is greater than d k .We briefly introduce the procedure as follows:1) We construct a priority queue E to keep the nodes basedon the ascending order of their lower bounds. We initiallynsert the root node into E . The minHeap stores thetrajectories with distance to τ q no larger than d k .2) The head element t is popped from E . If the lower boundof t is smaller than d k , we go to Step 3). Otherwise, westop the procedure and return minHeap .3) If node t is a leaf node, we compute the distances between τ q and the trajectories recorded by Tid . Then, we update d k and minHeap ; Otherwise, we compute the LB o foreach child node of t and insert it into E .4) Repeating Steps 2) and 3) when E is not empty. Other-wise, we stop the procedure and return minHeap .We introduce how to compute the one-side lower bound LB o . For simplicity, for a sub reference trajectory that startsfrom the root node and ends at an internal node, we also callit a reference trajectory. Next, we give a formal definition ofthe one-side lower bound. Definition 6 (One-Side Lower Bound) . Given the querytrajectory τ q = (cid:104) q , q , . . . , q m (cid:105) and a node whose referencetrajectory is τ ∗ = (cid:104) p ∗ , p ∗ , . . . , p ∗ n (cid:105) , one-side lower bound ofthe node is defined as follows, LB o ( τ q , τ ∗ ) = max { max p ∗ j ∈ τ ∗ min q i ∈ τ q d ( q i , p ∗ j ) − √ δ , } (2)Two following lemmas ensure that the returned results byabove procedure are the correct top- k query results of τ q . Lemma 1.
The LB o of a leaf node is smaller than theminimum distance between τ q and any trajectory stored inthe node.Proof. Given τ q = (cid:104) q , q , . . . , q m (cid:105) , a node’s reference tra-jectory τ ∗ = (cid:104) p ∗ , p ∗ , . . . , p ∗ n (cid:105) and any trajectory τ = (cid:104) p , p , . . . , p n (cid:48) (cid:105) stored in the node, we prove that D H ( τ q , τ ) ≥ LB o ( τ q , τ ∗ ) . Assume p i falls into the cell represented by p ∗ k (we call it p i ∈ p ∗ k ), we have d ( q j , p i ) ≥ max { d ( q j , p ∗ k ) −√ δ/ , } according to the triangle inequality. Then we have D H ( τ q , τ ) = max { max q j ∈ τ q min p i ∈ τ d ( p i , q j ) , max p i ∈ τ min q j ∈ τ q d ( p i , q j ) }≥ max p i ∈ τ min q j ∈ τ q d ( p i , q j ) ≥ max p i ∈ τ,p i ∈ p ∗ k min q j ∈ τ q max { d ( q j , p ∗ k ) − √ δ , } = max p ∗ k ∈ τ ∗ min q j ∈ τ q max { d ( q j , p ∗ k ) − √ δ , } = max { max p ∗ k ∈ τ ∗ min q j ∈ τ q d ( q j , p ∗ k ) − √ δ , } = LB o ( τ q , τ ∗ ) Lemma 1 indicates that if the node is a leaf node, LB o is the lower bound distance between τ q and the trajectoriesrecorded by Tid . Thus, the trajectories recorded by the nodecan be safely pruned when LB o ≥ d k . However, if the nodeis not a leaf node, Lemma 1 does not work, so it comes tothe following Lemma 2. Lemma 2 shows that LB o of a node is larger than that of its parent node. If LB o is larger than d k ,no top- k results exist in the subtree of current node, and wecan safely prune it. Lemma 2.
For an internal node, the LB o between its childnode and τ q is greater than that between the node and τ q .Proof. Let τ ∗ p = (cid:104) p ∗ , · · · , p ∗ n − (cid:105) and τ ∗ = (cid:104) p ∗ , · · · , p ∗ n (cid:105) be thereference trajectories of a node and its child node, respectively.Given τ q , we prove that LB o ( τ q , τ ∗ ) ≥ LB o ( τ q , τ ∗ p ) . LB o ( τ q , τ ∗ ) = max p ∗ k ∈ τ ∗ min q j ∈ τ q max { d ( q j , p ∗ k ) − √ δ , } = max { max p ∗ k ∈ τ ∗ p min q j ∈ τ q max { d ( q j , p ∗ k ) − √ δ , } , min q j ∈ τ q max { d ( q j , p ∗ n ) − √ δ , }}≥ max p ∗ k ∈ τ ∗ p min q j ∈ τ q max { d ( q j , p ∗ k ) − √ δ , } = LB o ( τ q , τ ∗ p ) Note that Lemmas 1 and 2 also hold for Frechet and DTWwith a modification regarding the computation of LB o . Thedetails can be found in Section VI. B. Pruning using Two-Side Lower Bound
We know that one-side lower bound is designed for pruningthe subtree of internal nodes. For each internal node, we onlyhave the prefix reference trajectories from the root node to thisinternal node. In contrast, for a leaf node, we have completereference trajectories of the trajectories stored in this node.Therefore, a tight lower bound, called two-side lower bound,is provided to further improve the query efficiency. Based onthe reference trajectory, the two-side lower bound is computedas follows:
Definition 7 (Two-Side Lower Bound) . Given τ q and a leafnode whose reference trajectory is τ ∗ , the two-side lowerbound LB t is computed as follows, LB t ( τ q , τ ∗ ) = max { D H ( τ q , τ ∗ ) − D max , } (3) where D max is the maximum distance from all trajectories inthe current node to the reference trajectory. According to the triangle inequality, it is easy to prove that LB t is the lower bound distance of all the trajectories stored incurrent node. In other word, LB t is smaller than the distancebetween τ q and any trajectories stored in the leaf node. Thefocuses of LB o and LB t are different. LB o is able to pruneall nodes of a subtree, but its value is loose compared with LB t . While LB t is designed for pruning trajectories in a leafnode, its value is close to exact distance. Next, we introducehow to reduce the computation overheads of both LB o and LB t by using the intermediate computation results. lgorithm 1: CompLB ( τ q , p ∗ , r, c max ) Input:
The query τ q , the newly added reference point p ∗ , an array r with size | τ q | and c max Output: LB o , LB t , the updated r and c max i ← ; r max ← ; c ← + ∞ ; while i ≤ | τ q | do dist = d ( τ q [ i ] , p ∗ ) ; r [ i ] = min { dist, r [ i ] } ; c = min { dist, c } ; r max = max { r [ i ] , r max } ; i ← i + 1 c max ← max { c max , c } ; LB o = max { c max − √ δ , } ; LB t = max { max { r max , c max } − D max , } ; return LB o , LB t , r, c max ; C. Optimization using Intermediate Results
To access each node on the RP-Trie, the computationaloverheads of LB o and LB t are both O ( mn ) , where m and n are the lengths of the query trajectory and the referencetrajectory, respectively. We propose an optimization that usesthe intermediate computation results of the parent node toreduce the overhead to O ( m ) .Before we describe the optimization, we first review thesteps of computing Hausdorff. We assume that the lengths of τ q and the reference trajectory τ ∗ are both 2. As shown in Fig.4, q and q are points in the query trajectory, while p ∗ and p ∗ are points in the reference trajectory. Let r i be the minimumvalue in row i , e.g., the minimum of d ( q i , p ∗ ) and d ( q i , p ∗ ) .Let c j be the minimum value in column j , e.g., the minimumof d ( q , p ∗ j ) and d ( q , p ∗ j ) . Let c max be the maximum valueamong all c j . According to the definition of Hausdorff, LB o ( τ q , τ ∗ ) = max { c max − √ δ , } , and LB t ( τ q , τ ∗ ) = max { max { max { r i } , c max } − D max , } . After accessing p ∗ , we record r , r and c max as intermediateresults. For a new reference trajectory τ ∗ new = (cid:104) p ∗ , p ∗ , p ∗ (cid:105) ,both LB t ( τ q , τ ∗ new ) and LB o ( τ q , τ ∗ new ) can be quickly com-puted by d ( q , p ∗ ) , d ( q , p ∗ ) and intermediate results, sincewe only have to update the values of c max and all r i , where r (cid:48) = min { r , d ( q , p ∗ ) } r (cid:48) = min { r , d ( q , p ∗ ) } c (cid:48) max = max { c max , min { d ( q , p ∗ ) , d ( q , p ∗ ) }} . (4)Similarly, r (cid:48) , r (cid:48) and c (cid:48) max are the intermediate results afteraccessing p ∗ that can be used for the subsequent computation.Therefore, the computational overheads of LB o and LB t for each node are O ( m ) . Algorithm 1 describes the processthat we compute LB o and LB t in detail. Hausdorff 𝑝𝑝 𝑝𝑝 𝑞𝑞 𝑞𝑞 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑟𝑟 𝑟𝑟 𝑐𝑐 𝑐𝑐 𝐷𝐷 𝐻𝐻 𝜏𝜏 𝑞𝑞 , 𝜏𝜏 ∗ 𝐷𝐷 𝐻𝐻 𝜏𝜏 𝑞𝑞 , 𝜏𝜏 𝑛𝑛𝑛𝑛𝑛𝑛∗ 𝑝𝑝 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑑𝑑 𝑞𝑞 , 𝑝𝑝 𝑐𝑐 𝑟𝑟 𝑟𝑟 max maxmax 𝑟𝑟𝑟 𝑚𝑚𝑚𝑚𝑛𝑛 min min Fig. 4. Distance Matrix
D. Pivot based Pruning
As the Hausdorff distance is a metric, the pivot-basedpruning strategy [6] is used to further improve the queryefficiency. Therefore, we propose a pruning strategy by usinga pivot based lower bound LB p .During preprocessing, we select N p trajectories as pivottrajectories. For each node in RP-Trie, we store a distancearray HR . During the query processing, we first compute thedistances d qp between τ q and all pivot trajectories. The timecomplexity is O ( N p · m · n ) , where m and n are the lengthsof τ q and pivot trajectory respectively.We compute the pivot based lower bound LB p as follows, LB p = max {| d qp [ i ] − HR [ i ] .max − √ δ |} . (5)Since N p is small, the time cost of computing LB p is small.V. D ISTRIBUTED F RAMEWORK
We proceed to cover the distributed framework. First, wediscuss the drawbacks of existing global partitioning strategies.Then, we introduce our partitioning strategy that aims toensure better computing resource utilization during queryprocessing. Finally, we discuss how to build a local RP-Triebased on Spark RDDs.
A. Drawbacks of Existing Global Partitioning Strategies
A straightforward partitioning method is to randomly dividetrajectories into different partitions such that similar numbersof trajectories are kept in each partition. However, this does notguarantee load balancing because the query times in differentpartitions can be quite different. Existing proposals, such asDITA, DFT, and DISAX [29], employ a homogeneous parti-tioning method that groups similar trajectories into the samepartition. Then, they use a global index to prune irrelevantpartitions without trajectories that are similar to the querytrajectory. However, these methods have limited performancefor two main reasons: • Computing resource waste . After global pruning, onlycompute nodes with surviving partitions compute localresults, while other nodes are idle and do not contribute. • Less effective local pruning . If the trajectories in eachpartition are similar, the pruning in local search is lesseffective since the MBRs of the local index may havelarge overlaps.The homogeneous partitioning strategy targets batch search(e.g., high concurrency), where different partitions responseo different query trajectories. However, not all partitions oftrajectories are involved in the batch query processing if thequeries are skewed. For example, ride-hailing companies tendto issue a batch of analysis queries in hot regions to increaseprofits. Therefore, some computing resource waste cannot beavoided.
B. Our Global Partitioning Strategy
An effective global partitioning strategy should ensure loadbalancing among partitions and should accelerate local-searchpart of query processing. Unlike existing strategies, we pro-pose a heterogeneous partitioning strategy that places similartrajectories in different partitions based on the followingobservations: • As mentioned, the pruning in local search is less effectivein existing methods. Inspired by the principle of maxi-mum entropy, a high degree of data discrimination in apartition is likely to enable better pruning. • The composition structure of each partition is similar,since each partition contains trajectories from differentgroups of similar trajectories. It is possible that mostpartitions contain query results. Therefore, the querytimes of most partitions are similar, which enables load-balancing.Specifically, we use the simple but effective clusteringalgorithm SOM-TC [10] to partition trajectories. We encodeeach trajectory τ as a reference trajectory τ ∗ using geohash.If τ ∗ = τ ∗ , we group τ and τ into a cluster. Note that wevary the space granularity of geohash to control the numberof clusters. At first, we set the space granularity to a smallvalue such that a cluster contains only one trajectory. Then weenlarge the space granularity gradually and group trajectoriesinto larger clusters. This process stops when we reach about N/N G clusters, where N is the dataset cardinality and N G isthe number of partitions. We sort the trajectories based on theircluster id and trajectory id. Finally, we assign the trajectoriesfrom the sorted sequence to partitions in round-robin fashion. C. Building Local Index based on RDD
Spark Core [34] is based on the concept of abstract ResilientDistributed Dataset (RDD), which supports two types of oper-ations: transformation and action operations. However, RDDsare designed for sequential processing, and random accessesthrough RDDs are expensive since such accesses may simplyturn into full scans over the data. In our method, random accessis inevitable. We solve this challenge by exploiting an existingmethod [28], that is, we package the data and index into a newclass structure and use the class as the type of RDD.Specially, the abstract class
Partitioner is provided bySpark for data partitioning, and Spark allows users to definetheir own partitioning strategies through inheritance from theabstract class. We use
Partitioner to implement the strategypresented in Section V-B. Then, we package all trajectorieswithin an RDD partition and RP-Trie index into a classstructure called
RpTraj that is defined as follow. case class RpTraj (trajectory: Array, Index: RP-Trie)
We change the type of RDD to
RpTraj . type RpTrieRDD = RDD [RpTraj] The
RpTrieRDD is an RDD structure. The transformationoperation mapPartitions is provided by Spark, and we use
RpTrieRDD . mapPartitions to manipulate each partition. Inparticular, in each partition, we first use Index . build to con-struct RP-Trie index. Then we use Index . query to execute thetop- k query. Finally, the master collects the results from eachpartition by collect and determines the global top- k result.VI. O THER S IMILARITY MEASURES
We proceed to extend our algorithm to Frechet and DTW.Due to the different properties of similarity measures, theindex structures are distinguished with that for Hausdorff.
A. Extension on Frechet
The only difference with Hausdorff is that Frechet is sensi-tive to the order of points. So the RP-Trie is used withoutthe trie optimization. As Frechet is also a metric, we stilluse pivot trajectories to accelerate the query processing. Given τ q = { q , q , · · · , q m } and τ = { p , p , · · · , p n } , the Frechetdistance between τ q and τ is computed as follows, D F ( τ q , τ ) = max nj =1 d ( q , p j ) m = 1max mi =1 d ( q i , p ) n = 1max (cid:8) d ( q m , p n ) , min (cid:8) D F (cid:0) τ m − q , τ n − (cid:1) , D F (cid:0) τ m − q , τ (cid:1) , D F (cid:0) τ q , τ n − (cid:1)(cid:9)(cid:9) otherwise(6)where τ m − is the prefix of τ by removing the last point.We modify the computation of LB F o and LB F t of Frechet, LB F o ( τ q , τ ∗ ) = max { c min − √ δ , } (7) LB F t ( τ q , τ ∗ ) = max { f m,n − √ δ , } (8)where f i,j = D F (cid:0) τ iq , τ ∗ j (cid:1) is the element of the i -th rowand the j -th column of the distance matrix, and c min is theminimum value of the newly added column, e.g., c min =min { f , , f , , f , } , as shown in Fig. 5. Fortunately, we arestill able to quickly compute LB F o and LB F t through theintermediate results. Given the current last column values { f , , f , , f , } , the values in the new column are computedas follows, f i, = max { d ( q i , p ∗ ) , min { f i − , , f i − , , f i, }} . (9)In addition, as Frechet is a metric, we can still use the pivot-based lower bound LB F p for pruning. Lemma 3. LB F o , LB F t and LB F p have the same properties asthose for Hausdorff.1) The LB F o , LB F t and LB F p of a leaf node are all smallerthan the minimum Frechet distance between τ q and anytrajectory stored in the node. 算开销优化 𝑝 𝑝 𝑞 𝑞 𝑝 𝐷 𝐹 𝜏 𝑞 , 𝜏 𝑛𝑒𝑤∗ 𝑝 𝑞 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 𝑓 Fig. 5. Distance Matrix of Frechet
2) For an internal node, the LB F o between its child nodeand τ q is larger than that between the node and τ q . The first property indicates that LB F o , LB F t and LB F p arecorrect lower bounds. The second property allows us to prunethe sub-tree of an internal node with LB F o . Proof.
For property 1, since LB F t ( τ q , τ ∗ ) ≥ LB F o ( τ q , τ ∗ ) , weonly need to prove that LB F t and LB F p are correct lowerbounds. According to Eq. 8 and f m,n = D F ( τ q , τ ∗ ) , we have LB F t ( τ q , τ ∗ ) = max { D F ( τ q , τ ∗ ) − √ δ , } . For a trajectory τ with a reference trajectory τ ∗ , we have D F ( τ, τ ∗ ) ≤ √ δ . Therefore, due to the triangle inequality,we have LB F t ( τ q , τ ∗ ) ≤ D F ( τ q , τ ) . Similarly, it is easy tohave that LB F p ( τ q , τ ∗ ) ≤ D F ( τ q , τ ) .For property 2, we prove that LB F o ( τ q , τ ∗ j ) ≥ LB F o ( τ q , τ ∗ j − ) , ≤ j ≤ n . According to Eq. 8, LB F o ( τ q , τ ∗ j ) = max { min f i,j − √ δ , } LB F o ( τ q , τ ∗ j − ) = max { min f i,j − − √ δ , } (10)Assume that h = arg min i f i,j − , we have f ,j = max (cid:8) d (cid:0) q , p ∗ j (cid:1) , f ,j − (cid:9) ≥ f ,j − ≥ f h,j − f ,j = max (cid:8) d (cid:0) q , p ∗ j (cid:1) , min { f ,j − , f ,j , f ,j − } (cid:9) ≥ min { f ,j − , f ,j , f ,j − } ≥ f h,j − (11)Therefore, we have f i,j ≥ f h,j − , ≤ i ≤ m , and min f i,j ≥ f h,j − . Finally, LB F o ( τ q , τ ∗ j ) ≥ LB F o ( τ q , τ ∗ j − ) . B. Extension on DTW
Since DTW is not a metric and is sensitive to the order oftrajectory points, thus only the basic RP-Trie structure is used.The DTW between τ q and τ is computed as follows, D DTW ( τ q , τ ) = (cid:80) nj =1 d ( q , p j ) m = 1 (cid:80) mi =1 d ( q i , p ) n = 1 d ( q m , p n ) + min (cid:8) D DTW (cid:0) τ m − q , τ n − (cid:1) , D DTW (cid:0) τ m − q , τ (cid:1) , D DTW (cid:0) τ q , τ n − (cid:1)(cid:9) otherwise(12) The computations of LB DTW o and LB DTW t are similar to thoseof Frechet, which are computed as follows. LB DTW o ( τ q , τ ∗ ) = c min (13) LB DTW t ( τ q , τ ∗ ) = f i,n (14)where f i,j = D DTW (cid:0) τ iq , τ ∗ j (cid:1) is the element of the i -th rowand the j -th column of the distance matrix, and c min is theminimum value of the newly added column, i.e., c min =min i { f i,j } at the j -th column. We update f i,j as follows. f i,j = d (cid:48) ( q i , p ∗ j ) + min { f i − ,j − , f i − ,j , f i,j − } . (15)Note that d (cid:48) ( q i , p ∗ j ) is the minimum distance between q i andthe cell that p ∗ j belongs to. We use it instead of d ( q i , p ∗ j ) sincethe triangle inequality does not apply to DTW. Lemma 4. LB DTW o , LB DTW t have the same properties as thosefor Hausdorff.1) The LB DTW o , LB DTW t of a leaf node are smaller than theminimum DTW distance between τ q and any trajectorystored in the node.2) For an internal node, the LB DTW o between its child nodeand τ q is larger than that between the node and τ q . The proof is similar to that of Lemma 3.The basic RP-Trie and the heterogeneous partitioning strat-egy are applicable to other distance measures. For metrics,such as ERP, the RP-Trie with the trie optimization and thepivot-based pruning can be employed in a similar way as forFrechet. For non-metrics, such as LCSS and EDR, the basicRP-Trie can be used similarly as for DTW.VII. EXPERIMENTS
A. Experimental Setup
All experiments are conducted on a cluster with 1 masternode and 16 worker nodes. Each node has a 4-core Intel Xeon(Cascade Lake) Platinum 8269CY @ 2.50GHz processor and32GB main memory and runs Ubuntu 18.04 LTS with Hadoop2.6.0 and Spark 2.2.0.
Datasets . We conduct experiments on 3 types of datasets.1) Small scale and small spatial span: San Francisco (SF) ,Porto , Rome , T-drive [33].2) Large scale and small spatial span: Chengdu and Xi’an .3) Large scale and large spatial span: OSM .The dataset statistics are shown in Table III. In the prepro-cessing stage, we remove the trajectories with length smallerthan 10, and we split the trajectories with length larger than1,000 into multiple trajectories. We uniformly and randomlyselect 100 trajectories as the query set. Competing algorithms . We compare the performance of
REPOSE with three baseline algorithms: http://sigspatial2017.sigspatial.org/giscup2017/home http://crawdad.org/roma/taxi/20140717 https://gaia.didichuxing.com TATISTICS OF D ATASETS
Datasets Cardinality AvgLen Spatial span Size (GB)
T-drive 356,228 22.6 (1.89 ◦ , 1.17 ◦ ) 0.16SF 343,696 27.5 (0.54 ◦ , 0.76 ◦ ) 0.19Rome 99,473 152.4 (1.21 ◦ , 0.86 ◦ ) 0.28Porto 1,613,284 48.9 (11.7 ◦ , 14.2 ◦ ) 1.24Xi’an 6,645,727 230.1 (0.09 ◦ , 0.08 ◦ ) 26.8Chengdu 11,327,466 188.9 (0.09 ◦ , 0.07 ◦ ) 37.7OSM 4,464,399 596.3 (360 ◦ , 180 ◦ ) 50.8 DFT : A segment-based distributed top- k query algorithmthat supports Hausdorff, Frechet, and DTW. Specifically,we choose the variant DFT-RB+DI that has the best queryperformance and the largest space overhead.2) DITA : An in-memory distributed range query algorithm.In order to support top- k query, DITA first estimatesa threshold and then executes a range query to findcandidate trajectories. Finally, DITA finds the k mostsimilar trajectories. DITA does not support Hausdorff.3) LS : Brute-force linear search (LS) that computes thedistances between the query and all trajectories in eachpartition and merges the results. Parameter settings . For DFT, we set the partition pruningparameter C = 5 . For DITA, we set N L = 32 and the pivotsize is set to . The pivot selection strategy is the neighbordistance strategy. For REPOSE , due to the different spatialspans and data densities of the datasets, we choose differentvalues of δ . For the SF, Porto, and Roma, we set δ = 0 . .For T-drive, we set δ = 0 . . For OSM, we set δ = 1 . . Next,for the Chengdu dataset, we set δ = 0 . for Hausdorff, and δ = 0 . for Frechet and DTW. Finally, for the Xi’an dataset,we set δ = 0 . for Hausdorff, and δ = 0 . for Frechet andDTW. We choose N p = 5 pivot trajectories.For queries, we use k = 100 as the default. Since we have16 worker nodes, each of which has a 4-core processor, weset the default number of partitions to 64 (each core processesa partition). Performance metrics . We study three performance metrics:(1) query time (QT), (2) index size (IS), and (3) indexconstruction time (IT). We repeat each query 20 times andreport the average values. The index construction time of
REPOSE includes the time for converting trajectories toreference trajectories, clustering the trajectories, and buildingthe trie.
B. Performance Evaluation
We compare the algorithms on all datasets and three sim-ilarity measures (Hausdorff, Frechet, DTW). A performanceoverview is shown in Table IV. Due to the space limitation,we only report the performances on the T-drive, Xi’an, andOSM datasets and for the Hausdorff and Frechet distances inthe remaining experiments.
Performance overview . We make the following observa-tions from Table IV: (1) In terms of query time and index
TABLE IVP
ERFORMANCE O VERVIEW
Metric Distance Algorithm SF Porto Rome T-drive Xi’an Chengdu OSMQT (s)
Hausdorff
REPOSE / / / / / / /
DFT LS REPOSE
DFT LS REPOSE
DFT LS IS (GB)
Hausdorff
REPOSE / / / / / / /
DFT LS / / / / / / /Frechet REPOSE
DITA LS / / / / / / /DTW REPOSE
DITA LS / / / / / / / IT (s)
Hausdorff
REPOSE / / / / / / /
DFT LS / / / / / / /Frechet REPOSE
DFT LS / / / / / / /DTW REPOSE
DFT LS / / / / / / / construction time, REPOSE significantly outperforms thebaseline methods.
REPOSE also maintains a smaller indexthan DFT and DITA in most cases.
REPOSE has a slightlylarger index than DITA on Chengdu, Xi’an, and OSM forthe Frechet and DTW distances. This occurs because DITAcompresses all trajectories into fixed-length representativetrajectories. Nevertheless,
REPOSE can capture trajectoryfeatures adaptively and is able to improve the query efficiency.(2) The index size of DFT is about 4 times those of
REPOSE and DITA on all datasets. For example on Xi’an, DFT requires142GB, while
REPOSE and DITA only require 32GB. Thereason is that DFT needs to regroup line segments intotrajectories when computing distances and needs a dual indexthat takes up extra space. (3) For the first two types of datasets,it is interesting to see that the query time of LS is smallerthan those of DITA and DFT. The reason is that the datasetsare small, so the pruning capabilities are not utilized fully.In addition, some index-specific operations add computationaloverhead. For large datasets with small spatial span, the datadensity is high, which renders pruning more challenging.
Varying the value of k . Fig. 6 shows the perfor-mances of all algorithms when varying the value of k in { , , , . . . , } . We notice that with the increase of k , thechanging trends differ across the algorithms. For example, thequery time of DFT is unstable since it finds C ∗ k trajectories EPOSE DITA DFT LS k Time (s) (a) T-drive with Hausdorff k Time (s) (b) T-drive with Frechet k Time (s) (c) Xi’an with Hausdorff k Time (s) (d) Xi’an with Frechet k Time (s) (e) Osm with Hausdorff k Time (s) (f) Osm with FrechetFig. 6. Performance when Varying k TABLE VQ
UERY TIME OF δ ON D H (H AUSDORFF ) AND D F (F RECHET ) T-drive Xi’an OSMValue QT (s) Value QT (s) Value QT (s) D H D F D H D F D H D F at random from the dataset and uses the k -th smallest distanceas the threshold. The query time depends on the quality of thethreshold. The query time of DITA increases when k increases.DITA repeatedly reduces its threshold by half until the numberof candidate trajectories is less than C ∗ k . The k -th smallestdistance is used to perform a range search. LS computes thedistance between the query trajectory and all trajectories, sothe value of k has little effect. REPOSE uses the currentbest k -th result as a pruning threshold, so the query timealso increases when k increases. REPOSE achieves the bestperformance for all k and offers a considerable improvement. Parameter chosen on δ . We study the effect of δ on REPOSE . With the change of δ , there are descending andascending trends on the query time. The reason for the TABLE VIQ
UERY TIME OF N p ON D H (H AUSDORFF ) AND D F (F RECHET ) T-drive Xi’an OSMValue QT (s) Value QT (s) Value QT (s) D H D F D H D F D H D F Optimized Trie Unoptimized Trie
T-drive OSM0.00.20.40.60.81.0
32K 198K39K 213K (a) Reduced
T-drive OSM0.00.20.40.60.81.0 (b) Improvment on query timeFig. 7. Improvement by Optimized Trie descending trend is that reference trajectories become longerquickly when δ decreases. Although a long reference trajectorycan improve the pruning, the benefits may become small afterexceeding a certain length, since time overheads for computing LB o and LB t are introduced. The ascending trend occursbecause the number of cells decreases when δ increases. Thena reference trajectory cannot approximate a trajectory well,which results in a poor pruning and larger query times. Insummary, the value of δ affects the query time significantly,and it is important to select an appropriate δ value. TableV shows that the effect of δ is relatively consistent on thesame dataset. Therefore, we can reuse the setting of δ fromone measure to another when the query efficiency meets theapplication requirements. Parameter chosen on N p . Table VI shows the effect of N p on the algorithms. Similar to δ , there are descending andascending trends on query time. As N p increases beyond aboundary value, the improvement on pruning is small andadditional computational overhead is introduced. Improvement by optimized trie.
Fig. 7 shows the improve-ments by using the optimized trie on the query time and thereduced number of trie nodes. For T-drive dataset, the numberof trie nodes is reduced by about 20% compared to that ofusing unoptimized trie. The query time is decreased by about12%. For OSM dataset, both of the number of trie nodes andthe query time are reduced by about 8%.
Effect of dataset cardinality.
From Fig. 8, we observe thatthe query time of all algorithms grows linearly.
REPOSE hasthe best performance due to a better partitioning scheme anda powerful pruning.
Effect of the number of partitions.
Fig. 9 shows thatwith the increase of the number of partitions from 16 to 64,
EPOSE DITA DFT LS
Scale
Time (s) (a) Osm with Hausdorff
Scale
Time (s) (b) Osm with FrechetFig. 8. Effect of Dataset Cardinality
16 32 48 64
Time (s) (a) Osm with Hausdorff
16 32 48 64
Time (s) (b) Osm with FrechetFig. 9. Effect of the Number of Partitions all algorithms gain performance improvements. We have 64partitions by default, where each core processes a partition.
REPOSE achieves a higher performance gain (not includingLS) because it implements a better partitioning scheme byequalizing the workload of each partition. LS has the highestperformance gain. This is because it suffers from a severedata skew issue when the number of partitions is small. Whenwe increase the number of partitions, the performance issignificantly improved.
Effect of partitioning strategies.
We deploy three differentpartitioning strategies, heterogeneous, homogeneous, and ran-dom, with the RP-Trie as a local index to analyze the effectsof the partitioning strategies. In the homogeneous strategy, weplace similar trajectories in a cluster in the same partition.In the random strategy, we place trajectories into partitions atrandom. The results are shown in Table VII. The heteroge-neous partitioning strategy achieves the best performance, andthe homogeneous strategy has the worst performance, whichcan be explained as follows: (1) The local pruning of thehomogeneous strategy is less effective since the distances ofthe trajectories in a partition to the query trajectory are verysimilar. (2) In the heterogeneous strategy, a high degree ofdata discrimination in a partition is able to improve the localpruning. In addition, since the composition of each partitionis similar, the query times of all partitions are similar, whichenables load-balancing.
Comparison with DITA and DFT using heterogeneouspartitioning.
To further explore the benefits of the hetero-geneous partitioning strategy, we apply it to DITA and DFTto examine the effects on performance. We denote thesetwo methods as Heter-DITA and Heter-DFT. The results are
TABLE VIIE
FFECT OF P ARTITIONING S TRATEGY
Distance Partitioning T-drive (s) Xi’an (s) OSM (s)
Hausdorff Heterogeneous
Homogeneous 1.64 35.93 344.74Random 1.51 21.38 35.12Frechet Heterogeneous
Homogeneous 1.52 109.42 240.32Random 1.45 47.94 37.68
TABLE VIIIC
OMPARISON WITH
DITA
USING H ETEROGENEOUS P ARTITIONING
Distance Algorithm T-drive (s) Xi’an (s) OSM (s)
DTW REPOSE
Heter-DITA 2.74 184.97 503.59DITA 2.78 186.36 509.53Frechet REPOSE
Heter-DITA 2.26 84.94 87.56DITA 2.39 91.31 94.12 shown in Tables VIII and IX. Since DITA does not supportHausdorff, we examine its performance on DTW instead. Theresults show that both Heter-DITA and Heter-DFT outperformthe original DITA and DFT, which offers evidence that theheterogeneous partitioning strategy is superior.VIII. R
ELATED W ORK
Trajectory similarity search is a fundamental operation inmany applications [3], [4], [11]–[14], [16], [25], [26], [28].Based on the index structures, we classify the existing pro-posals into three categories: (1) Space partitioning tree based approach . Traditionalspace partitioning trees [13], [16], [26], [27] (e.g., R-tree) con-struct MBRs by dividing trajectories into sub-trajectories andprune non-candidate trajectories based on the MBR intersec-tions. However, large MBR areas limit the pruning capabilities.For metric spaces, classical metric indexes, such as the VP-tree and the M-tree [15], [24], [32], are used for Hausdorff andFrechet. However, these methods perform poorly in practicesince they abandon the features of the trajectory and use onlythe metric characteristics to index the trajectories. (2) Grid hashing based approach . The grid hashing basedapproaches [2], [12] use the grid sequence that the trajec-tories traverse to represent the trajectories, which capturesthe trajectory characteristics and makes it possible to quicklyfind candidate trajectories. Although these methods fail toreturn exact results, they demonstrate good performances inapproximate trajectory similarity search. (3) Distributed index based approach . SmartTrace + [35]is a real-time distributed trajectory similarity search frameworkfor smartphones that employs a similar pruning and refinementparadigm of REPOSE . DFT [28] uses the R-tree to indextrajectory line segments and prunes by using the distancesbetween a query trajectory and the MBRs of line segments.Unfortunately, DFT needs to reconstruct the line segments
ABLE IXC
OMPARISON WITH
DFT
USING H ETEROGENEOUS P ARTITIONING
Distance Algorithm T-drive (s) Xi’an (s) OSM (s)
Hausdorff REPOSE
Heter-DFT 2.32 1645.59 260.77DFT 2.50 1857.88 273.68Frechet REPOSE
Heter-DFT 2.21 1855.01 380.47DFT 2.33 2100.80 418.40 into trajectories when computing similarity, which incurs highspace overhead. DITA [19] selects pivot points to representa trajectory and stores close pivot points in the same MBR.Then, DITA utilizes the trie to index all MBRs and prunesby using the distances between the query trajectory and theMBRs. However, DITA fails to retain the features of originaltrajectories due to its pivot point selection strategy. Moreover,the query processing is inefficient for some distance metrics.IX. C
ONCLUSION AND F UTURE W ORK
We present a distributed in-memory framework,
REPOSE ,for processing top- k trajectory similarity queries on Spark.We develop a heterogeneous global partitioning strategy thatplaces similar trajectories in different partitions in order toachieve load balancing and to utilize all compute nodes duringquery processing. We design an effective index structure, RP-Trie, for local search that utilizes two optimizations, namely asuccinct trie structure and z-value re-arrangement. To furtheraccelerate the query processing, we provide several prun-ing techniques. The experimental study uses 7 widely useddatasets and offers evidence that REPOSE outperforms itscompetitors in terms of query time, index size and indexconstruction time. Compared to the state-of-the-art proposalsfor Hausdorff,
REPOSE improves the query time by about60 % , the index size by about 80 % , and the index constructiontime by about 85 % .In the future work, it is of interest to take the temporaldimension into account to enable top- k spatial-temporal tra-jectory similarity search in distributed settings. In addition,it is of interest to develop optimizations for other distancemeasures, such as LCSS, ERP, and EDR, when using the RP-Trie index. R EFERENCES[1] H. Alt and M. Godau. Computing the fr´echet distance between twopolygonal curves.
IJCGA , 5:75–91, 1995.[2] M. S. Astefanoaei, P. Cesaretti, P. Katsikouli, M. Goswami, andR. Sarkar. Multi-resolution sketches and locality sensitive hashing forfast trajectory processing. In
SIGSPATIAL/GIS , pages 279–288, 2018.[3] P. Bakalov, M. Hadjieleftheriou, E. J. Keogh, and V. J. Tsotras. Efficienttrajectory joins using symbolic representations. In
MDM , pages 86–93,2005.[4] P. Bakalov, M. Hadjieleftheriou, and V. J. Tsotras. Time relaxedspatiotemporal trajectory joins. In
GIS , pages 182–191, 2005.[5] B. Bustos, G. Navarro, and E. Ch´avez. Pivot selection techniquesfor proximity searching in metric spaces.
Pattern Recognit. Lett. ,24(14):2357–2366, 2003.[6] L. Chen, Y. Gao, B. Zheng, C. S. Jensen, H. Yang, and K. Yang. Pivot-based metric indexing.
PVLDB , 10(10):1058–1069, 2017. [7] L. Chen and R. T. Ng. On the marriage of lp-norms and edit distance.In
VLDB , pages 792–803, 2004.[8] L. Chen, M. T. ¨Ozsu, and V. Oria. Robust and fast similarity search formoving object trajectories. In
SIGMOD , pages 491–502, 2005.[9] H. K. Dai and H. Su. On the locality properties of space-filling curves.In
ISAAC , volume 2906, pages 385–394, 2003.[10] P. Dewan, R. K. Ganti, and M. Srivatsa. SOM-TC: self-organizing mapfor hierarchical trajectory clustering. In
ICDCS , pages 1042–1052, 2017.[11] H. Ding, G. Trajcevski, and P. Scheuermann. Efficient similarity join oflarge sets of moving object trajectories. In
TIME , pages 79–87, 2008.[12] A. Driemel and F. Silvestri. Locality-sensitive hashing of curves. In
SoCG , LIPIcs, pages 37:1–37:16, 2017.[13] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequencematching in time-series databases. In
SIGMOD , pages 419–429, 1994.[14] E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similartrajectory search. In
ICDE , pages 816–825, 2007.[15] A. W. Fu, P. M. Chan, Y. Cheung, and Y. S. Moon. Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances.
VLDBJ , 9(2):154–173, 2000.[16] E. J. Keogh. Exact indexing of dynamic time warping. In
VLDB , pages406–417, 2002.[17] D. Pfoser, S. Brakatsoulas, P. Brosch, M. Umlauft, N. Tryfona, andG. Tsironis. Dynamic travel time provision for road networks. In
GIS ,page 68, 2008.[18] D. Pfoser, N. Tryfona, and A. Voisard. Dynamic travel time maps -enabling efficient navigation. In
SSDBM , pages 369–378, 2006.[19] Z. Shang, G. Li, and Z. Bao. DITA: distributed in-memory trajectoryanalytics. In
SIGMOD , pages 725–740, 2018.[20] M. B. Shapiro. The choice of reference points in best-match filesearching.
Commun. ACM , 20(5):339–343, 1977.[21] T. Skopal, J. Pokorn´y, and V. Sn´asel. Pm-tree: Pivoting metric tree forsimilarity search in multimedia databases. In
ADBIS , 2004.[22] H. Su, S. Liu, B. Zheng, X. Zhou, and K. Zheng. A survey of trajectorydistance measures and performance evaluation.
VLDBJ , 29(1):3–32,2020.[23] N. Ta, G. Li, Y. Xie, C. Li, S. Hao, and J. Feng. Signature-basedtrajectory similarity join.
TKDE , 29(4):870–883, 2017.[24] J. K. Uhlmann. Satisfying general proximity/similarity queries withmetric trees.
Inf. Process. Lett. , 40(4):175–179, 1991.[25] M. Vlachos, D. Gunopulos, and G. Kollios. Discovering similarmultidimensional trajectories. In
ICDE , pages 673–684, 2002.[26] M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. J. Keogh. In-dexing multi-dimensional time-series with support for multiple distancemeasures. In
KDD , pages 216–225, 2003.[27] Y. Wang, Y. Zheng, and Y. Xue. Travel time estimation of a path usingsparse trajectories. In
KDD , pages 25–34, 2014.[28] D. Xie, F. Li, and J. M. Phillips. Distributed trajectory similarity search.
PVLDB , 10(11):1478–1489, 2017.[29] D. E. Yagoubi, R. Akbarinia, F. Masseglia, and T. Palpanas. Massivelydistributed time series indexing and querying.
TKDE , 32(1):108–120,2020.[30] K. Yamamoto, K. Uesugi, and T. Watanabe. Adaptive routing of multipletaxis by mutual exchange of pathways.
IJKESDP , 2(1):57–69, 2010.[31] B. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similartime sequences under time warping. In
ICDE , pages 201–208, 1998.[32] P. N. Yianilos. Data structures and algorithms for nearest neighborsearch in general metric spaces. In
SODA , pages 311–321, 1993.[33] J. Yuan, Y. Zheng, X. Xie, and G. Sun. T-drive: Enhancing drivingdirections with taxi drivers’ intelligence.
TKDE , 25(1):220–232, 2013.[34] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J.Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: Afault-tolerant abstraction for in-memory cluster computing. In
NSDI ,pages 15–28, 2012.[35] D. Zeinalipour-Yazti, C. Laoudias, C. Costa, M. Vlachos, M. I. Andreou,and D. Gunopulos. Crowdsourced trace similarity with smartphones.
TKDE , 25(6):1240–1253, 2013.[36] D. Zhang, N. Li, Z. Zhou, C. Chen, L. Sun, and S. Li. ibat: detectinganomalous taxi trajectories from GPS traces. In
UbiComp , pages 99–108, 2011.[37] H. Zhang, H. Lim, V. Leis, D. G. Andersen, K. Keeton, and A. Pavlo.Succinct range filters.
SIGMOD , 48(1):78–85, 2019.[38] B. Zheng, L. Weng, X. Zhao, K. Zeng, X. Zhou, and C. S. Jensen. Re-pose: Distributed top-k trajectory similarity search with local referencepoint tries.
CoRR , abs/2101.08929, 2021.
PPENDIX
A. Pseudocode of Search Procedure
The search procedure is detailed in Algorithm 2.
Algorithm 2:
LocalSearch ( τ q , root, k ) Input:
The query trajectory τ q , the root node, and k Output:
The top- k results d k ← + ∞ ; minHeap ← ∅ ; Initialize a priority queue E with root ; while E (cid:54) = ∅ do t ← E.top ; if t is a leaf node then if t.LB t < d k ∧ t.LB p < d k then for each τ ∈ t do Compute D H ( τ q , τ ); Update minHeap and d k if necessary; else Break; else if t.LB o < d k ∧ t.LB p < d k then for each child node t (cid:48) of t do p ∗ ← the reference point of t (cid:48) ; t (cid:48) ← CompLB ( τ q , p ∗ , t.r, t.c max ); Insert t (cid:48) into E ; else Break; return k trajectories in minHeap ; B. Details of the Greedy Algorithm for the Optimized RP-Trie
Assume that we have a set of N reference trajectories Z = { Z , . . . , Z N } and M cells. Let L be the maximum lengthof reference trajectories and L ( Z ) = (cid:80) Ni =1 | Z i | be the totalnumber of z-values in Z .First, for the root node, we count the frequency of each z-value in Z and store them in an array C ( Z ) of size M . Thetime of counting is O ( L ( Z )) , and the space consumption of C ( Z ) is O ( M ) . Second, we find the most frequent z-value z .Let Z z be the set of reference trajectories containing z . Webuild a node e with label z as a child node of the root, andput the trajectories in Z z into the subtree of e . Similarly,for e , we count the frequency of each z-value in Z z andstore them by an array C ( Z z ) with size M . Third, to find thenext most frequent z-value z in the remaining trajectories, weonly need to compute C ( Z ) − C ( Z z ) rather than to count thefrequency of each z-value in Z − Z z . Thus, we find the mostfrequent z-value z of Z − Z z according to C ( Z ) − C ( Z z ) .Let Z z be the set of reference trajectories containing z in Z − Z z . We build a node e with label z as a child node ofthe root and put the trajectories in Z z into the subtree of e .We repeat this process when there is no trajectory left.Assume that we divide Z into Z z , . . . , Z z t , and we have TABLE XT
RAJECTORIES IN Z TO BE I NDEXED
ID Reference Trajectory ID Reference Trajectory Z { , } { , } Z { , } { , } Z { , , } { , , } Z { , } { , } Z { , } { , } Z { , } { , } Z { , , } { , , } Z { , } { , } Trie 结构 𝜏𝜏 𝜏𝜏 𝜏𝜏 𝑍𝑍 Internal Node ∶ Leaf Node ∶ Label Tid HR D max Label Ptr HR
Root 𝑍𝑍 𝑍𝑍 𝑍𝑍 𝑍𝑍 𝑍𝑍 𝑍𝑍 𝑍𝑍 Fig. 10. Optimized RP-Trie for Z C ( Z z ) , . . . , C ( Z z t ) . That is, we have t nodes in the firstlevel of the RP-Trie. The time of finding all z i is O ( M t ) .The accumulated time of computing the arrays is boundedby (cid:80) t i =1 O ( L ( Z z i )) = O ( L ( Z )) . Therefore, the time cost ofbuilding the first level of the RP-Trie is O ( L ( Z ) + M t ) .Similarly, if the i -th level has t i nodes, the time cost ofbuilding the i -th level is O ( L ( Z ) + M t i ) . The optimized RP-Trie has at most L + 1 layers and L ( Z ) + 1 nodes. Therefore,the overall time cost is O ( L ( Z ) L + M L ( Z )) . Normally, L and M are not large constants, L ≤ M and L ( Z ) ≤ N L . Sothe time cost of building an optimized RP-Trie is O ( N M ) .In fact, most reference trajectories have lengths much less than L , and the number of the nodes is smaller than L ( Z ) . So, thereal time cost is much less than O ( N M ) . Example 3.