Distributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub
Shohei Tsuruoka, Daichi Amagata, Shunya Nishio, Takahiro Hara
DDistributed Spatial-Keyword kNN Monitoring forLocation-aware Pub/Sub
Shohei Tsuruoka
Osaka [email protected]
Daichi Amagata
Osaka [email protected]
Shunya Nishio
Osaka [email protected]
Takahiro Hara
Osaka [email protected]
ABSTRACT
Recent applications employ publish/subscribe (Pub/Sub) systemsso that publishers can easily receive attentions of customers andsubscribers can monitor useful information generated by publish-ers. Due to the prevalence of smart devices and social networkingservices, a large number of objects that contain both spatial andkeyword information have been generated continuously, and thenumber of subscribers also continues to increase. This poses a chal-lenge to Pub/Sub systems: they need to continuously extract usefulinformation from massive objects for each subscriber in real time.In this paper, we address the problem of π nearest neighbormonitoring on a spatial-keyword data stream for a large numberof subscriptions. To scale well to massive objects and subscrip-tions, we propose a distributed solution, namely D π M-SKS. Given π workers, D π M-SKS divides a set of subscriptions into π disjointsubsets based on a cost model so that each worker has almost thesame π NN-update cost, to maintain load balancing. D π M-SKS al-lows an arbitrary approach to updating π NN of each subscription,so with a suitable in-memory index, D π M-SKS can accelerate up-date efficiency by pruning irrelevant subscriptions for a given newobject. We conduct experiments on real datasets, and the resultsdemonstrate the efficiency and scalability of D π M-SKS.
Due to the recent prevalence of GPS-enabled devices, many appli-cations have been generating objects that contain location informa-tion and keywords [15, 27]. They often provide services that retrieveobjects useful to users from the generated ones, based on a location-aware publish/subscribe (Pub/Sub) model [22, 23, 29, 33, 34]. In thismodel, users register queries that specify query locations and key-words as their subscriptions on a Pub/Sub system, and this systemdelivers appropriate objects generated by publishers (e.g., Pointof Interests) to subscriptions based on their query locations andkeywords. It is well known that range and π nearest neighbor ( π NN)queries support location-aware Pub/Sub systems. A range queryretrieves all objects existing within a user-specified range from aquery point, so it cannot control the result size. This means thatusers may not obtain any objects or may obtain a huge amount ofobjects, which is not desirable. On the other hand, a π NN queryalleviates this drawback, since users can obtain a reasonable-sizedresult. In this paper, hence, we consider π NN queries. π π π Italian, PastaSushi, Noodle ChineseJapanese, Sushi π π Japanese, Noodle π Chinese, Ramen (a) At time π‘ π π π Italian, PastaSushi, Noodle ChineseJapanese, Sushi π π Chinese, Ramen π Japanese, Sushi π Japanese, Noodle (b) At time π‘ + Figure 1: An example of π NN monitoring in a location-aware Pub/Sub system, where π π and π π respectively denotean spatial-keyword object and a subscription In Pub/Sub environments, objects are generated in a streamingfashion, so we have to continuously update the π NN objects for eachsubscription. For example:Example 1.
Figure 1 illustrates an example of π NN monitoring ina location-aware Pub/Sub system. Three subscriptions ( π , π , and π ) are registered, and the Pub/Sub system monitors π NN objects foreach subscription. Assume π = and focus on π , which specifies Japanese and
Noodle as keywords. At time π‘ , i.e., in Figure 1(a), theNN object for π is π , because it contains the keyword Noodle and isthe nearest to π among { π , π , π } . Also, the NN object for π ( π ) is π ( π ). Assume further that a new object π is generated at time π‘ + ,as shown in Figure 1(b). Since π also contains the keyword Japanese ,the NN object of π is updated to π (and the NN objects for the othersubscriptions do not change). Users require up-to-date results, so Pub/Sub systems have to effi-ciently update π NN objects of their subscriptions when new objectsare given. However, this is a difficult task, because many applica-tions employing Pub/Sub systems have to deal with a lot of (oftenmillion-scale) subscriptions [35]. Besides, due to the usefulnessof location-aware Pub/Sub systems, the number of subscriptionsis further increasing [32]. It is therefore hard for a single serverto update the result for each subscription in real time [14]. Thissuggests that we need to make location-aware Pub/Sub systems effi-cient and scalable, motivating us to consider a distributed solution: a r X i v : . [ c s . D B ] F e b hohei Tsuruoka, Daichi Amagata, Shunya Nishio, and Takahiro Hara given multiple workers, each registered subscription is assigned toa specific worker so that parallel π NN update is enabled.
Challenge.
Although a distributed solution is promising, it hassome challenges to scale well to massive objects and subscriptions(i.e., continuous spatial-keyword π NN queries).(1) A distributed solution has to maintain load balancing. This isnot trivial for continuous spatial-keyword π NN queries, becauseeach subscription specifies arbitrary locations and keywords., i.e.,the loads of subscriptions are different and not explicitly provided.(2) It is necessary to deal with subscription insertions and dele-tions. Although some variants of the spatial-keyword π NN monitor-ing problem [22, 33] accept subscription insertions and deletions,these solutions consider centralized environments and extendingthem for decentralized environments is not trivial. In addition,[14, 32] assume subscription insertions and deletions in distributedprocessing environments. However, [14] considers not the costs ofsubscriptions but the number of them, which is not effective forload balancing, and [32] does not consider load balancing.
We overcome these challenges and propose two baselines and D π M-SKS (Distributed π NN Monitoring on Spatial-Keyword data Stream).Our solutions employ β’ Cost models for subscriptions : We design cost models forsubscriptions, so that we can estimate the load of a given sub-scription when a new object is generated. Specifically, we pro-pose keyword- and space-oriented cost models. Our models usea practical assumption and can deal with new subscriptions.Based on these models, we further propose a hybrid of thesetwo models. β’ Cost-based subscription partitioning : Based on our costmodels, a set of subscriptions is divided into disjoint subsets,each of which is assigned to a specific worker. In particular,D π M-SKS considers both spatial and keyword information, sothat π NN update costs can be minimized. We use a greedy algo-rithm for subscription partitioning, because optimal cost-basedpartitioning is NP-hard.Furthermore, D π M-SKS allows an arbitrary exact algorithm for π NNupdate. This is a good property because it can implement a state-of-the-art to accelerate performance. To demonstrate the efficiency ofD π M-SKS, we conduct experiments on two real datasets. From theexperimental results, we confirm that D π M-SKS outperforms thebaselines and a state-of-the-art technique. This is the full versionof our preliminary paper [31].
Organization.
The rest of this paper is organized as follows. Weformally define our problem in Section 2. Then, we design baselinesolutions in Section 3. We propose D π M-SKS in Section 4, andintroduce our experimental results in Section 5. Related works arereviewed in Section 6. Finally, this paper is concluded in Section 7.
Problem definition.
Let us first define spatial-keyword objects.Definition 1 (Spatial-keyword object).
A spatial-keyword object π is defined as π = β¨ π,π, π‘ β© , where π is a 2-dimensional location of π , π π π π π π π π π π π π π π π π Subscription Keywords π π, π,π π π,π,π π π,ππ π,π π π π π π π,π π ππ π,ππ β Object Keywords π π,π π π π π π π,ππ π π π,π,β π π,π π π π π, ππ π π Figure 2: A toy example of objects and subscriptions thathave been respectively generated and registered at time π‘π is a set of keywords held by π , and π‘ is the time-stamp when π isgenerated. Without loss of generality, hereinafter, we call π object simply. Notethat we assume discrete time in this paper. Next, we define contin-uous spatial-keyword π nearest neighbor ( π NN) queries.Definition 2 (Continuous spatial-keyword π NN qery).
A con-tinuous spatial-keyword π NN query π is defined as π = β¨ π,π, π, π‘ β© ,where π is a 2-dimensional location of interest for π , π is a set ofkeywords in which π is interested, π is the number of results requiredby π , and π‘ is the time-stamp when π is registered. Let π be a set ofobjects generated so far, and let π ( π ) be the set of objects π β π where π.π β© π .π β β and π .π‘ β€ π.π‘ . Given π ( π ) , this query monitors a setof objects π΄ that satisfy (i) | π΄ | = π and (ii) β π β π΄ , β π β² β π ( π ) β π΄ , πππ π‘ ( π.π, π .π ) β€ πππ π‘ ( π β² .π, π .π ) , where πππ π‘ ( π, π β² ) evaluates the Eu-clidean distance between points π and π β² (ties are broken arbitrarily). That is, we consider continuous spatial-keyword π NN queries witha boolean (i.e., OR ) semantic for keywords [5, 8, 11] and a timeconstraint [6, 30] for obtaining fresh objects as much as possible.A subscription is corresponding to a continuous spatial-keyword π NN query in this paper, as shown in Example 1. We hence usethem interchangeably. Then, our problem is defined as follows:Problem statement.
Given π and a set of registered subscriptions π , our problem is to exactly monitor π΄ for each subscription β π . Example 2.
Figure 2 illustrates a toy example which is used through-out this paper. Assume that π , ..., π ( π , ..., π ) have been registered(generated) at time π‘ . Consider π , then π ( π ) = { π , π , π , π } , be-cause they contain a , b , or c . Assuming π .π = , π΄ of π is { π , π } . This paper proposes a distributed solution to achieve real-timemonitoring and scale well to large | π | and | π | . System overview.
We assume that a location-aware Pub/Sub sys-tem employs a general distributed setting consisting of a mainserver and π workers [9, 24]. The main server (each worker) di-rectly communicates with workers (the main server). (A workercan be a CPU core or a machine that can use a thread.) The mainserver takes the following roles: it β’ assigns each subscription to a specific worker, β’ receives a stream of objects and broadcasts them to all workers,and β’ accepts subscription insertions and deletions. istributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub The main operations of each worker are as follows: it β’ accepts subscriptions assigned by the main server, β’ removes requested subscriptions, and β’ updates the π NN objects for each assigned subscription.We see that π NN objects for each subscription are updated inparallel, thereby this approach is promising for massive subscrip-tions. An important problem to achieve this is load balancing. Thatis, distributed solutions have to consider how to make the com-putation time of each worker almost equal when new objects aregenerated. We below analyze this problem theoretically.Let πΆ ( π ) be the π NN update cost of a subscription π (how toobtain πΆ ( π ) is introduced later). Furthermore, let πΆ ( π€ π ) be the cost(load) of a worker π€ π , which is defined as πΆ ( π€ π ) = βοΈ π β π ( π€ π ) πΆ ( π ) , where π ( π€ π ) is a set of subscriptions assigned to π€ π . We want to opti-mize the load difference between workers with a good subscriptionassignment. This can be formalized as follows:Definition 3 (Optimal subscription assignment problem). Givena set of objects π , a set of subscriptions π , and π workers, this problemis to find a subscription assignment that minimizes max π β[ ,π ] πΆ ( π€ π ) β min π β[ ,π ] πΆ ( π€ π ) . We have the following theorem w.r.t. the above problem [7].Theorem 1.
The optimal subscription assignment problem is NP-hard.
It can be seen, from this theorem, that it is not practical to obtainthe optimal assignment, which suggests that we need a heuristicapproach. We hence consider πΆ ( π ) to capture the load of π and thendesign an approach that partitions π into π disjoint subsets whoseloads are well balanced. Note that πΆ ( π ) is dependent on a given costmodel. In addition, we consider how to manage new subscriptions(we can easily deal with subscription deletions: the main serversimply requests them to the corresponding workers). Because this is the first work that proposes a distributed solutionfor processing continuous spatial-keyword π NN queries definedin Definition 2, we first design baseline solutions. We propose twobaselines that respectively employ keyword- and space-orientedsubscription partitioning. We assume that some subscriptions areregistered at the initial time, and we partition π when π becomessufficiently large. (This is common to D π M-SKS.) We use π ππππ‘ todenote the set of objects when π is partitioned. To start with, we design keyword-oriented partition. One possibleapproach partitions π so that a set of distinct keywords of the sub-scriptions held by each worker can be disjoint between workers.This approach is not efficient, because it does not consider keywordfrequencies. In other words, if a worker has subscriptions with key-words that are contained by many objects, its load becomes heavy,rendering load imbalance. Hence our keyword-oriented partitiontakes keyword frequencies into account. Algorithm 1:
Subscription-Assignment
Input: π (a set of subscriptions) and π workers Sort π in descending order of cost Set πΆ ( π€ ) = for each π β π do π€ β arg min π πΆ ( π€ ) π ( π€ ) β π ( π€ ) βͺ { π } , πΆ ( π€ ) β πΆ ( π€ ) + πΆ ( π ) Cost estimation.
Similar to [34], we estimate the load of a sub-scription based on the distributions of keywords in π ππππ‘ , becausethe distributions of large datasets rarely change in practice [37].Assume that the appearance probability of each keyword is inde-pendent. Given an object π , the probability that a keyword π iscontained in π.π , π ( π ) , is π ( π ) = | π π || π ππππ‘ | , where π π β π ππππ‘ is a set of objects π π such that π β π π .π . Recallthat π ( π ) is a set of objects π π where π π .π β© π .π β β . Therefore, the π NN update cost of a subscription π , πΆ ( π ) , can be estimated as: πΆ ( π ) = βοΈ π β π .π π ( π ) . (1) Subscription partition.
Keyword-oriented partition employs thecost model defined in Equation (1) and a 3/2-approximation greedyalgorithm [20] for subscription partitioning. This approach firstcomputes πΆ ( π ) for every π β π , and sorts π in descending order of πΆ ( π ) . Then, this approach sequentially accesses subscriptions whileassigning an accessed subscription to π€ π with the minimum πΆ ( π€ π ) .Algorithm 1 details this approach. π NN update.
Each worker π€ maintains an inverted file π€ .πΌ to indexits assigned subscriptions. The inverted file is a set of postings lists π€ .πΌ [ π ] that maintain subscriptions containing a keyword π . Givena new object π (broadcast by the main server), each worker π€ computes subscriptions that contain keywords in π.π , from π€ .πΌ ,while pruning irrelevant subscriptions. After that, π€ updates the π NN of the corresponding subscriptions.Example 3.
We partition π in Figure 2 into two disjoint subsets forworkers π€ and π€ , based on keyword-oriented partition. Figure 3illustrates an overview. The left part shows subscriptions and theircosts obtained from Equation (1), and the right part shows the partitionresult, i.e., π€ has π , π , π , π , and π while π€ has π , π , π , π , and π . They are maintained by inverted files (the most right tables). Subscription insertion.
A new subscription π β² also can obtain itsestimated cost from Equation (1), because its cost model assumesthat the keyword distribution rarely changes [35]. The main servermaintains πΆ ( π€ ) for each worker π€ . (This is common to all of oursolutions.) Given a new subscription π β² , the main server computes πΆ ( π ) from Equation (1). Then the main server assigns π β² to theworker with the minimum πΆ ( π€ ) . hohei Tsuruoka, Daichi Amagata, Shunya Nishio, and Takahiro Hara Subscription π πΆ π π π π π π π π π π π π π π π π Subscription π π π π π Keywords Subscriptions π π π π ,π ,π π π ,π π π ,π β π Keyword-oriented partition IndexingIndexing
Keywords Subscriptions π π ,π ,π π π π π ,π π π ,π ,π π€ π€ Figure 3: An example of keyword-oriented partition for twoworkers π€ and π€ (based on objects and subscriptions inFigure 2)Subscription deletion. For subscription deletion, the main serversimply requests the worker, which has the corresponding subscrip-tion, to remove it, then updates πΆ ( π€ ) . This is also common to oursolutions. We next design space-oriented partition. The most straightforwardapproach is to partition the data space into π equal-sized subspaces.Clearly, this is not efficient, because some of them have more objectsthan the others, which also provides load imbalance. We henceconsider a space-based cost model below. Cost estimation.
Consider a set π π of subscriptions that exist ina subspace π , and let πΆ ( π π ) be its cost. Note that πΆ ( π π ) can be aprobability that π NN objects of subscriptions in π π may be updated,given a new object π . Let π π be the current π -th nearest neighborobject of a subscription π . Furthermore, let π΅ ( π ) be a ball whosecenter and radius are respectively π .π and πππ π‘ ( π π .π, π .π ) . We seethat new objects that are generated within π΅ ( π ) may become new π NN of π . Now consider a rectangle π that encloses all balls of π π .It is also true that new objects that are generated within π maybecome new π NNs of π β π π .The space-based cost also utilizes the distribution of π ππππ‘ . Givena set π π of objects existing within π , the probability that a newobject is generated within π , π ( π ) , is π ( π ) = | π π || π ππππ‘ | . (2)Then we define πΆ ( π π ) as follows: πΆ ( π π ) = π ( π ) Β· | π π | (3)It can be seen that πΆ ( π π ) takes the number of subscriptions intoaccount. Assume that π is small but contains many subscriptions.We see that the π NN update cost of π is not small when a newobject is generated within π . However, without | π π | , πΆ ( π π ) is small,which contradicts the above intuition. We therefore make Equation(3) an expected value, different from Equation (1). Subscription partition.
Here, we introduce how to obtain π (or π ). Let R be the space where objects and subscriptions exist. Wepartition R in a similar way to quadtree [19], motivated by a recent π π π π π π π π π π π π π π (a) Space-oriented partition for π in Figure2 π π π π π π π π π (b) Subscriptions that are assigned to π€ π π π π π π π (c) Subscriptions that are assigned to π€ Figure 4: An example of space-oriented partition for twoworkers π€ and π€ (based on objects and subscriptions inFigure 2) empirical evaluation on a spatial-keyword stream that confirms thesuperiority of quadtree-based space partition [5]. Specifically, wepartition R into four equal-sized subspaces and compute πΆ ( π π ) foreach subspace π . Then we pick the subspace that has the largest πΆ ( π π ) and partition it in the same way. This is repeated until wehave π β₯ π Β· π , where π and π are the number of subspaces and athreshold (system parameter), respectively.Now we have π disjoint subsets of π and determine their assign-ment in a similar way to Algorithm 1. Note that space-orientedpartition considers the assignment of subsets π π , different fromkeyword-oriented partition. That is, the input of the greedy algo-rithm is a collection of subsets π π . π NN update.
Space-oriented partition takes a different approachfrom keyword-oriented partition. Assume that a worker π€ has acollection π ( π€ ) of π π . For each π π β π ( π€ ) , we build an inverted file πΌ ( π π ) for π π . This aims at pruning irrelevant subscriptions, i.e., wecan prune π π when a new object is generated within π but does notcontain any keywords in π π .Given a new object π that is generated within π , π€ computessubscriptions that contain the keywords in π.π by using πΌ ( π π ) . Ifthere are such subscriptions, π€ updates their π NNs.Example 4.
We partition π in Figure 2 into two disjoint subsets forworkers π€ and π€ , based on space-oriented partition. Figure 4 illus-trates an example. For simplicity, π is partitioned into four subsets π , istributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub π , π , and π , see Figure 4(a). Assume that their costs are 0.36, 0.11,0.2, and 0.28, respectively. Then π and π ( π and π ) are assigned to π€ ( π€ ), as shown in Figure 4(b) (4(c)).Given a new object π that is shown in Figures 4(b) and 4(c), π€ needs to deal with π , π , π , and π , because π exists within therectangle of π . Similarly, π€ needs to consider π . New subscription.
Our space-oriented partition provides a costwith a set of subscriptions. On the other hand, for new subscriptions,we should provide their respective costs, because the number ofnew subscriptions at a given time is much smaller than that of theinitial set of subscriptions. We therefore estimate the cost of a newsubscription based on Equation (2).Given a new subscription π , the main server computes its π NNamong π ππππ‘ to obtain π΅ ( π ) . (Note that its exact π NN is monitoredafter π is assigned to a worker, as π ππππ‘ is not qualified for π ( π ) .)Then the main server has a rectangle π (i.e., a space π ) that encloses π΅ ( π ) . Now we can obtain its cost from Equation (2), because π π = { π } ,i.e., Equation (3) becomes Equation (2). How to assign π to a workeris the same as keyword-oriented partition.Although the above approach can deal with new subscriptions, itloses the property of βspace-orientedβ, because a new subscription π may be assigned to a worker π€ that does not have subscriptionsclose to π . This case may degrade the pruning performance, becausethe data space, where π€ has to care, becomes larger. For example,assume that a new subscription π is registered and its location is apoint in π of Figure 4(a). Assume furthermore that π is assignedto π€ , then the space, where π€ has to take care, becomes larger. π M-SKS
Motivation.
Our baselines partition π based only on either key-word or spatial information. However, given a subspace, a betterpartition approach is dependent on the space and keyword distri-butions of the subspace. For example:Example 5. Figure 5 depicts two data distributions. Black points,dashed circles, and balloons represent the locations of subscriptions π , π΅ ( π ) , and keywords of π , respectively.Focus on Figure 5(a) and let the solid rectangle show π . We see that π΅ ( π ) of each subscription π is small, thereby the entire cost is small ifwe use space-oriented partition for π , because the pruning probabilitybecomes large. Next, consider Figure 5(b). Each subscription has alarge π΅ ( π ) and it overlaps with the others. For this distribution, space-oriented partition is clearly not a good choice, because the size of therectangle that encloses each ball does not change much even if π ispartitioned. Motivated by the above observation, D π M-SKS considers a betterpartitioning approach when it partitions a (sub)set of subscriptions,to minimize the entire load. Then D π M-SKS assigns each subscrip-tion to a specific worker based on the greedy algorithm [20] andan additional heuristic. D π M-SKS also utilizes π ππππ‘ to estimate the cost of a subscription.Different from keyword- and space-oriented partition, D π M-SKSconsiders both space and keyword information. Consider π π a subset π π π π (a) A distribution in which space-orientedpartition is effective π ππ π (b) A distribution in which space-orientedpartition is not effective Figure 5: An example that depicts data distributions forconsidering a better partitioning approach. Black points,dashed circles, and balloons represent the locations of sub-scriptions π , π΅ ( π ) , and keywords of π , respectively. of π , and we have a rectangle π that encloses the balls of subscrip-tions in π π . Based on a similar idea to Equation (2), the probabilitythat an object π is generated within π and there exist the otherobjects in π which contain a keyword π β π.π is π ( π , π ) = | π π ,π || π ππππ‘ | , (4)where π π ,π is a set of objects β π ππππ‘ that exist within π and contain π in their keywords. Now take a subscription π β π π . Equation (4)focuses on a single keyword, thereby an estimated cost of π is πΆ ( π ) = βοΈ π β π .π π ( π , π ) . (5)Then the cost of π π is defined as πΆ ( π π ) = βοΈ π β π π πΆ ( π ) . (6)Note that D π M-SKS provides a cost both with a single subscriptionand a set of subscriptions.
Subscription partition.
Given π π , D π M-SKS selects a better ap-proach to π π by considering the following two partitioning ap-proaches.Space-only-Partition. Consider a space π where π π exists. Thisapproach partitions π into equal-sized disjoint subspaces π π (1 β€ π β€ π M-SKS obtains aset of subscriptions π π π that exist in π π .Hybrid-Partition. This approach also partitions π π into four dis-joint subsets π β π (1 β€ π β€ πΆ ( π ) , which is defined in Equation (5), and considers both spaceand keyword information. More specifically, D π M-SKS sorts π π indescending order of πΆ ( π ) , and runs the greedy algorithm to assigneach π β π π to π β π with the minimum (cid:205) π β π βπ πΆ ( π ) . hohei Tsuruoka, Daichi Amagata, Shunya Nishio, and Takahiro Hara We do not consider keyword-only partition, because it does notconsider spatial information and cannot reduce the size of π .We define a better partition as the one with less entire cost thanthe other after π π is partitioned. The entire costs πΆ π and πΆ β , whichare respectively provided by Space-only-Partition and Hybrid-Partition, are defined as πΆ π = βοΈ β€ π β€ πΆ ( π π π ) and πΆ β = βοΈ β€ π β€ πΆ ( π β π ) . If πΆ π < πΆ β , D π M-SKS selects Space-only-Partition. Otherwise,D π M-SKS selects Hybrid-Partition.
Algorithm description.
Now we are ready to introduce how topartition π through D π M-SKS. Algorithm 2 describes the detail. Theobjective of this algorithm is to obtain at least πΎ Β· π subsets of π ,where πΎ is a system parameter.For ease of presentation, assume that we are given a subset π β² of π . (At initialization, π β² = π .) D π M-SKS considers partitioning π β² .Because Hybrid-Partition needs to compute Equation (5) for eachsubscription in π β² , it incurs a large computational cost if | π β² | is large.Therefore, if | π β² | > πΎ , where πΎ is also a system parameter, D π M-SKS always utilizes Space-only-Partition to partition π β² into foursubsets. On the other hand, if | π β² | β€ πΎ , D π M-SKS tests both Space-only-Partition and Hybrid-Partition. D π M-SKS then selects theresult of Space-only-Partition if πΆ π < πΆ β . Otherwise, D π M-SKSselects that of Hybrid-Partition. The four subsets obtained bya better partition are inserted into a collection S of subsets. Afterthat, S is sorted in descending order of the estimated cost of subset.D π M-SKS checks |S| , and if |S| < πΎ Β· π , D π M-SKS picks the subsetwith the largest cost and repeats the above operations.
Subscription assignment.
From the subscription partition, D π M-SKS has a collection S of subsets π β² . It is important to note that π β² tends to contain subscriptions with close locations and simi-lar keyword sets. That is, given a new object π , the π NN of allsubscriptions in π β² may change by π . In this case, assigning eachsubscription π β π β² to a different worker is better than assigning π β² to a worker, to exploit the parallel π NN update. We use thisheuristic for subscription assignment.
Algorithm description. D π M-SKS employs a similar approach tokeyword-oriented partition for subscription assignment. In otherwords, D π M-SKS uses the greedy algorithm [20], but how to accesssubscriptions is different. Given S , we first sort S in descendingorder of cost obtained from Equation (6). Then, for each π β² β S ,we sort π β² as with S and run the greedy algorithm. Algorithm 3elaborates this operation. π NN Update Algorithm
Actually, D π M-SKS can employ an arbitrary index for updating π NN of each subscription. This is a good property because it canalways make use of a state-of-the-art. By default, in D π M-SKS, eachworker π€ utilizes a hybrid structure of a grid and an inverted file,because this structure is update-friendly. The grid is a set of cells,and for each cell, we implement an inverted file. More specifically,consider a subscription π β π ( π€ ) and a cell π that overlaps with Algorithm 2:
Subscription-Partitioning
Input: π (a set of subscriptions), π workers, πΎ , and πΎ (system parameters) S β β¨ π, β© while |S| < πΎ Β· π do β¨ π β² , πΆ ( π β² )β© β the front of S S β S β β¨ π β² , πΆ ( π β² )β© if | π β² | > πΎ then S β Space-only-Partition ( π β² ) for each π π β S do S β S βͺ β¨ π π , πΆ ( π π )β© else S β Space-only-Partition ( π β² ) S β² β Hybrid-Partition ( π β² ) if πΆ π < πΆ β then for each π π β S do S β S βͺ β¨ π π , πΆ ( π π )β© else for each π β β S β² do S β S βͺ β¨ π β , πΆ ( π β )β© Sort S in descending order of πΆ ( π β² ) return S Algorithm 3:
Subscription-Assignment for D π M-SKS
Input: S (a collection of subsets of π ) and π workers Set πΆ ( π€ ) = Sort S in descending order of cost for each π β² β S do Sort π β² in descending order of cost for each π β π β² do π€ β arg min π πΆ ( π€ ) π ( π€ ) β π ( π€ ) βͺ { π } , πΆ ( π€ ) β πΆ ( π€ ) + πΆ ( π ) a ball of π , π΅ ( π ) . This cell π maintains π by its inverted file π.πΌ (i.e., π.πΌ [ π ] maintains π if π β π .π ).Given a new object π broadcast by the main server, each worker π€ obtains the cell π to which π is mapped. Then π€ considers whetheror not it needs to update the π NN of subscriptions in π ( π€ ) from π.πΌ (i.e., subscriptions that do not contain any keywords in π.π are pruned). If necessary, π€ updates the π NN of correspondingsubscriptions, then updates π.πΌ accordingly.Example 6. D π M-SKS partitions π in Figure 2 into two disjoint sub-sets for two workers π€ and π€ . Assume that the table in Figure6(a) depicts the estimated cost of each subscription in D π M-SKS. As-sume furthermore that the result of partitioning is { π , π , π , π , π } ,which is also shown in Figure 6(a). Following Algorithm 3, the re-sult of subscription assignment of D π M-SKS is { π , π , π , π } for π€ and { π , π , π , π , π , π } for π€ , which are respectively illustrated inFigures 6(b) and 6(c). istributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub π π π π π π π π π π Subscription
πΆ π π π π π π π π π π π π π π π π (a) Subscription partitioning of D π M-SKS π π π π π π Keyword Subscriptions π π π π π π Keyword Subscriptions π π Keyword Subscriptions π π π .πΌ π .πΌπ .πΌ π .πΌπ π π Keyword Subscriptions π π ,π π π π π π π (b) Subscriptions assigned to π€ and the data structure maintained by π€ Keyword Subscriptions π π π π π π π π ,π π π π π Keyword Subscriptions π π π π Keyword Subscriptions π π π π β π π .πΌ π .πΌπ .πΌ π π π Keyword Subscriptions π π π π π .πΌπ π π (c) Subscriptions assigned to π€ and the data structure maintained by π€ Figure 6: An example of subscription partitioning and as-signment of D π M-SKS
Consider a case where a new object π (see Figure 4), which con-tains keyword b , is generated. Since it is mapped to π , π€ needs toconsider π , which can be seen from π .πΌ [ b ] . Similarly, π€ needs toconsider π . Compared with Example 4, D π M-SKS shows a better loadbalancing.
Recall that the estimated cost of a subscription π is obtained from π π β π ( π β π π ), as described in Equations (4) and (5). When a newsubscription π π is registered, it does not belong to any subset of π .A straightforward approach to providing an estimated cost with π π is to re-conduct Algorithm 2. It is obvious that this approach isvery computationally expensive. However, it is desirable that π π ishandled as if π π has been registered at the initial time. We thereforetake an approximate approach to estimating the cost of π π .Consider a collection S of subsets of π obtained by Algorithm 2.The main server maintains π for each subset π π β S . Given a new subscription π π , we first do the same operation as space-orientedpartition: the main server computes its π NN among π ππππ‘ , andthen computes the rectangle π π that encloses π΅ ( π π ) . Let π β© π π bethe overlapping area between π and π π . Now | π β© π π | can be theoverlapped area size. The main server computes π β = arg max π π βS | π β© π π | . (7)(In practice, |S| is small, so the cost of this computation is trivial.)Let π β be the rectangle of π β , and π β is the rectangle that overlapswith π π the most. By using Equation (5) with π β , π π obtains its esti-mated cost. Then π π is assigned to the worker π€ with the minimumcost πΆ ( π€ ) . We conducted experiments on a cluster of six machines. One ofthem is a main server with 3.0GHz Intel Xeon Gold with 512GBRAM. The others are equipped with 6-core 2.4GHz Intel Core i7-8700T and 32GB RAM. We used one core as a worker. The mainserver and workers communicate via a 1Gbps Ethernet network.As with [33], we set | π ππππ‘ | = , , π into π workers. Afterthat, we generated 1,000 objects and requested 100 subscriptioninsertions and deletions for each time-stamp . Dataset.
We used two real spatial-keyword stream datasets, Place[1] and Twitter [2]. Table 1 shows the statistics of these datasets.We generated subscriptions for each dataset, so that they follow thedistributions of the corresponding dataset [32]. When a subscription π was generated, we randomly picked one object to determine itslocation π .π and then picked at most five keywords at uniformlyrandom from its keyword set to obtain π .π . The value of π .π was arandom integer β [ , π πππ₯ ] . Algorithm.
We evaluated the following algorithms: β’ PS Stream [14]: a state-of-the-art algorithm for continuous spatial-keyword range queries. We extend the original algorithm sothat it can deal with our problem. β’ KOP : our first baseline, keyword-oriented partition, introducedin Section 3.1. β’ SOP : our second baseline, space-oriented partition, introducedin Section 3.2. β’ D π M-SKS : our proposed solution in this paper.All algorithms were implemented in C++.
Parameter.
The default values of π , π πππ₯ , and the initial | π | are 20,10, and 10,000,000, respectively. When we investigated the impact ofa given parameter, the other parameters were fixed. In addition, weset π = πΎ = , πΎ =
20 from preliminary experiments.
Criteria.
To evaluate the performance of each algorithm, we mea-sured the following criteria: β’ Update time: this is the average computation time for updating π NN objects of all registered subscriptions and time for dealingwith subscription insertions and deletions, per time-stamp. hohei Tsuruoka, Daichi Amagata, Shunya Nishio, and Takahiro Hara PS Stream
KOP SOP D k M-SKS Timeβstamp (Place) U pda t e t i m e [ m s e c ] (a) Place Timeβstamp (Twitter) U pda t e t i m e [ m s e c ] (b) Twitter Figure 7: Update time as a function of timeTable 1: Dataset statistics
Dataset Place TwitterCardinality 9,356,750 20,000,000Distinct β’ Load balance: this is the average difference between the max-imum and the minimum time to finish π NN update betweenworkers, per time-stamp.
Justification of using π ππππ‘ and analysis. We first empiricallydemonstrate that cost estimation based on π ππππ‘ functions well.Figure 7 depicts the time-series of the update time of each algorithm.The result of D π M-SKS shows that its update time does not varyeven as new objects are given, new subscriptions are inserted, andsome subscriptions are removed, on both Place and Twitter. Thissuggests that D π M-SKS keeps load balancing, and its cost estimationyields this result. Table 2 depicts that the load of each worker inD π M-SKS is actually balanced.We see that PS Stream also has this tendency. However, thisresult does not mean that PS Stream has load balancing. We ob-served that the initial partition of PS Stream incurs very imbalancepartition, i.e., one worker π€ has a very heavy load at initializa-tion. Because of this, new subscriptions are assigned to the otherworkers, but this does not overcome the load imbalance. Therefore,the update time of PS Stream is simply affected by the load of π€ .This is also confirmed from Table 2, which shows that the load inPS Stream is significantly imbalance.Next, we focus on KOP. This algorithm also has a similar result.From Figure 7 and Table 2, we see that the load balance of KOP is notso bad. Because both subscription partition and π NN update in KOPconsider only keyword information, they go well together. However,KOP is outperformed by D π M-SKS, which considers both spatialand keyword information. This result confirms the effectiveness ofthe partitioning approach in D π M-SKS.Let us consider SOP here. Figure 7 shows that, different from theother algorithms, the update time of SOP increases, as time goes by.This is derived from low accuracy of its cost estimation for new sub-scriptions. Specifically, given a new subscription, its cost estimatedby SOP is usually small, although it is large in practice. Because ofthis, the load of a worker, which has new subscriptions, becomes
Table 2: Load balance [msec] (default parameters)
Algorithm PS Stream KOP SOP D π M-SKSPlace 27490.11 549.23 14037.33 32.08Twitter 21962.50 486.97 12265.40 51.03
Table 3: Decomposed time [msec] on Place
Algorithm PS Stream KOP SOP D π M-SKS π NN update 32892.90 7684.48 18549.04 471.00Subscription ins. 1.51 1.59 1110.65 35.36Subscription del. 1.54 1.42 0.79 1.10heavy and bottleneck of the system. Table 2 also demonstrates thisfact.Last, we investigate the detail of update time. Table 3 decomposesthe update time of each algorithm on Twitter into π NN update time,subscription insertion time, and subscription deletion time, each ofwhich includes index update time. The result on Twitter is omitted,because its tendency is similar to that on Place. It can be seenthat the main part of the update time is π NN update time, andsubscription deletion needs a trivial cost. D π M-SKS significantlyoutperforms (is more than 10 times faster than) the other algorithmsand exploits available workers to reduce π NN update time. We seethat the query insertion time of D π M-SKS is slower than those ofKOP and PS Stream. This is because D π M-SKS needs to computeEquation (7), which incurs a more cost than Equation (2). Also, wecan observe that the subscription insertion time of SOP is muchlonger than those of the others. As explained earlier, (most) newsubscriptions are assigned to a single worker π€ . Hence π€ incurs along index update time. Varying π . We next study the impact of π , the number of work-ers. Figure 8 depicts the result. Since PS Stream is significantlyoutperformed by D π M-SKS, we omit its result.We see that each algorithm reduces its update time as π increases.This is an intuitive result, since subscriptions are distributed to moreworkers. The load balances of KOP and D π M-SKS are not affectedby π so much, since their cost estimations yield balanced load. Onthe other hand, as π increases, the load balance of SOP decreases.The reason is simple: the update time of the worker with the largestload becomes shorter as π increases. Varying | π | . To investigate the scalability of each algorithm, westudied the influence of | π | . Figure 9 shows the result. Due to theload imbalance, the update time of SOP is slow, even when | π | is istributed Spatial-Keyword kNN Monitoring for Location-aware Pub/Sub KOP SOP D k M-SKS Number of workers (Place) U pda t e t i m e [ m s e c ] (a) Update time (Place) Number of workers (Twitter) U pda t e t i m e [ m s e c ] (b) Update time (Twitter) Number of workers (Place)
Load ba l an c e [ m s e c ] (c) Load balance (Place) Number of workers (Twitter)
Load ba l an c e [ m s e c ] (d) Load balance (Twitter) Figure 8: Impact of π Number of subscriptions [M] (Place) U pda t e t i m e [ m s e c ] (a) Update time (Place) Number of subscriptions [M] (Twitter) U pda t e t i m e [ m s e c ] (b) Update time (Twitter) Number of subscriptions [M] (Place)
Load ba l an c e [ m s e c ] (c) Load balance (Place) Number of subscriptions [M] (Twitter)
Load ba l an c e [ m s e c ] (d) Load balance (Twitter) Figure 9: Impact of | π | kmax (Place) U pda t e t i m e [ m s e c ] (a) Update time (Place) kmax (Twitter) U pda t e t i m e [ m s e c ] (b) Update time (Twitter) kmax (Place) Load ba l an c e [ m s e c ] (c) Load balance (Place) kmax (Twitter) Load ba l an c e [ m s e c ] (d) Load balance (Twitter) Figure 10: Impact of π πππ₯ small. KOP and D π M-SKS scale linearly to | π | , so D π M-SKS alwaysoutperforms KOP. Note that the linear scalability shows that theirsubscription assignment functions well. From Figs. 10(c) and 10(d),we see that the load balance of each algorithm normally becomeslarger, as | π | increases. It is however trivial increase. For example,in D π M-SKS on Twitter, the difference is only 40[msec] betweenthe cases of | π | = . Β· and | π | = Β· . Varying π πππ₯ . Last, we study the impact of π πππ₯ , and the resultis shown in Figure 10. The update time of KOP and D π M-SKS in-creases slightly as π πππ₯ increases. This is also a reasonable result,because, as π πππ₯ increases, the probability that new objects updatethe π NN of some subscriptions becomes higher. SOP shows a dif-ferent pattern to KOP and D π M-SKS, because of its load imbalance.It can be seen that the update time of SOP is simply affected by itsload balance.
Spatial-keyword search.
Due to the prevalence of spatial-keywordobjects, algorithms for efficient searching them have been exten-sively devised [11]. For indexing a given dataset, these algorithms employ hybrid structures of spatial indices, e.g., R-tree and quadtree,and textual indices, such as inverted files [17, 38]. A famous index isIR-tree [17]. This is an essentially R-tree, but each node contains aninverted file. The hybrid structure of grid and inverted file, whichis employed by D π M-SKS, is derived from IR-tree. An R-tree is notupdate-efficient, because it partitions data space based on a givendataset. We therefore did not employ R-tree-like structures. It isalso important to notice that these works consider snapshot queriesand static datasets, whereas we consider continuous queries andstreaming data.Some related queries support moving users [36], find potentialusers [16], and analyze the result of spatial-keyword search [12].These works are also totally different from our problem.
Distributed query processing system.
Recently, distributed spa-tial query processing systems have been developed on Hadoop,Spark, and Storm. (D π M-SKS is orthogonal to these systems.) Forexample, Hadoop-GIS [4] and SpatialHaddop [18] support efficientprocessing of spatial queries, e.g., range and π NN queries on MapRe-duce environments. However, they do not consider keyword infor-mation and cannot deal with our problem. hohei Tsuruoka, Daichi Amagata, Shunya Nishio, and Takahiro Hara
Tornado [26] is a system based on Storm [3] and supports spatio-textual queries. The main focus of this system is to achieve effi-cient spatial-keyword query processing and not to support massivesubscriptions. Hence, it is not trivial for Tornado to provide sub-scription partitioning for continuous spatial-keyword π NN queries.SSTD [13] is also a system that supports spatio-textual queries onstreaming data. However, SSTD imposes, for objects, the conditionthat they have to contain all keywords specified by queries to matchthe queries. This is too strict, resulting in no matching objects.
Location-aware Pub/Sub.
There are many studies that addressedthe problem of dealing with spatio-textual subscriptions. Literatures[23, 25, 27, 34] considered continuous boolean range queries assubscriptions. Although PS Stream [14] also deals with booleanrange queries, this is the most related work to ours, because italso assumes the same distributed setting as ours. Our empiricalstudy has demonstrated that the cost model proposed in [14] is notefficient for our problem and D π M-SKS significantly outperformsPS Stream.Some studies [10, 28, 29, 33] also tackled the problem of spatio-textual π NN (or top-k) monitoring. [10] considers a decay modelfor streaming data, while [29, 33] do a sliding-window model. Inaddition, they consider an aggregation function for object scoring,i.e., spatial proximity and keyword (textual) similarity are aggre-gated to a score through a weighting parameter πΌ . Based on thisscoring function, they monitor top-k objects for each subscription.Their techniques are specific to this scoring function and their as-sumed streaming model (decay or sliding-window), thereby cannotdeal with our problem. Besides, it is well-known that specifyingan appropriate πΌ is generally hard for ordinary users [21]. Wetherefore consider boolean-based π NN monitoring, which is moreuser-friendly.
In this paper, to scale well to massive objects and subscriptions inlocation-aware Pub/Sub environments, we proposed D π M-SKS, adistributed solution to the problem of spatial-keyword π NN mon-itoring of massive subscriptions. D π M-SKS employs a new costmodel to effectively reflect the load of a given subscription. Besides,D π M-SKS partitions a set of subscriptions so that the entire loadbecomes as small as possible, then assigns each subscription to aspecific worker while considering load balancing. We conductedexperiments on two real datasets, and the results demonstrate thatD π M-SKS outperforms baselines and a state-of-the-art and scaleswell to massive subscriptions.
ACKNOWLEDGMENTS
This research is partially supported by JSPS Grant-in-Aid for Scien-tific Research (A) Grant Number 18H04095.
REFERENCES
PVLDB ,6(11):1009β1020, 2013.[5] A. Almaslukh and A. Magdy. Evaluating spatial-keyword queries on streamingdata. In
SIGSPATIAL , pages 209β218, 2018. [6] D. Amagata and T. Hara. Diversified set monitoring over distributed data streams.In
DEBS , pages 1β12, 2016.[7] D. Amagata and T. Hara. Identifying the most interactive object in spatialdatabases. In
ICDE , pages 1286β1297, 2019.[8] D. Amagata, T. Hara, and S. Nishio. Distributed top-k query processing onmulti-dimensional data with keywords. In
SSDBM , pages 10:1β10:12, 2015.[9] D. Amagata, T. Hara, and M. Onizuka. Space filling approach for distributedprocessing of top-k dominating queries.
IEEE Transactions on Knowledge andData Engineering , 30(6):1150β1163, 2018.[10] L. Chen, G. Cong, X. Cao, and K.-L. Tan. Temporal spatial-keyword top-k pub-lish/subscribe. In
ICDE , pages 255β266, 2015.[11] L. Chen, G. Cong, C. S. Jensen, and D. Wu. Spatial keyword query processing: anexperimental evaluation.
PVLDB , 6(3):217β228, 2013.[12] L. Chen, X. Lin, H. Hu, C. S. Jensen, and J. Xu. Answering why-not questions onspatial keyword top-k queries. In
ICDE , pages 279β290, 2015.[13] Y. Chen, Z. Chen, G. Cong, A. R. Mahmood, and W. G. Aref. Sstd: A distributedsystem on streaming spatio-textual data.
PVLDB , 13(11):2284β2296.[14] Z. Chen, G. Cong, Z. Zhang, T. Z. Fuz, and L. Chen. Distributed publish/subscribequery processing on the spatio-textual data stream. In
ICDE , pages 1095β1106,2017.[15] F. M. Choudhury, J. S. Culpepper, Z. Bao, and T. Sellis. Batch processing of top-kspatial-textual queries.
ACM Transactions on Spatial Algorithms and Systems ,3(4):1β40, 2018.[16] F. M. Choudhury, J. S. Culpepper, T. Sellis, and X. Cao. Maximizing bichromaticreverse spatial and textual k nearest neighbor queries.
PVLDB , 9(6):456β467,2016.[17] G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevantspatial web objects.
PVLDB , 2(1):337β348, 2009.[18] A. Eldawy and M. F. Mokbel. Spatialhadoop: A mapreduce framework for spatialdata. In
ICDE , pages 1352β1363, 2015.[19] R. A. Finkel and J. L. Bentley. Quad trees a data structure for retrieval on compositekeys.
Acta informatica , 4(1):1β9, 1974.[20] R. L. Graham. Bounds on multiprocessing timing anomalies.
SIAM journal onApplied Mathematics , 17(2):416β429, 1969.[21] Z. He and E. Lo. Answering why-not questions on top-k queries.
IEEE Transactionson Knowledge and Data Engineering , 26(6):1300β1315, 2012.[22] H. Hu, Y. Liu, G. Li, J. Feng, and K.-L. Tan. A location-aware publish/subscribeframework for parameterized spatio-textual subscriptions. In
ICDE , pages 711β722, 2015.[23] G. Li, Y. Wang, T. Wang, and J. Feng. Location-aware publish/subscribe. In
KDD ,pages 802β810, 2013.[24] S. Luo, Y. Luo, S. Zhou, G. Cong, J. Guan, and Z. Yong. Distributed spatial keywordquerying on road networks. In
EDBT , pages 235β246, 2014.[25] A. R. Mahmood, A. M. Aly, and W. G. Aref. Fast: frequency-aware indexing forspatio-textual data streams. In
ICDE , pages 305β316, 2018.[26] A. R. Mahmood, A. M. Aly, T. Qadah, E. K. Rezig, A. Daghistani, A. Madkour, A. S.Abdelhamid, M. S. Hassan, W. G. Aref, and S. Basalamah. Tornado: A distributedspatio-textual stream processing system.
PVLDB , 8(12):2020β2023, 2015.[27] A. R. Mahmood, A. Daghistani, A. M. Aly, M. Tang, S. Basalamah, S. Prabhakar,and W. G. Aref. Adaptive processing of spatial-keyword data over a distributedstreaming cluster. In
SIGSPATIAL , pages 219β228, 2018.[28] S. Nishio, D. Amagata, and T. Hara. Geo-social keyword top-k data monitoringover sliding window. In
DEXA , pages 409β424, 2017.[29] S. Nishio, D. Amagata, and T. Hara. Lamps: Location-aware moving top-k pub/sub.
IEEE Transactions on Knowledge and Data Engineering , 2020.[30] M. Qiao, J. Gan, and Y. Tao. Range thresholding on streams. In
SIGMOD , pages571β582, 2016.[31] S. Tsuruoka, D. Amagata, S. Nishio, and T. Hara. Distributed spatial-keyword knnmonitoring for location-aware pub/sub. In
International Conference on Advancesin Geographic Information Systems , pages 111β114, 2020.[32] X. Wang, W. Zhang, Y. Zhang, X. Lin, and Z. Huang. Top-k spatial-keywordpublish/subscribe over sliding window.
The VLDB Journal , 26(3):301β326, 2017.[33] X. Wang, Y. Zhang, W. Zhang, X. Lin, and Z. Huang. Skype: top-k spatial-keywordpublish/subscribe over sliding window.
PVLDB , 9(7):588β599, 2016.[34] X. Wang, Y. Zhang, W. Zhang, X. Lin, and W. Wang. Ap-tree: Efficiently supportcontinuous spatial-keyword queries over stream. In
ICDE , pages 1107β1118, 2015.[35] X. Wang, Y. Zhang, W. Zhang, X. Lin, and W. Wang. Ap-tree: efficiently supportlocation-aware publish/subscribe.
The VLDB Journal , 24(6):823β848, 2015.[36] D. Wu, M. L. Yiu, and C. S. Jensen. Moving spatial keyword queries: Formulation,methods, and analysis.
ACM Transactions on Database Systems , 38(1):1β47, 2013.[37] S. Yoon, J.-G. Lee, and B. S. Lee. Nets: extremely fast outlier detection from adata stream via set-based processing.
PVLDB , 12(11):1303β1315, 2019.[38] C. Zhang, Y. Zhang, W. Zhang, and X. Lin. Inverted linear quadtree: Efficient topk spatial keyword search.