[PDF] A hybrid index model for efficient spatio-temporal search in HBase

Abstract

Full PDF

aa r X i v : . [ c s . D B ] M a y A hybrid index model for eﬃcient spatio-temporal search inHBase

Chengyuan Zhang † , Lei Zhu † , Jun Long † , Shuangqiao Lin † , Zhan Yang † , Wenti Huang † † School of Information Science, Central South University, PR China {cyzhang,leizhu,jlong,linshq,zyang22,174601025}@csu.edu.cn

Abstract.

With advances in geo-positioning technologies and geo-location services, there area rapidly growing massive amount of spatio-temporal data collected in many applications suchas location-aware devices and wireless communication, in which an object is described by itsspatial location and its timestamp. Consequently, the study of spatio-temporal search whichexplores both geo-location information and temporal information of the data has attractedsigniﬁcant concern from research organizations and commercial communities. This work studythe problem of spatio-temporal k -nearest neighbors search (ST k NNS), which is fundamental inthe spatial temporal queries. Based on HBase, a novel index structure is proposed, called H ybrid S patio- T emporal HBase I ndex ( HSTI for short), which is carefully designed and takes bothspatial and temporal information into consideration to eﬀectively reduce the search space. Basedon HSTI, an eﬃcient algorithm is developed to deal with spatio-temporal k -nearest neighborssearch. Comprehensive experiments on real and synthetic data clearly show that HSTI is threeto ﬁve times faster than the state-of-the-art technique. Keywords: hybrid index; spatio-temporal; k -NN; HBase Massive amount of data, which include both geo-location and temporal information, are being gen-erated at an unparalleled scale on the Web. For example, more than 3.2 Billion comments have beenposted to Facebook every day [1], while more than 400 million daily tweets containing texts andimages [17,27,25,16,33,21] have been generated by 140 million twitter active users[4]. Combined withthe advances in location-aware devices [38,37,32,30] (GPS-enabled devices, RFIDs, etc) and wirelesscommunication [11,19,17], spatio-temporal data storage and processing have entered a new age. Thereis an increasing demand to manage spatio-temporal data in many applications, such as wireless sen-sor networks (WSNs) [10,29], spatio-temporal multimedia retrieval [26,20,24,28,15]. Consequently,the study of spatio-temporal search which explores both geo-location information and temporal in-formation of the data has attracted signiﬁcant concern from research organizations and commercialcommunities.This work investigates the problem of spatio-temporal k -nearest neighbors search (ST k NNS),which is applied in a variety of applications, such as spatio-temporal database management systems(STDBM), location based web service, spatio-temporal information based recommending system andindustrial detection system. For example, there is a pipe leakage occurs in residential area, to conﬁrmthe leakage location and leakage reason as soon as possible, the water inspector has to query thespatio-temporal data for a given region and given time interval. However, there are lots of datasatisfy the given spatio-temporal constraint sometimes. Thus, the water inspector might wish to limitthe query answer to k nearest spatio-temporal data. Challenges.

There are three key challenges in spatio-temporal k -NN query. Firstly, vast amountof data, typically in the order of TB scale or even PB scale, are uploaded to the service. Thus, thecloud storage systems, which can exploit a distributed hash table (DHT) approach to index data,should be adopted. Secondly, current cloud storage systems, such as HBase, provide a key-valuestore system, but they cannot maturely extend to support multi-attribute queries, which greatlyrestrict their application. Hence, it is important to design a transformation mechanism to convertmulti-dimension value to one-dimension value. Thirdly, novel techniques need to be created to designspatio-temporal indexing scheme that supports spatial pruning and temporal pruning synchronously.To the best of our knowledge, [34] is the only existing work that systematically study the prob-lem of spatio-temporal search based on HBase. STEHIX is proposed to match each spatio-temporal A hybrid index model for eﬃcient spatio-temporal search in HBase to relevant spatial region and time interval. Although spatial and temporal information are takeninto consideration during index construction, they are subjected to two fundamental shortcomings.Firstly, separate spatial index and temporal index are built during the index construction regardlessof connections between their space region or time interval. One of our key observations is that thecombination of space region and time interval can signiﬁcantly reduce the candidates satisﬁed queryconstraint, as a mass of unrelated data have been excluded. Hence, to achieve better performance, anindexing mechanism should integrate both spatial and temporal information. Secondly, the spatialand temporal information are fully decoupled in STEHIX which will cost lots of unnecessary I/Oaccess, as most of false positive results are caused by small overlapping cells.Based on the above observation, a novel index technique is proposed, namely H ybrid S patio- T emporal HBase I ndex ( HSTI for short), to eﬀectively organize spatio-temporal data. In brief,HSTI is a two-layered structure which follows the retrieval mechanism of HBase. In the ﬁrst layer,the whole space is partitioned into equal-size cells, and a space ﬁlling curve technique, Z-order, isemployed to map these two-dimensional spaces to one-dimensional sequence number. The Z-orderingvalues of these spaces are used as the preﬁx of the row key and maintained in the META table. In thesecond layer, to eﬀectively partition the spatio-temporal data, a three-dimensional tree structure isdesigned, named Z-Octree. To further improvement the performance of Z-Octree, a minimum bound-ing rectangle optimization strategy is also designed to check non-fully overlapping cell, the non-fullyoverlapping cell need to further access, if and only if it passes the check. Comprehensive experimentsdemonstrate that our HSTI achieves signiﬁcant improvement while comparing with previous work.

Contributions.

The principle contributions of this work are summarized as follows. – A novel Hybrid Spatio-Temporal Hbase Index is devised to deal with the problem of spatio-temporal query. As far as we know, this work is the ﬁrst spatio-temporal indexing mechanismwhich integrates the spatial and temporal information during index construction. – Based on HSTI, an eﬃcient spatio-temporal k -nearest neighbors query algorithm is developed. – Comprehensive experiments on real and synthetic datasets demonstrate that our new indexachieve substantial improvements over the state-of-the-art technique.

Roadmap.

The rest of the paper is organized as follows. Section 2 introduces related work. Section3 describes the data model, and the index structure. Section 4 presents algorithms and optimizationstrategies for search and reﬁnement. Extensive experiments are depicted in Section 5. Finally, Section6 concludes the paper.

With the emergence of the era of big data [9,23,31], relational DBMSs are incompetent with theincreasing volume of data because of low insertion rate and insuﬃcient scalability. Therefore, tomanage and process multidimensional spatial data eﬃciently, tree-structured spatial indices, such asR-Tree [7], R*-Tree [2], Quad-Tree [5,29], Kd-Tree [3], are widely used in traditional DBMSs.Recently, spatial data with temporal attribute becomes one of the largest volumes of data collectedby web services. Traditional relational DBMSs can no longer handle the quantity, and thus someresearchers study on many NoSQL database implementations for scaling datastores horizontally. Foxet. al [6] presented a spatio-temporal index built on top of Accumulo to store and search spatio-temporal data sets eﬃciently.To processing large volumes of data, SpatialHadoop [4,18] and Hadoop-GIS [1,22], which arebased on MapReduce, are widely used. These systems can eﬃciently support high-performance spatialqueries. However, they cannot directly be used in real-time system, as they do not take the temporalconstraints into consideration.Spatial indices have also been extended to NoSQL-based solutions. To partition the space ontop of HBase, S. Nishimura et. al [13] built a multi-dimensional index layer, called MD-HBase, byusing multidimensional index structures (K-d Tree and Quad-Tree). Linearization techniques (Z-ordering [12]) are used to convert multidimensional data to one dimension. However, MD-HBase doesnot index inner storage structure of slave nodes, and only provides an index layer in the META table.Hence, full scan operations are executed in each slave node, which reduces its eﬃciency.Hsu et al. [8] presented a novel key formulation scheme for spatial index in HBase, called KR+-tree.R+-tree is used to divide the data into disjoint rectangles, while gird is used for further division. Then, hybrid index model for eﬃcient spatio-temporal search in HBase 3

Hilbert-curve is exploited to encode the grid cells. During processing, KR+-tree ﬁrst searches therectangle cells, which satisﬁed the query constraint, in the KeyTable. Then, it ﬁnds the correspondingdata according to the rectangles. However, the scan operations still need to be executed in slave nodes,because the lookup mechanism of HBase is not considered.Zhang et al. [39] proposed scalable spatial data storage based on HBase, called HBaseSpatial.Compared with MongoDB and MySQL, HBase has better performance while searching complexspatial data, especially searching MultiLingString and LingString data types. But this storage modelcan only support spatial queries in HBase.To the best of our knowledge, [34] is the only work, which takes both spatial information andtemporal information into consideration. In [34], Chen et al. proposed a spatio-temporal index scheme,called STEHIX (Spatio-TEmporal HBase IndeX) based on HBase. However, the spatial informationand temporal information are not considered as an entirety.

This section ﬁrst presents a description of problem, then gives an overview of storage model based onHBase, last the proposed index structure HSTI, which based on the two-level lookup mechanism ofHBase, is introduced. Table 1 summarizes the mathematical notations used throughout this paperto facilitate the discussion of our study.

Notation Deﬁnition O a given data set of spatio-temporal data o i a spatio-temporal data object o id the identiﬁcation of an object x i the longitude of a spatio-temporal data y i the latitude of a spatio-temporal data t i the timestamp of a spatio-temporal data p the location point of an object o i z n the value of Morton order in Z-order curve L the deepest level of Z-Octree ξ the division threshold value for a node in Z-Octree v the Morton order of a leaf node in Z-Octree q a spatio-temporal k -NN search k the nearest neighbors number of a k -NN search δ e the Euclidean distance between two points in spatial space Table 1: The summary of notations

A spatio-temporal object can be represented as the form o i ( o i d, x i , y i , t i ) , where o id is the identiﬁcation of the object that has a spatial location p , includinglongitude and latitude ( x i , y i ) along x and y dimensions at the timestamp t i . Deﬁnition 2 (Spatio-temporal k-nearest neighbors search).

Given a set O of spatio-temporalobjects, a spatio-temporal k-nearest neighbors query is denoted as q ( x q , y q , [ t start , t end ] , k ) , where ( x q , y q ) is the query spatial location and ([ t start , t end ]) is the query temporal interval. This work aims toﬁnd a set O ( q ) ⊆ O ∩| O ( q ) | = k , and for each location point of o ∈ O ( q ) , δ e ( p, ( x q , y q )) δ e ( p ′ , ( x q , y q )) , ∀ p ∈ O ( q ) , p ′ ∈ O Ø( q ) , and p.t , p ′ .t ∈ [ t start , t end ] , where the δ e is the Euclidean distance. As the common used NoSQL database, Apache HBase [38] utilizes the distributed processing of theHadoop ﬁle system (HDFS) to achieve scalability, in which the data are organized in the form of

A hybrid index model for eﬃcient spatio-temporal search in HBase key-value pairs. A HBase cluster, which allow large-scare of data distributed storage across multiplephysical servers, is usually comprised by one or more master nodes (called Masters) and several slavenodes (called RegionServers). The implementation of HBase cluster depends on the ZooKeeper.Unlike the data structure of traditional RDBMSs, the basic storage unit in HBase is a cell, whichis deﬁned as < RowKey , ColumnF amily : ColumnN ame , T imeStamp > . Fig. 1 illustrates the logicalview of a table in HBase. The value V can be retrieved by the conditions < r , cf : cq , t > , where r means the row key, cf means the column family name, cq means column name and t means thetimestamp of this value. Within a table, rows are sorted in lexicographical order according to theirunique row keys. The timestamp marks the updating of all data in the database, and each updateversion corresponds to a timestamp.Fig. 1: Logical view for HBase table

As shown in Fig. 2, at the physical view for HBase, user tables are divided into several regions alongtheir rows horizontally. Each region is maintained by exactly one RegionServer, which is the smallestunit for distributed storage and load balancing in HBase cluster. For all user tables, the locationinformation of RegionServers was stored in META table in HBase. The data in the META table areorganized as key-value pairs, where the key is < T ableN ame , RegionStartKey , RegionId > and thevalue is < RegionSever > . A key-value type of search dependences on a two-level internal lookupmechanism of HBase to locate the value. Fig. 2 illustrates the two-level internal lookup mechanismin HBase. The work ﬂow is described as following, for a given row key, the exact location of thecorresponding RegionServer is obtained by META table, then it will take a full scan operation in theRegionServer to ﬁnd the value. For simplicity, some unrelated retrieval process, such as ROOT tablesearch and ZooKeeper coordination, are omitted.Fig. 2:

The two-level lookup structure in HBase

With the internal mechanism of HBase is further studied, we can see that there is only a simpleindex layer in the META table. Obviously, the full scan operation in each corresponding RegionServeris ineﬃcient, especially when the selectivity of the spatio-temporal queries is high. hybrid index model for eﬃcient spatio-temporal search in HBase 5

This section introduces the novel two-layer index structure, namely HSTI, which is based on the two-level internal lookup mechanism in HBase. The ﬁrst layer index in HSTI can achieve high eﬃciencyof servers routing in HBase cluster. With the help of the second layer index in HSTI, the distributedretrievals in involved RegionServers can be completed immediately. Fig. 3 shows the overall structureof HSTI. The details of index design are described as below.Fig. 3:

The overall structure of HSTI

The ﬁrst layer:

META table design.

This subsection induces the ﬁrst level of HSTI. Firstly,the whole spatial space is divided into non-overlapping cells, in which spatio-temporal data pointsare distributed based on their spatial locations. The linearization technique Z-order curve [12] isused to map data points from multi-dimension to one-dimension. As will be shown in Section 4,the reason why choosing Z-order curve is that the proposed index should have the special property:given any two-dimensional rectangle, the bottom leftmost point and the top rightmost point shouldbe mapped as the minimum value and the maximum value respectively. All the points within thespeciﬁed rectangle are distributed in this region.Fig illustrates Z-order curves. Z n ( A ) is the minimum value and Z n ( C ) is the maximum value,with regarded to all the data within the region.Given a data point with spatio-temporal information, the spatial Z-order curve value of the pointcan be calculated by its longitude and latitude ( x i , y i ). According to the hash principle of HBase,the preﬁx of row key should be diﬀerent so that it can improve the insertion rate over diﬀerentRegionServers and increase the chance of load balancing. Meanwhile, the data points that are spatiallyclose should share the same preﬁx in row key, thus these points can be stored adjacently by theirspatial proximity. Therefore, as the ﬁrst layer to index spatio-temporal data, Z-order curve value Znis used as row key in the META table, and then each data point can be placed into the correspondingRegionServer.

The secound layer: Z-Octree structure

In each RegionServer, the spatio-temporal data pointsare stored discretely. When a non-primary key search is executed in HBase, such as spatio-temporal

A hybrid index model for eﬃcient spatio-temporal search in HBase k -nearest neighbors search, full scan operations would be taken in regions to ﬁnd k points that satisfyquery constraint. Obviously, scan operations without any optimization measure in RegionServerswill degrade overall query performance in HBase. It is necessary to maintain an spatio-temporalindex scheme in each RegionServer, which can avoid unnecessary scan in spatio-temporal k -nearestneighbors search.For searching local data eﬃciently, a novel indexing structure, namely Z-order Octree (Z-Octree),is proposed as the second layer in HSTI. Z-Octree is an in-memory list-like structure, kept in eachRegionServer. Each entry in Z-Octree records a list of addresses pointing to spatio-temporal data inStoreFiles.In Z-Octree, the spatio-temporal space is considered as a three-dimensional space, in which allthe data points are mapped by their spatial and temporal information. In the three-dimensional dataprocessing, octree structure is an eﬃcient storage method so that octree is adopted here to storethe code values. In octree, each non-leaf nodes have 8 sub-nodes as show in Fig.5(a). In [3], a three-dimensional space can be recursively partitioned into uniform 8 L − subspaces where L is the levelof the partition. Each subspace is assigned a Morton value based on its visiting order. Nevertheless,unlikely with the traditional process, a more applicable strategy is proposed to construct an octree(called Z-Octree).To keep the Z-Octree adaptive, it is constructed based on the data dense and skewness. An innersubspace would divide into 8 subspaces only when there are suﬃcient points in it. Hence in our Z-Octree structure, the leaf nodes can exist on diﬀerent levels which is diﬀerent with traditional octree.When construct the Z-Octree, the whole spatio-temporal space is ﬁrstly processed as a root node,and then if a space contains more than ξ points, it will be recursively divided into 8 subspaces. Asshown in Fig(a), the whole space is divided into eight sub-nodes, and only one of them satisﬁes thedivision threshold value ξ . From this way, Z-Octree can easily handle the skew data and eliminatethe hotspot eﬃciently.The generation of Morton order value for each leaf node is described as follows. For the givendeepest level L of the octree, we assume that the whole spatio-temporal space can be divided into8 L − virtual subspaces and the Morton order value of each virtual subspace is calculated [12]. InZ-Octree, it does not have to implement each virtual subspace in sense that, if the number of a nodedoes not reach the division threshold value ξ , this node will not be divided. The minimum of virtualsubspace, which covers the not divided leaf node, is used as the Morton value v of this speciﬁed node.Fig.5(b) shows an example of Z-Octree built from Fig.5(a) which has a deepest level 3. The Mortonvalue of each leaf node is denoted in the corresponding circle. K -nearest neighbors search algorithm A spatio-temporal k -NN query q is deﬁned in Subsection 3.1. Usually, searching the k -nearest neigh-bors in spatio-temporal dimension is more diﬃcult than the search of spatial k -NN. Because therestrictions on both spatial and temporal information should be considered.A k -NN query algorithm based on HSTI is proposed. For a given spatial location ( x q , y q ), theZ-ordering space Zn containing ( x q , y q ) is computed in ﬁrst layer of HSTI. Then the Z-Octree in thecorresponding RegionServer is utilized to retrieve a list of data points. Meanwhile, the adjacent spacesof Z n are also computed. A priority queue Q is maintained for all the points and adjacent spaces,where priority metric is the distance from location ( x q , y q ) to point or Z-ordering space. The elementin priority queue is constantly dequeued, either being processed to result list or being retrieved toobtain adjacent points to be enqueued, until k -nearest neighbors are found. Algorithm 1 is the k -NN query pseudocode. The input is query q k = ( x q , y q , [ t start , t end ], k ),and the output is the k -nearest points of the spatial location ( x q , y q ) during [ t start , t end ]. The basicwork ﬂow is, proximity points of location ( x q , y q ) are constantly inserted into a list S until the lengthof the list S reaches k .In line 2, a priority queue Q is initialized to order the elements which are enqueued by the distance.The computed Z-ordering space z n is ﬁrst enqueued to Q (line 4), while the adjacent spaces of Z n arealso found and enqueued where the priority metric is the distance MINDIST [14]. Then elements in Q are constantly dequeued and processed in line 6. If the element is a Z-ordering space, the Procedure3 Search (line 12) would be executed in each involved RegionServer.Since Z-Octree is used for space partition in each RegionServer, the query of k -nearest neighborswill be transformed into the search of Z-Octree nodes neighbors. Each leaf node in the Z-Octree hybrid index model for eﬃcient spatio-temporal search in HBase 7 Algorithm 1 k Nearest Neighbors Search

Input: x q , y q , t start , t end , k . Output: S :list of k -nearest neighbors.1: S ← ∅ ; Z n ← ∅ ; CoveringCubes ← ∅ ; AdjacentSpaces ← ∅ ; AdjacentCubes ← ∅ ; e ← ∅ ; AS ← ∅ ;2: Q ← CreateP riorityQueue ();3: Z n ← getZorderingSpace ( x q , y q );4: Enqueue( Z n , MINDIST(( x q , y q ), Z n ), Q );5: while Q = ∅ do e ← Dequeue ( Q );7: if e is typeof Zordering space then AdjacentSpaces ← getAdjacentSpaces ( e );9: for each AdjacentSpaceAS ∈ AdjacentSpaces do

10: Enqueue( AS , MINDIST(( x q , y q ), AS ), Q );11: end for

12: Search( e, x q , y q , t start , t end , k );13: end if if e is typeof cube then P ointsSet ← getP oints ( e );16: for each point ∈ P ointsSet do

17: Enqueue( point , δ e (( x q , y q ), point ), Q );18: end for end if if e is typeof point then

21: Add e into S ;22: if S.length = k then return S ;24: end if end if end while corresponds to a sub-cube in the spatio-temporal space. If cubes are adjacent, then the nodes in octreeare neighbors with each other. The k -nearest neighbors of a spatial location ( x q , y q ) are searched notonly in the covering cube, but also in all the cubes adjacent to the cube within [ t start , t end ]. Figillustrates all cubes that need to be searched in Z-Octree, where the shade cubes are covering cubesand the cubes overlapping with the dotted region are spatial adjacent cubes which are satisfyingtemporal predicate [ t start , t end ].In procedure 3, the covering nodes (line 4) and their spatial adjacent cubes (line 9) are computedby function getCubes () and function getAdjacentCubes () respectively. We can obtain the points in thecovering nodes by searching Z-Octree in line 5. Then all the points and adjacent cubes are enqueuedinto priority queue Q from line 4 to 12. The minimum distance is the Euclidean distance from location( x q , y q ) to the point. If the element is an adjacent space of Z n , the algorithm enqueues the all theadjacent cubes which neighboring to the location ( x q , y q ) line in 15 to 18.As shown in Algorithm 1, if the element e dequeued from Q is a cube, the algorithm keeps enqueuethe points in the cube by the distance line in 13 to 17. Otherwise, if e is a point, which means theelement is a result, it would be added into the result list S (line 19). The procedure described aboveis looped until the length of list S reaches k . The results of query q k are obtained in S . With the implementation of HSTI, a comprehensive experimental evaluationis conducted to verify the performance of the scheme in a real cloud environment. This work puttedup an eight-node HBase cluster. In the experiments, the Zookepper was responsible for coordinationand synchronization in HBase cluster, the locations of all datasets were scaled to the two-dimensionalspace [0, 10000][0, 10000], and the timestamp of all datasets were scaled to [0, 5000]. In addition, thetop k value changed from 10 to 500, and cluster size varied from 2 to 8. By default, top k value and A hybrid index model for eﬃcient spatio-temporal search in HBase cluster size were set to 100, 4 respectively. More detailed conﬁguration of experiments is shown intable 2, where the default values of the parameters are in bold.

Parameter Conﬁguration

CPU Intel Core i5 @ 3.10GHzMemory 4GBNetwork 1Mbps bandwidthOS Ubuntu (Version 14.04 LTS)JVM Java 1.8.0Hadoop version 2.6.4HBase version 1.2.4ZooKeeper version 3.4.9Spatial region

Time interval k

10, 20, 50, , 200, 500Cluster Size 2, , 6, 8 Table 2: The detail of conﬁgurations

Datasets.

Three diﬀerent datasets were used in the experiments, one of which was a syntheticuniform dataset (UN for short) generated by program, and others were real-world datasets, describedas following: the ﬁrst one was collected in geolife project [40] (GL for short) by 182 users from April2007 to August 2012, the second one was T-Drive [35,36] (TD for short) generated by 33 thousandtaxis on Beijing road network over a period of 3 months. the important statistics of three datasets isshown in table 3.

Property UN GL TD

Number of records (millions) 20 25.84 17.76Size of dataset (GB) 0.74 1.54 0.71

Table 3: Dataset statisticsFor accurate analyzing and evaluating, STEHIX was chosen as baseline, which has a similar indexscheme with ours propose. In the experimental, the deepest level L of Z-Octree was set to 16 and thesplit threshold value ξ was set to 200. We investigate the query response time, index constructiontime and index size of 2 algorithms against three datasets Tigers, GL, TD and UN, where otherparameters are set to default values. Fig. ?? (a) depicts the rate of space occupying by the index sizes.The baseline requires more space due to the two kinds of indices (called s-index and t-index ) keptfor all the entries and the storage cost of it increases faster in larger datasets. In contrast, an indexrecord is maintained for each entry only once so that our index saves more space in memory. Fig. ?? (b)shows the diﬀerence of construction time between HSTI and the baseline. Due to the simple split andcode algorithms of an Z-Octree, HSTI has a shorter constructing time as compared to the baseline,which need to traverse two indices during the construction. In Fig. ?? (c), HSTI demonstrates superiorperformance in comparison with baseline in all datasets. Evaluation on the eﬀect of the number of results k . Figs report the query response time ofthe algorithms as a function of k on two datasets TD and UN. As expected, the performance of allalgorithms degrades regarding the increase of k (i.e., the larger search region). Compared with UN,the growth of the searching cost of HSTI is much slower for the TD dataset. And the performanceof HSTI always outperforms the baseline for both high-density and low-density points. The reasonsare as follow. First of all, the retrieval procedures of baseline have been divide into two parts: s-index hybrid index model for eﬃcient spatio-temporal search in HBase 9 and t-index . Thus, each information extraction will decompose into two processes to collect results intemporal dimension and spatial dimension, which will provoke more I/O costs. On the other hand,the periodicity of t-index gives rise to lots of discrete timestamps are mixed together. For example,assume each cycle has 24 hours, and each cycle is divided into several segments such as 8, then alldata entries will map into these 8 segments by their temporal information. For a given k -NN query,all results returned by the baseline probably have same time intervals but not in diﬀerent dates.Thus, it have to spend some time to remove false positive results, which leads to unnecessary timeconsumption. Evaluation on the eﬀect of cluster size.

As shown in Figs, with cluster size increased, runningtime decreased gradually. More clusters means less data on each sever. Obviously, the processing timewill decline. Meanwhile, we also observe that the performance of k -NN query on uniform dataset UNis better than that of real-world dataset TD. This is because the data distribution of TD might beinhomogenous, and there may be some hotpot in TD. The problem of spatio-temporal k -NN search is important due to the increasing amount of spatio-temporal data collected in a wide spectrum of application. The proposed hybrid spatio-temporal indexscheme is based on the two-level internal lookup mechanism in HBase. Based on HSTI, an eﬃcientalgorithm is developed to support spatio-temporal k -NN search. Our comprehensive experiments onreal and synthetic data clearly show that HSTI is able to achieve a reduction of the processing timeby 60-80% compared with prior state-of-the-art methods. References

1. Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-gis: A high performance spatialdata warehousing system over mapreduce. PVLDB (11), 1009–1020 (2013)2. Beckmann, N., Kriegel, H., Schneider, R., Seeger, B.: The r*-tree: An eﬃcient and robust access methodfor points and rectangles. In: Proceedings of the 1990 ACM SIGMOD International Conference onManagement of Data, Atlantic City, NJ, May 23-25, 1990., pp. 322–331 (1990)3. Brown, R.A.: Building a balanced k-d tree in o(kn log n) time. CoRR abs/1410.5420 (2014)4. Eldawy, A., Mokbel, M.F.: A demonstration of spatialhadoop: An eﬃcient mapreduce framework forspatial data. PVLDB (12), 1230–1233 (2013)5. Finkel, R.A., Bentley, J.L.: Quad trees: A data structure for retrieval on composite keys. Acta Inf. , 1–9(1974)6. Fox, A.D., Eichelberger, C.N., Hughes, J.N., Lyon, S.: Spatio-temporal indexing in non-relational dis-tributed databases. In: Proceedings of the 2013 IEEE International Conference on Big Data, 6-9 October2013, Santa Clara, CA, USA, pp. 291–299 (2013)7. Guttman, A.: R-trees: A dynamic index structure for spatial searching. In: SIGMOD’84, Proceedings ofAnnual Meeting, Boston, Massachusetts, June 18-21, 1984, pp. 47–57 (1984)8. Hsu, Y., Pan, Y., Wei, L., Peng, W., Lee, W.: Key formulation schemes for spatial index in cloud datamanagements. In: 13th IEEE International Conference on Mobile Data Management, MDM 2012, Ben-galuru, India, July 23-26, 2012, pp. 21–26 (2012)9. Labrinidis, A., Jagadish, H.V.: Challenges and opportunities with big data. PVLDB (12), 2032–2033(2012)10. Liu, A., Liu, X., Long, J.: A trust-based adaptive probability marking and storage traceback scheme forwsns. Sensors (4), 451 (2016). DOI 10.3390/s16040451. URL https://doi.org/10.3390/s16040451

11. Long, J., Dong, M., Ota, K., Liu, A.: A green TDMA scheduling algorithm for prolonging lifetime inwireless sensor networks. IEEE Systems Journal (2), 868–877 (2017). DOI 10.1109/JSYST.2015.2448355. URL https://doi.org/10.1109/JSYST.2015.2448355

12. M., M.G.: A computer oriented geodetic data base and a new technique in ﬁle sequencing. New York:International Business Machines Company (1966)13. Nishimura, S., Das, S., Agrawal, D., El Abbadi, A.: Md-hbase: A scalable multi-dimensional data infras-tructure for location aware services. In: 12th IEEE International Conference on Mobile Data Management,MDM 2011, Lule˚a, Sweden, June 6-9, 2011, Volume 1, pp. 7–16 (2011)14. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proceedings of the 1995 ACMSIGMOD International Conference on Management of Data, San Jose, California, May 22-25, 1995., pp.71–79 (1995)0 A hybrid index model for eﬃcient spatio-temporal search in HBase15. Wang, Y., Huang, X., Wu, L.: Clustering via geometric median shift over riemannian manifolds. Infor-mation Sciences , 292–305 (2013)16. Wang, Y., Lin, X., Wu, L., Zhang, W.: Eﬀective multi-query expansions: Robust landmark retrieval. In:ACM Multimedia, pp. 79–88 (2015)17. Wang, Y., Lin, X., Wu, L., Zhang, W.: Eﬀective multi-query expansions: Collaborative deep networks forrobust landmark retrieval. IEEE Trans. Image Processing (3), 1393–1404 (2017). DOI 10.1109/TIP.2017.2655449. URL https://doi.org/10.1109/TIP.2017.2655449

18. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: Exploiting correlation consensus: Towards subspaceclustering for multi-modal data. In: Proceedings of the ACM International Conference on Multimedia,MM ’14, Orlando, FL, USA, November 03 - 07, 2014, pp. 981–984 (2014)19. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q.: LBMCH: learning bridging mapping for cross-modalhashing. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Developmentin Information Retrieval, Santiago, Chile, August 9-13, 2015, pp. 999–1002 (2015)20. Wang, Y., Lin, X., Wu, L., Zhang, W., Zhang, Q., Huang, X.: Robust subspace clustering for multi-viewdata by exploiting correlation consensus. IEEE Trans. Image Processing (11), 3939–3949 (2015)21. Wang, Y., Lin, X., Zhang, Q.: Towards metric fusion on multi-view data: a cross-view based graph randomwalk approach. In: 22nd ACM International Conference on Information and Knowledge Management,CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, pp. 805–810 (2013)22. Wang, Y., Lin, X., Zhang, Q., Wu, L.: Shifting hypergraphs by probabilistic voting. In: Advances inKnowledge Discovery and Data Mining - 18th Paciﬁc-Asia Conference, PAKDD 2014, Tainan, Taiwan,May 13-16, 2014. Proceedings, Part II, pp. 234–246 (2014)23. Wang, Y., Wu, L.: Beyond low-rank representations: Orthogonal clustering basis reconstruction withoptimized graph structure for multi-view spectral clustering. Neural Networks (2018)24. Wang, Y., Wu, L., Lin, X., Gao, J.: Multiview spectral clustering via structured low-rank matrix factor-ization. IEEE Trans. Neural Networks and Learning Systems (2018)25. Wang, Y., Zhang, W., Wu, L., Lin, X., Fang, M., Pan, S.: Iterative views agreement: An iterative low-rankbased structured optimization method to multi-view spectral clustering. In: Proceedings of the Twenty-Fifth International Joint Conference on Artiﬁcial Intelligence, IJCAI 2016, New York, NY, USA, 9-15July 2016, pp. 2153–2159 (2016)26. Wang, Y., Zhang, W., Wu, L., Lin, X., Zhao, X.: Unsupervised metric fusion over multiview data bygraph random walk-based cross-view diﬀusion. IEEE Trans. Neural Netw. Learning Syst. (1), 57–70(2017)27. Wu, L., Huang, X., Zhang, C., Shepherd, J., Wang, Y.: An eﬃcient framework of bregman divergenceoptimization for co-ranking images and tags in a heterogeneous network. Multimedia Tools Appl. (15),5635–5660 (2015)28. Wu, L., Wang, Y.: Robust hashing for multi-view data: Jointly learning low-rank kernelized similarityconsensus and hash functions. Image and Vision Computing , 58–66 (2016)29. Wu, L., Wang, Y., Gao, J., Li, X.: Deep adaptive feature embedding with local sample distributions forperson re-identiﬁcation. Pattern Recognition , 275–288 (2018)30. Wu, L., Wang, Y., Ge, Z., Hu, Q., Li, X.: Structured deep hashing with convolutional neural networksfor fast person re-identiﬁcation. Computer Vision and Image Understanding , 63–73 (2018)31. Wu, L., Wang, Y., Li, X., Gao, J.: Deep attention-based spatially recursive networks for ﬁne-grainedvisual recognition. IEEE Trans. Cybernetics (2018)32. Wu, L., Wang, Y., Li, X., Gao, J.: What-and-where to match: Deep spatially multiplicative integrationnetworks for person re-identiﬁcation. Pattern Recognition , 727–738 (2018)33. Wu, L., Wang, Y., Shepherd, J.: Eﬃcient image and tag co-ranking: a bregman divergence optimizationmethod. In: ACM Multimedia (2013)34. Xiao-yin, C., Chong, Z., Bin, G., Wei-dong, X.: Eﬃcient historical query in hbase for spatio-temporaldecision support. International Journal of Computers, Communications and Control (5), 613–630(2016)35. Yuan, J., Zheng, Y., Xie, X., Sun, G.: Driving with knowledge from the physical world. In: Proceedings ofthe 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego,CA, USA, August 21-24, 2011, pp. 316–324 (2011)36. Yuan, J., Zheng, Y., Zhang, C., Xie, W., Xie, X., Sun, G., Huang, Y.: T-drive: driving directions basedon taxi trajectories. In: 18th ACM SIGSPATIAL International Symposium on Advances in GeographicInformation Systems, ACM-GIS 2010, November 3-5, 2010, San Jose, CA, USA, Proceedings, pp. 99–108(2010)37. Zhang, C., Zhang, Y., Zhang, W., Lin, X.: Inverted linear quadtree: Eﬃcient top k spatial keyword search.In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12,2013, pp. 901–912 (2013)38. Zhang, C., Zhang, Y., Zhang, W., Lin, X.: Inverted linear quadtree: Eﬃcient top K spatial keywordsearch. IEEE Trans. Knowl. Data Eng. (7), 1706–1721 (2016) hybrid index model for eﬃcient spatio-temporal search in HBase 1139. Zhang, N., Zheng, G., Chen, H., Chen, J., Chen, X.: Hbasespatial: A scalable spatial data storage basedon hbase. In: 13th IEEE International Conference on Trust, Security and Privacy in Computing andCommunications, TrustCom 2014, Beijing, China, September 24-26, 2014, pp. 644–651 (2014)40. Zheng, Y., Xie, X., Ma, W.: Geolife: A collaborative social networking service among user, location andtrajectory. IEEE Data Eng. Bull.33