[PDF] An Efficient Index for Contact Tracing Query in a Large Spatio-Temporal Database

Abstract

In this paper, we study a novel contact tracing query (CTQ) that finds users who have been in direct contact with the query user or in contact with the already contacted users in subsequent timestamps from a large spatio-temporal database. The CTQ is of paramount importance in the era of new COVID-19 pandemic world for finding possible list of potential COVID-19 exposed patients. A straightforward way to answer the CTQ is using traditional spatio-temporal indexes. However, these indexes cannot serve the purpose as each user covers a large area within the time-span of potential disease spreading and thus they can hardly use efficient pruning techniques. We propose a multi-level index, namely QR-tree, that consider both space coverage and the co-visiting patterns to group users so that users who are likely to meet the query user are grouped together. More specifically, we use a quadtree to partition user movement traces w.r.t. space and time, and then exploit these space-time mapping of user traces to group users using an R-tree. The QR-tree facilitates efficient pruning and enables accessing only potential sets of user who can be the candidate answers for the CTQ. Experiments with real datasets show the effectiveness of our approach.

Full PDF

AAn Efﬁcient Index for Contact Tracing Query in a LargeSpatio-Temporal Database

Mohammed Eunus Ali a , Shadman Saqib Eusuf a , Kazi Ashik Islam b a Bangladesh University of Engineering and Technology, Bangladesh b University of Virginia, USA

Abstract

In this paper, we study a novel contact tracing query (CTQ) that ﬁnds users who havebeen in direct contact with the query user or in contact with the already contactedusers in subsequent timestamps from a large spatio-temporal database. The CTQ is ofparamount importance in the era of new COVID-19 pandemic world for ﬁnding pos-sible list of potential COVID-19 exposed patients. A straightforward way to answerthe CTQ is using traditional spatio-temporal indexes. However, these indexes cannotserve the purpose as each user covers a large area within the time-span of potentialdisease spreading and thus they can hardly use efﬁcient pruning techniques. We pro-pose a multi-level index, namely QR-tree, that consider both space coverage and theco-visiting patterns to group users so that users who are likely to meet the query userare grouped together. More speciﬁcally, we use a quadtree to partition user movementtraces w.r.t. space and time, and then exploit these space-time mapping of user tracesto group users using an R-tree. The QR-tree facilitates efﬁcient pruning and enablesaccessing only potential sets of user who can be the candidate answers for the CTQ.Experiments with real datasets show the effectiveness of our approach.

Keywords:

Contact tracing query, Spatio-temporal database, COVID-19

Email addresses: [email protected] (Mohammed Eunus Ali), [email protected] (Shadman Saqib Eusuf), [email protected] (Kazi AshikIslam)

Preprint submitted to Journal of Big Data Research August 11, 2020 a r X i v : . [ c s . D B ] A ug . Introduction The world is witnessing an unprecedented pandemic as the coronavirus (SARS-CoV-2) is continuing its spread across the globe. As of today (May 13, 2019) there aremore than four million conﬁrmed cases in 185 countries and more than 290,000 peoplehave lost their lives due to the virus infected respiratory infection commonly referredto as COVID-19 [1]. This virus is extremely infectious, where it can easily pass fromperson to person. Thus, to curb the spread of the coronavirus, authorities around theworld implemented lockdown measures for months. However, these lockdowns havebrought much of global economic and social activity to a halt.To avoid the socio-economic catastrophes, the authorities have gradually started toease the lockdowns. However, they are still struggling to ﬁnd efﬁcient techniques tomonitor the mobility of potentially COVID-19 infected patients and who have been incontact with a virus infected person. Since people in close contact with someone whois infected with the virus are at higher risk of becoming infected themselves, and of po-tentially further infecting others, closely monitoring these contacts can prevent furthertransmission of the virus. This process of monitoring is known as contact tracing . Inthis paper, we study the problem of contact tracing query (CTQ) in a spatio-temporaldatabase.Consider the following scenario. Let D be the historical mobility traces (or equiv-alently trajectories) of users for the last T days, obtained from GPS-enabled phonesor mobile signals through triangulations. Thus, each user u ∈ D is represented as asequence of time stamped locations { ( l , t ) , ( l , t ) .... ( l n , t n ) } denoting her visitedplaces l , l , ..., l n in different times t , t , ..., t n , respectively. Let q be the mobilitytraces (or the trajectory) of a newly identiﬁed COVID-19 infected user, which is thequery in our system. The objective of the CTQ is to identify a set of users U ⊂ D whohave been in direct contact with q at any point of time, and subsequently ﬁnd users whocame into contact with the already contacted users.To process variants of trajectory related queries such as range, join, nearest-neighbor,etc., a large body of trajectory indexing techniques have been proposed in the litera-ture [2, 3, 4, 5]. These indexes are variants of traditional spatio-temporal indexes such2s R -tree [6] or quad-tree [7]. These indexes are tailored for answering different typesof queries. Though it may seem that the CTQ can be solved by using existing indexesdesigned for range queries, running repetitive range queries for different points of thequery trajectory in the CTQ will make it extremely in-efﬁcient. This is due to the fol-lowing two reasons: (i) the mobility traces or historical trajectories of a user are usuallya set of time-stamped dispersed point locations covering a large area, which is differentthan the normal point data such as POIs (Point of Interest) or trajectory data such astaxi trips, and thus very hard to prune using traditional indexes, (ii) if a user’s travelhistory matches with the query at any instance then the user will be a candidate answer,and we need to run the process recursively as this user may have subsequently infectedothers.To answer the CTQ efﬁciently, we propose a two-level index structure, namely QR -tree, that exploits the strengths of both quad-tree and R-tree. In the ﬁrst level, weuse a quad-tree to partition the points of historical trajectories, where the location of atrajectory point is speciﬁed by the space − id of the smallest quad-tree block that con-tains the point. Similarly, the timestamp of each location of a trajectory is mapped to a time − id that corresponds to a time bucket containing the timestamp of the trajectorypoint. After that, we transform each trajectory as a sequence of ( space − id, time − id ) tuples. We consider this mapping as a transformation to a new coordinate system forthe trajectory points. Next, we apply an R-tree on the trajectory points, represented bythe new coordinate system, for grouping and saving them in disk. Finally, we presentan efﬁcient divide-and-conquer approach to answer CTQ, where a query is recursivelydivided and run through different levels of the index to ﬁnd the users who match inboth space and time.The contributions of the paper are summarized as follows: • We are the ﬁrst to introduce a novel contact tracing query (CTQ) in a large spatio-temporal database, which is of paramount importance for identifying users whowere potentially exposed to COVID-19 infected users. • We propose a multi-level index structure, namely QR-tree, that combines the reg-ular space-partitioning strategy of a quadtree, and the object grouping strategy of3n R-tree, to organize the user spatio-temporal data in such a way that facilitatesfaster processing of the CTQ. • We present an efﬁcient divide-and-conquer approach for answering CTQ queriesusing the QR-tree. We evaluate our indexes and algorithms through an extensiveexperimental study on real datasets, which demonstrate both the efﬁciency andeffectiveness of our solution.

2. Problem Formulation

Let D be the historical mobility traces (or equivalently trajectories) of users forthe last T days, obtained from GPS-enabled phones or mobile signals through trian-gulation. Each user u ∈ D is represented as a sequence of time stamped locations { ( l , t ) , ( l , t ) , ... ( l n , t n ) } denoting her visited places l , l , ..., l n in different times t , t , ..., t n , respectively. Let q be the mobility traces (or the trajectory) of a COVID-19infected user, which is represented as the sequence { ( q.l , q.t ) , ( q.l , q.t ) .... ( q.l n , q.t n ) } ,the query in our system.Any two users u and v meet each other if and only if both the spatial distance andtemporal distance of any two points of u and v , respectively, are under certain distancethresholds. Formally, the meeting condition for two trajectories can be expressed asfollows. Condition of u meets v : Let { ( u.l , u.t ) , ( u.l , u.t ) .... ( u.l n , u.t n ) } and { ( v.l , v.t ) , ( v.l , v.t ) .... ( v.l m , v.t m ) } be the two sequence of time-stamped locations of u and v ,respectively. Let spatialDist and temporalDist be the spatial distance and tem-poral distance measuring functions between two locations and two time-stamps, re-spectively. Now, for any i ∈ [1 , n ] , j ∈ [1 , m ] , if spatialDist ( u.l i , v.l j ) ≤ ψ and temporalDist ( u.t i , v.t j ) ≤ τ , then we say the trajectory u meets the trajectory v .Here, ψ and τ , are spatial (euclidean) and temporal distance thresholds, respectively.The objective of the CTQ is to identify the set of users U ⊂ D where each user u ∈ U has potentially been exposed to the corona virus. We deﬁne the set U as follows.1. Let U be the set of users where each user u ∈ U met with q at any timestamp u.t in the last T days. We say that u was exposed at time u.t exposed (= u.t ) .4ere, the subscript zero ( ) in U denotes that, the number of intermediate car-riers (of the virus) between user q and u is zero (i.e u was infected by the queryuser q ).2. Let U be the set of users where each user u ∈ U met with any user v ∈ U at timestamp u.t > v.t exposed . We deﬁne u.t exposed (= u.t ) as the time ofexposure for user u .3. In a similar manner, we can deﬁne the set of users U i recursively where eachuser u was exposed to some user v ∈ U i − . Then, we deﬁne the set U as: U = L − (cid:91) i =0 U i Here, L is an integer that denotes the maximum allowed depth for the recursionand is passed as a parameter for CTQ.Based on the above deﬁnitions, we formally deﬁne our contact tracing query asfollows. Deﬁnition 2.1.

CTQ.

Given a set D of user trajectories, a COVID-19 infected user tra-jectory q , a spatial proximity threshold ψ , a temporal proximity threshold τ , andan integer L , a CTQ query ﬁnds a set of users U ⊂ D such that U = (cid:83) L − i =0 U i ;where U i is the set of users who were exposed to the query user through ‘ i ’number of intermediate carrier users.

3. The Proposed Index

The trajectories in our datasets can be very long (e.g., last 14 days of mobility tracesof each user) and may cover large areas. Using an R-tree to index the two spatial andone temporal dimension of these trajectories might not be useful as each trajectory’sMBR (Minimum Bounding Rectangle) will most likely overlap with too many othertrajectories’ MBRs, making the pruning scheme of the R-tree ineffective. On the otherhand, if we use a quadtree to index all points of the trajectories, the points of a singletrajectory may end up in many quadrant of the quadtree blocks, and thereby making ithard to decide which of those trajectories should be stored together in a disk block tofacilitate faster retrieval of candidate users.5 .1. QR-tree

The key intuition of our proposed index is, the trajectories whose points are co-located at the overlapping time-instant are likely to match with the same query. Basedon this observation, we present a two-level index, the Quad R (QR) tree, that combinesthe strengths of both quadtree and R-tree.First, the spatial data space is recursively partitioned using quadtree, where eachleaf quadtree block does not contain more than θ points. Then we use a space ﬁllingcurve, speciﬁcally a z-curve (Morton order), to number these leaf quadtree blocks. Wecall such a number, the spatial − id of the block. Thus, in the spatial domain eachtrajectory is represented as a list of spatial − id s. Similarly, each timestamp of atrajectory is mapped to a time bucket and assigned a number temporal − id . Thereby,each trajectory can now be represented as sequence of ( spatial − id, temporal − id ) tuples. This new mapping of trajectories can be seen as a transformation to anew coordinate system, where x -axis represents spatial dimension and y -axis representtemporal dimension, and each trajectory is represented as a set of points in that space.In the next step, we use an R-tree to group trajectories based on their sets of pointsin the transformed space. Essentially, each set of points in this new space is representedas an MBR, and the R-tree groups close-by MBRs in a leaf node. Each leaf node ofthe constructed R-tree is stored in a disk-page. We maintain this disk-page id in allthe corresponding leaf-blocks of the ﬁrst level quadtree that contain a point of thetrajectories stored in this disk-page. In the quadtree block, we also maintain associatedtemporal ids denoting the time range of trajectory points stored in the correspondingdisk-page. Note that, we do not keep the hierarchical structure of R-tree for queryprocessing, rather we only use the R-tree for grouping of similar trajectories in thetransformed space. Figure 1 and Figure 2 show the construction process of the QR-tree. Figure 1(a)shows an example with four user trajectories, { u , . . . , u } , where θ = 2 . The spaceis ﬁrst divided into four quadrants Q , . . . , Q . As Q contains more than tra-jectory points, this block is further divided into Q . , . . . , Q . . We then apply z-ordering to number these quadtree blocks as b , b , ..., b . After that each time-stampof points in the trajectories is assigned time-bucket number between t and t . After Q Q Q T T T T b b b b b b b ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) ( t ) Q Q Q Q T T T T ( b , t ) , ( b , t ) , ( b , t ) ( b , t ) , ( b , t ) , ( b , t ) ( b , t ) , ( b , t ) , ( b , t ) ( b , t ) , ( b , t ) Figure 1: (a) A quadtree based space partitioning of trajectories. (b) Mapping of trajectories. T T T T b b b b b b b R R R t t t t t t t t Q Q Q Q b b Q Q Q Q b b t : R t : R t : R t : R t : R Figure 2: (a) An R-tree based grouping of trajectory points in a transformed space. (b) A QR-tree indexstructure. that Figure 1(b) shows the new representation of trajectories u − u as a sequenceof ( b i , t j ) tuples. These points are then mapped into a new co-ordinate system in atwo-dimensional space (Figure 2(a)), where we can see points of four trajectories infour different colors, and each set of points of a single trajectory is represented asan MBR. These MBRs are grouped together to form an R-tree. Each leaf level node, R , R corresponds to a disk-page. Finally, we maintain these disk-page references indifferent level quadtree blocks of the QR-tree, as shown in (Figure 2(b)). For example,with Q . ( b ) , disk-page id R is assigned along with time-bucket ids t and t .3.2. Q R-tree

We make further improvement on the proposed QR-tree index, where we augmentthe index by adding another top-level quadtree. The intuition behind adding this toplevel quadtree is two fold: (i) it partitions the entire sets of trajectories into differentgroups based on their extents, thus the index will have better pruning capability, (ii)since a longer trajectory will most likely contain more points than a shorter trajectory,7aintaining different length trajecotory in a single R-tree is challenge as trajectories ofdifferent length may occupy different storage spaces in the disk.In our proposed Q R-tree, a trajectory is stored under any non-leaf or leaf nodesbased on their extents. In this case, we recursively partition the space, and a trajectoryis stored in a quadtree block that fully contains it. Thus, long trajectories are storedin the upper level quadtree blocks than shorter trajectories are stored in lower leveltrajectory blocks. For all the trajectories in a single quadtree block, we apply QR-treestrategy to organize them in disk. Since we use two quadtrees and one R-tree in theindex, we refer this index as Q R-tree.

4. Processing CTQ

In this section, we present an algorithm for processing CTQ using our proposedQR-tree. Intuitively, the quadtree of QR-tree supports faster range query around querytrajectory points, while R-tree grouping ensures the lower I/O overhead. We apply aspatial pruning followed by a temporal pruning using QR-tree, where the irrelevantquadtree nodes are pruned ﬁrst, and then the time buckets are used to further prune theR-tree blocks to be retrieved. For simplicity, we present the ﬁrst level contact tracing,where the task is to ﬁnd users who were directly exposed to q . Algorithm 1 describes the pseudocode for a divide-and-conquer algorithm for theCTQ. A user u can be infected by q , if a point of u is within a threshold distance ψ and a threshold time τ from any point of q . So to facilitate this spatio-temporal rangesearch, we consider an extended minimum bounding rectangle (EMBR) (in terms ofspace) of every points of q to include the infectious region of q .Initially, the function matchCT ( · ) is called with the root node N of the QR-treeand q . It ﬁnds the relevant child nodes of N that intersect with q (or EMBRs of q ) inthe function extendedIntersection ( · ) (Line 1.8). Thus, quadtree nodes that are within ψ spatial distance threshold from q are considered. If a child node does not intersectwith the EMBR, it can be safely pruned. Otherwise, each unpruned child node N c of8 lgorithm 1: matchCT( N , q ) Input:

A quadtree node N of QR-tree, a COVID-19 positive user trajectory q Output:

A set U of user trajectories suspected to be exposed by q U ← ∅ if q = ∅ then return U ; if N is a leaf then t b list ← extendedTimeWindows( q ) U ← evaluateContacts( N, t b list , q ) N children ← children( N ) q children ← extendedIntersection( N children , q ) for N c ∈ N children , q c ∈ q children do U ← U (cid:83) matchCT( N c , q c ) return U N and the corresponding components of q , are passed to matchCT function accordingto Algorithm 1 (Line 1.10).The recursive method has two base conditions: (i) when q is empty (there is nopoint left in that subspace for repeated division (Line 1.3)); and (ii) when N is a leafnode. In case of a leaf node, the possibly infectious time buckets corresponding to thepoints in q are calculated with the function extendedTimeWindows ( · ) . This functionreturns all the possible time windows within temporal range τ starting from that ofeach point of q . Then the exposure of the trajectories stored in the disk blocks mappedto the node N and entries of temporal bucket are computed with evaluateContacts ( · ) .The function evaluateContacts is used to determine which trajectories meet with q .To compute it, ﬁrst we need to retrieve trajectories which have transformed coordinates( N, t ), for each entry t ∈ t b list . So we look up our in memory QR-tree index andobtain a list of relevant R − tree nodes (i.e. disk block ids). We the fetch the trajectoriesstored in those disk blocks. For each user trajectory t i ∈ T r , we compute whether theuser meets with q . The trajectory t i is included in the exposed set U if it meets with q .The above algorithm supports contact tracing by passing the depth level L in matchCT ( · ) as a recursion depth parameter. Initially the parameter is set to 0. Then in the afore-mentioned second base condition, we can call matchCT ( · ) recursively for each of the9omputed exposed trajectories with depth parameter incremented by 1, until it hasreached L .

5. Experimental Evaluation

In this section, we compare the QR-tree with a baseline ( BL ) approach, where weuse a 3D R-tree (for location and timestamp) for indexing. We use it as our baselinebecause, in contrast to other methods, a trajectory is saved in an Rtree leaf as a singleobject. This is ideal for retrieving the whole trajectory during the processing of CTQ.We use the mobility traces from the CDR data collected by Grameenphone Ltd betweenJune 19, 2012 and July 18, 2012 [8], hereafter referred to as BD Cellphone , as ourdefault dataset. Besides, we use Foursquare check-in dataset [9] of New York city,hereafter referred to as

NYF , to evaluate performance of CTQ in a different spatio-temporal domain. We use JDK 1.8 for implementing our algorithms, which were runin Intel core i5-3570K processor (3.4 GHz) and 8 GB of RAM.

Performance Evaluation and Parameterization.

The parameters we varied, theirranges and default values (in bold) are shown in Table 1. We have varied a singleparameter in each experiment while the others are assigned their default values. Wemeasure the impact of the parameters on runtime and I/O cost i.e.

Parameters Ranges , 101-200, > , 100kSpatial Distance Threshold ( ψ ) 1m, , 4m, 10mTemporal Distance Threshold ( τ ) 1 min, 15 min,

30 min , 1 hour, 3 hourMaximum Recursion Depth ( L ) , 2, 3Table 1: Parameters Note that, the choice of spatial and temporal range thresholds is mostly applicationspeciﬁc. We have varied them in the aforementioned range mainly to demonstrate the10 T i m e ( s e c ) (a) T i m e ( s e c ) (b) o f B l o cks (c)

10k 25k 50k 100k o f B l o cks (d)Figure 3: Evaluating CTQ for varying no. of points per query trajectory (a & c) and no. of trajectories (b &d) performance of our work. For COVID-19, 1 meter was considered as the maximumdistance for transmission via respiratory droplets [10], which is suggested as 2 metersin some other studies[11]. Besides, there is an evidence, but perhaps no concrete proofof transmission by aerosolized respiratory ﬂuids, which is in fact likely to travel farther.So we have varied the spatial range upto 10 meters, considering situations of indoorenvironment. On the other hand, there is still no authentic information on the temporalthreshold for COVID-19 transmission. To the best of our knowledge, research worksare still going on in this topic. We have varied it from as low as 1 min upto 3 hours,since the upper range is suggested so according to a study [12] in the New EnglandJournal of Medicine. However, upon availability of more legitimate information aboutspatial and temporal thresholds, our algorithm should work just ﬁne with the updatedparameter values, without any modiﬁcation in its design. BD Cellphone (i) No. of points per query trajectory:

Our algorithm using the QR-tree index outper-forms the baseline by orders of magnitude in terms of both runtime (Figure 3a) andby − orders of magnitude in terms of I/O cost (Figure 3c). As the number of pointsin query trajectory increases, more user trajectories at different blocks are expected to11 T i m e ( s e c ) Spatial Distance Threshold, ψ (meter)BLQR (a) T i m e ( s e c ) Temporal Distance Threshold, τ BLQR (b) o f B l o cks Spatial Distance Threshold, ψ (meter)BLQR (c) o f B l o cks Temporal Distance Threshold, τ BLQR (d)Figure 4: Evaluating CTQ for varying spatial distance threshold, ψ (a & c) and temporal distance threshold, τ (b & d) -2 -1

1 2 3 T i m e ( s e c ) Maximum Recursion Depth, LBLQR (a)

1 2 3 o f B l o cks Maximum Recursion Depth, LBLQR (b)Figure 5: Evaluating CTQ for varying maximum recursion depth, L

12e processed. So runtime as well as I/O should increase in both the approaches, asreﬂected in the graph (Figure 3a, 3c). (ii) No. of indexed trajectories:

The QR-tree outperforms the baseline by around orders of magnitude in terms of runtime (Figure 3b) and by around order of magni-tude (Figure 3d) in terms of disk I/O cost. Both the approaches follow an increasingtrend with the increase in the number of indexed trajectories. This is because moretrajectories require more disk blocks to be stored. So a higher number of disk blocks,i.e., larger number of trajectories, are expected to be retrieved in query processing,requiring more time to be processed. (iii) Spatio-temporal thresholds ( ψ , τ ): When we vary spatial ( ψ ) or temporal distancethresholds ( τ ), QR-tree works better than the baseline by at around − orders ofmagnitude in terms of runtime (Figure 4a, 4b) and by at least order of magnitude(Figure 4c, 4d) in terms of disk I/O cost. (iv) Maximum recursion depth ( L ): The performance of the QR-tree is signiﬁcantlyimportant when we consider multiple levels of CTQ, that is, when we consider ex-posure from already exposed users upto a certain level, instead of conﬁrmed patientsonly. The QR-tree outperforms baseline approach by − orders of magnitude interms of runtime (Figure 5a) and by around − orders of magnitude in terms of I/Ocost (Figure 5b) when we vary maximum recursion depth level, L . More importantly,note that, the QR-tree can provide results in tens of seconds in case of upto three levelsof exposure while the baseline would require thousands of seconds to do that. Thebeneﬁts of I/O may seem misleading for higher depth levels (Figure 5b) because theCTQ processing gets saturated in terms of disk block access, i.e. it accesses almost allthe blocks in both approaches (baseline being marginally higher) to retrieve potentialcandidate trajectories. For this reason, the baseline approach has a somewhat ﬂat tailfor already accessing all the disk blocks. But running the experiment with higher num-ber of trajectories to demonstrate this I/O gain is not feasible because of the intractableruntime of the baseline method. Besides, instead of default temporal distance threshold( τ ) value of 15 minutes, we have used τ = 1 minute for running this experiment to keepthe results demonstrable. 13 T i m e ( s e c ) Spatial Distance Threshold, ψ (meter)BLQR (a) T i m e ( s e c ) Temporal Distance Threshold, τ BLQR (b)

10 100 1000 1 2 4 10 o f B l o cks Spatial Distance Threshold, ψ (meter)BLQR (c) o f B l o cks Temporal Distance Threshold, τ BLQR (d)Figure 6: Evaluating CTQ on NYF for varying spatial distance threshold, ψ (a & c) and temporal distancethreshold, τ (b & d) NYFThe

NYF dataset is signiﬁcantly smaller than the

BD Cellphone dataset. We reportonly the impacts of varying spatio-temporal distance thresholds in the experiments withthis dataset since varying the other parameters would be of little value for the datasetsize. (i) Spatio-temporal thresholds ( ψ , τ ): The QR-tree works better than the baseline byat around orders of magnitude in terms of runtime (Figure 6a, 6b) and by around − orders of magnitude (Figure 6c, 6d) in terms of I/O cost. So the performance ofQR-tree is comparatively even better for NYF dataset. The I/O graph of baseline looksﬂat because all the disk blocks have been accessed by it whatever the parameter valuesare. This is because the dataset spans over a longer temporal domain than that of

BDCellphone . R-tree

In a real system, the performance gain by our proposed index will largely be at-tributed to the lower I/O cost. This is not simulated in the runtime experiments. So themerits demonstrated for the QR-tree is very likely to be manifold to what is reported incase of a real deployment. 14

500 1000 1500 2000 50k 100k 150k 200k o f B l o cks R Figure 7: I/O comparison of enhanced tree for CTQ

The further enhancement we have proposed on the QR-tree, namely the Q R-treeworks slightly better in terms of I/O cost specially when we deal with larger numberof trajectories, as demonstrated in Figure 7. Note that, this experiment is run byindexing 50k, 100k, 150k and 200k trajectories respectively. The Q R-tree achieves − reduction in the number of disk blocks accessed, specially for higher numberof trajectories. So, though the CTQ processing using the Q R-tree needs multiple levelsof tree traversal, marginally lower I/O overhead can eventually result in better runtimeperformance as well, which is subject to further experiments in real systems.

6. Related Works

The works related to ours mostly encompass studies in trajectory indexing in spatio-temporal domain and some query processing using these indexes. Besides there aremany ongoing researches in contact tracing with the outbreak of COVID-19 pandemic,most of which attempt to solve the challenge from different perspectives.

Indexing of moving objects i.e. storing trajectory data efﬁciently has received con-siderable attention throughout the last two decades. Mokbel et al. present a summaryof spatio-temporal indexing methods in their survey [13] according to some of the ear-lier studies in this ﬁeld. They point out three techniques that have been used to indexhistoric trajectory data. These are augmentation of temporal index with existing spatialindex, combining both spatial and temporal access in a single structure and index-ing mainly based on temporal information while treating spatial index as secondary.15guyen-Dinh et al. extend the work in [14] and summarize the indexing methodsadopted in 2003-2010 period according to the aforementioned categories. Mahmoodet al. focus on the more recent techniques in [15], in succession of the previous works.We describe some of these indexing methodologies brieﬂy and present comparative ar-guments of the relevant ones with our work. The details and more elaborate discussioncan be found in [13], [14], [15] and the research works they have addressed in thesestudies.RT-tree, 3D R-tree etc. indexing methods deal with temporal information alongwith spatial data as summarized in [13]. RT-tree [16] simply augments the time in-terval information with the MBRs of R-tree. So it achieves a performance as goodas R-tree for spatial queries but the temporal queries often span the whole tree. Ourproposed contact tracing query need to process both spatial and temporal ranges, soit would be inefﬁcient for our purpose. On the other hand, 3D R-tree [17] considerstemporal attribute as an additional dimension with the spatial R-tree, processing spatialand temporal queries alike. We have used this approach as our baseline method for itspotential applicability to our proposed query. Spatio-temporal R-tree (STR-tree) [18]is another approach to index spatio-temporal data with R-tree at the core but with dif-ferent insertion and splitting strategy. It focuses on both spatial locality and trajectorypreservation based on a conﬁgurable parameter [13]. However different segments of atrajectory may be stored in different nodes or spatio-temporally close trajectories maybe grouped separately in this approach, both of which can deteriorate the performanceof our proposed query processing.Some trajectory oriented access methods puts more emphasis on grouping thepoints of each trajectory together. TB-tree [18], SETI [5] etc. adopt such mecha-nisms. Spatial queries and keeping spatially closed objects together are not amongtheir primary concerns as stated in [13].Trajectory-bundle tree (TB-tree) strictly emphasizes on trajectory preservation andgives up on spatial locality if needed. It is also built on top of R-tree, which means,the MBRs of TB-tree overlap a lot in contrary to its minimization as would be done inregular R-tree. This structure can deal with trajectory based queries involving topologywith spatio-temporal attributes like area, time etc. or those based on navigation quite16fﬁciently. But in the contact tracing query we need both trajectory interaction in termsof spatio-temporal locality and trajectory preservation for its efﬁcient processing. Soit not readily applicable to our problem as well. Besides, both TB-tree and STR-treeretrieves trajectory segments incrementally [18]. If someone is infected in our case, herwhole trajectory needs to be retrieved, which would be costly using these indexes.The indexing mechanism SETI, proposed in [5] addresses the scalability issues ofexisting indexing schemes. It presents a two level index structure: the ﬁrst level indexpartitions the spatial domain into static, uniform and non-overlapping cells, the secondlevel index uses a traditional R-tree to index the time domain. Using the ﬁrst index,the segments of each trajectory are assigned to the cells, according to their spatialcoordinates. If a segment spans multiple cells then it is split at the cell boundary.Then, the time span of the segments (i.e the minimum and the maximum timestamp)in each cell are saved in an R-tree. So, effectively the spatial and temporal dimensionsare decoupled in the process. The authors mainly discussed range queries with theirproposed index where a spatial and then a temporal ﬁltering is done using the ﬁrstand second level index respectively. After that, a reﬁnement step retrieves the desiredtrajectories. The index also supports efﬁcient insert, delete and update.The Start-End timestamp B-tree (SEB-tree) [19] is another trajectory oriented in-dexing similar to SETI. The space is partitioned into overlapping zones which are in-dexed using SEB-tree considering only start and end timestamps. The moving objectsare mapped to their zones using hashing. But unlike SETI, SEB-tree works on the twodimensional points instead of the trajectories.Some of the other indexing schemes [13] presents for indexing historic trajectorydata include MR-tree, HR-tree, MV3R-tree etc. These indexes disintegrate spatial andtemporal dimensions as they aim at storing spatial attributes of the trajectories at dif-ferent timestamps in different R-trees.[14] presents some specialized indexing methods besides some improvements onthe previous works. The MTSB-tree [20] has its similarity with SETI in terms of spatio-temporal organization with the difference that it uses Time-Split B-tree (TSB-tree) fortemporal indexing of trajectory segments instead of R-tree based temporal index ofSETI. So trajectory segments are sorted in increasing order of time. FNR-tree and17ON-tree use multiple R-trees to store object movement locations and time intervals.The latter one also uses a hash structure for mapping object movement lines to thelower level temporal R-tree. GS-tree indexes trajectories in a constrained graph by di-viding them into nodes and edges. It is a balanced binary tree that discriminates timedimension from the spatial counterpart. Here a leaf node represents MBR of edges andpoints to two different data structures for spatial and temporal dimensions. The Com-pressive Start-End tree (CSE-tree) [21] divides space in disjoint regions like SETI andmaintains temporal indexes for each of these regions. It considers time intervals as twodimensional points and maintains separate B+ trees for indexing end times followedby start times to group trajectory segments. Polar tree is specialized to index directionof the moving objects. It uses an in-memory unbalanced binary tree to index orienta-tion of objects with respect to a given focal point. It can efﬁciently determine if manyobjects get close to or far from a reference site. Besides, RTR-tree and TP R-tree pro-vides better support for range queries in euclidean space in indoor environments. Theother indexing methods studied in the survey are out of scope of this literature as theybear little resemblance to our work.[15] describes some of the more recent works in trajectory indexing. TrajStorepartitions trajectories and clusters spatio-temporally close segments together on thedisk. TrajTree also relies on trajectory segmentation where leaf nodes contain sub-trajectories and non-leaves hold sequences of the bounding boxes of their child nodes.Most of the other spatio-temporal indexes are specialized and application speciﬁc. Forinstance, UTH and UT

GRID deals with trajectories with uncertain portions, PARINETis speciﬁc to trajectories along road networks, TRIFL optimizes trajectory indexing inﬂash storage and so on.

The common queries in spatio-temporal domain can be classiﬁed into two broadcategories, coordinate based query and trajectory based query [18]. Coordinate basedqueries include point speﬁc query, range query, nearest neighbor search etc. while tra-jectory based query can involves topology or navigational details. Coordinate basedqueries have been addressed since the earlier works in the spatio-temporal trajectory18omain like RT-tree, 3D R-tree [13] etc., which were improved later in more efﬁcientindexes like SETI, SEB-tree etc. Trajectory oriented queries where trajectory preserva-tion can play an important role was addressed and efﬁciently processed using STR-tree,TB-tree in [18]. The later indexes proposes improvements in both directions, workswith new queries like k -NN (e.g. TrajTree), but these queries do not align with the pro-posed novel contact tracing query since both spatio-temporal range search and wholetrajectory retrieval are of utmost importance here.A query somewhat similar to the contact tracing query (CTQ) is presented in [22],which the authors call trajectory multi-range query (MRQ). The goal of MRQ is toﬁnd the set of trajectories that go through a set of given spatio-temporal ranges. Wecan consider CTQ as a multi-range query over trajectories too, where there is a queryspatio-temporal range for each point of the CTQ query trajectory. However, there is avery important distinction between MRQ and CTQ. In MRQ, the resultant trajectoriespass through all the given spatio-temporal ranges, whereas, in the case of CTQ, even ifa trajectory goes through only one of the query spatio-temoral ranges, we need to returnit and do further processing on it. Also, in CTQ we consider indirect contact/exposureto the query trajectory which is not considered in MRQ. Since the start of the COVID-19 pandemic, governments have rolled out contacttracing apps ([23]) in order to contain the spread of the virus. The aim of these appsis to understand if a user was exposed to a known COVID infected person, and if so,notify her for testing and starting the quarantine process. A review of the existing tech-nologies for contact tracing is presented in [24]. Proximity based contact tracing appsuse Bluetooth and WiFi to infer relative proximity to other users. Location based tech-nologies, on the other hand, use GPS to locate the exact position of the users. AlthoughGPS positioning is not very accurate in indoor spaces, that problem can be overcomewith the additional use of crowd-sourced WiFi localisation. With modern WiFi accesspoints the accuracy is good enough for contact tracing [24]. However, both of thesetwo technologies require a smartphone and an app to be installed by the user. Theinfrastructure and devices required by these methods may not be available in certain19laces, specially in developing countries. In such situations, using mobile operator’sinfrastructure to locate the phone of a user is an option. This has the advantage of notrequiring the user to do anything and the usage of existing infrastructure. However,there still remains the accuracy and privacy concerns.

7. Conclusions

We have proposed a novel CTQ in the context of spatio-temporal databases anddeveloped a multi-level index, namely QR-tree, to efﬁciently process the CTQ. Exper-imental results show that the QR-tree based approach outperform the baseline by 1-2orders of magnitude both in terms of processing time and I/O. In future, we plan to de-velop a system based on the proposed index and make it available for the community.

References [1] CoronavirusJHU, https://coronavirus.jhu.edu/ arXiv:2007.02806arXiv:2007.02806