Durable Top-K Instant-Stamped Temporal Records with User-Specified Scoring Functions
DDurable Top-K Instant-Stamped Temporal Recordswith User-Specified Scoring Functions(Technical Report Version)
Junyang Gao § Google Inc.
Stavros Sintos § University of Chicago
Pankaj K. Agarwal
Duke University
Jun Yang
Duke University
Abstract —A way of finding interesting or exceptional recordsfrom instant-stamped temporal data is to consider their “dura-bility,” or, intuitively speaking, how well they compare with otherrecords that arrived earlier or later, and how long they retaintheir supremacy. For example, people are naturally fascinated byclaims with long durability, such as: “On January 22, 2006, KobeBryant dropped 81 points against Toronto Raptors. Since then, thisscoring record has yet to be broken.”
In general, given a sequenceof instant-stamped records, suppose that we can rank them bya user-specified scoring function f , which may consider multipleattributes of a record to compute a single score for ranking.This paper studies durable top- k queries , which find recordswhose scores were within top- k among those records withina “durability window” of given length, e.g., a 10-year windowstarting/ending at the timestamp of the record. The parameter k , the length of the durability window, and parameters of thescoring function (which capture user preference) can all be givenat the query time. We illustrate why this problem formulationyields more meaningful answers in some practical situationsthan other similar types of queries considered previously. Wepropose new algorithms for solving this problem, and providea comprehensive theoretical analysis on the complexities of theproblem itself and of our algorithms. Our algorithms vastlyoutperform various baselines (by up to two orders of magnitudeon real and synthetic datasets). I. I
NTRODUCTION
Instant-stamped temporal data consists of a sequence ofrecords, each timestamped by a time instant which we callthe arrival time, and ordered by the arrival time. Such data isubiquitous in a rich variety of domains; i.e., sports statistics,weather measurement, network traffic logs and e-commercetransactions. A way of finding interesting or unusual recordsfrom such data is to consider their “durability,” or, intuitivelyspeaking, how well they compare with other records (i.e.,records that arrive earlier or later) and how long they retainthe supremacy. For example, consider the performance record:“On January 22, 2006, Kobe Bryant scored 81 points againstToronto Raptors.” While impressive by itself, this statementcan be boosted by adding some temporal context: “At thattime, this record was the top-1 scoring performance in thepast 45 years of NBA history .” Naturally, the further backwe can extend the “durability” (while the record still remainstop), the more convincing the statement becomes. We canextend durability forward in time as well: “Since 2006, Kobe’s § Most of the work was conducted when authors were at Duke University.
81 points scoring performance has yet to be broken as oftoday.” The notion of durability is widely used in mediaand marketing, because people are naturally attracted bythose events that “stood the test of time.” Such analysis ofdurability is a useful part of the toolbox for anybody whoworks with historical data, and can be particularly helpful tojournalists and marketers in identifying newsworthy facts andcommunicating their impressiveness to the public. Becausetemporal data can accumulate to very large sizes (especiallyfor granular data such as weather or network statistics), andbecause users often want to find durable records with respectto different ranking criteria quickly, we need to answer durabletop- k queries efficiently.In this paper, we consider durable top- k queries for findinginstant-stamped records that stand out in comparison to otherswithin a surrounding time window. In general, each record mayhave multiple attributes (besides the timestamp) whose valuesare relevant to ranking these records. We assume that there is auser-specified scoring function f that takes a record as input,potentially considers its multiple attributes, and computes asingle numeric score used for ranking. Intuitively, a durabletop- k query returns, given a time duration τ , records thatare within top k during a τ -length time window anchoredrelative to the arrival time of the record. How the windowshould be positioned relative to the arrival time depends onthe application; our solution only stipulates that the relativepositioning is done consistently across all records. In practice,we observe most statements in media involving durabilityeither ends the window at the arrival time of the record(i.e., looking back into the past) or begins the window at thearrival time of the record (i.e., looking ahead into the future).Generally speaking, each record returned by our durable top- k corresponds to a statement about the record that highlights thedurability of its supremacy.Note that there are different ways for capturing the notionof durability in queries, including some types that have beenstudied in the past. Different application scenarios may callfor different semantics. To understand why our definition ofdurable top- k queries may be more appropriate than othersin some scenarios, we examine the alternatives with a simpleconcrete example. Example I.1.
Suppose we are interested in finding exceptional a r X i v : . [ c s . D B ] F e b . . . . . . . . . . . . . . . . . . . . Oakley MutomboWallace DuncanRodmanRodmanMutombo LoveBynum Howard (1)
Rebound highlights . . . . . . . . . . . . . . . . . . . . Oakley Love HowardDuncan (2) durable top- k query . . . . . . . . . . . . . . . . . . . . MutomboOakleyRodman Duncan Love (3)
Tumbling Window Top- k . . . . . . . . . . . . . . . . . . . . HowardBynumDrummond (4)
Sliding Window Top- k Fig. 1:
A case study on finding durable noteworthy reboundperformances in NBA history. Red squares highlight resultsreturned by different queries, and line segments represent thedurability time window. rebounds performances (by individual players in individualgames) in NBA history—particularly, those that stood out asthe top record (or tying for the top record) in a 5-year timespan. Figure 1.(1) plots all relevant records (i.e., no fewer than27 rebounds by a single player in a single game) in entire NBAhistory. We consider the following three queries to accomplishour task; the latter two have been widely studied in the streamprocessing and top- k query processing literature. Note that inthis example k = 1 . • Durable top- k (our query) : This is the query that wepropose. For each record, we look back in a 5-year windowending at the timestamp of the record, and check whetherthe record has the top score among all records within thiswindow. Figure 1.(2) highlights the records (red squares)returned by our query; for each result record, we also showits 5-year durability window as a line segment ending at therecord for which it remains on the top. • Tumbling-window top- k : This query first partitions thetimeline into a series of non-overlapping, fixed-sized (5-year) windows, and then returns the top record within eachtime window. The placement of the windows is up to the userand can affect results. Results for one particular placementof the windows are shown in Figure 1.(3). • Sliding-window top- k : This query slides a 5-year windowalong the timeline, and returns the top record for eachposition of the sliding window. Figure 1.(4) highlights a fewrepresentative sliding windows, as well as the top recordsduring these windows.All these queries are able to uncover some meaningful durabletop records; i.e., for any data record ( X, Y, Z ) marked as ared square in Figure 1, we can claim “player X grabbed Y rebounds in a game on date Z , which is the best in some • Tumbling-window vs. our query : The general observationis that the results of tumbling-window are highly sensi-tive to the choice of window placement. In Figure 1.(3),tumbling-window picks (Mutombo, 29, 2001) and the othertwo performances with 29 rebounds as they were the bestones during 2000-2005, but there were more impressiveperformances right before them, unfortunately leaving theimpression that they stood out only because the windowswere cherry-picked. Furthermore, if we choose to place allwindows slightly to the right such that the last window endswith the most recent arrival time, (Rodman, 34, 1992) willbe eliminated by (Oakley, 35, 1988), and (Duncan, 27, 2009)will be overlooked since it is shadowed by (Love, 31, 2010).Overall, because of high sensitivity to window boundaries,tumbling-window runs the risk of omitting important recordsas they happen to be overshadowed by some other recordsin the same window, and picking less interesting records asthey happen to be the top ones in that specific window. • Sliding-window vs. our query : Sliding-window is not sus-ceptible to window placement, but it effectively considersall possible window placements, and it returns the unionof all top records for each such placement. This approachleads to possibly many records that are not as meaningful inpractice. In Figure 1.(4), sliding-window apparently returnsoverwhelmingly more results compared to our query, whichmakes it less applicable to mining most noteworthy records.Even more unnatural is the fact that as we slide the windowalong the timeline, a record can come in and out of theresult; i.e., there is no continuity. To illustrate, suppose weare interested in durable top-2 records with 5-year windows,and let us focus on Drummond’s 29 rebounds performanceon 2015.11.3 (highlighted in Figure 1.(4)). It is surroundedby two top performance (Howard, 30, 2018) and (Bynum,30, 2013). Sliding-window will return this record when thewindow is positioned at 2014-2019, but not when positionedat 2013-2018; however, the record will be returned againwhen the window moves to 2012-2017. Such discontinuitymakes the results rather unnatural to interpret.In comparison, our query does not have the issue of sensitivityto window placement or that of difficulty of interpretation,because we assess each record in a 5-year window that leadsup to its own timestamp. Thus, our query result records canbe consistently interpreted as having durability “within thepast 5 years” and clearly communicated to the audience. The esults from the other two queries would be qualified withrather specific durability windows, which may be perceivedas cherry-picking. In general, we argue that consistency andsimplicity of our query make it more applicable to journalists,marketers, and data enthusiasts alike who seek result that areeasily explainable to the public. In comparison, our query does not have the issue of sensitiv-ity to window placement or that of difficulty of interpretation,because we assess each record in a 5-year window that leadsup to its own timestamp. Thus, our query result records canbe consistently interpreted as having durability “within thepast 5 years” and clearly communicated to the audience. Theresults from the other two queries would be qualified withrather specific durability windows, which may be perceivedas cherry-picking.Although the above example ranks records by a singleattribute, its argument can be extended to the general casewhere records are ranked by a user-specified scoring functionthat combines multiple attribute values into a single score.Besides sports, durable top- k queries have applicationsacross many other domains. For instance, Wikipedia statesthat “In late January 2019, an extreme cold wave hit the Mid-western United States, and brought the coldest temperaturesin the past 20 years to most locations in the affected region,including some all-time record lows.” This statement stemsfrom a simple durable top- k query over historical weatherdata, and allows the Wikipedia article to convey the severityof event effectively. As an example involving more complexranking, cybersecurity analysts rely on network traffic log toidentify unusual and potentially malicious intrusions. With aappropriately defined scoring function that combines multiplefeatures of a session, such as duration, volume of data transfer,number of login attempts, and number of servers accessed,a durable top- k query can quickly help identify unusualtraffic (relative to others around the same time) for furtherinvestigation. As another example, a financial broker mayaccompany a recommendation with a statement “The price-to-earnings ratio (P/E) of this stock last Friday was among thetop 5 P/E’s within its section for more than 30 days,” which isalso a durable top- k query. In sum, the efficiency of durabletop- k queries makes them suitable for using large volumesof historical efficiently to drive insights or identify leads forfurther investigation; the conceptual simplicity of these queriesalso make them particular attractive for explaining insights andcommunicating them effectively to the public. Contributions.
Our contributions are as follows: • We propose to find “interesting” records from large instant-stamped temporal datasets using durable top- k queries.Compared with other query types related to durability, ourquery produces results that are more robust (i.e., less sensi- A related question is whether we can post-process the results of the sliding-window query to obtain the results to our query; e.g., filtering those resultrecords in Figure 1.(4) to get those in Figure 1.(2). Unfortunately, such anapproach, which we consider as one of the baselines in our experiments, isprohibitively slow on large datasets, as we shall show in later sections. tive to window placement than tumbling-window) and moremeaningful (i.e., easier to interpret than sliding-window). • We propose a suite of solutions based on two approachesthat process “promising” records in different prioritizationorders. We provide a comprehensive theoretical analysis oncomplexities of the problem and of our proposed solutions. • Our solutions are general and flexible. They do not dictateany specific scoring function f , but instead assume a well-defined building block for answering top- k queries using f ,which can be “plugged into” our solutions and analysis. Wegive some concrete example of f and the building block inlater sections. In particular, f can be further parameterizedaccording to user preference; these parameters, along with k , τ and I (the overall temporal range of history of inter-est), can be specified at query time, making our solutionsflexible and suitable for scenarios where users may exploreparameter setting at run-time, interactively or automatically. • We show that the query time complexity of our algorithms isproportional to O ( | S | + k (cid:6) | I | τ (cid:7) ) in the worst case, where | S | is the answer size. Furthermore, we prove that the expectedanswer size of a durable top- k query | S | is O ( k (cid:6) | I | τ (cid:7) ) underthe random permutation model (where the data values canbe arbitrarily chosen by an adversary but arrival order israndom); this result implies that the expected query time ofour algorithms in practice is linear in the output size. Paper Overview.
In a nutshell, our proposed algorithms 1)visit promising records in some manner, and 2) check thedurability (with respect to a top- k query) for each recordwe visit. Techniques for improvement mostly focus on howto efficiently identify candidate records and eventually reducethe total number of durability checks in the second step. Ourproposed algorithms come in two flavors: time-prioritized andscore-prioritized, introduced in Section III and Section IV,respectively. The time-prioritized solution traverses and findscandidate records sequentially along the timeline, while the score-prioritized solution greedily chooses unvisited candi-dates with the maximum score (with respect to f ). Thoughin different manners, we show in later sections that thesetwo solutions actually equivalently reduce and bound the sizeof candidate records (or, the number of durability checks).More interestingly, in Section V, we further demonstrate thatthe bound is proportional to the answer size of a durabletop- k query, which means our algorithms run faster whenthe query is more selective, e.g., with smaller k or longerdurability τ . Section VI experimentally evaluates our proposedsolutions, including implementations inside a database system.Section VII reviews related work and Section VIII concludes.II. P ROBLEM S TATEMENT AND P RELIMINARIES
Problem Statement.
Consider a dataset P with n records,where each record p ∈ P has d real-valued attributes andis represented as a point ( p.x , p.x , . . . , p.x d ) ∈ R d . Forsimplicity, we consider a discrete time domain of interest T = { , , . . . , n } , and let p.t ∈ T denote the arrival time of p . All records in P are organized by increasing order of their3 ABLE I:
Table of notation T Time domain p.t
Arrival time of pf Scoring function k Parameter of Top- k query π ≤ k ([ t , t ]) Top- k records in time interval [ t , t ] I Query interval τ Durability duration u Query vector s ( n ) , q ( n ) Space, query time of top- k index arrival time. Given a non-empty time window W : [ t , t ] ⊆ T ,let P ( W ) denote the set of records that arrive between t and t ; i.e., P ( W ) = { p ∈ P | t ≤ p.t ≤ t } .Assume a user-specified scoring function maps each record p to a real-valued score, f : R d → R . Given a timewindow W = [ t , t ] , a top- k query Q ( k, W ) asks for the k records from P ( W ) with the highest scores with respect to f . Let π ≤ k ([ t , t ]) denote the result of Q ( k, W ) ; i.e., for ∀ p ∈ π ≤ k ([ t , t ]) , there are no more than k − records q ∈ P ([ t , t ]) with f ( q ) > f ( p ) .For simplicity of exposition, we consider durability win-dows ending at the arrival time of each record (i.e., the“looking-back” version), but our solution can be extended tothe general case where the windows are anchored consistentlyrelative to the arrival times (including the “looking-ahead” ver-sion). We say a record p is τ -durable if p ∈ π ≤ k ([ p.t − τ, p.t ]) .That is, p remains in the top- k for τ time during [ p.t − τ, p.t ] .We are interested in finding records with long durability. Notethat if a record p is τ -durable, then it is also τ (cid:48) -durablefor τ (cid:48) ≤ τ . We are interested in finding records with “longenough” durability, i.e., durability at least τ . Given a queryinterval I and a durability threshold τ ∈ [1 , | T | ] , a durable top- k query, denoted DurTop ( k, I, τ ) , returns the set of τ -durablerecords that arrive during I ; i.e., DurTop ( k, I, τ ) = { p ∈ P ( I ) | p ∈ π ≤ k ([ p.t − τ, p.t ]) } . For a record p ∈ DurTop ( k, I, τ ) wecan also ask what is the maximum duration that it remains inthe top- k . Table I summarized our notations. Scoring Function and Top- k Query Building Block.
Asdiscussed earlier, our proposed algorithms and complexityanalyses are applicable to any user-specified scoring function f as long as there exists a “building block” that can answerbasic (non-durable) top- k queries under f . This building blockcan be a “black box”: the novelty and major contribution ofour algorithms come from its ability to reduce and bound thenumber of invocations of the building block, totally indepen-dent of how the building block operates itself. Of course, theoverall algorithm complexity still depends on the efficiencyof the building block. For a function f , we consider that anindex of size O ( s ( n )) can be constructed in O ( u ( n )) time thatanswers top- k queries with respect to f in O ( q ( n ) + k ) time,where n is the data size and s ( · ) , u ( · ) , q ( · ) are functions of n .In this paper, we are more interested in top- k queries on asubset of data specified by a time window W given at querytime; i,e., computing Q ( k, W ) that reports the k records in P ( W ) with the highest scores with respect to f . With a slight If τ is obvious from the context, we drop τ from the definition, i.e., wesay that a record is durable. care, the top- k query building block can be used to solvethis problem by paying a logarithmic factor in index size,query time and construction time. That is, for a function f wecan construct an index of size O ( s ( n ) log n ) in O ( u ( n ) log n ) time so that for given k, W , Q ( k, W ) can be computed in O (( q ( n ) + k ) log n ) time. If the top- k building block supportsupdates (insertion/deletion of an item) in O ( α ( n )) time, ourrange top- k index also supports updates in O ( α ( n ) log n ) time.Here, we give some concrete examples of f that are widelyused in real-life applications, for which efficient top- k querybuilding blocks exist. Consider the following class of scoringfunctions parameterized by u , which captures user preference: • linear : f u ( p ) = (cid:80) di =1 u i · p.x i , • linear combination of monotone scoring functions : f u ( p ) = (cid:80) di =1 u i · h ( p.x i ) , where h is a monotone function; i.e., h ( · ) = log( · ) , • cosine : f u ( p ) = | p || u | (cid:80) di =1 u i · p.x i ,where u is a real-valued preference vector and f u denotes thatthe scoring function f is parameterized by u . We refer to thisclass of functions as preference functions . Top- k queries usingsuch class of scoring functions (preferably in the above threeforms) have been well studied over the past decades both incomputational geometry [1]–[6] and databases [7]–[10]. Forexample, for preference functions above, there is an index with u ( n ) = O ( n ) , s ( n ) = O ( n ) , and q ( n ) = O ( n − / (cid:98) d/ (cid:99) ) ,skipping polylog( n ) factors. Using the results in [5], updatescan also be supported in α ( n ) = O (polylog( n )) time.As mentioned above, users can replace the scoring blockwith other functions (i.e., non-linear or non-monotone). Thecenterpiece of our algorithm and analysis, which bounds the number of invocations of the top- k query building block,remains unchanged. But in that case, the complexity of thebuilding block will affect the overall complexity bound. Wechoose these functions because 1) they are widely used inreal-life applications that require ranking and 2) they are bothlinear and monotone, so preference top- k can be efficientlyanswered (using the same index). Sliding-Windows and Baseline Solution.
Recall from thediscussion in Example I.1 (Figures 1-(2) and 1-(4)) that thereis a connection between our problem and the sliding-windowversion, which has been well studied [11]–[13]. Indeed, oneof our baseline solution is adopted from [11] with incre-mental top- k maintenance over sliding windows . However,the standard sliding-window technique is more suitable fordata streams, where incoming data must be scanned linearlyanyway. Instead, our query analyzes historical data. The linearcomplexity of sliding windows becomes infeasible especiallywhen dealing with large datasets. The limitation hence mo-tivates our solutions in later sections. Experimental resultsdemonstrate our algorithms’ significant efficiency gain (up to2 orders of magnitude) over sliding-window baselines. Duration of durable top- k records. When an algorithmfinds a record p in DurTop ( k, I, τ ) , we can also get the maxi- In particular, the idea of Skyband Maintanence Algorithm (SMA) to reducethe number of top- k re-computations from scratches. k . We doit by running a binary search with respect to the arrival timesof the records back in history. For each step of the binarysearch we ask a top- k query to check if p is still in the top- k records. The correctness follows from the observation that ifa record is τ (cid:48) -durable then it is also τ -durable for any τ < τ (cid:48) .The binary search has O (log n ) steps and each top- k querytakes O ( q ( n )) time. For all records in | DurTop ( k, I, τ ) | thisprocedure takes O ( | DurTop ( k, I, τ ) | · q ( n ) log n ) time. Noticethat this procedure is independent of the algorithm we use tofind the τ -durable records in I , so it can be applied in the endof all the algorithms we propose in the next sections (withoutincreasing their total running time).III. T IME -P RIORITIZED A PPROACH
The time-prioritized approach is straightforward: we visitrecords in time order and check their durability. We start witha baseline approach (Section III-A) and propose an improvedversion (Section III-B) using the observation that we canskip many unpromising records in practice. What is moreinteresting is how this simple improvement leads to provablysubstantial reduction in complexity (Section III-C).
A. Time-Baseline Algorithm
We start with a baseline solution, referred to as
Time-Baseline or T-Base . T-Base shares the same spirit as thesolution proposed in [11], where authors studied the problemon how to continuously monitor top- k queries over the mostrecent data in a streaming setting. The main idea is toincrementally maintain the top- k set over continuous slidingwindows. We start with the right endpoint of query interval,and sequentially slide a τ -length window backwards along thetimeline. For each sliding window [ t − τ, t ] , we need the top- k result to check whether the record (arriving at time t ) is τ -durable. With two adjacent windows W = [ t − τ, t ] and W = [ t − τ − , t − , top- k results could be updatedincrementally, if the expired record (e.g., P [ t ] ) is not a top- k on W . Otherwise, we need to compute the top- k on window W from scratch to guarantee correctness. The procedure repeatsuntil we visit all records in the query interval I .Next, we analyze the query time complexity of T-Base.There are only two types of records: durable or non-durable.After visiting each durable record, we need to issue a top- k query. After visiting each non-durable record, we only need toincrementally update the current top- k set with new incomingrecord in O (log k ) time. Assuming a top- k query can beanswered in O (cid:0) ( q ( n ) + k ) log n (cid:1) time, then T-Base runs in O (cid:0) | S | ( q ( n ) + k ) log n + n log k ) (cid:1) , where | S | is the answersize. This algorithm takes super-linear time (on the number ofrecords in the query interval). Next, we show a solution withsub-linear query time. B. Time-Hop Algorithm
It is not hard to see that the durable top- k query can beviewed as an offline version of the top- k query in the sliding-window streaming model. Hence, the baseline algorithm in-troduced above does not best serve our needs. Since the entire tτ t t t t τ s c o r e Fig. 2:
Data skipping in Time-Hop Algorithm.
Algorithm 1:
T-Hop ( k, I, τ ) Input: P , k , τ , and I : [ t , t ] . Output:
DurTop ( k, I, τ ) Initialize answer set: S ← ∅ , top- k set: π ≤ k ← ∅ ; t curr ← t ; while t curr > = t do π ≤ k ← Q ( k, [ t curr − τ, t curr ]) ; if P [ t curr ] ∈ π ≤ k then S ← S ∪ P [ t curr ] ; t curr ← t curr − ; else t curr ← most recent arrival time of records in π ≤ k ; return S; data is available in advance, the manner of continuous slidingwindow wastes too much time on those non-durable records.After all, a meaningful durable top- k query should be selective.Before describing the algorithm, we illustrate the main ideausing an example for k = 3 , shown in Figure 2. By runninga top-3 query Q (3 , [ t − τ, t ]) , consider the record p arrivingat t (black circle) is not τ -durable; i.e., p (cid:54)∈ π ≤ ([ t − τ, t ]) .We know the current top- set contains records (red squares)that arrive at t , t and t . Then, no records arriving between t and t would be τ -durable and we can safely hop from t to t . This simple and useful observation simplifies the queryprocedure, and allows larger strides for sliding windows.Now, we present our algorithm Time-Hop (T-Hop) (thepseudocode can be found in Algorithm 1). For each recordwe visit with timestamp t i , we run a top- k query in [ t i − τ, t i ] (Line 4). If the record is not durable, we slide the windowback to the most recent arrival time of records, say t j , in thecurrent top- k set (Line 9), skipping the non-durable recordsbetween t j and t i . Otherwise, if a durable record is found, weslide the window backwards by 1 (Line 7) as usual. Note thatif we adopt the look-ahead version of durability, we just needto reverse the traversal order (and time-hopping) on timelineas well. C. Complexity Analysis of T-Hop
For the Time-Hop algorithm, the time complexity purelydepends on the number of top- k queries called in the queryprocedure. We provide a worst-case guarantee on the numberof top- k queries performed, as shown by the lemma below(See Appendix B for full proofs). Lemma 1.
The total number of top- k queries performed bythe Time-Hop algorithm is O (cid:0) | S | + k (cid:6) | I | τ (cid:7)(cid:1) . p p p τ ττ s c o r e Fig. 3:
Blocking mechanism in score-prioritized approach
Proof (Sketch) . For each record we visit in T-Hop, a top- k query is called for a durability check. If the record is not τ -durable, we refer it to as a false check . Otherwise, we addit to the answer set. Hence, we only need to bound the totalnumber of false checks. We decompose the total number offalse checks into a set of disjoint τ -length windows, and derivean upper bound of false checks that happen in such a window.In particular, let ρ be a window of length τ and let S ρ bethe τ -durable records in ρ . We divide the false checks in ρ into two types. If a false check appears immidiately after a τ -durable record (found by the algorithm) then this is a type-1 false check. Otherwise it is a type-2 false check. From thedefinition, the number of type-1 false checks in ρ is O ( S ρ ) .Furthermore, we show that after finding i type-2 false checksin ρ , a top- k query (that is called for durability check) canonly find k − i records in ρ . In that way we show that thenumber of type-2 false checks is O ( k ) .Given a query interval I , there are at most (cid:6) | I | τ (cid:7) disjoint τ -length sub-intervals. We conclude that the number of top- k queries is O (cid:0) | S | + k (cid:6) | I | τ (cid:7)(cid:1) .Overall, with an efficient top- k module, T-Hop answers adurable top- k query DurTop ( k, I, τ ) in O (cid:0) ( | S | + k (cid:6) | I | τ (cid:7) )( q ( n ) + k ) log n (cid:1) time. Compared to T-Base, T-Hop runs in sublinearquery time (assuming that the ratio (cid:6) | I | τ (cid:7) is not arbitrarilylarge), i.e., the running time does not have a linear dependencyon the number of records in I . Our experimental resultsin Section VI suggests that T-Hop is one to two orders ofmagnitude faster than T-Base in practice. Furthermore, werecall that our index can be implemented with near linear sizeand polylogarithmic update time for preference queries.Notice that the number of top- k queries performed by T-Hop depends on | S | and k (cid:6) | I | τ (cid:7) . Ideally, we would like toargue that the number of top- k queries is O ( | S | ) . In theory,the term k (cid:6) | I | τ (cid:7) can be arbitrarily large comparing to | S | . InSection V-A we study the expected size of S in a randompermutation model where a set of n scores, chosen by anadversary, are assigned randomly to the records. In such acase we show that the expected size of S is roughly O ( k (cid:6) | I | τ (cid:7) ) ,meaning that in practice we expect that the number of top- k queries we execute are asymptotically equal to | S | .IV. S CORE -P RIORITIZED A PPROACH
One weakness of time-prioritized approach is that it doesnot pay much attention to scores and simply visit recordssequentially along the timeline (with hops). Though Lemma 1shows that T-Hop visits O ( | S | + k (cid:6) | I | τ (cid:7) ) records in the worstcase, it still potentially visits many low-score and non-durablerecords and ask more top- k queries. In contrast, the score- prioritized approach visits candidate records in descendingorder of their scores because records with high scores havea higher chance of being durable top- k records. Furthermore,these high-score records can also serve as a benchmark forfuture records, enabling a “blocking mechanism” to prunecandidates.Before describing the algorithms, we illustrate the main ideausing an example shown in Figure 3. Suppose we answer adurable top-3 query with τ by visiting records in descendingorder of their scores: p , p and p , and all three recordsare durable ones. p has the highest score in the entire queryinterval, any record that lies in the τ -length time interval [ p .t, p .t + τ ] will be dominated by p , which we refer toas being “blocked” by p . Similarly, p (the second highestscore) and p (the third highest score) also block a τ -lengthinterval starting from their arrival times. The time axis ispartitioned into intervals by endpoints of all blocking intervals.In Figure 3, the number under each interval shows how manyrecords block this interval. Notice the bold red interval, whereany record in this interval lies in three blocking intervalsafter processing p , p and p . Since there are already threerecords with higher score than any record in this interval, itcan not have any τ -durable top-3 record, and we can safelyremove this time interval from consideration. As we continueadding blocking intervals, eventually every remaining recordin the query interval will be blocked by at least three blockingintervals. The algorithm can now stop because no more durabletop records can be found. The procedure is straightforwardlyapplicable to look-ahead version of durability, by simplyreversing the direction of blocking intervals.We describe three algorithms in the following sections. Theydiffer on how the high-score records are found and how theblocking intervals are maintained. A. Score-Baseline Algorithm
We start with a baseline method (S-Base) of score-prioritized approach, which sorts records in the query intervalin descending order of their scores. Given k , τ and a queryinterval [ t , t ] : (1) Sort all records in time interval [ t − τ, t ] in descending order of scores. (2) For each record p in sortedorder: If p.t ∈ [ t , t ] and p lies in less than k blockingintervals, add p to answer set; Otherwise, continue. In anycase, add a blocking interval [ p.t, p.t + τ ] .Since all blocking intervals have the same length τ , we onlyneed to maintain the left endpoints of such intervals (using abalanced binary search tree) to find intersection counts. Thenumber of blocking intervals is O ( n ) . Hence, insertion andquery can both be finished in O (log n ) time. The sorting takes O ( n log n ) time so the overall query time complexity of S-Base is O ( n log n ) .Next we describe two better algorithms that avoid sortingall records in the query interval. B. Score-Band Algorithm (Monotone fff
Only)
If we could quickly find a small set of candidate records C , which is guaranteed to be a superset of the answers; i.e.,6 rriving Time D u ra ti on t t t t τ τ τ τ ˜ p ˜ p ˜ p ˜ p Fig. 4:
Index for k -skyband duration. Algorithm 2:
S-Band ( k, I, τ ) Input: P , k , τ , and I . Output:
DurTop ( k, I, τ ) S ← ∅ , Γ ← ∅ ; Compute
C ⊂ P by finding durable k -skyband set; Sort C in descending order of scores; for p ∈ C do if p lies in < k blocking intervals in Γ then π ≤ k ← Q ( k, [ p.t − τ, p.t ]) ; if p ∈ π ≤ k then S ← S ∪ { p } ; else for q ∈ π ≤ k ∧ q not visited before do Γ ← Γ ∪ { [ q.t, q.t + τ ] } ; Γ ← Γ ∪ { [ p.t, p.t + τ ] } ; return S; S ⊆ C , then we could get a faster algorithm by only sorting C . It is well-known that the k records with the highest score,with respect to any monotone scoring functions, belong to the k -skyband. Hence, if a record p is τ -durable for a top- k query(with respect to a monotone f ), then p must also be τ -durablefor the k -skyband; i.e., p is in the k -skyband for the timeinterval [ p.t − τ, p.t ] . This observation enables us to constructan offline index about each record’s duration of belonging tothe k -skyband, and efficiently produce a superset C of answersto durable top- k queries. Note that the score-band algorithmhas its limitation, since the k -skyband technique only appliesto monotone scoring functions. Index.
Score-Band algorithm needs additional index forfinding candidate set C , which we refer to as durable k -skyband . Suppose the value of k is known. For each record p , we compute the longest duration τ p that p belongs to the k -skyband. Then we map each record p into the “arrival time- duration” plane as a two-dimensional point, ˜ p = ( p.t, τ p ) .We then index all such points in the 2D plane using a prioritysearch tree [14] (or kd-tree, R-tree in practice). To answer DurTop ( k, I, τ ) , we first ask a range query with the -sidedrectangle I × [ τ, + ∞ ] . The set of points that fall into the searchregion is the superset to actual answers of durable records.This index can be constructed in O ( n log n ) time, has O ( n ) For ∀ p, q, ∈ P , p dominates q if p is no worse than q in all dimensions,and p is better than q in at least one dimension. k -skyband contains all thepoints that are dominated by no more than k − other points. Skyline is aspecial case of k -skyband when k = 1 . tp p p p p τ s c o r e Fig. 5:
Durability checks in S-Band and S-Hop.space and the query time is O ( |C| + log n ) in order to getthe set C . Figure 4 shows an example. We have four records p , p , p , p arriving at t , t , t , t , whose duration for k -skyband is τ , τ , τ and τ . We map them into ˜ p , ˜ p , ˜ p and ˜ p according to their arriving time and k -skyband duration.The -sided rectangle I × [ τ, + ∞ ] is shown as the shadedregion. In this case, C = { p , p } .In general case, notice that we do not know the value of k upfront, i.e., a query has k as a parameter, so we cannotconstruct only one such index. There are two ways to handleit. If we have the guarantee that k ≤ κ for a small number κ then we can construct κ such indexes with total space O ( nκ ) . Otherwise, if k can be any integer in [1 , n ] , we canconstruct O (log n ) such indexes (priority search trees), onefor each k = 2 , , . . . , log n , so the space is O ( n log n ) .Given a durable top- k query we first find the number ¯ k with k ≤ ¯ k ≤ k , and then we use the corresponding index to getthe superset C . In this case, C contains the records that are τ -durable to the ¯ k -skyband, so S ⊆ C . Query Algorithm.
We refer to this score-prioritized ap-proach using durable k -skyband candidates as Score-Bandalgorithm, or S-Band. Full algorithm is sketched in Algo-rithm 2 and described below. Given k, I, τ , we first retrieve thecandidate set C using the durable k -skyband index as shownabove. Then we sort C and visit records in descending orderof their scores. For each record p we visit, we first check thenumber of blocking intervals that p lies. If p lies in less than k blocking intervals, it is a promising candidate and we run atop- k query on time interval [ p.t − τ, p.t ] for durability check.If p is indeed τ -durable, we add p to answer set. Otherwise,we need to add a blocking interval for each record returnedby the top- k query (if we have not done so yet), since theyall have higher scores than p . On the other hand, if p alreadylies in at least k blocking intervals, we can simply skip it. Inthe end, we add the blocking interval [ p.t, p.t + τ ] for p .We can see that S-Band works similarly to S-Base. Theonly difference is that for a record that is blocked less than k times, we still have to execute a top- k query to check whetherthe record is τ -durable (Line 6). This step of durability checkis necessary. Though some records are guaranteed to be non-durable (i.e., not captured by C with durable k -skyband), theycan still block other records (with lower scores) to be durableones. Consider a concrete example in Figure 5 where blackdots represent candidate records in C and red squares representrecords that are not in C . S-Band would only visit p , p and p . At the time we visit p , there is only one blocking interval(introduced by p ). However, p and p actually have higherscores than p . By running a durability check query on p ,7e can discover these missing records and add correspondingblocking intervals (Line 10-11) for better pruning power infuture steps. Complexity.
The query time complexity of S-Band canbe decomposed into three parts: 1) a range search query tofind candidate set C ; 2) sort C according to their scores; 3)find durable records from sorted C sequentially. Summing upthe above, the overall query time complexity of S-Band is O (cid:0) |C| ( q ( n ) + k ) log n (cid:1) , assuming that a top- k query can beanswered in O ( q ( n ) + k ) time. In the worst case |C| = O ( n ) since all points can lie in the k -skyband. In Section V we showthat using the probabilistic model in [15] (where the coordi-nates of the points are randomly assigned) the expected sizeof C is O ( k (cid:6) | I | τ (cid:7) log d − τ ) . Due to the blocking mechanism,in practice we expect that the number of top- k queries willbe smaller. However, notice that we always need to sort allrecords in C which might make S-Band much slower due tothe size of C that increases (in expectation) exponentially onthe dimension d . C. Score-Hop Algorithm
The data reduction strategy of S-Band offers adequatebenefits for improving the overall running time on datasetsin low dimensions ( ≤ ). However, the query overheadon searching and sorting candidate records becomes a hugeburden on high-dimensional data, as it is well-known that thesize of k -skyband tends to explode (or equivalently, recordsin high-dimensional space tends to stay in k -skyband for alonger duration) in high-dimensional space. Furthermore, S-Band requires additional index and only applies to monotonescoring functions. To overcome the drawbacks of S-Base andS-Band, we propose another approach that does not requiresorting and has better worst case guarantee. The main idea isthat there is no need to sort records in advance; we can findthe record with the next highest score one by one as we finddurable records. With the help of blocking mechanism, we canskip certain time intervals when we find the next highest scorerecord, despite the fact that there might be some high-scorerecords in such intervals. This procedure has an analogy to theTime-Hop algorithm, since we effectively skip certain recordswhile we traverse records in descending order of their scores,as we taking a hop in the score-domain. Query Algorithm.
We refer to this solution as Score-Hop algorithm, or S-Hop. The main idea of the algorithmis straightforward. In each iteration, we find the record withthe maximum score among the records that lie in less than k blocking intervals. Let p be such a record. We run a durabletop-k query so if p is a τ -durable record we add it in S .If p is not a τ -durable record, we add a blocking intervalfor each record returned by the durable top-k query (if theyhave not been added before). In the end, we add the blockinginterval [ p.t, p.t + τ ] and we continue with the next record withthe highest score. The actual implementation of the algorithmis more subtle, to guarantee a fast query time as describedbelow; pseudo-code is provided in Algorithm 3. Given a query Algorithm 3:
S-Hop ( k, I, τ ) Input: P , k , τ , and I : [ a, b ] . Output:
DurTop ( k, I, τ ) H ← ∅ , S ← ∅ , Γ ← ∅ ; for [ l i , r i ] : disjoint τ -length intervals in I do M i ← Q ( u , k, [ l r , r i ]) ; H. push( M i .pop()); while H (cid:54) = ∅ do p ← H. pop(), and let p ∈ M j ; if p lies in < k blocking intervals in Γ then π ≤ k ← Q ( u , k, [ p.t − τ, p.t ]) ; if p ∈ π ≤ k then S ← S ∪ { p } ; else for q ∈ π ≤ k ∧ q not visited before do Γ ← Γ ∪ { [ q.t, q.t + τ ] } ; M − j ← Q ( k, [ l j , p.t − ; M + j ← Q ( k, [ p.t + 1 , r j ]) ; H. push( M − j .top()), H. push( M + j .top()); else if M j (cid:54) = ∅ then H. push( M j . pop()); if p not visited before then Γ ← Γ ∪ { [ p.t, p.t + τ ] } ; return S; interval I = [ a, b ] , we partition the interval into a set of disjoint τ -length sub-intervals: [ a, a + τ ) , [ a + τ, a + 2 τ ) , . . . , [ a + (cid:4) | I | τ (cid:5) τ, b ] . Let [ l i , r i ] be the i -th sub-interval, and in eachinterval we find the k records with the highest score, denoted M i . We construct a max-heap H over all the top-1 recordsfrom all sub-intervals. Besides that, each node in H also keepsthe original interval [ l i , r i ] and the set M i associated withthe record. We repeat the following until H is empty. Wetake and pop the top record from H . Let p be that recordoriginated from M j . Then p will be processed in the followingtwo cases: 1) If p lies in at least k blocking intervals, weupdate H by pushing the next top record in M j (if there isany). 2) If p lies in less than k blocking intervals, we update H as follows. Assume that [ l j , r j ] is the corresponding sub-interval of M j (or p ). We first split [ l j , r j ] into two non-emptyintervals [ l j , p.t − and [ p.t +1 , r j ] . Then, run a top- k query on [ l j , p.t − to get a new top- k set M − j . Similarly, get anothernew set M + j from [ p.t +1 , r j ] . We replace the old set M j with M − j and M + j , along with its corresponding interval [ l j , p.t − and [ p.t +1 , r j ] , respectively. Finally, we update H by pushingthe current top records from M − j and M + j into the heap. Inthe end, we add the blocking interval from record p (if it is thefirst time we visited p ). Figure 6 illustrates the main procedureof S-Hop on how to find next record with highest score. It isworth mentioning that the hopping movement happens at Line18: we effectively skip certain intervals by not updating themax-heap and stop asking top- k queries on its sub-intervals. As a practical note, we notice that finding the top-1 record (instead oftop- k ) in each time interval can be more efficient in most real-life datasets. Max Heap HM i M j ...Heap Top: p ∈ M j pM + j M − j l i r i l j r j Fig. 6:
Illustration of Score-Hop algorithm on finding nextrecord with highest score (if p lies in less than k blockingintervals). Compared to S-Band, S-Hop does not have a strong depen-dency on the dimension of the data (only the running timeof the top- k queries depends on the dimension) and makesbetter use of the blocking mechanism. In the end, we onlyfind and process high-score records as we need instead ofacquiring a full sorted order of records in advance, which leadsto better worst case theoretical guarantees and faster querytime. Experimental results in Section VI demonstrate that S-Hop can be 1 to 2 orders of magnitude faster than S-Band onhigh-dimensional ( ≥
10) datasets.
Correctness.
The following lemma proves the correctnessof S-Hop.
Lemma 2.
Given k , I and τ , the Score-Hop algorithm returnsthe correct answer for durable top- k query.Proof (Sketch) . Let S ∗ be the τ -durable records in I . We showthat S ⊆ S ∗ and S ∗ ⊆ S . The algorithm always checks byrunning a top- k query if a record should be in the solution(line 8 of Algorithm 3) so S ⊆ S ∗ .Next we prove S ∗ ⊆ S . The algorithm visits the recordsin descending (score) order so it is not possible that a record p ∈ S ∗ lies in at least k blocking intervals before the algorithmvisits p . We also need to prove that the algorithm does not missany durable record in a sub-interval [ l j , r j ] that correspondsto an empty M j . If | P ([ l j , r j ]) | ≤ k then the result follows.Otherwise, we argue using induction that each time when thealgorithm finds a record p in M j that is contained in at least k blocking intervals, any timestamp in the sub-interval [ l j , p.t ] lies in at least k blocking intervals. Hence, if M j is empty,any timestamp in [ l j , r j ] lies in at least k blocking intervalsand no other durable records are in [ l j , r j ] . D. Complexity Analysis of S-Hop
The query complexity analysis of S-Hop is non-trivial andneeds more care. There are three main sub-procedures in S-Hop: find next highest score record, top- k queries for durabil-ity check and blocking mechanism. As presented above, thefirst two components both rely on multiple top- k queries. Wefirst show a worst-case guarantee on the total number of top- k queries called in the algorithm. Please refer to Appendix C forfull proof. In practice, we make sure that when we ask k top- queries in an intervalwe remove it from the max-heap. Lemma 3.
The total number of top- k queries performed bythe Score-Hop algorithm is O ( | S | + k (cid:6) | I | τ (cid:7) ) .Proof (Sketch) . As we had in the proof of Lemma 1 we need tobound the number of false checks. Let p be a false check andlet p (cid:48) be the record with the largest timestamp in Q ( k, [ p.t − τ, p.t ]) . We say that p is assigned to p (cid:48) . If p (cid:48) .t < a , where a isthe timestamp such that I = [ a, b ] , then we assign p to a . Wefirst show that at the moment that we find the false check p thecorresponding record p (cid:48) can only have one of the followingthree properties: i) it lies in at least k blocking intervals, ii) p (cid:48) ∈ S and it lies in at most k − blocking intervals, iii) p (cid:48) = a . If p (cid:48) has property ii) then p is a type-1 false check.Otherwise, p is a type-2 false check.We first bound the number of type-1 false checks. Noticethat after a type-1 false check p is assigned to p (cid:48) then alltimestamps in the sub-interval [ p (cid:48) .t, p.t ] lie in at least k records. So if another false check q later in the algorithmis assigned to p (cid:48) , again, then q can only be a type-2 falsecheck. Hence, the type-1 false checks are bounded by O ( | S | ) .In order to bound the type-2 false checks we assume a window ρ of length τ in I . We make the following key observation:At the moment that we find a type-2 false check p , it liesin at most k − blocking intervals while p (cid:48) lies in at least k blocking intervals, so there should be a blocking interval [ l, r ] , where its right endpoint lies between p (cid:48) .t and p.t , i.e., p (cid:48) .t ≤ r ≤ p.t . (Notice that if p (cid:48) = a is assigned more thanonce then it also lies in at least k blocking intervals.) Usingthis observation along with other properties of the false checkswe can show that after finding k type-2 false checks in ρ ,each timestamp in ρ will lie in at least k blocking intervals.Hence, the algorithm will not run any other top- k query in ρ .Since there are (cid:6) | I | τ (cid:7) disjoint τ -length sub-intervals in I we canbound the total number of type-2 false check by O ( k (cid:6) | I | τ (cid:7) ) .Overall, the number of false checks along with the durablerecords in I is O ( | S | + k (cid:6) | I | τ (cid:7) ) .The lemma above also shows that the number of differentsets M j that are created by the algorithm is O ( | S | + k (cid:6) | I | τ (cid:7) ) .For each set we can visit at most k records so in total thealgorithm may visit O ( k ( | S | + k (cid:6) | I | τ (cid:7) )) records . Each top(),or pop() procedure takes O (log n ) time so in total we need O ( k ( | S | + k (cid:6) | I | τ (cid:7) ) log n ) to visit these records. Furthermore,recall that we need O (log n ) time to check if a record liesin at least k blocking intervals and O (log n ) time to insert ablocking interval (using a binary search tree) so we also spend O ( k ( | S | + k (cid:6) | I | τ (cid:7) ) log n ) time for the blocking mechanism. No-tice that this running time is dominated by the time to answer O ( | S | + k (cid:6) | I | τ (cid:7) ) top- k queries, so S-Hop answers a durablepreference top- k query in O (cid:0) ( | S | + k (cid:6) | I | τ (cid:7) )( q ( n ) + k ) log n (cid:1) time (with an efficient top- k query procedure in O ( q ( n ) + k ) ). We note that the algorithm may visit some records, that lie in at least k blocking intervals, more than once. The upper bound O ( k ( | S | + k (cid:6) | I | τ (cid:7) )) counts all the times that the algorithm visits a record. We can modify thealgorithm so that it does not visit the same record twice but that would makethe description of the algorithm more complicated without decreasing theoverall asymptotic complexity. k queries compared to T-Hop, due to the candidatepruning brought by blocking mechanism. This makes S-Hoprun faster than T-Hop when the top- k query itself is expensive;i.e., a larger k or on high-dimensional datasets.V. E XPECTED C OMPLEXITY
In the previous sections we presented two types of al-gorithms (time-prioritized and score-prioritized) to answerdurable top- k queries with the same worst-case guarantee ontheir query time. In particular we showed that their querytimes depend on k (cid:6) | I | τ (cid:7) and | S | . In this section, we go beyondthe worst-case analysis and analyze their performance in amore “expected” sense. Most importantly, we show in Sec-tion V-A that the expected size of | S | is roughly k (cid:6) | I | τ (cid:7) if thescores of data records are drawn randomly from an arbitrary distribution (which can be picked by a powerful adversarywith the advance knowledge of the query parameters). Thisresult essentially establishes that, under this model, our bestalgorithms are in a sense optimal because their complexityis expected to be linear in the output size. Secondly, inSection V-B, we study the expected complexity of Score-Band algorithm by bounding the expected size of τ -durable k -skyband candidate set C using the same probabilistic modelused in [15]. A. Expected Answer Size
Consider a set of n records P with p i .t = i , for p i ∈ P .We analyze the expected size of a query output when thescores of records are assigned in a semi-random manner,where the data values can be arbitrarily chosen and thenthey are assigned in a random order to the records. Moreformally, we consider a random permutation model (RPM).Let X = x < x < . . . < x n be a sequence of n arbitrarynon-negative numbers chosen by an adversary, and let σ bea permutation of { , . . . , n } . We set f ( p i ) = x σ ( i ) , i.e., thescore of record p i is x σ ( i ) , where σ ( i ) is the image of i under σ . As argued in [16], the random permutation modelis more general than the model in which all scores are drawnfrom an arbitrary unknown distribution, so our result holds forthis model as well. The random permutation model has beenwidely used in a rich variety of domains and considered as astandard for complexity analysis; i.e., online algorithms [17]–[19], discrete geometry [20]–[22], and query processing [16].Our main result is the following. Lemma 4.
In the random permutation model, given k, τ and I , we have E [ | S | ] = k | I | τ +1 .Proof. For a record p i ∈ P ( I ) , let X i be the random variable,which is if p i is a τ -durable record, and otherwise. Thus, E [ | S | ] = E [ (cid:80) i X i ] . Using the linearity of expectation, E [ (cid:80) i X i ] = (cid:80) i E [ X i ] = (cid:80) i Pr [ X i = 1 ] . Thus our goal is to compute Pr [ X i = 1 ] : the probabilitythat there are less than k records in [ p i .t − τ, p i .t ) with scorelarger than f ( p i ) . Let P τi = { p i − τ , . . . , p i − } . For a subset Q ⊂ P τi , let A Q be the binary random variable, which is ifall records in Q have score greater than f ( p i ) and all recordsin Q = P τi \ Q have score less than f ( p i ) . We have Pr [ X i = 1 ] = k − (cid:88) l =0 (cid:88) Q ⊂ P τi , | Q | = l Pr [ A Q ] . (1)We estimate Pr [ A Q ] as follows. Let V ⊂ X with | V | = τ + 1 . We first bound the conditional probability Pr [ A Q | V ] such that the records in P τi ∪ { p i } are assigned scores from V . We consider all possible permutations of V and count onlythose cases where the records in Q have larger value than f ( p i ) , and the records in Q have values less than the value of f ( p i ) . Notice that the permutations that satisfy this propertymust assign the first l largest values of V to Q , then the ( l +1) -th largest value to p i and the rest τ − l smaller values of V to Q .Under such assignment, any permutations of values in Q and Q are valid cases. Hence, the number of valid permutationsare l !( τ − l )! , while the number of all possible permutationsof V are ( τ + 1)! . We have Pr [ A Q | V ] = l !( τ − l )!( τ + 1)! = 1 τ + 1 1 (cid:0) τl (cid:1) . (2)Since (2) holds for all V , Pr [ A Q ] = 1 τ + 1 1 (cid:0) τl (cid:1) . Substitutingthis in (1), we obtain Pr [ X i = 1 ] = (cid:80) k − l =0 (cid:0) τl (cid:1) τ +1 1 ( τl ) = (cid:80) k − l =0 1 τ +1 = kτ +1 (3)Finally, E [ | S | ] = (cid:88) i Pr [ X i = 1 ] = k | I | τ + 1 . (4)Combining Lemma 4 with the analysis of Sections III-Cand IV-D, we conclude that in a random permutation model the expected query time complexity of both Time-Hop and Score-Hop algorithms is O ( | S | ( q ( n ) + k ) log n ) , or equivalently O (cid:0) k (cid:6) | I | τ (cid:7) ( q ( n ) + k ) log n (cid:1) , where O ( q ( n ) + k ) reflects thetime complexity of answering a top- k query. In Section VI,our experimental results on real and synthetic datasets bothconfirm this finding. B. Expected size of durable k -skyband In this subsection we bound the expected size of τ -durable k -skyband records, denoted by C , from Section IV-B in aprobabilistic model similar to the previous case. Recall thatthe size of C affects the running time of the S-Band algorithm.Let P = { p , . . . , p n } with p i .t = i . We use the samerandom model as in [15] where (the attributes of) records10 ABLE II:
Dataset summary
Dataset Dimensionality Size (
NBA-X
Network-X
Syn-X are randomly generated. The following lemma bounds theexpected size of C (See Appendix D). Lemma 5.
In the random model as in [15], given k, τ and I ,we have E [ |C| ] = O ( k | I | τ log d − τ ) . Combining Lemma 5, the analysis of Section IV-B and anefficient top- k query procedure runs in O ( q ( n ) + k ) time, the expected query time complexity of Score-Band algorithm is O (cid:0) k (cid:6) | I | τ (cid:7) ( q ( n )+ k ) log n log d − τ (cid:1) . It shows that the expectedcomplexity of Score-Band algorithm can be higher than Time-Hop or Score-Hop algorithm by a factor of at most log d − τ .Experimental results in Section VI also confirm this finding aswe vary the data dimensionalities. The curse of dimensionalitymakes Score-Band algorithm perform worse even compared toother simple baselines. Again, Time-Hop and Score-Hop areboth generally applicable to arbitrary user-specified scoringfunctions, while Score-Band only works for monotone func-tions. VI. E XPERIMENTS
A. Experiment Setup
Datasets.
We use two real-life datasets and some syntheticones, as summarized in Table II and described below:
NBA contains the performance of each NBA player in each game from 1983 to 2019, with in total ∼ million individualperformance records on 15 numeric attributes. Records arenaturally organized by date and time, and we break ties (e.g.,performances of different players in the same game) arbitrarily.We choose some subsets of 15 attributes to create datasetswith different dimensions collectively referred to as NBA-X : NBA-1 selects only ; NBA-2 captures the points and assists ; NBA-3 chooses points , assists , rebounds ; NBA-5 includes five dimensions: points , assists , rebounds , steals and blocks . Network is the dataset from KDD Cup 1999. This datasetcontains ∼ million records with 37 numeric attributesthat describe network connections to a machine, including connection duration , packet size , etc. The query in this caseutilizes a scoring function that weighs a variety of numericalattributes to rank connections in order to identify unusual andpotentially malicious ones. Records have unique timestampsand are ordered by these timestamps. Since these attributeshave different measurement units, we scale the value of eachdimension using MinMax normalization. To study the impactof data dimensionalities on query efficiency, we choose thefirst 2, 3, 5, 10, 20, 30 and 37 attributes from the fulldimensions to create 7 different datasets collectively referred https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1) IND (2)
ANTI
Fig. 7:
Value distributions for synthetic dataset
TABLE III:
Query Parameters (default value in bold)
Parameter Range k , 15, 20, 25, 30, 35, 40, 45, 50 τ , 25%, 30%, 40%, 50% | I | , 60%, 70% 80% d , 3, 5, 10, 20, 30, 37 to as Network-X , where X represents the dimensionality ofthe dataset. Syn is a synthetic two-dimensional dataset that is used forscalability test on proposed solutions. We generate Syn withindependent (IND) and anti-correlated (ANTI) data distributedin a 2D unit square. For IND data, the attribute values ofeach tuple are generated independently, following a uniformdistribution. ANTI data are drawn from the portion inside thepositive orthant of an annulus centered at the origin with outerradius 1 and inner radius 0.8, representing an environment thatmost of the records gather in k -skyband. Figure 7 illustratesthe sample value distributions of IND and ANTI. The full sizeof Syn is 50 million and each data point has an unique arrivingtime. We further choose several subsets of Syn with 1, 2, 5,10 and 20 millions of records. The set of synthetic datasetsare collectively referred to as Syn-X , where X represents datasize. Query Parameters.
Table III summaries the query param-eters under investigation, along with their ranges and defaultvalues. Among these, the query interval length | I | and thedurability τ is measured as percentage of dataset size n . Whenvarying query interval length, we always fix the right endpointof the interval to be the most recent timestamp in dataset andonly move the left endpoint. Implementations & Evaluation Metric.
To make the dis-cussions concrete and concise, we choose a linear and mono-tone preference scoring function throughout the experimentalsection in the simple form: f u ( p ) = (cid:80) di =1 u i · p.x i , where u isa user-specified preference vector and u i is the (non-negative)weight for i -th attribute of a record. At query time, user need tospecify u as one of the input parameters. Since the focus of thispaper is not to develop the best possible index for top- k queries Q u ( k, W ) , our implementation of the top- k building blocksimply adopts a tree index (on the time domain of P ), andanswer Q u ( k, W ) in a straightforward top-down manner witha branch-and-bound method. More specifically, each tree nodestores the skyline of all records that it contains. The skylinehelps us quickly identify the maximum score of each nodeunder any preference vector u . Then, to answer Q u ( k, W ) ,11 Performance on NBA-2 as τ varies. (2) Performance on Network-2 as τ varies. Fig. 8:
Performance comparison as τ varies. (1) Performance on NBA-2 as k varies. (2) Performance on Network-2 as k varies. Fig. 9:
Performance comparison as k varies.it is sufficient to use at most k nodes (that are contained bytime window W ) with the highest scores according to u . Thisindex offers adequate performance in our experiments, but itcan certainly be replaced by more sophisticated index withbetter worst-case guarantees, without affecting the rest of ourproposed solution.Using the building block of top- k queries described above,we further implement T-Base (Section III-A),
T-Hop (Sec-tion III-B),
S-Base (Section IV-A),
S-Band (Section IV-B)and
S-Hop (Section IV-C). Performance of various methodsare evaluated using the following two metrics: number of top- k queries and overall query time (in millisecond). For eachquery parameter setting, we run the query 100 times with 100different randomly generated preference vectors, and reportthe average with standard deviation.All methods were implemented in C++, and all experimentswere performed on a Linux machine with two Intel Xeon E5-2640 v4 2.4GHz processor with 256GB of memory. B. Algorithm Evaluations
According to the theoretical analysis of our algorithms inprevious sections, the query efficiency depends on the lengthof durability window τ , the value of k , the length of queryinterval I , the data dimensionality d and the data size n . Forfair evaluation and comparison of algorithm efficiency, wedesigned a set of variable-controlling experiments such thateach time we only vary one query parameter of interest andfix the others to default values. Comparison of Algorithms when Varying τ . In Figure 8,we investigate the performance of all durable top- k solutions,as we vary durability τ . Figure 8-1-(a) shows the queryefficiency comparison on NBA-2. The sorting based solutionS-Base is the slowest, as it requires fully sorting all recordsin the time interval of length | I | + τ . T-Base is faster thanS-Base and mostly independent of τ . All the rest solutions, T- Hop, S-Hop and S-Band, become more efficient as we increase τ , or equivalently, when query is more selective. This findingconfirms our analysis in Section V that the query efficiencybounds of Hop-based solutions and S-Band both depend onthe answer size, which is O ( k | I | τ ) . T-Hop and S-Hop nearlyperform the same, while S-Band can be slightly faster. Whenthe query is highly selective ( τ is half of the length ofentire time domain), they are 1-2 orders of magnitude fastercompared to T-Base and S-Base, respectively. Similar trendscan be seen in Figure 8-2-(a), where we test algorithms on alarger dataset Network-2. The only difference is that baselinesolutions (T-Base and S-Base) are more expensive and theefficiency difference between baseline solutions and T-Hop/S-Hop/S-Band is even larger (up to 3 orders of magnitude).Next, we take a closer look at T-Hop, S-Hop and S-Band inFigure 8-1-(b), which compares the number of top- k queriesneeded for these three advanced algorithms. For S-Hop, thetotal number of top- k queries is decomposed into two parts:top- k queries for durability check (unshaded region of a greenbar) and top- k queries for finding the next highest score record(shaded region). For S-Band, we also plot the size of durable k -skyband candidate set C on top the figure as red circled line,reflecting the overhead cost of sorting C for S-Band. Now it isclear that the main reason why T-Hop/S-Hop/S-Band becomesfaster when τ is large is that fewer top- k queries are needed.A more selective query with larger τ also makes the candidateset C of S-Band smaller, demonstrating the effectiveness ofusing durable k -skyband to identify promising candidates. Onthe other hand, we can see that S-Hop and S-Band ask fewertop- k queries than T-Hop, demonstrating the pruning power ofblocking mechanism in score-prioritized solutions. This figurealso explains why S-Band runs slightly faster than S-Hop andT-Hop on NBA-2 in this case, as S-Band requires the leastnumber of top- k queries and the overhead cost on sortingcandidate set C is relatively small on two-dimensional data.12 Performance on NBA-2 as | I | varies. (2) Performance on Network-2 as | I | varies. Fig. 10:
Performance comparison as | I | varies.Again, similar trends can be found in Figure 8-2-(b). Comparison of Algorithms when Varying k . Next, westudy the effect of k on efficiency. Results are shown inFigure 9. When we increase k , not only need we ask moretop- k queries (see Figure 9-1-(b) and Figure 9-2-(b)), buta top- k query itself also becomes more expensive. Thusin both Figure 9-1-(a) and Figure 9-2-(a), we can see thatall algorithms (except S-Base) are slower when k is larger.Especially when k reaches 50, top- k computations becomethe dominant factor on overall efficiency, and the differencesamong the various algorithms diminish. Still, S-Band and S-Hop have slight advantages over T-Hop on larger k , as theyuse blocking mechanism to prune candidate records and aremore conservative in asking expensive top- k queries. Comparison of Algorithms when Varying | I | . In Fig-ure 10, we compare the performance of proposed algorithmsas we vary the query interval length | I | . In terms of efficiency,Figure 10-1-(a) and Figure 10-2-(a) show that T-Hop/S-Hop/S-Band is much faster than baseline solutions T-Base and S-Base, especially on the large dataset Network-2. On the otherhand, we also find that our proposed algorithms scale betterwith | I | than with k (recall Figure 9). The reason is that thetime complexities of T-Hop/S-Hop and S-Band are quadraticin k but only linear on | I | (recall Lemma 4 and Lemma 5).In terms of number of top- k queries, in Figure 10-1-(b) andFigure 10-2-(b), it is not surprising to see that all proposedsolutions ask more top- k queries as | I | increases. The relativeperformance of various algorithms is consistent with previousexperiments where we varied τ or k . Comparison of Algorithms when Varying d . In thissection, we study the effect of data dimensionality d onalgorithm performances. Since the sorting-based S-Base isclearly inferior to other algorithms, here we only test T-Base, T-Hop, S-Band and S-Hop on Network-X with varyingdimensions. Results are shown in Figure 11. Let us first take (1) (2) Fig. 11:
Performance comparison on Network-X as d varies. (1) IND (2)
ANTI
Fig. 12:
Scalability test on IND and ANTI Syn-X.a look on Figure 11-2. We can see that the number of top- k queries for all proposed algorithms stays stable as we increasedimensionality. This finding again confirms our theoreticalanalysis that the number of top- k queries (or, answer size)depends only on k | I | τ and is independent of dimensionality d .On the other hand, we can see that the size of candidate set C for S-Band rockets in high dimensions, and can be up to 4orders of magnitude larger than the size of actual promisingrecords. The sorting overhead on such huge candidate sets isalready too big. Then, let us go back to Figure 11-1. Thequery time of T-Base, T-Hop and S-Hop slowly increases aswe increase dimensionality, because top- k queries on high-dimensions become more expensive, yet they ask roughly thesame number of top- k queries regardless of dimensionality.While S-Band still performs well on low-dimensional data(less than 5 dimensions), in higher dimension S-Band becomesdramatically worse, even taking as much time as T-Base onNetwork-37. Scalability.
Finally, we use the two-dimensional syntheticdataset Syn-X to test the scalability of the proposed algorithmsas we vary the input size from 1 million to 50 million.Figure 12 summarizes the results. As the input size increases,we also increase the query interval length proportionally (soit remains at a fixed percentage of the data size). As shown inFigure 12-1, we can see that T-Hop, S-Hop and S-Band scalewell on large IND datasets, and S-Band again performs slightly13 ig. 13:
Runtime distribution on 5d NBA data.better than T-Hop and S-Hop. The running time of S-Baseincreases on larger datasets simply because we are also makingthe query interval longer. Figure 12-1-(b) further illustrates thatthe total number of top- k queries asked by different algorithmsis also independent from the data size. A larger dataset onlymakes top- k queries more expensive. Although the size ofcandidate set | C | increases on larger IND datasets, its growthrate here is much lower than its growth rate when varyingdimensionality d in Figure 11. Overall, on IND synthetic data, | C | is only about 4-5 times bigger than the actual answersize, which will not incur a big sorting overhead for S-Band.However, the situation is much different for ANTI Syn-X. Asshown in Figure 12-2, in terms of query efficiency, T-Hopand S-Hop still scale well, but S-Band now becomes muchmore expensive because of the data distribution of ANTI. Mostrecords in ANTI data would gather in k -skyband, resulting in C up to 3 orders of magnitude larger than the actual answersize (see Figure 12-2-(b)), which hurts the performance of S-Band. The efficiency of S-Band has a strong dependency onthe candidate set C , or more generally, the data distribution.In contrast, the performance of T-Hop and S-Hop in this caseis nearly independent of both size and distribution of data; itis only linear to the answer size. Query Time Distribution over Different Real Datasets.
Figure 12 already clearly illustrates the performance differenceof S-Band on IND and ANTI synthetic data, demonstratingthe effect of data distributions on S-Band’s query efficiency.Here, we further compare T-Hop, S-Hop and S-Band on realdata, and study how data distributions would influence theirperformance in practice. We use NBA as the main data source,and select 20 combinations of 5 dimensions randomly chosenout of the 15 attributes, e.g., (points, assits, rebounds, steals,blocks), (points, assits, steals, blocks, 3-pointers-made), etc.These resulting 20 datasets have the same dimensionality (5)but exhibit different distributions. We run queries with defaultsettings on each dataset, and plot the running time distributionfor all datasets. Results are shown in Figure 13. We can seethat S-Band takes longer time on average, and also has a widespan on query time. This finding again confirms that S-Bandis highly sensitive to underlying data distributions. In contrast,running times of T-Hop and S-Hop are centered in narrowervalue ranges, showing their robustness to data distributionsand further demonstrating their advantages over S-Band on
TABLE IV:
Query time (in seconds) comparison on NBA-2when varying τ . PostgreSQL backend. τ (as % of | T | ) 10% 20% 30% 40% 50%T-Hop 0.46 0.28 0.18 0.12 0.1T-Base 2.2 1.9 1.8 1.7 1.7 TABLE V:
Query time (in seconds) comparison on NBA-2when varying | I | . PostgreSQL backend. | I | (as % of | T | ) 10% 20% 30% 40% 50%T-Hop 0.1 0.16 0.17 0.2 0.26T-Base 0.46 0.93 1.3 1.6 2 real data.In sum, we conclude that the Hop-based algorithms, T-Hopand S-Hop, are the best solutions for answering durable pref-erence top- k queries. They scale well on large datasets as wellas to high dimensions, and most importantly, their query timecomplexity is proportional to the answer size. This propertymakes T-Hop and S-Hop run even faster when the query ishighly selective; i.e., smaller k or larger τ , which tend to bethe more practical and meaningful query settings that peoplewould use in real-life applications. While S-Band is also areasonable approach, its performance depends highly on thedata characteristics (faring poorly in high dimensions and forcertain distributions). S-Band also requires additional offlineindexing for finding durable k -skyband candidates. Overall,as demonstrated by experiments on both real and syntheticdata, efficiency and robustness of Hop-based solutions makethem more attractive solutions. Even on very large and high-dimensional datasets, T-Hop/S-Hop only need less than asecond to return durable top records for any given preference,which enables interactive data exploration. C. DBMS-Based Implementations
To demonstrate the generality of proposed solutions and itspossibility of integrating into a DBMS, we further test thealgorithms utilizing PostgreSQL [23] as the backend DBMS.More specially, we load the datasets NBA-2, Syn-500M (IND)and Syn-500M (ANTI) into PostgreSQL tables. The tableschema consist of numeric attributes of the records and anadditional column representing arriving time instant. For algo-rithm implementations, we code T-Hop and T-Base as storedprocedures using PL/Python with PostgreSQL’s native supportoperators. Besides data tables, we also create correspondingindex tables to support efficient top- k records retrieval. Theindex table is similar to the tree-based index as we usedfor previous experiments, providing sufficient data reductionfor answering range top- k queries. Again, the top- k module The other proposed solution, S-Hop, requires a more delicate queryprocedure and data structures (recall Algorithm 3). Hence it is more suitableto implement S-Hop as a wrapper function outside the DBMS.
TABLE VI:
Query time (in seconds) comparison on differentdatasets. Dataset size (measured by DBMS storage size) isshown in parentheses. PostgreSQL backend.
Dataset NBA-2 ( ) Syn-IND (
30 G ) Syn-ANTI (
30 G )T-Hop 0.28 1.9 2.3T-Base 1.9 773 787 τ and query interval length | I | to compare query efficiencies. Similar conclusions can bedrawn here. T-Base always pays linear cost (continuous slidingwindows) to visit all records in the query interval. Thus, therunning time is linear to | I | (Table V), and nearly independentof τ (Table IV). In comparison, T-Hop’s complexity is linearto the answer size, which makes it run faster as query becomesmore selective (smaller | I | or larger τ ). Overall, T-Hop is atleast 10 × faster than T-Base.In Table VI, we increase the dataset size up to 500Mrecords, which takes around 30 Gigabytes of disk space inPostgreSQL. Running default queries in such cases, we cansee that T-Hop is more than 100 × faster than T-Base, bringingdown the query time from nearly 12 minutes to just 2 seconds.T-Hop also apparently scales well on large datasets, since thecomplexity is mostly linear to the answer size. The querytime increase solely comes from the more expensive top- k module. On the contrary, the continuous sliding-window natureof T-Base makes it prohibitively slow when dealing with largeamounts of temporal data. D. Summary of Experiments
In sum, we conclude that the Hop-based algorithms, T-Hopand S-Hop, are the best solutions for answering durable pref-erence top- k queries. They scale well on large datasets as wellas to high dimensions, and most importantly, their query timecomplexity is proportional to the answer size. This propertymakes T-Hop and S-Hop run even faster when the query ishighly selective; i.e., smaller k or larger τ , which tend to bethe more practical and meaningful query settings that peoplewould use in real-life applications. While S-Band is also areasonable approach, its performance depends highly on thedata characteristics (faring poorly in high dimensions and forcertain distributions). S-Band also requires additional offlineindexing for finding durable k -skyband candidates. Overall,as demonstrated by experiments on both real and syntheticdata, efficiency and robustness of Hop-based solutions makethem more attractive solutions. Even on very large and high-dimensional datasets, T-Hop/S-Hop only need less than asecond to return durable top records for any given preference,which enables interactive data exploration. Finally, T-Hop canbe efficiently implemented inside a DBMS; for large datasets(tens of Gigabytes), it brings down the query time to justa couple of seconds, from more than 10 minutes requiredwithout our solution.VII. R ELATED W ORK
The notion of “durability” on temporal data has been studiedby previous works, but they consider different definitions ofdurability and/or different data models from ours. In [24]and [25], authors implicitly considered “durability” in theform of prominent streaks in sequence data, and devised efficient algorithms for discovering such streaks. Given asequence of values, a prominent streak is a long consecutivesubsequence consisting of only large (small) values. Theiralgorithms can also be extended to find general top-k, multi-sequence and multi-dimensional prominent streaks. Jiang andPei [26] studied Interval Skyline Queries on time series, whichcan be viewed as another type of “durability” when segmentsof time series dominate others.Another line of durability-related work on temporal datais represented by [27]–[29] and [30]. Consider a time-seriesdataset with a set of objects, where the data values of eachobject are measured at regular time intervals; i.e., stockmarkets. At each time t , objects are ranked according to theirvalues at t . The definition of “durability” therein is the fractionof time during a given time window when an object ranks k or above. This line of work mainly focused on how toefficiently aggregate rankings (rank ≤ k or not) over time. [30]applied durable top- k searches in document archives, findingdocuments that are consistently among the most relevant toquery keywords throughout a given time interval. In thatsetting, the challenge is how to merge multiple per-keywordrelevance scores over time efficiently into a single rank.Durable queries also arise in dynamic or temporal graphs,typically represented as sequences of graph snapshots. Forexample, in [31] and [32], authors considered the problemof finding the (top- k ) most durable matches of an input graphpattern query; that is, the matches that exist for the longestperiod of time. The main focus is on the representations andindexes of the sequence of graph snapshots, and how to adaptclassic graph algorithms in this setting.Besides durability, Mouratidis et al. [11] studied how tocontinuously monitor top- k results over the most recent datain a streaming setting. Our baseline solution used in Section VIshares the same spirit as algorithms in [11] for incrementallymaintaining top- k results over consecutive sliding windows.VIII. C ONCLUSION
In this paper, we have initiated a comprehensive study intothe problem of finding durable top records in large instant-stamped temporal datasets by running durable top- k queries.We proposed two types of novel algorithms for efficiently solv-ing this problem, and provided in-depth theoretical analysison the complexity of the problem itself and of our algorithms.As demonstrated by experiments on real and synthetic data,our best solutions, Time-Hop and Score-Hop, find interestingdurable top records in under a second on large and high-dimensional datasets, and can be up to 2 orders of magnitudefaster than existing baselines.R EFERENCES[1] P. Afshani and T. M. Chan, “Optimal halfspace range reporting inthree dimensions,” in
Proceedings of the twentieth annual ACM-SIAMsymposium on Discrete algorithms , 2009.[2] B. Chazelle, L. J. Guibas, and D.-T. Lee, “The power of geometricduality,”
BIT Numerical Mathematics , vol. 25, 1985.[3] J. Matousek, “Reporting points in halfspaces,”
Computational Geometry ,vol. 2, 1992.
4] P. K. Agarwal et al. , “Efficient searching with linear constraints,”
J.Comp. and System Sciences , vol. 61, 2000.[5] P. K. Agarwal and J. Matouˇsek, “Dynamic half-space range reportingand its applications,”
Algorithmica , vol. 13, 1995.[6] T. M. Chan, “Three problems about dynamic convex hulls,”
InternationalJournal of Computational Geometry & Applications , vol. 22, 2012.[7] Y.-C. Chang et al. , “The onion technique: indexing for linear optimiza-tion queries,” in
SIGMOD , vol. 29, 2000.[8] K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen, “Efficient maintenance ofmaterialized top-k views,” in
ICDE , 2003.[9] V. Hristidis and Y. Papakonstantinou, “Algorithms and applications foranswering ranked queries using ranked views,”
VLDB J. , vol. 13, 2004.[10] I. F. Ilyas, G. Beskales, and M. A. Soliman, “A survey of top-k queryprocessing techniques in relational database systems,”
CSUR , 2008.[11] K. Mouratidis, S. Bakiras, and D. Papadias, “Continuous monitoring oftop-k queries over sliding windows,” in
SIGMOD , 2006.[12] C. Jin, K. Yi, L. Chen, J. X. Yu, and X. Lin, “Sliding-window top-kqueries on uncertain streams,”
VLDB , vol. 1, 2008.[13] G. Das, D. Gunopulos, N. Koudas, and N. Sarkas, “Ad-hoc top-k queryanswering for data streams,” in
VLDB , 2007.[14] M. De Berg, M. Van Kreveld, M. Overmars, and O. Schwarzkopf,“Computational geometry,” in
Computational geometry , 1997.[15] J. L. Bentley, H.-T. Kung, M. Schkolnick, and C. D. Thompson, “Onthe average number of maxima in a set of vectors and applications.”CARNEGIE-MELLON UNIV, Tech. Rep., 1977.[16] P. K. Agarwal et al. , “Range-max queries on uncertain data,”
Journalof Computer and System Sciences , vol. 94, 2018.[17] G. Goel and A. Mehta, “Online budgeted matching in random inputmodels with applications to adwords,” in
Proc. 19th Annual ACM-SIAMSymp. on Discrete algorithms , 2008.[18] M. Mahdian and Q. Yan, “Online bipartite matching with randomarrivals: an approach based on strongly factor-revealing lps,” in
Proc.43rd Annual ACM Symp. on Theory of computing , 2011.[19] A. Mehta, A. Saberi, U. Vazirani, and V. Vazirani, “Adwords andgeneralized on-line matching,” in
FOCS , 2005.[20] P. K. Agarwal, H. Kaplan, and M. Sharir, “Union of hypercubes and 3dminkowski sums with random sizes,” in
ICALP , 2018.[21] P. K. Agarwal, S. Har-Peled, H. Kaplan, and M. Sharir, “Union ofrandom minkowski sums and network vulnerability analysis,”
Discrete& Computational Geometry , vol. 52, 2014.[22] S. Har-Peled and B. Raichel, “On the complexity of randomly weightedmultiplicative voronoi diagrams,”
Discrete & Computational Geometry ,vol. 53, 2015.[23]
PostgreSQL
SIGKDD , 2011.[25] G. Zhang, X. Jiang, P. Luo, M. Wang, and C. Li, “Discovering generalprominent streaks in sequence data,”
TKDD , vol. 8, 2014.[26] B. Jiang and J. Pei, “Online interval skyline queries on time series,” in
ICDE , 2009.[27] M. L. Lee, W. Hsu, L. Li, and W. H. Tok, “Consistent top-k queriesover time,” in
DASFAA , 2009.[28] H. Wang, Y. Cai, Y. Yang, S. Zhang, and N. Mamoulis, “Durable queriesover historical time series,”
TKDE , vol. 26, 2014.[29] J. Gao, P. K. Agarwal, and J. Yang, “Durable top-k queries on temporaldata,”
VLDB , vol. 11, 2018.[30] N. Mamoulis, K. Berberich, S. Bedathur et al. , “Durable top-k searchin document archives,” in
SIGMOD , 2010.[31] K. Semertzidis and E. Pitoura, “Durable graph pattern queries onhistorical graphs,” in
ICDE , 2016.[32] K. Semertzidis et al. , “Top- k durable graph pattern queries on temporalgraphs,” TKDE , vol. 31, 2018.[33] S. Borzsony, D. Kossmann, and K. Stocker, “The skyline operator,” in . IEEE, 2001, pp. 421–430. A PPENDIX
A. Implementation Details
For simplicity and usability, we adopt a more straightfor-ward tree-based implementation that better serves our purposefor answering a preference top- k query in a time window. Algorithm 4:
Tree Index Construction
Input:
Dataset P Output:
A Tree Index T for Preference Top- k Query Def
BuildTree ( t , t ) : if t > t then return null; else if t == t then create a leaf node n ; n. skyline ← P [ t ] ; n. interval ← [ t , t ] ; return n ; else create a node n ; t m ← t + ( t + t ) / ; n. left child ← BuildTree ( t , t m ) ; n. right child ← BuildTree ( t m + 1 , t ) ; n. interval ← [ t , t ] ; n. skyline ← S (cid:0) S (cid:0) P ([ t , t m ]) (cid:1) ∪ S (cid:0) P ([ t m + 1 , t ]) (cid:1)(cid:1) ; return n ;Consider a query time W decomposed into n non-emptydisjoint time intervals W = (cid:83) ni =1 I i . Assume for each interval I i we know the highest score (with respect to u ) among P ( I i ) ,referred to as interval max score . It is sufficient to use atmost k out of n intervals to answer a preference top- k query Q ( u , k, W ) if the chosen intervals have the k highestinterval max scores. Based on this idea, our implementationtakes advantages of two important properties of skyline [33]to improve the efficiency of index construction and queryprocedure.As shown in Algorithm 4, the tree index is built upon thedimension of arriving time of all points in P in a bottom-up manner. Each leaf node corresponds to a single timestamp(Line 6) and each internal node represents a time window(Line 14). Each tree node also contains a skyline of pointsarriving during its window. Skylines in all internal nodes canbe efficiently computed from bottom to up (Line 15).Algorithm 5 specifies the query procedure using the treeindex. Starting from the canonical intervals (nodes) of querywindow I (Line 4), we recursively split long intervals into smaller ones (Line 10-13), and use a priority queue to Using all records in P that arrive during these k time intervals to computethe top- k results. The pre-determined value of LENGTH THRESHOLD controls the gran-ularity of the chosen k intervals for preference top- k computations. By default,we set LENGTH THRESHOLD=128. lgorithm 5: Preference Top- k Query Q ( u , k, I ) Input: P , T , u , k , and I Output: π ≤ k ( u ) I Def
PreferenceTopK ( I, u , k ) : candidates ← ∅ ; Q (priority queue in descending order of key) ← ∅ ; N ← a set of canonical nodes from T that covers I ; for n i ∈ N do Compute interval max score v i using n i .skyline; Q .push ( v i , n i ) ; while | candidate | < k and ! Q .empty() do v, n ← Q .top(), Q .pop(); if | n .interval | > LENGTH THRESHOLD then n l ← n .left child, n r ← n. right child; Compute v l , v r from n l , n r using skylines; Q .push ( v l , n l ) , Q .push ( v r , n r ) ; else candidate.push( n ); Compute π ≤ k ( u ) I using candidates; return π ≤ k ( u ) I ;remember at most k intervals that have the highest intervalmax scores (Line 15). Finally, a preference top- k result iscomputed using at most k such intervals and all correspondingrecords in P (no more than k ∗ LENGTH THRESHOLD intotal). We can efficiently compute the interval max score forany interval I (Line 6 and 12). B. Missing Proofs of Section IIIProof of Lemma 1.
Let I = [ a, b ] and ρ = [ b − τ, b ] . Let S ρ = S ∩ ρ , i.e., the set of durable records with timestamp in ρ . Weshow that the number of false checks in ρ is O ( | S ρ | + k ) .Without loss of generality, assume that for any pair of records p i , p j with i < j , p i .t < p j .t .We consider two types of false checks in ρ . If the algorithmfinds a false check immediately after a durable record thenthis is a type-1 false check. Otherwise it is a type-2 falsecheck. From the definition, the number of type-1 false checksis bounded by O ( | S ρ | ) . Next we show that the number oftype-2 false checks in ρ is bounded by O ( k ) . If the numberof records in ρ is less than k then the result follows, so weassume that | P [ b − τ, b ] | > k .Recall that if the algorithm visits a record p it computes thetop- k elements in [ p.t − τ, p.t ] . Let U p be the list of the top- k items in [ p.t − τ, p.t ] . Let Z p = U p ∩ ρ , be the list of thesetop- k elements that lie in ρ . Generally we refer to Z p as a Z list. At the beginning of the algorithm assume that we find thetop- k elements in a window of length τ from the rightmostitem in ρ , so we have a list Z with | Z | ≤ k . We show thefollowing two observations. i) Each time that the algorithm finds a type-2 false check the new Z list of top- k records in ρ has cardinality at least one less than the previous list. ii) Thecardinalities of the Z lists as we run the algorithm in ρ arenever increasing. If we show (i), (ii) we could argue that afterthe algorithm finds k type-2 false checks in ρ , the Z list willbe empty and the algorithm will visits a record out of ρ .Without loss of generality, assume that the rightmost recordin ρ was a type-2 false check. Let Z r be the current list asdefined above. The algorithm visits the record with the largesttimestamp in Z r , say p , which is a type-2 false check. Let Z p = U p ∩ ρ be the new list. We compare the new list Z p with the old list Z r . Notice that every record q / ∈ Z r withtime q.t ∈ [ b − τ, p.t ] has f u ( q ) < f u ( p ) (1), otherwise Z r would not be in the correct top- k list. Furthermore, p is a falsecheck because there are at least k records in [ p.t − τ, p.t ) withscore larger than the score of p , (2). From (1), (2) it followsthat Z p ⊂ Z r . Hence, the cardinality of the new Z list is lessthan the cardinality of the previous Z list. In addition, noticethat there are at least k − | Z p | records in [ p.t − τ, b − τ ] withscores greater than the score of p , and generally greater thanthe score of any record in P [ b − τ, p.t ] \ Z p , (3).In order to complete the proof we need to show what isthe new Z list when the algorithm visits a series of durablerecords. Assume that Z p is the current list (or the initial one)and the algorithm visits Z p ’s record with the larger timestamp.Assume that the algorithm finds a series of durable records,where j of them belong in Z p . Notice that j ≥ . Let q be thetype-1 false check that the algorithm visits (after the seriesof durable records) and let Z q be the new list. We need toshow that | Z q | ≤ | Z p | . We assume that q / ∈ Z p (if q ∈ Z p then notice that Z q ⊂ Z p so the result follows). Recall from(3) that there are at least k − | Z p | records with timestamp [ p.t − τ > q.t − τ, b − τ ] and with score greater than the scoreof q . We call these records A . Moreover, there are | Z p | − j records in Z p with timestamp in [ b − τ, q.t ) and with scoregreater than the score of q . We call these records B . We have | Z q | ≤ | B | + ( k − | A | − | B | ) = k − | A | ≤ | Z p | . Hence, weconclude that there are O ( k ) type-2 false checks and the totalnumber of false checks in ρ is O ( | S ρ | + k ) .There are (cid:108) | I | τ (cid:109) intervals of length τ in I so the total numberof false checks is O ( | S | + k (cid:108) | I | τ (cid:109) ) . C. Missing Proofs of Section IV
We first introduce some useful notation. Let dens ( t ) be thedensity of a timestamp t , i.e., the number of blocking intervalsthat contain t . Notice that dens ( t ) is changing as we executethe algorithm. If a record p i is blocked by at least k records,i.e., dens ( p i .t ) ≥ k , at line 7 of Algorithm 3 then we call itan auxiliary record . Overall, we have that a record can be adurable record, a false check (we run a top- k query but therecord does not belong in the solution), or an auxiliary record.We first start with a lemma that will be useful later. Lemma 6.
Let M i be a set that is empty after the algorithmconsidering a (auxiliary) record from M i with density at least k , and let [ l i , r i ] be its corresponding sub-interval. Then one of he two cases hold: The density of each timestamp in [ l i , r i ] isat least k or the algorithm has visited all records in P ([ l i , r i ]) .Proof. If | P ([ l i , r i ]) | ≤ k then the algorithm visits all recordsin P ([ l i , r i ]) , since we always consider the top- k records in [ l i , r i ] . If | P ([ l i , r i ]) | > k then we show that when M i isempty every timestamp in [ l i , r i ] has density at least k .We prove the following argument by induction: When thealgorithm visits a new auxiliary record p j in a set M j then anytimestamp in [ l j , t j ] has density at least k . Let p be the firstauxiliary record that the algorithm finds and let M i be theset that it belongs to. Since p is an auxiliary record we havethat dens ( p .t ) ≥ k at the moment we visit p . Furthermore,notice that the algorithm did not consider any other recordin [ l i , p .t ] in a previous iteration so we can argue that thedensity of every record in [ l i , p .t ] is at least k . In addition,notice that it is not possible to find any durable record or anyfalse check in [ l i , p .t ] in the future. As a result, if we visit p again in the future it will be an auxiliary record in a set withleft endpoint the same l i timestamp. Let p h − be an auxiliaryrecord that the algorithm visits in set M i h − and let assumethat any record in [ l i h − , p h − .t ] has density at least k . Let p h be the next auxiliary record that the algorithm visits andlet assume that it belongs in a set M i h . First assume that thealgorithm has visited p h in a previous iteration. Let M f be theset that contained p h when the algorithm first visited p h . Atthe moment when the algorithm first visited p h , we had that dens ( p h .t ) ≥ k and from the induction hypothesis we havethat every timestamp in [ l f , p h .t ] had density at least k . Hence,there was no other durable record or false check in [ l f , p h .t ] in the future. That means that l f = l i h and so it holds thatevery record in [ l i h , p h .t ] has density at least k . Next, assumethat this is the first time that we visit the auxiliary record p h . If this is the first auxiliary record in M i h we have thatthe density of every record in [ l i h , p h .t ] has density at least k because dens ( p h .t ) ≥ k and there is no subinterval that startsin [ l i h , p h .t ] . Then, we study the case where p h is not the firstauxiliary record that the algorithm finds in set M i h . Let p u be the auxiliary record in M i h with the largest timestamp justbefore the algorithm found p h . From induction hypothesis weknow that the density of every record in [ l i h , p u .t ] is at least k . If p h .t ≤ p u .t then [ l i h , p h .t ] ⊆ [ l i h , p u .t ] so any recordin [ l i h , p h .t ] has density at least k . The last case to consideris when p h .t > p u .t . Since dens ( p h .t ) ≥ k , and since thereis no sub-inerval that starts in ( l u , p h .t ) we have that everyrecord in [ l u , p h .t ] has density at least k . We conclude thatthe density of every timestamp in [ l i h , p h .t ] is at least k .Now we are ready to prove our lemma. If | P ([ l i , r i ]) | > k and M i is empty it means that the algorithm has already con-sidered k auxiliary records in [ l i , r i ] . Let p u be the auxiliaryrecord in M i with the largest timestamp. From the inductionwe have that the density of every record in [ l i , p u .t ] is at least k . Furthermore, the algorithm has visited k auxiliary recordsand hence it has added at least k blocking intervals with leftendpoint in [ l i , p u .t ] . All the intervals we add have length τ and r i − l i ≤ τ so all timestamps in the interval [ p u .t, r i ] have density at least k . We conclude that the density of each recordin [ l i , r i ] is at least k . Proof of Lemma 2.
Let S ∗ be the durable records in I . Weshow that S ⊆ S ∗ and S ∗ ⊆ S showing that S = S ∗ . Thealgorithm always checks by running a top- k query if a recordshould be in the solution (line 8 of Algorithm 3) so S ⊆ S ∗ .Next we show the other direction. The algorithm visits therecords in descending (on score) order so it is not possible thata record p ∈ S ∗ is blocked by at least k records before thealgorithm visits p . Before we argue that S ∗ ⊆ S we also needto make sure that the algorithm does not miss any durablerecord in a sub-interval [ l j , r j ] that corresponds to an emptyset M j . In Lemma 6 we showed that all timestamps in [ l j , r j ] have density at least k so there is no additional durable recordin this sub-interval. Hence S ∗ ⊆ S , and overall we concludethat S = S ∗ .Let p i be a false check that the algorithm just found, andlet P (cid:48) i be the top- k records in [ p i .t − τ, p i .t ) , as we had inthe algorithm. Let p (cid:48) i be the record in P (cid:48) i with the largesttimestamp. We say that p i is assigned to p (cid:48) i . If p i .t (cid:48) < a , where a is the timestamp such that I = [ a, b ] , then p i is assigned to a . The next lemma follows from the definition. Lemma 7.
Assume that the algorithm just found the falsecheck p i . After adding all the blocking intervals from P (cid:48) i wehave that the density of every timestamp in [ p (cid:48) i .t, p i .t ] is atleast k . We show the next lemma which is useful to bound thenumber of false checks.
Lemma 8.
Let p i be a false check and p (cid:48) i be the recordthat it is assigned to. Before adding the k blocking intervalsfrom all records in P (cid:48) i (as defined above) we have that either dens ( p (cid:48) i .t ) ≥ k , or p (cid:48) i ∈ S and dens ( p (cid:48) i .t ) < k , or p (cid:48) i = a .Proof. If p (cid:48) i .t < a then from the definition p (cid:48) i is a . (Noticethat if we find more than one false checks that are assigned to a then dens ( a ) > k , so this case can be considered the sameas dens ( p (cid:48) i .t ) > k .)Next, we assume that p (cid:48) i .t ≥ a . We prove the lemma bycontradiction. Let p (cid:48) i be a record that does not belong in S and dens ( p (cid:48) i .t ) < k . Notice that f u ( p (cid:48) i ) > f u ( p i ) . Since p (cid:48) i isnot in S it can be either: a false check, an auxiliary record,or a record that the algorithm has not visited before. If p (cid:48) i is afalse check then from Lemma 7 we have that dens ( p (cid:48) i .t ) ≥ k at the moment that we found p (cid:48) i for first time, which is acontradiction. If p (cid:48) i is an auxiliary record then from Lemma 6we have that dens ( p (cid:48) i .t ) ≥ k , which is a contradiction. If p (cid:48) i is a record that the algorithm has not considered before thenthere are two cases: a) p (cid:48) i belongs in an interval [ l j , r j ] ofa set M j that we have removed from M because we havealready visited its top- k records. From Lemma 6 we knowthat dens ( p (cid:48) i .t ) ≥ k , which is a contradiction. b) p (cid:48) i belongsin an interval [ l j , r j ] of a set M j that there still exists in H .Since f u ( p (cid:48) i ) > f u ( p i ) it means that p i is not the record with18he highest score among the sub-intervals that are not removedfrom M , which is a contradiction.In any case we proved that either p (cid:48) i has density at least k ,or p (cid:48) i has density less than k and p (cid:48) i ∈ S , or p (cid:48) i = a . Proof of Lemma 3.
If a false check p i is assigned to a durablerecord with density less than k then we call it type-1 falsecheck. Otherwise, it is a type-2 false check.Let p i be a type-1 false check so we have that p (cid:48) i ∈ S and dens ( p (cid:48) i .t ) < k . After adding all the k segments from P (cid:48) i we have that dens ( p (cid:48) i .t ) ≥ k . The next time that p (cid:48) i willbe assigned by another false check the density of p (cid:48) i will beat least k so it will be a type-2 false check. Hence, it isstraightforward to bound the number of type-1 false checks,which is at most O ( | S | ) .Next we focus on type-2 false checks. Let [ l, r ] be one of theinitial disjoint τ -length windows from line 2 of Algorithm 3.We show that after finding k type-2 false checks in [ l, r ] thedensity of all timestamps in [ l, r ] is at least k . If that is thecase then the algorithm will not find any other false check in [ l, r ] .Let t be any timestamp in [ l, r ] . We show that dens ( t ) ≥ k after finding k type-2 false checks in [ l, r ] . If one of the falsechecks in [ l, r ] lies on t then we already have that dens ( t ) ≥ k .Let assume that the algorithm finds k type-2 false checks in [ l, t ) and k type-2 false checks in ( t, r ] , where k + k = k . If k ≥ k then dens ( t ) ≥ k , so the interesting case is when k
We show the result extending the mainideas from [15]. Let P ( I ) = { p j +1 , . . . , p j + L } . For p i ∈ P ( I ) , let X i be a random variable which is if p i ∈ C ,and otherwise. From linearity of expectation we havethat E [ | C | ] = E (cid:104)(cid:80) j + | I | i = j +1 X i (cid:105) = (cid:80) j + | I | i = j +1 E [ X i ] = (cid:80) j + | I | i = j +1 Pr [ X i = 1 ] . We focus on computing Pr [ X i = 1 ] .Let P i = P ([ p i − τ .t, p i .t ]) = { p i − τ , . . . , p i − , p i } . By in-dependence we have that the probability of each point in P i to be in the k -skyband of P i is the same, so we cancompute Pr [ X i = 1 ] by first finding the expected size of the k -skyband in P i and then divide it by the number of points, τ + 1 .Let B i be the k -skyband of the τ +1 points P i . Let V j ⊂ N for ≤ j ≤ d , with | V j | = τ + 1 such that V j contains thevalues that are assigned to the j -th coordinate of the pointsin P i . We compute E [ | B i | | V , . . . , V d ] . Let A ( τ + 1 , d ) bethe expected size of the k -skyband of a set ¯ P with τ + 1 points in R d in the d -dimensional random permutation model.Notice that A ( τ + 1 , d ) = E [ | B i | | V , . . . , V d ] . We compute A ( τ + 1 , d ) as follows. From linearity of expectation wecan compute the probability that a point in ¯ P belongs inthe k -skyband and take the sum of them, A ( τ + 1 , d ) = (cid:80) ¯ p ∈ ¯ P Pr (cid:2) ¯ p ∈ k -skyband of ¯ P (cid:3) . Assume that a point ¯ p ∈ ¯ P has the g -th largest first coordinate among the points in ¯ P .Notice that this can happen with probability τ +1 . Since thefirst coordinate of the g -th point ( ¯ p ) is greater than the first19oordinates of g − points it cannot be dominated by any ofthose. Therefore, the g -th point belongs in the k -skyband if andonly if its remaining d − coordinates belong in the k -skybandamong the points in ¯ P with the g -th through the ( τ + 1) -thlargest first coordinate. The probability that the g -th point isin the k -skyband is, by independence, the expected number ofthe k -skyband in the remaining points and coordinates, whichis A ( τ + 1 − g + 1 , d − , divided by the total number ofthe remaining points in the set which are τ + 1 − g + 1 .Notice that A ( k (cid:48) , y ) = k (cid:48) for k (cid:48) ≤ k and any y . Hence,we have A ( τ + 1 , d ) = (cid:80) τ +1 j =1 (cid:80) τ +1 g =1 1 τ +1 A ( τ +1 − g +1 ,d − τ +1 − g +1 = τ +1 (cid:80) τ +1 j =1 (cid:80) τ +1 J =1 A ( J,d − J = (cid:80) τ +1 J =1 A ( J,d − J . Notice that A ( x, y ) is monotonically increasing in x , so if x ≤ x , then A ( x , y ) ≤ A ( x , y ) for any y . Furthermore, we note that A ( τ +1 ,
1) = k since in one dimension the top- k points belongin the k -skyband. We have, A ( τ + 1 , d ) = (cid:80) τ +1 J =1 A ( J,d − J ≤ A ( τ + 1 , d − (cid:80) τ +1 J =1 1 J ≤ A ( τ + 1 , d − O (log τ ) . Iteratingthis recurrence on d until A ( τ + 1 ,
1) = k gives the upperbound A ( τ + 1 , d ) = O ( k log d − τ ) .We conclude that E [ | B i | | V , . . . , V d ] = O ( k log d − τ ) .Notice that Pr [ V , . . . , V d ] = ( nτ +1 ) d and all possiblesets of V , . . . , V d are (cid:0) nτ +1 (cid:1) d so we have that E [ | B i | ] = O ( k log d − τ ) , and Pr [ X i = 1 ] ∼ O ( k log d − τ ) τ +1 . Over-all we conclude that E [ |C| ] = (cid:80) j + | I | i = j +1 Pr [ X i = 1 ] = O ( k | I | τ log d − τ ))