[PDF] A Fully Dynamic Algorithm for k-Regret Minimizing Sets

Abstract

Selecting a small set of representatives from a large database is important in many applications such as multi-criteria decision making, web search, and recommendation. The k -regret minimizing set ( k -RMS) problem was recently proposed for representative tuple discovery. Specifically, for a large database P of tuples with multiple numerical attributes, the k -RMS problem returns a size- r subset Q of P such that, for any possible ranking function, the score of the top-ranked tuple in Q is not much worse than the score of the k \textsuperscript{th}-ranked tuple in P . Although the k -RMS problem has been extensively studied in the literature, existing methods are designed for the static setting and cannot maintain the result efficiently when the database is updated. To address this issue, we propose the first fully-dynamic algorithm for the k -RMS problem that can efficiently provide the up-to-date result w.r.t.~any insertion and deletion in the database with a provable guarantee. Experimental results on several real-world and synthetic datasets demonstrate that our algorithm runs up to four orders of magnitude faster than existing k -RMS algorithms while returning results of nearly equal quality.

Full PDF

AA Fully Dynamic Algorithm for k-RegretMinimizing Sets

Yanhao Wang † , Yuchen Li ‡ , Raymond Chi-Wing Wong (cid:92) , and Kian-Lee Tan † † School of Computing, National University of Singapore ‡ School of Information Systems, Singapore Management University (cid:92)

Department of Computer Science and Engineering, The Hong Kong University of Science and Technology † { yanhao90,tankl } @comp.nus.edu.sg; ‡ [email protected]; (cid:92) [email protected] Abstract —Selecting a small set of representatives from a largedatabase is important in many applications such as multi-criteriadecision making, web search, and recommendation. The k -regretminimizing set ( k -RMS) problem was recently proposed forrepresentative tuple discovery. Speciﬁcally, for a large database P of tuples with multiple numerical attributes, the k -RMS problemreturns a size- r subset Q of P such that, for any possible rankingfunction, the score of the top-ranked tuple in Q is not muchworse than the score of the k th -ranked tuple in P . Although the k -RMS problem has been extensively studied in the literature,existing methods are designed for the static setting and cannotmaintain the result efﬁciently when the database is updated. Toaddress this issue, we propose the ﬁrst fully-dynamic algorithmfor the k -RMS problem that can efﬁciently provide the up-to-date result w.r.t. any insertion and deletion in the database witha provable guarantee. Experimental results on several real-worldand synthetic datasets demonstrate that our algorithm runs up tofour orders of magnitude faster than existing k -RMS algorithmswhile returning results of near-equal quality. Index Terms —k-regret minimizing set, dynamic algorithm, setcover, top-k query, skyline

I. I

NTRODUCTION

In many real-world applications such as multi-criteria deci-sion making [22], web search [29], ad recommendation [20],it is crucial to ﬁnd a succinct representative subset from alarge database to meet the requirements of various users.For example, when a user queries for a hotel on a website(e.g., booking.com and expedia.com), she/he will receivethousands of available options as results. The website wouldlike to display the best choices in the ﬁrst few pages fromwhich almost all users could ﬁnd what they are most interestedin. A common method is to rank all results using a utilityfunction that denotes a user’s preference on different attributes(e.g., price , rating , and distance to destination for hotels) andonly present the top- k tuples with the highest scores accordingto this function to the user. However, due to the wide diversityof user preferences, it is infeasible to represent the preferencesof all users by any single utility function. Therefore, to selecta set of highly representative tuples, it is necessary to take all(possible) user preferences into account.A well-established approach to ﬁnding such representativesfrom databases is the skyline operator [9]. It is based on theconcept of domination : a tuple p dominates a tuple q iff p is as good as q on all attributes and strictly better than q onat least one attribute. For a given database, a skyline query returns its Pareto-optimal subset which consists of all tuplesthat are not dominated by any tuple. It is guaranteed that anyuser can ﬁnd her/his best choice from the skyline because thetop-ranked result according to any monotone function mustnot be dominated. Unfortunately, although skyline queriesare effective for representing low-dimensional databases, theirresult sizes cannot be controlled and increase rapidly as thedimensionality (i.e., number of attributes in a tuple) grows,particularly so for databases with anti-correlated attributes.Recently, the k -regret minimizing set ( k -RMS) problem [3],[4], [10], [11], [19], [22], [23], [32] was proposed to alleviatethe deﬁciency of skyline queries. Speciﬁcally, given a database P of tuples with d numeric attributes, the k -RMS problemaims to ﬁnd a subset Q ⊆ P such that, for any possibleutility function, the top- tuple in Q can approximate the top- k tuples in P within a small error. Here, the maximum k -regretratio [11] ( mrr k ) is used to measure how well Q can represent P . For a utility function f , the k -regret ratio ( rr k ) of Q over P is deﬁned to be if the top- tuple in Q is among the top- k tuples in P w.r.t. f , or otherwise, to be one minus the ratiobetween the score of the top- tuple in Q and the score of the k th -ranked tuple in P w.r.t. f . Then, the maximum k -regretratio ( mrr k ) is deﬁned by the maximum of rr k over a classof (possibly inﬁnite) utility functions. Given a positive integer r , a k -RMS on a database P returns a subset Q ⊆ P of size r to minimize mrr k . As an illustrative example, the websitecould run a k -RMS on all available hotels to pick a set of r candidates where all users can ﬁnd at least one close to her/histop- k choices.The k -RMS problem has been extensively studied recently.Theoretically, it is NP-hard [3], [10], [11] on any database with d ≥ . In general, we categorize existing k -RMS algorithmsinto three types. The ﬁrst type is dynamic programmingalgorithms [4], [10], [11] for k -RMS on two-dimensionaldata. Although they can provide optimal solutions when d = 2 , they are not suitable for higher dimensions due tothe NP-hardness of k -RMS. The second type is the greedyheuristic [11], [22], [23], which always adds a tuple thatmaximally reduces mrr k at each iteration. Although thesealgorithms can provide high-quality results empirically, theyhave no theoretical guarantee and suffer from low efﬁciency onhigh-dimensional data. The third type is to transform k -RMSinto another problem such as ε -kernel [3], [10], [19], [32], a r X i v : . [ c s . D B ] M a y iscretized matrix min-max [4], hitting set [3], [19], and k - MEDOID clustering [26], and then to utilize existing solutionsof the transformed problem for k -RMS computation. Althoughthese algorithms are more efﬁcient than greedy heuristics whilehaving theoretical bounds, they are designed for the staticsetting and, to the best of our knowledge, none of them cansupport any update in the database. Typically, most of themprecompute the skyline as the input to compute the result of k -RMS. Once an insertion or deletion triggers any change inthe skyline, they are unable to maintain the result withoutre-running from scratch. Hence, existing k -RMS algorithmsbecome very inefﬁcient in highly dynamic environments wheretuples in the databases are frequently inserted and deleted.However, dynamic databases are very common in real-worldscenarios, especially for online services. For example, in ahotel booking system, the prices and availabilities of roomsare frequently changed over time. As another example, in anIoT network, a large number of sensors may often connector disconnect with the server. Moreover, sensors also updatetheir statistics regularly. Therefore, it is essential to addressthe problem of maintaining an up-to-date result for k -RMSwhen the database is frequently updated.In this paper, we propose the ﬁrst fully-dynamic k -RMSalgorithm that can efﬁciently maintain the result of k -RMSw.r.t. any tuple insertion and deletion in the database withboth theoretical guarantee and good empirical performance.Our main contributions are summarized as follows: • In Section II, we formally deﬁne the notion of maximum k -regret ratio and the k -regret minimizing set ( k -RMS)problem in a fully-dynamic setting. • In Section III, we propose the ﬁrst fully-dynamic algo-rithm called FD-RMS to maintain the k -RMS result overtuple insertions and deletions in a database. Our basicidea is to transform fully-dynamic k -RMS into a dynamicset cover problem. Speciﬁcally, FD-RMS computes the(approximate) top- k tuples for a set of randomly sampledutility functions and builds a set system based on the top- k results. Then, the k -RMS result can be retrieved froman approximate solution for set cover on the set system.Furthermore, we devise a novel algorithm for dynamic setcover by introducing the notion of stable solution , whichis used to efﬁciently update the k -RMS result wheneveran insertion or deletion triggers some changes in top- k results as well as the set system. We also provide detailedtheoretical analyses of FD-RMS. • In Section IV, we conduct extensive experiments onseveral real-world and synthetic datasets to evaluate theperformance of FD-RMS. The results show that FD-RMSachieves up to four orders of magnitude speedup overexisting k -RMS algorithms while providing results ofnear-equal quality in a fully dynamic setting.II. P RELIMINARIES

In this section, we formally deﬁne the problem we study inthis paper. We ﬁrst introduce the notion of maximum k -regretratio. Then, we formulate the k -regret minimizing set ( k -RMS) x O y p p p p p p p p u u Tuple x yp p p p p p p p Fig. 1: A two-dimensional database of 8 tuples. problem in a fully dynamic setting. Finally, we present thechallenges of solving fully-dynamic k -RMS. A. Maximum K-Regret Ratio

Let us consider a database P where each tuple p ∈ P has d nonnegative numerical attributes p [1] , . . . , p [ d ] and isrepresented as a point in the nonnegative orthant R d + . A user’spreference is denoted by a utility function f : R d + → R + thatassigns a positive score f ( p ) to each tuple p . Following [4],[11], [22], [23], [32], we restrict the class of utility functionsto linear functions . A function f is linear if and only if thereexists a d -dimensional vector u = ( u [1] , . . . , u [ d ]) ∈ R d + such that f ( p ) = (cid:104) u, p (cid:105) = (cid:80) di =1 u [ i ] · p [ i ] for any p ∈ R d + .W.l.o.g., we assume the range of values on each dimensionis scaled to [0 , and any utility vector is normalized to be aunit , i.e., (cid:107) u (cid:107) = 1 . Intuitively, the class of linear functionscorresponds to the nonnegative orthant of d -dimensional unitsphere U = { u ∈ R d + : (cid:107) u (cid:107) = 1 } .We use ϕ j ( u, P ) to denote the tuple p ∈ P with the j th -largest score w.r.t. vector u and ω j ( u, P ) to denote its score.Note that multiple tuples may have the same score w.r.t. u andany consistent rule can be adopted to break ties. For brevity, wedrop the subscript j from the above notations when j = 1 , i.e., ϕ ( u, P ) = arg max p ∈ P (cid:104) u, p (cid:105) and ω ( u, P ) = max p ∈ P (cid:104) u, p (cid:105) .The top- k tuples in P w.r.t. u is represented as Φ k ( u, P ) = { ϕ j ( u, P ) : 1 ≤ j ≤ k } . Given a real number ε ∈ (0 , ,the ε -approximate top- k tuples in P w.r.t. u is denoted as Φ k,ε ( u, P ) = { p ∈ P : (cid:104) u, p (cid:105) ≥ (1 − ε ) · ω k ( u, P ) } , i.e., theset of tuples whose scores are at least (1 − ε ) · ω k ( u, P ) .For a subset Q ⊆ P and an integer k ≥ , we deﬁne the k -regret ratio of Q over P for a utility vector u by rr k ( u, Q ) =max (cid:0) , − ω ( u,Q ) ω k ( u,P ) (cid:1) , i.e., the relative loss of replacing the k th -ranked tuple in P by the top-ranked tuple in Q . Since itis required to consider the preferences of all possible users,our goal is to ﬁnd a subset whose k -regret ratio is small foran arbitrary utility vector. Therefore, we deﬁne the maximum k -regret ratio of Q over P by mrr k ( Q ) = max u ∈ U rr k ( u, Q ) .Intuitively, mrr k ( Q ) measures how well the top-ranked tupleof Q approximates the k th -ranked tuple of P in the worst case.For a real number ε ∈ (0 , , Q is said to be a ( k, ε ) -regret setof P iff mrr k ( Q ) ≤ ε , or equivalently, ϕ ( u, Q ) ∈ Φ k,ε ( u, P ) for any u ∈ U . By deﬁnition, it holds that mrr k ( Q ) ∈ [0 , . The normalization does not affect our results because the maximum k -regret ratio is scale-invariant [22]. xample 1. Fig. 1 illustrates a database P in R with tuples { p , . . . , p } . For utility vectors u = (0 . , . and u = (0 . , . , their top- results are Φ ( u , P ) = { p , p } and Φ ( u , P ) = { p , p } , respectively. Given a subset Q = { p , p } of P , rr ( u , Q ) = 1 − . . ≈ . as ω ( u , Q ) = (cid:104) u , p (cid:105) = 0 . and ω ( u , P ) = (cid:104) u , p (cid:105) =0 . . Furthermore, mrr ( Q ) ≈ . because rr ( u, Q ) isthe maximum when u = (0 . , . with rr ( u, Q ) = 1 − ≈ . . Finally, Q = { p , p , p } is a (2 , -regret set of P since mrr ( Q ) = 0 . B. K-Regret Minimizing Set

Based on the notion of maximum k -regret ratio , we canformally deﬁne the k - regret minimizing set ( k -RMS) problemin the following. Deﬁnition 1 ( k -Regret Minimizing Set) . Given a database P ⊂ R d + and a size constraint r ∈ Z + ( r ≥ d ), the k -regretminimizing set ( k -RMS) problem returns a subset Q ∗ ⊆ P of at most r tuples with the smallest maximum k -regret ratio ,i.e., Q ∗ = arg min Q ⊆ P : | Q |≤ r mrr k ( Q ) .For any given k and r , we denote the k -RMS problem by RMS ( k, r ) and the maximum k -regret ratio of the optimal result Q ∗ for RMS ( k, r ) by ε ∗ k,r . In particular, the r -regret querystudied in [4], [22], [23], [32] is a special case of our k -RMSproblem when k = 1 , i.e., -RMS. Example 2.

Let us continue with the example in Fig. 1. Fora query

RMS (2 , on P , we have Q ∗ = { p , p } with ε ∗ , = mrr ( Q ∗ ) ≈ . because { p , p } has the smallest maximum -regret ratio among all size- subsets of P .In this paper, we focus on the fully-dynamic k -RMS prob-lem. We consider an initial database P and a (possiblycountably inﬁnite) sequence of operations ∆ = (cid:104) ∆ , ∆ , . . . (cid:105) .At each timestamp t ( t ∈ Z + ), the database is updated from P t − to P t by performing an operation ∆ t of one of thefollowing two types: • Tuple insertion ∆ t = (cid:104) p, + (cid:105) : add a new tuple p to P t − ,i.e., P t ← P t − ∪ { p } ; • Tuple deletion ∆ t = (cid:104) p, −(cid:105) : delete an existing tuple p from P t − , i.e., P t ← P t − \ { p } .Note that the update of a tuple can be processed by a deletionfollowed by an insertion, and thus is not discussed separatelyin this paper. Given an initial database P , a sequence ofoperations ∆ , and a query RMS ( k, r ) , we aim to keep trackof the result Q ∗ t for RMS ( k, r ) on P t at any time t .Fully-dynamic k -RMS faces two challenges. First, the k -RMS problem is NP-hard [3], [10], [11] for any d ≥ . Thus,the optimal solution of k -RMS is intractable for any databasewith three or more attributes unless P=NP in both static anddynamic settings. Hence, we will focus on maintaining anapproximate result of k -RMS in this paper. Second, existing k -RMS algorithms can only work in the static setting. They mustrecompute the result from scratch once an operation triggersany update in the skyline (Note that since the result of k -RMSis a subset of the skyline [11], [22], it remains unchanged Approximate top- k results u u … u m p n Set-cover instance p p r … … Q t …… u u …… u m Utility vectors

Fig. 2: An illustration of FD-RMS for any operation on non-skyline tuples). However, frequentrecomputation leads to signiﬁcant overhead and causes lowefﬁciency on highly dynamic databases. Therefore, we willpropose a novel method for fully-dynamic k -RMS that canmaintain a high-quality result for RMS ( k, r ) on a databasew.r.t. any tuple insertion and deletion efﬁciently.III. T HE FD-RMS A

LGORITHM

In this section, we present our FD-RMS algorithm for k -RMS in a fully dynamic setting. The general framework ofFD-RMS is illustrated in Fig. 2. The basic idea is to transform fully-dynamic k -RMS to a dynamic set cover problem. Let usconsider how to compute the result of RMS ( k, r ) on database P t . First of all, we draw a set of m random utility vectors { u , . . . , u m } from U and maintain the ε -approximate top- k result of each u i ( i ∈ [1 , m ] ) on P t , i.e., Φ k,ε ( u i , P t ) .Note that ε should be given as an input parameter of FD-RMS and we will discuss how to specify its value at the endof Section III. Then, we construct a set system Σ = ( U , S ) based on the approximate top- k results, where the universe U = { u , . . . , u m } and the collection S consists of n t sets( n t = | P t | ) each of which corresponds to one tuple in P t .Speciﬁcally, for each tuple p ∈ P t , we deﬁne S ( p ) as a setof utility vectors for which p is an ε -approximate top- k resulton P t . Or formally, S ( p ) = { u ∈ U : p ∈ Φ k,ε ( u, P t ) } and S = { S ( p ) : p ∈ P t } . After that, we compute a result Q t for RMS ( k, r ) on P t using an (approximate) solution for setcover on Σ . Let C ⊆ S be a set-cover solution of Σ , i.e., (cid:83) S ( p ) ∈C S ( p ) = U . We use the set Q t of tuples correspondingto C , i.e., Q t = { p ∈ P t : S ( p ) ∈ C} , as the result of RMS ( k, r ) on P t . Given the above framework, there are stilltwo challenges of updating the result of k -RMS in a fullydynamic setting. Firstly, because the size of Q t is restrictedto r , it is necessary to always keep an appropriate value of m over time so that |C| ≤ r . Secondly, the updates in approximatetop- k results triggered by tuple insertions and deletions inthe database lead to the changes in the set collection S .Therefore, it is essential to maintain the set-cover solution C over time for the changes in S . In fact, both challenges canbe treated as a dynamic set cover problem that keeps a set-cover solution w.r.t. changes in both U and S . Therefore, wewill ﬁrst introduce the background on dynamic set cover inSection III-A. After that, we will elaborate on how FD-RMSprocesses k -RMS in a fully dynamic setting using the dynamicset cover algorithm in Section III-B. . Background: Dynamic Set Cover Given a set system

Σ = ( U , S ) , the Set Cover problemasks for the smallest subset C ∗ of S whose union equals to theuniverse U . It is one of Karp’s 21 NP-complete problems [18],and cannot be approximated to (1 − o (1)) · ln m ( m = |U| )unless P=NP [14]. A common method to ﬁnd an approximateset-cover solution is the greedy algorithm. Starting from C = ∅ , it always adds the set that contains the largest number ofuncovered elements in U to C at each iteration until (cid:83) S ∈C S = U . Theoretically, the solution C achieves an approximationratio of (1 + ln m ) , i.e., |C| ≤ (1 + ln m ) · |C ∗ | . But obviously,the greedy algorithm cannot dynamically update the set-coversolution when the set system Σ is changed.Recently, there are some theoretical advances on coveringand relevant problems (e.g., vertex cover, maximum matching,set cover, and maximal independent set) in dynamic set-tings [1], [8], [16], [17]. Although these theoretical results haveopened up new ways to design dynamic set cover algorithms,they cannot be directly applied to the update procedure ofFD-RMS because of two limitations. First, existing dynamicalgorithms for set cover [1], [16] can only handle the updatein the universe U but assume that the set collection S is notchanged. But in our scenario, the changes in top- k resultslead to the update of S . Second, due to the extremely largeconstants introduced in their analyses, the solutions returnedmay be far away from the optima in practice.Therefore, we devise a more practical approach to dynamicset cover that supports any update in both U and S . Ourbasic idea is to introduce the notion of stability to a set-coversolution. Then, we prove that any stable solution is O (log m ) -approximate ( m = |U| ) for set cover. Based on this result,we are able to design an algorithm to maintain a set-coversolution w.r.t. any change in Σ by guaranteeing its stability .We ﬁrst formalize the concept of stability of a set-coversolution. Let C ⊆ S be a set-cover solution on

Σ = ( U , S ) .We deﬁne an assignment φ from each element u ∈ U to a unique set S ∈ C that contains u (or formally, φ : U → C ).For each set S ∈ C , its cover set cov ( S ) is deﬁned as the set ofelements assigned to S , i.e., cov ( S ) = { u ∈ U : φ ( u ) = S } .By deﬁnition, the cover sets of different sets in C are mutuallydisjoint from each other. Then, we can organize the sets in C into hierarchies according to the numbers of elements coveredby them. Speciﬁcally, we put a set S ∈ C in a higher level if itcovers more elements and vice versa. We associate each level L j ( j ∈ N ) with a range of cover number [2 j , j +1 ) . Eachset S ∈ C is assigned to a level L j if j ≤ | cov ( S ) | < j +1 .We use A j to denote the set of elements assigned to any set in L j , i.e., A j = { u ∈ U : φ ( u ) ∈ L j } . Moreover, the notations L with subscripts, i.e., L >j or L ≥ j and L

If a set-cover solution C is stable, it satisﬁes that |C| ≤ O (log m ) · |C ∗ | .Proof. Let

OPT = |C ∗ | , ρ ∗ = m OPT , and j ∗ be the level indexsuch that j ∗ ≤ ρ ∗ < j ∗ +1 . According to Condition (1) ofDeﬁnition 2, we have | cov ( S ) | ≥ j ∗ for any S ∈ L ≥ j ∗ .Thus, it holds that |L ≥ j ∗ | ≤ | A ≥ j ∗ | j ∗ ≤ m j ∗ ≤ ρ ∗ j ∗ · OPT ≤ · OPT

For some level L j with j < j ∗ , according to Condition (2)of Deﬁnition 2, any S ∈ S covers at most j +1 elementsin A j . Hence, S ∗ needs at least | A j | j +1 sets to cover A j , i.e., OPT ≥ | A j | j +1 . Since | cov ( S ) | ≥ j for each S ∈ L j , it holdsthat |L j | ≤ | A j | j ≤ · OPT . As ≤ | cov ( S ) | ≤ m , the rangeof level index is [0 , log m ] . Thus, the number of levels below L j ∗ is at most log m . To sum up, we prove that |C| = |L ≥ j ∗ | + |L

REEDY to initialize a set-cover solution C on Σ (Line 1). Its detailed procedure is inLines 13–19. It follows the classic greedy algorithm for setcover, and the only difference is that all the sets in C areassigned to different levels according to the sizes of their coversets. Then, the procedure of updating C for set operation σ is shown in Lines 2–12. Our method supports four types ofset operations to update Σ as follows: σ = ( u, S, ± ) , i.e., toadd/remove an element u to/from a set S ∈ S ; σ = ( u, U , ± ) ,i.e., to add/remove an element u to/from the universe U . Weidentify three cases in which the assignment of u must bechanged for σ . When σ = ( u, S, − ) and φ ( u ) = S , it willreassign u to another set containing u ; For σ = ( u, U , ± ) ,it will add or delete the assignment of u accordingly. Afterthat, for each set with some change in its cover set, it callsR E L EVEL (e.g., Lines 5, 8, and 11) to check whether the setshould be moved to a new level based on the updated size ofits cover set. The detailed procedure of R E L EVEL is given inLines 20–27. Finally, S

TABILIZE (Line 12) is always calledfor every σ to guarantee the stability of C since C may becomeunstable due to the changes in Σ and φ ( u ) . The procedure ofstabilization is in Lines 28–32. It ﬁnds all sets that violateCondition (2) of Deﬁnition 2 and adjust C for these sets untilno set should be adjusted anymore. Theoretical Analysis:

Next, we will analyze Algorithm 1 the-oretically. We ﬁrst show that a set-cover solution returned byG

REEDY is stable. Then, we prove that S

TABILIZE convergesto a stable solution in ﬁnite steps.

Lemma 1.

The solution C returned by G REEDY is stable. lgorithm 1: D YNAMIC S ET C OVER

Input :

Set system Σ , set operation σ Output :

Stable set-cover solution C /* compute an initial solution C on Σ */ C ← G REEDY (Σ) ; /* update C for σ = ( u, S, ± ) or ( u, U , ± ) */ if σ = ( u, S, − ) and u ∈ cov ( S ) then cov ( S ) ← cov ( S ) \ { u } ; cov ( S + ) ← cov ( S + ) ∪ { u } for S + ∈ S s.t. u ∈ S + ; R E L EVEL ( S ) and R E L EVEL ( S + ) ; else if σ = ( u, U , +) then cov ( S + ) ← cov ( S + ) ∪ { u } for S + ∈ S s.t. u ∈ S + ; R E L EVEL ( S + ) ; else if σ = ( u, U , − ) then cov ( S − ) ← cov ( S − ) \ { u } if u ∈ cov ( S − ) ; R E L EVEL ( S − ) ; S TABILIZE ( C ) ; Function G REEDY (Σ) I ← U , L j ← ∅ for every j ≥ ; while I (cid:54) = ∅ do S ∗ ← arg max S ∈S | I ∩ S | , cov ( S ∗ ) ← I ∩ S ∗ ; Add S ∗ to L j s.t. j ≤ | cov ( S ∗ ) | < j +1 ; I ← I \ cov ( S ∗ ) ; return C ← (cid:83) j ≥ L j ; Function R E L EVEL ( S ) if cov ( S ) = ∅ then C ← C \ { S } ; else Let L j be the current level of S ; if | cov ( S ) | < j or | cov ( S ) | ≥ j +1 then Let j (cid:48) be the index s.t. j (cid:48) ≤ | cov ( S ) | < j (cid:48) +1 ; Move S from L j to L j (cid:48) ; Function S TABILIZE ( C ) while ∃ S ∈ S and L j s.t. | S ∩ A j | ≥ j +1 do cov ( S ) ← cov ( S ) ∪ ( S ∩ A j ) , R E L EVEL ( S ); while ∃ S (cid:48) ∈ C : cov ( S ) ∩ cov ( S (cid:48) ) (cid:54) = ∅ do cov ( S (cid:48) ) ← cov ( S (cid:48) ) \ cov ( S ) , R E L EVEL ( S (cid:48) ); Proof.

First of all, it is obvious that each set S ∈ C is assignedto the correct level according to the size of its cover set andCondition (1) of Deﬁnition 2 is satisﬁed. Then, we sort the setsin C as S ∗ , . . . , S ∗|C| by the order in which they are added. Let S ∗ i be the set s.t. | cov ( S ∗ i ) | < j +1 and | cov ( S ∗ i (cid:48) ) | ≥ j +1 for any i (cid:48) < i , i.e., S ∗ i is the ﬁrst set added to level L j .We have | I ∩ S ∗ i | = | cov ( S ∗ i ) | < j +1 where I is the set ofuncovered elements before S ∗ i is added to C . If there were a set S ∈ S such that | S ∩ A j | > j +1 , we would acquire | I ∩ S | ≥| S ∩ A j | > j +1 and | I ∩ S | > | I ∩ S ∗ i | , which contradictswith Line 16 of Algorithm 2. Thus, C must satisfy Condition(2) of Deﬁnition 2. To sum up, C is a stable solution. Lemma 2. S TABILIZE can converge to a stable solution in O ( m log m ) steps.Proof. For an iteration of the while loop (i.e., Lines 28–32)that picks a set S and a level L j , the new level L j (cid:48) of S alwayssatisﬁes j (cid:48) > j . Accordingly, all the elements in cov ( S ) are Algorithm 2: I NITIALIZATION

Input :

Query

RMS ( k, r ) , initial database P , parameters ε ∈ (0 , and M ∈ Z + ( M > r ) Output :

Result Q of RMS ( k, r ) on P Draw M vectors { u i ∈ U : i ∈ [1 , M ] } where the ﬁrst d arethe standard basis of R d + and the remaining are uniformlysampled from U ; Compute Φ k,ε ( u i , P ) of every u i where i ∈ [1 , M ] ; L ← r , H ← M , m ← ( L + H ) / ; while true do foreach p ∈ P do S ( p ) ← { u i : i ∈ [1 , m ] ∧ p ∈ Φ k,ε ( u i , P ) } ; Σ = ( U , S ) where U = { u i : i ∈ [1 , m ] } and S = { S ( p ) : p ∈ P } ) ; C ← G REEDY (Σ) ; if |C| < r then L ← m + 1 , m ← ( L + H ) / ; else if |C| > r then H ← m − , m ← ( L + H ) / ; else if |C| = r or m = M then break ; return Q ← { p ∈ P : S ( p ) ∈ C} ; moved from A ≤ j to A j (cid:48) . At the same time, no element in A ≥ j (cid:48) is moved to lower levels. Since each level contains atmost m elements ( | A j | ≤ m ), S TABILIZE moves at most m elements across O (log m ) levels. Therefore, it must terminatein O ( m log m ) steps. Furthermore, after termination, the set-cover solution C must satisfy both conditions in Deﬁnition 2.Thus, we conclude the proof.The above two lemmas can guarantee that the set-coversolution provided by Algorithm 1 is always stable after anychange in the set system. In the next subsection, we willpresent how to use it for fully-dynamic k -RMS. B. Algorithmic Description

Next, we will present how FD-RMS maintains the k -RMSresult by always keeping a stable set-cover solution on adynamic set system built from the approximate top- k resultsover tuple insertions and deletions. Initialization:

We ﬁrst present how FD-RMS computes an ini-tial result Q for RMS ( k, r ) on P from scratch in Algorithm 2.There are two parameters in FD-RMS: the approximationfactor of top- k results ε and the upper bound of sample size M . The lower bound of sample size is set to r because wecan always ﬁnd a set-cover solution of size equal to the sizeof the universe (i.e., m in FD-RMS). First of all, it draws M utility vectors { u , . . . , u M } , where the ﬁrst d vectors are thestandard basis of R d + and the remaining are uniformly sampledfrom U , and computes the ε -approximate top- k result of eachvector. Subsequently, it ﬁnds an appropriate m ∈ [ r, M ] sothat the size of the set-cover solution on the set system Σ builton U = { u , . . . , u m } is exactly r . The detailed procedure isas presented in Lines 3–14. Speciﬁcally, it performs a binarysearch on range [ r, M ] to determine the value of m . For a given lgorithm 3: U PDATE

Input :

Query

RMS ( k, r ) , database P t − , operation ∆ t ,set-cover solution C Output :

Result Q t for RMS ( k, r ) on P t Update P t − to P t w.r.t. ∆ t ; for i ← , . . . , M do Update Φ k,ε ( u i , P t − ) to Φ k,ε ( u i , P t ) w.r.t. ∆ t ; Maintain Σ based on Φ k,ε ( u i , P t ) ; if ∆ t = (cid:104) p, + (cid:105) then foreach u ∈ S ( p ) do if u ∈ cov ( S ( p (cid:48) )) and u / ∈ S ( p (cid:48) ) then Update C for σ = ( u, S ( p (cid:48) ) , − ) ; else if ∆ t = (cid:104) p, −(cid:105) then Delete S ( p ) from C if S ( p ) ∈ C ; foreach u ∈ cov ( S ( p )) do Update C for σ = ( u, S ( p ) , − ) ; if |C| (cid:54) = r then m, C ← U PDATE M (Σ) ; return Q t ← { p ∈ P t : S ( p ) ∈ C} ; m , it ﬁrst constructs a set system Σ according to Lines 5–7. Next, it runs G REEDY in Algorithm 1 to compute a set-cover solution C on Σ . After that, if |C| (cid:54) = r and m < M , itwill refresh the value of m and rerun the above procedures;Otherwise, m is determined and the current set-cover solution C will be used to compute Q for RMS ( k, r ) . Finally, it returnsall the tuples whose corresponding sets are included in C asthe result Q for RMS ( k, r ) on P (Line 15). Update:

The procedure of updating the result of

RMS ( k, r ) w.r.t. ∆ t is shown in Algorithm 3. First, it updates the databasefrom P t − to P t and the approximate top- k result from Φ k,ε ( u i , P t − ) to Φ k,ε ( u i , P t ) for each u i w.r.t. ∆ t (Lines 1–3). Then, it also maintains the set system Σ according to thechanges in approximate top- k results (Line 4). Next, it updatesthe set-cover solution C for the changes in Σ as follows. • Insertion:

The procedure of updating C w.r.t. an insertion ∆ t = (cid:104) p, + (cid:105) is presented in Lines 5–8. The changes intop- k results lead to two updates in Σ : (1) the insertionof S ( p ) to S and (2) a series of deletions each of whichrepresents a tuple p (cid:48) is deleted from Φ k,ε ( u, P t ) due tothe insertion of p . For each deletion, it needs to checkwhether u is previously assigned to S ( p (cid:48) ) . If so, it willupdate C by reassigning u to a new set according toAlgorithm 1 because u has been deleted from S ( p (cid:48) ) . • Deletion:

The procedure of updating C w.r.t. a deletion ∆ t = (cid:104) p, −(cid:105) is shown in Lines 9–12. In contrast to aninsertion, the deletion of p leads to the removal of S ( p ) from S and a series of insertions. Thus, it must delete S ( p ) from C . Next, it will reassign each u ∈ cov ( S ( p )) to a new set according to Algorithm 1.Then, it checks whether the size of C is still r . If not, it willupdate the sample size m and the universe U so that the set-cover solution C consists of r sets. The procedure of updating m and U as well as maintaining C on the updated U is shownin Algorithm 4. When |C| < r , it will add new utility vectors Algorithm 4: U PDATE M (Σ) Output :

Updated sample size m and solution C on Σ if |C| < r then while m < M and |C| < r do m ← m + 1 , U ← U ∪ { u m } ; foreach p ∈ Φ k,ε ( u m , P t ) do S ( p ) ← S ( p ) ∪ { u m } ; Update C for σ = ( u m , U , +) ; else if |C| > r then while |C| > r do U ← U \ { u m } ; foreach p ∈ Φ k,ε ( u m , P t ) do S ( p ) ← S ( p ) \ { u m } ; Update C for σ = ( u m , U , − ) ; m ← m − ; return m, C ; from u m +1 , and so on, to the universe and maintain C until |C| = r or m = M . On the contrary, if |C| > r , it will dropexisting utility vectors from u m , and so on, from the universeand maintain C until |C| = r . Finally, the updated m and C are returned. After all above procedures, it also returns Q t corresponding to the set-cover solution C on the updated Σ asthe result of RMS ( k, r ) on P t . Example 3.

Fig. 3 illustrates an example of using FD-RMSto process a k -RMS with k = 1 and r = 3 . Here, we set ε = 0 . and M = 9 . In Fig. 3b, we show how to compute Q for RMS (1 , on P = { p , . . . , p } . It ﬁrst uses m =(3 + 9) / and runs G REEDY to get a set-cover solution C = { S ( p ) , S ( p ) , S ( p ) } . Since |C| = 3 , it does not change m anymore and returns Q = { p , p , p } for RMS (1 , on P . Then, the result of FD-RMS after the update proceduresfor ∆ = (cid:104) p , + (cid:105) as Algorithm 3 is shown in Fig. 3c. For RMS (1 , on P = { p , . . . , p } , the result Q is updated to { p , p , p } . Finally, after the update procedures for ∆ = (cid:104) p , −(cid:105) , as shown in Fig. 3d, m is updated to and the result Q for RMS (1 , on P is { p , p , p } . Theoretical Bound:

The theoretical bound of FD-RMS isanalyzed as follows. First of all, we need to verify the set-cover solution C maintained by Algorithms 2–4 is alwaysstable. According to Lemma 1, it is guaranteed that the set-cover solution C returned by Algorithm 2 is stable. Then, weneed to show it remains stable after the update procedures ofAlgorithms 3 and 4. In fact, both algorithms use Algorithm 1to maintain the set-cover solution C . Hence, the stability of C can be guaranteed by Lemma 2 since S TABILIZE is alwayscalled after every update in Algorithm 1.Next, we indicate the relationship between the result of k -RMS and the set-cover solution and provide the bound on themaximum- k regret ratio of Q t returned by FD-RMS on P t . Theorem 2.

The result Q t returned by FD-RMS is a (cid:0) k, O ( ε ∗ k,r (cid:48) + δ ) (cid:1) -regret set of P t with high probability where r (cid:48) = O ( r log m ) and δ = O ( m − d − ) . uple x y p p p p p p p p p x O y p p p p p p p p u u u u u u u u u p (a) Dataset Level Range Set cov( S ) L [2,4) S ( p ) u u S ( p ) u u u L [1,2) S ( p ) u p p p u u u u u u u u u (b) Initial construction Level Range Set cov( S ) L [2,4) S ( p ) u u S ( p ) u u S ( p ) u u p p p u u u u u u u u u p (c) Add tuple p Level Range Set cov( S ) L [2,4) S ( p ) u u L [1,2) S ( p ) u S ( p ) u p p p u u u u u u u u u p p (d) Delete tuple p Fig. 3: An example of using FD-RMS to process a k -RMS with k = 1 and r = 3 Proof.

Given a parameter δ > , a δ -net [3] of U is a ﬁnite set U ⊂ U s.t. there exists a vector u ∈ U with (cid:107) u − u (cid:107) ≤ δ forany u ∈ U . It is known that a random sample of O ( δ d − log δ ) vectors is a δ -net of U with probability at least [3].Let B be the standard basis of R d + and U be a δ -net of U where B = { u , . . . , u d } ⊂ U . Since each tuple p ∈ P t is scaled to p [ i ] ≤ for i ∈ [1 , d ] , it holds that (cid:107) p (cid:107) ≤ √ d .According to the deﬁnition of δ -net, there exists a vector u ∈ U such that (cid:107) u − u (cid:107) ≤ δ for every u ∈ U . Hence, for anytuple p ∈ P t , |(cid:104) u, p (cid:105) − (cid:104) u, p (cid:105)| = |(cid:104) u − u, p (cid:105)| ≤ (cid:107) u − u (cid:107) · (cid:107) p (cid:107) ≤ δ · √ d (1)Moreover, as Q t corresponds to a set-cover solution C on Σ ,there exists a tuple q ∈ Q t such that (cid:104) u, q (cid:105) ≥ (1 − ε ) · ω k ( u, P t ) for any u ∈ U . We ﬁrst consider a basis vector u i ∈ U forsome i ∈ [1 , d ] . We have ω ( u i , Q t ) ≥ (1 − ε ) · ω k ( u i , P t ) andthus ω ( u i , Q t ) ≥ (1 − ε ) · c where c = min i ∈ [1 ,d ] ω k ( u i , P t ) .Since (cid:107) u (cid:107) = 1 , there must exist some i with u [ i ] ≥ √ d for any u ∈ U . Therefore, it holds that ω ( u, Q t ) ≥ ω ( u i , Q t ) · √ d ≥ (1 − ε ) · c √ d for any u ∈ U .Next, we discuss two cases for u ∈ U separately. • Case 1 ( ω k ( u, P t ) ≤ c √ d ): In this case, there alwaysexists q ∈ Q t such that (cid:104) u, q (cid:105) ≥ (1 − ε ) · ω k ( u, P t ) . • Case 2 ( ω k ( u, P t ) > c √ d ): Let u ∈ U be the utility vectorsuch that (cid:107) u − u (cid:107) ≤ δ . Let Φ k ( u, P t ) = { p , . . . , p k } bethe top- k results of u on P t . According to Equation 1, wehave (cid:104) u, p i (cid:105) ≥ (cid:104) u, p i (cid:105) − δ · √ d for all i ∈ [1 , k ] and thus (cid:104) u, p i (cid:105) ≥ ω k ( u, P t ) − δ · √ d . Thus, there exists k tuplesin P t with scores at least ω k ( u, P t ) − δ · √ d for u . Wecan acquire ω k ( u, P t ) ≥ ω k ( u, P t ) − δ · √ d . Therefore,there exists q ∈ Q t such that (cid:104) u, q (cid:105) ≥ (cid:104) u, q (cid:105) − δ · √ d ≥ (1 − ε ) · ω k ( u, P t ) − δ · √ d ≥ (1 − ε ) · (cid:0) ω k ( u, P t ) − δ · √ d (cid:1) − δ · √ d ≥ (cid:0) − ε − (1 − ε ) dδc − dδc (cid:1) · ω k ( u, P t ) ≥ (1 − ε − dδc ) · ω k ( u, P t ) Considering both cases, we have ω ( u, Q t ) ≥ (1 − ε − dδc ) · ω k ( u, P t ) for any u ∈ U and thus mrr k ( Q t ) over P t is at most ε + dδc . Therefore, Q t is a (cid:0) k, O ( ε + δ ) (cid:1) -regret set of P t withhigh probability for any c, d = O (1) . Moreover, since FD-RMS uses m utility vectors including B to compute Q t and m = O ( δ d − log δ ) , we can acquire δ = O ( m − d − ) (here,we ignore the log factor).Finally, because any ( k, ε ) -regret set of P t corresponds toa set-cover solution on Σ (otherwise, the regret ratio is largerthan ε for some utility vector) and the size of the optimalset-cover solution on Σ is O ( r log m ) according to Theorem 1,the maximum k -regret ratio of any size- r (cid:48) subset of P t is atleast ε where r (cid:48) = O ( r log m ) , i.e., ε ∗ k,r (cid:48) ≥ ε . Therefore, weconclude that Q t is a (cid:0) k, O ( ε ∗ k,r (cid:48) + δ ) (cid:1) -regret set of P t withhigh probability.Finally, the upper bound of the maximum k -regret ratio of Q t returned by FD-RMS on P t is analyzed in the followingcorollary derived from the result of Theorem 2. Corollary 1.

It satisﬁes that mrr k ( Q t ) = O ( r − d − ) with highprobability if we assume ε = O ( m − d − ) .Proof. As indicated in the proof of Theorem 2, U = { u , u ,. . . , u m } is a δ -net of U where δ = O ( m − d − ) with highprobability. Moreover, we have mrr k ( Q t ) = O ( ε + δ ) andthus mrr k ( Q t ) = O ( m − d − ) if ε = O ( m − d − ) . In addition,at any time, U must have at least r utility vectors, i.e., m ≥ r .Thus, we have mrr k ( Q t ) = O ( r − d − ) since m − d − ≤ r − d − for any d > and conclude the proof.Since ε is tunable in FD-RMS, by trying different valuesof ε , we can always ﬁnd an appropriate one such that ε = O ( m − d − ) . Hence, from Corollary 1, we show that the upperbound of FD-RMS is the same as C UBE [22] and slightlyhigher than S

PHERE [32] under a mild assumption.

Complexity Analysis:

First, we use tree-based methods tomaintain the approximate top- k results for FD-RMS (seeSection III-C for details). Here, the time complexity of eachtop- k query is O ( n ) where n = | P | because the size of ε -approximate top- k tuples can be O ( n ) . Hence, it takes O ( M · n ) time to compute the top- k results. Then, G REEDY runs O ( r ) iterations to get a set-cover solution. At eachiteration, it evaluates O ( n ) sets to ﬁnd S ∗ in Line 16 of Al-gorithm 1. Thus, the time complexity of G REEDY is O ( r · n ) .FD-RMS calls G REEDY O (log M ) times to determine thevalue of m . Therefore, the time complexity of computing Q on P is O (cid:0) ( M + r log M ) · n (cid:1) . In Algorithm 3, the timecomplexity of updating the top- k results and set system Σ is O (cid:0) u (∆ t ) · n t (cid:1) where u (∆ t ) is the number of utility vectorshose top- k results are changed by ∆ t . Then, the maximumnumber of reassignments in cover sets is | S ( p ) | for ∆ t , whichis bounded by O ( u (∆ t )) . In addition, the time complexity ofS TABILIZE is O ( m log m ) according to Lemma 2. Moreover,the difference between the old and new values of m is O ( m ) .Hence, the total time complexity of updating Q t w.r.t. ∆ t is O (cid:0) u (∆ t ) · n t + m log m (cid:1) . C. Implementation Issues

Index Structures:

As indicated in Line 2 of Algorithm 2and Line 3 of Algorithm 3, FD-RMS should compute the ε -approximate top- k result of each u i ( i ∈ [1 , M ] ) on P andupdate it w.r.t. ∆ t . Here, we elaborate on our implementationsfor top- k maintenance. In order to process a large numberof (approximate) top- k queries with frequent updates in thedatabase, we implement a dual-tree [12], [25], [34] thatcomprises a tuple index TI and a utility index UI .The goal of TI is to efﬁciently retrieve the ε -approximatetop- k result Φ k,ε ( u, P t ) of any utility vector u on the up-to-date P t . Hence, any space-partitioning index, e.g., k-d tree [7]and Quadtree [15], can serve as TI for top- k query processing.In practice, we use k-d tree as TI . We adopt the schemeof [6] to transform a top- k query in R d into a k NN queryin R d +1 . Then, we implement the standard top-down methodsto construct TI on P and update it w.r.t. ∆ t . The branch-and-bound algorithm is used for top- k queries on TI .The goal of UI is to cluster the sampled utility vectors soas to efﬁciently ﬁnd each vector whose ε -approximate top- k result is updated by ∆ t . Since the top- k results of linearfunctions are merely determined by directions, the basic ideaof UI is to cluster the utilities with high cosine similarities together. Therefore, we adopt an angular-based binary spacepartitioning tree called cone tree [25] as UI . We generallyfollow Algorithms 8–9 in [25] to build UI for { u , . . . , u M } .We implement a top-down approach based on Section 3.2of [34] to update the top- k results affected by ∆ t . Parameter Tuning:

Now, we discuss how to specify ε , i.e.,the approximation factor of top- k queries, and M , i.e., theupper bound of m , in FD-RMS. In general, the value of ε will directly affect m as well as the efﬁciency and qualityof results of FD-RMS. In particular, if ε is larger, the ε -approximate top- k result of each utility vector will includemore tuples and the set system built on top- k results willbe more dense. As a result, to guarantee the result size tobe exactly r , FD-RMS will use more utility vectors (i.e., alarger m ) for a larger ε . Therefore, a smaller ε leads to higherefﬁciency and lower solution quality due to smaller m andlarger δ , and vice versa. In our implementation, we use a trial-and-error method to ﬁnd appropriate values of ε and M : Foreach query RMS ( k, r ) on a dataset, we test different values of ε chosen from [0 . , . . . , . and, for each value of ε , M is set to the smallest one chosen from [2 , . . . , ] thatalways guarantees m < M . If the result size is still less than r when m = M = 2 , we will not use larger M anymoredue to efﬁciency issue. The values of ε and M that strike thebest balance between efﬁciency and quality of results will be used. In Fig. 5, we present how the value of ε affects theperformance of FD-RMS empirically.IV. E XPERIMENTS

In this section, we evaluate the performance of FD-RMSon real-world and synthetic datasets. We ﬁrst introduce theexperimental setup in Section IV-A. Then, we present theexperimental results in Section IV-B.

A. Experimental Setup

Algorithms:

The algorithms compared are listed as follows. • G REEDY : the greedy algorithm for -RMS in [22]. • G REEDY ∗ : the randomized greedy algorithm for k -RMSwhen k > proposed in [11]. • G EO G REEDY : a variation of the greedy algorithm for -RMS proposed in [23]. • DMM-RRMS & DMM-G

REEDY : two discretized matrixmin-max based algorithms for -RMS in [4]. • ε -K ERNEL : computing an ε -kernel coreset [2] as the k -RMS result [3], [10] directly. • HS: a hitting-set based algorithm for k -RMS in [3]. • S PHERE : an algorithm that is a combination of ε -kerneland G REEDY for -RMS in [32]. • FD-RMS: our fully-dynamic k -RMS algorithm proposedin this paper.The algorithms that are only applicable to two-dimensionaldatasets are not compared in the experiments. Note that allthe above algorithms except FD-RMS cannot directly workin a fully dynamic setting. In the experiments, they re-runfrom scratch to compute the up-to-date k -RMS result once theskyline is updated by any insertion or deletion. In addition,the algorithms that are not applicable when k > are notcompared in the experiments to test the effect of k . Since ε -K ERNEL and HS are proposed for min-size k -RMS that returnsthe smallest subset whose maximum k -regret ratio is at most ε ,we adapt them to our problem by performing a binary searchon ε in range (0 , to ﬁnd the maximum value of ε that canguarantee the result size is at most r .Our implementation of FD-RMS is in Java 8 and publiclyavailable on GitHub . We used the C++ implementations ofbaseline algorithms published by authors and followed thedefault parameter settings as described in the original papers.All the experiments were conducted on a server runningUbuntu 18.04.1 with a 2.3GHz processor and 256GB memory. Datasets:

The datasets we use are listed as follows. • BB is a basketball dataset that contains , tuples,each of which represents one player/season combinationwith attributes such as points and rebounds . • AQ includes hourly air-pollution and weather data from12 monitoring sites in Beijing. It has , tuples andeach tuple has attributes including the concentrations of air pollutants like PM , as well as meteorologicalparameters like temperature . github.com/yhwang1990/dynamic-rms archive.ics.uci.edu/ml/datasets/Beijing+Multi-Site+Air-Quality+Data ABLE I: Statistics of datasets

Dataset n d BB ,

961 5 200 AQ ,

168 9 21 , CT ,

012 8 77 , Movie ,

176 12 3 , Indep K– M – see Fig. 4 AntiCor K– M – see Fig. 4 Indep AntiCor d sky li ne s n ( × ) sky li ne s Fig. 4: Sizes of skylines of synthetic datasets • CT contains the cartographic data of forest covers inthe Roosevelt National Forest of northern Colorado. Ithas , tuples and we choose numerical attributes,e.g., elevation and slope , for evaluation. • Movie is the tag genome dataset published by Movie-Lens. We extract the relevance scores of , moviesand tags for evaluation. Each tuple represents therelevance scores of tags to a movie. • Indep is generated as described in [9]. It is a set ofuniform points on the unit hypercube where differentattributes are independent of each other. • AntiCor is also generated as described in [9]. It is a setof random points with anti-correlated attributes.The statistics of datasets are reported in Table I. Here, n isthe number of tuples; d is the dimensionality; and isthe number of tuples on the skyline. Note that we generatedseveral Indep and

AntiCor datasets by varying n from Kto M and d from to for scalability tests. By default, weused the ones with n = 100 K and d = 6 . The sizes of theskylines of synthetic datasets are shown in Fig. 4. Workloads:

The workload of each experiment was generatedas follows: First, we randomly picked of tuples as theinitial dataset P ; Second, we inserted the remaining oftuples one by one into the dataset to test the performances forinsertions; Third, we randomly deleted of tuples one byone from the dataset to test the performances for deletions.It is guaranteed that the orders of operations kept the samefor all algorithms. The k -RMS results were recorded timeswhen , , . . . , of the operations were performed. Performance Measures:

The efﬁciency of each algorithm wasmeasured by average update time , i.e., the average wall-clocktime of an algorithm to update the result of k -RMS for eachoperation. For the static algorithms, we only took the timefor k -RMS computation into account and ignored the timefor skyline maintenance for fair comparison. The quality ofresults was measured by the maximum k -regret ratio ( mrr k ) archive.ics.uci.edu/ml/datasets/covertype grouplens.org/datasets/movielens for a given size constraint r , and, of course, the smaller mrr k the better. To compute mrr k ( Q ) of a result Q , we generated atest set of K random utility vectors and used the maximumregret value found as our estimate. Since the k -RMS resultswere recorded times for each query, we reported the averageof the maximum k -regret ratios of results for evaluation. B. Experimental Results

Effect of parameter ε on FD-RMS: In Fig. 5, we present theeffect of the parameter ε on the performance of FD-RMS. Wereport the update time and maximum regret ratios of FD-RMSfor k = 1 and r = 50 on each dataset ( r = 20 on BB since themaximum regret ratio is always for r > ) with varying ε from . to . . We use the method described inSection III-C to set the value of M for each ε . Note that wewill not test larger values of ε after M reaches . First ofall, the update time of FD-RMS increases signiﬁcantly with ε .This is because both the time to process an ε -approximate top- k query and the number of top- k queries (i.e., M ) grow with ε ,which requires a larger overhead to maintain both top- k resultsand set-cover solutions. Meanwhile, the quality of results ﬁrstbecomes better when ε is larger but then could degrade if ε is too large. The improvement in quality with increasing ε isattributed to larger m and thus smaller δ . However, once ε isgreater than the maximum regret ratio ε k,r of the optimal result(whose upper bound can be inferred from practical results),the result of FD-RMS will contain less than r tuples and itsmaximum regret ratio will be close to ε no matter how large m is. To sum up, by setting ε to the one that is slightly lower than ε k,r among [0 . , . . . , . , FD-RMS performs better interms of both efﬁciency and solution quality, and the valuesof ε are decided in this way for the remaining experiments. Effect of result size r : In Fig. 6, we present the performanceof different algorithms for -RMS (a.k.a. r -regret query) withvarying the result size r . We vary r from to onall datasets (except BB where r is varied from to because the maximum regret ratio has dropped to when r = 25 ). In general, the update time of each algorithm growswhile the maximum regret ratios decrease with increasing r .But, for FD-RMS, it could take less update time when r islarger in some cases. The efﬁciency of FD-RMS is positivelycorrelated with m but negatively correlated with ε . On aspeciﬁc dataset, FD-RMS typically chooses a smaller ε when r is large, and vice versa. When ε becomes smaller, m willdecrease even for a larger r . Therefore, the update time ofFD-RMS can decrease with r because of smaller ε . Amongall algorithms tested, G REEDY has the lowest efﬁciency. Itoften cannot provide results within one day when r > onseveral datasets (e.g., AQ , CT , and AntiCor ). G EO G REEDY runs much faster than G

REEDY while achieving equivalentquality on low-dimensional data. However, it cannot scale upto high dimensions (i.e., d > ) because the cost of ﬁnding happy points grows signiﬁcantly with d . DMM-RRMS andDMM-G REEDY suffer from two drawbacks: (1) They alsocannot scale up to d > due to extremely large memoryconsumption; (2) The quality of results is not competitive ε (× − ) − t i m e ( m s ) m rr (a) BB ε (× − ) − t i m e ( m s ) m rr (b) AQ ε (× − ) t i m e ( m s ) m rr (c) CT ε (× − ) − t i m e ( m s ) m rr (d) Movie ε (× − ) − − t i m e ( m s ) m rr (e) Indep ε (× − ) − − t i m e ( m s ) m rr (f) AntiCorFig. 5: Performance of FD-RMS with varying ε ( k = 1 ; r = 20 for BB and r = 50 for other datasets). Note that the red line represents the update time andthe blue bars denote the maximum regret ratios. DMM-Greedy DMM-RRMS ε -Kernel GeoGreedy Greedy HS Sphere FD-RMS r − − t i m e ( m s ) r m rr (a) BB

10 40 70 100 r − t i m e ( m s )

10 40 70 100 r m rr (b) AQ

10 40 70 100 r t i m e ( m s )

10 40 70 100 r m rr (c) CT

20 40 60 80 100 r t i m e ( m s )

20 40 60 80 100 r m rr (d) Movie

10 40 70 100 r − − t i m e ( m s )

10 40 70 100 r m rr (e) Indep

10 40 70 100 r − t i m e ( m s )

10 40 70 100 r m rr (f) AntiCorFig. 6: Update time and maximum regret ratios with varying the result size r ( k = 1 ) when r ≥ because of the sparsity of space discretization.The solution quality of ε -K ERNEL is typically inferior to anyother algorithm because the size of an ε -kernel coreset is muchlarger than that of the minimum (1 , ε ) -regret set for the same ε .S PHERE and FD-RMS achieve better overall performance thanother algorithms. Although HS can provide results with similarquality to those of S

PHERE and FD-RMS in most cases, itruns two to three orders of magnitude slower. Compared withS

PHERE , FD-RMS performs much better on datasets withlarger skyline sizes, e.g., CT and AntiCor . On these datasets,FD-RMS runs up to three orders of magnitude faster thanS

PHERE . At the same time, the maximum regret ratios of theresults of FD-RMS are very close (the differences are lessthan . ) to those of S PHERE . Generally, FD-RMS alwaysruns faster than S

PHERE for different r on all datasets exceptfor r = 25 on BB . To sum up, FD-RMS outperforms allstatic algorithms for -RMS, especially on datasets with largeskyline sizes, in a fully dynamic setting. Effect of k : The results for k -RMS with varying k from to are illustrated in Fig. 7. We only compare FD-RMS withG REEDY ∗ , ε -K ERNEL , and HS because other algorithms arenot applicable to the case when k > . We set r = 10 for BB and Indep and r = 50 for the other datasets. The results ofG REEDY ∗ for k > are only available on BB and Indep . Forthe other datasets, G

REEDY ∗ fails to return any result withinone day when k > . We can see all algorithms run muchslower when k increases. For FD-RMS, lower efﬁciencies arecaused by higher cost of maintaining top- k results. HS and ε -K ERNEL must consider all tuples in the datasets insteadof only skylines to validate that the maximum k -regret ratio is at most ε when k > . For G REEDY ∗ , the number oflinear programs to compute k -regret ratios increases drasticallywith k . Meanwhile, the maximum k -regret ratios drop with k ,which is obvious according to its deﬁnition. FD-RMS achievesspeedups of up to four orders of magnitude than the baselineson all datasets. At the same time, the solution quality of FD-RMS is also better on all datasets except Movie and CT , wherethe results of HS are of slightly higher quality in some cases. Scalability:

We evaluate the scalability of different algorithmsw.r.t. the dimensionality d and dataset size n on syntheticdatasets. To evaluate the effect of d , we ﬁx n = 100 K, k = 1 , r = 50 , and vary d from to . The performance withvarying d is shown in Fig. 8a–8b. Both the update time andmaximum regret ratios of all algorithms increase dramaticallywith d . Only FD-RMS and S PHERE can provide results on

AntiCor when d = 9 and . Compared with S PHERE , FD-RMS has better efﬁciency when the dimensionality is higher. Itachieves speedups of times over S

PHERE while providingresults of equivalent quality when d ≥ . To evaluate theeffect of n , we ﬁx d = 6 , k = 1 , r = 50 , and vary n from K to M. The performance with varying n is shown inFig. 8c–8d. For static algorithms, we observe different trendsin efﬁciency on two datasets: The update time slightly dropson Indep but keeps steady on

AntiCor . The efﬁciencies aredetermined by two factors, i.e., the number of skyline tuples and the frequency of skyline updates . When n becomes large,the number of skyline tuples increases but the frequency ofskyline updates decreases. On Indep , the beneﬁts of lowerupdate frequencies outweigh the cost of more skyline tuples;on

AntiCor , two factors cancel each other. FD-RMS runs -Kernel Greedy ∗ HS FD-RMS k − − t i m e ( m s ) k m rr (a) BB k t i m e ( m s ) k m rr (b) AQ k t i m e ( m s ) k m rr (c) CT k t i m e ( m s ) k m rr (d) Movie k − t i m e ( m s ) k m rr (e) Indep k − t i m e ( m s ) k m rr (f) AntiCorFig. 7: Update time and maximum regret ratios with varying k ( r = 10 for BB and Indep; r = 50 for other datasets) DMM-Greedy DMM-RRMS ε -Kernel GeoGreedy Greedy HS Sphere FD-RMS d − t i m e ( m s ) d m rr (a) Indep, varying d d − t i m e ( m s ) d m rr (b) AntiCor, varying d n( × ) − t i m e ( m s ) n( × ) m rr (c) Indep, varying n n( × ) t i m e ( m s ) n( × ) m rr (d) AntiCor, varying n Fig. 8: Scalability with varying the dimensionality d and dataset size n ( k = 1 , r = 50 ) slower when n increases due to higher cost of maintaining top- k results on Indep . But, on

AntiCor , the update time keepssteady with n because of smaller values of ε and m , whichcancel the higher cost of maintaining top- k results. In addition,the maximum regret ratios are not signiﬁcantly affected by n .And the solution quality of FD-RMS is always close to thebest of static algorithms with varying n . In general, FD-RMSalways outperforms the baselines for different values of n . Summary:

The experimental results have shown the supe-riority of FD-RMS over existing k -RMS algorithms in afully dynamic setting. In the case of -RMS (a.k.a. r -regretquery), FD-RMS outperforms all static algorithms in termsof efﬁciency. Meanwhile, its solution quality is very close tothat of the best static algorithm. In addition, FD-RMS showsboth higher efﬁciency and better solution quality than all staticalgorithms for k -RMS when k > in most cases. Finally, FD-RMS demonstrates better scalability for higher dimensionalityand larger dataset sizes than all static algorithms.V. R ELATED W ORK

There have been extensive studies on the k -regret minimiz-ing set ( k -RMS) problem (see [31] for a survey). Nanongkaiet al. [22] ﬁrst introduced the notions of maximum regret ratio and r -regret query (i.e., maximum -regret ratio and -RMS in this paper). They proposed the C UBE algorithm toprovide an upper-bound guarantee for the maximum regretratio of the optimal solution of -RMS. They also proposedthe G REEDY heuristic for -RMS, which always picks a tuplethat can maximally reduces the maximum regret ratio at eachiteration. Peng and Wong [23] proposed the G EO G REEDY algorithm to improve the efﬁciency of G

REEDY by utilizingthe geometric properties of -RMS. Asudeh et al. [4] proposedtwo discretized matrix min-max based algorithms, i.e., DMM-RRMS and DMM-G REEDY , for -RMS. Xie et al. [32]devised the S PHERE algorithm for -RMS based on the notionof ε -kernel [2]. The aforementioned algorithms cannot be usedfor k -RMS when k > . Chester et al. [11] ﬁrst extended thenotion of -RMS to k -RMS. They also proposed a randomizedG REEDY ∗ algorithm that extends the G REEDY heuristic tosupport k -RMS when k > . The min-size version of k -RMS that returns the minimum subset whose maximum k -regret ratio is at most ε for a given ε ∈ (0 , was studiedin [3], [19]. They proposed two efﬁcient algorithms for min-size k -RMS based on the notion of ε -kernel [2] and hitting-set,respectively. However, all above algorithms are speciﬁc for thetatic setting. To the best of our knowledge, FD-RMS is theonly fully-dynamic k -RMS algorithm that can incrementallymaintain the result w.r.t. tuple insertions and deletions.Different variations of the regret minimizing set prob-lem were also studied recently. The -RMS problem withnonlinear utility functions were studied in [13], [24], [27].Speciﬁcally, they generalized the class of utility functionsto convex functions [13], multiplicative functions [24], and submodular functions [27], respectively. Asudeh et al. [5]proposed the rank-regret representative (RRR) problem. Thedifference between RRR and RMS is that the regret in RRRis deﬁned by ranking while the regret in RMS is deﬁned byscore. Several studies [26], [28], [35] investigated the averageregret minimization (ARM) problem. Instead of minimizingthe maximum regret ratio, ARM returns a subset of r tuplessuch that the average regret of all possible users is minimized.Shetiyam et al. proposed a uniﬁed approach to RMS and ARMbased on k -M EDOID clustering. The problem of interactive re-gret minimization was studied in [21], [30]. It aims to enhancethe regret minimization problem by user interactions. Xie etal. [33] proposed an α -happiness query, which is a variationof the min-size version of RMS. Since these variations havedifferent formulations from the original k -RMS problem, thealgorithms proposed for these variations cannot be directlyapplied to the k -RMS problem. Moreover, these algorithmsare still proposed for the static setting without consideringany update in the dataset.VI. C ONCLUSION

In this paper, we studied the problem of maintaining k -regret minimizing sets ( k -RMS) on dynamic datasets witharbitrary insertions and deletions of tuples. We proposed theﬁrst fully-dynamic k -RMS algorithm called FD-RMS. FD-RMS was based on transforming fully-dynamic k -RMS to adynamic set cover problem, and it could dynamically maintainthe result of k -RMS with a theoretical guarantee. Extensiveexperiments on real-world and synthetic datasets conﬁrmed theefﬁciency, effectiveness, and scalability of FD-RMS comparedwith existing static approaches to k -RMS. For future work,it would be interesting to investigate whether our techniquescan be extended to k -RMS and related problems on higherdimensions (i.e., d > ) or with nonlinear utility functions(e.g., [13], [24], [27]) in dynamic settings.R EFERENCES[1] A. Abboud, R. Addanki, F. Grandoni, D. Panigrahi, and B. Saha,“Dynamic set cover: improved algorithms and lower bounds,” in

STOC ,2019, pp. 114–125.[2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan, “Approximatingextent measures of points,”

J. ACM , vol. 51, no. 4, pp. 606–635, 2004.[3] P. K. Agarwal, N. Kumar, S. Sintos, and S. Suri, “Efﬁcient algorithmsfor k-regret minimizing sets,” in

SEA , 2017, pp. 7:1–7:23.[4] A. Asudeh, A. Nazi, N. Zhang, and G. Das, “Efﬁcient computationof regret-ratio minimizing set: A compact maxima representative,” in

SIGMOD , 2017, pp. 821–834.[5] A. Asudeh, A. Nazi, N. Zhang, G. Das, and H. V. Jagadish, “RRR:Rank-regret representative,” in

SIGMOD , 2019, pp. 263–280. [6] Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenig-stein, N. Nice, and U. Paquet, “Speeding up the xbox recommendersystem using a euclidean transformation for inner-product spaces,” in

RecSys , 2014, pp. 257–264.[7] J. L. Bentley, “Multidimensional binary search trees used for associativesearching,”

Commun. ACM , vol. 18, no. 9, pp. 509–517, 1975.[8] S. Bhattacharya, M. Henzinger, and G. F. Italiano, “Deterministic fullydynamic data structures for vertex cover and matching,”

SIAM J.Comput. , vol. 47, no. 3, pp. 859–887, 2018.[9] S. B¨orzs¨onyi, D. Kossmann, and K. Stocker, “The skyline operator,” in

ICDE , 2001, pp. 421–430.[10] W. Cao, J. Li, H. Wang, K. Wang, R. Wang, R. C. Wong, and W. Zhan,“K-regret minimizing set: Efﬁcient algorithms and hardness,” in

ICDT ,2017, pp. 11:1–11:19.[11] S. Chester, A. Thomo, S. Venkatesh, and S. Whitesides, “Computingk-regret minimizing sets,”

PVLDB , vol. 7, no. 5, pp. 389–400, 2014.[12] R. R. Curtin, A. G. Gray, and P. Ram, “Fast exact max-kernel search,”in

SDM , 2013, pp. 1–9.[13] T. K. Faulkner, W. Brackenbury, and A. Lall, “K-regret queries withnonlinear utilities,”

PVLDB , vol. 8, no. 13, pp. 2098–2109, 2015.[14] U. Feige, “A threshold of ln n for approximating set cover,” J. ACM ,vol. 45, no. 4, pp. 634–652, 1998.[15] R. A. Finkel and J. L. Bentley, “Quad trees: A data structure for retrievalon composite keys,”

Acta Inf. , vol. 4, pp. 1–9, 1974.[16] A. Gupta, R. Krishnaswamy, A. Kumar, and D. Panigrahi, “Online anddynamic algorithms for set cover,” in

STOC , 2017, pp. 537–550.[17] N. Hjuler, G. F. Italiano, N. Parotsidis, and D. Saulpic, “Dominating setsand connected dominating sets in dynamic graphs,” in

STACS , 2019, pp.35:1–35:17.[18] R. M. Karp, “Reducibility among combinatorial problems,” in

Complex-ity of Computer Computations , 1972, pp. 85–103.[19] N. Kumar and S. Sintos, “Faster approximation algorithm for the k-regretminimizing set and related problems,” in

ALENEX , 2018, pp. 62–74.[20] N. N. Liu, X. Meng, C. Liu, and Q. Yang, “Wisdom of the better few:cold start recommendation via representative based rating elicitation,”in

RecSys , 2011, pp. 37–44.[21] D. Nanongkai, A. Lall, A. D. Sarma, and K. Makino, “Interactive regretminimization,” in

SIGMOD , 2012, pp. 109–120.[22] D. Nanongkai, A. D. Sarma, A. Lall, R. J. Lipton, and J. Xu, “Regret-minimizing representative databases,”

PVLDB , vol. 3, no. 1, pp. 1114–1124, 2010.[23] P. Peng and R. C. Wong, “Geometry approach for k-regret query,” in

ICDE , 2014, pp. 772–783.[24] J. Qi, F. Zuo, H. Samet, and J. C. Yao, “K-regret queries usingmultiplicative utility functions,”

ACM Trans. Database Syst. , vol. 43,no. 2, pp. 10:1–10:41, 2018.[25] P. Ram and A. G. Gray, “Maximum inner-product search using conetrees,” in

KDD , 2012, pp. 931–939.[26] S. Shetiyam, A. Asudeh, S. Ahmed, and G. Das, “A uniﬁed optimizationalgorithm for solving “regret-minimizing representative” problems,”

PVLDB , vol. 13, no. 3, pp. 239–251, 2019.[27] T. Soma and Y. Yoshida, “Regret ratio minimization in multi-objectivesubmodular function maximization,” in

AAAI , 2017, pp. 905–911.[28] S. Storandt and S. Funke, “Algorithms for average regret minimization,”in

AAAI , 2019, pp. 1600–1607.[29] J. Stoyanovich, K. Yang, and H. V. Jagadish, “Online set selection withfairness and diversity constraints,” in

EDBT , 2018, pp. 241–252.[30] M. Xie, R. C. Wong, and A. Lall, “Strongly truthful interactive regretminimization,” in

SIGMOD , 2019, pp. 281–298.[31] ——, “An experimental survey of regret minimization query and vari-ants: bridging the best worlds between top-k query and skyline query,”

VLDB J. , vol. 29, no. 1, pp. 147–175, 2020.[32] M. Xie, R. C. Wong, J. Li, C. Long, and A. Lall, “Efﬁcient k-regretquery algorithm with restriction-free bound for any dimensionality,” in

SIGMOD , 2018, pp. 959–974.[33] M. Xie, R. C. Wong, P. Peng, and V. J. Tsotras, “Being happy withthe least: Achieving α -happiness with minimum number of tuples,” in ICDE , 2020, pp. 1009–1020.[34] A. Yu, P. K. Agarwal, and J. Yang, “Processing a large number ofcontinuous preference top-k queries,” in

SIGMOD , 2012, pp. 397–408.[35] S. Zeighami and R. C. Wong, “Finding average regret ratio minimizingset in database,” in

ICDE , 2019, pp. 1722–1725.

PPENDIX F REQUENTLY U SED N OTATIONS

A list of frequently used notations in this paper is summa-rized in Table II.

TABLE II: Frequently used notations P t the database at time t ( t ≥ ) p a tuple in database P t d the dimensionality of P t n t the number of tuples in P t ∆ a sequence (cid:104) ∆ , ∆ , . . . (cid:105) of operations for database update ∆ t an operation (cid:104) p, + (cid:105) or (cid:104) p, −(cid:105) at time t to update the databasefrom P t − to P t by adding/deleting tuple p U the space of all nonnegative utility vectors u a nonnegative utility vector in U ϕ ( u, P t ) the top-ranked tuple in P t w.r.t. uω ( u, P t ) the score of ϕ ( u, P t ) w.r.t. uϕ j ( u, P t ) the j th -ranked tuple in P t w.r.t. uω j ( u, P t ) the score of ϕ j ( u, P t ) w.r.t. u Φ k ( u, P t ) the set of top- k results of P t w.r.t. u , i.e., Φ k ( u, P t ) = { ϕ j ( u, P t ) : 1 ≤ j ≤ k } Φ k,ε ( u, P t ) the set of ε -approximate top- k results of P t w.r.t. u , i.e., Φ k,ε ( u, P t ) = { p ∈ P t : (cid:104) u, p (cid:105) ≥ (1 − ε ) · ω k ( u, P ) } rr k ( u, Q ) the k -regret ratio of a subset Q ⊆ P t w.r.t. u mrr k ( Q ) the maximum k -regret ratio of a subset Q ⊆ P t RMS ( k, r ) a k -RMS problem with size constraint r ∈ Z + and r ≥ dQ ∗ t the optimal result of RMS ( k, r ) on P t ε ∗ k,r the (optimal) maximum regret ratio of Q ∗ t over P t Q t the result of RMS ( k, r ) on P t returned by FD-RMS Σ = ( U , S ) a set system with the universe U and the collection S of sets.When Σ is built on the approximate top- k results of m utilityvectors on P t , we have m = |U| and n t = |S| . σ an operation (cid:104) u, S, ±(cid:105) or (cid:104) u, U , ±(cid:105) to update the set system Σ by adding/removing u to/from S ∈ S or adding/removing u to/from U S ( p ) a set in S that contains the utility vectors in U where p is anapproximate top- k result on P t C ∗ the optimal solution for set cover on Σ C an approximate solution for set cover on Σ cov ( S ) the cover set of S in C ε an input parameter to specify the approximation factor of top- k queries in FD-RMS M an input parameter to provide the upper bound of the number m of utility vectors used in FD-RMS u (∆ t ) the number of utility vectors whose approximate top- k resultsare changed by ∆∆