[PDF] Dynamic Skyline Queries on Encrypted Data Using Result Materialization

Abstract

Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged the importance of secure skyline computation, but existing solutions suffer from at least one of the following shortcomings: (i) they only provide ad-hoc security; (ii) they are prohibitively expensive; or (iii) they rely on unrealistic assumptions, such as the presence of multiple non-colluding parties in the protocol. Inspired from solutions for secure nearest-neighbors (NN) computation, we conjecture that the most secure and efficient way to compute skylines is through result materialization. However, this approach is significantly more challenging for skylines than for NN queries. We exhaustively study and provide algorithms for pre-computation of skyline results, and we perform an in-depth theoretical analysis of this process. We show that pre-computing results while minimizing storage overhead is NP-hard, and we provide dynamic programming and greedy heuristics that solve the problem more efficiently, while maintaining storage at reasonable levels. Our algorithms are novel and applicable to plain-text skyline computation, but we focus on the encrypted setting where materialization reduces the cost of skyline computation from hours to seconds. Extensive experiments show that we clearly outperform existing work in terms of performance, and our security analysis proves that we obtain a smaller (and quantifiable) data leakage than competitors.

Full PDF

aa r X i v : . [ c s . D B ] F e b Secure Dynamic Skyline QueriesUsing Result Materialization

Sepanta Zeighami

University of SouthernCalifornia [email protected] Gabriel Ghinita

University of MassachusettsBoston [email protected] Cyrus Shahabi

University of SouthernCalifornia [email protected]

ABSTRACT

Skyline computation is an increasingly popular query, withbroad applicability in domains such as healthcare, travel andﬁnance. Given the recent trend to outsource databases andquery evaluation, and due to the proprietary and sometimeshighly sensitivity nature of the data (e.g., in healthcare), it isessential to evaluate skylines on encrypted datasets. Severalresearch eﬀorts acknowledged the importance of secure sky-line computation, but existing solutions suﬀer from at leastone of the following shortcomings: (i) they only provide ad-hoc security; (ii) they are prohibitively expensive; or (iii)they rely on unrealistic assumptions, such as the presenceof multiple non-colluding parties in the protocol.Inspired from solutions for secure nearest-neighbors (NN)computation, we conjecture that the most secure and eﬃ-cient way to compute skylines is through result materializa-tion. However, this approach is signiﬁcantly more challeng-ing for skylines than for NN queries. We exhaustively studyand provide algorithms for pre-computation of skyline re-sults, and we perform an in-depth theoretical analysis of thisprocess. We show that pre-computing results while minimiz-ing storage overhead is NP-hard, and we provide dynamicprogramming and greedy heuristics that solve the problemmore eﬃciently, while maintaining storage at reasonable lev-els. Our algorithms are novel and applicable to plain-textskyline computation, but we focus on the encrypted settingwhere materialization reduces the cost of skyline computa-tion from hours to seconds. Extensive experiments showthat we clearly outperform existing work in terms of per-formance, and our security analysis proves that we obtain asmaller (and quantiﬁable) data leakage than competitors.

1. INTRODUCTION

The skyline query ﬁnds points in a dataset which are notdominated by any other data point in at least one attributevalue. These points have the property of “standing out”among other data points. For instance, in airfare book-ing, the skyline may contain routes that are either cheap-

This work is licensed under the Creative Commons Attribution NonCommercial NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by nc nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. XX, No. xxxISSN 2150 8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx est, shortest, or have the fewest stopovers. In a hospitaldatabase, the skyline may contain patients with lowest age,or patients with minimum value for a certain test result (e.g.,hemoglobin level). Many research eﬀorts in the past decadefocused on eﬃcient computation of skylines over plaintextdata [1, 16, 4]. However, few solutions exist for the problemof secure skyline, where the data and query execution areoutsourced to a service provider (SP). Since the data maybe proprietary or protected by law (e.g., healthcare records),the computation must be executed over encrypted datasets.The work in [2] was the ﬁrst to formulate the secure sky-line problem, but the solution proposed only provided ad-hoc security. Later in [24, 22, 14], several solutions wereproposed that used either homomorphic encryption, or se-cure multi-party computation. However, they either leakexcessive information to the SP, or they incur prohibitivecomputation and communication costs. The state-of-the-art approach in [22] assumes a system architecture with twonon-colluding parties that engage in a secure multi-partyprotocol that needs to scan the entire dataset for each query,and perform expensive operations for a large subset of theCartesian product of all records. This results in a responsetime of around 3 hours for a single query, which is clearlyimpractical. Each query starts anew, and cannot use anyinformation computed form the previous query.We propose a diﬀerent approach, which has been shown inthe seminal work of [30] to be the only secure and eﬃcientapproach for computing nearest-neighbor (NN) queries onencrypted data. The main idea of [30] is to pre-computequery results using a Voronoi diagram, and then partitionand materialize the results in a data structure whose proper-ties are not dependent on the data characteristics (to min-imize leakage). At query time, the user provides an en-crypted representation of the query point, and then the SPand the user engage in an interactive protocol that allowsthe user to retrieve the partition that contains the resultsto the query (together with a number of possible additionalresults, i.e., false positives) . It is shown in [30] that anymethod that uses some sort of encrypted processing directlyon the data points leaks a signiﬁcant amount of information,in the form of either inter-point distances, or distance order.Inspired by [30], we extend the idea of pre-computationand partitioning to the skyline query. This leads to a moresecure solution, and a much faster response time, as we do not process the entire dataset for each query. However, pre- The Voronoi diagram materialization used in SNN was ﬁrstintroduced by the authors of this submission in [10] for pri-vate NN queries on public datasets.1omputing skyline results in the dynamic case (i.e., for everypossible query point) is a very challenging problem. In fact,the equivalent of this problem on plaintext has attractedlimited attention, because doing so would be too expensive.Contrast this to the case of NN queries, where Voronoi dia-grams have been extensively used even for plaintext queries.Our work is pioneering in that it provides a thorough analy-sis of skyline result pre-computation, which may ﬁnd appli-cations beyond encrypted data. Although expensive in com-parison to other plaintext skyline computation counterparts,we believe that the pre-computation approach becomes valu-able, and in fact the only viable approach, in the context ofencrypted data. This is because techniques that do not per-form materialization require hours of processing for a singlequery. Our approach can answer a query in less than a sec-ond, even though there is a one-time setup cost.In a nutshell, our materialization approach reduces theskyline query to a simple index look-up. First, we perform apartitioning of the space into non-overlapping regions called skyline tiles , such that the answer to all skyline queries thatfall within the same tile are identical. This is done by ﬁnd-ing, for each data point, the space of queries for which thedata point is in the skyline (see Figs. 1 and 2). Then, skylinetiles are created by intersecting these regions (see Fig. 3).We store the query answer in each tile, and we index thetiles in a data structure. This way, we can answer a dy-namic skyline query by simply performing a lookup in theindex. The index is then encrypted to ensure the securityof our approach. Finally, note that reducing skyline querycomputation to an index look-up allows us to utilize signif-icantly less expensive cryptographic primitives, which man-ifests itself in orders of magnitude improvement in querytime compared with existing approaches.Our speciﬁc contributions are:1. We comprehensively study the problem of pre-computingskyline results, and we perform an in-depth theoreticalanalysis of this process.2. We show that the problem of pre-computing resultswhile minimizing storage overhead is NP-hard, and weprovide dynamic programming and greedy heuristicsthat solve the problem more eﬃciently, while main-taining storage costs at reasonable levels.3. We perform an extensive experimental evaluation show-ing that our techniques clearly outperform existingwork in terms of performance.4. We provide an in-depth security analysis to measurethe leakage of our proposed approach, and concludethat the amount of leakage is quantiﬁable, and smallercompared to existing techniques.The rest of the paper is organized as follows: we pro-vide background information in Section 2. Section 3 intro-duces a construction to pre-compute skyline query resultsfor data units called tiles . We show how to aggregate tilesand perform data partitioning to reduce storage overheadin Section 4. We generalize the tile concept in Section 5 toobtain further performance gains, and outline the complete,end-to-end solution to the secure skyline query in Section 6.Section 7 presents our security analysis, followed by exper-iments in Section 8. We review related work in Section 9and conclude in Section 10.

2. BACKGROUND AND DEFINITIONS2.1 Preliminaries

Consider database D with n points in the d -dimensionalspace R d . For point p ∈ D , p [ i ] denotes its value in i th dimension. A query is denoted by q ∈ R d . For ease of dis-cussion, let ∞ be a large constant and assume that the queryspace is bounded by ∞ in every dimension (i.e., q [ i ] < ∞ for all i ). The domination relationship between two pointsin D with respect to a query q is deﬁned as follows: Deﬁnition 1 (Domination) . A point p ∈ D dominates an-other point p ′ ∈ D with respect to q , denoted by p > q p ′ ,if and only if ∀ i, ≤ i ≤ d, | p [ i ] − q [ i ] | ≤ | p ′ [ i ] − q [ i ] | and ∃ i, ≤ i ≤ d, | p [ i ] − q [ i ] | < | p ′ [ i ] − q [ i ] | . We use p > q p ′ if p does not dominate p ′ with respect to q . Intuitively, p > q p ′ implies that p is at least as close to q as p ′ in all dimensions, and it is closer to q in at least one. Deﬁnition 2 (Dynamic Skyline Query) . Given a database D and a query point q , return a set S ⊆ D , such that nopoint in S is dominated by a point in D with respect to q ,that is, S = { p ∈ D |∀ p ′ ∈ D, p ′ > q p } . Note that, the conventional skyline query (where q is thedomain origin) can be deﬁned as a special case of dynamicskyline. Our focus is on the more challenging dynamic sky-line setting, so whenever we mention skyline query, we referto the dynamic case (given query q ). We use Fig. 1 (a) asa running example: the skyline query answers for queries q and q are { p , p , p } , and { p , p } , respectively. We assume three types of participants: the data owner(DO), the service provider (SP), and users. Users are trustedby the DO, and wish to obtain the result to dynamic skylinequeries on the dataset owned by DO. The DO does not havethe infrastructure to run such a service, so it outsourcesthe functionality to SP (e.g., a commercial entity), which is honest but curious . The SP runs the protocol correctly, butmay try to infer private details about the data. In addition,the SP may be compromised by an attacker, in which casethe data may be exposed, with serious consequences (e.g.,leakage of healthcare records). To address such threats, theDO ﬁrst encrypts the dataset, and only shares the encryptedversion with the SP. At runtime, the DO may be oﬄine, andonly the SP and the user engage in a protocol to determinethe encrypted result to the user’s skyline query q .Users are trusted with some secret tokens (e.g., encryptionkeys), and are assumed not to collude with the SP (in prac-tice, the users may be highly vetted individuals, e.g., medicaldoctors). The user retrieves a superset of the actual queryresult in encrypted form (i.e., including false positives), andperforms a local lightweight ﬁltering step to narrow down theexact result. Our solution guarantees that the user alwaysobtains the exact result, and we also provide an upper boundon the total amount of false positives, in order to minimizethe communication and computational cost on the user.To support this protocol, the DO must prepare and en-crypt the dataset, which may incur a signiﬁcant overhead.However, this is a one-time setup cost . It helps reduce signif-icantly the query processing overhead at runtime, which isthe most important component of the cost, since that is theresponse time perceived by the user. We also assume that2 ymbol Deﬁnition D , n , d d -dimensional database D of cardinality np > q p ′ p dominates p ′ with respect to query qD pp ′ Domination region of p ′ with respect to pS p Skyline region of pm p,p ′ Mid-point between p and p ′ cnt ( T ), spc ( T ) For T = ( S, P ), cnt ( T ) = P , spc ( T ) = S T D Set of skyline tiles for database DL i Boundaries of skyline tiles in dim. iN i , N N i = | L i | , N = max i { N i }I Set of skyline indices l Number of skyline regions to intersect k Max. number of false positives allowed m Pre-partitioning parameter

Table 1: Summary of Notations when the user registers for the service, there is a one-timesetup cost on the user device. This may include transferringof a relatively small amount of metadata needed to run theprotocol with the SP (our evaluation shows that the userdownload size is in the order of 10 s of MB, which is a rea-sonable amount even on a mobile connection).

3. SKYLINE RESULT MATERIALIZATION

In this section, we introduce some preliminary conceptsthat are built upon later in Sections 4 and 5 to obtain ef-ﬁcient algorithms for building and storing an index on theplaintext data. Section 6 presents a complete, end-to-endprocessing algorithm on plaintext data. Finally, in Section 7we show how the index is encrypted using a special trans-formation before being sent to the SP, and how traversal isperformed on the encrypted structure.

Consider a point p ∈ D . Recall that for a query q , p isin the skyline if p is not dominated by any other point in D with respect to q . Denoted by S p is the skyline region of p ,i.e., the set of all query points for which p is a skyline point: S p = { x ∈ R d |∀ p ′ ∈ D, p ′ > x p } . Due to the propertiesof the domination relation, S p is a polytope of a speciﬁcshape and can be constructed easily. We introduce severalauxiliary concepts needed to deﬁne skyline regions. Domination Region of a point . First, consider twopoints p and p ′ . Recall that p ′ > q p , if and only if ∀ i, ≤ i ≤ d | p ′ [ i ] − q [ i ] | ≤ | p [ i ] − q [ i ] | (1)and ∃ i, ≤ i ≤ d | p [ i ] − q [ i ] | < | p [ i ] − q [ i ] | (2)Rephrasing the deﬁnition of domination, observe that p ′ dominates p for all the query points q in D pp ′ = { q ∈ R d | q satisﬁes (1) and (2) } . We refer to D pp ′ as the domination re-gion of p ′ with respect to p . For convenience, deﬁne D pp = ∅ .The dominance region of all the points with respect to p isshown in Fig. 1 (b)-(e).Note that D pp ′ is the solution to inequalities (1) and (2).We ﬁrst focus on the solutions to Eq. (1). Observe that | p ′ [ i ] − q [ i ] | ≤ | p [ i ] − q [ i ] | ⇐⇒  q [ i ] ≥ p ′ [ i ]+ p [ i ]2 p ′ [ i ] > p [ i ] q [ i ] ≤ p ′ [ i ]+ p [ i ]2 p ′ [ i ] < p [ i ]true otherwise(3) Given p and p ′ and for each i , (3) is an inequality of the form q [ i ] ≤ c or q [ i ] ≥ c , for some constant c depending on p [ i ]and p ′ [ i ]. That is, the solution to the inequality for each i is of the form ( −∞ , c ] or [ c, ∞ ). Let m p ′ ,p be the mid-pointbetween p and p ′ , i.e., m p ′ ,p [ i ] = p ′ [ i ]+ p [ i ]2 . Based on Eq. (3),a q ∈ R d satisﬁes Eq. (1) if ∀ i, ≤ i ≤ d , q [ i ] ∈  [ m p,p ′ [ i ] , ∞ ) p ′ [ i ] > p [ i ]( −∞ , m p,p ′ [ i ]] p ′ [ i ] < p [ i ]( −∞ , ∞ ) p ′ [ i ] = p [ i ] (4)Let Z pp ′ be the set of q that satisﬁes Eq. (4). Observethat Z pp ′ is a hyper-rectangle with its axes parallel to thecoordinate axes, starting at m p ′ ,p and going to inﬁnity ornegative inﬁnity in the direction of p ′ .Finally, note that D pp ′ is a subset of Z pp ′ that also satisﬁesEq. (2). The interior of Z pp ′ always satisﬁes (2), but somepoints on the boundary of Z pp ′ may not. E.g., m p,p ′ doesnot satisfy (2) by deﬁnition. Hence, D pp ′ is the set Z exceptsome of its boundaries. Since Z pp ′ is deﬁned by a set of hyper-planes, to deﬁne D pp ′ , we also use the set of hyper-planes,but in addition to its coordinates we store the possible ex-ceptions in the boundaries. We formalize our notation ofhyper-planes later in this section. Fig. 1(b)-(e) shows thedomination regions of all points with respect to the p . Skyline region of a point . Recall that S p is the spacewhere p is not dominated by any other point. Using theterminology above, S p is the entire space except ∪ p ′ ∈ D D pp ′ ,that is, S p = R d \ ( ∪ p ′ ∈ D D pp ′ ). This is because ∪ p ′ ∈ D D pp ′ is the region where p is dominated by some point in D . S p can be deﬁned as a subset of R d except the union of hyper-rectangles with their axes parallel to the coordinate axes.Fig. 2(d) shows how we can ﬁnd the skyline region of point p . We ﬁrst ﬁnd the domination regions of all points with re-spect to p . Then we take their union and the skyline regionof p is the entire space except the union of the dominationregions. As Fig. 2(d) shows, not all points in D contributeto the skyline region of p . Skyline hyper-planes . The concepts deﬁned so far con-sider subspaces of the query space deﬁned by axis-parallelhyper-planes. In the rest of the paper, we refer by skylinehyper-plane (or simply hyper-plane) to the following datastructure: for a hyper-plane H , the structure contains foreach dimension i two points min [ i ] and max [ i ], representingthe smallest (largest) value on the hyper-plane in the i -thdimension. For any hyper-plane, there exists a dimension i such that min [ i ] = max [ i ]. We say that a hyper-plane is in the i -th dimension if all the points on the hyper-plane haveexactly the same value in that dimension. Furthermore,if the hyper-plane corresponds to the skyline region S p , itstores the point p , and p is referred to as the hyper-plane’s generator . In general, p is not a skyline point on the hyper-plane. However, as discussed before, a hyper-plane stores aset of exceptions, corresponding to coordinates (if any) onthe hyper-plane for which p is actually a skyline point. Fur-thermore, we say that a hyper-plane is bounded by a set ofhyper-planes H if every point on its boundary (deﬁned by min and max ) also belongs to another hyper-plane in H . Border points . Not all the points in D contribute to ∪ p ′ ∈ D D pp ′ . That is, for two points, p ′ , p ′ ∈ D , D pp ′ may bea subset of D pp ′ , in which scenario D pp ′ does not impact S p .We refer to all the points such that ∀ p ′ ∈ D, D pp ′ D pp ′ as3 om. region dom. region dom. region dom. region (a) A database and 4 queries (b) Dom. region of w.r.t (c) Dom. region of w.r.t (d) Dom. region of w.r.t (e) Dom. region of w.r.t Figure 1: (a) A database and 4 sample queries. (b) - (e) Domination regions of all the points w.r.t p (a) Skyline region of (b) Skyline region of (c) Skyline region of (d) Skyline region of (e) Skyline region of

Figure 2: Skyline regions of all points border points of p . Fig. 2 (a)-(e) shows the skyline regionsof all points in the database. In each ﬁgure, the points inred are the border points and the points in blue are non-border points. The importance of border points is that theydeﬁne how complex the skyline region of a point is. Thatis, the more number of border points, the more edges theskyline region will have. The following property helps usunderstand how many border points any point has. Property 1. p divides the space into d quadrants. Let X i contain the points in D \ { p } that are in the i -th quadrant. p ′ is a border point of p if, for some i , p ′ is a skyline pointwith respect to a query at p for the database X i . Observe that for a query q we can tell that p is in theskyline iﬀ q ∈ S p . The answer S to the skyline query q is S = { p ∈ D | q ∈ S p } . We need to ﬁnd an eﬃcient way tocheck whether q is in S p for all points p in D . To that end,we intersect S p for all p with each other.First, deﬁne a partitioning of a space Q as a set Π, suchthat for each π ∈ Π, π ⊆ Q , ∪ π ∈ Π π = Q and that π i ∩ π j = ∅ for any π i , π j ∈ Π, π i = π j . Let H p be the setof hyper-planes deﬁning the skyline region, S p , for a given p . Now consider the set H = ∪ p ∈ D H p . Observe that H p is a set of intersecting hyper-planes where each hyper-planeis bounded by other hyper-planes. Consider the set of allpolytopes created by the intersection of hyper-planes in H p .Each polytope creates a partition, and their union Π D is apartitioning of the space. For the exceptions stored in eachhyper-plane, we also consider them to be a partition in thepartitioning Π D (For ease of discussion, in the remainderof this paper, we do not explicitly mention the partitionscreated by the exceptions as their shape is diﬀerent fromthe polytope partitions, but the discussion either directlyholds for both or the extension to the exceptions is straightforward). We use this partitioning to deﬁne skyline tiles .Formally, a tile is deﬁned as follows. Deﬁnition 3 (Tiles) . A tile, T , is deﬁned as a tuple T =( S, P ) , where S is a subspace of R d deﬁned by a set of hyper-planes and P is a subset of D . S satisﬁes the following con-ditions. Firstly, each hyper-plane of S is parallel to one ofthe axes. Secondly, each hyper-plane is bounded by anotherhyper-plane on all sides. For a tile T = ( S, P ), S is called the location of T and P is called the content of T , also written as cnt ( T ) = P .Moreover, deﬁne space of T , written as spc ( T ), as the subsetof R d that is inside S . We say a query falls inside T if q ∈ spc ( T ). Finally, a tiling of the space is a set of tiles, X ,such that ∪ T ∈ X spc ( T ) = R d and spc ( T ) ∩ spc ( T ′ ) = ∅ forall T, T ′ ∈ X .For each partition π i ∈ Π D , a tile is τ i deﬁned as follows.Let P τ i = { p ∈ D | π i ∩ S p = ∅} (i.e., the set of points whoseskyline region intersects π ). Let tile τ i = ( π i , P τ i ). We call τ a skyline tile and the set T D = { τ i , ∀ i } the set of skylinetiles of D . Skyline tiles have the following properties. Property 2.

For a query q that falls inside a skyline tile T , the answer to the skyline query at q is cnt ( T ) . Property 3.

Consider two skyline tiles T and T such thatboth have a hyper-plane, H , as one of their edges. Assumehyper-plane H correspond to some point p . Then, either p ∈ cnt ( T ) or p ∈ cnt ( T ) but not both. Furthermore, cnt ( T ) \{ p } = cnt ( T ) \ { p } . In other words, content of T and T diﬀer in exactly one point, p . Fig. 3 shows the skyline tiles created from intersecting allthe skyline regions in Fig. 2. In Fig. 3, observe that t is atile deﬁned by four hyper-planes (or lines, since d = 2), andits content is the set of points { p , p , p } . Now considerqueries q and q in Fig. 3. Both queries fall in the tile t and therefore their answer is p , p and p . The query q is in tile t and its answer is p ; whereas the answer to q is { p , p } . This shows how an answer to a query can easilybe retrieved by ﬁnding which tile the query falls into. Border Locations.

Recall that skyline tiles are createdbased on a set of hyper-planes, H . Consider the set H i of allthe hyper-planes in H that are in the i -th dimension. Letthe set L i = { x | h ∈ H i , x = h.min [ i ] = h.max [ i ] } (notethat h.min [ i ] = h.max [ i ] holds because h is a hyper-plane4 igure 3: Skyline tiles Figure 4: A solution to TAP Lv [0] Lv [1] Lv [3] Lv [6] Lv [9] Lv [10] Lh [ ] Lh [1] Lh [2] Lh [3] Lh [4] Lh [5] Lh [6] Lh [7] Lh [8] Lh [9] Lv [2] Figure 5: A solution to APP in the i -th dimension). Observe that the set L i containsthe i -th dimension boundary of all the skyline tiles. Deﬁne N i = | L i | for all i and let N = max i L i . Observe thatthere is a relationship between number of border points andnumber of border locations. More speciﬁcally, every borderpoint creates at most one hyper-plane in every dimension.Therefore, for every border point, we have a hyper-plane in H i that corresponds to a location in L i . Thus, we can studythe value of N by analyzing the number of border points.This analysis is used in studying the performance of ouralgorithms. Two important properties of skyline tiles that are utilizedin our analysis are the value of N and the total number oftiles. They determine the space and time complexity of thealgorithms discussed in the rest of the paper. Number of tiles.

Tiles are the intersection of n diﬀerentskyline regions, and each region contains at most n hyper-planes in each dimension. For a hyper-plane in the ﬁrstdimension, consider the maximum number of tiles it can bepart of. In any other dimension, there can be at most 2 d × n hyper-planes intersecting it . Thus, it can be a part of atmost 2 d ( d − n d − tiles. There are at most n hyper-planesin the ﬁrst dimension, and every tile must have one of thoseas its edge. Thus, there are at most O (2 d n d +1 ) tiles intotal. Value of N . Let B p be the set of border points for p .According to Property 1, | B p | is the size of a skyline queryat p . Let B avg = P p ∈ D | B p | n . Observe that N is at most P p ∈ D | B p | . Thus, we can write N = nB avg . If data pointsare uniformly distributed, there will be O ( n (2 log n ) d d ! ) num-ber of skyline points [3] ( (log n ) d d ! is the expected number ofskyline points for uniformly distribution, there are n datapoints, and we need to consider skyline points for each of the2 d quadrants created by p ). Therefore, in such a scenario, B avg = O ( n (2 log n ) d d ! ) on expectation. Challenges . The number of tiles created by this ap-proach is exponential in data size and makes the problemintractable. This occurs because skyline regions may impactparts of the space far from their generating point. Workingdirectly with skyline tiles may be ineﬃcient. We present twodiﬀerent methods in Sections 4 and 5 to deal with this issue. (a) intersecti n(cid:0) (cid:1) (cid:2)(cid:3) d (b) in t(cid:4)r(cid:5)(cid:6)(cid:7)(cid:8)(cid:9) ng and

Figure 6: Generalized tiling with l = 3 .

4. AGGREGATING TILES

One approach to address the high space complexity of sky-line tiles is to aggregate some of the tiles together. This savesspaces by storing the content of multiple tiles only once, butit may require the inclusion of false positives in the answer.Recent work [30] utilizes the computational power of theuser to ﬁlter the ﬁnal results based on the data returned tothe user. Considering that the user can decrypt the dataand operate on plaintexts, the amount of work required toﬁlter out the false positives is very small. On the otherhand, allowing a small amount of false positives can signif-icantly increase storage and processing eﬃciency at the SP.In essence, we no longer partition the data domain so thatthe answer to each skyline query is the same within eachtile. Instead, we partition the domain such that the answerto a query does not change by much within each partition.We achieve this by combining some of the tiles together.The tile aggregation problem is formalized as follows.

When false positives are permitted, a solution S q returnedby skyline processing algorithm for query q consists of twosets: set S qc which contains only (and all) points that are inthe skyline of q , and the set S qf which contains none of theskyline points of q . We refer to the points in S qf as false hits and | S qf | as the number of false hits for the query q .For eﬃcient post-processing, we bound the number of falsehits for any query. We deﬁne the false-hit requirement asfollows: let k be an integer; then the maximum number offalse hits in a solution for any query must be at most k ,that is max q | S qf | ≤ k . Recall that, by allowing false-hits, we5im at reducing the space complexity of our skyline resultmaterialization structure. We reduce the space complexityby aggregating some of the skyline tiles. An aggregation isformally deﬁned as follows: let T = { t = ( S , P ) , t =( S , P ) , ..., t r = ( S r , P r ) } for the following deﬁnitions. Deﬁnition 4 (Aggregation) . An aggregation (or aggregatetile), A , of a set of tiles T is itself a tile A = ( S, P ) thatsatisﬁes the following conditions. (1) P = ∪ i P i and (2) S is the smallest set such that ∪ i S i ⊆ S . The tiles in T arecalled the component tiles of A . Deﬁnition 5 (Location-wise validity) . An aggregation A of T is location-wise valid if the aggregation creates a singleconnected polytope. Intuitively, an aggregation is location-wise valid if all ofits component tiles are next to each other, i.e., there is noempty space between them.

Deﬁnition 6 (Cardinality-wise validity) . We say that anaggregation A of T is cardinality-wise valid if | P | ≤ k +min r | P i | , for a parameter k . That is, the aggregation contains at most k additionalpoints compared to any of its component tiles. Observethat if an aggregation A is location-wise valid, any query q that falls inside the aggregation also falls inside one of itscomponent tiles, T . If the aggregation is also cardinality-wise valid, we can return the content of A instead of T toanswer the query q , and the solution will satisfy the false-hitrequirement. Thus, we aim at ﬁnding aggregations that areboth location-wise and cardinality-wise valid. Furthermore,we aim to reduce the total space used by the algorithm. Wecan express this as an optimization problem as follows. Deﬁnition 7 (Tile Aggregation Problem) . Given a set oftiles T and an integer k , return set S ⊆ T such that ∪ t ∈ S = T , ∀ t i ∈ S the aggregation a i of t i is both location-wise andcardinality-wise valid, and P i | cnt ( a i ) | is minimized. Fig. 4 shows a feasible solution to the Tile AggregationProblem (TAP) when k = 2. Observe that the content ofeach aggregated tile is the union of the points in each of thecomponent tiles. Furthermore, for queries q and q , theanswer is the same as the answer with no aggregation (seeFig. 3). However, q now returns two false hits, i.e., p and p while q returns one false hit, i.e., p .The tile aggregation problem (TAP) as deﬁned above isdiﬃcult to solve optimally. Speciﬁcally, we show that it isNP-hard even in two dimensions. Theorem 1.

TAP is NP-hard.Proof.

See Appendix A.Two issues arise when solving TAP. First, solving TAPoptimally is NP-hard. Second, the aggregate tiles of theTAP solution can have complicated shapes, slowing downthe process of searching them. We address these issues next.

We propose a relaxation of the aggregation problem. Werestrict the possible choices by enforcing that aggregationsmust have a certain shape. However, we allow aggregationsto split existing tiles into two (that is, half of a tile maybelong to one aggregations and the other half to another) to avoid over-restrictive requirements. These modiﬁcationsmake it more intuitive to formulate the problem as a spacepartitioning problem, as discussed below.Note that, any feasible solution, S , to TAP corresponds toa set of aggregations A whose union of space, i.e., ∪ a ∈ A spc ( a ),covers the entire data domain. On the other hand, if weallow splitting of the tiles during aggregation, the obser-vation here is that in fact any partitioning of the space,can be used to create an aggregation that in turn can beused to answer the skyline query. Consider a partitioningof the query space Π = { π , π , ..., π r } . Deﬁne a set of tile T = { t , ..., .t r } such that spc ( t i ) = π i . Moreover, for atile t i , let B i = { τ ∈ T D | spc ( τ ) ∩ spc ( t i ) = ∅} and set cnt ( t i ) = ∪ τ ∈ B i cnt ( τ ). Note that the set T deﬁnes a tilingof the space that covers the entire domain and if a queryfalls into a tile, the content of the tile will be a super-set ofthe answer to the query.Since every partitioning of the query space correspondsto a tiling by the construction above, we formulate this re-laxation of TAP (this is a relaxation because we now allowsplitting of tiles), in terms of a space partitioning problemthat minimizes the storage cost. Finding a space partition-ing in the general case is also diﬃcult, since, many aggre-gations are also space partitioning. However, we reduce thecomplexity of the problem by restricting the shape of a par-titioning allowed. Deﬁnition 8 (Shape-wise validity) . A partitioning is shape-wise valid if it can be represented by a set of hyper-planeswhere a hyper-plane in the i -th dimension has boundariesfrom −∞ to ∞ in all dimensions j when j > i . Intuitively, the partitioning contains hyper-planes in di-mension i that are not bounded by any hyper-plane in di-mensions after i . Deﬁnition 9 (Aggregation by Partitioning Problem) . Forthe query space Q , ﬁnd a set of hyper-planes that deﬁne ashape-wise valid partitioning of Q such that the tiling, T ,corresponding to the partitioning is cardinality-wise validand that P t ∈ T | cnt ( t ) | is minimized. We provide a dynamic programming solution that solvesAggregation by Partitioning Problem (APP) optimally andthen discuss a number of heuristics that solve it withoutapproximation guarantees. It is worth mentioning that thequality of our solution to APP (i.e., whether it is optimal ornot), helps with reducing storage cost of our index. How-ever, our guarantees regarding properties of tiles (e.g., thefalse-hit requirement) hold true irrespective of whether theproblem is solved optimally or not. Thus, heuristics can beuseful in practice when using minimal space is not critical.

We can solve APP optimally by dynamic programming.We limit our discussion to two-dimensions for ease of illus-tration. The idea can be extended to higher dimensions,and we can obtain a polynomial running time for any ﬁxeddimensionality. However, in practice, the run time will be-come large for high dimensions, and using heuristics may bea better option in those scenarios.The dynamic programming formulation uses two observa-tions. First, the shape-wise validity requirement of APP intwo dimensions implies the that the space partitioning is de-ﬁned by a set of vertical lines that cross the entire space and6 set of horizontal lines that start and end between adjacentvertical lines. Second, there exists an optimal solution whereall line overlaps the boundary of some skyline tile. This isbecause for any line that does not overlap the boundary ofany skyline tile, we can move that line until it does overlapthe boundary, and the total cost will not increase. Thus,to ﬁnd the optimal solution, we only need to consider linesthat pass through the boundaries of the skyline tiles.

Recurrence relation.

Let L v be an array of all possible x locations for the vertical lines and let L h be an array ofall possible y locations for the horizontal lines, both sortedin ascending order. To obtain L v and L h , we enumerateall the skyline tiles and ﬁnd the x and y values of theirboundary lines and add them to L v and L h if they don’texist. Let N h = | L h | and N v = | L v | . Furthermore, consider R = (( x , y ) , ( x , y )) as a rectangle with lower left coordi-nates ( x , y ) and upper right coordinates ( x , y ). Deﬁne C ( s, i, t, j ) as the size of the content of the aggregate tilewhose space is R = (( L v [ s ] , L h [ i ]) , ( L v [ t ] , L h [ j ])) or inﬁnityif such an aggregation is not cardinality-wise valid. Thatis, let B = { τ ∈ T D | spc ( τ ) ∩ R = ∅} . Deﬁne smallest =min τ ∈ B | cnt ( τ ) | and size = | ∪ τ ∈ B cnt ( τ ) | . Then let C ( s, i, t, j ) = (cid:26) size smallest + k ≤ size ∞ otherwiseDeﬁne V ( i ) as the optimal solution to the problem of APP inthe space where x > L v [ i ], given that there exists a verticalline at L v [ i ]. Note that the optimal solution to our prob-lem is V (0). Furthermore, deﬁne H ( i ; s, t ) as the optimalsolution to APP in the space L v [ s ] < x < L v [ t ], y > L h [ i ]with only horizontal lines given that there are vertical linesat L v [ s ] and L v [ t ] and a horizontal line at L h [ i ]. Then, Wecan write the following recurrence relations. V ( i ) = min i

Algorithm 1 implements the recurrence re-lation. The algorithm starts backwards and tabulates thevalues for H and V . Note that one optimization comparedwith the recurrence relation is that distance between i and j is bounded by v max (and similarly s and t by h max ). This isbecause of the false-hit requirement of the solution. That is, v max is the maximum possible integer such that the spacefrom L v [ v max ] to L v [ v max + i ] has a feasible solution usingonly horizontal lines (i.e., a partitioning exists that satisﬁesthe false-hit requirement) for all i ≤ | L v | − v max . Correctness.

Observe that, given that a vertical lineexists at L v [ i ], the optimal solution must contain a nextvertical line at a location L v [ j ], for some j > i (except forthe base case). Each j splits the space into two: (1) thespace from L v [ i ] to L v [ j ]; and (2) the space after L v [ j ].Note that, no partition is allowed to cross L v [ j ] becausethe partitioning has to be shape-wise valid. As a result, Algorithm 1

DP( T D )1: L h ← horizontal boundaries of T D , sorted2: L v ← vertical boundaries of T D , sorted3: N v ← | L v | and N h ← | L h | V ( N v ) ← for i ← N v − to do V ( i ) ← ∞ for j ← i + 1 to N v and j − i ≤ v max do H ( N h ) ← for s ← N h − to do H ( s ) ← ∞ for t ← s + 1 to N h and t − s ≤ h max do c ← C ( i, s, j, t ) + H ( t )13: if c ≤ H ( s ) then H ( s ) ← c c ← H (0) + V ( j )16: if c ≤ V ( i ) then V ( i ) ← c return V(0)the two instances can be solved independently. Considerinstance (1): it is the problem of ﬁnding the optimal tilingof the space between L v [ i ] and L v [ j ] using only horizontallines, given that there are vertical lines at L v [ i ] and L v [ j ].This is the same as H (0; i, j ). Consider instance (2): it isthe problem of ﬁnding the optimal tiling of the space after L v [ j ] given that there is a vertical line at L v [ j ]. This is thesame as V ( j ). Therefore, for each j , the minimum possiblecost is H (0; i, j ) + V ( j ). Since the optimal solution has tocontain one of the possible values of j , then its cost mustbe min j H (0; i, j ) + V ( j ). A similar argument proves thecorrectness of the recurrence relation for H . Time Complexity.

As Algorithm 1 shows, there arefour nested loops, and each of them take N v , v max , N h and h max respectively. C ( i, s, j, t ) can be found in log n usingProperty 3 (we keep track of the content of the tiles, andwhenever a hyper-plan is crossed, we use Property 3 to de-termine the content of the new tile encountered). O (log n )is needed to determine whether the diﬀering point men-tioned in Property 3 already exists in the previous tile ornot. Let N = max { N v , N h } . Thus, the total running time is O ( v max h max N log n ) ( C ( i, s, j, t ) can also be pre-computedto avoid the log n factor at the expense of storage cost). v max and h max depend on k and on how the tiles are distributed.They are generally similar to k in value, since every skylinetile diﬀers in exactly one point from any of its neighbors. We present two heuristics that can be used either inde-pendently, or together with our dynamic programming ap-proach, to improve the running time of our algorithm at thecost of losing optimality.

Prepartitioning . To reduce the time complexity of thealgorithm, we ﬁrst pre-partition the space. We can do thisby placing a vertical and horizontal line at every every mul-tiple of m coordinates, where m is a ﬁxed parameter. Thisbounds v max and h max to m . We solve the problem for eachpartition created using DP. Each partition now has m possi-ble vertical positions and m horizontal positions. The run-time of the algorithm is reduced to O ( N m log n ), where m can be used to trade-oﬀ optimality with run-time. Thesolution approaches the optimal as m increases.7 reedy. Another heuristic is to choose the lines greedily,that is, starting from the beginning and choosing the nextvertical line as far away from the current position as possi-ble, and repeating that for the horizontal dimension. Thisreduces the cost to O ( |T D | ).

5. GENERALIZED SKYLINE TILES

Another way to reduce the storage overhead is to gen-eralize the skyline tile concept, and to allow for a tunableparameter providing a trade-oﬀ between space complexityand query time. Recall that, to generate skyline tiles, weintersected the skyline region of all the points, which leadto the creation of a large number of tiles. In this section, wegeneralize the process by only intersection l of the skylineregions, for a parameter l which gives us ⌈ nl ⌉ diﬀerent sets oftiles. This helps reducing the space complexity, because itreduces the fragmentation of tiles (fewer data points resultin fewer skyline region intersections). However, it increasesquery time, since we need to search multiple sets of tiles toﬁnd the ﬁnal answer to the query. In the extreme case when l = n , we obtain one set of tiles that intersects all the skylineregions. When l = 1 we do not intersect the skyline regionsat all, but we use each skyline region individually to knowwhether a point is in the skyline or not.Observe that, generalizations of Property 2 and Prop-erty 3 hold for generalized skyline tiles. For Property 2,observe that a skyline tile created from the intersection ofa set of points D i ⊆ D contains the subset of the answer tothe skyline query in D . In other words, the union of the tilesa skyline query falls into is the answer to the skyline query.Furthermore, note that the aggregation methods discussedin Section 4 can still be applied to each set of generalized tilescreated. That is, we apply the method ⌈ nl ⌉ times. However,the total number of false-hits will now be ⌈ nl ⌉ × k , as k isthe number of false hits per use of the aggregation method.Fig. 6 illustrates the generalization concept. Setting l =3, we get two diﬀerent sets of tiles. Fig. 6(a) shows theintersection of skyline regions for p , p and p , whereasFig. 6(b) shows the intersection for the remaining skylineregions. Note that, a query has to search both sets of tiles.For instance, queries q and q fall into the tile t from whichwe obtain points p and p . They also fall into the tile t from which we obtain p . Thus, the answer to q and q is p , p and p . Observe that q falls into t , but t nowdoes not contain any point at all (this is possible when theintersection is not done on all the skyline regions). q alsofalls into t which contains p . Thus, the answer to q is p . Number of Tiles . Each index contains l skyline regionsand each skyline region contains at most n hyperplanes ineach of the d dimensions. For a hyperplane in the ﬁrst di-mension, consider the maximum number of tiles it can be apart of. There can be at most 2 d × l hyper-planes intersect-ing it in any other dimension. Thus, it can be a part of atmost 2 d l d − tiles. There are at most nl hyperplanes in theﬁrst dimension, and every tile has to have one of those asits edge. Thus, there are at most O (2 d nl d ) tiles in total foreach set of tiles. Value of N . Let B p be the set of border points for p .According to Property 1, | B p | is the size of a skyline query at p . Let B avg = P p | B p | l . Since N is at most P p | B p | , we can Figure 7: Building a kd-tree for a solution to APPAlgorithm 2

GetSkylineRegion( D , p )1: B ← border points of p S p ← ∅ for p i ∈ B do m ← p + p i for dim ≤ d do h ← d-dimensional hyper-plane7: for dim ′ ≤ d do m ← p [ dim ′ ]+ p i [ dim ′ ]2 end ← end point of hyper-plane10: h [ dim ] ← ( m, end )11: S p ← S p ∪ { h } return S p write N = lB avg . If data points are uniformly distributed,there will be O ( n (2 log n ) d d ! ) skyline points. This is because (log n ) d d ! is the expected number of skyline points if the pointsare uniformly distributed, there are n data points, and weneed to consider the number of skyline points for each ofthe 2 d quadrants created based on p . Therefore, in such ascenario, B avg = O ( n (2 log n ) d d ! ) on expectation. Remark . Observe that based on our choice of l , we canavoid both space and time complexity n d (for instance, bysetting l = n d ). Thus, generalized tiling is particularlyuseful as dimensionality increases.

6. COMPLETE SKYLINE ALGORITHM

We described several methods to pre-compute the sky-line result for any query within the data domain. Ourapproaches achieve various trade-oﬀs between computationtime, storage size and number of false hits. In this sec-tion, we show how the ﬁnal query can be determined usingplaintext data and returned to the user based on our pre-computed set of results. In the next section, we presenthow the data structures are encrypted, and how the queryanswering operations are performed using encrypted data.The performance of our approach can be tuned using twomain parameters: l and k . The former determines the num-ber of partitions we divide our dataset into, and the latterthe number of false positives allowed. Setting k = 0, the setof tiles will provide an exact answer, while in general therecan be ⌈ nl ⌉ k number of false hits in the answer. Followingthe result pre-computation, we generate a set of indices, I ,containing ⌈ nl ⌉ separate index structures, used to answer theskyline query. We call I i ∈ I a skyline index.According to our validity conditions (Sec. 4), I i must par-tition the space recursively while disallowing overlaps be-tween partitions. We employ a kd-tree index for this pur-pose. Alg. 3 illustrates the combined process of result pre-computation and index construction, with four stages:8 lgorithm 3 GetSkylineIndices( D , l , k )1: for p i ∈ D do S p i ← GetSkylineRegion( D , p i )3: I ← ∅ for i ≤ ⌈ nl ⌉ do H ← AggregateT iles ( k,S p i × l ∪ S p i × l +1 ... ∪ S p ( i +1) × l )6: I i ← kd-tree on H for each leaf node n l in I i do n l .points ← GetSkylinePoints( n l )9: Add I i to I return I

1. Construction of Skyline Region.

To create theskyline region of a point p , we ﬁrst ﬁnd its border points byissuing a skyline query at p on D \ { p } . Then, we iteratethrough all the border points, p ′ of p and ﬁnd D pp ′ . D pp ′ is deﬁned in terms of a number of hyper-planes. Thus, westore these hyper-planes for each points for future use.

2. Aggregating Tiles.

Line 5 of Alg. 3 uses one of themethods discussed in Sec. 4 to perform tile aggregation andreturns the resulting hyper-planes. Note that, even when k = 0, we run an APP algorithm with k = 0 and obtainthe corresponding hyper-planes. This approaches traversesthe tiles once and splits some tiles into two, but constructshyper-planes that can be easily used to create a balancedindexed (see below). Alternatively, when k = 0, we can skipthis step but the process of creating a balanced tree becomemore complicated.

3. Building kd-tree.

We build a balanced kd-tree fromour tiles, as follows. All the hyper-planes are given in ad-vance by our solution to APP, so we build a balanced kd-tree. Since for a solution to APP, the hyper-planes in the i -th dimension do not cross hyper-planes in the j -th dimen-sion for j < i , we impose the following ordering on tilesby utilizing the partitioning. We traverse the tiles by go-ing through the hyper-planes in the ﬁrst dimension itera-tively, and for each hyper-plane, recursively going throughthe hyper-planes in the next dimensions that fall right beforeit. Fig. 7 (a) shows how we can do this in two dimensions.We start from the left-most vertical line, and go through thetile in the ascending order of the horizontal lines. Then wemove on to the next vertical line. This gives us the orderingshown in Fig. 7(a), where the numbers show the position ofeach tile in that order.Then, to build the tree, we choose the split points so thathalf of the tiles are stored in one sub-tree and the other halfin another. For instance, in Fig. 7, tiles 1-7 are in the leftsub-tree and tiles 8-14 are in the right sub-tree. Note thatthe condition for the split can be deﬁne by 2 × d −

4. Assigning Skyline Points.

For each leaf node ofthe kd-tree we assign the corresponding skyline points. The content of each leaf node is already determined by runningAPP. Thus, we traverse all the leaf nodes and respectivelycopy the content from the corresponding aggregation.

Performing Queries . At runtime, the query result isdetermined by a simple traversal of the kd-tree index. Thesearch locates the leaf node that encloses the query, and thelist of points stored in that leaf represents the (super)set ofthe skyline query. In case of generalized tiles, the process isrun separately for each index structure (i.e., ⌈ n/l ⌉ times).All searches are completely independent, so the search canbe ran in parallel at the SP, thus improving response time. Index Construction Time.

Alg. 2 ﬁrst ﬁnds the bor-der points of p , which takes O ( n ). Then, for each point,it ﬁnds the hyper-planes delimiting its skyline region, whichtakes O ( d n ) (line 9 takes O ( dn ) because for any candi-date end-point, we need to check if another end-point coversit). Overall, Alg. 2 takes O ( d n ). Alg. 3 calls Alg. 2 rou-tine for all the points, which costs O ( d n ). Then, for each l points, Alg. 3 builds a kd-tree. Observe that, the heightof the kd-tree is O ( d log N ), and in total, there are O ( dN )hyper-planes. Thus, building the index costs O ( d N log N ).Then, there are a total of N d leaf nodes, and ﬁlling thecontent of each takes O ( n ) which is in total O ( nN d ). Query Time.

Searching each index takes O ( d log N )(index height is at most N d and each level requires O ( d )comparisons), for a query time of O ( ⌈ nl ⌉ d log n ). Space Complexity.

Every tile can contain O ( l ) points,and there are O ( nl ) separate index structures. Therefore,the total space complexity is O (2 d n l d − ). In general, wecan observe that increasing l reduces query time but in-creases space complexity. Thus, l can be set depending onthe space constraints that exist at the service provider.

7. ENCRYPTED SKYLINE SEARCH

Our result materialization approach reduces the skylinequery to a simple index look-up. The beneﬁts of our methodbecome even more clear when performing skyline queries onencrypted data. We do not require any distance calculationsat query time, as existing methods do. We only require valuecomparisons for traversing the index. Furthermore, thesecomparisons are not performed on the actual data points,but on index node extents. In addition, we bulk-load theindices, which hides any data distribution details, and makesthe indexes fully balanced. These features allow us to utilizesimple and eﬃcient cryptograpic primitives, while at thesame time providing strong security guarantees.

We need to encrypt a set of skyline indexes I . For eachindex, we must encrypt (1) the data points stored in the leafset and (2) the index structure itself. Encrypting Data Points.

The search does not performcomparisons on data points, so we can use conventional sym-metric encryption, such as AES, which provides strong pro-tection and also achieve semnatic security. After traversingthe index and reaching the leaf level, we return the entirecontents of the leaf to the user, who decrypts them locally.

Encrypting Index Structures . Since a kd-tree is used,we only need to perform comparisons at each index level.We employ two alternative encryption techniques: muta-ble order-preserving encryption (mOPE) [26] and practical9rder revealing encryption (pORE) [7]. mOPE has beenproven to be ideal, and does not leak any information aboutvalues, (e.g, no value distribution, density, etc.). However,it requires the encoding for each index value to be deter-mined in advance, and the user needs to be aware of thismapping. As a result, the user must perform a one-timesetup operation through which it downloads the mappingfrom the DO. On the other hand, with pORE, the user cancompute the ciphertext of an arbitrary data value based onthe secret key alone, without any mapping tables. How-ever, pORE incurs a small and measurable leakage in theform of the position of the most signiﬁcant bit that diﬀerswhen comparing two values. Recall that, the comparisonis not performed directly on data points, but on intermedi-ate index values. Still, this amount of leakage may not beacceptable in some scenarios. Therefore, our two solutionsoﬀer a measurable trade-oﬀ between protection and setupcost: if the user is willing to download the mapping locally,then the ideal mOPE can be used. Otherwise, if the user iswilling to trade a small amount of leakage, then pORE canbe used, with no additional one-time setup required.

We assume the clients do not collude with the server andthat the server is honest-but-curious (i.e., it correctly fol-lows the protocol, but tries to infer additional information).Our analysis quantiﬁes the security leakage and answers thefollowing questions: (1) what can the server learn about thedata by just observing the skyline index (static data leak-age), (2) what can the server learn about the data whileperforming encrypted queries on the index (dynamic dataleakage) and (3) what can the server learn about a querywhen performing the encrypted query (query security).

Static Data Leakage . The SP observes the index andattempts to learn information from the structure. We needto show that, given a leakage function L S , the server canonly distinguish with negligible probability between a real index, R , created based on the original data and a simulated index, S , created based on the leakage function L S . Thesecurity game is to repeatedly show the server a pair ( R i , S i )for a polynomial number of rounds and let the server guesswhether the ﬁrst index or the second index is the real one(the game is, in essence, similar to that of IND-OCPA [26]).The following theorem quantiﬁes our leakage. Theorem 2.

The server can distinguish between a real andsimulated index with negligible probability given leakage func-tion L ( I ) = ( |I| , {∀ I ∈I | ( h I , σ ( I )) } ) , where |I| is the num-ber of indices, h I is the height of the index I and σ ( I ) is thesize of the content of each node in the index I .Proof sketch . Given the leakage function, and a ﬁxed leafset size, a simulated index can be constructed to have ex-actly the same structure. This results from our proposedindex construction method, where we bulk-load the tree,and ensure that all leaves are at the same height. Thus,the security of our method boils down to the security of theunderlying order-preserving encryption scheme.The above result is possible because we build a completelybalanced tree, and only the order of magnitude for the leafset is revealed. From an empirical perspective, the height ofthe index reveals very little about the data, as a wide rangeof data sizes lead to a skyline index structure of the sameheight. If leaking the leaf set size is considered unacceptable,one can employ padding, where fake points are added to the leaf nodes. Doing so will not have an eﬀect on searchperformance, but increases communication cost. Dynamic Data Leakage and Query Security . Westudy dynamic data security and query security together asthey are both determined by the encryption method usedfor performing comparisons. mOPE . If we use mOPE [26], we can guarantee that theonly leakage from an individual query is its traversal path,and that there is no extra leakage (in addition to L S ) fromthe data. This follows the security analysis of mOPE [26].Intuitively, this holds because a query is translated into atraversal path of the index and then sent to the server. Asa result, the server only learns the path accessed by thequery and nothing more. At the same time, the server onlyaccesses the index nodes based on the path provided by thequery, which it could have done without the query as well. pORE . Using pORE [7] increases the leakage of our algo-rithm but removes the extra storage requirement at the userside because of mOPE. Note that our encryption algorithmsends an independently encrypted value for comparison ateach level of the index (that is, the encrypted query maycontain the same dimension encrypted multiple times, andthe size of the query is equal to the height of the tree).Intuitively, this ensures that, at the server side, every com-parison is independently done, and the server cannot learnanything by cross-examining the queries. Thus, the leakageis reduced to that of pORE [7], which is the index of the ﬁrstbit that is diﬀerent between the query and the index node.Since the server does not encrypt the query points, this leak-age may not have any meaningful implication regarding thesecurity of the data. However, it provides a lower level ofguarantee compared with mOPE. Protecting Traversal Patterns . We do not directly pro-tect access patterns to the index. While access patternsthemselves may not lead to an adversary learning the datavalues, they may reveal information about the query. Toavoid such disclosure, our approach can be used in conjunc-tion with oblivious RAM structures. There are a number ofexisting techniques [28, 29] that can be used in conjunctionwith generic index structures (including kd-trees) in orderto hide access patterns through re-balancing. That will gen-erate additional maintenance cost, although query responsetime will not be signiﬁcantly aﬀected. The details of com-bining ORAM with our approach are not speciﬁc to skylinequeries, so they fall outside the scope of this submission.

8. EMPIRICAL EVALUATION8.1 Experimental Setup

We performed experiments on an Intel i9-9980XE CPU(3GHz) with 128GB RAM running Ubuntu 18.04 LTS.

Dataset.

Following the setup from [22], we use bothsynthetic and real datasets, but with larger sizes. We usedthe NBA dataset with 2438 points in 5 dimensions, whereeach point represents an NBA player’s performance metric(e.g., points scored, blocks, assists, etc.). We use three syn-thetic datasets: uniform, correlated (Gaussian distributionwith correlation coeﬃcient 0.9) and anti-correlated (Gaus-sian with coeﬃcient -0.9). We consider up to 50,000 points,and dimensionality up to 5. Retrieved from https://stats.nba.com/ on on 04/15/201510

20 40 60 80 100Max. fal e hit10 C o n t r u c t i o n t i m e ( ) (a) Con truction Time Q u e r y T i m e ( μ ) (b) Query Time T o t a l i n d e x S i z e ( M μ ) (c) Storage Co t DP-100 DP-1000 GREEDY-100 GREEDY-1000

Figure 8: Varying false hits on smaller dataset

20 40 60 80 100Max. fal e hit01000200030004000 C o n t r u c t i o n t i m e ( ) (a) Con truction Time

20 40 60 80 100Max. fal e hit01020304050 Q u e r y T i m e ( μ ) (b) Query Time

20 40 60 80 100Max. fal e hit0200040006000800010000 T o t a l i n d e x S i z e ( M μ ) (c) Storage Co t GREEDY-1000 GREEDY-10000

Figure 9: Varying false hit on larger datasetAlgorithms.

We evaluate our dynamic programming al-gorithm (label DP ) and greedy algorithm (label GREEDY ) fromSec. 4, and for each of them we consider the generalizedtiling option discussed in Sec. 5. The notation

GREEDY-l (respectively

DP-l ) for some value of l refers to the greedy(respectively DP) algorithm with generalized tiling param-eter set to l . We also include the standalone generalizedtiling algorithm (without aggregation, i.e., skyline indexesare built directly on the skyline tiles) with label GEN-TILE .To the best of our knowledge, the state-of-the-art workfor skyline queries on encrypted data is that of [22] and [14].The former requires two non-colluding servers, and it takesaround three hours to complete a query, whereas the latterrequires a time in the order of seconds for 100 data points.Since our method is much faster (sub-second query responsetime), we did not include a direct comparison with theseapproaches, which we clearly outperform – mainly due toour approach of materializing results.

Measurements.

We report construction time, querytime and storage cost. Construction time is the time re-quired to build and encrypt the index structure(s) at thedata owner. Query time is the time to execute a query onthe encrypted index. Storage cost measures the amount re-quired to store the SP. Result ﬁltering time at the user takesless than half a second, so we omit it from the measurements.

Experiments on two-dimensional data.

First, wecompare the performance of the proposed DP and greedyalgorithms for multiple settings of k and l . Since the DPapproach is costly, we restrict these comparative runs to adataset of 1 ,

000 records, and evaluate the greedy approachlater on using larger data sizes. Fig. 8 summarizes our ﬁnd-ings. Fig 8(c) shows that although

GREEDY may not returnoptimal solutions, in practice, it returns solutions with stor-age cost very close to that of DP . However, Fig. 8(a) showsthat DP takes much longer to run due, to its higher time com-plexity. Fig. 8(b) shows that the query time is almost thesame regardless of whether GREEDY or DP is used. Observethat, the construction time is generally smaller for l = 100compared with l = 1000, while query times are about a mul-tiplicative factor apart ( l = 100 requires 10 diﬀerent indexesto be searched while l = 1000 only searches one index oflarger size). We emphasize that the query time, which is C o n s t r u c t i o n t i m e ( s ) (a) Construction Time Q u e r y T i m e ( μ s ) (b) Query Time T o t a l i n d e x S i e ( M μ ) (c) Storage Cost GREEDY-100 GREEDY-1000 GREEDY-10000

Figure 10: Varying data size C o n s t r u c t i o n t i m e ( s ) (a) Construction Time Q u e r y T i m e ( m s ) (b) Query Time T o t a l i n d e x S i z e ( G B ) (c) Storage Cost GEN-TILE-1

Figure 11: Varying dimensionality the delay perceived by the user at runtime, is very small,always less than one millisecond.The results show that DP provides a relatively small stor-age advantage compared with GREEDY , whereas it index con-struction time overhead is considerably larger. DP becomesimpractical for larger values of k even when data set containsonly 1 ,

000 points. In the remainder of the experiments, weexclude DP from our evaluation. However, it remains an in-teresting approach from a theoretical perspective, and mayprove valuable in future work as a base for deriving eﬀectiveheuristics that reduce storage cost.Next, we increase data size to n = 10 ,

000 and vary pa-rameter k (which controls the maximum number of false hitsallowed). Results are shown in Fig. 9. In general, k does notimpact signiﬁcantly construction time, and only has a visi-ble impact on query time when l = 1 , l = 1 , l = 10 , k being smaller (ranges from 2 to 10) when l is 1 ,

000 andincreasing it allows for more ﬂexibility of aggregations. Fi-nally, we observed that index structure size is about 50MBfor k = 60, which shows that the communication cost fortransferring the structure when using mOPE is small.We further increase data size up to 50 ,

000 data pointsand set k = l , so that the false hits count is k × nl = n ,i.e., 1% of the data. Fig. 10 shows the results. For l =10 ,

000 the storage cost and construction time of the indexbecome prohibitive, but the query time is at least an orderof magnitude smaller than in other cases. However, smallervalues of l can be used in practice to handle this workload. Experiments on higher dimensional data.

Observ-ing the theoretical results of Sec. 5, we only use our general-ized tiling approach with no aggregation in this section (i.e.,the

Gen-Tile algorithm). Our decision is due to the follow-ing factors: recall that in Sec. 5 we discussed decreasing thevalue of l for higher dimensional data to be able to overcomethe curse of dimensionality. At the same time, the numberof false hits, whenever k >

0, increases when decreasing l .Moreover, to be able to achieve construction time similarto that for 2D data when dimensionality increases, we needto decrease the value of l . As a consequence, we need toset l to a small value, but then setting k > k = 0 and l to a small11 BA COR ANTI UNI01000200030004000 C o n s t r u c t i o n t i m e ( ) (a) Con truction Time NBA COR ANTI UNI10 −2 −1 Q u e r y t i m e ( m ) (b) Query Time NBA COR ANTI UNI10 T o t a l I n d e x S i z e ( M B ) (c) Storage Co t GREEDY-10000 GREEDY-1000 GET-TILE-1 GET-TILE-2

Figure 12: Other Distributions value. That is, we perform no aggregation, but increase thenumber of skyline indexes. Thus, in this section, we onlyreport results for

Gen-Tile .Fig. 11 shows the results when n = 10 , l = 1since for d >

4, larger values of l incur signiﬁcantly morestorage cost. Overall, the query time increases linearly withdimensionality. However, for higher dimensional data, stor-age cost surpasses 10GB and makes this approach less ap-plicable for dimensionality higher than ﬁve. We performed experiments on non-uniform data and realdatasets as well. The results are shown in Fig. 8.3. Thealgorithms performs almost identical when comparing dif-ferent synthetic distributions. This shows one signiﬁcantdiﬀerence between dynamic skyline queries and conventionalskyline queries. Since in dynamic skyline queries the querypoint can be anywhere in the space, an anti-correlated dis-tribution does not necessarily increase the size of the skylineresult for a query (whereas in the traditional skyline, num-ber of skyline results may change signiﬁcantly by changingthe data distribution). Finally, on the real NBA dataset, ouralgorithm can perform skyline queries within milliseconds,and when l = 1 only requires storage cost of about 300MB.

9. RELATED WORK

Plain-text Skyline Queries . The skyline query wasﬁrst discussed in [17], and gained signiﬁcant attention fol-lowing the more recent work in [1]. Variations of the queryunder diﬀerent scenarios have been extensively studied [8,25, 27, 4, 19, 21, 31, 16]. The dynamic skyline query wasformalized in [8], although general skyline algorithms [25]were able to answer dynamic skyline queries before that.Closer to our work are algorithms focusing on continu-ous skyline queries for location-based services [18, 15, 20,5]. These algorithms ﬁnd ranges where the answer to thequery does not change, and incrementally update the sky-line when the answer does change. These algorithms exploit spatio-temporal coherence , which focuses on how the queryand objects move over time to determine how the result ofthe skyline query evolves. This creates a fundamentally dif-ferent problem, as the dominance relationship in our prob-lem is deﬁned diﬀerently, and there is no assumption onhow queries change over time. As an example, observe thata data point far from the query (measured by Euclideandistance) will be spatially dominated [27] by closer points,but this is not necessarily the case with our dynamic skylineproblem. In fact, this lack of locality is the main challengein materializing dynamic skyline queries, as data points farfrom a query point can impact the result.The work in [23] studied independently from us how tomaterialize the result of dynamic skyline queries. In [23],the space is partitioned into a grid, and grid cells with the same skyline result are merged. This leads to redundanttime and space utilization, although the outcome is similarto our skyline tile concept discussed in Sec. 3. The authorsof [23] proposed various methods for merging the grid cellsto obtain the skyline tiles. Our method of ﬁnding skylineregions allows us to directly create exact skyline tiles. Moreimportantly, [23] ignores the large storage cost of full mate-rialization, which is the motivation for our contributions inSecs. 4 and 5. In contrast to our approach, [23] always per-forms full materialization, which is highly impractical dueto the storage cost.

Secure Skyline Queries . The powerful trend towardsoutsourcing data storage and querying [11] led to a signif-icant body of research on querying encrypted data. Mostof this work focused on nearest-neighbor (NN) queries [12,9, 13], culminating with the work in [30] which showed thatthe most secure and eﬃcient way to answer NN queries onencrypted data is through materialization of results and en-cryption of the resulting structure. Our work follows a sim-ilar model, but we tackle the dynamic skyline query, forwhich result materialization is much more challenging.The problem of securely answering skyline queries onlyreceived attention very recently. The work in [6] was one ofthe ﬁrst to address outsourced skyline queries, but it onlyfocused on authentication of results, not data conﬁdential-ity. Skyline queries on encrypted data were ﬁrst consideredin [24], where multiple parties engage in an interactive proto-col to execute skyline queries on their joint datasets. Eachparty has access to its own data in plaintext (e.g., thereare multiple data owners, and they all must be online atquery time). This setting is considerably less challengingthan ours. The work in [22] considers a model with twonon-colluding servers that engage in a secure multi-partycomputation protocol to determine the result of the skylinequery. The solution is slow, and the assumption that twonon-colluding SP’s agree to jointly oﬀer a secure skyline ser-vice may not be feasible in practice. Finally, the recent workin [14] provides a single-SP solution based on fast secure per-mutations and comparisons. However, the solution relies onbilinear map pairings, which are notoriously expensive. Theexperimental results in [14] show performance numbers onlyup to 200 data points.

10. CONCLUSION

We proposed the ﬁrst approach to use query result ma-terialization for answering dynamic skyline queries on en-crypted data. Compared to existing work on secure nearest-neighbors, the problem we tackle is much more complex, dueto the fact that pre-computing skyline results lacks the lo-cality property that allows NN solutions to be so eﬃcient.We provided an in-depth theoretical analysis of skyline re-sult materialization, and investigated extensively the trade-oﬀ that emerges between computational and storage costs.Our proposed heuristics are able to build the result mate-rialization structure in reasonable time, while keeping thestorage overhead at practical levels. Our ability to createbalanced result materialization structures helps minimizethe amount of leakage. In future work, we plan to studyadditional heuristics that can further reduce constructiontime. In addition, we will focus on the challenging problemof supporting incremental updates to skyline result material-ization, so we can eﬃciently handle fast-changing datasets.12

1. REFERENCES [1] S. Borzsony, D. Kossmann, and K. Stocker. Theskyline operator. In

Proceedings 17th internationalconference on data engineering , pages 421–430. IEEE,2001.[2] S. Bothe, A. Cuzzocrea, P. Karras, and A. Vlachou.Skyline query processing over encrypted data: Anattribute-order-preserving free approach. In

Proc. ofIntl. Workshop Privacy and Security for Big Data ,pages 37–43, 2014.[3] C. Buchta. On the average number of maxima in a setof vectors.

Information Processing Letters ,33(2):63–65, 1989.[4] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. K. Tung, andZ. Zhang. Finding k-dominant skylines in highdimensional space. In

Proceedings of the 2006 ACMSIGMOD international conference on Management ofdata , pages 503–514, 2006.[5] M. A. Cheema, X. Lin, W. Zhang, and Y. Zhang. Asafe zone based approach for monitoring movingskyline queries. In

Proceedings of the 16thinternational conference on extending databasetechnology , pages 275–286, 2013.[6] W. Chen, M. Liu, R. Zhang, Y. Zhang, and S. Liu.Secure outsourced skyline query processing viauntrusted cloud service providers. In

Proceedings ofAnnual IEEE Intl. Conf. on ComputerCommunications , 2016.[7] N. Chenette, K. Lewi, S. A. Weis, and D. J. Wu.Practical order-revealing encryption with limitedleakage. In

International Conference on Fast SoftwareEncryption , pages 474–493. Springer, 2016.[8] E. Dellis and B. Seeger. Eﬃcient computation ofreverse skyline queries. In

VLDB , volume 7, pages291–302, 2007.[9] Y. Elmehdwi, B. K. Samanthula, and W. Jiang.Secure k-nearest neighbor query over encrypted datain outsourced environments. In

IEEE InternationalConference on Data Engineering (ICDE) , pages664–675, 2013.[10] G. Ghinita, P. Kalnis, A. Khoshgozaran, C. Shahabi,and K. L. Tan. Private queries in location basedservices: Anonymizers are not necessary. In

Proceedings of International Conference onManagement of Data (ACM SIGMOD) , 2008.[11] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra.Executing sql over encrypted data in thedatabase-service-provider model. In

Proceedings of theACM SIGMOD International Conference onManagement of Data , pages 216–227, 06 2002.[12] T. Hashem, L. Kulik, and R. Zhang. Privacypreserving group nearest neighbor queries. In

IEEEInternational Conference on Data Engineering(ICDE) , pages 489–500, 01 2010.[13] H. Hu, J. Xu, C. Ren, and B. Choi. Processing privatequeries over untrusted data cloud through privacyhomomorphism. In

IEEE International Conference onData Engineering (ICDE) , pages 601–612, 2011.[14] J. Hua, H. Zhu, F. Wang, X. Liu, R. Lu, H. Li, andY. Zhang. Cinema: Eﬃcient and privacy-preservingonline medical primary diagnosis with skyline query.

IEEE Internet of Things , 6(2):1450–1551, 2019. [15] Z. Huang, H. Lu, B. C. Ooi, and A. K. Tung.Continuous skyline queries for moving objects.

IEEEtransactions on knowledge and data engineering ,18(12):1645–1658, 2006.[16] D. Kossmann, F. Ramsak, and S. Rost. Shooting starsin the sky: An online algorithm for skyline queries. In

Proc. of Very Large Data Bases , pages 275–286, 2002.[17] H.-T. Kung, F. Luccio, and F. P. Preparata. Onﬁnding the maxima of a set of vectors.

Journal of theACM (JACM) , 22(4):469–476, 1975.[18] M.-W. Lee and S.-w. Hwang. Continuous skylining onvolatile moving data. In , pages 1568–1575.IEEE, 2009.[19] X. Lian and L. Chen. Monochromatic and bichromaticreverse skyline search over uncertain databases. In

Proceedings of the 2008 ACM SIGMOD internationalconference on Management of data , pages 213–226,2008.[20] X. Lin, J. Xu, and H. Hu. Range-based skyline queriesin mobile environments.

IEEE Transactions onKnowledge and Data Engineering , 25(4):835–849,2011.[21] J. Liu, L. Xiong, J. Pei, J. Luo, and H. Zhang.Finding pareto optimal groups: Group-based skyline.

Proceedings of the VLDB Endowment ,8(13):2086–2097, 2015.[22] J. Liu, J. Yang, L. Xiong, and J. Pei. Secure andeﬃcient skyline queries on encrypted data.

IEEETransactions on Knowledge and Data Engineering ,31(7):1397–1411, 2018.[23] J. Liu, J. Yang, L. Xiong, J. Pei, and J. Luo. Skylinediagram: Finding the voronoi counterpart for skylinequeries. In , pages 653–664. IEEE,2018.[24] X. Liu, R. Lu, J. Ma, L. Chen, and H. Bao. Eﬃcientand privacy-preserving skyline computationframework across domains.

Future GenerationComputer Systems , 62:161–174, 2016.[25] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Anoptimal and progressive algorithm for skyline queries.In

Proceedings of the 2003 ACM SIGMODinternational conference on Management of data ,pages 467–478, 2003.[26] R. A. Popa, F. H. Li, and N. Zeldovich. Anideal-security protocol for order-preserving encoding.In ,pages 463–477. IEEE, 2013.[27] M. Sharifzadeh and C. Shahabi. The spatial skylinequeries. In

Proceedings of the 32nd internationalconference on Very large data bases , pages 751–762.Citeseer, 2006.[28] E. Stefanov, M. V. Djik, E. Shi, T. H. Chan,C. Fletcher, L. Ren, X. Yu, and S. Devadas. Pathoram: An extremely simple oblivious ram protocol.

Journal of the ACM , 65(4), April 2018.[29] E. Stefanov, E. Shi, and D. Song. Towards practicaloblivious ram. In

Proc. of Network and DistributedSystem Security Symposium (NDSS) , 2012.[30] B. Yao, F. Li, and X. Xiao. Secure nearest neighborrevisited. In

Proc. of Intl. Conf. on Data Engineering ,13ages 733–744, 2013.[31] W. Yu, Z. Qin, J. Liu, L. Xiong, X. Chen, andH. Zhang. Fast algorithms for pareto optimalgroup-based skyline. In

Proceedings of the 2017 ACMon Conference on Information and KnowledgeManagement , pages 417–426, 2017. 14 v v v v v e e e e e e e e t t t t t t t t p p p p p p x x x x x x x x (a) A planar graph (b) Corresponding TAP instance

Figure 13: Reduction from PVC to TAP

APPENDIXA. PROOFS

Proof of Theorem 1 . We provide a polynomial time reduc-tion from an instance of planar vertex cover to an instanceof TAP.

Deﬁnition 10.

Planar Vertex Cover (PVC) Given a pla-nar graph G = ( E, V ) , ﬁnd a set of vertices of minimumcordiality, S ⊆ V such that every edge, e ∈ E , there exists avertex v ∈ S where e is incident on v . Recall that and instance of PVC, I PV C is deﬁned by agraph, G , where an instance of TAP, I TAP is deﬁned by aset of tiles T , a database D and an integer k such that theset the content of each tile in T is a subset of D . Thus, wedescribe a polynomial time algorithm that returns T and k given a graph G , such that by solving I TAP optimally wecan solve I PV C optimally as well.Our reduction works as follows. For each edge in E wecreate a corresponding tile in T (therefore | T | = | E | ). Foreach vertex in V and each edge in e we create a corre-sponding data point (to be assigned to the tiles, therefore, | D | = | E | + | V | ).The reduction has two steps. First, we construct the tilesand then assign points to the tiles. The construction of thetiles is done so that, for any set edges, B , incident on agiven vertex, the aggregation of the tiles corresponding to B is location-wise valid. We that such a set of tiles can beconstructed in polynomial time below. Lemma 1.

Given a planar graph G = ( E, V ) , we can con-struct, in polynomial time, a set of tiles such that for anyset edges, B , incident on a given vertex, the aggregation ofthe tiles corresponding to B is location-wise valid Now consider allocating points to the tiles. Consider twosets of points, D V = { p , p , ..., p | V | } , where p i is a point cor-responding to the vertex v i , and the set D E = { x , x , ..., x | E | } ,where x i is a point corresponding to the edge e i . Let D = D E ∪ D V . First, for each tile, we insert the entire set D E as its content. Furthermore, for a tile t i corresponding tothe edge e i = ( v x , v y ) we insert the set D V \ { p x , p y } to itscontent. Therefore, each tile contains exactly | D | − D V to each tile isto control which tile aggregations are cardinality-wise valid.The intuition behind adding D E to all the tiles is simplyto increase the size of the content of each tile. Doing thisenforces TAP to selects the fewest number of aggregationsand helps us translate TAP’s objective (which is in terms of total number of poitns stored) to VPC’s objective (which isin terms of total number of aggregations performed), shownbelow. Fig. 13 shows the reduction for an instance of VPCto an instance of TAP.Finally set, k = 1. This enforces that aggregation of twotiles that do not correspond to incident edges is cardinality-wise invalid. This is because any such aggregation will al-ways have exactly | D | points, since it will contain D E ∪ D V \ { p x , p y } ∪ D V \ { p x ′ , p y ′ } . Note that { p x ′ , p y ′ } ⊆ D V \ { p x , p y } and therefore, D V \ { p x , p y } ∪ D V \ { p x ′ , p y ′ } contains both p x ′ and p y ′ , as well as D V \ { p x ′ , p y ′ , whichimplies it contains the entire D V . Moreover, aggregationof two tiles corresponding to incident edges is cardinality-wise valid. This is because the aggregation contains D V \{ p x , p y } ∪ D V \ { p x , p y ′ } , which is equal to D V \ { p x } .Now consider any feasible solution S to the TAP problem.The solution contains aggregations of two types. Firstly, ag-gregations that contain more than one tile and aggregationsthat contain exactly one tile. For aggregations correspond-ing to exactly one tile, t i , consider any one of the two endpoints of e i , and let C = ∪{ v x } . For aggregations withmore than one tile, all the tiles have to be corresponding toedges incident on a particular vertex, v i . Let C = ∪ x { v x } .Consider the set C S = C ∪ C . Note that C is a vertex coverof G . This is because all tiles are part of some aggregationand the corresponding edges for each tile is covered by thevertex corresponding to the aggregation. In general, for anyfeasible solution S to TAP we denote the corresponding fea-sible solution to VPC as C S . Note that we have | S | = | C S | .Therefore, | S ∗ | = | C S ∗ | ≥ | C ∗ | .Next, we show that | S ∗ | ≤ | S C ∗ | = | C ∗ | . Observe thatfor any feasible solution C to VPC, we can construct a fea-sible solution S to TAP by just taking, for each vertex in C , the aggregation of all tiles corresponding to edges inci-dent on C . We denote by S C the corresponding solutionto TAP for a solution C to VPC. Let cost of S , denotedby c ( S ) for a solution S to TAP be the value of the objec-tive function for the solution S . First note that since S ∗ isoptimal, c ( S ∗ ) ≤ c ( S C ∗ ). Note that S ∗ = S ∗ ∪ S ∗ , where S ∗ contains aggregations that contains exactly one tile and S ∗ contains the rest of the aggregations. Then c ( S ∗ ) = | S ∗ | ( || D | −

2) + | S ∗ | ( | D | −

1) = ( | S ∗ | + | S ∗ | )( || D | − − | S ∗ | and similarly c ( S C ∗ ) = ( | S C ∗ | + | S C ∗ | )( || D | − − | S C ∗ | .Now assume that | S C ∗ | < | S ∗ | or ( | S ∗ | − | S C ∗ | ) ≥

1. Be-cause c ( S ∗ ) ≤ c ( S C ∗ ), we have that( | S ∗ | + | S ∗ | )( | D | − − | S ∗ | ≤ ( | S C ∗ | + | S C ∗ | )( | D | − − | S C ∗ | ( | S ∗ | )( | D | − − | S ∗ | ≤ ( | S C ∗ | )( | D | − − | S C ∗ | Therefore, | S ∗ | − | S C ∗ | ≥ ( | S ∗ | )( | D | − − ( | S C ∗ | )( | D | − ≥ ( | D | − | C ∗ | = | S C ∗ | ≥ | S ∗ | .Finally, we get that | C ∗ | = | S ∗ | = | C S ∗ | . Therefore, theVPC solution corresponding to S ∗ is an optimal solution toVPC. Proof of Lemma 1 . The construction works as follows.Construction has two steps. First step is assigning tiles. Itworks as follows. (1) Draw the planar graph on a plane withall edges as straight lines and draw a minimum bounding15 t t t t t t t (b) After assigning ti les t t t t (c) Redrawing boundaries t t t t t t t t (d) Final tiles(a) Before assigning tiles t t t t t t t t Figure 14: Construction of tiles rectangle around the graph, (2) Arbitrarily assign each edgeto one of the two faces of the graph the edge is a border of,(3) split faces with more than one corresponding edge, (4) ifa face has no assigned edge, remove one of its edges and (5)move the boundaries. Step (1) is straight forward and can bedone by Fry’s theorem [X]. Step (2) is also straight forward,and each edge is a border of at most two faces because thegraph is planar. For step (3), consider a point inside a facewith more than one edge assigned to it, say edges e i and e j .Connect that point to the both ends of e i (or e j , the choiceof the edge can be done arbitrarily), this creates exactly onenew face (because the graph is planar). Assign e i to thisnewly created face and remove it from its old assignment.Repeat this process until all faces have at most one point.For example, see t in Fig. 14. For step (4) any borderingedge of a face with no points assigned can be removed. Theresulting faces will have exactly one point left in them. Forexample, see t8