Secure and Efficient Skyline Queries on Encrypted Data
aa r X i v : . [ c s . D B ] J un Secure and Efficient Skyline Queries on EncryptedData
Jinfei Liu, member, IEEE, Juncheng Yang, member, IEEE, Li Xiong, member, IEEE, and Jian Pei, Fellow, IEEE
Abstract —Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and queryprocessing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud serverand other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform queryprocessing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficientway such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem ofsecure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presentssignificant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted usingsemantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a buildingblock for other queries. Furthermore, we demonstrate two optimizations, data partitioning and lazy merging, to further reduce the computationload. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalabilityunder different parameter settings, verifying the feasibility of our proposed solutions.
Index Terms —Skyline, Secure, Efficient, Parallel, Semi-honest. ✦ ntroduction As an emerging computing paradigm, cloud computing attractsincreasing attention from both research and industry communities.Outsourcing data and computation to cloud server provides a cost-e ff ective way to support large scale data storage and query pro-cessing. However, due to security and privacy concerns, sensitivedata need to be protected from the cloud server as well as otherunauthorized users.Fig. 1: Secure similarity queries.A common approach to protect the confidentiality of out-sourced data is to encrypt the data (e.g., [15], [34]). To protect theconfidentiality of the query from cloud server, authorized clientsalso send encrypted queries to the cloud server. Figure 1 illustratesour problem scenario of secure query processing over encrypteddata in the cloud. The data owner outsources encrypted data to thecloud server. The cloud server processes encrypted queries fromthe client on the encrypted data and returns the query result tothe client. During the query processing, the cloud server shouldnot gain any knowledge about the data, data patterns, query, andquery result. • Jinfei Liu, Juncheng Yang, and Li Xiong are with the Department ofMathematics and Computer Science, Emory University.E-mail: { jinfei.liu, juncheng.yang, and lxiong } @emory.edu • Jian Pei is with School of Computing Science, Simon Fraser University.E-mail: [email protected] received XXXXXX; revised XXXXXX.
Fully homomorphic encryption schemes [15] ensure strongsecurity while enabling arbitrary computations on the encrypteddata. However, the computation cost is prohibitive in practice.Trusted hardware such as Intel’s Software Guard Extensions(SGX) brings a promising alternative, but still has limitations inits security guarantees [10]. Many techniques (e.g., [18], [40])have been proposed to support specific queries or computationson encrypted data with varying degrees of security guarantee ande ffi ciency (e.g., by weaker encryptions). Focusing on similaritysearch, secure k -nearest neighbor ( k NN) queries, which return k most similar (closest) records given a query record, have beenextensively studied [12], [21], [42], [44].In this paper, we focus on the problem of secure skylinequeries on encrypted data, another type of similarity search im-portant for multi-criteria decision making. The skyline or Pareto of a multi-dimensional dataset given a query point consists of thedata points that are not dominated by other points. A data pointdominates another if it is closer to the query point in at least onedimension and at least as close to the query point in every otherdimension. The skyline query is particularly useful for selectingsimilar (or best) records when a single aggregated distance metricwith all dimensions is hard to define. The assumption of k NNqueries is that the relative weights of the attributes are knownin advance, so that a single similarity metric can be computedbetween a pair of records aggregating the similarity between allattribute pairs. However, this assumption does not always holdin practical applications. In many scenarios, it is desirable toretrieve similar records considering all possible relative weights ofthe attributes (e.g., considering only one attribute, or an arbitrarycombination of attributes), which is essentially the skyline or the“pareto-similar” records.
Motivating Example.
Consider a hospital who wishes to out-source its electronic health records to the cloud and the data isencrypted to ensure data confidentiality. Let P denote a sampleheart disease dataset with attributes ID, age, trestbps (resting blood pressure). We sampled four patient records p , ..., p fromthe heart disease dataset of UCI machine learning repository[23] as shown in Table 1(a) and Figure 2. Consider a physicianwho is treating a heart disease patient q = (41 , q . While it is unclear how to definethe attribute weights for k NN queries ( p is the nearest if onlyage is considered while p , p are the nearest if only trestbps isconsidered), skyline provides all pareto-similar records that arenot dominated by any other records. Skyline includes all possible1NN results by considering all possible relative attribute weights,and hence can serve as a filter for users. Given the query q , wecan map the data points to a new space with q as the origin andthe distance to q as the mapping function. The mapped records t i [ j ] = | p i [ j ] − q [ j ] | + q [ j ] on each dimension j are shown in Table1(b) and also in Figure 2. It is easy to see that t and t are skylinein the mapped space, which means p and p are skyline withrespect to query q .Our goal is for the cloud server to compute the skyline querygiven q on the encrypted data without revealing the data, thequery q , the final result set { p , p } , as well as any intermediateresult (e.g., t dominates t ) to the cloud. We note that skylinecomputation (with query point at the origin) is a special case ofskyline queries.TABLE 1: Sample of heart disease dataset. (a) Original data.ID age trestbps p
40 140 p
39 120 p
45 130 p
37 140 (b) Mapped Data.ID age trestbps t
42 140 t
43 130 t
45 130 t
45 140
35 40 45110120130140 agetrestbps qp p p t t t p / t Fig. 2: Dynamic skyline query.
Challenges.
Designing a fully secure protocol for skyline queriesover encrypted data presents significant challenges due to thecomplex comparisons and computations. Let P denotes a set of n tuples p , ..., p n with m attributes and q denotes input querytuple. In k NN queries, we only need to compute the distancesbetween each tuple p i and the query tuple q and then choosethe k tuples corresponding to the k smallest distances. In skylinequeries, for each tuple p i , we need to compare it with all othertuples to check dominance. For each comparison between twotuples p a and p b , we need to compare all their m attributes and forcomparison of each attribute p [ j ], there are three di ff erent outputs,i.e., p a [ j ] < ( = , > ) p b [ j ]. Therefore, there are 3 m di ff erent outputsfor each comparison between two tuples, based on which we needto determine if one tuple dominates the other. How to determinethe 2 m − p a dominates p b e ffi ciently whileprotecting intermediate results (e.g., whether two attribute valuesare the same) is particularly challenging.Such complex comparisons and computations require morecomplex protocol design in order to carry out the computations on the encrypted data given an encryption scheme with semanticsecurity (instead of weaker order-preserving or other property-preserving encryptions). In addition, the extensive intermediateresult means more indirect information about the data can bepotentially revealed (e.g., which tuple dominates which other,whether there are duplicate tuples or equivalent attribute values)even if the exact data is protected. This makes it challenging todesign a fully secure and e ffi cient skyline query protocol in whichthe cloud should not gain any knowledge about the data includingindirect data patterns. Contributions.
We summarize our contributions as follows. • We study the secure skyline problem on encrypted datawith semantic security for the first time. We assume thedata is encrypted using the Paillier cryptosystem whichprovides semantic security and is partially homomorphic. • We propose a fully secure dominance protocol, which canbe used as a building block for skyline queries as wellas other queries, e.g., reverse skyline queries [11] and k -skyband queries [35]. • We present two secure skyline query protocols. The firstone, served as a basic and e ffi cient solution, leaks someindirect data patterns to the cloud server. The second one isfully secure and ensures that the cloud gains no knowledgeabout the data including indirect patterns. The proposedprotocols exploit the partial (additive) homomorphism aswell as novel permutation and perturbation techniques toensure the correct result is computed while guaranteeingprivacy. We provide security and complexity analysis ofthe proposed protocols. • Compared with our conference version [31], we presenttwo new optimizations, data partitioning and lazy merging,to further reduce the computation load. For the data par-titioning, we theoretically analyze the optimal number ofpartitions given the number of points, the expected numberof output skyline points, the number of decomposed bits,and the number of dimensions. In addition, we proposea lazy merging scheme that aims to reduce computationoverhead due to the smaller partition sizes at the later stageof the partitioning scheme. • We also provide a complete implementation including bothserial and parallelized versions which can be deployedin practical settings. We empirically study the e ffi ciencyand scalability of the implementations under di ff erent pa-rameter settings, verifying the feasibility of our proposedsolutions. Organization.
The rest of the paper is organized as follows.Section 2 presents the related work. Section 3 introduces back-ground definitions as well as our problem setting. The securitysubprotocols for general functions that will be used in our secureskyline protocol are introduced in Section 4. The key subroutineof secure skyline protocols, secure dominance protocol, is shownin Section 5. The complete secure skyline protocols are presentedin Section 6. We illustrate two optimizations to further reduce thecomputation load in Section 7. We report the experimental resultsand findings in Section 8. Section 9 concludes the paper. elated W ork Skyline.
The skyline computation problem was first studied incomputational geometry field [3], [26] where they focused on worst-case time complexity. [24], [30] proposed output-sensitivealgorithms achieving O ( nlogk ) in worst-case where k is the num-ber of skyline points which is far less than n in general.Since the introduction of the skyline operator by B¨orzs¨onyiet al. [5], skyline has been extensively studied in the databasefield. Kossmann et al. [25] studied the progressive algorithm forskyline queries. Di ff erent variants of the skyline problem havebeen studied (e.g., subspace skyline [8], uncertain skyline [37][33], group-based skyline [29], [27], [46], skyline diagram [32]). Secure query processing on encrypted data.
Fully homomor-phic encryption schemes [15] enable arbitrary computations onencrypted data. Even though it is shown that [15] we can buildsuch encryption schemes with polynomial time, they remain farfrom practical even with the state of the art implementations [19].Many techniques (e.g., [18], [40]) have been proposed tosupport specific queries or computations on encrypted data withvarying degrees of security guarantee and e ffi ciency (e.g., byweaker encryptions). We are not aware of any formal work onsecure skyline queries over encrypted data with semantic security.Bothe et al. [6] and their demo version [7] illustrated an approachabout skyline queries on so-called “encrypted” data without anyformal security guarantee. Another work [9] studied the verifica-tion of skyline query result returned by an untrusted cloud server.The closely related work is secure k NN queries [12], [20],[21], [36], [38], [42], [44], [45] which we discuss in more detailhere. Wong et al. [42] proposed a new encryption scheme calledasymmetric scalar-product-preserving encryption. In their work,data and query are encrypted using slightly di ff erent encryptionschemes and all clients know the private key. Hu et al. [21] pro-posed a method based on provably secure privacy homomorphismencryption scheme. However, both schemes are vulnerable to thechosen-plaintext attacks as illustrated by Yao et al. [44]. Yao etal. [44] proposed a new method based on secure Voronoi diagram.Instead of asking the cloud server to retrieve the exact k NN result,their method retrieve a relevant encrypted partition such that it isguaranteed to contain the kNN of the query point. Hashem et al.[20] identified the challenges in preserving user privacy for groupnearest neighbor queries and provided a comprehensive solutionto this problem. Yi et al. [45] proposed solutions for secure k NNqueries based on oblivious transfer paradigm. Recently, Elmehdwiet al. [12] proposed a secure k NN query protocol on data encryptedusing Paillier cryptosystem that ensures data privacy and queryprivacy, as well as low (or no) computation overhead on clientand data owner using two non-colluding cloud servers. Our workfollows this setting and addresses skyline queries.Other works studied k NN queries in the secure multi-partycomputation (SMC) setting [38] (data is distributed between twoparties who want to cooperatively compute the answers withoutrevealing to each other their private data), or private informationretrieval (PIR) setting [36] (query is private while data is public),which are di ff erent from our settings. Secure Multi-party Computation (SMC).
SMC was first pro-posed by Yao [43] for two-party setting and then extended byGoldreich et al. [17] to multi-party setting. SMC refers to theproblem where a set of parties with private inputs wish to computesome joint function of their inputs. There are techniques such asgarbled circuits [22] and secret sharing [2] that can be used forSMC. In this paper, all protocols assume a two-party setting, butdi ff erent from the traditional SMC setting. Namely, we have party C with encrypted input and party C with the private key sk . The goal is for C to obtain an encrypted result of a function on theinput without disclosing the original input to either C or C . reliminaries and P roblem D efinitions In this section, we first illustrate some background knowledge onskyline computation and dynamic skyline query, and then describethe security model we use in this paper. For references, a summaryof notations is given in Table 2.TABLE 2: The summary of notations.
Notation Definition P dataset of n points / tuples / records p i [ j ] the j th attribute of p i q query tuple of client n number of points in Pm number of dimensions k number of skyline l number of bits K key size pk / sk public / private key J a K encrypted vector of the individual bits of a ˆ a binary bit( a ) ( i ) B the i th bit of binary number a Definition 1. ( Skyline ). Given a dataset P = { p , ..., p n } in m -dimensional space. Let p a and p b be two di ff erent points in P , we say p a dominates p b , denoted by p a ≺ p b , if for all j , p a [ j ] ≤ p b [ j ], and for at least one j , p a [ j ] < p b [ j ], where p i [ j ]is the j th dimension of p i and 1 ≤ j ≤ m . The skyline pointsare those points that are not dominated by any other point in P . Definition 2. ( Dynamic Skyline Query ) [11]. Given a dataset P = { p , ..., p n } and a query point q in m -dimensional space. Let p a and p b be two di ff erent points in P , we say p a dynamicallydominates p b with regard to the query point q , denoted by p a ≺ p b , if for all j , | p a [ j ] − q [ j ] | ≤ | p b [ j ] − q [ j ] | , and for atleast one j , | p a [ j ] − q [ j ] | < | p b [ j ] − q [ j ] | , where p i [ j ] is the j th dimension of p i and 1 ≤ j ≤ m . The skyline points are thosepoints that are not dynamically dominated by any other pointin P .The traditional skyline definition is a special case of dynamicskyline query in which the query point is the origin. On theother hand, dynamic skyline query is equivalent to traditionalskyline computation if we map the points to a new space withthe query point q as the origin and the absolute distances to q asmapping functions. So the protocols we will present in the paperalso work for traditional skyline computation (without an explicitquery point). Example 1.
Consider Table 1 and Figure 2 as a running example.Given data points p to p and query point q , the mapped datapoints are computed as t i [ j ] = | p i [ j ] − q [ j ] | + q [ j ]. We see that t , t are the skyline in the mapped space, and p , p are theskyline with respect to query q in the original space. Skyline computation has been extensively studied as we discussedin Section 2. We illustrate an iterative skyline computation al-gorithm (Algorithm 1) which will be used as the basis of oursecure skyline protocol. We note that this is not the most e ffi cient algorithm to compute skyline for plaintext compared to the divide-and-conquer algorithm [26]. We construct our secure skylineprotocol based on this algorithm for two reasons: 1) the divide-and-conquer approach is less suitable if not impossible for asecure implementation compared to the iterative approach, 2) theperformance of the divide-and-conquer algorithm deteriorate withthe “curse of dimensionality”.The general idea of Algorithm 1 is to first map the data pointsto the new space with the query point as origin (Lines 1-3). Giventhe new data points, it computes the sum of all attributes for eachtuple S ( t i ) (Line 6) and chooses the tuple t min with smallest S ( t i )as a skyline because no other tuples can dominate it. It then deletesthose tuples dominated by t min . The algorithm repeats this processfor the remaining tuples until an empty dataset T is reached. Algorithm 1:
Skyline Computation. input :
A dataset P and a query q . output: Skyline of P . for i = to n do for j = to m do t i [ j ] = | p i [ j ] − q [ j ] | ; while the dataset T is not empty do for i = to size of dataset T do S ( t i ) = P mj = t i [ j ]; choose the tuple t min with smallest S ( t i ) as a skyline; add corresponding tuple p min to the skyline pool; delete those tuples dominated by t min from T ; delete tuple t min from T ; return skyline pool ; Example 2.
Given the mapped data points t , ..., t , we begin bycomputing the attribute sum for each tuple as S ( t ) = S ( t ) = S ( t ) =
9, and S ( t ) =
19. We choose the tuplewith smallest S ( t i ), i.e., t , as a skyline tuple, delete t fromdataset T and add p to the skyline pool. We then delete tuples t and t from T because they are dominated by t . Now, thereis only t in T . We add p to the skyline pool. After deleting t from T , T is empty and the algorithm terminates. p and p in the skyline pool are returned as the query result. We now describe our problem setting for secure skyline queriesover encrypted data. Consider a data owner (e.g., hospital, CDC)with a dataset P . Before outsourcing the data, the data ownerencrypts each attribute of each record p i [ j ] using a semanticallysecure public-key cryptosystem. Fully homomorphic encryptionschemes ensure strong security while enabling arbitrary compu-tations on the encrypted data. However, the computation costis prohibitive in practice. Partially homomorphic encryption ismuch more e ffi cient but only provides partially (either additive ormultiplicative) homomorphic properties. Among them, we chosePaillier [34] mainly due to its additive homomorphic propertiesas we employ significantly more additions than multiplications inour protocol. Furthermore, we can also utilize its homomorphicmultiplication between ciphertext and plaintext. We use pk and sk to denote the public key and private key, respectively. Data ownersends E pk ( p i [ j ]) for i = , ..., n and j = , ..., m to cloud server C .Consider an authorized client (e.g., physician) who wishesto query the skyline tuples corresponding to query tuple q = ( q [1] , ..., q [ m ]). In order to protect the sensitive query tuple, the E pk ( P ) E pk ( q ) sk C : C : pk, sk E pk ( P ) , E pk ( q ) , pk P, pk, skData owner : Client : q , pk partial skyline resultpartial skyline result Fig. 3: Overview of protocol setting.client uses the same public key pk to encrypt the query tuple andsends E pk ( q ) = ( E pk ( q [1]) , ..., E pk ( q [ m ])) to cloud server C .Our goal is to enable the cloud server to compute and returnthe skyline to the client without learning any information about thedata and the query. In addition to guaranteeing the correctness ofthe result and the e ffi ciency of the computation, the computationshould require no or minimal interaction from the client or thedata owner for practicality. To achieve this, we assume there isan additional non-colluding cloud server, C , which will holdthe private key sk shared by the data owner and assist with thecomputation. This way, the data owner does not need to participatein any computation. The client also does not need to participate inany computation except combining the partial result from C and C for final result. An overview of the protocol setting is shown inFigure 3. Adversary Model.
We adopt the semi-honest adversary model inour study. In any multi-party computation setting, a semi-honest party correctly follows the protocol specification, yet attemptsto learn additional information by analyzing the transcript ofmessages received during the execution. By semi-honest model,this work implicitly assumes that the two cloud servers do notcollude.There are two main reasons to adopt the semi-honest adversarymodel in our study. First, developing protocols under the semi-honest setting is an important first step towards constructingprotocols with stronger security guarantees [22]. Using zero-knowledge proofs [14], these protocols can be transformed intosecure protocols under the malicious model. Second, the semi-honest model is realistic in current cloud market. C and C areassumed to be two cloud servers, which are legitimate, well-knowncompanies (e.g., Amazon, Google, and Microsoft). A collusionbetween them is highly unlikely. Therefore, following the workdone in [12], [28], [47], we also adopt the semi-honest adversarymodel for this paper. Please see Security Definition in the Semi-honest Model and Paillier Cryptosystem in the appendix. Desired Privacy Properties.
Our security goal is to protect thedata and the query as well as the query result from the cloudservers. We summarize the desired privacy properties below. Afterthe execution of the entire protocol, the following should beachieved. • Data Privacy.
Cloud servers C and C know nothingabout the exact data except the size pattern, the clientknows nothing about the dataset except the skyline queryresult. • Data Pattern Privacy.
Cloud servers C and C knownothing about the data patterns (indirect data knowledge)due to intermediate result, e.g., which tuple dominateswhich other tuple. • Query Privacy.
Data owner, cloud servers C and C knownothing about the query tuple q . • Result Privacy.
Cloud servers C and C know nothingabout the query result, e.g., which tuples are in the skylineresult. asic S ecurity S ubprotocols In this section, we present a set of secure subprotocols forcomputing basic functions on encrypted data that will be used toconstruct our secure skyline query protocol. All protocols assumea two-party setting, namely, C with encrypted input and C withthe private key sk as shown in Figure 3. The goal is for C toobtain an encrypted result of a function on the input withoutdisclosing the original input to either C or C . We note that this isdi ff erent from the traditional two-party secure computation settingwith techniques such as garbled circuits [22] where each partyholds a private input and they wish to compute a function of theinputs. For each function, we describe the input and output, presentour proposed protocol or provide a reference if existing solutionsare available. Due to limited space, we omit the security proofwhich can be derived by the simulation and composition theoremin a straightforward way. Please see Secure Multiplication (SM),Secure Bit Decomposition (SBD), and Secure Boolean Operationsin the appendix. Secure minimum protocol and secure comparison protocol havebeen extensively studied in cryptography community [1], [13],[41] and database community [12], [47]. Secure comparisonprotocol can be easily adapted to secure minimum protocol, andvice versa. For example, if we set E pk ( out ) as the result of securecomparison E pk ( Bool ( a ≤ b )) known by cloud server C (it willbe E pk (1) when a ≤ b and E pk (0) when a > b ), C can get E pk ( min ( a , b )) by computing E pk ( a ∗ out + b ∗ ¬ out ).We analyzed the existing protocols and observed that bothsecure minimum (SMIN) algorithms [12], [47] from databasecommunity for selecting a minimum have a security weakness,i.e., C can determine whether the two numbers are equal to eachother. We point out the security weakness in the appendix.Therefore, we adapted the secure minimum / comparison proto-cols [41] from cryptography community in this paper. The basicidea of those protocols is that for any two l bit numbers a and b , the most significant bit ( z l ) of z = l + a − b indicates therelationship between a and b , i.e., z l = ⇔ a < b . We listthe secure minimum / comparison protocols we used in this paperbelow. Secure Less Than or Equal (SLEQ).
Assume a cloud server C with encrypted input E pk ( a ) and E pk ( b ), and a cloud server C withthe private key sk , where a and b are two numbers not known to C and C . The goal of the SLEQ protocol is to securely computethe encrypted boolean output E pk ( Bool ( a ≤ b )), such that only C knows E pk ( Bool ( a ≤ b )) and no information related to a and b isrevealed to C or C . Secure Equal (SEQ).
Assume a cloud server C with encryptedinput E pk ( a ) and E pk ( b ), and a cloud server C with the private key sk , where a and b are two numbers not known to C and C . Thegoal of the SEQ protocol is to securely compute the encryptedboolean output E pk ( Bool ( a == b )), such that only C knows E pk ( Bool ( a == b )) and no information related to Bool ( a == b )is revealed to C or C . Secure Less (SLESS).
Assume a cloud server C with encryptedinput E pk ( a ) and E pk ( b ), and a cloud server C with the privatekey sk , where a and b are two numbers not known to C and C . The goal of the SLESS protocol is to securely compute theencrypted boolean output E pk ( Bool ( a < b )), such that only C knows E pk ( Bool ( a < b )) and no information related to Bool ( a < b ) is revealed to C or C . This can be simply implemented byconjunction from the output of SEQ and SLEQ. Secure Minimum (SMIN).
Assume a cloud server C withencrypted input E pk ( a ) and E pk ( b ), and a cloud server C withthe private key sk , where a and b are two numbers not knownto both parties. The goal of the SMIN protocol is to securelycompute encrypted minimum value of a and b , E pk ( min ( a , b )),such that only C knows E pk ( min ( a , b )) and no information relatedto a and b is revealed to C or C . Benefiting from the probabilisticproperty of Paillier, the ciphertext of min ( a , b ), i.e., E pk ( min ( a , b ))is di ff erent from the ciphertext of a , b , i.e., E pk ( a ), E pk ( b ).Therefore, C does not know which of a or b is min ( a , b ). Ingeneral, assume C has n encrypted values, the goal of SMINprotocol is to securely compute encrypted minimum of the n values. ecure D ominance P rotocol The key to compute skyline is to compute dominance relationshipbetween two tuples. Assume a cloud server C with encryptedtuples a = ( a [1] , ..., a [ m ]), b = ( b [1] , ..., b [ m ]) and a cloud server C with the private key sk , where a and b are not known to bothparties. The goal of the secure dominance (SDOM) protocol isto securely compute E pk ( Bool ( a ≺ b )) such that only C knows E pk (1) if a ≺ b , otherwise, E pk (0). Protocol Design.
Given any two tuples a = ( a [1] , ..., a [ m ]) and b = ( b [1] , ..., b [ m ]), recall the definition of skyline, we say a ≺ b iffor all j , a [ j ] ≤ b [ j ] and for at least one j , a [ j ] < b [ j ] (1 ≤ j ≤ m ).If for all j , a [ j ] ≤ b [ j ], we have either a = b or a ≺ b . We referto this case as a (cid:22) b . The basic idea of secure dominance protocolis to first determine whether a (cid:22) b , and then determine whether a = b .The detailed protocol is shown in Algorithm 2. For eachattribute, C and C cooperatively use the secure less than or equal(SLEQ) protocol to compute E pk ( Bool ( a [ j ] ≤ b [ j ])). And then C and C cooperatively use SAND to compute Φ = δ ∧ , ..., ∧ δ m .If Φ = E pk (1), it means a (cid:22) b , otherwise, a (cid:14) b . We note that,the dominance relationship information Φ is known only to C inciphertext. Therefore, both C and C do not know any informationabout whether a (cid:22) b .Next, we need to determine if a , b . Only if a , b , then a ≺ b . One naive way is to employ SEQ protocol for each pair ofattribute and then take the conjunction of the output. We propose amore e ffi cient way which is to check whether S ( a ) < S ( b ), where S ( a ) is the attribute sum of tuple a . If S ( a ) < S ( b ), then it isimpossible that a = b . As the algorithm shows, C computes thesum of all attributes α = E pk ( a [1] + ... + a [ m ]) and β = E pk ( b [1] + ... + b [ m ]) based on the additive homomorphic property. Then C and C cooperatively use SLESS protocol to compute σ = E pk ( Bool ( α < β )). Finally, C and C cooperatively use SANDprotocol to compute the final dominance relationship Ψ = σ ∧ Φ Algorithm 2:
Secure Dominance Protocol. input : C has E pk ( a ) , E pk ( b ) and C has sk . output: C gets E pk (1) if a ≺ b , otherwise, C gets E pk (0). C and C : for j = to m do C gets δ j = E pk ( Bool ( a [ j ] ≤ b [ j ])) by SLEQ; use SAND to compute Φ = δ ∧ ..., ∧ δ m ; C : compute α = E pk ( a [1]) × , ..., × E pk ( a [ m ]); compute β = E pk ( b [1]) × , ..., × E pk ( b [ m ]); C and C : C gets σ = E pk ( Bool ( α < β )) by employing SLESS; C gets Ψ = σ ∧ Φ as the final dominance relationship usingSAND; which is only known to C in ciphertext. Ψ = E pk (1) means a ≺ b ,otherwise, a ⊀ b . Security Analysis.
Based on the composition theorem (Theorem2), the security of secure dominance protocol relies on the securityof SLEQ, SLESS, and SAND, which have been shown in existingworks.
Complexity Analysis.
To determine a (cid:22) b , Algorithm 2 requires O ( m ) encryptions and decryptions. Then to determine if a = b ,Algorithm 2 requires O (1) encryptions and decryptions. Therefore,our secure dominance protocol requires O ( m ) encryptions anddecryptions in total. ecure S kyline P rotocol In this section, we first propose a basic secure skyline protocol andshow why such a simple solution is not secure. Then we proposea fully secure skyline protocol. Both protocols are constructed byusing the security primitives discussed in Section 4 and the securedominance protocol in Section 5.As mentioned in Algorithm 1, given a skyline query q , it isequivalent to compute the skyline in a transformed space with thequery point q as the origin and the absolute distances to q asmapping functions. Hence we first show a preprocessing step inAlgorithm 3 which maps the dataset to the new space. Since theskyline only depends on the order of the attribute values, we use( p i [ j ] − q [ j ]) which is easier to compute than | p i [ j ] − q [ j ] | asthe mapping function . After Algorithm 3, C has the encrypteddataset E pk ( P ) and E pk ( T ), C has the private key sk . The goal is tosecurely compute the skyline by C and C without participationof data owner and the client. We first illustrate a straw-man protocol which is straightforwardbut not fully secure (as shown in Algorithm 4). The idea is toimplement each of the steps in Algorithm 1 using the primitivesecure protocols. C first determines the terminal condition, ifthere is no tuple exists in dataset E pk ( T ), the protocol ends,otherwise, the protocol proceeds as follows. Compute minimum attribute sum. C first computes the sum of E pk ( t i [ j ]) for 1 ≤ j ≤ m , denoted as E pk ( S ( t i )), for each tuple t i . Then C and C uses SMIN protocol such that C obtains E pk ( S ( t min )).
1. We use | p i [ j ] − q [ j ] | in our running example for simplicity. Algorithm 3:
Preprocessing. input : C has E pk ( P ), C has sk , and the client has q . output: C obtains the new encrypted dataset E pk ( T ). Client: send ( E pk ( − q [1]) , ..., E pk ( − q [ m ])) to C ; C : for i = to n do for j = to m do E pk ( temp i [ j ]) = E pk ( p i [ j ] − q [ j ]) = E pk ( p i [ j ]) × E pk ( − q [ j ]) mod N ; C and C : use SM protocol to compute E pk ( T ) = ( E pk ( t ) , ..., E pk ( t n ))only known by C , where E pk ( t i ) = ( E pk ( t i [1]) , ..., E pk ( t i [ m ]))and E pk ( t i [ j ]) = E pk ( temp i [ j ]) × E pk ( temp i [ j ]); Select the skyline with minimum attribute sum.
The challengenow is we need to select the tuple E pk ( t min ) with the smallest E pk ( S ( t i )) as a skyline tuple. In order to do this, a naive way isfor C to compute E pk ( S ( t i ) − S ( t min )) for all tuples and then sendthem to C . C can decrypt them and determine which one is equalto 0 and return the index to C . C then adds the tuple E pk ( p min )to skyline pool. Eliminate dominated tuples.
Once the skyline tuple is selected, C and C cooperatively use SDOM protocol to determine thedominance relationship between E pk ( t min ) and other tuples. Inorder to delete those tuples that are dominated by E pk ( t min ), anaive way is for C to send the encrypted dominance output to C , who can decrypt it and send back the indexes of the tupleswho are dominated to C . C can delete those tuples dominatedby E pk ( t min ) and the tuple E pk ( t min ) from E pk ( T ). The algorithmcontinues until there is no tuples left. Return skyline results to client.
Once C has the encryptedskyline result, it can directly send them to the client if the client hasthe private key. However, in our setting, the client does not havethe private key for better security. Lines 25 to 39 in Algorithm4 illustrate how the client obliviously obtains the final skylinequery result with the help of C and C , at the same time, C and C know nothing about the final result. Consider the skylinetuples E pk ( p ) , ..., E pk ( p k ) in skyline pool, where k is the numberof skyline. The idea is for C to add a random noise r i [ j ] to each p i [ j ] in ciphertext and then sends the encrypted randomized values α i [ j ] to C . C also sends the noise r i [ j ] to client. At the sametime, C decrypts the randomized values α i [ j ] and sends the result r ′ i [ j ] to client. Client receives the random noise r i [ j ] from C and randomized values of the skyline points α i [ j ] from C , andremoves the noise by computing p i [ j ] = r ′ i [ j ] − r i [ j ] for i = , ..., k and j = , ..., m as the final result. The basic protocol clearly reveals several information to C and C as follows. • When selecting the skyline tuple with minimum attributesum, C and C know which tuples are skyline points,which violates our result privacy requirement. • When eliminating dominated tuples, C and C know thedominance relationship among tuples with respect to thequery tuple q , which violates our data pattern privacyrequirement. Algorithm 4:
Basic Secure Skyline Protocol. input : C has E pk ( P ) , E pk ( T ) and C has sk . output: client knows the skyline query result. Compute minimum attribute sum; C : if there is no tuple in E pk ( T ) then break; for i = to n do E pk ( S ( t i )) = E pk ( t i [1]) × ... × E pk ( t i [ m ]) mod N ; C and C : E pk ( S ( t min )) = S MIN ( E pk ( S ( t )) , ..., E pk ( S ( t n ))); Select the skyline with minimum attribute sum; C : for i = to n do α i = E pk ( S ( t min )) N − × E pk ( S ( t i )) mod N ; α ′ i = α r i i mod N , where r i ∈ Z + N ; send α ′ to C ; C : decrypt α ′ and tell C which one equals to 0; C : add the corresponding E pk ( p min ) to the skyline pool; Eliminate dominated tuples; C and C : use SDOM protocol to determine the dominance relationshipbetween E pk ( t min ) and other tuples; delete those tuples dominated by E pk ( t min ) and E pk ( t min ); GOTO Line 1; Return skyline results to client; C : for i = to k do for j = to m do α i [ j ] = E pk ( p i [ j ]) × E pk ( r i [ j ]) mod N , where r i [ j ] ∈ Z + N ; send α i [ j ] to C and r i [ j ] to client for all i = , ..., k ; j = , ..., m ; C : for i = to k do for j = to m do r i [ j ] ′ = D sk ( α i [ j ]); send r i [ j ] ′ to client; Client: receive r i [ j ] from C and r i [ j ] ′ from C ; for i = to k do for j = to m do p i [ j ] = r i [ j ] ′ − r i [ j ]; To address these leakage, we propose a fully secure protocolin Algorithm 5. The step to compute minimum attribute sum andreturn the results to the client are the same as the basic protocol.We focus on the following steps that are designed to address thedisclosures of the basic protocol.
Select skyline with minimum attribute sum.
Once C obtainsthe encrypted minimum attribute sum E pk ( S ( t min )), the challengeis how to select the tuple E pk ( t min ) with the minimum sum E pk ( S ( t min )) as a skyline tuple such that C and C know nothingabout which tuple is selected. We present a protocol as shown inAlgorithm 6.We first need to determine which S ( t i ) is equal to S ( t min ).Note that this can not be achieved by the SMIN protocol whichonly selects the minimum value. Here we propose an e ffi cientway, exploiting the fact that it is okay for C to know there is one Algorithm 5:
Fully Secure Skyline Protocol. input : C has E pk ( P ) , E pk ( T ) and C has sk . output: C knows the encrypted skyline E pk ( p sky ). Order preserving perturbation; C : for i = to n do E pk ( S ( t i )) = E pk ( t i [1]) × ... × E pk ( t i [ m ]) mod N ; C and C : for i = to n do J E pk ( S ( t i )) K = S BD ( E pk ( S ( t i ))); C : for i = to n do J E pk ( S ( t i )) K = h E pk (( S ( t i )) (1) B ) , ..., E pk (( S ( t i )) ( l ) B ) , E pk (( S ( t i )) ( l + B ) , ..., E pk (( S ( t i )) ( l + ⌈ log n ⌉ ) B ) i , where( S ( t i )) ( l + B , ..., ( S ( t i )) ( l + ⌈ log n ⌉ ) B is the binary representationof an exclusive vale of [0 , n − E pk ( S ( t i )) = Q l γ = E pk (( S ( t i )) ( γ ) B ) l − γ mod N ; C and C : E pk ( S ( t min )) = S MIN ( E pk ( S ( t )) , ..., E pk ( S ( t n )); C : λ = ( E pk ( S ( t min )) × E pk ( MAX ) − ) r mod N , where r i ∈ Z + N ; send λ to C ; C : if D sk ( λ ) = then break; Select skyline with minimum attribute sum; ( E pk ( p sky ) , E pk ( t sky )) = FindOneSkyline( E pk ( P ) , E pk ( T ) , E pk ( S ( t i )) , E pk ( S ( t min ))) (Algorithm 6); Eliminate dominated tuples; C and C : for i = to n do for γ = to l do E pk (( S ( t i )) ( γ ) B ) = S OR ( V i , E pk (( S ( t i )) ( γ ) B )); C : for i = to n do E pk ( S ( t i )) = Q l γ = E pk (( S ( t i )) ( γ ) B ) l − γ mod N ; C and C : for i = to n do V i = S DOM ( E pk ( t sky ) , E pk ( t i )); Lines 23-32; GOTO Line 1; equal case (since we are selecting one skyline tuple) as long asit does not know which one. C first computes α ′ i = E pk (( S ( t i ) − S ( t min )) × r i ), and then sends a permuted list β = π ( α ′ ) to C based on a random permutation sequence π . The permutation hideswhich sum is equal to the minimum from C while the uniformlyrandom noise r i masks the di ff erence between each sum and theminimum sum. Note that α ′ i is uniformly random in Z + N exceptwhen S ( t i ) − S ( t min ) =
0, in which case α ′ i = C decrypts β i , if itis 0, it means tuple i has smallest E pk ( S ( t i )). Therefore, C sends E pk (1) to C , otherwise, sends E pk (0).After receiving the encrypted permuted bit vector U as theequality result, C applies a reverse permutation, and obtains anencrypted bit vector V , where one tuple has bit 1 suggesting ithas the minimum sum. In order to obtain the attribute values ofthis tuple, C and C employ SM protocol to compute encryptedproduct of the bit vector and the attribute values, E pk ( t i [ j ] ′ ) and E pk ( p i [ j ] ′ ). Since all other tuples except the one with the minimumsum will be 0, we can sum all E pk ( t i [ j ] ′ ) and E pk ( p i [ j ] ′ ) on each attribute and C can obtain the attribute values corresponding tothe skyline tuple. Algorithm 6:
Find One Skyline. input : C has encrypted dataset E pk ( P ), E pk ( T ), E pk ( S ( t i )),and E pk ( S ( t min )), C has private key sk . output: C knows one encrypted skyline E pk ( p sky ) and E pk ( t sky ). C : for i = to n do α i = E pk ( S ( t min )) N − × E pk ( S ( t i )) mod N ; α ′ i = α r i i mod N , where r i ∈ Z + N ; send β = π ( α ′ ) to C ; C : receive β from C ; for i = to n do β ′ i = D sk ( β i ); if β ′ i = then U i = E pk (1); else U i = E pk (0); send U to C ; C : receive U from C ; V = π − ( U ); for i = to n do for j = to m do E pk ( t i [ j ] ′ ) = S M ( V i , E pk ( t i [ j ])); E pk ( p i [ j ] ′ ) = S M ( V i , E pk ( p i [ j ])); for j = to m do E pk ( t [ j ] ′ ) = Q ni = E pk ( t i [ j ] ′ ) mod N ; E pk ( p [ j ] ′ ) = Q ni = E pk ( p i [ j ] ′ ) mod N ; add E pk ( p sky ) = h E pk ( p [1] ′ ) , ..., E pk ( p [ m ] ′ ) i to skyline pool; use E pk ( t sky ) = h E pk ( t [1] ′ ) , ..., E pk ( t [ m ] ′ ) i to compare withother E pk ( t i ); Order preserving perturbation.
We can show that Algorithm 6is secure and correctly selects the skyline tuple if there is only oneminimum. A potential issue is that multiple tuples may have thesame minimum sum. If this happens, not only is this informationrevealed to C , but also the skyline tuple cannot be selected(computed) correctly, since the bit vector contains more than one1 bit. To address this, we employ order-preserving perturbationwhich adds a set of mutually di ff erent bit sequence to a set ofvalues such that: 1) if the original values are equal to each other,the perturbed values are guaranteed not equal to each other, and2) if the original values are not equal to each other, their orderis preserved. The perturbed values are then used as the input forAlgorithm 6.Concretely, given n numbers in their binary representations,we add a ⌈ logn ⌉ -bit sequence to the end of each E pk ( S ( t i )), eachrepresents a unique bit sequence in the range of [0 , n − ff erent from eachother while their order is preserved since the added bits are theleast significant bits. Line 10 of Algorithm 5 shows this step. Wenote that we can multiply each sum E pk ( S ( t i )) by n and uniquelyadd a value from [0 , n −
1] to each E pk ( S ( t i )), hence guaranteethey are not equal to each other. This will be more e ffi cient thanadding a bit sequence, however, since we will need to perform thebit decomposition later in the protocol to allow bit operators, we run decomposition by the SBD protocol for l bits in the beginningof the protocol rather than l + ⌈ log n ⌉ bits later. Eliminate dominated tuples.
Once the skyline tuple is selected,it can be added to the skyline pool and then used to eliminatedominated tuples. In order to do this, C and C cooperatively useSDOM protocol to determine the dominance relationship between E pk ( t min ) and other tuples. The challenge is then how to eliminatethe dominated tuples without C and C knowing which tuplesare being dominated and eliminated. Our idea is that instead ofeliminating the dominated tuples, we “flag” them by securelysetting their attribute values to the maximum domain value. Thisway, they will not be selected as skyline tuples in the remainingiterations. Concretely, we can set the binary representation oftheir attribute sum to all 1s so that it represents the domainmaximum. Since we added ⌈ log n ⌉ bits to J E pk ( S ( t i )) K , the new J E pk ( S ( t i )) K has l + ⌈ log n ⌉ bits. Therefore, the maximum value MAX = l + ⌈ log n ⌉ −
1. To obliviously set the attributes of onlydominated tuples to
MAX , based on the encrypted dominanceoutput V i of the dominance protocol, C and C cooperativelyemploy SOR of the dominance boolean output and the bits of the S ( t i ). This way, if the tuple is dominated, it will be set to MAX.Otherwise, it will remain the same. If E pk ( S ( t min )) = E pk ( MAX ), itmeans all the tuples are processed, i.e., flagged either as a skylineor a dominated tuple, the protocol ends.
Example 3.
We illustrate the entire protocol through the runningexample shown in Table 3. Please note that all column valuesare in encrypted form except columns π and β ′ . Given themapped data points t i , C first computes the attribute sum E pk ( S ( t i )) shown in the third column. We set l = C getsthe binary representation of the attribute sum J E pk ( S ( t i )) K .Because n = C obliviously adds the order-preservingperturbation ⌈ log 4 ⌉ = J E pk ( S ( t i )) K respec-tively to get the new E pk ( S ( t i )) (shown in the sixth column).Then C gets E pk ( S ( t min )) = E pk (30) by employing SMIN.The protocol then turns to the subroutine Algorithm 6 toselect the first skyline based on the minimum attribute sum. C computes α i = E pk ( S ( t i ) − S ( t min )). Assume the randomnoise vector r = h , , , i and the permutation sequence π = h , , , i , C sends the encrypted permuted and random-ized di ff erence vector β to C . After decrypting β , C gets β ′ and then sends U to C . C computes V by applying areverse permutation. By employing SM with V , C computes( E pk ( t i [1] ′ ) , E pk ( t i [2] ′ )) and ( E pk ( p i [1] ′ ) , E pk ( p i [2] ′ )). Aftersumming all column values, C adds E pk ( p sky ) = ( E pk (39), E pk (120)) to skyline pool and uses E pk ( t sky ) = ( E pk (2) , E pk (5))to eliminate dominated tuples.The protocol now turns back to the main routine in Algorithm5 to eliminate dominated tuples. C and C use SOR with V tomake E pk ( S ( t min )) = E pk (127) and E pk ( S ( t i )) = E pk ( S ( t i )) for i , min . Now, only E pk ( S ( t min )) = E pk ( S ( t )) has changed to E pk (127) which is “flagged” as MAX. We emphasize that C does not know this value has changed because the ciphertextof all tuples has changed. Next, C and C find the dominancerelationship between E pk ( t sky ) and E pk ( t i ) by SDOM protocol. C obtains the dominance vector V . Using same method, C flags E pk ( S ( t )) and E pk ( S ( t )) to E pk (127). The protocolcontinues until all are set to MAX. Security Analysis.
Based on Theorem 1, the protocol is secureif the subprotocols are secure and the intermediate results are
TABLE 3: Example of Algorithm 5. C : C : C : t i ( t i [1] , t i [2]) S ( t i ) J S ( t i ) K pert. S ( t i ) S ( t i ) − S ( t min ) r π β ′ U V ( t i [1] ′ , t i [2] ′ ) ( p i [1] ′ , p i [2] ′ ) S ( t i ) V S ( t i ) t (1 ,
15) 16 1 , , , , , −
30 3 2 0 1 0 (0 ,
0) (0 ,
0) 67 0 67 t (2 ,
5) 7 0 , , , , , −
30 9 1 111 0 1 (2 ,
5) (39 , t (4 ,
5) 9 0 , , , , , −
30 31 4 92 0 0 (0 ,
0) (0 ,
0) 37 1 127 t (4 ,
15) 19 1 , , , , , −
30 2 3 217 0 0 (0 ,
0) (0 ,
0) 76 1 127 random or pseudo-random. We focus on the intermediate resulthere. From C ’s view, the intermediate result includes U . Because U is ciphertext and C does not have the secret key, C cansimulate U based on its input and output. From C ’s view, theintermediate result includes β . β contains one E pk (0) and m − π of C , C cannot determine where is the E pk (0). Therefore, C can simulate β based on its input and output. Hence the protocol is secure. Computational Complexity Analysis.
The subroutine Algorithm6 requires O ( n ) decryptions in Line 9, O ( nm ) encryptions anddecryptions in Lines 20 and 21. Thus, Algorithm 6 requires O ( nm )encryptions and decryptions in all. In Algorithm 5, Line 7 requires O ( nl ) encryptions and decryptions. Line 10 requires O ( n ⌈ log n ⌉ )encryptions. Line 12 requires O (( l + ⌈ log n ⌉ ) n ) encryptions anddecryptions. Line 26 requires O ( l + ⌈ log n ⌉ ) encryptions and de-cryptions. Line 32 requires O ( nm ) encryptions and decryptions.Thus, this part requires O (( l + ⌈ log n ⌉ ) n + nm ) encryptions anddecryptions. Because this part runs k times, the fully secureskyline protocol requires O ( k ( l + ⌈ log n ⌉ ) n + knm ) encryptions anddecryptions in total. erformance A nalysis and O ptimizations In this section, we illustrate two optimizations to further reduce thecomputation load. We first show a data partitioning optimizationin Subsection 7.1, and then show a lazy merging optimization inSubsection 7.2.
As shown in the previous section, the overall run time complexitydepends on the number of points ( n ), the number of skyline points( k ), the number of decomposed bits ( l ) which is determined bythe domain of the attribute values, and the number of dimensions( m ). A straightforward way to enhance the performance is topartition the input dataset into subdatasets and then we can use adivide-and-conquer approach to avoid unnecessary computations.Furthermore, the partitioning also allows e ff ective parallelism.The basic idea of data partitioning is to divide the dataset into aset of initial partitions, compute the skyline in each partition, andthen continuously merge the skyline result of the partitions intonew partitions and compute their skyline, until all partitions aremerged into the final result. This can be implemented with eithera single thread (sequentially) or multiple threads (in parallel).We describe our data partitioning scheme in Algorithm 7. Givenan input dataset, the number of partitions s is specified as oneparameter. We will show how to calculate the optimal number ofpartitions in Subsection 7.1.1. We first divide the input data into s partitions and compute the skyline in each partition in Line 1,and then set the state of all partitions as uncomputed in Line 2.In Line 7, the algorithm continues with uncomputed partitions oridle threads. In Line 8, if there are some uncomputed partitionsand there are some idle threads, we assign one uncomputedpartition to each idle thread in Line 9. In Line 11, if there is no uncomputed partition ( n p == n um == n it == n t − Algorithm 7:
Parallel implementation via data partitioning. input :
A dataset P of n points in m dimensions. output: Skyline of P . divide n points into s partitions and compute the skyline pointsin each partition; set the state of all partitions as uncomputed; n p ← number of uncomputed partitions; n t ← number of threads; n it ← number of idle threads; n um ← number of computed and unmerged results; while n p > or n it > do if n p > and n it > then assign one uncomputed partition to each idle thread; else if n p == and n it == n t − and n um == then break; wait until at least one thread finishes; set the state of computed partition as unmerged; if n um > then merge each two into one new partition; set new partition state as uncomputed; In this subsection, we show how to calculate the optimal numberof partitions for minimizing the total computation load given anindependent and identically distributed random dataset. We firstshow the theorem of the expected number of skyline points asfollows.
Theorem 1. ( Number of Skyline Points ) [4]. Given an indepen-dent and identically distributed random dataset of n points in m dimensional space, the expected number of skyline points is O (ln m − n ).In the computational complexity analysis of fully secureskyline protocol, the time complexity is O ( kn ( l + m + ⌈ log n ⌉ )).According to Theorem 1, the expected output size of input datawith size ns in m dimensional space is ln m − ( ns ). Therefore, inthis step, the computation load required for each partition isln m − ( ns ) × ns × ( log ( ns ) + m + l ). Since we have s partitions, the totalcomputation load required is s × ln m − ( ns ) × ns × (log( ns ) + m + l ) = n × ln m − ( ns ) × (log( ns ) + m + l ). This is the initial layer of thecomputation, which we refer to layer . We use 0 because thefollowing layers have a slightly di ff erent formula. Before we proceed, we denote the number of layers excluding layer as n layer . For each layer i , we denote the number of partitionsthat needs to be computed as n p , i , the size of a single input partitionas size in , i , the output size of a single partition as size out , i , andthe amount of computation load as W layer i . A visual graph aboutthe layer structure is shown in Figure 4. In the ideal case, wehave s = h partitions, where h is an integer. For each layer, wereduce the number of partitions by merging two partitions to forma new partition which contains skyline points of those two mergedpartitions. After h layers’ merging, we obtain only one partitionwhich is the final skyline result. Number of Partitions and Layers.
To simplify the analysis,we assume the merging of two partitions happens at the samelayer (although mergings from di ff erent layers may happen at thesame time). As shown in Figure 4, the datasets for layer i ( i > layer i − .Therefore, in layer i , the number of partitions ( n p , i ) is s i giventhe number of partitions in layer is s . Meanwhile, layer has s partitions, layer has s partitions, and the last layer has onepartition, so the number of layers excluding layer ( n layer ) is log s . partition partition partition partition partition s − partition s layer layer n=s n=s n=s n=s n=s n=sln m − ( n=s ) ln m − ( n=s ) ln m − ( n=s ) ln m − ( n=s ) ln m − ( n=s ) ln m − ( n=s )2 ln m − ( n=s ) 2 ln m − ( n=s ) 2 ln m − ( n=s ) ln m − (2 n=s ) ln m − (2 n=s ) ln m − (2 n=s ) layer i layer i − ln m − (2 i − n=s ) ln m − (2 i − n=s )2 ln m − (2 i − n=s ) 2 ln m − (2 i − n=s )2 ln m − (2 i − n=s ) 2 ln m − (2 i − n=s ) ln m − (2 i n=s ) ln m − (2 i n=s ) layer log ( s ) ln m − (2 log ( s ) − n=s ) ln m − (2 log ( s ) n=s ) = ln m − ( n ) partition partition partition s= partition partition partition partition s= i − paritition s= i final resultinterResultinterResultinterResultinterResult Fig. 4: Layer structure (interResult is short for intermediate result).
Output Size.
A partition in layer i is merged from 2 i partitionsin layer . Therefore, the expected output size of one partitionat layer i corresponds to the expected output size of 2 i partitionsin layer . That is, in layer i , the expected output size of a singlepartition ( size out , i ) is ln m − ( i ns ). Input Size. In layer i , the size of each input partition ( size in , i ) istwice of the single partition output size from the last layer becauseit is the merging of two outputs from the last layer. In other words, size in , i = × size out , i − = × ln m − ( i − ns ). For example, the expectedsingle partition output size of layer is ln m − ( ns ) and the expectedsize of each input partition in layer is 2 × ln m − ( ns ). Computation Load.
With n p , i , size in , i , and size out , i , we can obtainthe general formula for computation load of layer i ( i ,
0) as W layer i = n p , i × size out , i × size in , i × ( m + log( size in , i )) according to the time complexity of our fully secure skyline protocol. And since we have n layer layers, the overall computation load is calculated as follows. W all = W layer + n layer X W layer i = W layer + n layer X n p , i × size out , i × size in , i × ( m + log( size in , i )) = n × ln m − ( ns ) × (log ns + m + l ) + log s X i = s i × ln m − ( 2 i ns ) × ln m − ( 2 i − ns ) × (log(2 ln m − ( 2 i − ns )) + m + l ) Optimal Number of Partitions.
Without loss of generality, fromnow on, we assume n = u and s = v , where u , v ∈ Z + and1 ≤ v < u . To find out the optimal number of partitions, our goalis to minimize W all against s or v . Because n = u and s = v , wehave the computation load W ( v ) corresponding to the number ofpartition s = v as follows. W ( v ) = u × ( u − v ) m − × ln m − × ( u − v + m + l ) + v X i = v − i + × ( i + u − v ) m − × ( i − + u − v ) m − × ln m − × (log(2 × ( i − + u − v ) m − ln m − + m + l )We denote the part after P as WI v , i . Notice that WI v , i = WI v + , i + , we have W ( v + − W ( v ) = W layer , v + − W layer , v + v + X i = WI v + , i − v X i = WI v , i = W layer , v + − W layer , v + WI v + , Notice that the minimal value of W lies at the position where W ( v + − W ( v ) changes from negative to positive. Observe that inour setting, all variables can only be positive integer, which meanswe need to find out the integer v such that f ( v ) = W ( v + − W ( v )changes from negative to positive. By letting x = u − v , we have f ( x ) = WI v + , + W layer , v + − W layer , v = v + × ( x ) m − × ( x − m − × ln m − × (log(2 × ( x − m − ln m − + m + l ) + u × ( x − m − × ln m − × ( x − + m + l ) − u × x m − × ln m − × ( x + m + l ) = u ln m − × (2 − x × x m − × ( x − m − × ln m − × (log(2 × ( x − m − ln m − + m + l ) + (( x − m − × ( x − + m + l ) − x m − × ( x + m + l )))To obtain the minimal value of f ( x ), we can ignore thepreceding 2 u ln m − x where f ( x ) changes from positiveto negative given m and l .For example, we set l =
20 in our experiments, if m =
2, thenthe minimal value of W ( v ) is obtained at x =
1, i.e., u − v = m =
3, we have x =
6, i.e., u − v =
6. Thatis, for three dimensional datasets, the optimal number of partitionsis 2 u − and each partition has 2 points. In this subsection, we show another optimization with lazy merg-ing.
Lazy Merging.
In the hierarchical divide-and-conquer approachproposed in the last subsection, results from any two computedpartitions are merged immediately as a new partition for com-puting skyline points. However, immediate merging might not beoptimal in the later stage of the program because it requires 1)more merging overhead and 2) more unnecessary computations. Inthe later stage of the program, there are only a few points in eachpartition. At this time, merging overhead is high compared to thecomputation time. Therefore, we can employ lazy merging whichincurs less merging overhead. Furthermore, in the later stage ofthe program, those remaining points are likely to follow an anti-correlated distribution as they are skyline points of a partitionat a previous layer. For anti-correlated dataset, data partitioningwill incur more unnecessary computations. Consider an extremeexample, if all the remaining points are the final skyline points, allthe computations in each partition are unnecessary. Therefore, wecan employ lazy merging to avoid those unnecessary computationsand delay the merging operation to a later time when morecomputed results are ready.
Merging Timing.
With lazy merging, we can reduce running timeif and only if the timing for lazy merging is perfect. Merging tooearly (immediate merging) or merging too late does not provideenough benefit or even jeopardizes the performance. As shownin the last subsection, for a given dataset, we can calculate theoptimal number of partitions, which is related to the dataset size.For example, given l =
20 and m =
3, we have the number ofoptimal partitions as n , which e ff ectively states that the optimalsize of each partition should be 2 =
64 in the initial layer.Therefore, in our algorithm, we heuristically wait until the sizeof merged partitions reach 64 before sending it for computationin the previous example. That is, there are at least 64 points ineach partition (excluding the final partition which contains thefinal skyline points) to compute the skyline points.
Security Analysis.
The cloud servers can tell if the subsets areskew or uniformly distributed in the extreme case when thedistribution of entire dataset is di ff erent from the distributionof subsets based on the di ff erent number of returned skylinepoints from each partition. However, the probability is very lowbecause we randomly partition the dataset, and the distribution ofsubsets should be very similar to the distribution of entire dataset.Moreover, this attack can be easily fixed by returning all the tuplesin each iteration. That is, cloud servers C and C return all skylinetuples with true values and non-skyline tuples with MAX values.In this way, the cloud servers cannot know the skyline distributionof subsets, thus, the cloud servers cannot get any new informationfrom the partitions. xperiments In this section, we describe our experimental setup and optimizedparallel system design. For comparison purposes, we have imple-mented both protocols: the Basic Secure Skyline Protocol (
BSSP )in Section 6.1, and the Fully Secure Skyline Protocol (
FSSP )in Section 6.2. Since there is no existing solution for secureskyline computation, we use the basic approach as a baselinewhich is e ffi cient but leaks some indirect data patterns to the cloud server. We have also designed a parallel framework for e ff ectivereducing computation time together with the two optimizations,data partitioning and lazy merging. We implemented all algorithms in C with all multithreading usingPOSIX threads and all communication using sockets. We ransingle-machine-experiments on a machine with Intel Core i7-6700K 4.0GHz running Ubuntu 16.04. The distributed versionwas tested on a cluster of 64 machines with Intel Core i7-26003.40GHz running CentOS 6, which we will provide more detailsin the next section. In our experiment setup, both C and C wererunning on the same machine. The reported computation time isthe total computation time of the C and C . Datasets.
We used both synthetic datasets and a real NBA datasetin our experiments. To study the scalability of our methods,we generated independent (INDE), correlated (CORR), and anti-correlated (ANTI) datasets following the seminal work [5]. Wealso built a dataset that contains 2384 NBA players who are leagueleaders of playo ff s . Each player has five attributes that measurethe player’s performance: Points (PTS), Rebounds (REB), Assists(AST), Steals (STL), and Blocks (BLK). Data Partitioning.
This procedure can be done either using singlethread or multiple threads. We conducted single thread experimentfor verifying the optimal number of partitions. And we refer tomultithreading implementation as local parallelism. The algorithmis shown in Algorithm 7.To further demonstrate the scalability of our algorithm, wealso implemented a distributed version, which employs a manager-worker model. The manager distributes partitions to workers, theworkers compute the skyline points in any given dataset andreturn the results to the manager, which works similarly as thelocal parallelism. The only di ff erence is that the manager couldimplement sophisticated load balancing algorithm to fully utilizethe computation resources. The overall data partitioning schemeis very similar to the existing MapReduce approach. However, wedidn’t employ existing MapReduce framework because existingcrypto library in Java does not satisfy our requirements. Lazy Merging.
The lazy merging delays the merging operationuntil there are enough results to form a partition with optimal size,which is detailed shown in Section 7.1.1. All experiments usingoptimizations are conducted using 10 di ff erent independent andidentically distributed random datasets of size 512 and dimension3 with three repeated runs for each dataset. In this subsection, we evaluate our protocols by varying thenumber of tuples (n), the number of dimensions (m), and the keysize (K) on datasets of various distributions.
Impact of number of tuples n . Figures 6(a)(b)(c)(d) show thetime cost of di ff erent n on CORR, INDE, ANTI, and NBAdatasets, respectively. We observe that for all datasets, the timecost increases approximately linearly with the number of tuples n , which is consistent with our complexity analysis. While BSSPis very e ffi cient, FSSP does incur more computational overhead
2. The data was extracted from http: // stats.nba.com / leaders / all-time / ?ls = iref:nba:gnav on 04 / / for full security. Comparing di ff erent datasets, the time cost is inslightly increasing order for CORR, INDE, and ANTI, due to theincreasing number of skyline points of the datasets. The time forNBA dataset is low due to its small number of tuples. Impact of number of dimensions m . Figures 7(a)(b)(c)(d) showthe time cost of di ff erent m on CORR, INDE, ANTI, and NBAdatasets, respectively. For all datasets, the time cost increasesapproximately linearly with the number of dimensions m . FSSPalso shows more computational overhead than BSSP. The di ff erentdatasets show a similar comparison as in Figure 6. The time forNBA dataset is lower than the CORR dataset which suggests thatthe NBA data is strongly correlated. Impact of encryption key size K . Figures 8(a)(b)(c)(d) show thetime cost with di ff erent key size used in the Paillier cryptosystemon CORR, INDE, ANTI, and NBA datasets, respectively. Astronger security indeed comes at the price of computation over-head, i.e., the time cost increases significantly, almost exponential,when K grows. Communication overhead.
We also measured the overall timewhich includes computation time reported earlier and the commu-nication time between the two server processes. Figure 5 showsthe computation and communication time of di ff erent n on INDEdataset of FSSP. We observe that computation time only takesabout one third of the total time in this setting.Fig. 5: Computation and communication time cost of di ff erent n(m =
2, K = In this subsection, we evaluate the e ffi ciency of our proposed twooptimizations, data partitioning and lazy merging. Data Partitioning.
Figure 9 shows the relationship betweentheoretical computation load and real computation time. Thetheoretical computation load has an optimal value at the partition2 − =
8, which indicates dividing the original dataset into 8partitions will give the smallest amount of computation load.Using ten datasets and three repeated runs for each dataset, weobtained the real computation time, which perfectly matches thetheoretical computation load at the region with small numberof partitions. With large number of partitions, the experimentalresults deviate from theoretical derivations. The reason for thedeviation is that when the number of points in each partition is toosmall for large number of partitions, the number of skyline pointsin each partition violates our assumption of data distribution.For example, it is hard to say a dataset with only five pointsis an independent and identically distributed random dataset. Therefore, computation time for each partition does not followour derivation. Furthermore, the large number of partitions willincur more merging overhead.
Lazy Merging.
As yet another optimization, lazy merging playsan important role especially when the number of partitions islarge. In Figure 10, we show the computation time with andwithout lazy merging, respectively. It can be seen that overall withlazy merging, the run time can be e ff ectively reduced. The largernumber of partitions, the larger number of time di ff erence, whichis reasonable because the larger number of partitions, the largernumber of merging operations and more rounds of computation.We can also see that for one partition (no partition) and twopartitions, there is no time reduction, the reasons are that thereis no merging operation need for one partition and there is no lazymerging operation for two partitions.To summarize, both data partitioning and lazy merging havebeen proven e ff ective and can significantly reduce the computationtime even using single thread. In this subsection, we demonstrate the speedup of our protocolby using multithreading (local parallelism) on independent andidentically distributed random datasets with 512 points and dis-tributed computing with 64 commercial desktops (global paral-lelism) on independent and identically distributed random datasetswith 65536 points.As shown in Figure 11, if we use one machine with up to4 threads, the protocol almost shows a linear speedup. As thenumber of threads doubles, the computation time reduces to half.However, as we further increase the number of threads, we onlysee sub-linear speedup. We believe this is due to the small size ofthe dataset. In distributed computation experiments, we employed4, 8, 16, 32, 64, and 128 threads, respectively. It is clear that at thebeginning the protocol shows a linear speedup. While the numberof threads reaches 64, it switches to sub-linear speedup again dueto the small size of dataset. In both local and global parallelism, weobserve that the di ff erence between with lazy merging and withoutlazy merging is too small to be observed. In other words, when wehave enough computation power, lazy merging provides limitedimprovement, which is opposite to what we see in single-threadexperiment. onclusions In this paper, we proposed a fully secure skyline protocol onencrypted data using two non-colluding cloud servers under thesemi-honest model. It ensures semantic security in that the cloudservers knows nothing about the data including indirect datapatterns, query, as well as the query result. In addition, the clientand data owner do not need to participate in the computation.We also presented a secure dominance protocol which can beused by skyline queries as well as other queries. Furthermore,we demonstrated two optimizations, data partitioning and lazymerging, to further reduce the computation load. Finally, wepresented our implementation of the protocol and demonstratedthe feasibility and e ffi ciency of the solution. As for future work,we plan to optimize the communication time complexity to furtherimprove the performance of the protocol. number of tuples n t i m e ( s e c ond s ) BSSPFSSP (a) time cost of CORR number of tuples n t i m e ( s e c ond s ) BSSPFSSP (b) time cost of INDE number of tuples n t i m e ( s e c ond s ) BSSPFSSP (c) time cost of ANTI number of tuples n t i m e ( s e c ond s ) BSSPFSSP (d) time cost of NBA
Fig. 6: The impact of n (m =
2, K = number of dimensions m t i m e ( s e c ond s ) BSSPFSSP (a) time cost of CORR number of dimensions m t i m e ( s e c ond s ) BSSPFSSP (b) time cost of INDE number of dimensions m t i m e ( s e c ond s ) BSSPFSSP (c) time cost of ANTI number of dimensions m t i m e ( s e c ond s ) BSSPFSSP (d) time cost of NBA
Fig. 7: The impact of m (n = = number of key size K
256 512 1024 2048 t i m e ( s e c ond s ) BSSPFSSP (a) time cost of CORR number of key size K
256 512 1024 2048 t i m e ( s e c ond s ) BSSPFSSP (b) time cost of INDE number of key size K
256 512 1024 2048 t i m e ( s e c ond s ) BSSPFSSP (c) time cost of ANTI number of key size K
256 512 1024 2048 t i m e ( s e c ond s ) BSSPFSSP (d) time cost of NBA
Fig. 8: The impact of K (n = = number of partitions t i m e ( s ) c o m pu t a t i on l oad × Real Computation TimeTheoretical Computation Load
Fig. 9: Theoretical and exper-imental results. number of partitions t i m e ( s ) W/O Lazy MergeWith Lazy Merge
Fig. 10: Computation timewith and without lazy merg-ing. A cknowledgement This research is supported in part by the Patient-CenteredOutcomes Research Institute (PCORI) under award ME-1310-07058, the National Institute of Health (NIH) under awardR01GM114612, and an NSERC Discovery grant. R eferences [1] F. Baldimtsi and O. Ohrimenko. Sorting and searching behind the curtain.In FC 2015 , pages 127–146, 2015. number of threads t i m e ( s ) With Lazy MergeW/O Lazy Merge (a) Local parallelism. number of threads t i m e ( s ) With Lazy MergeW/O Lazy Merge (b) Global parallelism.
Fig. 11: Local parallelism and global parallelism. [2] A. Beimel. Secret-sharing schemes: a survey. In
International Confer-ence on Coding and Cryptology , pages 11–46. Springer, 2011.[3] J. L. Bentley. Multidimensional divide-and-conquer.
Commun. ACM ,23(4):214–229, 1980.[4] J. L. Bentley, H. T. Kung, M. Schkolnick, and C. D. Thompson. On theaverage number of maxima in a set of vectors and applications.
J. ACM ,25(4):536–543, 1978.[5] S. B¨orzs¨onyi, D. Kossmann, and K. Stocker. The skyline operator. In
ICDE 2001 .[6] S. Bothe, A. Cuzzocrea, P. Karras, and A. Vlachou. Skyline query pro-cessing over encrypted data: An attribute-order-preserving-free approach.In
PSBD@CIKM , pages 37–43, 2014.[7] S. Bothe, P. Karras, and A. Vlachou. eskyline: Processing skyline queries over encrypted data. PVLDB , 6(12):1338–1341, 2013.[8] C. Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. H. Tung, and Z. Zhang.Finding k-dominant skylines in high dimensional space. In
SIGMODConference , pages 503–514, 2006.[9] W. Chen, M. Liu, R. Zhang, Y. Zhang, and S. Liu. Secure outsourcedskyline query processing via untrusted cloud service providers. In
INFOCOM 2016 .[10] V. Costan and S. Devadas. Intel sgx explained. Technical report,Cryptology ePrint Archive, Report 2016 / // eprint. iacr.org.[11] E. Dellis and B. Seeger. E ffi cient computation of reverse skyline queries.In VLDB , pages 291–302, 2007.[12] Y. Elmehdwi, B. K. Samanthula, and W. Jiang. Secure k-nearest neighborquery over encrypted data in outsourced environments. In
ICDE 2014 .[13] Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, andT. Toft. Privacy-preserving face recognition. In
PETS , pages 235–253,2009.[14] U. Feige, A. Fiat, and A. Shamir. Zero-knowledge proofs of identity.
J.Cryptology , 1(2):77–94, 1988.[15] C. Gentry. Fully homomorphic encryption using ideal lattices. In
STOC2009 .[16] O. Goldreich.
The Foundations of Cryptography - Volume 2, BasicApplications . Cambridge University Press, 2004.[17] O. Goldreich, S. Micali, and A. Wigderson. How to play any mentalgame or A completeness theorem for protocols with honest majority. In
ACM Symposium on Theory of Computing , pages 218–229, 1987.[18] H. Hacig¨um¨us, B. R. Iyer, C. Li, and S. Mehrotra. Executing SQLover encrypted data in the database-service-provider model. In
SIGMOD2002 , pages 216–227, 2002.[19] S. Halevi and V. Shoup. Bootstrapping for helib. In
EUROCRYPT 2015 ,pages 641–670, 2015.[20] T. Hashem, L. Kulik, and R. Zhang. Privacy preserving group nearestneighbor queries. In
EDBT 2010 .[21] H. Hu, J. Xu, C. Ren, and B. Choi. Processing private queries overuntrusted data cloud through privacy homomorphism. In
ICDE 2011 .[22] Y. Huang, D. Evans, J. Katz, and L. Malka. Faster secure two-partycomputation using garbled circuits. In
USENIX 2011 , 2011.[23] A. Janosi, W. Steinbrunn, M. Pfisterer, and R. Detrano. Heart diseasedataset, https: // archive.ics.uci.edu / ml / datasets / heart + disease. In The UCIArchive 1998 .[24] D. G. Kirkpatrick and R. Seidel. Output-size sensitive algorithms forfinding maximal vectors. In
Symposium on Computational Geometry ,pages 89–96, 1985.[25] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: Anonline algorithm for skyline queries. In
VLDB 2002 , 2002.[26] H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of aset of vectors.
JACM , 1975.[27] C. Li, N. Zhang, N. Hassan, S. Rajasekaran, and G. Das. On skylinegroups. In
CIKM , pages 2119–2123, 2012.[28] A. Liu, K. Zheng, L. Li, G. Liu, L. Zhao, and X. Zhou. E ffi cient securesimilarity computation on encrypted trajectory data. In ICDE , pages66–77, 2015.[29] J. Liu, L. Xiong, J. Pei, J. Luo, and H. Zhang. Finding pareto optimalgroups: Group-based skyline.
PVLDB , 8(13):2086–2097, 2015.[30] J. Liu, L. Xiong, and X. Xu. Faster output-sensitive skyline computationalgorithm.
Inf. Process. Lett. , 2014.[31] J. Liu, J. Yang, L. Xiong, and J. Pei. Secure skyline queries on cloudplatform. In
ICDE , pages 633–644, 2017.[32] J. Liu, J. Yang, L. Xiong, J. Pei, and J. Luo. Skyline diagram: Findingthe voronoi counterpart for skyline queries. In
ICDE , 2018.[33] J. Liu, H. Zhang, L. Xiong, H. Li, and J. Luo. Finding probabilistick-skyline sets on uncertain data. In
CIKM , pages 1511–1520, 2015.[34] P. Paillier. Public-key cryptosystems based on composite degree resid-uosity classes. In
Advances in Cryptology - EUROCRYPT ’99 , pages223–238, 1999.[35] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive skyline com-putation in database systems.
ACM Trans. Database Syst. , 30(1):41–82,2005.[36] S. Papadopoulos, S. Bakiras, and D. Papadias. Nearest neighbor searchwith strong location privacy.
PVLDB , 2010.[37] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertaindata. In
VLDB , pages 15–26, 2007.[38] Y. Qi and M. J. Atallah. E ffi cient privacy-preserving k-nearest neighborsearch. In ICDCS 2008 .[39] B. K. Samanthula, C. Hu, and W. Jiang. An e ffi cient and probabilisticsecure bit-decomposition. In ASIA CCS , pages 541–546, 2013. [40] D. X. Song, D. Wagner, and A. Perrig. Practical techniques for searcheson encrypted data. In
IEEE Symposium on Security and Privacy , 2000.[41] T. Veugen, F. Blom, S. J. A. de Hoogh, and Z. Erkin. Secure comparisonprotocols in the semi-honest model.
J. Sel. Topics Signal Processing ,9(7):1217–1228, 2015.[42] W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis. Secure knncomputation on encrypted databases. In
SIGMOD 2009 .[43] A. C. Yao. Protocols for secure computations (extended abstract). In
FOCS , pages 160–164, 1982.[44] B. Yao, F. Li, and X. Xiao. Secure nearest neighbor revisited. In
ICDE2013 .[45] X. Yi, R. Paulet, E. Bertino, and V. Varadharajan. Practical k nearestneighbor queries with location privacy. In
ICDE 2014 .[46] W. Yu, Z. Qin, J. Liu, L. Xiong, X. Chen, and H. Zhang. Fast algorithmsfor pareto optimal group-based skyline. In
CIKM , pages 417–426, 2017.[47] H. Zhu, X. Meng, and G. Kollios. Privacy preserving similarity evalua-tion of time series data. In
EDBT , pages 499–510, 2014.
Jinfei Liu is a joint postdoctoral research fellow atEmory University and Georgia Institute of Technol-ogy. His research interests include skyline queries,data privacy and security, and machine learn-ing. He has published over 20 papers in premierjournals and conferences including VLDB, ICDE,CIKM, and IPL.
Juncheng Yang is a master student in EmoryUniversity. His research interests include computersecurity, database, smart cache in storage and dis-tributed system. He has published over 10 papersin premier conferences including ICDE and SoCC.
Li Xiong is a Professor of Computer Science andBiomedical Informatics at Emory University. Sheconducts research that addresses both fundamen-tal and applied questions at the interface of dataprivacy and security, spatiotemporal data manage-ment, and health informatics. She has publishedover 100 papers in premier journals and confer-ences including TKDE, JAMIA, VLDB, ICDE, CCS,and WWW. She currently serves as associate edi-tor for IEEE Transactions on Knowledge and DataEngineering (TKDE) and on numerous programcommittees for data management and data security conferences.
Jian Pei is currently a Canada Research Chair(Tier 1) in Big Data Science, a Professor in theSchool of Computing Science at Simon Fraser Uni-versity, Canada. He is one of the most cited authorsin data mining, database systems, and informa-tion retrieval. Since 2000, he has published onetextbook, two monographs and over 200 researchpapers in refereed journals and conferences, whichhave been cited by more than 77,000 in literature.He was the editor-in-chief of the IEEE Transac-tions of Knowledge and Data Engineering (TKDE)in 2013-2016, is currently a director of the Special Interest Group onKnowledge Discovery in Data (SIGKDD) of the Association for ComputingMachinery (ACM). He is a Fellow of the ACM and of the IEEE. A ppendix AB asic S ecurity S ubprotocols Secure Multiplication (SM).
Assume a cloud server C withencrypted input E pk ( a ) and E pk ( b ), and a cloud server C with theprivate key sk , where a , b are two numbers not known to C and C . The Secure Multiplication (SM) protocol [12] (based on theadditively homomorphic property of Paillier) securely computesencrypted result of multiplication of a , b , E pk ( a × b ), such thatonly C knows E pk ( a × b ), and no information related to a , b isrevealed to C or C . Secure Bit Decomposition (SBD).
Assume a cloud server C withencrypted input E pk ( a ) and a cloud server C with the private key sk , where a is a number not known to C and C . The Secure BitDecomposition (SBD) protocol [39] securely computes encryptedindividual bits of the binary representation of a , denoted as J a K = h E pk (( a ) (1) B ) , ..., E pk (( a ) ( l ) B ) i , where l is the number of bits, ( a ) (1) B and( a ) ( l ) B denote the most and least significant bits of a , respectively.At the end of the protocol, the output J a K is known only to C andno information related to a is revealed to C or C . A.1 Secure Boolean Operations
Secure OR (SOR).
Assume a cloud sever C with encrypted input E pk (ˆ a ) and E pk (ˆ b ), and a cloud server C with the private key sk ,where ˆ a and ˆ b are two bits not known to C and C . The SecureOR (SOR) protocol [12] securely computes encrypted result of thebit-wise OR of the two bits, E pk (ˆ a ∨ ˆ b ), such that only C knows E pk (ˆ a ∨ ˆ b ) and no information related to ˆ a and ˆ b is revealed to C or C . Secure AND (SAND).
Assume a cloud server C with encryptedinput E pk (ˆ a ) and E pk (ˆ b ), and a cloud server C with the private key sk , where ˆ a and ˆ b are two bits not known to C and C . The goalof the SAND protocol is to securely compute encrypted result ofthe bit-wise AND of the two bits, E pk (ˆ a ∧ ˆ b ), such that only C knows E pk (ˆ a ∧ ˆ b ) and no information related to ˆ a and ˆ b is revealedto C or C . We can simply use the secure multiplication (SM)protocol on the two bits. Secure NOT (SNOT).
Assume a cloud server C with encryptedinput E pk (ˆ a ) and a cloud server C with the private key sk , whereˆ a is a bit not known to C , C . The goal of the SNOT protocol isto securely compute the encrypted complement bit of ˆ a , E pk ( ¬ ˆ a ),such that only C knows E pk ( ¬ ˆ a ) and no information related toˆ a is revealed to C or C . Secure NOT protocol can be easilyimplemented by E pk (1 − ˆ a ) = E pk (1) E pk (ˆ a ) N − . A ppendix BD isclosure of B inary based SMIN
Given two numbers in binary representations, the idea of theBinary representation based SMIN protocol (BSMIN) [12] is for C to randomly choose a boolean functionality F (by flipping acoin), where F is either a > b or b > a , and then securely compute F with C , such that the output of F is oblivious to both C and C . Based on the output and chosen F , C computes min ( a , b )locally using homomorphic properties. More specifically, given
3. The SMIN protocol for n values can be constructed by employing BSMINfor two values at a time in a hierarchical fashion as suggested in [12] or simplya linear fashion. the binary representation of the two numbers, for each bit, C computes an encrypted boolean output W i of the two bits basedon F (e.g., if F is a > b , W i = E pk (1), if ( a ) ( i ) B > ( b ) ( i ) B and E pk (0)otherwise) and an encrypted randomized di ff erence between ( a ) ( i ) B and ( b ) ( i ) B . This way, the order and di ff erence of the two numbersare not disclosed to C . However, when a = b , whatever F is,we have W i = E pk (0) for all bits. We can show that through theintermediate result (the encrypted randomized di ff erence between( a ) ( i ) B and ( b ) ( i ) B , Γ i = E pk ( r i ) for 1 ≤ i ≤ l , the bit-wise XOR of( a ) ( i ) B and ( b ) ( i ) B , G i = E pk (0) for 1 ≤ i ≤ l ), C can determine a equals to b . A ppendix CD isclosure of P erturbation based SMIN
The Perturbation based SMIN protocol (PSMIN) [47] assumes C has E pk ( a ) and E pk ( b ). C generates a set of v random valuesuniformly from a certain range { r , ..., r v | r < r i , i ≥ } . C thensends a set of 2 + v − { E pk ( a + r ) , E pk ( b + r ) , E pk ( x + r ) , ..., E pk ( x v + r v ) } to C , where x i , i ≥ a , b . The idea is that the smallest number, after beingperturbed by r (which is smaller than r i , i ≥ C .Although not mentioned by the original paper, we point out C also needs to shu ffl e the encrypted values before sending them to C , otherwise the di ff erences between the values will be disclosedto C after decryption. After decrypting those 2 + v − C takes the minimal min and sends E pk ( min ) to C . C computes E pk ( min − r ) as result. The security weakness of PSMIN is dueto the fact that if two numbers are equal, their perturbed valuesremain equal. Since C sends { E pk ( a + r ) , E pk ( b + r ) , E pk ( x + r ) , ..., E pk ( x v + r v ) } to C , C can learn two numbers are equalbased on a + r and b + r . A ppendix DS ecurity D efinition in the S emi - honest M odel Considering the privacy properties above, we adopt the formalsecurity definition from the multi-party computation setting underthe semi-honest model [16]. Intuitively, a protocol is secure ifwhatever can be computed by a party participating in the protocolcan be computed based on its input and output only. This is formal-ized according to the simulation paradigm. Loosely speaking, werequire that a party’s view in a protocol execution to be simulativegiven only its input and output. This then implies that the partieslearn nothing from the protocol execution. For the detailed andstrict definition, please see [16].
Theorem 2. ( Composition Theorem ) [16]. If a protocol consistsof subprotocols, the protocol is secure as long as the subproto-cols are secure and all the intermediate results are random orpseudo-random.In this work, the proposed secure skyline protocols are con-structed based on a sequential composition of subprotocols. To for-mally prove the security under the semi-honest model, accordingto the composition theorem given in Theorem 2, one needs to showthat the simulated view of each subprotocol was computationallyindistinguishable from the actual execution view and the protocolproduces random or pseudo-random shares as intermediate results. A ppendix EP aillier C ryptosystem We use the Paillier cryptosystem [34] as the encryption schemein this paper and briefly describe Paillier’s additive homomorphicproperties which will be used in our protocols. • Homomorphic addition of plaintexts: D sk ( E pk ( a ) × E pk ( b ) mod N ) = ( a + b ) mod N • Homomorphic multiplication of plaintexts: D sk ( E pk ( a ) b mod N ) = a × b mod N It is easy to see that the Paillier cryptosystem is additivelyhomomorphic and we can compute a new probabilistic encrypted E pk ( a ) given an encrypted E pk ( a ) without knowing the private key sk ..