[PDF] Obscure: Information-Theoretically Secure, Oblivious, and Verifiable Aggregation Queries on Secret-Shared Outsourced Data -- Full Version

Abstract

Despite exciting progress on cryptography, secure and efficient query processing over outsourced data remains an open challenge. We develop a communication-efficient and information-theoretically secure system, entitled Obscure for aggregation queries with conjunctive or disjunctive predicates, using secret-sharing. Obscure is strongly secure (i.e., secure regardless of the computational-capabilities of an adversary) and prevents the network, as well as, the (adversarial) servers to learn the user's queries, results, or the database. In addition, Obscure provides additional security features, such as hiding access-patterns (i.e., hiding the identity of the tuple satisfying a query) and hiding query-patterns (i.e., hiding which two queries are identical). Also, Obscure does not require any communication between any two servers that store the secret-shared data before/during/after the query execution. Moreover, our techniques deal with the secret-shared data that is outsourced by a single or multiple database owners, as well as, allows a user, which may not be the database owner, to execute the query over secret-shared data. We further develop (non-mandatory) privacy-preserving result verification algorithms that detect malicious behaviors, and experimentally validate the efficiency of Obscure on large datasets, the size of which prior approaches of secret-sharing or multi-party computation systems have not scaled to.

Full PDF

IIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 1 O B S C U R E : Information-Theoretically Secure,Oblivious, and Veriﬁable Aggregation Queries onSecret-Shared Outsourced Data

Peeyush Gupta, Yin Li, Sharad Mehrotra, Nisha Panwar, Shantanu Sharma, and Sumaya Almanee

Abstract —Despite exciting progress on cryptography, secure and efﬁcient query processing over outsourced data remains an openchallenge. We develop a communication-efﬁcient and information-theoretically secure system, entitled O

BSCURE for aggregationqueries with conjunctive or disjunctive predicates, using secret-sharing. O

BSCURE is strongly secure ( i . e ., secure regardless of thecomputational-capabilities of an adversary) and prevents the network, as well as, the (adversarial) servers to learn the user’s queries,results, or the database. In addition, O BSCURE provides additional security features, such as hiding access-patterns ( i . e ., hiding theidentity of the tuple satisfying a query) and hiding query-patterns ( i . e ., hiding which two queries are identical). Also, O BSCURE does notrequire any communication between any two servers that store the secret-shared data before/during/after the query execution.Moreover, our techniques deal with the secret-shared data that is outsourced by a single or multiple database owners, as well as, allowsa user, which may not be the database owner, to execute the query over secret-shared data. We further develop (non-mandatory)privacy-preserving result veriﬁcation algorithms that detect malicious behaviors, and experimentally validate the efﬁciency of O

BSCURE on large datasets, the size of which prior approaches of secret-sharing or multi-party computation systems have not scaled to.

Index Terms —Computation and data privacy, data and computation outsourcing, multi-party computation, Shamir’s secret-sharing,result veriﬁcation. (cid:70)

NTRODUCTION

Database-as-a-service (DaS) [36] allows authenticated usersto execute their queries on an untrusted public cloud. Over thelast two decades, several cryptographic techniques ( e . g ., [8], [33],[35], [44], [46]) have been proposed secure and privacy-preservingcomputations in the DaS model. These techniques can be broadlyclassiﬁed based on cryptographic security into two categories: Computationally secure techniques that assume the adversarylacks adequate computational capabilities to break the underlyingcryptographic mechanism in polynomial time ( i . e ., a practicallyshort amount of time). Non-deterministic encryption [35], homo-morphic encryption [33], order-preserving encryption (OPE) [8],and searchable-encryption [46] are examples of such techniques. Information-theoretically secure techniques that are uncon-ditionally secure and independent of adversary’s computationalcapabilities. Shamir’s secret-sharing (SSS) [44] is a well-knowninformation-theoretically secure protocol. In SSS, multiple (se-cure) shares of a dataset are kept at mutually suspicious servers,such that a single server cannot learn anything about the data.Secret-sharing-based techniques are secure under the assumptionthat a majority of the servers (equal to the threshold of the secret-sharing mechanism) do not collude. Secret-sharing mechanisms

A preliminary version of this work was accepted in VLDB 2019.Corresponding author:

Shantanu Sharma [email protected]

P. Gupta, S. Mehrotra, S. Sharma, and S. Almanee are with University ofCalifornia, Irvine, USA. Yin Li is with Dongguan University of Technology,P.R. China. N. Panwar is with Augusta University and University of California,Irvine, USA.Manuscript received 05 Oct. 2019; accepted 22 Mar. 2020. DOI:10.1109/TKDE.2020.2983932. c (cid:13) also have applications in other areas such as Byzantine agreement,secure multiparty computations (MPC), and threshold cryptogra-phy, as discussed in [14].The computationally or information-theoretically securedatabase techniques can also be broadly classiﬁed into two cate-gories, based on the supported queries: ( i ) Techniques that supportselection/join : Different cryptographic techniques are built forselection queries, e . g ., searchable encryption, deterministic/non-deterministic encryption, and OPE; and ( ii ) Techniques that sup-port aggregation : Cryptographic techniques that exploit homomor-phic mechanisms such as homomorphic encryption, SSS, or MPCtechniques.While both computationally and information-theoretically se-cure techniques have been studied extensively in the cryptographicdomain, secure data management has focused disproportionatelyon computationally secure techniques ( e . g ., OPE, homomorphicencryption, searchable-encryption, and bucketization [36]) result-ing in systems such as CryptDB [42], Monomi [48], MariaDB [1],CorrectDB [10]). Some exceptions to the above include [28], [29],[30], [50] that have focused on secret-sharing.Recently, both academia [23], [28], [30], [50] and indus-tries [2], [9], [16] have begun to explore information-theoreticallysecure techniques using MPC that efﬁciently supports OLAP tasksinvolving aggregation queries, while achieving higher securitythan computationally secure techniques. For instance, commer-cial systems, such as Jana [9] by Galois, Pulsar [2] by StealthSoftware, Sharemind [16] by Cybernetica, and products by compa-nies such as Unbound Tech., Partisia, Secret Double Octopus, and Some of the computationally secure mechanisms are vulnerable to computationallypowerful adversaries. For instance, Google, with sufﬁcient computational capabilities,broke SHA-1 [3]. a r X i v : . [ c s . D B ] A p r EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 2

SecretSkyDB Ltd. have explored MPC-based databases systemsthat offer strong security guarantees. Beneﬁts of MPC-based meth-ods in terms of both higher-level security and relatively efﬁcientsupport for aggregation queries have been extensively discussed inboth scientiﬁc articles [24], [31], [41], [43] and popular media [4],[5], [6], [7].Much of the above work on MPC-based secure data manage-ment requires several servers to collaborate to answer queries.These collaborations require several rounds of communicationamong non-colluding servers. Instead, we explore secure datamanagement based on SSS that does not require servers tocollaborate to generate answers and can, hence, be implementedmore efﬁciently. There is prior work on exploring secret-sharingfor SQL processing [28], [29], [30], [50], but the developedtechniques suffer from several drawbacks, e . g ., weak securityguarantees such as leakage of access patterns, signiﬁcant overheadof maintaining polynomials for generating shares at the database(DB) owner, no support for third-party query execution on thesecret-shared outsourced database, etc. We discuss the limitationsof existing secret-sharing-based data management techniques indetails in § Contribution.

Our contributions in this paper are threefold:1) SSS-based algorithms (entitled O

BSCURE ) that support a largeclass of access-pattern-hiding aggregation queries with selection .O BSCURE supports count, sum, average, maximum, minimum,top-k, and reverse top-k, queries, without revealing anything aboutdata/query/results to an adversary.2) An oblivious result veriﬁcation algorithm for aggregation queriessuch that an adversary does not learn anything from the veriﬁ-cation. O

BSCURE ’s veriﬁcation step is not mandatory. A queriermay run veriﬁcation occasionally to conﬁrm the correctness ofresults.3) A comprehensive experimental evaluation of O

BSCURE on a vari-ety of queries that clearly highlight its scalability to moderate-sizedatasets and its efﬁciency compared to both state-of-the-art MPC-based solutions, as well as, to the simple strategy of downloadingencrypted data at the client, decrypting it, and running queries atthe (trusted) client.

Applications.

Our proposed algorithms can deal with datasetsoutsourced by a single or multiple DB owners. Here, we provideexamples of each scenario.

DB outsourcing by a single DB owner: Hospital database.

Ahospital may outsource its patient database to an (untrusted cloud)server. Given the sensitivity of the patient records, such data needsto be secured cryptographically. The hospital may still wish toexecute analytical queries on the sever over such data ( e . g ., numberof inﬂuenza patients seen in the last month) for its own internallogistical planning. DB outsourcing by multiple DB owners: Smart metering (orIoT sensors).

Smart meters’ data outsourcing is an example ofmultiple DB owners and a single querier. In smart meter settings,smart meter devices keep the energy consumptions of a homeat given time intervals and send the data to the servers [45].This data contains behavioral information of the user; hence, acryptographic technique should be used to make it secure beforeoutsourcing. Users may execute queries on this secure databasefor monitoring and comparing their usage to that of others in theneighborhood. Executing such aggregate queries involve count,sum, and maximum operations in an oblivious manner at theserver for preventing access to users’ behavioral information.Our proposed algorithms prevent an adversarial server to learn the user’s behaviors, when storing the database or executing aquery. Privacy-preserving data integration [11], [12], [13], [15],[39], where different datasets – owned by different DB owners– are intergraded into a single dataset, is also an example DBoutsourcing by multiple owners. However, [11], [12], [13], [15],[39] deal with only encrypted data integration; thus, we do notdiscuss such techniques in detail.

Outline of the paper : § § § § § § § Appendix.

In appendix, we provide the following: an exampleof count query veriﬁcation using secret-shared data, an approachfor ﬁnding maximum over SSS databases outsourced by multipleDB owners, approaches for the minimum and top-k, an outlinefor security proofs, and a communication-efﬁcient strategy forknowing tuples that satisﬁed a query predicate.

ACKGROUND

Here, we provide an overview of secret-sharing with an exampleand compare our proposed approach with existing works. O BSCURE is based on SSS, string-matching operations overSSS, and order-preserving secret-sharing (OP-SS). This sectionprovides an overview of these existing techniques.

Shamir’s secret-sharing (SSS).

In SSS [44], the DB ownerdivides a secret value, say S , into c different fragments, called shares , and sends each share to a set of c non-communicatingparticipants/servers. These servers cannot know the secret S untilthey collect c (cid:48) < c shares. In particular, the DB owner randomlyselects a polynomial of degree c (cid:48) with c (cid:48) random coefﬁcients, i . e ., f ( x ) = a + a x + a x + · · · + a c (cid:48) x c (cid:48) , where f ( x ) ∈ F p [ x ] , p is a prime number, F p is a ﬁnite ﬁeld of order p , a = S , and a i ∈ N (1 ≤ i ≤ c (cid:48) ) . The DB owner distributes the secret S into c shares by placing x = 1 , , . . . , c into f ( x ) . The secretcan be reconstructed based on any c (cid:48) + 1 shares using Lagrangeinterpolation [22]. Note that c (cid:48) ≤ c , where c is often taken to belarger than c (cid:48) to tolerate malicious adversaries that may modifythe value of their shares. For this paper, however, since we are notaddressing the availability of data, we will consider c and c (cid:48) to beidentical.SSS allows an addition of shares, i . e ., if s ( a ) i and s ( b ) i areshares of two values a and b , respectively, at the server i , then theserver i can compute an addition of a and b itself, i . e ., a + b = s ( a ) + s ( b ) , without knowing real values of a and b . String-matching operation on secret-shares.

Accumulating-Automata (AA) [27] is a new string-matching technique on secret-shares that do not require servers to collaborate to do the operation,unlike MPC-techniques [9], [16], [17], [18], [25], [38]. Here, weexplain AA to show how string-matching can be performed onsecret-shares.Let D be the cleartext data. Let S ( D ) i ( ≤ i ≤ c ) bethe i th secret-share of D stored at the i th server, and c bethe number of non-communicating servers. AA allows a user to EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 3 search a pattern, pt , by creating c secret-shares of pt (denotedby S ( pt ) i , ≤ i ≤ c ), so that the i th server can search thesecret-shared pattern S ( pt ) i over S ( D ) i . The result of the string-matching operation is either of secret-share form, if S ( pt ) i matches with a secret-shared string in S ( D ) i or of secret-share form; otherwise. Note that when searching a pattern on theservers, AA uses multiplication of shares, as well as, the additiveproperty of SSS, which will be clear by the following example.Thus, if the user wants to search a pattern of length l in only onecommunication round , while the DB owner and the user are usinga polynomial of degree one, then due to multiplication of shares,the ﬁnal degree of the polynomial will be l , and solving such apolynomial will require at least l + 1 shares. Example.

Assume that the domain of symbols has only threesymbols, namely A, B, and C. Thus, A can be represented as (cid:104) , , (cid:105) . Similarly, B and C can be represented as (cid:104) , , (cid:105) and (cid:104) , , (cid:105) , respectively. DB owner side . Suppose that the DB owner wants to outsource Bto the (cloud) servers. Hence, the DB owner may represent B asits unary representation: (cid:104) , , (cid:105) . If the DB owner outsources thevector (cid:104) , , (cid:105) to the servers, it will reveal the symbol. Thus, theDB owner uses any three polynomials of an identical degree, asshown in Table 1, to create three shares. Vector values Polynomials Firstshares Second shares Thirdshares0 x x

10 19 280 x TABLE 1: Secret-shares of a vector (cid:104) , , (cid:105) , created by the DBowner. User-side . Suppose that the user wants to search for a symbol B.The user will ﬁrst represent B as a unary vector, (cid:104) , , (cid:105) , andthen, create secret-shares of B, as shown in Table 2. Note thatthere is no need to ask the DB owner to send any polynomials tocreate shares or ask the DB owner to execute the search query. Vector values Polynomials Firstshares Second shares Thirdshares0 x x x TABLE 2: Secret-shares of a vector (cid:104) , , (cid:105) , created by theuser/querier. Server-side . Each server performs position-wise multiplication ofthe vectors that they have, adds all the multiplication resultants,and sends them to the user, as shown in Table 3. An importantpoint to note here is that the server cannot deduce the keyword, aswell as, the data by observing data/query/results.

Computation onServer 1 Server 2 Server 3 × × × × × × × × ×

12 = 72

43 147 313

TABLE 3: Multiplication of shares and addition of ﬁnal shares bythe servers.

User-side . After receiving the outputs ( (cid:104) y = 43 , y = 147 , y =313 (cid:105) ) from the three servers, the user executes Lagrange in-terpolation [22] to construct the secret answer, as follows: ( x − x x − x x − x x − x × y + ( x − x x − x x − x x − x × y + ( x − x x − x x − x x − x × y = ( x − x − − − ×

43 + ( x − x − − − ×

147 + ( x − x − − − ×

313 = 1

The ﬁnal answer is 1 that conﬁrms that the secret-shares at theservers have B.

Note.

In this paper, we use AA that utilizes unary representation as a building block. A recent paper Prio [23] also uses a unaryrepresentation; however, we use signiﬁcantly fewer number ofbits compared to Prio’s unary representation. One can use Prio’sunary representation too or use a different private string-matchingtechnique over secret-shares that supports string-matching over theshares.

Order-preserving secret-sharing (OP-SS).

The concept of OP-SS was introduced in [29]. OP-SS maintains the order of the valuesin secret-shares too, e . g ., if v and v are two values in cleartextsuch that v < v , then S ( v ) < S ( v ) at any server. It is clearthat ﬁnding records with maximum or minimum values using OP-SS are trivial. However, ordering revealed by OP-SS can leak moreinformation about records. Consider, for instance, an employeerelation, given in Table 5 on page 5. For explanation purpose,we represent Table 5 in cleartext. In Table 5, the salary ﬁeldcan be stored using OP-SS. If we know (background knowledge)that employees in the security department earn more money thanothers, we can infer from the representation that the second tuplecorresponds to someone from the security department. Thus, OP-SS, by itself, offers little security. However, as we will see later in §

7, by splitting the ﬁelds such as salary that can be stored usingOP-SS, while storing other ﬁelds using SSS, we, thus, can beneﬁtfrom the ordering supported by OP-SS without compromising onsecurity. [26] provided 2’scomplement-based sigbit computation. We will use signbit toﬁnd if two numbers are equal or not, as follows: A ≥ B if signbit ( A − B ) = 0 , and A < B if signbit ( A − B ) = 1 .Let A = [ a n , a n − , . . . , a ] be a n bit number and B =[ b n , b n − , . . . , b ] be a n bit number. 2’s complement subtractionconverts B − A into B + ¯ A + 1 , where ¯ A + 1 is 2’s complementrepresentation of − A . We start from the least signiﬁcant bit (LSB)and go through the rest of the bits. The method inverts a i (by doing − a i , where ≤ i ≤ n ), calculates ¯ a + b + 1 and its carrybit. After ﬁnishing this on all the n bits, the most signiﬁcant bit(MSB) keeps the signbit. Comparison with SSS databases.

In 2006, Emekc¸i et al. [29]introduced the ﬁrst work on SSS data for executing sum, maxi-mum, and minimum queries. However, [29] uses a trusted-third-party to perform queries and is not secure, since it uses OP-SS toanswer maximum/minimum queries. Another paper by Emekc¸i etal. [30] on OP-SS based aggregation queries requires the database(DB) owner to retain each polynomial, which was used to createdatabase shares, resulting in the DB owner to store n × m polyno-mials, where n and m are the numbers of tuples and attributes ina relation. [30] is also not secure, since it reveals access-patterns( i . e ., the identity of tuples that satisfy a query) and using OP-SS. Like [30], [50] proposed a similar approach and also suffersfrom similar disadvantages. [47] proposed SSS-based sum andaverage queries; however, they also require the DB owner to retain While [9], [28], [29], [30] have explored mechanisms to support selection andjoin operations over the secret-shared data, these techniques are not secure ( e . g ., leakinformation from access-patterns), are inefﬁcient (often requiring quadratic computations),and require transmitting entire dataset to users. SS can primarily be used to support OLAPstyle aggregation queries, which is our focus in this paper. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 4 tuple-ids of qualifying tuples. [28] used a novel string-matchingoperation over the shares at the server, but it cannot performgeneral aggregations with selection over complex predicates. Inshort, all the SSS-based solutions for aggregation queries eitheroverburden the DB owner (by storing enough data related topolynomials and fully participating in a query execution), areinsecure due to OP-SS, reveal access-patterns, or support a verylimited form of aggregation queries without any selection criteria.In contrast, O

BSCURE eliminates all such limitations. It pro-vides a fully secure and efﬁcient solution for implementing ag-gregation queries with selections. Our experimental results willshow that O

BSCURE scales to datasets with 6M tuples on TPC-H queries, the size of which prior secret-sharing and/or MPC-based techniques have never scaled to. The key to the efﬁcientperformance of O

BSCURE still is exploiting OP-SS – while OP-SS, in itself, is not secure (it is prone to background knowledgeattacks, for instance). The way O

BSCURE uses OP-SS, as will beclear in §

4, it prevents such attacks by appropriately partitioningdata, while still being able to exploit OP-SS for efﬁciency. In ad-dition, to support aggregation with selections, O

BSCURE exploitsthe string-matching techniques over shares developed in [27].Furthermore, as we will see in experimental section ( § BSCURE scales to datasets with 6M tuples on TPC-H queries.

Comparison with MPC-techniques. O BSCURE also overcomesseveral limitations of existing MPC-based solutions. Recentwork, Prio [23] supports a mechanism for conﬁrming the max-imum number, if the maximum number is known; however,Prio [23] does not provide any mechanism to compute the max-imum/minimum. Also, Prio does not provide methods to executeconjunctive and disjunctive count/sum queries. Another recentwork [17] deals with adding shares in an array under maliciousservers and malicious users, using the properties of SSS andpublic-key settings. However, [17] is unable to execute a single-dimensional, conjunctive, or disjunctive sum query. Note that (asper our assumption) though, [17] can tolerate malicious users,while O

BSCURE is designed to only handle malicious servers, andit assumes users to be trustworthy.Other works, e . g ., Sepia [18] and [25], perform additionand less than operations, and use many communication rounds.In contrast, O BSCURE uses minimal communication rounds be-tween the user and each server, (when having enough shares).Speciﬁcally, count, sum, average, and their veriﬁcation algorithmsrequire at most two rounds between each server and the user.However, maximum/minimum ﬁnding algorithms require at mostfour communication rounds. In addition, our scheme achieves theminimum communication cost for aggregate queries, especiallyfor count, sum, and average queries, by aggregating data locally ateach server.

Comparison with MPC/SSS-based veriﬁcation ap-proaches. [38] and [47] developed veriﬁcation approachesfor secret-shared data. [38] considered veriﬁcation process forMPC using a trusted-third-party veriﬁer. While overburdeningthe DB owner by keeping metadata for each tuple, [47] providedmetadata-based operation veriﬁcation ( i . e ., whether all the desiredtuples are scanned or not) for only sum queries, unlike O BSCURE ’sresult veriﬁcation for all queries. O

BSCURE veriﬁcation methodsneither involve the DB owner to verify the results nor require atrusted-third-party veriﬁer.

RELIMINARY

This section provides a description of entities, an adversarialmodel, and security properties for obliviously executing queries.

We assume the following three entities in our model.1) A set of c > non-communicating servers. The servers do notexchange data with each other to compute any answer. The onlypossible data exchange of a server is with the user/querier or thedatabase owner.2) The trusted database (DB) owner, that creates c secret-shares ofthe data and transfers the i th share to the i th server. The secret-shares are created by an algorithm that supports non-interactive addition and multiplication of two shares, which is required toexecute the private string-matching operation, at the server, asexplained in §

3) An (authenticated, authorized, and trusted ) user/querier, whoexecutes queries on the secret-shared data at the servers. Thequery is sent to servers. The user fetches the partial outputsfrom the servers and performs a simple operation (polynomialinterpolation using Lagrange polynomials [22]) to obtain thesecret-value.

We consider two adversarial models, in both of which the cloudservers (storing secret-shares) are not trustworthy. In the honestbut curious model, the server correctly computes the assignedtask without tampering with data or hiding answers. However,the server may exploit side information ( e . g ., query execution,background knowledge, and output size) to gain as much informa-tion as possible about the stored data. Such a model is consideredwidely in many cryptographic algorithms and in widely used inDaS [20], [36], [49], [51]. We also consider a malicious adversarythat could deviate from the algorithm and delete tuples from therelation. Users and database owners, in contrast, are assumed tobe not malicious.Only authenticated users can request query on servers. Further,we follow the restriction of the standard SSS that the adversarycannot collude with all (or possibly the majority of) the servers.Thus, the adversary cannot generate/insert/update shares at themajority of the servers. Also, the adversary cannot eavesdropon a majority of communication channels between the user andthe servers. This can be achieved by either encrypting the trafﬁcbetween user and servers, or by using anonymous routing [34],in which case the adversary cannot gain knowledge of serversthat store the secret-shares. Note that if the adversary could eithercollude with or successfully eavesdrop on the communicationchannels between the majority of servers and user, the secret-sharing technique will not apply. The validity of the assumptionsbehind secret-sharing has been extensively discussed in priorwork [24], [31], [41], [43]. The adversary can be aware of thepublic information, such as the actual number of tuples andnumber of attributes in a relation, which will not affect the security The choice of the underlying non-interactive and string-matching-based secret-sharing mechanism does not change our proposed aggregation and veriﬁcation algorithms. The DB owner/user can use anonymous routing to send their data to the servers,thereby preventing an adversary from determining which user is connecting to whichserver. If the adversary knows the majority of the communication channels/servers, thenit can construct the secret-shared query, outputs to the query, and the database.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 5

Simple aggregation queries select α ( A i ) from R Multi-dimensional aggregation queries select α ( A i ) from R where A k = v k OP A l = v l Multi-dimensional aggregation queries with group-by select A i , α ( A j ) from R where A k = v k OP A l = v l group by A i Multi-dimensional aggregation queries with group-by and having clause select A i , α ( A j ) from R where A k = v k OP A l = v l group by A i having γ ( A j ) = v i Notations : α and γ are aggregation operators, such as count , sum , avg , max , and min . OP is a conjunctive or disjunctive operation. A i , A j , A k , and A l are someattributes of a relation R TABLE 4: Query types supported by O

BSCURE .of the proposed scheme, though such leakage can be prevented byadding fake tuples and attributes. In the above-mentioned adversarial model, an adversary wishesto learn the (entire/partial) data and query predicates. Hence, asecure algorithm must prevent an adversary to learn the data ( i )by just looking the cryptographically-secure data and deduce thefrequency of each value ( i . e ., frequency-count attacks), and ( ii )when executing a query and deduce which tuples satisfy a querypredicate ( i . e ., access-pattern attacks) and how many tuples satisfya query predicate ( i . e ., output-size attacks). Thus, in order toprevent these attacks, our security deﬁnitions are identical to thestandard security deﬁnition as in [19], [21], [32]. An algorithm is privacy-preserving if it maintains the privacy of the querier ( i . e .,query privacy), the privacy of data from the servers, and performsidentical operations, regardless of the user query. Query/Querier’s privacy requires that the user’s query must behidden from the server, the DB owner, and the communicationchannel. In addition, the server cannot distinguish between two ormore queries of the same type based on the output. Queries are ofthe same type based on their output size. For instance, all countqueries are of the same type since they return almost an identicalnumber of bits.

Deﬁnition: Users privacy.

For any probabilistic polynomial timeadversarial server having a secret-shared relation S ( R ) and anytwo input query predicates, say p and p , the server cannotdistinguish p or p based on the executed computations for either p and p . Privacy from the server requires that the stored input data,intermediate data during a computation, and output data are notrevealed to the server, and the secret value can only be recon-structed by the DB owner or an authorized user. In addition, two ormore occurrences of a value in the relation must be different at theserver to prevent frequency analysis while data at rest. Recall thatdue to secret-shared relations (by following the approach given in § Deﬁnition: Privacy from the server.

For any given secret-sharedrelation S ( R ) at a server, any query predicate qp , and any real The adversary cannot launch any attack against the DB owner. We do not considercyber-attacks that can exﬁltrate data from the DB owner directly, since defending againstgeneric cyber-attacks is outside the scope of this paper. user, say U , there exists a probabilistic polynomial time (PPT)user U (cid:48) in the ideal execution, such that the outputs to U and U (cid:48) for the query predicate qp on the relation S ( R ) are identical. Properties of veriﬁcation.

We provide veriﬁcation propertiesagainst malicious behaviors. A veriﬁcation method must be obliv-ious and ﬁnd any misbehavior of the servers when computing aquery. We follow the veriﬁcation properties from [38], as follows:( i ) the veriﬁcation method cannot be refuted by the majority of themalicious servers, and ( ii ) the veriﬁcation method should not leakany additional information. Algorithms’ performance.

We analyze our oblivious aggregationalgorithms on the following parameters, which are stated inTable 7: ( i ) Communication rounds . The number of rounds thatis required between the user and each server to obtain an answerto the query. ( ii ) Scan cost at the server . We measure scan costat the server in terms of the number of the rounds that the serverperforms to read the entire dataset . ( iii ) Computational cost atthe user . The number of values/tuples that the user interpolates toknow the ﬁnal output. O BSCURE

Overview

Let us introduce O

BSCURE at a high-level. O

BSCURE allowssingle-dimensional and multi-dimensional conjunctive/disjunctiveequality queries. Note that the method of O

BSCURE for handlingthese types of queries is different from SQL, since O

BSCURE does not support query optimization and indexing due to secret-shared data. Further, O BSCURE handles range-based queries byconverting the range into equality queries. Executing a query onO

BSCURE requires four phases, as follows:P

HASE Data upload by DB owner(s).

The DB owner uploadsdata to non-communicating servers using a secret-sharing mech-anism that allows addition and multiplication ( e . g ., [27]) at theservers.P HASE Query generation by the user.

The user generates aquery, creates secret-shares of the query predicate, and sends themto the servers. For generating secret-shares of the query predicate,the user follows the strategies given in § § § § § HASE Query processing by the servers.

The servers processan input query in an oblivious manner such that neither the querynor the results satisfying the query are revealed to the adversary.Finally, the servers transfer their outputs to the user.P

HASE Result construction by the user.

The user performsLagrange interpolation on the received results, which provide ananswer to the query. The user can also verify these results byfollowing the methods given in § § § BSCURE , where α and γ are aggregation operators, such as count, sum, average, maximum,and minimum. In order to execute these operators, we provide For the class of queries considered (viz. aggregation with selection), the mainoptimization in standard databases is to push selections down and to determine whether anindex-scan should be used or not. In secret-sharing, an index scan cannot be used (at leastnot in any obvious way), since sub-setting the data processed will reveal access-patterns,making the technique less secure. Hence, we avoid using any indexing structure.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 6 algorithms in the following sections. As will become clear soon,the proposed algorithms may take at most three communicationrounds between the servers and the user. Further, note that inO

BSCURE , a group-by query requires us to know the name ofgroups, prior to query execution. For example, if the group-byoperation is executed on

Department attribute, then we need toknow all unique department names.

ATA O UTSOURCING

This section provides details on creating and outsourcing adatabase of secret-shared form. The DB owner wishes to outsourcea relation R having attributes A , A , . . . , A m and n tuples, andcreates the following two relations R and R : • Relation R that consists of all the attributes A , A , . . . , A m along with two additional attributes, namely TID (tuple-id) and

Index . As will become clear in §

7, the

TID attribute will helpin ﬁnding tuples having the maximum/minimum/top-k values, andthe

Index attribute will be used to know the tuples satisfying thequery predicate. The i th values of the TID and

Index attributeshave the same and unique random number between 1 to n . • Relation R that consists of three attributes CTID (cleartexttuple-id),

SSTID (secret-shared tuple-id), and an attribute, say A c , on which a comparison operator (minimum, maximum, andtop-k) needs to be supported. The i th values of the attributes CTID and

SSTID of the rela-tion R keep the i th value of the TID attribute of the relation R .The i th value of the attributes A c of the relation R keeps the i th value of an attribute of the relation R on which the user wants toexecute a comparison operator. Further, the tuples of the relations R are randomly permuted. The reason for doing permutation isthat the adversary cannot relate any tuple of both the secret-sharedrelations, which will be clear soon by the example below. Note.

The relation S ( R ) will be used to answer count and sumqueries, while it will be clear in § S ( R ) and S ( R ) together to fetch a tuple havingmaximum/minimum/top-k/reverse-top-k value in an attribute. EmpID Name Salary Dept

E101 John 1000 TestingE101 John 100000 SecurityE102 Adam 5000 TestingE103 Eve 2000 DesignE104 Alice 1500 DesignE105 Mike 2000 Design

TABLE 5: A relation:

Employee . Example . Consider the

Employee relation (see Table 5). The DBowner creates R = Employee1 relation (see Table 6a) with TID and

Index attributes. Further, the DB owner creates R = Employee2 relation (see Table 6b) having three attributes

CTID , SSTID , and

Salary . Creating secret-shares.

Let A i [ a j ] ( ≤ i ≤ m + 1 and ≤ j ≤ n ) be the j th value of the attribute A i . The DBowner creates c secret-shares of each attribute value A i [ a j ] of therelation R using a secret-sharing mechanism that allows string-matching operations at the server (as speciﬁed in § c shares of the j th value of the attribute A m +2 ( i . e ., Index ) If there are x attributes on which comparison operators will be executed, then theDB owner will create x relations, each with attributes CTID , SSTID , and one of the x attributes. For verifying results of count and sum queries, we add two more attributes to thisrelation. However, we do not show here, since veriﬁcation is not a mandatory step.

EmpID Name Salary Dept TID Index

E101 John 1000 Testing 3 3E101 John 100000 Security 2 2E102 Adam 5000 Testing 5 5E103 Eve 2000 Design 4 4E104 Alice 1500 Design 1 1E105 Mike 2000 Design 6 6 (a) R = Employee1 relation.

CTID SSTID Salary (b) R = Employee2 relation.

TABLE 6: Two relations obtained from

Employee relation.

Algorithms Queryconditions Scanrounds ata server Comm.rounds Interpolatedvalues atuserCount § Sum § Unconditional max./min.(

SDBMax § m Conditionalmaximum/minimum(

SDBMax § n + 1 or T + 1 Tuple fetching 2 2 n + m or T + m Maximum/Minimum(

MDBMax ) Oneoccurrences § B Counting n + 1 n + 3 T + (cid:96)m Group-by § g Top-k or reverse top-k § C Unique occur-rence 1 or k k × m Notations. m : n : D = n × m : the database. 1D: Single-dimensional equality query. CE: Conjunctive equality query. DE: Disjunctive equalityquery. T : (cid:96) : g : Condition : the above-mentioned rounds are givenwhen we have l + 1 shares, where l is the maximum length of a keyword. TABLE 7: Complexities of the algorithms.are obtained using SSS. This will result in c relations: S ( R ) , S ( R ) , . . . , S ( R ) c , each having m + 2 attributes. The notation S ( R ) k denotes the k th secret-shared relation of R at the server k . We use the notation A i [ S ( a j )] k to indicate the j th secret-shared value of the i th attribute of a secret-shared relation at theserver k .Further, on the relation R , the DB owner creates c secret-shares of each value of SSTID using a secret-sharing mechanismthat allows string-matching operations on the servers and eachvalue of A c using order-preserving secret-sharing [29], [30], [37].The secret-shares of the relation R are denoted by S ( R ) i ( ≤ i ≤ c ). The attribute CTID is outsourced in cleartext withthe shared relation S ( R ) i . It is important to mention that CTID attribute allows fast search due to cleartext representation than

SSTID attribute, which allows search over shares.Note that the DB owner’s objective is to hide any relationshipbetween the two relations when creating shares of the relations S ( R ) and S ( R ) , i . e ., the adversary cannot know by just observ-ing any two tuples of the two relations that whether these tuplesshare a common value in the attribute TID / SSTID and A c or not.Thus, shares of an i th ( ≤ i ≤ n ) value of the attribute TID in the relation S ( R ) j and in the attribute SSTID of the relation S ( R ) j must be different at the j th server. Also, by default, theattribute A c have different shares in both the relations, due tousing different secret-sharing mechanisms for different attributes.The DB owner outsources the relations S ( R ) i and S ( R ) i to the i th server. Note.

Naveed et al. [40] showed that a cryptographically secureddatabase that is also an using order-preserving cryptographictechnique ( e . g ., order-preserving encryption or OP-SS) may revealthe entire data when mixed with publicly known databases. Hence, EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 7 in order to overcome such a vulnerability of order-preserving cryp-tographic techniques, we created two relations, and importantly,the above-mentioned representation, even though it uses OP-SSdoes not suffer from attacks based on background knowledge, asmentioned in §

2. Of course, instead of using the two relations,the DB owner can outsource only a single relation without usingOP-SS. In the case of a single relation, while we reduce the sizeof the outsourced dataset, we need to compare each pair of twoshares, and it will result in increased communication cost, as wellas, communication rounds, as shown in previous works [18], [25],which were developed to compare two shares.

OUNT Q UERY AND V ERIFICATION

In this section, we develop techniques to support count queriesover secret-shared dataset outsourced by a single or multiple DBowners. The query execution does not involve the DB owner orthe querier to answer the query. Further, we develop a method toverify the count query results.

Conjunctive count query.

Our conjunctive equality-based countquery scans the entire relation only once for checking sin-gle/multiple conditions of the query predicate. For exam-ple, consider the following conjunctive count query: selectcount(*) from R where A = v ∧ A = v ∧ . . . ∧ A m = v m .The user transforms the query predicates to c secret-shares that result in the following query at the j th server: select count(*) from S ( R ) j where A = S ( v ) j ∧ A = S ( v ) j ∧ . . . ∧ A m = S ( v m ) j . Note that thesingle-dimensional query will have only one condition. Eachserver j performs the following operations: Output = (cid:80) k = nk =1 (cid:81) i = mi =1 ( A i [ S ( a k )] j ⊗ S ( v i ) j ) ⊗ shows a string-matching operation that depends on the under-lying text representation. For example, if the text is represented asa unary vector, as explained in § ⊗ is a bit-wise multiplicationand addition over a vector’s elements, whose results will be or of secret-share form. Each server j compares the query predicatevalue S ( v i ) against k th value ( ≤ k ≤ n ) of the attribute A i ,multiplies all the resulting comparison for each of the attributes forthe k th tuple. This will result in a single value for the k th tuple,and ﬁnally, the server adds all those values. Since secret-sharing allows the addition of two shares , the sum of all n resultant sharesprovides the occurrences of tuples that satisfy the query predicateof secret-share form in the relation S ( R ) at the j th server. Onreceiving the values from the servers, the user performs Lagrangeinterpolation [22] to get the ﬁnal answer in cleartext. Correctness . The occurrence of k th tuple will only be includedwhen the multiplication of m comparisons results in of secret-share form. Having only a single as a comparison resultant overan attribute of k th tuple produce of secret-share form; thus, the k th tuple will not be included. Thus, the correct occurrences overall tuples are included that satisfy the query’s where clause. Example.

We explain the above conjunctive count query methodusing the following query on the

Employee relation (referto Table 5): select count(*) from Employee whereName = ‘John’ and Salary = ‘1000’ . Table 8 showsthe result of the private string-matching on the attribute

Name ,denoted by o , and on the attribute Salary , denoted by o .Finally, the last column shows the result of the query for eachrow and the ﬁnal count answer for all the tuples. Note that for Name o Salary o o × o John 1 1000 1 1John 1 100000 0 0Adam 0 5000 0 0Eve 0 2000 0 0Alice 0 1500 0 0Mike 0 2000 0 01

TABLE 8: An execution of the conjunctive count query.the purpose of explanation, we use cleartext values; however, theserver will perform all operations over secret-shares. For the ﬁrsttuple, when the servers check the ﬁrst value of

Name attributeagainst the query predicate

John and the ﬁrst value of

Salary attribute against the query predicate , the multiplication ofboth the results of string-matching becomes 1. For the secondtuple, when the server checks the second value of

Name and

Salary attributes against the query predicate

John and ,respectively, the multiplication of both the results become 0. Allthe other tuples are processed in the same way.

Disjunctive count query.

Our disjunctive count query also scansthe entire relation only once for checking multiple conditions ofthe query predicate, like the conjunctive count query. Consider,for example, the following disjunctive count query: selectcount(*) from R where A = v ∨ A = v ∨ . . . ∨ A m = v m The user transforms the query predicates to c secret-sharesthat results in the following query at the j th server: selectcount(*) from S ( R ) j where A = S ( v ) j ∨ . . . ∨ A m = S ( v m ) j The server j performs the following operation: Result ki = A i [ S ( a k )] j ⊗ S ( v i ) j , ≤ i ≤ m Output = (cid:80) k = nk =1 ((( Result k OR Result k ) OR Result k ) . . . OR Result km ) To capture the OR operation for each tuple k , the servergenerates m different results either 0 or 1 of secret-shareform, denoted by Result i ( ≤ i ≤ m ), each of whichcorresponds to the comparison for one attribute. To com-pute the ﬁnal result of the OR operation for each tuple k ,one can perform binary-tree style computation. However, forsimplicity, we used an iterative OR operation, as follows: temp k = Result k + Result k − Result k × Result k temp k = temp k + Result k − temp k × Result k ... Output k = temp km − + Result km − temp km − × Result km After performing the same operation on each tuple, ﬁnally, theserver adds all the resultant of the OR operation ( (cid:80) k = nk =1 Output k )and sends to the user. The user performs an interpolation on thereceived values that is the answer to the disjunctive count query. Correctness . The disjunctive counting operation counts only thosetuples that satisfy one of the query predicates. Thus, by performing OR operation over string-matching resultants for an i th tupleresults in 1 of secret-share form, if the tuple satisﬁed one ofthe query predicates. Thus, the sum of the OR operation resultantsurely provides an answer to the query. Information leakage discussion.

The user sends query predicatesof secret-share form, and the string-matching operation is executedon all the values of the desired attribute. Hence, access-patterns arehidden from the adversary, so that the server cannot distinguishany query predicate in the count queries. The output of any countquery is of secret-share form and contains an identical numberof bits. Thus, based on the output size, the adversary cannotknow the exact count, as well as, differentiate two count queries.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 8

However, the adversary can know whether the count query issingle-dimensional, conjunctive or disjunctive count query.

In this section, we describe how results of count query can beveriﬁed. Note that we explain the algorithms only for a single-dimensional query predicate. Conjunctive and disjunctive predi-cates can be handled in the same way.Here, our objective is to verify that ( i ) all tuples of thedatabases are checked against the count query predicates, and ( ii )all answers to the query predicate ( or of secret-share form) areincluded in the answer. In order to verify both the conditions, theserver performs two functions, f and f , as follows: op = f ( x ) = (cid:80) i = ni =1 ( S ( x i ) ⊗ o i ) op = op + f ( y ) = op + (cid:80) i = ni =1 f ( S ( y i ) ⊗ (1 − o i )) i . e ., the server executes the functions f and f on n secret-sharedvalues each (of two newly added attributes A x and A y , outsourcedby the DB owner, described below). In the above equations o i is the output of the string-matching operation carried on the i th value of an attribute, say A j , on which the user wants to executethe count query. The server sends the outputs of the function f , denoted by op , and the sum of the outputs of f and f ,denoted by op , to the user. The outputs op and op ensure thecount result veriﬁcation and that the server has checked each tuple,respectively. The veriﬁcation method for a count query works asfollows: The DB owner.

For enabling a count query result veriﬁcation overany attribute, the DB owner adds two attributes, say A x and A y ,having initialized with one, to the relation R . The values of theattributes A x and A y are also outsourced of SSS form (not unaryrepresentations) to the servers. Server.

Each server k executes the count query, as mentioned in § i . e ., it executes the private string-matching operation on the i th ( ≤ i ≤ n ) value of the attribute A j against the querypredicate and adds all the resultant values. In addition, each server k executes the functions f and f . The function f (and f )multiplies the i th value of the A x (and A y ) attribute by the i th string-matching resultant (and by the complement of the i th string-matching resultant). The server k sends the following three things:( i ) the sum of the string-matching operation over the attribute A j ,as a result, say (cid:104) result (cid:105) k , of the count query, ( ii ) the outputs of thefunction f : (cid:104) op (cid:105) k , and ( iii ) the sum of outputs of the function f and f : (cid:104) op (cid:105) k , to the user. User-side.

The user interpolates the received three values fromeach server, which result in

Iresult , Iop , and Iop . If the serverfollowed the algorithm, the user will obtain: Iresult = Iop and Iop = n , where n is the number of tuples in the relation, and itis known to the user. Example.

In Appendix A, we provide an example of count queryveriﬁcation over secret-shares. However, here, we explain theabove method using the following query on the

Employee rela-tion (refer to Table 5): select count(*) from Employeewhere Name = ‘John’ . Table 9 shows the result of the pri-vate string-matching, functions f and f at a server. Note thatfor the purpose of explanation, we use cleartext values; however,the server will perform all operations over secret-shares. For theﬁrst tuple, when the servers check the ﬁrst value of Name attributeagainst the query predicate, the result of string-matching becomes1 that is multiplied by the ﬁrst value of the attribute A x , andresults in 1. The complement of the resultant is multiplied by Name

String-matching results f f John 1 1 0John 1 1 0Adam 0 0 1Eve 0 0 1Alice 0 0 1Mike 0 0 12 2 4

TABLE 9: An execution of the count query veriﬁcation.the ﬁrst value of the attribute A y , and results in 0. All the othertuples are processed in the same way. Note that for this query, result = op = 2 and op = 6 , if server performs each operationcorrectly. Correctness . Consider two cases: ( i ) all servers discard an entireidentical tuple for processing, or ( ii ) all servers correctly processeach value of the attribute A j , op , and op ; however, they do notadd an identical resultant, o i ( ≤ i ≤ n ), of the string-matchingoperation. In the ﬁrst case, the user ﬁnds Iresult = Iop to betrue. However, the second condition ( Iop = n ) will never betrue, since discarding one tuple will result in Iop = n − . In thesecond case, the servers will send the wrong result by discardingan i th count query resultant, and they will also discard the i th value of the attribute A x to produce Iresult = Iop at the user-side. Here, the user, however, ﬁnds the second condition Iop = n to be false.Thus, the above veriﬁcation method correctly veriﬁes thecount query result, always, under the assumption of SSS that anadversary cannot collude all (or the majority of) the servers, asgiven in § UM AND A VERAGE Q UERIES

The sum and average queries are based on the search operationas mentioned above in the case of conjunctive/disjunctive countqueries. In this section, we brieﬂy present sum and average querieson a secret-shared database outsourced by single or multiple DBowners. Then, we develop a result veriﬁcation approach for sumqueries.

Conjunctive sum query.

Consider the following query: selectsum( A (cid:96) ) from R where A = v ∧ A = v ∧ . . . ∧ A m = v m .In the secret-sharing setting, the user transforms the abovequery into the following query at the j th server: selectsum( A (cid:96) ) from S ( R ) j where A = S ( v ) j ∧ A = S ( v ) j ∧ . . . ∧ A m = S ( v m ) j . This query will be executedin a similar manner as conjunctive count query except for thedifference that the i th resultant of matching the query predicateis multiplied by the i th values of the attribute A (cid:96) . The j th serverperforms the following operation on each attribute on which theuser wants to compute the sum, i . e ., A (cid:96) and A q : (cid:80) k = nk =1 A (cid:96) [ S ( a k )] j × ( (cid:81) i = mi =1 ( A i [ S ( a k )] j ⊗ S ( v i ) j )) Correctness . The correctness of conjunctive sum queries is similarto the argument for the correctness of conjunctive count queries.

Disjunctive sum query.

Consider the following query: selectsum( A (cid:96) ) from R where A = v ∨ A = v ∨ . . . ∨ A m = v m . The user transforms the query predicates to c secret-sharesthat results in the following query at the j th server: select sum( A (cid:96) ) from S ( R ) j where A = S ( v ) j ∨ A = S ( v ) j ∨ . . . ∨ A m = S ( v m ) j The server j executes the following computation: Result ki = A i [ S ( a k )] j ⊗ S ( v i ) j , ≤ i ≤ m, ≤ k ≤ n Output = (cid:80) k = nk =1 A (cid:96) [ S ( a k )] j × ((( Result k OR Result k ) OR EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 9

Result k ) . . . OR Result kn ) The server multiplies the k th comparison resultant by the k th value of the attribute, on which the user wants to execute the sumoperation ( e . g ., A (cid:96) ), and then, adds all values of the attribute A (cid:96) . Correctness . The correctness of a disjunctive sum query is similarto the correctness of a disjunctive count query.

Average queries.

In our settings, computing the average queryis a combination of the counting and the sum queries. The userrequests the server to send the count and the sum of the desiredvalues, and the user computes the average at their end.

Information leakage discussion.

Sum queries work identicallyto count queries. Sum queries, like count queries, hide the factswhich tuples are included in the sum operation, and the sum of thevalues.

Now, we develop a result veriﬁcation approach for a single-dimensional sum query. The approach can be extended for con-junctive and disjunctive sum queries. Let A (cid:96) be an attribute whosevalues will be included by the following sum query. selectsum( A (cid:96) ) from R where A q = v .Here, our objective is to verify that ( i ) all tuples of thedatabases are checked against the sum query predicates, A q = v ,and ( ii ) only all qualiﬁed values of the attribute A (cid:96) are includedas an answer to the sum query. The veriﬁcation of a sum queryﬁrst veriﬁes the occurrences of the tuples that qualify the querypredicate, using the mechanism for count query veriﬁcation ( § f and f , to verifyboth the conditions of sum-query veriﬁcation in an obliviousmanner, as follows: op = f ( x ) = (cid:80) i = ni =1 o i ( x i + a i + o i ) op = f ( x ) = (cid:80) i = ni =1 o i ( y i + a i + o i ) i . e ., the server executes the functions f and f on n values,described below. In the above equations, o i is the output of thestring-matching operation carried on the i th value of the attribute A q , and a i be the i th ( ≤ i ≤ n ) value of the attribute A (cid:96) . Theserver sends the sum of the outputs of the function f , denotedby op , and the outputs of f , denoted by op , to the user.Particularly, the veriﬁcation method for a sum query works asfollows: The DB owner.

Analogous with the count veriﬁcation method,if the data owner wants to provide veriﬁcation for sum queries,new attributes should be added. Thus, the DB owner adds twoattributes, say A x and A y , to the relation R . The i th values ofthe attributes A x and A y are any two random numbers whosedifference equals to − a i , where a i is the i th value of the attribute A (cid:96) . The values of the attributes A x and A y are also secret-sharedusing SSS. For example, in Table 10, boldface numbers show theserandom numbers of the attribute A x and A y in cleartext. Servers.

The servers execute the above-mentioned sum query, i . e .,each server k executes the private string-matching operation onthe i th ( ≤ i ≤ n ) value of the attribute A q against the querypredicate v and multiplies the resultant value by the i th value ofthe attribute A (cid:96) . The server k adds all the resultant values of theattributes A (cid:96) . Veriﬁcation stage . The server k executes the functions f and f on each value x i and y i of the attributes A x and A y , by followingthe above-mentioned equations. Finally, the server k sends thefollowing three things to the user: ( i ) the sum of the resultantvalues of the attributes A (cid:96) , say (cid:104) sum (cid:96) (cid:105) k , ( ii ) the sum of the output Dept

Salary o val-ues A x and f A y and f Testing 1000 1 1( +1000+1) = − +1000+1) = − +100000+0) = − +100000+0) = − +5000+1) = −

899 1( +5000+1) = +2000+0) = − +2000+1) = +1500+0) = − +1500+0) = − +2000+0) = +2000+0) = (cid:80) f = (cid:80) f = TABLE 10: An execution of the sum query veriﬁcation.of the string-matching operations carried on the attribute A q , say (cid:104) sum q (cid:105) k , against the query predicate, and ( iii ) the sum of outputsof the functions f and f , say (cid:104) sum f f (cid:105) k . User-side.

The user interpolates the received three values fromeach server, which results in

Isum (cid:96) , Isum q , and Isum f f . Theuser checks the value of Isum f f − × Isum q and Isum (cid:96) , and ifit ﬁnds equal, then it implies that the server has correctly executedthe sum query.

Example.

We explain the above method using the fol-lowing query on the

Employee relation (refer to Ta-ble 5): select sum(Salary) from Employee whereDept = ‘Testing’ . Table 10 shows the result of the privatestring-matching ( o ), the values of the attributes A x and A y inboldface, and the execution of the functions f and f at aserver. Note that for the purpose of explanation, we show theveriﬁcation operation in cleartext; however, the server will performall operations over secret-shares.For the ﬁrst tuple, when the server checks the ﬁrst value of Dept attribute against the query predicate, the string-matchingresultant, o , becomes 1 that is multiplied by the ﬁrst value ofthe attribute Salary . Also, the server adds the salary of the ﬁrsttuple to the ﬁrst values of the attributes A x and A y with o . Then,the server multiplies the summation outputs by o .For the second tuple, the servers perform the same operations,as did on the ﬁrst tuple; however, the string-matching resultant o becomes , which results in the second values of the attributes A x and A y to be . The servers perform the same operations on theremaining tuples. Finally, the servers send the summation of o i ( i . e ., ), the sum of the salaries of qualiﬁed tuples ( i . e ., ), andthe sum of outputs of the functions f and f ( i . e ., ), to theuser. Note that for this query, Isum f f − × Isum q = Isum (cid:96) , i . e ., − × . Correctness . The occurrences of qualiﬁed tuples against a querypredicates can be veriﬁed using the method given in § i ) all servers discard an entire identical tuple forprocessing, or ( ii ) all servers correctly process the query predicate,but they discard the i th values of the attributes A (cid:96) , A x , and A y .The ﬁrst case is easy to deal with, since the count queryveriﬁcation will inform the user that an identical tuple is dis-carded by the server for any processing. In the second case,the user ﬁnds Isum f f − × Isum q (cid:54) = Isum (cid:96) , since anadversary cannot provide a wrong value of

Isum q , which isdetected by count query veriﬁcation. In order to hold the equation Isum f f − × Isum q = Isum (cid:96) , the adversary needs to generateshares such that

Isum f f − Isum (cid:96) = 2 × Isum q , but anadversary cannot generate any share, as per the assumption ofSSS that an adversary cannot produce a share, since it requires tocollude all (or the majority of) the servers, which is impossibledue to the assumption of SSS, as mentioned in § If users are interested, they can also verify this result using the method given in § EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 10

AXIMUM Q UERY

This section provides methods for ﬁnding the maximum valueand retrieving the corresponding tuples for the two types ofqueries, where the ﬁrst type of query (

QMax1 ) does not have anyquery condition, while another (

QMax2 ) is a conditional query, asfollows:

QMax1 . select * from Employee where Salaryin (select max(Salary) from Employee)

QMax2 . select * from Employee as E1 whereE1.Dept = ’Testing’ and Salary in (selectmax(salary) from Employee as E2 whereE2.Dept = ’Testing’) Note that the string-matching secret-sharing algorithms (as ex-plained in §

2) cannot ﬁnd the maximum value, as these algo-rithms provide only equality checking mechanisms, not comparingmechanisms to compare between values. For answering maxi-mum queries, we provide two methods: The ﬁrst method, called

SDBMax is applicable for the case when only a single DB owneroutsources the database. It will be clear soon that

SDBMax takesonly one communication round when answering an unconditionalquery (like

QMax1 ) and at most two communication rounds foranswering a conditional query (like

QMax2 ). The second method,called

MDBMax is applicable to the scenario when multiple DBowners outsource their data to the servers.

SDBMax . In this section, we assume that A c be an attribute ofthe relation S ( R ) on which the user wishes to execute maximumqueries. Our idea is based on a combination of OP-SS [29], [37]and SSS [27], [44] techniques. Speciﬁcally, for answering maxi-mum queries, SDBMax uses the two relations S ( R ) and S ( R ) ,which are secured using secret-shared and OP-SS, respectively,as explained in § § A c will exist in the relations S ( R ) i and S ( R ) i at the server i . The strategy is to jointly execute a queryon the relations S ( R ) i and S ( R ) i and obliviously retrieve theentire tuple from S ( R ) i . In this paper, due to space restrictions,we develop SDBMax for the case when only a single tuple hasthe maximum value; for example, in

Employee relation (seeTable 5), the maximum salary over all employees is unique.

Recall that by observing the shares of the attribute A c of therelation S ( R ) , the server cannot ﬁnd the maximum value of theattribute A c . However, the server can ﬁnd the maximum value ofthe attribute A c using the relation S ( R ) , which is secret-sharedusing OP-SS. Thus, to retrieve a tuple having the maximum valuein the attribute A c of the relation S ( R ) i , the i th server executesthe following steps:1) On the relation S ( R ) i . Since the secret-shared values of theattribute A c of the relation S ( R ) i are comparable, the server i ﬁnds a tuple (cid:104) S ( t k ) , S ( value ) (cid:105) i having the maximum value inthe attribute A c , where S ( t k ) i is the k th secret-shared tuple-id(in the attribute SSTID ) and S ( value ) i is the secret-shared valueof the A c attribute in the k th tuple.2) On the relation S ( R ) i . Now, the server i performs thejoin of the tuple (cid:104) S ( t k ) , S ( value ) (cid:105) i with all the tuples Note that we considered only a single-dimensional condition in

QMax2 query.Our proposed algorithms (without any modiﬁcation) can ﬁnd maximum/minimum whilesatisfying conjunctive and disjunctive conditions. of the relation S ( R ) i by comparing the tuple-ids ( TID attribute’s values) of the relation S ( R ) i with S ( t k ) i , as follows: (cid:80) k = nk =1 A p [ S ( a k )] i × ( TID [ S ( a k )] i ⊗ S ( t k ) i ) Where p ( ≤ p ≤ m ) is the number of attributes in the relation R and TID is the tuple-id attribute of S ( R ) i . The server i compares the tuple-id (cid:104) S ( t k ) (cid:105) i with each k th value of theattribute TID of S ( R ) i and multiplies the resultant by the ﬁrst m attribute values of the tuple k . Finally, the server i adds all thevalues of each m attribute. Correctness . The server i can ﬁnd the tuple having the maximumvalue in the attribute A c of the relation S ( R ) i . Afterward, thecomparison of the tuple-id S ( t k ) i with all the values of the TID attribute of the relation S ( R ) i results in n − zeros (when thetuple-ids do not match) and only one (when the tuple-ids match)of secret-share form. Further, the multiplication of the resultant ( or of secret-share form) by the entire tuple will leave only onetuple in the relation S ( R ) i , which satisﬁes the query. Information leakage discussion.

The adversary will know onlythe order of the values, due to OP-SS implemented on the relation S ( R ) . However, revealing only the order is not threatening, sincethe adversary may know the domain of the values, for example,the domain of age or salary.Recall that, as mentioned in § S ( R ) and S ( R ) share attributes: TID / SSTID and A c (the attribute onwhich a comparison operation will be carried). However, by justobserving these two relations, the adversary cannot know anyrelationship between them, as well as, which tuple of the relation S ( R ) has the maximum value in the attribute A c , due to differentrepresentations of common TID / SSTID and A c values betweenthe relations. Furthermore, after the above-mentioned maximumquery ( QMax1 ) execution, the adversary cannot learn which tupleof the relation S ( R ) has the maximum value in the attribute A c ,due to executing an identical operation on each tuple of S ( R ) when joining with a single tuple of S ( R ) . The maximum value of the attribute A c may be different fromthe A c ’s maximum value of the tuple satisfying the where clauseof a query. For example, in Employee relation, the maximumsalary of the testing department is 2000, while the maximumsalary of the employees is 100000. Thus, the method given foranswering unconditional maximum queries is not applicable here.In the following, we provide a method to answer maximumqueries that have conditional predicates (like

QMax2 ), and thatuses two communication rounds between the user and the servers,as follows:

Round 1 . The user obliviously knows the indexes of the relation S ( R ) satisfying the where clause of the query (the method forobliviously ﬁnding the indexes is given below). Round 2 . The user interpolates the received indexes and sendsthe desired indexes in cleartext to the servers. Each server i ﬁndsthe maximum value of the attribute A c in the requested indexesby looking into the attribute CTID of the relation S ( R ) i andresults in a tuple, say (cid:104) S ( t k ) , S ( value ) (cid:105) i , where S ( t k ) i showsthe secret-shared tuple-id (from SSTID attribute) and S ( value ) i shows the secret-shared maximum value. Now, the server i per-forms a join operation between all the tuples of S ( R ) i and (cid:104) S ( t k ) , S ( value ) (cid:105) i , as performed when answering unconditionalmaximum ( QMax1 ) queries; see § EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 11

Note.

The difference between the methods for answering uncon-ditional and conditional maximum queries is that ﬁrst we need toknow the desired indexes of S ( R ) relation satisfying the where clause of a query in the case of conditional maximum queries. Correctness.

The correctness of the above method can be arguedin a similar way as the method for answering unconditionalmaximum queries.

Information leakage discussion.

In round 1, due to obliviouslyretrieving indexes of S ( R ) , the adversary cannot know whichtuples satisfy the query predicate. In round 2, the user sendsonly the desired indexes in cleartext to fasten the lookup of themaximum salary. Note that by sending indexes, the adversarylearns the number of tuples that satisﬁes the query predicate; however, the adversary cannot learn which tuples of the relation S ( R ) have those indexes. Due to OP-SS, the adversary alsoknows only the order of values of A c attribute in the requestedindexes. However, joining the tuple of S ( R ) , which has themaximum value in A c attribute, with all tuples of S ( R ) willnot reveal which tuple satisﬁes the query predicate, as well as,have the maximum value in A c . Aside: Hiding frequency-analysis in round 2 used for condi-tional maximum queries . In the above-mentioned round 2, theuser reveals the number of tuples satisfying a query predicate.Now, below, we provide a method to hide frequency-count infor-mation:

User-side.

The user interpolates the received indexes (after round1) and sends the desired indexes with some fake indexes, whichdo not satisfy the query predicate in the round 1, in cleartext to theservers. Let x = r + f be the indexes that are transmitted to theservers, where r and f be the real and fake indexes, respectively.Note that the maximum value of the attribute A c over x tuples maybe more than the maximum value over r tuples. Hence, the userdoes the following computation to appropriately send the indexes:The user arranges the x indexes in a √ x × √ x matrix, where all r real indexes appear before f fake indexes. Then, the user creates √ x groups of tuples ids, say g , g , . . . , g √ x , where all tuples idsin an i th row of the matrix become a part of the group g i . Notethat in this case only one of the groups, say g mix , may containboth the real and fake indexes. Now, the user asks the server toﬁnd the maximum value of the attribute A c in each group exceptfor the group g mix and to fetch all √ x tuples of the group g mix . Server.

For each group, g j , except the group g mix , each server i ﬁnds the maximum value of the attribute A c by looking into theattribute CTID of the relation S ( R ) i and results in a tuple, say (cid:104) S ( t k ) , S ( value ) (cid:105) i . Further, the server i fetches all √ x tuples ofthe group g mix . Then, the server i performs a join operation (basedon the attribute TID and

SSTID , as performed in the second stepfor answering unconditional maximum queries; see § S ( R ) i and √ x − tuples obtained from therelation S ( R ) , and returns √ x − tuples to the user. The userﬁnds the maximum value over the r real tuples. Note that √ x − tuples must satisfy a conditional maximum query; however, due tospace restrictions, we do not prove this claim here.Note that this method, on one hand, hides the frequency-count;on the other hand, it requires the servers and the user process moretuples than the method that reveals the frequency-count. The adversary may already know the classiﬁcation of tuples based on some criteria,due to her background knowledge. For example, the number of employees working ina department or the number of employees of certain names/age. Hence, revealing thenumber of tuples satisfying a query does not matter a lot; however, revealing that whichtuples satisfy a query may jeopardize the data security/privacy.

EmpID (cid:48)

Name (cid:48)

Salary (cid:48)

Dept (cid:48)

TID o A x A y

106 47 1000 80 3 1 1( +1233)=1733 1( -733 +1233)=500106 47 100000 120 2 0 0( +100273)=0 0( -99873 +100273)=0107 19 5000 80 5 0 0( +5211)=0 0( -5011 +5211)=0108 32 2000 51 4 0 0( +2195)=0 0( -1595 +2195)=0109 30 1500 51 1 0 0( +1690)=0 0( -1390 +1690)=0110 38 2000 51 6 0 0( +2199)=0 0( -2099 +2199)=0 op = 1733 op = 500 TABLE 11: An execution of the tuple retrieval veriﬁcation.

Obliviously ﬁnding the indexes.

For ﬁnding the indexes, eachserver k executes the following operation: Index [ i ] k × ( A p [ i ] k ⊗ S ( v ) k ) , i . e ., the server executes string-matching operations oneach value of the desired attribute, say A p , of the relation S ( R ) and checks the occurrence of the query predicate v . Then, theserver k multiplies the i th resultant of the string-matching oper-ation by the i th value of Index attribute of the relation S ( R ) .Finally, the server sends all the n values of the attribute Index tothe user, where n is the number of tuples in the relation. The userinterpolates the received values and knows the desired indexes. This section provides a method to verify the tuple having max-imum value in an attribute, A c . Note that verifying only themaximum value of the tuple is trivial, since (cid:104) S ( value ) (cid:105) i of S ( R ) i is also a part of the attribute of A c of S ( R ) i , and serverssend a joined output of the relations (see step 2 in § Veriﬁcation of retrieved tuple.

This method is an extension of thesum veriﬁcation method (as given in § f and f , in an oblivious manner, as follows: op = f ( x ) = (cid:80) i = ni =1 o i ( x i + s ij ) op = f ( x ) = (cid:80) i = ni =1 o i ( y i + s ij ) i . e ., the server executes the functions f and f on n values,described below. In the above equations, o i is the output of thestring-matching operation carried on the i th value of the TID attribute, and s i,j be the i th ( ≤ i ≤ n ) value of the attribute j , where ≤ j ≤ m . The server sends the difference of theoutputs of the functions f and f to the user. Particularly, thetuple veriﬁcation method works as follows: The DB owner.

The DB owner adds one value to each of theattribute values of a tuple along with new attributes, say A x and A y .Let A be an attribute having only numbers. For A attribute,the newly added i th value in cleartext is same as the existing i th value in A attribute. Let A be an attribute having Englishalphabets, say attribute Name in Employee relation in Table 5. Thenew value is the sum of the positions of each appeared alphabet inEnglish letters; for example, the ﬁrst value in the attribute

Name is John , the DB owner adds 47 (10+15+8+14). When creatingshares of the two values at the i th position of the attribute A or A , the ﬁrst value’s shares are created using the mechanism thatsupports string-matching at the server, as mentioned in § i th values of the attributes A x and A y are two randomnumbers whose difference equals to − a i , where a i is the i th The servers can also check conjunctive and/or disjunctive conditions, like one-dimensional condition (see § i th resultant of conjunctive and/ordisjunctive conditions matching by the i th value of Index attribute of the relation S ( R ) , and then, sends all the n values of the attribute Index to the user.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 12 value obtained after summing all the newly added values to eachattribute of the i th tuple. The values of the attributes A x and A y are secret-shared using SSS. E.g., in Table 11, numbers shownewly added values to attributes Name (cid:48) , Dept (cid:48) , and randomnumbers (in bold-face) of the attributes A x and A y in cleartext(a prime ( (cid:48) ) symbol is used to distinguish these values from theoriginal attribute values). Servers.

Each server k executes the method for tuple retrievalas given in step 2 in § k executes functions f and f , i . e ., adds all the m newly added values (one in eachattribute) to x i and y i of the attributes A x and A y , respectively,and then, multiply the resultant of the string-matching operationcarried on TID attribute of the relation S ( R ) k . Finally, the server k sends the following two things to the user: ( i ) the tuple havingthe maximum value in the attribute A c of the relation S ( R ) k ;and ( ii ) the difference of outputs of the functions f and f , say (cid:104) diﬀ f f (cid:105) k . User.

After interpolation, the user obtains the desired tuple anda value, say

Idiﬀ f f . Like the DB owner, the user generatesa value for each of the attribute values of the received tuple(see the ﬁrst step above for generating values), compares against Isum f f , and if it ﬁnds equal, then it implies that the server hascorrectly sent the tuple. Example.

Table 11 shows the veriﬁcation process for the ﬁrsttuple-id of employee relation; see Table 5. Note that the valuesand computation are shown in the cleartext; however, the valuesare of secret-share form and the computation will be carried onshares at servers.

THER O PERATIONS

This section considers two more cases of a maximum query, wherethe maximum value can occur in multiple tuples ( § § B). Further,we present an algorithm for a group-by query.

In practical applications, more than one tuple may have themaximum value in an attribute, e . g ., two employees (E103 andE015) earn the maximum salary in design department; see Table 5.However, the above-mentioned methods (for QMax1 or QMax2 )cannot fetch all those tuples from the relation S ( R ) in one round.The reason is that since the server i uses OP-SS values of theattribute A c in the relation S ( R ) i for ﬁnding the maximumvalue, where more than one occurrences of a value have differentrepresentations, the sever i cannot ﬁnd all the tuples of S ( R ) having the identical maximum value, by looking OP-SS values.In this subsection, we, thus, provide a simple two-communication-round method for solving unconditional maxi-mum queries. This method can be easily extended to conditionalmaximum queries. Data outsourcing.

The DB owner outsources the relation S ( R ) as mentioned in § S ( R ) with four columns: CTID , SSTID , OP-SS- A c ,and SS- A c . The ﬁrst three columns are created in the same wayas mentioned in § i th value of SS- A c attribute has thesame value as the i th value of OP-SS- A c attribute. However,this value is secret-shared using the unary representation, as thecolumn A c of the relation S ( R ) has, and the DB owner usesdifferent polynomials over the i th value of the attribute A c of S ( R ) and the attribute SS- A c of S ( R ) ; thus, the adversarycannot observe that which two values are identical in the tworelations. Query execution.

The method uses two communication rounds asfollows:

Round 1.

In round 1, the server i ﬁnds a tuple (cid:104) S ( t k ) , S ( value ) , S ( value ) (cid:105) i having the maximum value (de-noted by (cid:104) S ( value ) (cid:105) i ) in the attribute A c , where S ( t k ) i isthe k th secret-shared tuple-id (in the attribute SSTID ) and (cid:104) S ( value ) (cid:105) i is the secret-shared value of the SS- A c attributein the k th tuple. Afterward, the server i performs the following: Index [ k ] × ( A C [ S ( k )] i ⊗ S ( value ) i ) , ≤ k ≤ n i . e ., the server compares (cid:104) S ( value ) (cid:105) i with each k th value of theattribute A c of the relation S ( R ) and multiplies the resultant bythe k th index values. The server i provides a list of n numbers tothe user. Round 2.

After interpolating n numbers, the user gets a listof n numbers having 0 and Index values, where the de-sired maximum value of the attribute A c exists. Then, theuser fetches all the tuples having the maximum values basedon the received Index value. In particular, the user createsnew secret-shares of the matching indexes in a way that theserver can perform searching operation on

TID attribute. Theserver executes the following computation to retrieve all thetuples, say T , having the maximum value in the attribute A c : (cid:80) k = nk =1 A p [ S ( a k )] i × ( TID [ S ( a k )] i ⊗ S ( t j ) i ) Where ≤ p ≤ m , ≤ j ≤ T and ≤ k ≤ n , i . e .,the server i compares each received tuple-id T with each tuple-idof the relation S ( R ) i and multiplies the resultant to the ﬁrst m attributes of the relation S ( R ) i . Finally, the server i adds all theattribute values for each tuple-id T . Complexities . As mentioned, fetching all tuples having the maxi-mum value in the attribute A c requires two communication roundswhen answering an unconditional query. Further, each server scansthe entire relation S ( R ) twice. However, ﬁnding the maximumnumber over the attribute OP-SS- A c can be done using an index. Information leakage discussion . The adversary learns the orderof the values. The adversary will not learn which tuple has themaximum value in the attribute A c . But, the adversary may learnhow many tuples have the maximum value. This can be preventedby asking queries for fake tuples in round 2 by generating random TID values, which should be larger than n (the number of tuplesin the relation). Aside.

We can prevent having to outsource S ( R ) , by adding oneadditional communication round between the user and the server.In that case, the server provides a tuple having the maximum valuein the attribute A c , and then, the user ﬁnds occurrences of themaximum value in the relation S ( R ) by using one additionalround. Note. Answering conditional maximum query.

The abovemechanism can easily be extended to support conditional maxi-mum queries. For answering a conditional maximum query, theuser includes the above-mentioned two steps to the method givenin § A c requires three communication rounds, and eachserver scans the entire relation S ( R ) three times. In particular, inthe ﬁrst round, the server i provides Index values to the user. Inthe second round, the server i ﬁnds the tuple having the maximumvalue in the attribute A c from the requested tuple-ids, implementsthe above-mentioned method given in round 1, and provides a list EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 13 of n numbers. In the last round, the user fetches all the desiredtuples. A group-by query in combination with aggregation (viz.,count/sum), can be executed similar to the aggregation query asmentioned in § §

6, if the set of possible values – for theattribute on which the group-by query will be executed – is knownto the user in advance. For example, consider the following group-by query on

Employee relation, shown in Table 5: select Dept, count(Dept) from Employeegroup by Dept

In this query, the user needs to know the name of departments, i . e , Testing , Security , and

Design , and then, the user canexecute the query at the servers for each department. Below,we brieﬂy summarize, the execution of a group-by query withcount/sum aggregation operation.

Group-by query with count . Consider the following group-byquery: select A i , count( A i ) from R group by A i .For answering this group-by query, the server j executes thefollowing computation on each tuple of the relation R for eachgroup ( to g ): Output l = (cid:80) k = nk =1 ( A i [ S ( a k )] j ⊗ S ( v l ) j ) Where ≤ l ≤ g , v l is the name of each group, ⊗ shows a string-matching operation, and Output l is the answer to the group-byquery. The server j will return (cid:104) S ( v l ) j , Output l (cid:105) , where ≤ l ≤ g . The user interpolates the received answers from the server toobtain the ﬁnal answer to the query. Note that since the user willreceive each group name, the user will know the correct answer togroup-by queries for each group. Group-by query with sum . Consider the following group-by queryinvolving sum operation: select A i , sum( A (cid:96) ) from R group by A i . For answering this group-by query, the server j executes the following computation on each tuple of the relation R for each group ( to g ): Sum l = (cid:80) k = nk =1 A (cid:96) [ S ( a k )] j × ( A i [ S ( a k )] j ⊗ S ( v l ) j ) Where ≤ l ≤ g , v l is the name of each group, ⊗ shows astring-matching operation, and Sum l is the answer to the group-byquery. The server j will return (cid:104) S ( v l ) j , Sum l (cid:105) , where ≤ l ≤ g . Information leakage discussion.

In executing the followingquery: select Dept, count(Dept) from Employeegroup by Dept , the adversary may learn the number ofgroups in an attribute, by receiving only three values, one valuefor each department. However, the user may also hide suchinformation, by asking queries for additional fake groups. Forexample, the user may ask the count query for fake groups suchas

Sale and

Production , including the three real groups(

Testing , Security , and

Design ). Since the stored data andquery predicates are secret-shared, the adversary cannot learn howmany unique values exist in an attribute. In this case, the countquery answer for real groups, after interpolation at the user-side,will produce the desired answers, but for the fake group, the userwill obtain zero as the answer. Since the user knows the real andfake groups, the user can distinguish the results.Note that since the proposed algorithms for group-by queriesproduce the result of secret-shared form, it prevents the adversaryto know the frequency-count of each group. In addition, sincethe proposed algorithms check each group name against thedesired attribute’s values of each tuple, it hides access-patternsand prevents the adversary to know which group name is real or fake. Further, note that different attributes in a relation mayhave a different number of unique values, and hence, group-byqueries over different attributes will produce a different numberof answers, (depending on the unique values in attributes). Wecan also hide this fact by executing a group-by query for fakegroups. While such a method will prevent information leakagesbased on the number of groups across different groups, it willincur computational cost and communication cost.

As we mentioned, we convert a range query into several pointqueries that cover the entire range. However, as per Exp 8 (Fig-ure 5), as the range increases, the computation time also increases.In order to reduce the computation time, we propose a new methodthat creates bins over the domain of attribute values and organizesthese bins into a k -way tree, where k is the number of child nodesof a node or the number of values in each node at the lowest level.The bucketization-based range queries works as follows: DB owner.

Assume that the domain of values in an attribute has , , . . . , n numbers. The DB owner ﬁrst creates a k -way tree, bycreating n/k nodes at the 0 th -level by placing , , . . . , k numbersin the ﬁrst node, k + 1 , k + 2 , . . . , k numbers in the second node,and so on. The ﬁrst level node has (cid:100) n/k (cid:101) nodes, where the ﬁrstnode of the ﬁrst level becomes the parent of the ﬁrst k nodes of0 th node. The second node of the ﬁrst level becomes a parentof k + 1 , k + 2 , . . . , k nodes of 0 th -level. In this way, the DBowner constructs a k -way tree of height (cid:100) log k ( n/k )+1 (cid:101) . Now foreach level, except the root node and the leaf level, the DB owneradds one attribute in the relation R . An i th value of the attributecorresponding to a level, say j , is set to be the node id of the j th level’s node that covers the i th value at the leaf level ( i . e ., level0). Assume that an attribute A of a relation R has 32 numbers( , , . . . , ). Here, we show how does the DB owner create a -way tree and three additional columns. Figure 1 shows a -waytree for 32 numbers. In a -way tree, the 0 th level has n/k = 16 nodes, each with two numbers. The tree height is log k ( n/k )+1 =5 . Here, the DB owner adds three columns, say A , A , and A ,in the relation for levels 1, 2, and 3 of the tree; see Table 12.Note that, for example, th value of the attributes A , A , and A contains node-ids of the respective levels that cover th valueof the level 0. Thus, the attribute A contains 103, since Node103 covers the value 9, the attribute A contains 201, since Node201 covers the value 9, and the attribute A contains 301, since Node 301 covers the value 9.

Creating secret-shares of the relation.

The DB owner c secret-shares of each attribute value A i [ a j ] of the relation R using asecret-sharing mechanism that allows string-matching operationsat the server (as speciﬁed in § User.

We assume that the user is aware of the k value used inthe k -way tree creation. For a given range, the user ﬁrst ﬁnds theminimal set of nodes that cover the range, and then, creates secret-shares of those node values. We follow a least-match method forsearching node values. Assume a query for counting the numberof tuples having values between and . The best-match methodwill ﬁnd only Node 301 that satisﬁes this query. However, it will

For simplicity of presentation, we assume that the attribute has 32 continuousnumbers. Having any 32 numbers will not affect the algorithm. In the case of any 32number, we will create k -way tree for the minimum and maximum value in the domain,so that the resulting tree will have many empty nodes. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 14

102 103

104 105 106

107 108202 203 204201

Level 0

Level 1Level 2

Level 3

Level 4

Fig. 1: 2-way tree for 32 values.cover some other values too, resulting in a wrong answer to thequery. Thus, using a minimal set of nodes that cover the range, theuser breaks the range into sub-ranges such as

Node 201 , Node103 , and value . Note that by breaking the range from 1-13 intopoint queries requires searching 13 different values. However, inthe modiﬁed representation of ranges using -way tree, the serverwill search only for three values.Finally, the user creates secret-shares of these three values( Node 201 , Node 103 , and value 13) and sends them to theservers with the information of the desired attribute on which theserver should search for a value.

Server.

The server executes the count query as mentioned in § Node 201 in the attribute A , for Node 103 in the attribute A , and for thevalue in the attribute A . Finally, the server adds the outputsof all three individual searches, which produce the ﬁnal answer tothe count query. Note.

By following the same idea of breaking a range into sub-range, one can execute conjunctive and disjunctive count/sumqueries.

XPERIMENTS

This section evaluates the scalability of O

BSCURE and comparesit against other SSS- and MPC-based systems. We used a 16GBRAM machine as a DB owner, as well as, a user that communi-cates with AWS servers. For our experiments, we used two typesof AWS servers – a relatively weaker 32 GB, 2.5 GHz, Intel XeonCPU (Exp 2, 5, 6), and a powerful 144GB RAM, 3.0GHz IntelXeon CPU with 72 cores to study the impact of multi-threadedprocessing (Exp 3, 8). O BSCURE

Evaluation

Secret-share (SS) dataset generation.

We used four columns (Or-derkey (OK), Partkey (PK), Linenumber (LN), and Suppkey(SK))of LineItem table of TPC-H benchmark to generate 1M and6M rows. To the best of our knowledge, this is the ﬁrst suchexperiment of SSS-based approaches to such large datasets. Wenext explain the method followed to generate SS data for 1Mrows. A similar method was used for generating SS data for 6Mrows.The four columns of LineItem table only contain numbers:OK: 1 to 300,000 (1,500,000 in 6M), PK: 1 to 40,000 (200,000in 6M), LN: 1 to 7, and SK: 1 to 2000 (200,000 in 6M). Thefollowing steps are required to generate SS of the four columns in1M rows:1) The ﬁrst step was to pad each number of each column with zeros.Hence, all numbers in a column contain identical digits to prevent

A A A A TABLE 12: A relation R having three new attributes, A , A , and A , based on bucketization of range values.an adversary to know the distribution of values. For example, afterpadding 1 of OK was 000,001. Similarly, values of PK and SKwere padded. We did not pad LN values, since they took only onedigit.2) The second step was representing each digit into a set of tennumbers, as mentioned in § i . e ., 0,a group of 11th to 20th number shows the second digit, i . e ., 0, andso on. Similarly, each value of PK, SK, and LN was converted.We also added columns for TID, Index, count, sum, and maximumveriﬁcation, and it resulted in the relation R . Further, we createdanother relation, R , with three attributes CTID, SSTID, and OK,as mentioned in § f ( x ) = secret value + a x , where a was selectedrandomly between 1 to 10M for each number, the modulus ischosen as 15,000,017, and x was varied from one to ﬁfteen toobtain ﬁfteen shares of each value. On R , we implemented OP-SS on OK attribute, and also generated ﬁfteen shares of SSTID.Thus, we got S ( R ) i and S ( R ) i , ≤ i ≤ . (Exp 5 willdiscuss in detail why are we generating ﬁfteen shares.) For sumand tuple retrieval queries’ time minimization, we add four moreattributes corresponding to each of the four attributes in LineItemtable. A value of each of the four attributes has only one secret-shared value, created using SSS (not after padding). But, one canalso implement the same query on secret-shared values obtained One may use binary representation for representing secret-shares, since it iscompact as compared to unary representation. However, in binary representation, thepolynomial degree increases signiﬁcantly, when we perform string-matching operations.For example, consider a decimal number, say n ( = 400 ), having l d ( = 3 ) digits indecimal, and takes l b ( = 9 ) digits in binary ( ). Here, representing 400 usingunary and binary representations will take 30 and 9 numbers, respectively. However, whenthe user wishes to perform the minimum computation by interpolating only the desiredanswer, we need at least × l d + 1 and × l b + 1 servers for string-matching, whenusing unary and binary representation, respectively. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 15 T i m e ( s e c )

1D Count2CE Count3DE Count1D Sum2CE Sum3DE SumUnc-MaxC-MaxGroupByData Fetch Time(1D)Data Fetch Time(DE)Data Fetch Time(Unc-Max) (a) 1M rows. T i m e ( s e c )

1D Count2CE Count3DE Count1D Sum2CE Sum3DE SumUnc-MaxC-MaxGroupByData Fetch Time(1D)Data Fetch Time(DE)Data Fetch Time(Unc-Max) (b) 6M rows.

Fig. 2: Exp 3. Impact of parallelism, evaluated using

AWS serverswith 144GB RAM, 3.0GHz Intel Xeon CPU with 72 cores. after step 2.4) Lastly, we placed i th share of S ( R ) and S ( R ) to i th AWSserver.

Exp 1. Data generation time.

Table 13 shows the time to generatesecret-shared LineItem table of size 1M and 6M rows, at the DBowner machine. Note that due to unary representation, the size ofthe data is large; however, the data generation time of O

BSCURE is signiﬁcantly less than an MPC system, which will be discussedin § Tuples Time Size (in GB)1M ≈

10 mins | S ( R ) | = 1 . , | S ( R ) | = 0 . ≈ | S ( R ) | = 14 , | S ( R ) | = 3 TABLE 13: Exp. 1. Average time and size for shared datageneration using single-threaded implementation at the DB owner.

Exp 2. O

BSCURE performance.

In O

BSCURE , we usedmultiple cores of CPU by writing parallel programs forone-dimensional (1D) count/sum, two-dimensional conjunctive-equality (2CE) count/sum, three-dimensional disjunctive-equality(3DE) count/sum, unconditional maximum (Unc-Max), condi-tional maximum (C-Max), and group-by queries on the LineItemtable having 1M and 6M rows. A parallel program divides rows into blocks with one thread processing one block, and then, theintermediate results (generated by each thread) are reduced by themaster thread to produce the ﬁnal result.For this experiment having ﬁfteen shares, we used

AWS serverswith 144GB RAM, 3.0GHz Intel Xeon CPU with 72 cores , andvaried the degree of parallelism up to 48 (number of parallelthreads). Increasing more threads did not provide speed-up, sincethe execution time reached close to the time spent in the sequentialpart of the program (Amdahl’s law); furthermore, the executiontime increases due to thread maintenance overheads. Figure 2shows as the number of threads increases, the computation timedecreases. Also, observe that the data fetch time from the databaseremains (almost) same and less than the processing time. Further,the computation time reduces signiﬁcantly due to using manythreads on powerful servers (Figure 2). Also, note that as the sizeof data increases, the time increases slightly more than linearly.This is due to the unary representation that requires 10 morenumbers (for the 6M rows table) to cover one new additional digitin all attribute values (except LN attribute). This increase resultsin additional multiplications during string-matching. An importantobservation is that executing any query took at most 13seconds on1M rows and 75seconds on 6M rows.

Count and sum queries . Figure 2 shows the time taken by1D, 2CE conjunctive-equality, and three-dimensional disjunctive-equality (3DE) count and sum queries. CE queries were executedon OK and LN, and DE queries involved OK, PK, and LNattributes. Observe that as the number of predicates increases, thecomputation time also increases, due to an increasing number ofmultiplications. The time difference between computations on 1Mand 6M rows is about 6-7.4%.

Maximum queries . Fetching the tuple having the maximum valuein an unconditional maximum query was very efﬁcient, due toOP-SS, and took at most 9seconds on 1M rows and at most50seconds on 6M rows; see Figure 2. We executed 1D conditionalmaximum query (C-Max). C-Max requires to know the tuple-idsthat satisfy the condition in relation S ( R ) , and then, determiningthe maximum value from S ( R ) . Note that in both UnC-Max andC-Max, we achieve the maximum efﬁciently, due to OP-SS, (whilealso preventing background-knowledge-based attacks on OP-SS).The time difference between fetching a tuple having the maximumvalue from 1M and 6M data is about 5.5-6.6%. Group-by queries . A group-by query works in a similar manner to1D count/sum query. Figure 2 shows the time taken by a group-byquery when the number of groups was seven (due to LN attributethat has seven values), where we counted the number of OK valuescorresponding to each LN value.

Exp 3. Impact of local processing at a resource-constraineduser.

To show the practicality of O

BSCURE , we did an experiment,where a resource-constrained user downloads the entire encrypteddata and executes the computation at their end after decrypting thedata and loading into a database system. We restricted the userto have a machine with 1GB RAM and single-core 1.35GHz CPUusing docker, unlike multicore servers used in Exp 3, and executedthe same queries that we executed in Exp 3. With this setup,decryption time at the user side was 54s and 259s for 1M and 6Mrows, respectively. Further, loading decrypted data into a databasesystem (MySQL) at the user-side took 20s and 120s for 1M and6M rows, respectively. All queries used in Exp 2 were executed in1-5s for both 1M and 6M rows. Note that the user computationtime is signiﬁcantly higher compared to the computation time

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 16 of queries in Exp 3. For example, end-to-end 1D count queryexecution in Exp 3 over 6M secret-shared rows took 26s (seeFigure 2b), while the same query took 385s when decrypting andloading the data into MySQL at the resource-constrained user.

Exp 4. Overheads of result veriﬁcation.

This experiment ﬁndsthe overheads of the result veriﬁcation approaches. Figure 3ashows that count result veriﬁcation steps do not incur a signiﬁcantcost at the servers, since executing result veriﬁcation requires onlytwo more multiplications and modulo on each row’s A x and A y values (see § E x e c u t i o n T i m e ( s ) W/O Verification (1M Rows)W Verification (1M Rows)W/O Verification (6M Rows)W Verification (6M Rows) (a) Count query. E x e c u t i o n T i m e ( s ) W/O Verification (1M Rows)W Verification (1M Rows)W/O Verification (6M Rows)W Verification (6M Rows) (b) Sum query.

UnC-Max-Tuple-Fetch Cond-Max-Tuple-Fetch010203040506070 E x e c u t i o n T i m e ( s ) W/O Verification (1M Rows)W Verification (1M Rows)W/O Verification (6M Rows)W Verification (6M Rows) (c) Maximum tuple fetch query.

Fig. 3: Exp 4. Result veriﬁcation.

Exp 5. Impact of num-ber of shares.

In thisexperiment, we studythe impact of the num-ber of shares on the per-formance of O

BSCURE .For this experiment, weused four different se-tups with data, secretshared between 3, 5,11, and 15 servers. Dueto space restrictions, weshow results for 1Mrows only. Figure 9shows computation timeat the server and userside, with a differentnumber of shares.The results demon-strate two tradeoffs, ﬁrstbetween the number ofshares and computationtime at the user, and sec-ond between the num-ber of shares and theamount of data trans-ferred from each serverto the user. As thenumber of shares de-creases, the computationtime at the user in-creases; since the string-matching operation re-sults in the degree ofpolynomials to be dou-bled, and if servers donot have enough shares,they cannot compute the E x e c u t i o n T i m e ( s ) User Processing TimeServer Processing Time

Fig. 4: Exp 5. Impact of the number of shares, using a singlethreaded implementation on 32GB RAM, 2.5GHz Intel XeonCPU.ﬁnal answer and may re-quire more than one round of communication with the user tocompute the SS aggregate value. Thus, the communication costalso increases with a decreasing number of shares.From Figure 9, it is clear that as the number of sharesincreases, the computation time at the user decreases and at theserver increases, while the overall query execution time decreases,generally. In Appendix E, we discuss the processing of each queryunder a different number of shares.

Exp 6. Impact of communication cost.

An interesting point wasthe impact of the communication cost. Since servers send data tothe user over the network, it may affect the overall performance.As mentioned in Exp 4., using 3 servers, the communication costincreases as compared to 15 servers. For instance, in executingDE count/sum queries over PK, LN, and OK attributes took thehighest amount of data transfer when using 3 servers. Since thenumber of digits of the three predicates was 12 in 1M rows and14 in 6M rows, each server sends 12 ﬁles (each of size 7MB) inthe case of 1M rows and 14 ﬁles (each of size 48MB).Hence, the server to user communication was 84MB/server inthe case of 1M rows and 672MB/server in the case of 6M rows.However, in the case of 15 servers, the server to user communi-cation was 7MB/server in the case of 1M rows and 48MB/serverin the case of 6M rows. When using slow (100MB/s), medium(500MB/s), and fast (1GB/s) speed of data transmission, thedata transmission time in the case of 15 servers was negligible.However, in the case of 6M, it took 7s, 1s, less than 1s per server,respectively, on slow, medium, and fast transmission speed.Observe that the computation time at the server was at least 40sin any query on 6M rows (when using 72 core servers; Figure 2b)that was signiﬁcantly more than the communication time betweenuser and servers. Thus, the communication time does not affect theservers’s computation time, which was the bottleneck . Exp 7. Range queries.

We evaluated range queries for 1D-count and 1D-sum operations. Given a range query involving k continuous values, we converted it into k k -times at the server. In order to reducethe number of scans, we processed (as per 1D-count or sum query)all the k -values in the range on each tuple, before processing thenext tuple. As a result, we got k values (as per the 1D-count orsum query) after processing the entire relation. Finally, the serveradds all k values and sends them to the user. We implemented arange query involving 1D-count/sum operations, using 48 threadson AWS servers with 144GB RAM, 3.0GHz Intel Xeon CPU with EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 17

Length of range345678910 T i m e ( s ) Count(1M Rows)Sum(1M Rows)

Fig. 5: Exp 7: Impact of executing rangequeries. T i m e ( s ) With Buckets(Fanout 10)Without Buckets

Fig. 6: Exp 8: Impact of executing rangequeries using bucketization. N u m b e r o f m u l t i p li c a t i o n s ( p e r r o w ) With Buckets(Fanout 10)Without Buckets

Fig. 7: Exp 8: Number of multiplication op-erations used in string-matching operationin range queries.72 cores. Figure 5 shows that as the length of range increasesthe computation time also increases. In Appendix 8.3, we providea bucketization-based approach to reduce the computation timewhile increasing the range values.

Exp 8. Bucketization-based range queries.

We pre-computedthe range information with k = 10 for OrderKey values (withdomain of 1-150000) of LineItem table of TPC-H. We outsourcedthe secret-shared version of this range information along with theoriginal data. We executed 1D-count query on OrderKey values.Figure 6 shows that with pre-computed range information in theform of buckets signiﬁcantly decreases the computation time ( i . e .,the amount of time spent in range-based ﬁltering over secret-shared data). Note that the na¨ıve range implementation scaleslinearly with the length of range, whereas the bucketized pre-computed range information along with each tuple, takes almostconstant time even after increasing the range length. Furthermore,it sometimes drops as fewer buckets are able to cover the entirerange. For example, a range of 1-99 requires 19 searches (9 for9 buckets covering values from 1 to 90 and extra 9 searchesfor values 91 to 99), whereas for a range from 1 to 100 weonly need 1 bucket to represent it, therefore the time required toexecute a range query for 1 to 100 decreases. Figure 7 shows howthe number of multiplications per row varies with the increasein the range length. We can see that the na¨ıve implementationrequires many multiplications as compared to the bucketization-based range algorithm. However, since we store range informationalong with each tuple, the size of the database increases, requiringmore time to scan the table. Figure 6 shows that for small-sizedranges (length < ), the na¨ıve algorithm performs better as thenumber of multiplication require by both the algorithms are equal,but the scanning time for the na¨ıve algorithm is smaller than thebucketization-based algorithm. Exp 9. O

BSCURE performance on a weaker machine.

Inthis experiment, we explored O

BSCURE on a relatively weakersingle-threaded machine with 32GB.

We chose this machine since(as will be clear in § O BSCURE on 32GB AWS servers.

Note that single-threaded implementation of O

BSCURE incurstime overheads, which are signiﬁcantly reduced when using manythreads on powerful servers; see Exp 2. Likewise Exp 2, weexecuted count, sum, unconditional and conditional maximum,and group-by queries on the LineItem table having 1M and 6Mrows using ﬁfteen shares; see Figure 8. Note that Figure 8 showsthat determining only the maximum value is efﬁcient due to OP-SS, in the case of unconditional maximum queries (UnC-Max-Det,

QMax1 , see § The previous works on SSS-based techniques either did not reportany experiments [27], [29] or scaled to only a very small dataset, orused techniques that, while efﬁcient, were insecure [30], [50]. Forinstance, [30], [50] are both vulnerable to access-pattern attacks.Furthermore, these approaches achieve efﬁcient query processingtimes ( e . g ., 90 ms for aggregation queries on databases of size150K) by executing queries on SS data identically to that oncleartext, which requires user sides to retain polynomials, whichwere used to generate SS-data. Thus, as mentioned in § n × m polynomials, where n and m are the numberof rows and columns in a database, respectively.MPC-based methods, e . g ., [9], [16], [17], [18], are secure,they also do not scale to large datasets due to high overhead ofshare creation and/or query execution. For example, MPC-basedSepia [18] used 65K values for only count operation without E x e c u t i o n T i m e ( s ) Obscure (1M Rows)Obscure (6M Rows)

Fig. 8: Exp 9. O

BSCURE performance using a single-threaded implementation on 32GB RAM , 2.5GHz Intel Xeon CPU.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 18 any condition with the help of three to nine servers, and recentBonawitz et al. [17] (appeared in CCS 2017) used only 500Kvalues for count and sum of the numbers. Note that Sepia [18]and Bonawitz et al. [17] do not support conjunctive/disjunctivecount/sum queries.We evaluated one of the state-of-the-art industrial MPC-basedsystems that we refer to system Z to get a better sense of itsperformance compared to O

BSCURE , whose performance is givenin Figure 8. However, we note that the MPC systems, as mentionedin §

1, are not available as either open source, and, often, not evenavailable for purchase, except in the context of a contract. We wereable to gain access to System Z, due to our ongoing collaborationwith the team under the anonymity understanding. We installedsystem Z (having three SS of LineItem) on the local machine,since it was not allowed to install it on AWS. Also, note that wecannot directly compare system Z and O

BSCURE , since system Zuses a single machine to keep all three shares. Inserting 1M rowsin system Z took 9 hours, while the size of SS data was 1GB.We executed the same queries using the system Z, which tookthe following time: 532s for 1D count, 808s for CE count, 1099sfor DE count, 531s for 1D sum, 801s for CE sum, 1073s for DEsum, 2205s for UnC-Max-Tuple-Fetch, and 2304s for Cond-Max-Tuple-Fetch.

10 C

ONCLUSION

We proposed O

BSCURE that is a information-theoretically secure,oblivious, and communication efﬁcient system for answeringaggregation queries (count, sum, and maximum having single-dimensional, conjunctive, or disjunctive query predicates) on asecret-shared dataset outsourced by either a single DB or multipleDB owners. O

BSCURE also supports efﬁcient result veriﬁcationalgorithms to protect against malicious adversarial cloud serversthat deviate from the algorithm, due to software/hardware bugs.Our experimental results on 1M rows and 6M secret-shared rowsusing AWS servers show better performance as compared a simplestrategy of downloading encrypted data, decrypting, and then,executing the query at a resource-constrained user. Further, weshowed a tradeoff between the number of shares and performance.

Future directions.

While O

BSCURE supports a wide range ofaggregation queries, there are some issues that we plan to extendin the future, listed below:1) Reducing the number of communication rounds between the userand the server to one for any aggregation query.2) Designing an algorithm for group-by queries without knowing theunique values of the attribute on which the group-by query will beexecuted; recall that the existing algorithm for group-by queriesrequires to know the unique values in an attribute.3) Dealing with multiple aggregation operators in a query. Forexample, O

BSCURE can execute the following query in onecommunication round between the user and the server:

SELECTavg(age), max(age) FORM Employee by creating twosub-queries, one for average and another for maximum. One mayconsider how to execute such a query only in one communicationround, without creating two sub-queries.4) Extending this work on GPU-based efﬁcient join and nestedqueries, since the proposed algorithms use multiplication and ad-dition operations, which can be supported by GPU very efﬁciently. A CKNOWLEDGMENT

This material is based on research sponsored by DARPA underagreement number FA8750-16-2-0021. The U.S. Government isauthorized to reproduce and distribute reprints for Governmentalpurposes notwithstanding any copyright notation thereon. Theviews and conclusions contained herein are those of the authorsand should not be interpreted as necessarily representing theofﬁcial policies or endorsements, either expressed or implied, ofDARPA or the U.S. Government. This work is partially supportedby NSF grants 1527536 and 1545071. This work of Y. Li is sup-ported by National Natural Science Foundation of China (Grantno. 61402393, 61601396). R EFERENCES

SIGMOD , pages 563–574, 2004.[9] D. W. Archer et al. From keys to databases - real-world applications ofsecure multi-party computation.

IACR Cryptology ePrint , 2018.[10] S. Bajaj et al. Correctdb: SQL engine with practical query authentication.

PVLDB , 6(7):529–540, 2013.[11] M. Barhamgi et al. PrivComp: a privacy-aware data service compositionsystem. In

EDBT , pages 757–760, 2013.[12] M. Barhamgi et al. Secure and privacy-preserving execution model fordata services. In

CAiSE , pages 35–50, 2013.[13] M. Barhamgi et al. Privacy in data service composition.

IEEE Transac-tions on Services Computing , 2019.[14] A. Beimel. Secret-sharing schemes: A survey. In

IWCC , pages 11–46,2011.[15] D. Benslimane et al. PAIRSE: a privacy-preserving service-oriented dataintegration system.

SIGMOD Record , 42(3):42–47, 2013.[16] D. Bogdanov et al. Sharemind: A framework for fast privacy-preservingcomputations. In

ESORICS , volume 5283, pages 192–206, 2008.[17] K. Bonawitz et al. Practical secure aggregation for privacy-preservingmachine learning. In

CCS , pages 1175–1191, 2017.[18] M. Burkhart et al. SEPIA: privacy-preserving aggregation of multi-domain network events and statistics. In

USENIX , pages 223–240, 2010.[19] R. Canetti. Security and composition of multiparty cryptographic proto-cols.

J. Cryptology , 13(1):143–202, 2000.[20] R. Canetti et al. Adaptively secure multi-party computation. In

STOC ,pages 639–648, 1996.[21] C. Chu et al. Efﬁcient k -out-of- n oblivious transfer schemes with adaptiveand non-adaptive queries. In PKC , pages 172–183, 2005.[22] R. M. Corless et al. A graduate introduction to numerical methods.

AMC ,10:12, 2013.[23] H. Corrigan-Gibbs et al. Prio: Private, robust, and scalable computationof aggregate statistics. In

NSDI , pages 259–282, 2017.[24] R. Cramer et al.

Secure Multiparty Computation and Secret Sharing .Cambridge University Press, 2015.[25] I. Damg˚ard et al. Unconditionally secure constant-rounds multi-partycomputation for equality, comparison, bits and exponentiation. In

TCC ,pages 285–304, 2006.[26] S. Dolev et al. Secret shared random access machine. In

ALGOCLOUD ,volume 9511, pages 19–34.[27] S. Dolev et al. Accumulating automata and cascaded equations automatafor communicationless information theoretically secure multi-party com-putation: Extended abstract. In

SCC , pages 21–29, 2015.[28] S. Dolev et al. Privacy-preserving secret shared computations usingmapreduce.

CoRR , abs/1801.10323, 2018.[29] F. Emekc¸i et al. Privacy preserving query processing using third parties.In

ICDE , page 27, 2006.[30] F. Emekc¸i et al. Dividing secrets to secure data outsourcing.

Inf. Sci. ,263:198–210, 2014.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 19 [31] J. Frankle et al. Practical accountability of secret processes. In

USENIXSecurity , pages 657–674, 2018.[32] M. J. Freedman et al. Keyword search and oblivious pseudorandomfunctions. In

TCC , pages 303–324, 2005.[33] C. Gentry.

A fully homomorphic encryption scheme . PhD thesis, StanfordUniversity, 2009.[34] D. M. Goldschlag et al. Onion routing.

Commun. ACM , 42(2):39–41,1999.[35] S. Goldwasser et al. Probabilistic encryption.

J. Comput. Syst. Sci. ,28(2):270–299, 1984.[36] H. Hacig¨um¨us et al. Executing SQL over encrypted data in the database-service-provider model. In

SIGMOD , pages 216–227, 2002.[37] M. A. Hadavi et al. AS5: A secure searchable secret sharing scheme forprivacy preserving database outsourcing. In

DPM , pages 201–216, 2012.[38] W. Jiang et al. Transforming semi-honest protocols to ensure account-ability.

Data Knowl. Eng. , 65(1):57–74, 2008.[39] H. W. Lim et al.

PrivateLink : Privacy-preserving integration and sharingof datasets.

IEEE Trans. Information Forensics and Security , 15:564–577,2020.[40] M. Naveed et al. Inference attacks on property-preserving encrypteddatabases. In

CCS , pages 644–655, 2015.[41] C. Orlandi. Is multiparty computation any good in practice? In

ICASSP ,pages 5848–5851, 2011.[42] R. A. Popa et al. CryptDB: processing queries on an encrypted database.

Commun. ACM , 55(9):103–111, 2012.[43] A. Rajan et al. Callisto: A cryptographic approach to detecting serialperpetrators of sexual misconduct. In

COMPASS , pages 49:1–49:4, 2018.[44] A. Shamir. How to share a secret.

Commun. ACM , 22(11):612–613,1979.[45] L. V. Silva et al. Security and privacy preserving data aggregation incloud computing. In

SAC , pages 1732–1738, 2017.[46] D. X. Song et al. Practical techniques for searches on encrypted data. In

IEEE SP , pages 44–55, 2000.[47] B. Thompson et al. Privacy-preserving computation and veriﬁcation ofaggregate queries on outsourced databases. In

PETS , pages 185–201,2009.[48] S. Tu et al. Processing analytical queries over encrypted data.

Proc.VLDB Endow. , 6(5):289–300, 2013.[49] C. Wang et al. Secure ranked keyword search over encrypted cloud data.In

ICDCS , pages 253–262, 2010.[50] T. Xiang et al. Processing secure, veriﬁable and efﬁcient SQL overoutsourced database.

Inf. Sci. , 348:163–178, 2016.[51] S. Yu et al. Attribute based data sharing with attribute revocation. In

ASIACCS , pages 261–270, 2010.

Peeyush Gupta is a Ph.D. student, advised byProf. Sharad Mehrotra, at University of Califor-nia, Irvine, USA. He obtained his Master of Tech-nology degree in Computer Science from IndianInstitute of Technology, Bombay, India, in 2013.His research interests include IoT data manage-ment, time series database systems, and datasecurity and privacy.

Yin Li is an associate professor in the Schoolof Cyberspace Security, Dongguan University ofTechnology, China. Previously, he was an as-sociate professor in Department of ComputerScience and Technology, Xinyang Normal Uni-versity, China. He received his Ph.D. degreein Computer Science from Shanghai JiaotongUniversity (SJTU),Shanghai in 2011. He re-ceived his B.Sc. degree and M.Sc. degree fromInformation Engineering University, Zhenzhou,in2004 and 2007. He was a Post Doc at Ben-Gurion University of the Negev, Israel, assistedby Prof. Shlomi Dolev.His current research interests include algorithm and architectures forcomputation in ﬁnite ﬁeld, computer algebra, and secure cloud comput-ing.

Sharad Mehrotra received the PhD degree incomputer science from the University of Texas,Austin, in 1993. He is currently a professor inDepartment of Computer Science, University ofCalifornia, Irvine. Previously, he was a professorwith the University of Illinois at Urbana Cham-paign. He has received numerous awards andhonors, including the 2011 SIGMOD Best Pa-per Award, 2007 DASFAA Best Paper Award,SIGMOD test of time award, 2012, DASFAA tenyear best paper awards for 2013 and 2014, 1998CAREER Award from the US National Science Foundation (NSF), andACM ICMR best paper award for 2013. His primary research interestsinclude the area of database management, distributed systems, securedatabases, and Internet of Things.

Nisha Panwar is an assistant professor at Au-gusta University, Georgia. She obtained herPh.D. in Computer Science from Ben-GurionUniversity, Israel, in 2016, where he worked withProf. Shlomi Dolev and Prof. Michael Segal. Shereceived her Master of Technology (M.Tech.) de-gree in Computer Engineering from National In-stitute of Technology, Kurukshetra, India in 2011.She was a Post Doc at University of California,Irvine, USA. Her research interests include se-curity and privacy issues in IoT systems, as wellas, in vehicular networks, computer network and communication, anddistributed algorithms.

Shantanu Sharma received his Ph.D. in Com-puter Science in 2016 from Ben-Gurion Univer-sity, Israel. During his Ph.D., he worked withProf. Shlomi Dolev and Prof. Jeffrey Ullman. Heobtained his Master of Technology (M.Tech.) de-gree in Computer Science from National Instituteof Technology, Kurukshetra, India, in 2011. Hewas awarded a gold medal for the ﬁrst positionin his M.Tech. degree. Currently, he is pursuinghis Post Doc at University of California, Irvine,USA, assisted by Prof. Sharad Mehrotra. Hisresearch interests include data security and privacy, building secureand privacy-preserving systems on sensor data for smart buildings,designing models for MapReduce computations, distributed algorithms,mobile computing, and wireless communication.

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 20 A PPENDIX AC OUNT Q UERY V ERIFICATION OVER S ECRET -S HARED V ALUES

This section shows an example for count query veriﬁcation over asecret-shared relation.

Example.

Assume that the domain of symbols has only twosymbols, namely A and B. Thus, A can be represented as (cid:104) , (cid:105) ,and B can be represented as (cid:104) , (cid:105) . DB owner side . Suppose that the DB owner wants to outsourcethree rows having A, B, A, respectively. The DB owner adds twoattributes, A x and A y , initialized with one, to the relation; seeTable 14. Values A x A y A 1 0B 1 0A 1 0

TABLE 14: Non-secret-shared relation at the DB owner.The DB owner uses any polynomials of an identical degree, asshown in Table 15, to create four shares. Further, the i th share isplaced to the i th server. Values Vectorvalues Polynomials Firstshares Secondshares Thirdshares FourthsharesA 1 x + 1 x + 0 x + 0 x + 1 x + 1 x + 0 A x x + 1 x + 1 x + 1 A y x + 1 x + 1 x + 1 TABLE 15: Secret-shares of a relation shown in Table 14.

User-side . Suppose that the user wants to search for a symbol B.The user will ﬁrst represent B as a unary vector, (cid:104) , (cid:105) , and then,create secret-shares of B, as shown in Table 16. Vectorvalues Polynomials Firstshares Secondshares Thirdshares Fourthshares0 x + 0 x + 1 TABLE 16: Secret-shares of a vector (cid:104) , (cid:105) , created by theuser/querier. Server-side . Each server executes the count query, as mentionedin §

5, and the functions f and f . op = f ( x ) = (cid:80) i = ni =1 ( S ( x i ) ⊗ o i ) op = op + f ( y ) = op + (cid:80) i = ni =1 f ( S ( y i ) ⊗ (1 − o i )) The function f (and f ) multiplies the i th value of the A x (and A y ) attribute by the i th string-matching resultant (and by thecomplement of the i th string-matching resultant). Each server i ( ≤ i ≤ ) sends the following three things: ( i ) the result of thecount query (cid:104) result (cid:105) i , ( ii ) the outputs of the function f : (cid:104) op (cid:105) i ,and ( iii ) the sum of outputs of the function f and f : (cid:104) op (cid:105) i , tothe user. Tables 17 - 20 show the working of servers over secret-shares. User-side . The user interpolates the received values from eachserver, which result in

Iresult , Iop , and Iop , as follows: Iresult = ( x − x − x − − − − ×

44 + ( x − x − x − − − − × ( x − x − x − − − − ×

304 + ( x − x − x − − − − ×

521 = 1

Iop = ( x − x − x − − − − ×

162 + ( x − x − x − − − − × Value SMR ( o ) Function f − o Function f A (2 , ⊗ (2 ,

2) = 10 2 ×

10 = 20 1 −

10 = − ×− − B (4 , ⊗ (2 ,

2) = 14 3 ×

14 = 42 1 −

14 = −

13 6 × −

13 = − A (6 , ⊗ (2 ,

2) = 20 5 ×

20 = 100 1 −

20 = −

19 3 × −

19 = − result =44 (cid:104) op (cid:105) =162 (cid:104) op (cid:105) = − TABLE 17: Server 1 execution.

Value SMR ( o ) Function f − o Function f A (3 , ⊗ (4 ,

3) = 30 3 ×

30 = 90 1 −

30 = −

29 7 × −

29 = − B (8 , ⊗ (4 ,

3) = 47 5 ×

47 = 235 1 −

47 = −

46 11 × −

46 = − A (11 , ⊗ (4 ,

3) = 68 9 ×

68 = 612 1 −

68 = −

67 5 × −

67 = − result =145 (cid:104) op (cid:105) =937 (cid:104) op (cid:105) = − TABLE 18: Server 2 execution. ( x − x − x − − − − × ( x − x − x − − − − × Iop = ( x − x − x − − − − × − ( x − x − x − − − − × − ( x − x − x − − − − × −

363 + ( x − x − x − − − − × −

849 = 3

Note that the user obtains:

Iresult = Iop and Iop = n ,where n is the number of tuples in the relation, and it is knownto the user. Thus, it is proved that the servers followed the countquery algorithm. A PPENDIX BF INDING M AXIMUM OVER D ATASETS O UTSOURCEDBY M ULTIPLE

DB O

WNERS

In this section, we explain a method, named

MDBMax for thecase when multiple DB owners outsource their data to servers, e . g ., smart meters. Note that for the case of multiple DB owners, SDBMax method cannot work, as different DB owners do notshare any information for creating OP-SS. We describe

MDBMax for a list, say A c , having n numbers outsourced by k DBowners/devices, where k ≤ n . Data outsourcing.

Consider that an i th DB owner wishes tooutsource a number, say v . The i th DB owner creates shares of v using a secret-sharing mechanism that allows string-matchingoperations at the server and sends to the c non-communicatingservers, as described in § Query execution.

MDBMax uses 2’s complement-based signbitcomputation for each pair of shares at a server. The server j considers an i th ( ≤ i ≤ n ) share as the maximum value andcompares the i th share against the remaining n − shares.Thus, for each number at the i th position, say V i , the server j computes the signbit with all the other numbers using 2’scomplement-based subtraction, i . e ., signbit ( V i − V x ) , x (cid:54) = i , and ≤ x, i ≤ n . Recall that the signbit results in of secret-shareform, if V i < V x ; otherwise, . Then, the server j adds all n − signbit values computed for the i th share of the list A c . Therefore,after comparing each pair of inputs and adding corresponding n − signbit values, the server j has a vector, say vec , of n shares. Theuser asks the count query ( §

5) to ﬁnd the occurrences of in vec (it will be clear soon why the user is asking for counting ) and EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 21

Value SMR ( o ) Function f − o Function f A (4 , ⊗ (6 ,

4) = 60 4 ×

60 = 240 1 −

60 = −

59 10 × −

59 = − B (12 , ⊗ (6 ,

4) = 100 7 ×

100 = 600 1 −

100 = −

99 16 × −

99 = − A (16 , ⊗ (6 ,

4) = 144 13 ×

44 =1872 1 −

144 = −

143 7 × −

143 = − result =304 (cid:104) op (cid:105) =2812 (cid:104) op (cid:105) = − TABLE 19: Server 3 execution.

Value SMR ( o ) Function f − o Function f A (5 , ⊗ (8 ,

5) = 100 5 ×

100 = 500 1 −

100 = −

99 13 × −

99 = − B (16 , ⊗ (8 ,

5) = 173 9 ×

173 =1557 1 −

173 = −

172 21 × −

172 = − A (21 , ⊗ (8 ,

5) = 248 17 ×

248 =4216 1 −

248 = −

247 9 × −

247 = − result =521 (cid:104) op (cid:105) =6273 (cid:104) op (cid:105) = − TABLE 20: Server 4 execution.the sum of the values of A c for which the count query resulted in1 of secret-shared form. Example.

The following table shows how does the server ﬁnd themaximum value without using OP-SS. Note that for the purpose ofexplanation, we use cleartext values and computations; however,the server will perform all operations over secret-shared numbers.The list A c contains ﬁve numbers: 10, 20, 90, 50, and 90. Notethat the sum of signbit for the maximum value is 0. The serverexecutes the count query for the value of 0, multiplies the i th resultant to the i th value of A c , and sends the sum of the countquery results and the sum of values of A c after multiplication. Theuser receives 2 and 180 as the output of the count and sum queries,respectively, and so that the user knows the maximum value is 90. A c Signbits Sum ofsignbits String-matchingresult Maximumvalue10 20 90 50 9010 0 1 1 1 1 4 0 020 0 0 1 1 1 3 0 090 0 0 0 0 0 0 1 9050 0 0 1 0 1 2 0 090 0 0 0 0 0 0 1 90Answers to the count and sum queries 2 180

Complexities.

MDBMax requires n comparisons and n + 1 scanrounds of the list A c , where the ﬁrst n rounds are used incomparing each pair of numbers, other n rounds are used foradding n − signbits for each number, and one additional roundfor executing count and sum queries. Minimum queries over numbers outsourced by multiple DBowners.

Here, we also compare each pair of numbers. However,for each number at the i th position, say V i , we compute the signbitwith all the other numbers using 2’s complement-based subtrac-tion, as follows: signbit ( V x − V i ) , x (cid:54) = i , and ≤ x, i ≤ n . As aresult, after adding n − signbits for each number, the minimumvalues has 0, and the user asks for the count query for 0 and thesum of the values of A c for which the count query resulted in 1 ofsecret-shared form. A PPENDIX CM INIMUM AND T OP - K In this section, we focus on the minimum and top-k/reverse-top-kﬁnding algorithms on an attribute, say A c . Further, we assume thatany value in the attribute A c appears only once. Minimum.

Consider the following two queries

QMin1 (uncondi-tional minimum) and

QMin2 (conditional minimum).

QMin1 . select * from Employee where Salaryin (select min(Salary) from Employee)

QMin2 . select * from Employee as E1 whereE1.Dept = ’Testing’ and Salary in (selectmin(salary) from Employee as E2 whereE2.Dept = ’Testing’)

Here, in short, we explain how to execute these queries on the re-lations S ( R ) and S ( R ) , since these queries are similar to max-imum queries §

7. To execute an unconditional minimum query,the user follows the same strategy for solving

QMax1 ( § S ( R ) . First, each server i ﬁnds a tuple, say (cid:104) S ( t k ) , S ( value ) (cid:105) i ,where S ( t k ) i is the k th secret-shared tuple-id (in the attribute SSTID ) and S ( value ) i is the secret-shared minimum value ofthe A c attribute in the k th tuple. Finally, the server i comparesthe tuple-id (cid:104) S ( t k ) (cid:105) i with each k th value of the attribute TID of S ( R ) i and multiplies the resultant by the ﬁrst m attribute valuesof the tuple k . Finally, the server i adds all the values of each m attribute.To execute a conditional minimum query, the user operates intwo rounds, like a conditional maximum query; see § S ( R ) satisfying query predicate. In round 2, the user interpolatesthe received tuple-ids and sends the desired tuple-ids in cleartext tothe servers. Each server i ﬁnds the minimum value of the attribute A c in the requested tuple-ids by looking into the attribute CTID ofthe relation S ( R ) i and results in a tuple, say (cid:104) S ( t k ) , S ( value ) (cid:105) i ,where S ( t k ) i shows the secret-shared tuple-id (from SSTID attribute) and S ( value ) i shows the secret-shared minimum value.Finally, the server i performs a join operation between all thetuples of S ( R ) i and (cid:104) S ( t k ) , S ( value ) (cid:105) i , as performed whenanswering unconditional maximum ( QMax1 ) queries; see § Correctness and information leakage . The correctness argumentsand information leakage of a minimum query is similar to maxi-mum queries.

Top-k.

We again consider unconditional and conditional queriesin the case of a top-k query. In both the cases, the user followsa similar approach, like maximum queries; see §

7; however, theuser asks for top-k values instead of the maximum value.

Unconditional top-k query . To retrieve tuples having the top-kvalues in the attribute A c of the relation S ( R ) i , the i th serverexecutes the following steps:1) On the relation S ( R ) i . Since the secret-shared values ofthe attribute A c of the relation S ( R ) i are comparable, theserver i ﬁnds a set of k tuples, where k tuples have the top-k values in the attribute A c . One of the k tuples is denoted by (cid:104) S ( t (cid:96) ) , S ( value ) (cid:105) i , where S ( t (cid:96) ) i is the (cid:96) th secret-shared tuple-id (in the attribute SSTID ) and S ( value ) i is the secret-sharedvalue of the A c attribute in the j th tuple.2) On the relation S ( R ) i . Now, the server i performsthe join of all the top- k tuples with all the tuplesof the relation S ( R ) i by comparing the tuple-ids ( TID attribute’s values) of the relation S ( R ) i : (cid:80) j = nj =1 A p [ S ( a j )] i × ( TID [ S ( a j )] i ⊗ S ( t (cid:96) ) i ) Where ≤ (cid:96) ≤ k and p ( ≤ p ≤ m ) is the number ofattributes in the relation R and TID is the tuple-id attribute of S ( R ) i . To say, the server i compares each tuple-id (cid:104) S ( t (cid:96) ) (cid:105) i with each j th value of the attribute TID of S ( R ) i andmultiplies the resultant by the ﬁrst m attribute values of the EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 22

TID ( r ) Dept SM result ( o ) Count ( a ) x = 1 x = 2 r × o [1 − ( signbit ( x −

1) + signbit (1 − x ))] = 3 r × o [1 − ( signbit ( x −

1) + signbit (1 − x ))] = 0 r × o [1 − ( signbit ( x −

2) + signbit (2 − x ))] = 0 r × o [1 − ( signbit ( x −

2) + signbit (2 − x ))] = 5 r × o [1 − ( signbit ( x −

2) + signbit (2 − x ))] = 0 r × o [1 − ( signbit ( x −

2) + signbit (2 − x ))] = 0 Tuple-ids after adding values of the columns 3 5

TABLE 21: Knowing tuple-ids of employees working in testing department.tuple j . Finally, the server i adds all the values of each m attribute. Conditional top-k query . Answering conditional top-k queriesrequire when all the values of the attribute A c are unique requirestwo communication rounds between the user and the servers, likea conditional maximum query, see § Round 1 . The user obliviously knows the tuple-ids of the relation S ( R ) satisfying the query predicate. Round 2 . The user interpolates the received tuple-ids and sendsthe desired tuple-ids in cleartext to the servers. Each server i ﬁndsthe top-k values of the attribute A c in the requested tuple-ids bylooking into the attribute CTID of the relation S ( R ) i and resultsin a set of k tuples. Now, the server i performs a join operationbetween all the tuples of S ( R ) i and each of the k tuples of therelation S ( R ) , as performed above in answering an unconditionaltop-k query. Note.

A reverse-top-k query can also be executed in the samemanner as top-k queries; however, the user asks for the minimum-k values. A PPENDIX DM ETHODS FOR F INDING T UPLE -I DS A trivial solution for knowing the tuple-ids satisfying a querypredicate is given in § n numbers from eachserver to the user. In the following method, we allow the adversaryto know an upper bound on the number of tuples, say T , satisfythe query predicate. The method executes T computations on eachtuple and maintains T variables for each tuple. Thus, the serverperforms signiﬁcant computations, when T is large. The method.

The server creates T columns, one for each tuple-id that satisﬁes the query predicate, say v . Note that actually wedo not need to create any column during implementation, we needto have T variables. For the purpose of explanation, we show T columns. Each column has allocated one of the values from 1 to T of secret-share form (provided by the user). After an obliviouscomputation over each tuple, if there are T occurrences of v ,then each of the T columns will have one of the exact tuple-id where v occurs. The server executes the following operation: r × o [1 − ( signbit ( x − a ) + signbit ( a − x ))] Where r is the tuple-id; o = A (cid:96) [ S ( a i )] ⊗ S ( v ) , ≤ i ≤ n , i . e ., the resultant output of matching the predicate v with eachvalue of the attribute A (cid:96) ; a = (cid:80) i = ni =1 A (cid:96) [ S ( a i )] ⊗ S ( v ) , i . e ., theaccumulated counting of the predicate v in the attribute A (cid:96) ; and x ( ≤ x ≤ T ) is a value of the column, created for storing thetuple-id. Details . For r th ( ≤ r ≤ n ) value of the attribute A (cid:96) , theserver executes counting operations for ﬁnding the occurrences of

15. The user either provides an upper bound on the number of tuples thatcan satisfy the query predicate or knows the occurrences of the query predicateby executing the count query. v in A (cid:96) . The occurrences of v in the above-mentioned equation isdenoted by a . For each resultant a , the server compares a againsteach of the T values using 2’s complement method (as given in § v matches with only one of the T values,and thus, signbit ( x − a ) + signbit ( a − x ) results in , i . e ., thedifference of signbits of comparing two identical numbers is . Forall the other subtraction, it will be either or − of secret-shareform. Note that for all the values of T that do not match with a ,the above-mentioned equation will be 0 of secret-share form.Since for the occurrence of v matching with one of the valuesof T , signbit ( x − a ) + signbit ( a − x ) results in , we subtractit from to keep on which we can multiply the tuple-id r .Thus, if the tuple r has v in the attribute A (cid:96) , the server keeps r to one of the T columns. It is important to note that if the r th tuple has v in attribute A (cid:96) and ( r + 1) th tuple do not have v inattribute A (cid:96) , the value of accumulated count, a , will be same forthe tuples r th and ( r + 1) th . Hence, the server may also keepthe ( r + 1) th tuple-id in the same column where it has kept r th tuple-id. In order to prevent this, we also multiply the result of thestring-matching operation (denoted by o , see the above equation.Thus, the ( r + 1) th tuple-id will not be stored. Finally, the serverperforms the addition operations on each T column and sends theﬁnal sum of each column to the user. Example.

Table 21 shows an implementation of tuple-id ﬁndingmethod in cleartext to know the tuple-ids that have

Dept =Testing ; see Figure 6a for Employee relation. Note that foreach row, we perform string-matching operations whose resultsare stored in the variable o , and all the occurrences of the querypredicate are stored in the variable a . The user asks the server tocreate two columns ( T = 2 ) for keeping tuple-ids.For the ﬁrst tuple, the string-matching operation results in o = 1 and a = 1 , since the occurrence of the query predicate( Testing ) matches with the department of the ﬁrst tuple. Theserver computes the signbit (by placing x = 1 and a = 1 )that results in , and subtracts it from before multiplyingby r = 3 and o = 1 . Hence, the ﬁrst column keeps thetuple-id 3. The second column of the ﬁrst row has 0, since signbit (2 −

1) + signbit (1 −

2) = 0 + 1 = 1 . Note that whenprocessing the second row, the server ﬁnds the signbit of a equalsto the value of the ﬁrst column, while the second tuple does nothave Testing department. The multiplication of the resultant ofthe signbit comparison by o makes the values of the ﬁrst column ,while the second column has too. The server processes the thirdtuple like the ﬁrst tuple. Here, the second column keeps the tuple-id, since for the second column the current value of accumulatedcount a matches with the column number, while the ﬁrst columnstores , due to − ( signbit (1 − signbit (2 −

1) = 1 − (1+0) .The server processes the remaining tuples in a similar manner. EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 23 A PPENDIX EI MPACT OF N UMBER OF S HARES

In this section, we discuss the impact of number of shares on queryexecution (Experiment 5).

Consider a 1D-count on OK attribute (seeFigure 9). Each OK value needs six digits, which we denoteas: (cid:104) d , d , d , d , d , d (cid:105) . In order to evaluate a query predicateover OK attribute in a 1D-count query, using one round ofcommunication between the user and the servers, we need at mostthirteen servers/shares, as mentioned in § i . e ., the server checks (cid:104) d , d (cid:105) , (cid:104) d , d (cid:105) , and (cid:104) d , d (cid:105) , and sends the partial results to theuser. Note that checking two digits of each OK value requires moremultiplication and modulus operations than using three shares,and thus, the server computation time increases to 83s. Here, theuser receives a smaller set of partial results, and thus, the user’sinterpolation task reduces. Note that the total time when usingﬁve shares is higher than three shares, since the server performsmore computations. In the case of eleven shares, the computationtime at the server is higher as compared to three and ﬁve shares,since the server is able to check ﬁve digits ( (cid:104) d , d , d , d , d (cid:105) )of the query predicate. In the case of ﬁfteen servers, each serverchecks the entire predicate, and hence more computation time isrequired at the server, due to more multiplication and modulooperation. However, in the case of ﬁfteen shares, the user paysonly for interpolating one value, which is the answer to the countquery. Now, consider a 2CE-count query on OK and LNattributes, where the number of digits in OK and LN were 6 and1 respectively. In order to execute a 2CE-count query, we need atmost ﬁfteen shares. A 2CE-count query execution time follows atrend similar to 1D-count query (see Figure 9) except that thereare more number of digits involved. Hence, we skip details here.

Sum queries behave differently than count queries, withan increasing number of shares. 1D-sum queries include querypredicate on OK attribute, wherein each value has six digits,which we represent as (cid:104) d , d , d , d , d , d (cid:105) . Generally, in asum query, if the server does not have enough shares, they needto communicate with the user, who reduces the degree of thepolynomial of searching predicate attribute, and then, the useragain sends the shares of string-matching resultant to the server. E x e c u t i o n T i m e ( s ) User Processing TimeServer Processing Time

Fig. 9: Impact of the number of shares, using a single threadedimplementation on 32GB RAM, 2.5GHz Intel Xeon CPU.Afterwards, the server performs sum operations by multiplying i th secret-shared result with the i th value of desired attribute onwhich the user is executing a sum query, and then, the server addsall the values.As we increase the number of servers from three to ﬁfteen, theserver computation time increases, due to more computations, likecount 1D-count queries, and the user time decreases. Also, here,the total time increases when going to three to ﬁve shares, like 1D-count query. However, note that the time when using eleven shares,in the case of 1D-sum query is higher than ﬁve shares, unlike1D-count query. In the case of eleven shares, the server checks (cid:104) d , d , d , d , d (cid:105) and sends their output to reduce the degree ofthe polynomial. However, the server does not send the output ofstring-matching operation over (cid:104) d (cid:105) digit. Now, the user createsﬁve shares of each value after reducing the degree. On receivingnew shares, the server multiplies the i th new share to the outputof (cid:104) d (cid:105) digit string-matching, whose resultant is multiplied by i th share of the attribute on which the sum operation is carried out.Note that after this multiplication, the degree of the polynomialis four; thus, the user sent ﬁve shares to recover the secret value.Note that while user time is almost same when using ﬁve or elevenshares, the server time is higher in the case of eleven shares, sincethe server is matching almost the entire query predicate, exceptthe last digit. Thus, the server time in the case of eleven shares ishigher than ﬁve shares. In the case of ﬁfteen shares, the user timeis minimum, since servers sends the ﬁnal answer to the sum query,while server time is maximum. Now, consider a 2CE-sum query on OK and LNattributes, where the number of digits in OK and LN were six andone, respectively. In order to execute this 2CE-sum query, we needat most ﬁfteen shares. A 2CE-sum query execution time followsa similar trend like of a 1D-sum query (see Figure 9). Hence, weomit details here.

UnC-Max-Tuple-Fetch query.

While retrieving a tuple having themaximum value for an attribute, say A c , the server joins two rela-tions S ( R ) and S ( R ) , based on TID , as mentioned in § TID consists of seven dig-its, which we represent as (cid:104) d , d , d , d , d , d , d , d (cid:105) . Whenincreasing number of shares from three to ﬁfteen, the servertime increases and the user time decreases, similar to all theabove-mentioned queries. Moreover, in this case, the total timeof computation is decreasing. For this the reason is as follows:During the join operation, the degree of the polynomial usedto create shares of TID values increases. In order to execute

EEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (TKDE); ACCEPTED MARCH 2020. 24 string-matching over

TID values and tuple retrieval in one round,we need at least ﬁfteen and sixteen shares, respectively. Thus,in our setting, the user needs to reduce the degree of string-matching resultant and re-generate shares of this to fetch the tuple,regardless of three, ﬁve, eleven, or ﬁfteen shares.In the case of three shares, the server compares only each digitand sends partial results to the user. After interpolating the partialresults, which consists of zeros for every tuple except for one, foreach tuple, the user creates three secret-shares of this vector andsends to the servers to retrieve the desired tuple. This operationrequires interpolating seven shares, and then, generating three newshares.In the case of ﬁve shares, the server compares and sends partialresults of string-matching over (cid:104) d , d (cid:105) , (cid:104) d , d (cid:105) , (cid:104) d , d (cid:105) to theuser. Note that the user interpolates three shares and generates ﬁvenew shares. Hence, the user time decreases as compared to thecase of three shares.In the case of eleven shares, each server sends partial resultsof string-matching over (cid:104) d , d , d , d , d (cid:105) its share, and the usergenerates ﬁve new shares. Thus, the user time again decreases inthis case. In the case of ﬁfteen servers, each server checks theentire TID value and sends partial results for degree reduction.After interpolating the values, the user generates three new secret-shared ﬁles. Thus, the user time again decreases in this case ascompared to eleven servers. In addition, as the servers check morenumber of digits in the

TID value, their time increases.

Cond-Max-Det.

As mentioned in § e . g ., three, ﬁve, oreleven), the server ﬁrst checks partial query predicates and sendsresults to the user for degree reduction, like 1D-count query.The user decreases the degree of string-matching resultant andsends new shares, where the new i th share gets multiplied by i th value of Index attribute to know the tuple-ids. Finally, theserver sends the resultant to the user. After interpolation, the userknows the tuple-ids that satisfy the query predicate. Hence, inthe case of three, ﬁve, or eleven shares, the user executes theinterpolation operation two times, while in the case of ﬁfteenshares, the user executes interpolation operation only one time.Hence, the user computation time reduced when increasing thenumber of shares. After knowing tuple-ids, the user asks theserver to ﬁnd the maximum value in the given tuple-ids usingthe relation S ( R ) , and this operation takes same time regardlessof the number of shares. While increasing the number of shares,the server computation time increases, as it happened in all above-mentioned queries. Cond-Max-Tuple-Fetch.

Fetching a tuple having maximum valueaccording to a conditional query requires two rounds, as stated in § S ( R ) with one of the tuples of S ( R ) based on TID attribute of S ( R ) .In this query, we set a condition on OK attribute, which requiressix digits to represent a value. Hence, we need at least × shares to know the tuple-ids in one communication round.Further, to get the desired tuple, based on join over TID , that hasseven digits, we need at least sixteen shares (ﬁfteen shares for string-matching operations ( × ) and one more sharefor reconstructing the tuple values).In the ﬁrst round, the user interpolates at least twice in thecase of three, ﬁve, and eleven shares, and at least once in thecase of ﬁfteen shares to know the tuple-ids. Further, we use atmost ﬁfteen shares in our experiments; hence, the user needs toreduce the degree at least once, of string-matching resultant in thesecond round to get the desired tuple. Thus, user computation timedecreases as the number of shares increases. Further, the servertime decreases, as we increase the number of shares, similar toother queries. A PPENDIX FS ECURITY P ROOF O UTLINE

Now, we provide the security proof outline for O

BSCURE . Inour context, we, ﬁrst, need to show that an adversary cannotdistinguish any two queries of the same type based on the outputsize, i . e ., the query/user privacy will be maintained. Once we canprove the query privacy, we will show how the server privacy ( i . e .,not revealing more information to the user) is achieved. Theorem 1

If the adversarial cloud can distinguish two inputqueries, then either the random polynomials used for creatingshares of a query is not correct or O BSCURE does not providequery privacy.

In order to show that the adversary can never know the exact queryvalue, we consider two instances of the datasets, as follows: D and D , where D differs from D only at one value each, say v and v , i . e ., v is in D but D and v is in D but D . Here,we show that if the adversary can distinguish the single differentvalue in D and D , she can break O BSCURE . In this setting, theserver executes the input queries on D and D .By our assumption of ciphertext indistinguishability (men-tioned in § D and D are identical or different. Note that if the DB owner uses only onepolynomial ( i . e ., a weak cryptographic plan), then the adversarycan ﬁnd which value is the only single values of D that isdifferent from values of D . Moreover, it reveals frequency-countof values.Now assume the queries for the value v and v that will bemapped to secret-shared queries, q v ( D ) and q v ( D ) , respec-tively. Further, assume that q v ( D ) and q v ( D ) are identical.Hence, the adversary will consider both of them as an identicalquery, while they are for different queries. Hence, the adversarycannot distinguish two queries. Now, assume that q v ( D ) and q v ( D ))