Privacy-preserving Targeted Advertising
aa r X i v : . [ c s . I R ] J un Privacy-preserving Targeted Advertising
Theja Tulabandhula ∗ , Shailesh Vaya and Aritra Dhar University of Illinois at Chicago The Aquarian Inventors ETH Zurich ∗ email: [email protected] Abstract
Recommendation systems form the center piece of a rapidly growing trillion dollaronline advertisement industry. Even with numerous optimizations and approximations,collaborative filtering (CF) based approaches require real-time computations involvingvery large vectors. Curating and storing such related profile information vectors on webportals seriously breaches the user’s privacy. Modifying such systems to achieve privaterecommendations further requires communication of long encrypted vectors, making thewhole process inefficient. We present a more efficient recommendation system alternative,in which user profiles are maintained entirely on their device, and appropriate recommen-dations are fetched from web portals in an efficient privacy preserving manner. We basethis approach on association rules.
Targeted advertising (TA) uses keywords, their frequencies, link structure of the web, userinterests/demographics, recent/overall buying histories etc. to deliver personalized adver-tisements [1]. TA is enabled by (unique) cookies (random hash-maps) stored on the user’sdevices [2]. When a cookie is retrieved from it, the web-server storing this string can recall theprofile of the user associated/stored with it at its end. This mechanism constitutes a seriousbreach of privacy as it allows the websites to build very elaborate profile of the user at theirend [3, 4, 5]. This leads to the question: can we achieve TA without employing the cookie mechanism? Any such alternate approach would require the user to maintain their profile ontheir own device, and use it for interactively computing and fetching the appropriate TA fromthe server. In fact, solving this problem effectively can be considered a highly valuable con-tribution [6]. We present a solution framework for this very important commercial problemof wide-spread interest. In the presentation, we focus on the crux of the solution: a privacypreserving recommendation system .We present a novel approach for privacy preserving recommendation system (RS), whichshifts most of the computational work to a pre-processing stage. In the query processing phase,i.e., when we need to provide recommendations to a user according to their profile, most of thecomputation is done by the server in our proposed system, while the client merely computesand exchanges encryptions of a few messages. We will use recommender systems terminologyin the paper, wherein items can also denote ads. Our system is based on selection andapplication of association rules (AR), to produce an ordered list of recommended items. ARs[7] capture the relation that if a user has already bought a set of items p (called the antecedent ),then she is very likely to also buy another set of items q (called the consequent ). ARs are mined1rom large databases of historical user purchase data on the server-side, and often filtered toretain only the most meaningful insights [8]. While CF methods [9] extract and use pairwisemarginal statistics from the joint probability distribution of user preferences over items, ARstake into account more complex summary statistics of the joint probability distribution, whichcapture information about much larger collections of items. Most computation in a CF basedsystem is done both offline as well as online when the recommendation is computed. However,ARs are generated at an offline pre-processing stage, and query-processing merely requiresselection and application of the right AR interactively. If done correctly, this process canhave lower communication complexity in a privacy-preserving system. Lastly, note that CFencounters more bottlenecks when privacy preserving properties are desired: (a) Explicit andconsistent feedback from users are hard to obtain, store, organize and compile, given thatthey are to be attached to the right user ID; and (b) the spatio-temporal profiles of userscan change continuously leading to inaccurate targeted recommendations from a CF system.Because ARs are user-agnostic (they only track population level statistics), they can be morerobust to these two issues.In order to build a privacy preserving recommendation system using ARs, we first devisemeaningful criteria for selection of the most relevant ARs given a query (user profile) andto form an ordered list of recommended items. We then present fast exact and approximatealgorithms whose outputs satisfy these criteria. We provide their privacy preserving versionsin which the client does not reveal its profile (we will use the word transaction borrowed fromthe AR literature to denote a profile) to the server, and the server does not reveal its databaseof ARs to the client. Finally, we present experimental results to demonstrate the practicalityof our solutions, in terms of the latency overheads incurred due to privacy, in e-commerce andother domains. Problem Statement and Contributions
There is a client C , which has a set of items, referred to as a transaction set T ∈ I , with I being the universal item set and 2 I denoting the power set. The transaction set T representsthe current profile of a user. A server S processes a large historical transaction database andstores a database of association rules D , in the form of { p i → q i } | D | i =1 . Here | · | representsthe size of a set/database and p i and q i are sets (belonging to 2 I ) that define the i th AR. Inthe non-private version of the problem, given a single transaction T , the server S computesand sends recommended items that are based on consequents of matching association rules,where matching is defined suitably. For instance, if there are multiple ARs that are applicableto the transaction, then an ordered list of recommended items is prepared by collating theitems recommended by each of the multiple ARs taking into consideration the (optional)weights attached to these rules (for instance, these can be lift, conviction, Piatetsky-Shapiroetc. or directly optimized for recommendation accuracy), or such ordering can also be basedon the items in the input transaction, which may be assigned a weight according to somemonotonously decreasing function of time lapsed since this item was active (e.g., purchased).In the privacy-preserving version , given an input transaction T , defined as an ordered list ofitems, held by the C , and a database of ARs ( D ) held by the S , the C and S privately andinteractively compute the most relevant item, or ordered list of items, to be recommended to C . Below is a list of our contributions:( ) Criteria:
We formulate several criteria for selecting the appropriate set of associationrules. A rule is applicable if its antecedent is contained in the transaction. These criteria are2ifferentiated using parameters such as threshold weight w , which is used to eliminate all rulesbelow the given threshold weight, antecedent length threshold t , which is used to eliminateall rules above the given length threshold, and parameter k , which is used to select the top - k association rules under a specified ordering. We relate these criteria to a newly defined setoptimization problem called the Generalized Subset Containment Search (GSCS). We showhow the GSCS problem is a strict super-set of the well known Maximum Inner Product Search(MIPS) criterion [10].( ) Algorithms:
We develop efficient exact and approximate algorithms for computingrecommended items based on the criteria above. Exact implementations build on a novel two-level hashing based data structure that stores the ARs in a manner so that their antecedentscan be appropriately matched, and corresponding consequent(s) can be efficiently fetched.The key benefit of the data structure is that it provides a weak form of privacy by itself,and is readily amenable to the privacy protocols mentioned below. Our implementationsare parallelizable, and can exploit multi-threading machines. We also design a novel fastrandomized approximation algorithm for GSCS that fetches applicable ARs based on LocalitySensitive Hashing (LSH) [11] and hashing based algorithms for MIPS [12].( ) Privacy-preserving 2-party Protocols:
Next, we design communication efficientprivacy preserving protocols corresponding to the above exact and approximate algorithms.These protocols are based on oblivious transfer and straightforward to implement. Further,the protocol for the approximation algorithm can be easily extended to embed many otherlarge scale data processing tasks that rely on LSH (for instance, record linkage, data cleaningand duplicity detection problems [13] to name a few). Finally, we extensively evaluate theimpact that adding privacy has in terms of latency in recommending items. We emphasizethat these latencies are manageable for reasonably sized databases (e.g., ∼ ARs, see Sec-tion 6) and practical for certain targeted advertising settings. We do note that achievingtruly web-scale targeted advertising (100 − X larger problem instances) under the designchoices made in our ARs based recommendation system, while not impossible, would needfurther research. Related Work
The GSCS problem introduced in Section 3 is similar to other popular search problems onsets including the Jaccard Similarity (JS) problem, and vector space problems such as theNearest-Neighbor (NN) and the Maximum Inner Product Search (MIPS) problems. For theserelated problems, solutions based on hashing techniques such as LSH are already available. Forinstance, one can find a set maximizing Jaccard Similarity with a query set using a techniquecalled
Minhash . The NN problem can be addressed using
L2LSH [11] and variants. TheMIPS problem can be solved approximately using methods such as
L2-ALSH(SL) [10] and
Simple-LSH [12] among others. Note that the GSCS problem is different from all of these,and also different from the similar sounding Maximum Containment Search problem definedin [13]. The problem in [13] is equivalent to the MIPS problem while the former is not. In thispaper, we give a new approximate algorithm to the GSCS problem using hashing techniqueslisted above.Privacy preserving recommendation systems have been well studied in the past. Forinstance, privacy based solutions for different types of collaborative filtering systems havebeen proposed in [14, 15, 16]. Roughly in that setting, the problem reduces to computing thedot product of a matrix with a vector of real numbers, where the (recommendation) matrix3s possessed by the server and the client possesses the vector, and both the client and theserver are interested in preserving the privacy of their data. Since embedding such schemesinto a privacy protocol based on cryptography is difficult, many solutions resort to datamodification and adding noise. For instance, in [17], the authors propose a perturbation basedmethod for preserving privacy in data mining problems. This approach is only applicable whenone is interested in aggregate statistics and does not work when more fine-grained privacyis needed. In [18], the authors propose a decentralized distributed storage scheme alongwith data perturbation to achieve certain notions of privacy in the collaborative filteringsetting. For the same setting, a method based on perturbations is also proposed in [19]and [20]. In the paper [21], the authors proposed a theoretical approach for a system calledAlambic which splits customer data between the merchant and a semi-trusted third party. Thesecurity assumption is that these parties do not collude to reveal customers’ data. A majordifference between our work and all these solutions is that we base our privacy solutions oncryptographic primitives (notably oblivious transfer ) and build specific protocols that workwith association rules. This is attractive because ARs are already heavily used in practicefor exploratory analysis in the industry. In particular, we propose one of the first practicaldistributed privacy preserving protocols for recommendation systems based on selection andapplication association rules. Note that privacy has also been well studied in the contextof generation of association rules from historical transaction data [22, 23], but not much forthe problem of their selection and application in a recommendation or targeted advertisingcontext.Our work is closely related to the literature on secure two-party computation [24, 25]. Inthis model, two players with independent inputs want to compute a function of the union oftheir inputs while not revealing their own inputs to the other party. GMW and Yao [26, 27]prove feasibility of secure two party computation (assuming honest-but-curious parties) andare based on a Boolean circuit computing the desired function, although because of theirgenerality, they require a lot of communication. On the other hand, specialized protocols (forinstance, our private protocols) for promiment classes of problems (for instance, targeted ad-vertising or recommendations) are worth designing because they can reduce privacy overheadsconsiderably. Similar to these protocols, our solutions also implement the same functionalityas a trusted third party. Secure sorting is integral part of our solution, and has been previ-ously studied in the literature [28]. Note that our work is also distinct from Trusted ExecutionEnvironments (TEE) such as Intel SGX, ARM TrustZone etc. The latter provide the integrityand secrecy of computation by placing all executions on the isolated encrypted memory. Ourwork addresses not just the resultant secure two-party computation problem, but also allowsdealing with large number of association rules via LSH based indexing, which sets its apartfrom the rest of the literature. Because LSH based retrieval is similar to a standard databasequery retrieval problem, we are able to design privacy protocols for approximate retrieval aswell as exact retrieval of item recommendations using the same primitives. These protocolsconsider only honest-but-curious type of corruptions of the involved parties, in which the par-ties cannot deviate from the main protocol, but can try to glean whatever extra informationthey can using the transcripts of execution of the protocol.
Overview
In Sections 2, we present different criteria for the selection of applicable association rules,assuming that the rules have been mined beforehand from historical transaction data on theserver-side. While the rules could be mined in a privacy preserving manner as well, we sidestep4his aspect here (see related works above for prior solutions for this). While the accuracy (orany other information retrieval measure) of the recommendations will be dependent on therule mining step in conjunction with the selection criteria, we take a decoupled approachin this paper to isolate the impact of introducing privacy on the recommendation systemat query-time. So for the first step, we assume that an off-the shelf rule miner (such asSPMF that implements Apriori/FPGrowth) has already been used to generate rules, and forthe second step, we proceed with designing rule selection criteria optimized for query-timeperformance. In this setting, rules can be optimized for recommendation performance in thefirst step itself by assuming a weighted decision-list model class and optimizing the weightsof the model by minimizing a suitable recommendation error. A rule with a higher weightcan signify higher importance and should be considered first while recommending, and ourselection criteria take this into account while fetching the most applicable ARs (see orderingfunctions in Section 2). We present new approximate and exact algorithms that implement theproposed criteria in Sections 3 and 4 respectively. We then develop their privacy preservingcounterpart protocols in Section 5. In Section 6 we describe experimental evaluations thatvalidate the practicality of the proposed algorithms and their privacy preserving versions inmoderate-scale recommendation systems. It is important to note that neither AR based norCF based recommendations can dominate each other in terms of recommendation quality, andtheir performances depend on the specific datasets. Hence, we don’t pursue this comparisonin this paper. Our conclusions are presented in Section 7.
We have a set of D association rules { p i → q i } | D | i =1 with p i , q i ∈ I , and additional attributes(for instance, a weight attribute w i could denote an interestingness measure or the rule’simportance in recommendation quality). Given a query transaction T ∈ I of the client, weperform two steps: a Fetch operation followed by a
Collate operation. We propose multiplecriteria for deciding which association rules should be fetched. We also define a search problemcalled the
Generalized Subset Containment Search (GSCS) and relate it to one of thecriteria. Our algorithm in Section 3 solves the GSCS problem in a computationally efficientalbeit approximate manner. In Section 4, we give linear-time query and space efficient exactalgorithms that are easily adaptable to a private protocol design in Section 5.
Fetch Step: Selection of Rules
We define that an association rule i is applicable to a transaction T if and only if p i ⊆ T .The selection criteria are differentiated based on parameters such as threshold weight w ∈ Z + ,which is used to eliminate all applicable association rules below the threshold weight, t ∈ Z + ,which is used to eliminate rules with antecedent lengths greater than t , and k ∈ Z + , which isused to select the top- k association rules according to a predefined notion of ordering specifiedusing an ordering function f : { , ..., D } → Z . Ordering Functions.
The function f determines which of the applicable rules are the top k rules. We can define f such that rules can be ordered according to: (a) their weights w i (e.g., f ( i ) = w i ), or (b) antecedent lengths | p i | (e.g., f ( i ) = | p i | ) or, (c) a combination of both(e.g., f ( i ) = g ( w i ) + g ( w max ) · g ( | p i | ), where g and g are strictly monotonic integer-valuedfunctions and w max = max i =1 ,.., | D | w i ). For a pair of rules with antecedents p and p andweights w and w , this latter function has the following properties: (i) If | p | < | p | , then5 (1) ≤ f (2), and (ii) If | p | = | p | and w ≤ w , then f (1) ≤ f (2). Another example of acombination ordering function is f ( i ) = g ( | p i | ) + g ( |I| ) · g ( w i ), which prefers weights firstand then lengths in case of ties. The Criteria.
The
TOP - Assoc ( k, w, t, f ) criterion outputs a set of applicable rules baseon three parameters and an ordering function f . Parameter w filters out rules with weights ≤ w . Parameter t retains rules with antecedents of length ≤ t . Parameter k ∈ Z + controls thenumber of applicable rules that are finally output. We can write the following optimizationformulation representing this criterion as follows: max x ∈{ , } | D | P | D | i x i · f ( i ) such that P i x i ≤ k, and x i ≤ min { [ | p i | ≤ t ] , [ w i ≥ w ] , [ p i ⊆ T ] } . Here, x i is a binary decisionvariable that can take a value in { , } and indicates whether an applicable rule is selected.The inequality x i ≤ min { [ | p i | ≤ t ] , [ w i ≥ w ] , [ p i ⊆ T ] } is a mathematical programmingnotation for the following constraint: set variable x i to 1 if and only if | p i | ≤ t, w i ≥ w and p i ⊆ T . The term [ expr ] denotes an indicator function that takes the value 1 if the expr istrue, and 0 otherwise. Since x i is constrained to be less than the minimum of three indicatorfunctions, it can only take a value 1 if all three evaluate to 1. Otherwise, x i has to necessarilytake the value 0. TOP-1-Assoc ( f ) criterion is a special case of TOP - Assoc ( k, w, t, f ) where k = 1, w = 0and t = |I| . And TOP-K-Assoc ( k, f ) criterion is the special setting where w = 0 and t = |I| . ALL-Assoc ( w, t ) criterion is another special case where k = | D | , in which case f does not matter. Finally, under the ANY - Assoc ( k, w, t ) criterion, the output contains atmost k applicable association rules with weights ≥ w and rule antecedent lengths ≤ t . Thecorresponding optimization formulation is max x ∈{ , } | D | P i x i such that P i x i ≤ k, and x i ≤ min { [ | p i | ≤ t ] , [ w i ≥ w ] , [ p i ⊆ T ] } . To summarize, TOP - Assoc , TOP-1-Assoc , TOP-K-Assoc , ALL-Assoc , and
ANY - Assoc are some of the selection criteria we propose.
Generalized Subset Containment Search.
The
TOP-1-Assoc ( f ) criterion with an f ()that (all else being equal) prefers longer applicable rules leads to two new search problems:(a) the Largest Subset Containment Search (LSCS) problem, and (b) its generalization, theGeneralized Subset Containment Search (GSCS) problem. Notation : Until now, we used p i , q i and T to denote item sets (and | · | denotes their size).With a slight abuse of notation, we will also denote their set characteristic vectors with thesame symbols. A set characteristic vector has its coordinates equal to 1 if the items are presentin the set, and 0 otherwise. Thus, in this case, p i , q i and T are also binary valued vectors inthe |I| -dimensional space. Also, we denote the j th coordinate of p i as p ji , where 1 ≤ j ≤ |I| ,and the ℓ -norm of p i as k p i k . The meaning of the symbols will be hopefully clear from thecontext, and will also be reiterated as needed. Largest Subset Containment Search : The problem max ≤ i ≤| D | P |I| j =1 T j · p ji subject to p ji ≤T j for 1 ≤ j ≤ |I| , attempts to find a set (i.e., a set characteristic vector) whose innerproduct with the set characteristic vector T is the highest among all sets that are subsets of T . It is related to the TOP-1-Assoc ( f ) criterion for certain ordering functions as shownbelow. Lemma 1.
When f ( i ) ∝ | p i | for all ≤ i ≤ | D | , the TOP-1-Assoc ( f ) criterion is equivalentto the LSCS problem. The above lemma holds because if p i ⊆ T , then | p i | is equal to P |I| j =1 p ji T j , which is theobjective of LSCS. This connection to LSCS allows us to design a sub-linear time (i.e., o ( | D | ))6andomized approximate algorithm for fetching ARs (see Section 3) under the TOP-1-Assoc criterion (with appropriately chosen f ) if we can come up with such algorithms for the LSCSproblem. And the way we achieve this is by building on fast unconstrained inner-productsearch techniques [12] that solve the Maximum Inner Product Search (MIPS) problem, whichis the following problem. Given a collection of “database” vectors r i ∈ R d , ≤ i ≤ | D | anda query vector s ∈ R d ( d is the dimension), find a data vector maximizing the inner productwith the query: r ∗ ∈ arg max ≤ i ≤| D | P dj =1 r ji s j .MIPS and LSCS are not equivalent to each other. To see this, consider the following MIPSinstance constructed to mimic an LSCS instance. Let d = |I| . Let r i be equal to normalizedantecedent vectors: r i = k p i k p i with p i ∈ { , } |I| for 1 ≤ i ≤ | D | , and let query s be equal tothe set characteristic vector T ∈ { , } |I| . The normalization in the definition of r i s ensuresthat smaller length antecedents are preferred in the MIPS instance in order to mimic thesubset containment or ‘applicable’ property. Let p LSCS and p MIPS be the optimal solutions ofthe LSCS and MIPS instances constructed above. Then the following statements hold.
Lemma 2. (1) [LSCS feasible implies MIPS optimal] If the LSCS instance is feasible (i.e.,there exists at least one p such that p j = 0 for all j where T j = 0 ), then all feasible solutionsof LSCS instance are optimal for the MIPS instance. (2) [MIPS optimal does not imply LSCSoptimal] If there exist p , p such that k p k < k p k such that p j = 0 and p j = 0 for all j where T j = 0 , then p (as well as p ) is optimal for the MIPS instance but is not optimal forthe LSCS instance. In other words, part (1) implies that if the LSCS instance is feasible, then there existsan optimal solution p MIPS for the constructed MIPS instance that has p j MIPS = 0 for allcoordinates where T j = 0. Part (2) implies that the optimal solutions of the MIPS instanceare potentially feasible for the LSCS instance if they satisfy a condition. However, there isno guarantee that they will be optimal for the LSCS problem. In the worst case, there couldbe as many as O(2 |T | ) optimal solutions for the MIPS instance but only a unique solution forthe LSCS instance. Generalized Subset Containment Search : LSCS can be generalized to get the GSCS problem,which is as follows: max ≤ i ≤| D | f ′ ( i ) · P |I| j =1 T j · p ji subject to p ji ≤ T j for 1 ≤ j ≤ |I| , where f ′ ( i ) is an ordering function. When f ′ ( i ) = 1, this is LSCS. In other words, for LSCS theordering determining the top is not dependent on attributes such as the weight w i and isonly dependent on the antecedent length (through the inner product term in the objective).On the other hand, GSCS can account for arbitrary ordering functions, especially when suchfunctions capture task specific meaning (such as being related to recommendation accuracyfor instance). Since the GSCS problem is more general, it is clear that the GSCS and theMIPS problems are also different. Collate Step: Item Recommendations
Once we have generated a set L of applicable association rules according to one of the criteriadescribed above (assumed non-empty, otherwise we return a predefined list such as the list ofglobally most frequent items), we can compile a list of item recommendations in the followingtwo ways: Uncapacitated setting : We simply return the union of consequent(s) q i of the association rulesin L ; however, this list can be potentially large. Capacitated setting : The client may have a constraint k ′ << |I| on the number of items it canrecommend to the user. In this case, we derive associated weights ˜ w j for each item j ∈ ∪ i ∈L q i
7y adding up the weights w i of the rules where item j is in the consequent q i . The list ofrecommended items are then sorted according to these accumulated weights. If the bound k ′ < | ∪ i ∈L q i | , we return the top k ′ items from this sorted list, else we return all the items. We give approximate randomized algorithms that allow for sublinear time fetching of appli-cable association rules under two different criteria:
TOP-1-Assoc and
TOP-K-Assoc . Wefirst introduce the basics of locality sensitive hashing that will be key to the design of ouralgorithms. Next, we propose our first algorithm for the
TOP-1-Assoc ( f ) criterion. Andfinally, we show how selecting rules for the TOP-K-Assoc can be reduced to finding solutionsfor multiple instance of the former allowing us to reuse the data structure construction andquery processing steps.
Hashing based Data Structures for Rule Retrieval
LSH [29, 30] is a technique for finding vectors from a known set of vectors that are most ‘similar’to a given query vector in an efficient manner (using randomization and approximation). Wewill use it to solve MIPS and GSCS problems associated with the criteria mentioned above.The method uses hash functions that have the following locality-sensitive property.
Definition 1. A ( r, cr, P , P ) -sensitive family of hash functions ( h ∈ H ) for a metric space ( X, d ) satisfies the following properties for any two points p, q ∈ X : • If d ( p, q ) ≤ r , then P r H [ h ( q ) = h ( p )] ≥ P , and • If d ( p, q ) ≥ cr , then P r H [ h ( q ) = h ( p )] ≤ P . LSH solves the nearest neighbor problem via solving a near-neighbor problem definedbelow.
Definition 2.
The ( c, r ) -NN (approximate near neighbor) problem with failure probability f ∈ (0 , is to construct a data structure over a set of points P that supports the followingquery: given point q , if min p ∈ P d ( q, p ) ≤ r , then report some point p ′ ∈ P ∩ { p : d ( p, q ) ≤ cr } with probability − f . Here, d ( q, p ) represents the distance between points q and p accordingto a metric that captures the notion of neighbors. Similarly, the c -NN (approximate nearest neighbor) problem with failure probability f ∈ (0 , is to construct a data structure over a setof points P that supports the following query: given point q , report a c -approximate nearestneighbor of q in P (i.e., return p ′ such that d ( p ′ , q ) ≤ c min p ∈ P d ( p, q ) ) with probability − f . The following theorem states that we can construct a data structure that solves the ap-proximate near neighbor problem in sub-linear time.
Theorem 1 ([30] Theorem 3.4) . Given a ( r, cr, P , P ) -sensitive family of hash functions,there exists a data structure for the ( c, r ) -NN (approximate near neighbor problem) overpoints in the set P (with | P | = N ) such that the time complexity of returning a result is O ( nN ρ /P log /P N ) and the space complexity is O ( nN ρ /P ) . Here ρ = log 1 /P log 1 /P . Further,the failure probability is upper bounded by / /e . amplification .The data structure is as follows [29]: we employ multiple hash functions to increase theconfidence in reporting near neighbors by amplifying the gap between P and P . The numberof such hash functions is determined by parameters L and L . We choose L functions ofdimension L , denoted as g j ( q ) = ( h ,j ( q ) , h ,j ( q ) , · · · h L ,j ( q )), where h t,j with 1 ≤ t ≤ L , ≤ j ≤ L are chosen independently and uniformly at random from the family of hash functions.The data structure for searching points with high similarity is constructed by taking eachpoint x (in our setting, these would be the set characteristic vectors of antecedents) andstoring it in the location (bucket) indexed by g j ( x ) , ≤ j ≤ L . When a new query point q is received, g j ( q ) , ≤ j ≤ L are calculated and all the points from the search space in thebuckets g j ( q ) , ≤ j ≤ L are retrieved. We then compute the similarity of these points withthe query vector in a sequential manner and return any point that has a similarity greaterthan the specified threshold r . We also interrupt the search after finding the first L pointsincluding duplicates (this is necessary for the guarantees in Theorem 1 to hold). Choosing L = log N , L = N ρ and L = 3 L allows for sublinear query time. The storage space forthe data structure is O ( nN ρ ), which is not too expensive (we need O ( nN ) space just tostore the points).The following theorem states that an approximate nearest neighbor data structure can beconstructed using an approximate near neighbor data structure. Theorem 2 ([30] Theorem 2.9) . Let P be a given set of N points in a metric space, and let c, f ∈ (0 , and γ ∈ ( N , be parameters. Assume we have a data-structure for the ( c, r ) -NN (approximate near neighbor) problem that uses space S and has query time Q and failureprobability f . Then there exists a data structure for answering c (1 + O ( γ )) -NN (approximate nearest neighbor) problem queries in time O ( Q log N ) with failure probability O ( f log N ) . Theresulting data structure uses O ( S/γ log N ) space. Instead of using the data structure above, below we will use a slightly sub-optimal datastructure (see
Approx-GSCS-Prep ) but amenable to private protocols (Section 5) as follows:We create multiple near neighbor data structures as described in Theorem 1 using differentthreshold values ( r ) but with the same success probability 1 − f (by amplification for instance).When a query vector is received, we calculate the near-neighbors using the hash structure withthe lowest threshold. We continue checking with increasing value of thresholds till we findat least one near neighbor. Let e r be the first threshold for which there is at least one nearneighbor. This implies that the probability that we don’t find the true nearest neighbor is atmost f because the near neighbor data structure with the threshold e r has success probability1 − f . If the different radii in the data structures are not too far apart so that not too manypoints are retrieved, we can still get sublinear query times. Solving for TOP-1-Assoc( f ) For the
TOP-1-Assoc criterion, our goal is to return a single rule whose antecedent setis contained within the query set T (applicable) and whose f ( i ) is the highest among suchapplicable rules. Our algorithm solves the GSCS formulation of this criterion by constructinga corresponding approximating MIPS instance, and solving it using locality sensitive hashing(LSH) based techniques[12].Our scheme has two parts: (a) Preprocessing state involving Approx-GSCS-Prep (Algo-rithm 1), and (b) Query stage involving
Approx-GSCS-Query (Algorithm 3). In
Approx-GSCS-Prep , the algorithm prepares a data structure based on all rules that can be efficiently9earched at query time to obtain applicable association rules. In
Approx-GSC-Query , thealgorithm takes a transaction T and outputs the rule that satisfies the TOP-1-Assoc ( f )criteria via a linear-scan (worse case O (2 |T | )). Pre-processing Stage.
The GSCS instance for obtaining applicable rules is: max ≤ i ≤| D | f ( i ) · P |I| j =1 p ji T j such that p ji ≤ T j for all 1 ≤ i ≤ | D | , ≤ j ≤ |I| . Since a GSCS instance cannotbe transformed exactly to an MIPS instance as detailed in Section 2, we create an approxi-mating MIPS that results in a set of candidate rules that contain the rule which is optimalfor the TOP-1-Assoc criterion. From these rules, we prune for the one that satisfies the
TOP-1-Assoc ( f ) criterion by evaluating the ordering function at query time (see below).The approximate MIPS instance we construct is as follows: max ≤ i ≤| D | k p i k · P |I| j =1 p ji T j ,where we have replaced the hard constraints related to subset containment with a proxyscaling coefficient. The effect of the coefficient is that it prefers applicable antecedents withsmall antecedent length, which ensures that the chosen rule obeys the original containmentconstraint (see Lemma 2). The objective of the MIPS instance can be viewed as an innerproduct between two real vectors, where the first vector is k p i k p i . Let vector p ′ i = k p i k p i ∈ R |I| , which satisfies k p ′ i k ≤
1. This re-parameterization achieves two things: (a) using p ′ i in the MIPS instance has the same effect as using k p i k p i , and (b) its ℓ norm is smallerthan 1, allowing us to apply the technique proposed in [12] to build our data structure using Approx-GSCS-Prep . This algorithm (shown in Algorithm 1) uses
Simple-LSH-Prep (seeAlgorithm 2, proposed in [12]) to create the desired data structure that allows for sublineartime querying of applicable association rules.Our scheme also works on the LSH principle described above. The subroutine that we use,namely
Simple-LSH-Prep (Algorithm 2), relies on the inner product similarity measure andcomes with corresponding guarantees on the retrieval quality. To construct a data structure forfast retrieval, it uses hash functions parameterized by spherical Gaussian vectors a ∼ N (0 , I )such that h a ( x ) = sign( a T x ) (sign() is a scalar function that outputs +1 if its argument ispositive and 0 otherwise).Given the scaled p ′ i vectors, Simple-LSH-Prep constructs a data structure DS as follows.It defines a mapping P for vector x ∈ { x ∈ R |I| : k x k ≤ } as: P ( x ) = h x ; p − k x k i ∈ R |I| +1 . Then, for any p ′ i , due to our scaling, we have k P ( p ′ i ) k = 1. Let T ′ = kT k T be the scaled version of the transaction T . Then, for any scaled vector p ′ i and T ′ we havethe following property: P ( p ′ i ) T P ( T ′ ) = k p i k P |I| j =1 p ji T ′ j , where () T represents the transposeoperation. This implies that the inner product in the space defined by the mapping P is equalto our MIPS instance objective. Further, in this new space, maximizing inner product is thesame as minimizing Euclidean distance. Thus, using the hash functions { h a } defined earlierto perform fast Euclidean nearest neighbor search achieves doing an inner product search inthe original space defined by the domain of P ().Let the approximation guarantee for the nearest neighbor obtained be 1 + ν (e.g., set theparameters of the data structure in Theorem 2 such that it solves the (1 + ν )-NN problem).Then the following straightforward relation gives the approximation guarantee for the MIPSproblem. Lemma 3.
If we have an ν solution x to the nearest neighbor problem for vector y , then ν ) (max p ∈ P p · y − ≤ x · y ≤ max p ∈ P p · y . Simple-LSH-Prep uses multiple parameters, including concatenation parameters { K m } Mm =1 ,10nd repetition parameters { L m } Mm =1 (these are determined by a sequence of increasing radii { r m } Mm =1 needed for the nearest-neighbor problem, see above). For every m , Simple-LSH-Prep picks a sequence of K m · L m hash functions from { h a } and gets a K m · L m dimensionalsignature for each vector P ( p ′ i ) ∈ R |I| +1 , ≤ i ≤ | D | . These signatures and the chosen hashfunctions (across all radii) are output as DS . Algorithm 1: Approx-GSCS-Prep( { p i } | D | i =1 ) Input: D rules with antecedents p i ∈ { , } |I| . Output:
Data Structure DS containing rule representations and hashing constants begin Do parallel { forall i = 1 , ..., | D | do p ′ i ← k p i k p i } End parallel DS ←
Simple-LSH-Prep ( { p ′ i } | D | i =1 ) return DS Query Stage.
Given a transaction T , Approx-GSCS-Query queries
Simple-LSH-Query (see Algorithm 4) to obtain rules that are applicable.
Simple-LSH-Query solves the ap-proximating MIPS problem in sub-linear time by filtering out the most similar neighbors oftransaction vector T from the set { p ′ i : 1 ≤ i ≤ | D |} . It does this by constructing the vector[ T ; 0] ∈ R |I| +1 and getting (for each m = 1 , ..., M sequentially) its K m · L m dimensional sig-nature with the same hash functions as used before for p ′ i vectors. Then it collects rules thatshare the same signature as the transformed transaction vector (it stops this process at thefirst m for which it is able to find a rule). In particular, it ensures that the signatures agreein at least one K m -length chunk out of the L m chunks. Appropriate choices of K m and L m (which do not depend on the number of rules | D | ) allows for retrieval of the top candidateswith high approximation quality. Approx-GSCS-Query processes this candidate list to getthe top rule in terms of the ordering function f . This is a linear search with worst case timecomplexity O(2 |T | ). In case the candidate list is empty, it returns a predefined baseline rule.Note that scaling T to T ′ = kT k T before passing it to transformation map P () is notnecessary at query time. This is because, Given a Gaussian vector a ∈ R d +1 and a transactionvector T ∈ { , } d , sign( a T P ( TkT k )) = sign( a T [ T ; 0]). The advantage of this change is that wedo not have to work with a real-valued vector at query time, leading to an efficient oblivioustransfer (OT) step in the privacy preserving counterpart protocol that embeds this method(see Section 5 for more details). Solving for TOP-K-Assoc( k, f ) The approximate algorithm above can be adapted to the
TOP-K-Assoc criterion due to areduction from the approximate k -nearest neighbor problem and the approximate 1-nearestneighbor problem (the reduction and its analysis are due to Sariel Har-Peled, 2018). Thereduction is as follows. Given database D and the parameter k , we construct N = k log | D | copies of the database ( D , ..., D N ) where in each database, every rule is included with aconstant probability 1 /k . Given these N databases, we apply Approx-GSCS-Prep to eachto generate DS , ..., DS N . When we want to run a query T to get the top- k applicableassociation rules according to an ordering function f , we seek the most highly applicable rulefrom each of the data structures using Approx-GSCS-Query . Once we retrieve N highlyapplicable rules, we then prune this list by linear scanning and sorting to obtain the top- k lgorithm 2: Simple-LSH-Prep( { v i } | D | i =1 ) Input:
Vectors v i ∈ R d for 1 ≤ i ≤ | D | . Output:
Data structure DS capturing the hash functions and the hash signatures of input ARs begin Preprocessing: Generate Hash functions G ← φ Do parallel { forall m = 1 , ..., M do forall l = 1 , ..., L m do forall k = 1 , ..., K m do a ∈ R d +1 ∼ N (0 , I ) G [ m, l, k ] ← a } End parallel Preprocessing: Hash data vectors Define P ( x ) = (cid:20) x ; q − k x k (cid:21) for x ∈ { x ∈ R d : k x k ≤ } H ← φ Do parallel { forall ≤ i ≤ | D | do forall m = 1 , ..., M do forall ≤ l ≤ L m do index ← φ forall ≤ k ≤ K m do if G [ m, l, k ] T P ( v i ) ≥ then index.append(1) else index.append(0) H [ m, l, index] .add ( i ) } End parallel DS ← ( G , H ) return DS rules. This reduction increases the query time roughly by a factor that is linear in k andlogarithmic in the number of rules | D | .An intuitive argument for the reduction (Anastasios Sidiropoulos, 2018) is the following:Let X be the set of k -nearest neighbors (rules satisfying the given criterion) to the query T .When sampling a subset of the rules, for any x ∈ X , with probability Θ(1 /k ), we include x and exclude every other rule in X . The specified number of sampled copies of the database D are just enough to recover the top- k rules with high probability, even when there areapproximations. For exact retrieval of applicable association rules according to any of the criteria in Section2, we essentially perform a linear scan over all rules, filter them according to the appropriatethresholds and sort them according to the given ordering function. Our main contributionhere is a two-level data structure to store the ARs that has two attractive properties: (a)It can efficiently store the rules for fetching quickly (in terms of the overall communicationcomplexity and computational complexity needed), and (b) the data structure is easy to privatize for use in a privacy-preserving protocol (see Section 5). The data structure (denotedas H ) is common to exact implementations of all the criteria specified in Section 2. Wedescribe the data structure in a generic way for retrieval of strings. Adapting the notion ofstrings to rules (and their attributes) in our setting is straightforward.12 lgorithm 3: Approx-GSCS-Query( T , DS , f ) Input:
Data structure DS from Approx-GSCS-Prep , query
T ∈ { , } |I| Output:
Top most rule according to ordering function f begin Set
S ← φ S ←
Simple-LSH-Query ( T , DS ) if S = φ then return a pre-defined default rule Set S ′ ← φ Set fval ← forall i ∈ { i : p ′ i ∈ S} do if f ( i ) ≥ fval then S ′ ← { p i → q i } fval ← f ( i ) return S ′ Algorithm 4: Simple-LSH-Query( u , DS ) Input:
Query u ∈ R d , data structure DS Output:
Vector(s) with high inner products with query u begin G , H ← DS Query: forall m = 1 , ..., M do Set
S ← φ forall ≤ l ≤ L m do index ← φ forall ≤ k ≤ K m do if G [ m, l, k ] T [ u ; 0] ≥ then index.append(1) else index.append(0) S .add( H [m,l,index]) if |S| > L m then break if S 6 = φ then return S return S Pre-processing Stage
Consider a database D of strings, with each string of maximum length M (i.e., each stringis a sequence of symbols from some ground set Σ). We generate a data structure H thatstores these strings using Exact-Fetch-Prep (Algorithm 5). The structure is an adaptationof [31], but unlike [31] it is symmetric and has two levels. By symmetric we mean that a fixedhash function ( h r ) will be chosen for hashing all elements at the first level, and anotherhash function ( h s ) is chosen for hashing all elements at the second level. Such a choicehelps with the complexity of oblivious transfer (OT) protocol in Section 5, where we willessentially use the same data structure. That is, using the same data structure, the clientcan compute encrypted indices on its end with the knowledge of ( h r , h s ) and can use OTto retrieve objects, and efficient implementations for this already exist in practice. The hashfunctions map elements from [ | D | ] (for a number N the notation [ N ] represents the set 1 , ..., N )to a range of size L = 16 · | D | , and are chosen randomly from a 2-Universal hash functionfamily [32] H = { h : [ U ] → [ L ] } , where U is a large positive integer (note that these are notlocality-sensitive). 13he choice of a two-level hashing is inspired by the treatment in [31] where the authorsshow that a two level hashing can simultaneously lead to linear storage complexity as wellas constant worst case retrieval time complexity compared to single level hashing (where onetypically trades off storage space vs retrieval time). In addition, to avoid collisions, the rangeof a single hash would have to be very large. On the other hand, the two level structuredoes not require the first hash function to be collision-free, and this helps with the storage vsretrieval tradeoff.Additionally, we first choose two large integers r and l , a string r ′ of length l and the MD5hash function [33] (denoted as C r : Σ M + l → [2 r ]) to transform the database strings. Once H is created on the server, it publicly declares the hash functions h r and h s as well as theconstant string r ′ it generated. Details of the construction of H using Exact-Fetch-Prep are shown in Algorithm 5.
Algorithm 5: Exact-Fetch-Prep( D ) : Creating two-level data structure H Input:
Database D Output:
Hash table with two level hashes H , hashing functions h r , h s and string r ′ begin Choose: (a) large positive integers r and l , (b) arbitrary string r ′ of length l , and (c) collision resistantcryptographic hash function C r : Σ l + M → r . D e ← φ forall x ′ ∈ D do x ← r ′ ◦ x ′ ( ◦ denotes the concatenation operator) D e ← D e ∪ { x } do h r ∼ Uniform( H ) forall i = 1 , ..., L do B i = { x ∈ D e : h r ( x ) = i } b ( i ) ← | B i | while P Li =1 b i ≤ | D | do h s ∼ Uniform( H ) while ∀ ≤ i ≤ L and ∀ x, y ∈ B i , h s ( x ) = h s ( y ) Initialize array H of size L + L . forall x ∈ D e do H [ L · h r ( x ) + h s ( x )] = { C r ( x ) and other data associated with x } return ( H , h s , h r , r ′ ) We now show two properties that
Exact-Fetch-Prep satisfies. First, it identifies thetwo hash functions h r , h s that lead to no collisions with high probability. And second, itconstructs H in expected polynomial time and uses O ( | D | ) storage. In particular, building onthe analysis in [31], the probability for random hash functions h r and h s to be successful (i.e.,have no collisions) in the first and second stages of Exact-Fetch-Prep can be bounded asfollows.
Lemma 4. (1) For h r ∈ H , P r [ P | D | i =1 b i ≤ | D | ] ≥ , where b i corresponds to the countsof collisions in each hash bucket i . (2) Functions h r , h s ∈ H succeed with no collisions witha probability ≥ / .Proof. Proof of Part (1) is similar to the analysis in [31]. For Part (2), we have the following.Define the following random variable: X i = | { ( x, y ) | x = y, x, y ∈ { p j : 1 ≤ j ≤ | D |} ,h r ( x ) = h r ( y ) = i, h s ( x ) = h s ( y ) } | . X = P ·| D | i X i . If X >
0, then the two level hashing of Algorithm 5 fails. Ifwe show that
P r [ | X | ≤ / ≥ − / /
4, then it would imply that | X | = 0 (as X is anatural number) and the two level hashing succeeds with high probability.For a randomly chosen hash function h s ∈ H , we have that for any x, y , P r [ h s ( x ) = h s ( y )] = | D | . We estimate the expected value of random variable X : E [ X i ] ≤ (cid:18) | B i | (cid:19) | D | = | B i | ( | B i | − · | D | , where B i is the hash bucket corresponding to index i with size b i .Thus, the expected number of colliding pairs summed up over all buckets is E [ X ] = E [ X X i ] ≤ | D | X i =1 b i · | D | . Since P | D | i =1 b i ≤ | D | , therefore E [ X ] ≤ and P r [ X ≥ / ≤ E [ X ]1 / . Or, P r [ X ≤ / ≥ − / / Lemma 5.
The data structure H , can be constructed by Exact-Fetch-Prep in expectedpolynomial time.
Query Stage
Given H on the server-side, we do a linear scan on the server-side at query-time to retrievethe consequents of applicable rules. A basic building block that is used in the linear scan isthe retrieval of a single element from H . We discuss this first.To query whether a string str is present in H , one can compute the following quantities: x = r ′ ◦ str ( ◦ denotes the concatenation operator), and index i = L · h r ( x ) + h s ( x ). Wecan then fetch the indexed element H [ i ] including one of its attributes H [ i ] .C r ( x ′ ) (here x ′ corresponds to the string present at location i ). This attribute can be used to verify if str wasindeed present in the database. If a client sends a query for the presence of an element str toa server that only holds H , then the client can have limited privacy. In particular, the serverdoes not know what str is, although it knows the index i and the element was returned (whichmay not contain the str itself). Since the hashes are not invertible, it affords partial privacyas the client is not revealing its string str . Exact-Fetch-Query (Algorithm 6) implementsthis query process.
Algorithm 6: Exact-Fetch-Query( str ) : Query data structure H Input:
Query string str , and data structure H from Algorithm 5 Output:
Value in H corresponding to query begin Client C computes x = r ′ ◦ str , h r ( x ) and h s ( x ). C computes i = L · h s ( x ) + h r ( x ). C queries server S for entry at index i in H . Server returns H [i]. The algorithm description for querying the server under the
TOP-Assoc ( k, w, t, f ) crite-rion is provided in Algorithm 7 (the algorithms for TOP-1-Assoc , TOP-K-Assoc , ALL-Assoc and
ANY - Assoc are similar, hence omitted).15 lgorithm 7: Exact-TOP-Assoc-Query( T , H ) : Query with TOP-Assoc criterion
Input:
Transaction T , data structure H from Algorithm 5, threshold weight w ∈ Z + , antecedent lengthparameter t ∈ Z + , output size parameter k ∈ Z + , and ordering function f . Output:
Set of k consequents of applicable association rules L begin Initialize L all ← φ Do parallel { forall p i ⊂ T & | p i | ≤ t & w i ≥ w do q ← Exact-Fetch-Query ( p i ) (Algorithm 6, ignoring the client/server distinction) if q = φ then continue else L all . add( q ) } End parallel L ← first k elements from sort( L all , f ) (break ties arbitrarily if needed) return L We address three related privacy-preserving tasks in sequence. First, we discuss how the
Oblivious transfer protocol is a solution to privacy-preserving database query problem. Wepresent how consequents of all applicable association rules for a given criterion can be fetchedand collated in privacy preserving manner. Finally, we discuss how the protocol can beextended to the setting when we use the approximate algorithms from Section 3 for fetching.
Private Protocol for Database Lookup
Consider the following two party task: a client C has an index i , and a server S has a database D represented as a vector −→ v [1 : | D ] of | D | elements. The client’s goal fetch the i th element −→ v [ i ] such that: (a) the client learns nothing more than the element it fetched from the server,and (b) the server learns nothing about client’s query. Specifically, this leads to the followingdefinition for oblivious transfer (OT): Definition 3.
An oblivious transfer(OT) protocol is one in which C retrieves the i th elementfrom S holding [1 , . . . , n ] elements iff the following conditions hold:( ) The ensembles V iew S ( S ( −→ v ) , C ( i )) and V iew S ( S ( −→ v ) , C ( j )) are computationally indistin-guishable for all pairs ( i, j ) , where the random variable V iew S refers to the transcript of theserver created by the execution of the protocol.( ) There is a (probabilistic polynomial time) simulator Sim , such that for any query element c , the ensembles Sim ( c, −→ v [ c ]) and V iew C ( S ( −→ v ) , C ( c )) are computationally indistinguishable. We use the notation OT[ C : i , S : [1 , . . . , | D | ]] to represent the above Protocol in Defini-tion 3. Without going into the details of OT implementation, we make the design choice touse a fast and parallel implementation described in [34]. This scheme is based on length pre-serving additive homo-morphic encryption, described next. Homo-morphic encryption withpublic key pk , of message m , is denoted as c = E pk ( m ). Decryption with private key sk isdenoted as m = D sk ( c ). Any operation over the cipher text, will also be reflected in thedecrypted plain text. For instance, let c and c be two cipher texts such that c = E pk ( m )and c = E pk ( m ). Let + represent a binary operation. Then, c + c = E pk ( m + m ).Further-more, the scheme is length preserving so that an l -bit input is mapped to an input ofsize l + N , where N is a constant. 16e now discuss how to answer the question of whether a string str is in D in a privacy-preserving manner. Recall the data structure H output by Exact-Fetch-Prep (see Sec-tion 4) in which C fetches an element from database stored with S . We can readily get astronger privacy-preserving protocol for the same, by employing the OT protocol from Defini-tion 3. That is, by a single execution of OT between client C and server S , C can privately fetchthe C r -hash of string r ′ ◦ str , stored in record H [ D · h r ( r ′ ◦ str ) + h s ( r ′ ◦ str )] .C r (). Note thatfor two strings str, str ′ , it may be that h r ( r ′ ◦ str ) = h r ( r ′ ◦ str ′ ), and h s ( r ′ ◦ str ) = h s ( r ′ ◦ str ′ ).Yet the corresponding C r -hashes of str and str ′ may not be equal. Thus, the choice of usinga C r -hash leads to the following guarantee on the OT based protocol. Lemma 6.
There exists a two party protocol
Private-Exact-Fetch-Query such that:(1) C learns whether str ∈ D with high probability given the description of associated hashfunctions ( h r , h s ) , and (2) the computationally bounded S learns nothing. Private Protocol based on Exact Algorithms
A client computes an ordered list of recommended items from a set of consequents of allapplicable association rules, chosen according to some selection criteria. We break down theprocess of making this recommendation process private into the following subtasks.( ) Expunge infrequent items and anonymize item list:
Firstly, note that only a few itemsfrom the client’s transaction may be frequent and belong to any rule. So, it is important forthe client to remove all infrequent items from its transaction before further processing. Thetask (denoted
Preprocess , see Algorithm 8) is to remove infrequent items, and anonymizethe input transaction of the client. We assume that the initial list of items are given identifiersfrom the range [ |I| ], which are publicly available (hence available to the client).( ) Privately fetch and privately, interactively collate applicable association rules:
We needto select applicable rules according to the given criterion, and given the consequents of theseapplicable rules along with their respective weights, we need to privately collate them toproduce a list of recommended items using these weights. For this, the client is given a list ofidentities, with associated weights (which are homo-morphically encrypted), that are obtainedfrom the selection of rules. Client C and server S interactively execute a two-party privatesorting (denoted Private-Two-Party-Sort , see Algorithm 10) to sort the list of itemsusing their encrypted weights, and the client finally produces an ordered list of recommendeditems at its end, sorted according to their weights.( ) De-anonymize and recommend:
Given a final list of k ′ anonymized item identities, wede-anonymize them to obtain the actual names of the recommended items. For this, the client C fetches their actual identifiers by executing OT (see Definition 3), with the server S (similarto Preprocess above) on the reverse mapping ( RT , see Algorithm 8), and obtain the trueidentifiers of the items to be recommended.The above steps are captured in Protocol 9, which builds on the exact implementationof ALL-Assoc ( w, t ) from Section 4. For brevity, we discuss the special case when t = |I| .This protocol makes use of the Private-Exact-Fetch-Query ( p ) algorithm (see above) asa subroutine. After its execution, the client C fetches all association rules with weights ≥ w ina privacy preserving manner, and from these rules, it collates the list of recommended items.The private versions of the exact implementations of TOP-Assoc ( k, w, t ), TOP-1-Assoc ( f ), TOP-K-Assoc ( f ) and ANY-Assoc ( k, w, t ) can be designed in a similar manner.We note that the above process ensures privacy of the client data with respect to theserver, and the privacy of server’s data with respect to the client, by only revealing the17elevant consequents of association rules to the client. A much simpler privacy preservingprotocol may be devised, if only the privacy of the clients data (transaction T ) is to beguaranteed, with respect to the server. Protocol 8: Pre-processing and Anonymization
Input: C holds Transaction set T and S holds itemset I Output:
Item identifiers begin Server Preprocessing begin π Random ←−−−−−−−−− permutation from [1 , . . . , |I| ]. T ← table with |I| + 1 entries Store π in T such that T [ i ] = π ( i ). Map item I ′ ∈ I to inf , where freq ( I ′ ) <θ . T [ inf ] ← Let RT be the reverse map i.e., RT ◦ T ( i ) = i , ∀ frequent items. C has transaction T = { i , i , i , . . . , i | T | } comprising of all table entries and S has the table T . C and S execute OT (see Definition 3), with C seeking table entries T [ i i , i , . . . , i |T | ] S randomly permutes the output of the OT before sending to the client C . C receives the outputs of the OT from S , decrypts all outputs, and discards all inf entries, correspondingto infrequent items. Protocol 9: Private-Exact-ALL-Assoc-Query( w, t = |I| ) Input:
Client: Transaction T Input:
Server: Threshold weight w , data structure H containing D association rules Output:
Client: Set of recommended items I rec begin C ↔ S
Preprocess for anonymizing T C A [1 ..t ] ←−−−− S : Call Private-Exact-Fetch-Query (Lemma 6) for t = | T | times. C ↔ S : Execute
Private-Two-Party-Sort (Algorithm 10) based on whether the weight of the associatedrule is ≥ w . C L ←− S ; where L is the list of association rules sorted according to their weights C collates the consequents of rules in L , and calculates I rec Details of Privacy-preserving Sorting Protocol.
We now discuss the details of
Private-Two-Party-Sort mentioned above. It is based on a primitive that makes n . pairwisecomparisons, which are chosen at the pre-processing phase and dependent only on value of n (in our case, n = | D | ). Using [35], S produces the identities of the m = c · n . pairs,knowing the comparisons of which one can execute the oblivious sorting algorithm Private-Two-Party-Sort in one shot. It is detailed in Protocol 10, and takes only two roundsof communications, with a communication complexity of O ( n . ) per round. We recap theproperties of the primitive used in the protocol below. Theorem 3 ([35]) . There exists a deterministic pair of algorithms ( AKS , AKS ) , whichsatisfy the following:( ) Given input n , AKS produces a list L n of O ( n . ) pair of indexes ( i, j ) .( ) Given an input list of integers I n = a , a , . . . , a i , . . . , a n and value of comparisons of ( a i , a j ) for all ( i, j ) ∈ L n , | L n | = n . , deterministic algorithm AKS sorts the input list I n . rotocol 10: Private-Two-Party-Sort : Private two party sorting
Input: C has pair of values V Input: S has identities Output: C arranges V in sorted form begin C{ Let the pairs of these index of V be ( x , y ) , . . . , ( x m , y m ) using AKS . W eight ( x i , y i ) ← ( T x i , T y i ) ( T ′ x i , T ′ y i ) Enc ←−−− ( T x i , T y i ) ( E x i , E y i ) ← ( Rand ( T ′ x i , T ′ y i ) ( Rand in Algorithm 11) P sort ←−−− ( E x i , E y i ) , i = 1 , . . . , m } C P −→ S S : D = ( d , . . . , d n ) , d i Dec ←−−− p i , p i ∈ P S : val ← compare values of D C val ←−− S C then applies the AKS algorithm to sort P . Algorithm 11:
Rand : Randomization of an encrypted data pair
Input:
Data pair ( T , T ) Output:
Randomize encrypted data pairs ( E , E ) begin Let T , T ∈ Z t ( a , b ) , ( a , b ) random ←−−−−−− Z t E , E ← ( a .T + b , a .T + b ) where E ∈ Z t , such that E , E preserves order of T and T Private Protocol based on Approximate Algorithms
A few modifications are needed to the previous protocol when working with data structuresdesigned in Section 3. Recall the functioning of
Approx-GSCS-Prep and
Approx-GSCS-Query : Server chooses l random maps ( l = P Mm =1 L m · K m ), where the i th map f unc i ,maps a set T , represented as a characteristic vector T of length |I| , to a string T i of k (say k = max m =1 ,...,M K m ) bits. Thus, each antecedent p i of our rules (for 1 ≤ i ≤ | D | ) is mappedto l strings p , p , . . . , p i , . . . , p l of length k bits each. For an input transaction T , an associa-tion rule p → q is selected if and only if any of the i maps f unc i ( T ) exactly matches f unc i ( p ).We then proceed along the lines of the previous private protocol by doing the following mod-ifications. Pre-processing the D rules : We create an enhanced database D e by first choosing l random strings r , r , r , . . . , r l ∈ { , } s , where s is a security parameter (a large posi-tive integer). We then concatenate the above random strings to the l -maps as follows: r ◦ f unc ( p i ) , r ◦ f unc ( p i ) , . . . , r j ◦ f unc j ( p i ) , . . . , r l ◦ f unc l ( p i ), for each 1 ≤ i ≤ | D | .The new database D e has l · | D | elements, each of which stores all relevant information forthe ARs. All strings r i ◦ f unc ( p i ), along with corresponding consequents q i in D e are storedin H as defined by Exact-Fetch-Prep (Algorithm 5).
Pre-processing the query:
The client C obtains the definition of the l maps, f unc i , i ∈{ , . . . , l } , along with the random prefixes r i , which are declared publicly. It then applies the l maps on the characteristic vector T , corresponding to its input transaction T , and computes f unc i ( T ), from which it prepares r i ◦ f unc i ( T ) for i ∈ { , . . . , l } . Privately receiving answers to the query: C queries S for existence of each string19 i ◦ f unc i ( T ), for i = { , , . . . , l } , with the S possessing the enhanced data structure via Private-Exact-Fetch-Query that is based on OT (Lemma 6).
Remarks on Security Analysis.
The protocols presented here are two party protocols,executed between a client C and a server S and assume that both are honest-but-curious .That is, they will execute the given protocols, but can be curious to know more (about theinputs of the other party via the transcripts). The security of these protocols can be definedalong the standard Ideal/Real paradigm definitions of security, and hinges on the securityof the underlying encryption system. In the Ideal/Real paradigm, an ideal functionality isdefined which captures the desired input/output functionality for the two parties. First,the parties submit their inputs to a TP (trusted party), and then receive some outputs.The real protocol is said to realize this ideal functionality, if there is a PPT (probabilisticpolynomial time) simulator that can compute a distribution of views of the parties that isindistinguishable from the distribution of views generated in the real process. If this is thecase, then the protocol is claimed to be secure. In other words, a user can assume that thedesigned protocol provides security guarantees that one could imagine, as guaranteed by theIdeal functionality.Proofs of security of the presented protocols can be presented along these above lines. Forexample, in the retrieval of an element from the data structure one would need to capturewhat is learned by the server about the client’s input T and like-wise by the client aboutserver’s database of D association rules. This could be the number of consequents fetched,crypto-hash of some antecedent of fetched association rule, size of the client input etc., andthis collection is defined as the outputs of the respective parties. A simulation based proofcan then proceed along standard lines. Here, we choose to focus on the formulations of therecommendation problem and approaches to solve it efficiently as well as securely and omit theelaborate proofs of security (which are standard because our parties are honest-but-curious,albeit long and relatively less insightful).Finally, note that in our solutions, for every new client session the server S has to re-organize the database (use a new set of hash functions in Private-Exact-Fetch-Query ).Otherwise, a client may be able to correlate the information received from the server frommultiple sessions and gain ‘unintended to be shared’ information about the association rules.This often leads to more computation by the server per client session, and is unavoidable ifthis type of information leak needs to be prevented. There is always a trade-off between theinformation shared about the server’s database with client and vice-versa, and computationsdone by the client/server. Our solution chooses one end of this spectrum, and many otherchoices are equally valid.
Our empirical evaluations illustrate that although the computational and latency requirementsgenerally increase, privacy properties can still be guaranteed at roughly the same time scaleas the non-private counterparts for reasonably sized problem instances. The goal of theexperiments is to validate how the recommendation latencies are influenced as a function ofthe number of rules, the selection criteria and various other problem parameters. For instance,we know that OT does introduce measurable latency between the client and the server. But aswe show below, for moderately sized datasets ( 10000 rules for instance), the communicationoverhead is quite manageable (well within a few seconds), which may be appealing for near-20 arameter Symbol Default value Range
RSA modulus size N { , } Number of rules | D | − TOP-Assoc output size k − TOP-Assoc length t − |T | − Table 1:
Evaluation Metrics | D | |T | t Time | D | |T | t Time | D | |T | t Time100K 5 3 58 500K 5 3 229 1M 5 3 7445 69 5 273 5 93110 3 278 10 3 436 10 3 12795 232 5 525 5 141315 3 212 15 3 678 15 3 19115 213 5 751 5 200920 3 263 20 3 999 20 3 26155 229 5 987 5 2957
Table 2.
Execution times (in milliseconds) incurred by the
Exact-ALL-Assoc ( w, t ) algo-rithm for fetching applicable rules. Symbols K and M denotes values 10 and 10 respectively. real-time advertising (i.e., where time scales of the order of a few seconds are acceptable). Experimental setup and evaluation metrics.
Experiments were conducted on a laptopequipped with a 2 . Java version 8 update 60 with allocated heap space of 8gigabyte. We explored the parameters listed in Table 1 over corresponding ranges to evaluateour algorithms and their private versions. RSA modulus size N is the key size used in theunderlying crypto-system. Increasing N causes significant reduction in performance whileincreasing security. To assess this trade-off, we ran experiments using both RSA1024 andRSA2048. All the experiments were executed 1000 times to compute the amortized executiontimes. Evaluating the Exact and Approximate Algorithms
First, an exact implementation of the
ALL-Assoc ( w, t ) criterion is evaluated. This criterionwas picked because the size of the list of rules output by the exact algorithm is larger than theoutputs of the other criteria. Further, the computational burden imposed by parameter k forany k < | D | is negligible in terms of the total processing time. We generated synthetic datasetsand evaluated our implementation, whose median processing times are listed in Table 2. Ascan be inferred, even when the number of association rules is very large (for instance, seethe entry corresponding to | D | equal to 1 million), our implementation is observed to be veryefficient.Second, we evaluated Approx-GSCS-Query ( T , DS , f ) on two real world transactiondatasets: (a) Retail [36], and (b) Accidents [37]. The retail dataset consists of market basketdata collected from an anonymous Belgian retail store for approximately 5 months during theperiod 1999-2000. The number of transactions is 88163 and the number of items is 16470.We use SPMF’s [38] implementation of FP Growth algorithm (setting minimum support andminimum confidence values to 0 .
001 and 0 .
01 respectively) to get 16147 association rules.The Accidents dataset consists of traffic accidents during the period 1991-2000 in Flanders,Belgium. The number of transactions in this dataset is 340 , | D | T q (ms) | D o | T o (s) A A A Accidents 334K 1 .
69 10K 27 . . Table 3:
Performance of
Approx-GSCS-Query . circumstances in which the accidents have occurred. The total number of attributes (corre-sponding to I ) is 572. Again, we use SPMF’s [38] implementation of FP Growth algorithm(setting minimum support and minimum confidence values to 0 . . ALL-Assoc criterion had acceptable validation setprecision (fraction of items that were correct among the recommended items). We used theconfidence of the rules as their weights. We did not cross-validate to get the best parameterchoices (weight threshold w or the antecedent length threshold t for our criterion here) as thiswas not our primary goal.Instead, we show how the latency overhead due to the private protocol varies as a functionof our system parameters below. In particular, column T q of table 3 lists the median processingtimes for a collection of predefined query transactions for these two datasets (the thresholdingparameter for antecedent length, t , was varied between 1 to 5) in the non-privacy setting.Contrast this with the column T o , which lists the median processing times with privacy ona sub-sampled set of rules as shown in column | D o | . The reduced size of the set of rulesconsidered for private fetching is needed to ensure that the latencies are manageable. Thesub-sampling was based on the confidence weights of the ARs (rules with higher weights werepicked). As can be inferred, these processing times are comparable to the numbers in Table 2in an absolute sense (seconds vs milliseconds). Columns A , A and A provide accuraciesof Approx-GSCS-Query with hash lengths ( k = max m =1 ,...,M K m , see Section 3) set to 10,16 and 32 bits respectively, averaged over 1000 queries of length 3. When the length is only10 bits, Approx-GSCS-Query suffers from low accuracy as irrelevant association rules fallunder the same buckets. We omit more extensive results on the approximation quality forbrevity (see [13, 12] for extensive performance profiling of similar approximation schemes).
Evaluating Timing Overheads due to Privacy
Table 4 documents the timing overhead introduced by a single 1- n oblivious transfer, whichis used to make the exact implementation for ALL-Assoc privacy preserving. The numberof rules ( | D | ) was varied from 1 thousand to 10 thousand and the RSA modulus was variedbetween 1024 and 2048. Our implementation of the oblivious transfer protocol is singlethreaded and is based on [39] (a multi-threaded faster implementation can be found in [34],which can be used as a plug-in module to improve overhead time by a factor of magnitudeor higher). From the table we can infer that for moderate sized databases, private fetchingof applicable rules is competitive and practical. For instance, to fetch applicable rules froma database with 10 rules, the median time taken is ∼
40 seconds for a query of size 5(RSA modulus set to 1024). Practicality is further supported by the fact that in client serversettings applicable for many cloud based applications/world wide web, multiple servers willbe handling multiple query requests.As discussed briefly earlier, the timing overheads incurred by the privacy preserving coun-terpart of
Approx-GSCS-Query for the real datasets is shown in Table 3 (column T o ). We22 | D | |T | Time | D | |T | Time
N | D | |T | Time | D | |T | Time1024 1K 5 3 . . . .
110 3 . . . . . . . .
45K 5 17 10K 5 39 . . .
910 18 . . . .
615 17 . . . .
520 17 . . . . Table 4.
Overhead times by a private protocol (see Section 5) that embeds the exact imple-mentation of
ALL-Assoc ( w, t ). choose the RSA modulus value to be 1024 here. Although the processing times are now mul-tiple orders of magnitude compared to vanilla processing times (column T q ), they are stillpractical and manageable for an e-commerce setting (again, due to the fact that in practicemultiple servers service multiple queries). These times are also comparable to similar sizeddatasets benchmarked in Table 4. Thus, our solutions and their private versions are verycompetitive in fetching applicable association rules and making item recommendations. Our work proposes a rich set of methods for selection and application of association rules forrecommendations that have strong theoretical basis as well as pragmatic grounding. The abil-ity to reuse association rules that are frequently used in the industry to bootstrap a scalableand privacy-aware recommendation system makes our solution very attractive to practitioners.Our experiments further highlight the practicality of achieving privacy preserving recommen-dations for moderate to large-scale e-commerce applications.
References [1] A. Korolova, “Privacy violations using microtargeted ads: A case study,” in
IEEE Inter-national Conference on Data Mining: Workshops , 2010, pp. 474–482.[2] J. R. Mayer and J. C. Mitchell, “Third-party web tracking: Policy and technology,” in
IEEE Symposium on Security and Privacy , 2012, pp. 413–427.[3] C. Duhigg, “How companies learn your secrets,” Feb 2012.[4] G. Lubin, “The incredible story of how target exposed a teen girl’s pregnancy,” Feb2012.[5] E. Steel and G. Fowler, “Facebook in privacy breach,”
The Wall Street Journal , vol. 18,pp. 21–22, 2010.[6] G. J. Udo, “Privacy and security concerns as major barriers for e-commerce: a surveystudy,”
Information Management & Computer Security , vol. 9, no. 4, pp. 165–174, 2001.[7] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in
Proceedingsof the 20th International Conference on Very Large Data Bases , 1994, pp. 487–499.[8] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Analysis of recommendation algorithmsfor e-commerce,” in
Proceedings of the 2nd ACM Conference on Electronic Commerce .ACM, 2000, pp. 158–167. 239] ——, “Item-based collaborative filtering recommendation algorithms,” in
Proceedings ofthe 10th International Conference on World Wide Web . ACM, 2001, pp. 285–295.[10] A. Shrivastava and P. Li, “Asymmetric LSH (ALSH) for sublinear time maximum innerproduct search (MIPS),” in
Advances in Neural Information Processing Systems , 2014,pp. 2321–2329.[11] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing schemebased on p-stable distributions,” in
Proceedings of the 20th Annual Symposium on Com-putational Geometry . ACM, 2004, pp. 253–262.[12] B. Neyshabur and N. Srebro, “On symmetric and asymmetric LSHs for inner productsearch,” in
Proceedings of the 32nd International Conference on Machine Learning , 2015,pp. 1926–1934.[13] A. Shrivastava and P. Li, “Asymmetric Minwise hashing for indexing binary inner prod-ucts and set containment,” in
Proceedings of the 24th International Conference on WorldWide Web , 2015, pp. 981–991.[14] D. Li, Q. Lv, H. Xia, L. Shang, T. Lu, and N. Gu, “Pistis: a privacy-preserving con-tent recommender system for online social communities,” in
Proceedings of the 2011IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent AgentTechnology , 2011, pp. 79–86.[15] F. McSherry and I. Mironov, “Differentially private recommender systems: building pri-vacy into the Net,” in
Proceedings of the 15th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining , 2009, pp. 627–636.[16] S. Zhang, J. Ford, and F. Makedon, “A privacy-preserving collaborative filtering schemewith two-way communication,” in
Proceedings of the 7th ACM Conference on ElectronicCommerce , 2006, pp. 316–323.[17] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preservingdata mining algorithms,” in
Proceedings of the 20th ACM Symposium on Principles ofDatabase Systems , 2001, pp. 247–255.[18] S. Berkovsky, Y. Eytani, T. Kuflik, and F. Ricci, “Enhancing privacy and preserving accu-racy of a distributed collaborative filtering,” in
Proceedings of the 2007 ACM Conferenceon Recommender Systems , 2007, pp. 9–16.[19] H. Polat and W. Du, “SVD-based collaborative filtering with privacy,” in
Proceedings ofthe ACM Symposium on Applied Computing , 2005, pp. 791–795.[20] J. Canny, “Collaborative filtering with privacy via factor analysis,” in
Proceedings ofthe 25th Annual International ACM SIGIR Conference on Research and Development inInformation Retrieval , 2002, pp. 238–245.[21] E. A¨ımeur, G. Brassard, J. M. Fernandez, and F. S. M. Onana, “Alambic: a privacy-preserving recommender system for electronic commerce,”
International Journal of In-formation Security , vol. 7, no. 5, pp. 307–334, 2008.2422] J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically par-titioned data,” in
Proceedings of the 8th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , 2002, pp. 639–644.[23] S. J. Rizvi and J. R. Haritsa, “Maintaining data privacy in association rule mining,” in
Proceedings of the 28th International Conference on Very Large Data Bases , 2002, pp.682–693.[24] C. Hazay and Y. Lindell,
Efficient secure two-party protocols: Techniques and construc-tions . Springer Science & Business Media, 2010.[25] B. Pinkas, T. Schneider, N. P. Smart, and S. C. Williams, “Secure two-party computationis practical,” in
International Conference on the Theory and Application of Cryptologyand Information Security . Springer, 2009, pp. 250–267.[26] O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game,” in
Proceed-ings of the 19th annual ACM Symposium on Theory of Computing . ACM, 1987, pp.218–229.[27] A. C.-C. Yao, “How to generate and exchange secrets,” in . IEEE, 1986, pp. 162–167.[28] K. V. J´onsson, G. Kreitz, and M. Uddin, “Secure multi-party sorting and applications.”
IACR Cryptology ePrint Archive , p. 122, 2011.[29] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearestneighbor in high dimensions,”
Communications of the ACM , vol. 51, no. 1, p. 117, 2008.[30] S. Har-Peled, P. Indyk, and R. Motwani, “Approximate nearest neighbor: Towards re-moving the curse of dimensionality.”
Theory of Computing , vol. 8, no. 1, pp. 321–350,2012.[31] M. L. Fredman, J. Koml´os, and E. Szemer´edi, “Storing a sparse table with O(1) worstcase access time,”
Journal of the ACM , vol. 31, no. 3, pp. 538–544, 1984.[32] J. L. Carter and M. N. Wegman, “Universal classes of hash functions,” in
Proceedings ofthe 9th Annual ACM Symposium on Theory of Computing , 1977, pp. 106–112.[33] R. L. Rivest, “RFC 1321: The MD5 message-digest algorithm,”
Internet EngineeringTask Force , vol. 143, 1992.[34] E. Unal and E. Savas, “On acceleration and scalability of number theoretic private in-formation retrieval,”
IEEE Transactions on Parallel and Distributed Systems , vol. PP,no. 99, pp. 1–1, 2015.[35] M. Ajtai, J. Koml´os, and E. Szemer´edi, “An O(N log N) sorting network,” in
Proceedingsof the 15th Annual ACM Symposium on Theory of Computing , 1983, pp. 1–9.[36] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets, “Using Association Rules for ProductAssortment Decisions: A Case Study,” in
Knowledge Discovery and Data Mining , 1999,pp. 254–260. 2537] K. Geurts, G. Wets, T. Brijs, and K. Vanhoof, “Profiling High Frequency AccidentLocations Using Association Rules,” in
Proceedings of the 82nd Annual TransportationResearch Board , 2003, p. 18pp.[38] P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S. Tseng,“SPMF: a Java Open-Source Pattern Mining Library,”
Journal of Machine LearningResearch , vol. 15, pp. 3389–3393, 2014.[39] H. Lipmaa, “An oblivious transfer protocol with log-squared communication,” in