On Sharing Private Data with Multiple Non-Colluding Adversaries
Theodoros Rekatsinas, Amol Deshpande, Ashwin Machanavajjhala
OOn Sharing Private Data with Multiple Non-ColludingAdversaries
Theodoros Rekatsinas
University of Maryland [email protected] Amol Deshpande
University of Maryland [email protected] Ashwin Machanavajjhala
Duke University [email protected]
ABSTRACT
We present SPARSI, a novel theoretical framework for partition-ing sensitive data across multiple non-colluding adversaries. Mostwork in privacy-aware data sharing has considered disclosing sum-maries where the aggregate information about the data is preserved,but sensitive user information is protected. Nonetheless, there areapplications, including online advertising, cloud computing andcrowdsourcing markets, where detailed and fine-grained user-datamust be disclosed. We consider a new data sharing paradigm andintroduce the problem of privacy-aware data partitioning , wherea sensitive dataset must be partitioned among k untrusted parties( adversaries ). The goal is to maximize the utility derived by par-titioning and distributing the dataset, while minimizing the totalamount of sensitive information disclosed. The data should be dis-tributed so that an adversary, without colluding with other adver-saries, cannot draw additional inferences about the private informa-tion, by linking together multiple pieces of information released toher. The assumption of no collusion is both reasonable and neces-sary in the above application domains that require release of privateuser information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties formodeling private information and a hypergraph representation fordescribing the interdependencies between data entries and privateinformation. We show that solving privacy-aware partitioning is,in general, NP-hard, but for specific information disclosure func-tions, good approximate solutions can be found using relaxation techniques. Finally, we present a local search algorithm applicableto generic information disclosure functions. We apply SPARSI to-gether with the proposed algorithms on data from a real advertisingscenario and show that we can partition data with no disclosure toany single advertiser.
1. INTRODUCTION
The landscape of online services has changed significantly in therecent years. More and more sensitive information is released onthe Web and is processed by online services. The most commonparadigm to consider are people who rely on online social networksto communicate and share information with each other. This leads to a diverse collection of voluntarily published user data. Onlineservices such as Web search, news portals, recommendation ande-commerce systems, collect and store this data in their effort toprovide high-quality personalized experiences to a heterogeneoususer base. Naturally, this leads to increased concerns related to anindividual’s privacy and the possibility of private personal informa-tion being aggregated by untrusted third-parties such as advertisers.A different application domain that is increasingly popular iscrowdsourcing markets. Tasks, typically decomposed into micro-tasks, are submitted by users to a crowdsourcing market and arefulfilled by a collection of workers. The user needs to provideeach worker with the necessary data to accomplish each micro-task. However, this data may contain information that is sensitiveand care must be taken not to disclose any more sensitive informa-tion than minimally needed to accomplish the task. Consider, forexample, the task of labeling a dataset that contains informationabout the location of different individuals, that needs to be usedas input to a machine learning algorithm. Since the cost of hand-labeling the dataset is high, submitting this task to a crowdsourcingmarket provides an inexpensive alternative. However, the datasetmight contain sensitive information about the trajectories the indi-viduals follow as well as the structure of the social network theyform. Hence, we must perform a clever partitioning of the datasetto the different untrusted workers in order to avoid disclosing sen-sitive information. Observe, that under this paradigm, the sensitiveinformation contained in the dataset is not necessarily associatedwith a particular data entry.Similarly with the rise of cloud computing, increasing volumesof private data are being stored and processed on untrusted serversin the cloud. Even if the data is stored and processed in an en-crypted form, an adversary may be able to infer some of the privateinformation by aggregating, over a period of time, the informationthat is available to it (e.g., password hashes of users, workload in-formation). This has led security researchers to recommend split-ting data and workloads across systems or organizations to removesuch points of compromise.In all applications presented above, a party, called publisher , isrequired to distribute a collection of data (e.g., user information)to many different third parties. The utility in sharing data resultseither from the improved quality of personalized services or fromthe cost reduction in fulfilling a decomposable task. The sensitiveinformation is often not limited to the identity of a particular entityin the dataset (e.g., a user using a social network based service),but rather arises from the combination of a set of data items. It isthese sets we would like to partition accross different adversaries.We next use two real-world examples to illustrate this. a r X i v : . [ c s . D B ] M a r XAMPLE Consider a location based social network, suchas Gowalla and Brightkite , where users check-in at different placesthey visit. The available data contains information about the loca-tions of the users at different time instances and the structure of thesocial network connecting the users. User location data is of par-ticular interest to advertisers, as analyzing it can provide them witha rather detailed profile of the habits of the user. Using such dataallows advertisers to devise highly efficient personalized marketingstrategies. Hence, they are willing to pay large amounts of moneyto the data publisher for user information. However, analyzing thelocation of multiple users collectively can reveal information aboutthe friendship links between users, thus, revealing the structure ofthe social network [2]. Disclosing the structure of the social net-work might not be desirable by the online social network provider,as it can be used for viral marketing purposes, which may driveusers away from using the social network. It is easy to see that anatural tradeoff exists between publishing user data, and receivinghigh monetary utility, versus keeping this data private to ensure thepopularity of the social network. This example shows how an adversary may infer some sensi-tive information that is not explicitly mentioned in the dataset butis related to the provided data and can be inferred only when par-ticular entries of the dataset are collectively analyzed. Not reveal-ing all those entries together to an adversary prevents disclosure ofthe sensitive information. We further exemplify this setup using acrowdsourcing application.E
XAMPLE Consider a data publisher with a collection ofmedical prescriptions to be transcribed. Each prescription con-tains sensitive information, such as the disease of the patient, theprescribed medication, the identity of the patient, and the identity ofthe doctor. Furthermore, the publisher would like to minimize thetotal cost of the transcription. Thus, she considers using a crowd-sourcing solution where she partitions the task into micro-tasks tobe submitted to multiple workers. It is obvious that if all fields inthe prescription are revealed to the same worker, highly sensitiveinformation is disclosed. However, if the dataset is partitioned insuch a way that different workers are responsible for transcribingdifferent fields of one prescription, no information is disclosed aspatients cannot be linked with a particular disease or a particulardoctor. In this case, the utility of the publisher stems from fulfillingthe task at a reduced cost.
Despite being simplistic, the second example illustrates how dis-tributing a dataset can allow one to use it for a particular task, whileminimizing the disclosure of sensitive information. Motivated byapplications such as the ones presented above, we introduce theproblem of privacy-aware partitioning of a dataset, where our goalis to partition a dataset among k untrusted parties and to maximizeeither user’s utility, or the third parties’ utilities, or a combinationof those. Further, we would like to do this while minimizing thetotal amount of sensitive information disclosed.Most of the previous work has either considered sharing pri-vacy-preserving summaries of data, where the aggregate informa-tion about the population of users is preserved, or has bypassed theuse of personal data and its disclosure to multiple advertisers [20,17, 12]. These approaches focus on worst-case scenarios assum-ing arbitrary collusion among adversaries. Therefore, all adver-saries are combined and treated as a single adversary. However,this strong threat model does not allow publishing of fine-grainedinformation. Other approaches have explicitly focused on online http://en.wikipedia.org/wiki/Gowalla http://en.wikipedia.org/wiki/Brightkite advertising, and have developed specialized systems that limit thedisclosure of sensitive user-related information when deployed toa user’s Web browser [11, 21]. Finally, Krause et al. have studiedhow the disclosure of a subset of the attributes of a data entry canallow access to fine-grained information [14]. While they examinethe utility and disclosure tradeoff, their proposed framework doesnot take into account the interdependencies across different dataentries and assumes a single adversary (third party). In this work we propose SPARSI a new framework that allows usto formally reason about leakage of sensitive information in scenar-ios such as the ones presented above, namely, setups where we aregiven a dataset to be partitioned among a set of non-colluding ad-versaries in order to obtain some utility. We consider a generalizedform of utility that captures both the utility that each adversary ob-tains by receiving part of the data and the user’s personal utility de-rived by fulfilling a task. We elaborate more on this generalizationin the next section. This raises a natural tradeoff between maximiz-ing the overall utility while minimizing information disclosure. Weprovide a formal definition of the privacy-aware data partitioningproblem, as an optimization of the aforementioned tradeoff.While non-collusion results in a somewhat weaker threat model,we argue that it is a reasonable and practical assumption in a va-riety of scenarios, including the ones discussed above. In setupslike online advertising and cloud computing there is no particularincentive for adversaries to collude, due to conflicting monetary in-terests. In crowdsourcing scenarios the probability that adversarieswho may collude will be assigned to the same task is minuscule dueto the large number of anonymous available workers. Attempts tocollude can often be detected easily, and the possibility of strict pe-nalization (by the crowdsourcing market) provides additional dis-incentive to collude. Finally, we note that, an assumption of nocollusion is a necessary and a practical one in most of these situa-tions; otherwise there would be no way to accomplish those tasks.The main contributions of this paper are as follows: • We introduce the problem of privacy-aware data partitioning across multiple adversaries, and analyze its complexity. Toour knowledge this is the first work that addresses the problemof minimizing information leakage when partitioning a datasetacross multiple adversaries. • We introduce SPARSI, a rigorous framework based on the no-tion of sensitive properties that allows us to formally reasonabout how information is leaked and the total amount of infor-mation disclosure. We represent the interdependencies betweendata and sensitive properties using a hypergraph and we showhow the problem of privacy-aware partitioning can be cast asan optimization problem that it is NP-hard by reducing it to hy-pergraph partitioning. • We analyze the problem for specific families of information dis-closure functions, including step and linear functions, and showhow good solutions can be derived by using relaxation tech-niques. Furthermore, we propose a set of algorithms, based ona generic greedy randomized local search algorithm, for obtain-ing approximate solutions to this problem under generic fami-lies of utility and information disclosure functions. • Finally, we demonstrate how, using SPARSI, one can distributeuser-location data, like in Example 1, to multiple advertiserswhile ensuring that almost no sensitive information about po-tential user friendship links is revealed. Moreover, we experi-mentally verify the performance of the proposed algorithms for For the remainder of the paper we treat adversaries and third par-ties as the sameoth synthetic and real-world datasets. We compare the perfor-mance of the proposed greedy local search algorithm againstapproaches tailored to specific disclosure functions, and showthat it is capable of producing solutions that are close to theoptimal.
2. SPARSI FRAMEWORK
In this section we start by describing the different componentsof SPARSI. More precisely, we show how one can formally reasonabout the sensitive information contained in a dataset by introduc-ing the notion of sensitive properties . Then, we show how to modelthe interdependencies between data entries and sensitive propertiesrigorously, and how to reason about the leakage of sensitive infor-mation in a principled manner.
Let D denote the dataset to be partitioned among different adver-saries. Moreover, let A denote the set of adversaries. We assumethat D is comprised of data entries d i ∈ D that disclose minimalsensitive information if revealed alone. To clarify this consider Ex-ample 1 where each data entry is the check-in location of a user.The user is sharing this information voluntarily with the social net-work service in exchange for local advertisement services, hence,this entry is assumed not to disclose sensitive information. In Ex-ample 2, the data entries to be published are the fields of the pre-scriptions. Observe that if the disease field is revealed in isolation,no information is leaked about possible individuals carrying it.However, revealing several data entries together discloses sensi-tive information. We define a sensitive property to be a propertythat is related to a subset of data entries but not explicitly repre-sented in the data set, and that can be inferred if the data entriesare collectively analyzed. Let P denote the set of sensitive proper-ties that are related to data entries in D . To formalize this abstractnotion of indirect information disclosure, we assume that each sen-sitive property p ∈ P is associated with a variable (either numericalor categorical) V p with true value v ∗ p . Let D p ⊂ D be the smallestset of data entries from which an adversary can infer the true value v ∗ p of V p with high confidence, if all entries in D p are revealed toher. We assume that there is a unique such D p corresponding toeach property p . We say that data entries in D p disclose informa-tion about property p ∈ P and that information disclosure can bemodeled as a function over D p (see Section 2.2).We assume that sensitive properties are specified by an expertand the dependencies between data entries in D and properties in P , via sets D p , ∀ p ∈ P , are represented as an undirected bipar-tite graph, called a dependency graph . Returning to the exampleapplications presented above we have the following: In Example 1the sensitive properties correspond to the friendship links betweenusers, and the assosiated datasets D p correspond to the check-ininformation of the pairs of users participating in friendship links.In Example 2, the sensitive properties correspond to the links be-tween a patient’s id and a particular disease, or a doctor’s id andparticular medication. In general, it has been shown that data min-ing techniques can be used to determine the dependencies betweendata items and sensitive properties [16].Let G d denote such a dependency graph. G d has two types ofnodes, i.e., nodes P that correspond to sensitive properties andnodes D that correspond to data entries. An edge connects a dataentry d ∈ D with a property p ∈ P only if d can potentially dis-close information about p . Alternatively, we can use an equivalent hypergraph representation, that is easier to reason about in somecases. Converting the dependency graph G d into an equivalent de-pendency hypergraph is simply done by mapping each property Sensitive Properties Data Entries checkIn
Figure 1: An example of a dependency graph between data en-tries and sensitive properties. Data entries corresponding to thesame user are colored using the same color. node into a hyperedge. We assume that the dimension (i.e., sizeof largest hyperedge) of the dependency hypergraph is bigger thanthe number of adversaries.An example of a bipartite graph and its equivalent hypergraph isshown in Figure 1. Recall that in this example we do not want todisclose any information about the structure of the social network,i.e., the sensitive properties are the friendship links between indi-viduals. However, if an adversary is given the check-in locationsof two individuals, she can infer whether there is a friendship linkor not between them [2]. The dependencies between check-ins andfriendship links are captured by the edges in the bipartite graph.
We model the information disclosed to an adversary a using avector valued function f a : P ( D ) → [0 , | P | , which takes as in-put the subset of data entries published to an adversary, and returnsa vector of disclosure values; one per sensitive property. That is , f a ( S a )[ i ] denotes the information disclosued to adversary a ∈ A about the i th property when a has access to the subset S a of data en-tries. We assume that information disclosure takes values in [0 , ,with indicating no disclosure and indicating full disclosure.Generic disclosure functions, including posterior beliefs, and dis-tribution distances can be naturally represented by SPARSI. Theonly requirement is that the disclosure function returns a value foreach sensitive property.Based on the disclosure functions of all adversaries we define theoverall disclosure function f as an aggregate of all functions in F .Before presenting the formal definition, we define the assignmentset , given as input to f .D EFINITION
SSIGNMENT S ET ). Let x da be an indicatorvariable set to 1 if data entry d ∈ D is published to adversary a ∈ A . We define the assignment set S to be the set of all variables x da , i.e., S = { x , · · · , x | A | , · · · , x | D || A | } , and the adversary’sassignment set S a to be the set of indicator variables correspondingto adversary a ∈ A , i.e., S a = { x a , x a , · · · , x | D | a } . orst Disclosure. The overall disclosure can be expressed as: f ∞ ( S ) = max a ∈ A ( (cid:107) f a ( S a ) (cid:107) ∞ ) (1)Observe that using the infinity norm accounts for the worst casedisclosure across properties. Thus, full disclosure for at least onesensitive property suffices to maximize the information leakage.This function is indifferent to the total number of sensitive prop-erties that are fully disclosed in a particular partitioning and givesthe same score to all that have at least one fully disclosed property.However, there are cases where one is not interested in the worstcase disclosure but only interested in the total information disclosedto any adversary. Following this observation we introduce anothervariation of the overall disclosure function that considers the totalinformation disclosure per adversary. Average Disclosure.
We replace the infinity norm in the equationabove with the L norm: f L ( S ) = max a ∈ A ( (cid:107) f a ( S a ) (cid:107) | P | ) (2)Observe that both Equation 1 and Equation 2 consider the maxi-mum over the disclosure across adversaries, i.e., they can be writtenas: f ( S ) = max a ∈ A f (cid:48) a ( S a ) (3)where f (cid:48) a ( S a ) = (cid:107) f a ( S a ) (cid:107) ∞ or f (cid:48) a ( S a ) = (cid:107) f a ( S a ) (cid:107) | P | . Let u denote the utility derived by partitioning the dataset acrossmultiple adversaries. We have that u : P ( D × A ) → R + , where P ( D × A ) denotes the powerset of possible data-to-adversary as-signments. As we show below, this function either quantifies theutility of the adversaries when acquiring part of the dataset D (seeExample 1) or the publisher’s utility derived by fulfilling a particu-lar task that requires partitioning the data (see Example 2). Undermany real world examples these two different kinds of utility canbe unified under a single utility. Consider Example 1. Typically,advertisers pay higher amounts for data that provide higher utility.Thus, maximizing the utility of each individual advertiser maxi-mizes the utility (maybe monetary) of the data publisher as well.Based on this observation we unify the two types of utilities un-der a single formulation based on the utility of adversaries. Firstwe focus on the utility of adversaries. Intuitively, we would expectthat the more data an adversary receives, the less the observation ofa new, previously unobserved, data entry would increase the gainof the adversaries. This notion of diminishing returns is formalizedby the combinatorial notion of submodularity and is shown to holdin many real-world scenarios [18, 14]. More formally, a set func-tion G : 2 V → R mapping subsets A ⊆ V into the real numbersis called submodular [4], if for all A ⊆ B ⊆ V , and v (cid:48) ∈ V \ B ,it holds that G ( A ∪ { v (cid:48) } ) − G ( A ) ≥ G ( B ∪ { v (cid:48) } ) − G ( B ) , i.e.,adding v (cid:48) to a set A increases G more than adding v (cid:48) to a superset B of A . F is called nondecreasing , if for all A ⊆ B ⊆ V it holdsthat G ( A ) ≤ G ( B ) .Let u a be a set function that quantifies the utility of each adver-sary a . As mentioned above, we assume that u a is a nondecreasingsubmodular function. For convenience we will occasionally dropthe nondecreasing qualification in the remainder of the paper. Let U A denote the set of all utility functions for a given set of adver-saries A . The overall utility function u can be defined as an aggre-gate function of all utilities u a ∈ U A . We require that u is alsoa submodular function. For example the overall utility may be de-fined as a linear combination, i.e., a weighted sum, of all functions in U A , following the form: u ( S ) = (cid:88) a ∈ A w a u a ( S a ) (4)where S and S a are defined in Definition 1. Since all functions in U A are submodular, u will also be submodular, since it is expressedas a linear combination of submodular functions [7].An example of a submodular function u a is an additive function.Assume that each data entry in d ∈ D has some utility w da for anadversary a ∈ A . We have that u a ( S a ) = (cid:80) d ∈ D w da x da , where x da is an indicator variable that takes value 1 when data entry d isrevealed to adversary a and 0 otherwise. For the remainder of thepaper we will assume that utility u is normalize so that u ∈ [0 , .
3. PRIVACY-AWARE DATA PARTITIONING
In this section, we describe two formulations of the privacy-aware partitioning problem. We show how both can be expressedas maximization problems that are, in general, NP-hard to solve.We consider a dataset D that needs to be partitioned across a givenset of adversaries A . We assume that each data entry must be re-vealed to at least one and at most t adversaries. The lower boundarises naturally in both application discussed in the introductionsection. The upper bound is necessary to model cases where thenumber of assignments per data entry needs to be restricted as itmight incur some cost, e.g., monetary in crowdsourcing applica-tions.We assume that the functions to compute the overall utility andinformation disclosure are given. Let these functions be denotedby u and f respectively. Ideally, we wish to maximize the utilitywhile minimizing the cost; however, there is a natural tradeoff be-tween the two optimization goals. A traditional approach is to seta requirement on information disclosure while optimizing the gain.Accordingly we can define the following optimization problem.D EFINITION
ISC B UDGET ). Let D be a set of data en-tries, A be a set of adversaries, and τ I be a budget on informa-tion disclosure. This formulation of the privacy-aware partitioningproblem finds a data entry to adversary assignment set S that max-imizes u ( S ) under constraint f ( S ) < τ I . More formally we havethe following optimization problem:maximize S∈P ( D × A ) u ( S ) subject to f ( S ) < τ I , ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (5) where x da and S are defined in Definition 1 as before and t ≥ is the maximum number of adversaries to whom a particular dataentry can be published. This optimization problem already captures our desire to reducethe information disclosure while increasing the utility. However,depending on the value of τ I , the optimization problem presentedabove might be infeasible. Infeasibility occurs when τ I is so smallthat no assignment of data to adversaries, such that (cid:80) ka =1 x da ≥ , ∀ d ∈ D and f ( S ) < τ I exists.To overcome this, we consider a different formulation of theprivacy-aware data partitioning problem where we seek to maxi-mize the difference between the utility and the information disclo-sure functions. We consider the Lagrangian relaxation of the pre-vious optimization problem. Again, we assume that both functionsare measured using the same unit. We have the following: EFINITION
RADEOFF ). Let D be a set of data entries, A be a set of adversaries, and τ I be a budget on information disclo-sure. This formulation of the privacy-aware partitioning problemfinds a data entry to adversary assignment S that maximizes thetradeoff between the overall utility and the overall information dis-closure, i.e., u ( S )+ λ ( τ I − f ( S )) , where λ is a nonnegative weight.More formally we have the following optimization problem:maximize S∈P ( D ×| A | ) u ( S ) + λ ( τ I − f ( S )) subject to ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (6) where x da and S are defined in Def. 1 and t is the maximum num-ber of adversaries to whom a data entry can be published. We prove that both problems above are NP-hard by reducing bothversions of the privacy-aware data partitioning problem to the hy-pegraph coloring problem [3].T
HEOREM Both formulations of the privacy-aware data par-titioning problem are, in general, NP-hard to solve. P ROOF . Fix a set of adversaries denoted by A and a set of dataentries denoted by D . Let P denote the sensitive properties thatcorrespond to data entries in D . We require to partition D acrossthe adversaries in A . Consider the following instance of the privacy-aware data partitioning problem. We require that each data entry bepublished to exactly one adversary. Moreover, we set the maximumbudget on information disclosure to be 1. We also fix the overall in-formation disclosure to be a step function of the following form: Ifall the data entries corresponding to a particular sensitive propertyare revealed to the same adversary the overall disclosure is 1 other-wise it is 0. Finally we consider a constant utility function, whichis always equal to 1. Considering the hypergraph representation ofdata and properties, it is easy to see that this problem is equivalentto the hypergraph coloring problem, which is NP-hard [3]. Revers-ing the above steps, one can easily reduce any instance of hyper-graph coloring to the privacy-aware data partitioning problem.In the remainder of the paper we describe efficient heuristics forsolving the partitioning problem – we present approximation algo-rithms for specific information disclosure functions in Section 4,and a greedy local-search heuristic for the general problem in Sec-tion 5. Due to space constraints, henceforth, we will only focus onthe T RADE O FF formulation. Many of our algorithms also work forthe D ISC B UDGET formulation (with slight modifications).
4. ANALYZING SPECIFIC CASES OF IN-FORMATION DISCLOSURE
In this section, we present approximation algorithms when theinformation disclosure function takes the following special forms:1) step functions, 2) linearly increasing functions. The utility func-tion is assumed to be submodular.
Information disclosure functions that correspond to a step func-tion can model cases when each sensitive property p ∈ P is eitherfully disclosed or fully private. A natural application of step func-tions is the crowdsourcing scenario shown in Example 2. Whencertain fields of a medical transcription, e.g., name together withdiagnosis, or gender together with the zip code and birth data, arerevealed to an adversary the corresponding sensitive property forthe patient is revealed. We continue by describing such functions formally. Let D p ⊂ D be the set of data entries associated with p . Property p is fully dis-closed only if D p is published in its entirety to an adversary. Thiscan be modeled using a set of step functions f a ∈ F : f a ( D a )[ p ] =1 , if the set of data items assigned to adversary a contains all the el-ements D p associated with property p . If D p (cid:54)⊆ D a , then f a ( D a ) =0 . Observe that information disclosure is minimized (and is equalto ) when no adversary gets all the elements in D p , for all p . Forstep functions we consider worst case disclosure, since ideally wedo not want to fully disclose any property.Considering D ISC B UDGET and T
RADE O FF separately is not mean-ingful for step functions. Since disclosure can only take the extremevalues { , } , τ I = 1 in T RADE O FF should be set to 0. Hence, fulldisclosure of a property always penalizes the utility. Hence, onecan reformulate the problem and seek for solutions that maximizethe utility function under the constraint that information disclosureis 0, i.e., no property exists such that all its corresponding data en-tries are published to the same adversary.Given these families of information disclosure functions and asubmodular utility function, both formulations of privacy-awaredata partitioning can be represented as an integer program (IP) :maximize S∈P ( D ×| A | ) u ( S ) subject to (cid:88) d ∈ D p x da < | D p | , ∀ p ∈ P, ∀ a ∈ A, ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (7)where t is the maximum number of adversaries to whom a particu-lar data entry can be published.The first constraint enforces that there is no full disclosure of asensitive property. The partitioning constraint enforces that a dataentry is revealed to at least one but no more than t adversaries.Solving the optimization problem in (7) corresponds to maximiz-ing a submodular function under linear constraints. Recall that theutility function is submodular and observe that all constraints in theoptimization problem presented above are linear. In fact be viewedas packing constraints.For additive utility functions ( u = (cid:80) a ∈ A (cid:80) d ∈ D w da x da ), Equa-tion 7 becomes an integer linear program, that can be approxi-mately solved in PTIME in two steps. First, one can solve a lin-ear relaxation of Equation 7, where x da is some fraction in [0 , .The resulting fractional solution can be converted into an integralsolution using a rounding strategy .The simplest rounding strategy, called randomized rounding [19],works as follows: assign data entry d to an adversary a with proba-bility equal to ˆ x da , where ˆ x da is the fractional solution to the linearrelaxation. The value of the objective function achieved by the re-sulting integral solution is in expectation equal to the optimal valueof the objective achieved by the linear relaxation. Moreover, ran-domized rounding preserves all constraints in expectation. A dif-ferent kind of rounding, called dependent rounding [8], ensures thatconstraints are satisfied in the integral solution with probability 1.For an overview of different randomized rounding techniques andthe quality of the derived solutions for budgeted problems we referthe reader to the work by Doerr et al. [4]One can solve the general problem with worst-case approxima-tion guarantees by leveraging a recent result on submodular maxi-mization under multiple linear constraints by Kulik et al. [15]. HEOREM Let the overall utility function u be a nondecreas-ing submodular function. One can find a feasible solution to theoptimization problem in (7) with expected approximation ratio of (1 − (cid:15) )(1 − e − ) , for any (cid:15) > , in polynomial time. P ROOF . This holds directly by Theorem 2.1 of Kulik et al. [15].To achieve this approximation ratio, Kulik et al. introduce aframework that first obtains an approximate solution for a contin-uous relaxation of the problem, and then uses a non-trivial combi-nation of a randomized rounding procedure with two enumerationphases, one on the most profitable elements , and the other on the‘big’ elements, i.e., elements with high cost. This combination en-ables one to show that the rounded solution can be converted toa feasible one with high expected profit. Due to the intricacy ofthe algorithm we refer the reader to Kulik et al. [15] for a detaileddescription of the algorithm.
In this section, we consider linearly increasing disclosure func-tions. Linear disclosure functions can naturally model situtationswhere each data entry independently affects the likelihood of dis-closure. In particular, if normalized log-likelihood is used as a mea-sure of information disclosure, the disclosure function takes the lin-ear form presented below. We consider the following additive formfor the disclosure of property p : f a ( · )[ p ] = (cid:88) d ∈ D p a dp x da (8)where a dp is a weight associated with the information that is dis-closed about property p when data d is revealed to an adversary.We can rewrite the T RADE O FF problem statement as:maximize S∈P ( D × A ) u ( S ) + λ ( τ I − max a ∈ A ( f (cid:48) a ( S a ))) subject to ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (9)When the utility function is additive, the above problem is an in-teger linear program, and hence can be solved by rounding the LPrelaxation as explained in the previous section.However, for general submodular u ( · ) , the objective is not sub-modular anymore – the max of the additive information disclosurefunctions is not necessarily supermodular [7]. Hence, unlike thecase of step functions, we cannot use the result of Kulik et al. [15]to get an efficient approximate solution.Nevertheless, we can compute approximate solutions in PTIMEby considering the following max-min formulation of the problem:maximize S∈P ( D × A ) min a ∈ A ( u ( S ) + λ ( τ I − f (cid:48) a ( S a ))) subject to ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (10)Since the overall utility function is a nondecreasing submodularfunction, and the disclosure for each adversary is additive, the ob-jective now is a max-min of submodular functions. More precisely,for worst-case disclosure (Equation 1), the optimization objectivecan be rewritten as:maximize S∈P ( D × A ) min a ∈ A,p ∈ P ( u ( S ) + λ ( τ I − f a ( S a )[ p ])) (11)and, for average disclosure (Equation 2), it can be written as:maximize S∈P ( D × A ) min a ∈ A ( u ( S ) + λ ( τ I − | P | (cid:88) p ∈ P f a ( S a )[ p ])) (12) The above max-min problem formulation is closely related to the max-min fair allocation problem [10] for both types of informa-tion disclosure functions. The main difference between Problem(10) and the max-min fair allocation problem is that data items maybe assigned to multiple advrsaries. In the max-min fair allocationproblem a data item is assigned exactly once. If t = 1 then thetwo problems are equivalent, and thus, one can provide worst caseguaranties on the quality of the approximate solution. The prob-lem of max-min fair allocation was studied by Golovin [10], Khotand Ponnuswami [13], and Goemans et al. [9]. Let n denote thetotal number of data entries ( goods in the max-min fair allocationproblem) and m denote the number of adversaries ( buyers in themax-min fair allocation problem). The first two papers focus onadditive functions and give algorithms achieving an ( n − m + 1) -approximation and a (2 m + 1) -approximation respectively, whilethe third one gives a O ( n m log n log m ) -approximation. In this section we consider quadratic disclosure functions of thefollowing form: f a ( · )[ p ] = (cid:18)(cid:88) d ∈ D p a dp x da (cid:19) (13)where a dp and x da are defined as before. Since f a ( D p )[ p ] = 1 we have that ( (cid:80) d ∈ D p a dp ) = 1 . We assume that the utility isan additive function following the form of Equation 4, and do notconsider generic submodular case.The T RADE O FF optimization function can be rewritten as:maximize S∈P ( D × A ) (cid:88) d ∈ D (cid:88) a ∈ A w da x da + λ ( τ I − max a ∈ A ( f (cid:48) a ( S a ))) (14)where f (cid:48) a ( · ) is a quadratic function. The internal maximization overinformation disclosure functions can be removed from the objectiveby rewriting the optimization problem as:maximize S∈P ( D × A ) (cid:88) d ∈ D (cid:88) a ∈ A w da x da + λ ( τ I − y ) subject to y ≥ f (cid:48) a ( S a ) , ∀ a ∈ A, ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D,x da ∈ { , } . (15)Since all constraints are either linear or quadratic, the above prob-lem is an integer [ ? ],which is, in general, NP-hard to solve. In order to derive an ap-proximate solution in PTIME, we relax this problem to an equiv-alent Second Order Conic Programming (SOCP) problem [ ? ]. ASOCP problem can be solved in polynomial time to within anylevel of accuracy by using an interior-point method [ ? ] resultingin a fractional solution ˆ x da .First, we focus on the constraints shown in Problem (15). Theconstraints for worst and average disclosure can be written as: ( (cid:88) d ∈ D p a dp x da ) = ( A ap x ap ) T A ap x a , ∀ p ∈ P, ∀ a ∈ A | P | (cid:88) p ∈ P ( (cid:88) d ∈ D p a dp x da ) = ( A p x a ) T A p x a , ∀ a ∈ A (16)where x a corresponds to a vector representation of the assignmentset S a and A Tp = [ a (cid:48) , a (cid:48) , . . . , a (cid:48)| D | ] is a positive vector that con-tains the appropriate weights for all data entries with respect to allproperties p ∈ P .Both constraints follow the quadratic form ( Ax ) T Ax , where x isa vector representation of the assignment set S a and A is a matrixontaining the appropriate A ap ’s or A p ’s based on the type of in-formation disclosure we are using. Let C denote the total numberof constraints with respect to the disclosure function. Observe thatwe have C = | P || A | and C = | A | for the two cases of informationdisclosure respectively.The next step is to incorporate variable y in the optimizationproblem. For that we extend vector x to include variable y . The newvariable vector is [ y x ] T . We can rewrite the quantities in Equation16 as follows: (cid:18)(cid:2) A (cid:3) (cid:20) y x (cid:21)(cid:19) T (cid:18)(cid:2) A (cid:3) (cid:20) y x (cid:21)(cid:19) , ∀ c ∈ [ C ] (17)where A and x are as defined above. Finally, we have that the equiv-alent SOCP problem to the initial QCP problem is:maximize q (cid:2) − λ W (cid:3) q + λτ I subject to (cid:2) (cid:3) q ≥ ( A (cid:48) q ) T ( A (cid:48) q ) , ∀ c ∈ [ C ] , ≤ (cid:88) ka =1 x da ≤ t, ∀ d ∈ D. (18)where q = (cid:2) y x (cid:3) T , A (cid:48) = (cid:2) A (cid:3) .Finally, the fractional solutions ˆ x da obtained from the SOCPneeds to be converted to an integral solution. We point out that forthe T RADE O FF problem no guarantees can be derived on the valueof the objective function of the integral solution, when naive ran-domized rounding schemes, such as setting x da = 1 with Pr[ X ad =1] = ˆ x da are used. Thus, finding a rounding scheme that ensuresthat the objective of the integral solution is equal to that of the frac-tional solution in expectation is an open problem.
5. A GREEDY LOCAL SEARCHHEURISTIC
So far we studied specific families of disclosure functions toderive worst-case guaranties for the quality of approximate solu-tions. In this section we present two greedy heuristic algorithmssuitable for any general disclosure function. We still require theutility function to be submodular. Our heuristics are based on hillclimbing and the
Greedy Randomized Adaptive Local Search Pro-cedure (GRASP) [6]. Notice that local search heuristics are knownto perform well when maximizing a submodular objective function[7]. Again, we only focus on the T
RADE O FF optimization prob-lem(see Equation 6). Algorithm 1
Overall Algorithm1:
Input: A : set of adversaries; G : objective function; r : number of repetitions; t : max. adversaries per data item2: Output: M opt : a data-to-adversary assignment matrix3: for all i = 1 → r do M ∅ ← empty assignment , g opt ← G ( M ∅ ) (cid:104) M ini , g ini (cid:105) ← CONSTRUCTION ( G, A, t ) ;6: (cid:104) M, g (cid:105) ←
LOCALSEARCH ( G, A, t, M ini , g ini ) ;7: if g > g opt then M opt ← M ; g opt ← g ;9: return M opt ; Our algorithm proceeds in two phases (Algorithm 1). The firstphase, which we call construction , constructs a data-to-adversaryassignment matrix M ini by greedily picking assignments that max-imize the specified objective function G ( · ) , i.e., the tradeoff be-tween utility and information disclosure, while ensuring that each data item is assigned to at least one and at most t adversaries. Thesecond phase, called local-search , searches for a better solution inthe neighborhood of the M ini , by changing one assignment of onedata item at a time if it improves the objective function, resultingin an assignment M . The construction algorithm may be random-ized; hence, the overall algorithm is executed r times, and the bestsolution M opt = argmax { M ,...,M r } G ( M i ) is returned as the finalsolution. Algorithm 2
CONSTRUCTION1:
Input: G : objective function; A : set of adversaries; t : max. adversaries per data item2: Output: (cid:104)
M, g (cid:105) : data-to-adversary assignment, objectivevalue3: maxIterations ← t · | D |
4: Initialize: M ← empty assignment5: for all i ∈ [1 , maxIterations ] do // Compute a set of candidate assignments D M ← data entries assigned to < t adversaries in M ;8: Let S ← D M × A − M // Pick the next best assignment that improves the objective (cid:104) d, a (cid:105) ← P ICK N EXT B EST ( M, g, S, G ) if (cid:104) d, a (cid:105) is NULL then
12: break; // No new assignments improve the objective
13: Assign the selected data entry d to the selected adversary a ;14: return (cid:104) M, G ( M ) (cid:105) ; The construction phase (Algorithm 2) starts with an empty data-to-adversary assignment matrix and greedily adds a new (cid:104) d, a (cid:105) as-signment to the mapping M if it improves the objective function.This is achieved by iteratively performing two steps. The algorithmfirst computes a set of candidate assignments S . For any data item d (which does not already have t assignments), and any adversary a , (cid:104) d, a (cid:105) is a candidate assignment if it does not appear in M . Algorithm 3 P ICK N EXT B EST Input: G : objective function; M : current assignment; g : current value of objective, S : possible new assignments2: Output: new assignment (cid:104) d (cid:63) , a (cid:63) (cid:105) , or NULL3: GREEDY: (cid:104) d (cid:63) , a (cid:63) (cid:105) ← argmax (cid:104) d,a (cid:105)∈ S G ( M ∪ (cid:104) d, a (cid:105) ) GRASP:
6: Pick the top- n assignments S n having the highest value for g (cid:104) d,a (cid:105) = G ( M ∪ (cid:104) d, a (cid:105) ) from S , and having g (cid:104) d,a (cid:105) > g .7: (cid:104) d (cid:63) , a (cid:63) (cid:105) is drawn uniformly at random from S n if G ( M ∪ (cid:104) d, a (cid:105) ) > g then return (cid:104) d, a (cid:105) else return NULLSecond, the algorithm picks the next best assignment from thecandidates (using P
ICK N EXT B EST , Algorithm 3). We considertwo methods for picking the next best assignment – GREEDY andGRASP. The GREEDY strategy picks (cid:104) d (cid:63) , a (cid:63) (cid:105) that maximizes theobjective G ( M ∪ (cid:104) d (cid:63) , a (cid:63) (cid:105) ) . On the other hand, GRASP identifies aset S n of top n assignments that have the highest value for the ob-jective g (cid:104) d,a (cid:105) = G ( M ∪ (cid:104) d, a (cid:105) ) , such that g (cid:104) d,a (cid:105) is greater than thecurrent value of the objective g . Note that S n can contain less than n assignments. The GRASP strategy picks an assignment (cid:104) d (cid:63) , a (cid:63) (cid:105) at random from S n . Both strategies return NULL if (cid:104) d (cid:63) , a (cid:63) (cid:105) doesot improve the value of the objective function. The constructionstops when no new assignment can improve the objective function. Complexity:
The run time complexity of the construction phase is O ( t · | A | · | D | ) . There are O ( t · | D | ) iterations, and each itera-tion may have a worst case running time of O ( | D | · | A | ) . P ICK -N EXT B EST permits a simple parallel implementation.
Algorithm 4
LOCALSEARCH1:
Input: G : objective function; A : set of adversaries; t : max. assignments per data item; M : current assignment; g : current objective value2: Output: (cid:104) M opt , g opt (cid:105) : the new assignment, the correspondingobjective value3: for all d ∈ D do A d ← the set of adversaries to whom data item d is assigned(according to current assignment M );5: // Construct a set of neighboring assignments N d ← { M } .7: if ( | A d | < t ) then N d ← N d ∪ { M ∪ {(cid:104) d, a (cid:48) (cid:105)}|∀ a (cid:48) (cid:54)∈ A d } ;8: for each adversary a ∈ A d do N d ← N d ∪ { M − {(cid:104) d, a (cid:105)}} N d ← N d ∪ { M − {(cid:104) d, a (cid:105)} ∪ {(cid:104) d, a (cid:48) (cid:105)}|∀ a (cid:48) (cid:54)∈ A d } ;11: // Pick the neighboring assignment with maximum objective M opt ← argmax M (cid:48) ∈ N d G ( M (cid:48) ) M ← M opt return (cid:104) M opt , G ( M opt ) (cid:105) ; The second phase employs local search (Algorithm 4) to improvethe initial assignment M ini output by the construction phase. Inthis phase, the data items are considered exactly once in some (ran-dom) order. Given the current assignment M , for each data item,a set of neighboring assignments N d (including M ) are consideredby (i) removing an assignment to an adversary a in M , (ii) modify-ing the assignment from adversary a to an adversary a (cid:48) (that d wasnot already assigned to in M ), and (iii) adding a new assignment (if d is not already assigned to t adversaries in M ). Next, the neigh-boring assignment in N d with the maximum value for the objective M opt is picked. The next iteration considers the data item succeed-ing d (in the ordering) with M opt as the current assignment. Wefound that making a second pass of the dataset in the local searchphase does not improve the value of the objective function. Complexity:
The run time complexity of the local-search phase is O ( t · | A | · | D | ) . The construction phase (Algorithm 2) has a run time that is quadraticin the size of D . This is because in each iteration, the P ICK -N EXT B EST subroutine computes a global maximum assignmentacross all data-adversary pairs. While this approach makes the al-gorithm more effective in avoiding local minima it reduces its scal-ability due to its quadratic cost.To improve scalability, one can adopt a local myopic approachduring construction. Instead of considering all possible (data,adversary)pairs when constructing the list of candidate assignments (see Ln. 8in Algorithm 2), one can consider a single data entry d and popu-late the set of candidate assignments S using only (data,adversary)pairs that contain d . More specifically, we fix a total ordering of thedata entries O , and perform t iterations of the following: • Consider the next data item d in O . Let M be the current as-signment. • Construct S as ( { d } × A ) − M . • Pick the next best assignment in S using Algorithm 3 (GREEDYor GRASP) that improves the objective function. • Update the current assignment M , and proceed with the nextdata entry in O . Complexity:
The run time complexity of the myopic-constructionphase is O ( t · | A | · | D | ) . While both the construction and local search phases carefullyensure that each data item is assigned to no more that t adversaries,we still need to prove that each data item is assigned to at least oneadversary. To ensure this lower bound on the cardinality, we usethe following objective function G : G ( · ) = u ( · ) + λ ( τ I − f ( · )) − C (19)where C is the number of data items that are not assigned to any ad-versary and λ ∈ [0 , . The above objective function adds a penaltyterm C to the tradeoff between the utility and the information dis-closure, i.e., u ( · ) + λ ( τ I − f ( · )) . We can show that this penaltyensures that every data item is assigned to at least one adversary.We have the following theorem:T HEOREM Using G ( · ) = u ( · ) + λ ( τ I − f ( · )) − C with λ ∈ [0 , as the objective function in Algorithm 1 returns a solutionwhere all cardinality constraints are satisfied. P ROOF . Both the construction and the local-search phases en-sure that no adversary is assigned to more than t adversaries (Ln. 7in Algorithms 2 and 4).We need to show that every data item is assigned to at least oneadversary, i.e., at the end of the algorithm C will be equal to 0.We will focus on the global Algorithm 1. The proof for the localversion of the algorithm is analogous.The main intuition behind the proof is the following: • At the end of the construction phase, if C > , then some dataitem must be assigned to > t adversaries. • In the local-search phase, making a data item unassigned neverresults in a better objective function.At the start of the construction phase, C = | D | . We show that ifat some iteration C = i , then in that iteration some unassigned dataitem is assigned to an adversary and C reduces by 1. If C = i > ,there are three possible paths the algorithm can follow after the it-eration is over: (1) no new assignment is chosen, (2) the algorithmchooses to assign a data entry to an adversary so that the number ofviolated constraints remains the same (i.e. that data entry is alreadyassigned to at least one adversary), and (3) a data-to-adversary as-signment is chosen so that C = i − . We will show that given theobjective function we are using, the first path will never be chosen.We evaluate the value of the objective function for paths 1 and 3.For the third path we consider the worst case scenario, where onlydisclosure is increased (by ∆ f ) and utility remains the same. Thevalue of the objective function for paths 1 and 3 is as follows: G = u + λ ( τ I − f ) − C G = u + λ ( τ I − f − ∆ f ) − ( C − Since ∆ f ≤ , G ≥ G , thus, path 1 will never be taken if C > .Notice, that the objective values for paths 2 and 3 cannot be di-rectly compared. However, we will show that during the t | D | it-erations we perform path 3 will be chosen | D | times, and, hence = 0 at the end of Algorithm 2. Let n , n and n denote thenumber of times path1, path 2 and path 3 are chosen respectively.By construction we have that n + n + n = t | D | . After t | D | iterations if C = 0 we are done. If C > then based on the abovestatement we will have n = 0 , and thus n + n = t × | D | . Nowwe will show that n will always be equal to | D | . Let n < | D | ,we have that n = tD − n > ( t − | D | . Recall that when path 2is chosen, the algorithm assigned to an adversary a data entry is al-ready assigned to another adversary. Therefore, if n > ( t − | D | ,then some data item is assigned to > t adversaries, which does nothappen. Therefore, n = | D | .To prove that C = 0 at the end of the local-search phase, it suf-fices to show that the algorithm will never choose to remove an as-signment from a data item assigned to exactly one adversary. Con-sider the state of the local-search algorithm right before an iterationof its main for-loop (Ln. 3-13 of Algorithm 4). We assume that thealgorithm considers a data-entry d with a single assignment andthat C = 0 . Moreover, let u and f be the utility and disclosure atthat point. Let u (cid:48) and f (cid:48) denote the utility and disclosure if d isassigned to no adversary. Consider the best-case scenario where noutility is lost, i.e., u (cid:48) = u , and f (cid:48) = 0 . Note that C = 1 . The newobjective value will be u (cid:48) + λ ( τ I − f (cid:48) ) −C = u + λτ I − . We havethat u + λτ I − ≤ u + λτ I − f , since f ≤ always holds.
6. EXPERIMENTS
In this section we present an empirical evaluation of SPARSI.The main questions we seek to address are: (1) how the two ver-sions of the privacy-aware partitioning problem – T
RADE O FF andD ISC B UDGET – compare with each other, and how well they ex-ploit the presence of multiple adversaries with respect to disclosureand utility, (2) how the different algorithms perform in optimizingthe overall utility and disclosure, and (3) how practical SPARSI isfor distributing real-world datasets across multiple adversaries.We empirically study these questions using both real and syn-thetic datasets. After describing the data and the experimental method-ology, we present results that demonstrate the effectiveness of ourframework on partitioning sensitive data. The evaluation is per-formed on an Intel(R) Core(TM) i5 2.3 GHz/64bit/8GB machine.SPARSI is implemented in MATLAB and uses MOSEK, a com-mercial optimization toolbox.
Real-World Dataset:
For the first set of experiments we presenthow SPARSI can be applied to real world domains. We considereda social networking scenario as discussed in Example 1. We usedthe
Brighkite dataset published by Cho et al. [2]. This dataset wasextracted from Brightkite, a former location-based social network-ing service provider where users shared their locations by checking-in. Each check-in entry contains information about the id of theuser, the timestamp and location of the check-in. The dataset con-tains the public check-in data of users and the friendship network ofusers. The original dataset contains . million check-ins, , users and , edges. We subsampled the dataset and ex-tracted a dataset comprised of , check-ins. The correspond-ing friendship network contains , nodes and , edges. InSection 6.1, we discuss how we modeled the utility and informationdisclosure for this data. Synthetic Data:
For the second set of experiments we used syn-thetically generated data to understand the properties of differentdisclosure functions and the performance of the proposed algo-rithms better. There are two data-related components in our frame-work. The first is a hypergraph that describes the interaction be-tween data entries and sensitive properties (see Section 2.1), andthe second is a set of weights w da representing the utility received when data entry d ∈ D is published to adversary a ∈ A . The syn-thetic data are generated as follows. First, we set the total numberof data entries | D | ∈ { , , , , } , the total number ofsensitive properties | P | ∈ { , , , } , and the total numberof adversaries | A | ∈ { , , , , } .Next, we describe the scheme we used to generate the utilityweights w da . There are two particular properties that need to besatisfied by such a scheme. The first one is that assigning any entryto an adversary should induce some minimum utility, since it al-lows us to fulfil the task under consideration (see Example 2). Thesecond one is that there are cases where certain data items shouldinduce higher utilities when assigned to specific adversaries, e.g.,some workers may have better accuracy than others in crowdsourc-ing, or advertisers may pay more for certain types of data.The utility weights need to satisfy the aforementioned proper-ties. To achieve this, we first choose a minimum utility value u min from a unifrom distribution U (0 , . . Then, we iterate over allpossible data-to-adversary assignments and set the correspondingweight w da to a value drawn from a uniform distribution U (0 . , with probability p u , or to u min with probability − p u . For our ex-periments we set the probability p u to 0.4. Notice that both prop-erties are satisfied. Finally, weights are scaled down by dividingthem with the number of adversaries | A | .Next, we describe how we generate a random hypergraph H =( X, E ) , with | X | = | D | and | E | = | P | , describing the interactionbetween data entries and sensitive properties. To create H we sim-ply generate an equivalent bipartite dependency graph G (see Sec-tion 2.1) and convert that to the equivalent dependecy hypergraph.In particular we iterate over the possible data to sensitive propertypairs and insert the corresponding edge to G with probability p f .For our experiments we set p f to 0.3. Algorithms:
We evaluated the following algorithms: • RAND + : Each data entry is assigned to exactly t adversaries.The probability of assigning a data entry to an adversary isproportional to the corresponding utility weight w da . We runthe random partitioning a 100 times, and select the data-to-adversary assignment that maximizes the objective function. • LP: We solve the LP relaxation of the optimization problems forstep (Section 4.1) and linear (Section 4.2) disclosure functions.We generate an integral solution from the resulting fractionalsolution using naive randomized rounding (see Section 4.1).Note that the constraints are satisfied in expectations. More-over, we perform a second pass over the derived integral solu-tion to guarantee that the cardinality constrains are satisfied. Ifa data item is not assigned to an adversary, we assign it to theadversary with the highest weight, i.e., corresponding fractionalsolution. On the other hand is a data item is assigned to moreadversaries, we remove those with the lowest weight. This is anaive, yet effective, rounding scheme because the fractional so-lutions we get are close to the integral ones. More sophisticatedrounding techniques can be used [4]. We run the rounding 100times and select the data-to-adversary assignment with maxi-mum value of objective. • ILP: We solve the exact ILP algorithm for step and linear dis-closure functions. • GREEDY: Algorithm 1 with GREEDY strategy for picking acandidate • GRASP: Algorithm 1 with GRASP strategy for picking the can-didate assignments using n = 5 and r = 10 . • GREEDYL: Local myopic variant of Algorithm 1 (see Section5.4 with GREEDY strategy for picking a candidate).
GRASPL: Local myopic variant of Algorithm 1 (see Section5.4 with GRASP strategy for picking candidates) using n =min( k, and r = 10 . Evaluation.
To evaluate the performance of the aforementionedalgorithms we used the following metrics: (1) the total utility u cor-responding to the final assignment, (2) the information disclosure f for the final assignment and (3) the tradeoff between utility anddisclosure, given by u + λ ( τ I − f ) . We evaluated the differentalgorithms using different step and linear information disclosurefunctions for T RADE O FF .For all experiments we set λ = 1 and assume an additive util-ity function of the form u a ( S a ) = (cid:80) d ∈ D w da x da (cid:80) d ∈ D (cid:80) top − t ( A ) ∈ A w da , where x da is an indicator variable that takes value 1 when data entry d is revealed to adversary a and 0 otherwise, and top − t ( A ) returnsthe top t adversaries with respect to weights w da . Observe thatthe normalization used corresponds to the maximum total utility avalid data-to-adversary assignment may have, when ignoring dis-closure. Using this value ensures that the total utility and the quan-tity τ I − f have the same scale [0 , . For convenience we fixed theupper information disclosure to τ I = 0 . Finally, for RAND + weperform 10 runs and report the average, while for LP we performthe aforementioned rounding procedure 10 times and report the av-erage. The corresponding standard errors are shown as error barsin the plots below. We start by presenting how SPARSI can be applied to real-worldapplications. In particular, we evaluate the performance of the pro-posed local-search framework meta-heuristic on generic informa-tion disclosure functions using Brightkite. As described in the be-ginning of the section, this dataset contains the check-in locationsof users and their corresponding friendship network. As illustratedin Example 1 we desire to distribute the check-in information toadvertisers, while minimizing the information we disclose for thestructure of the network. We, first, show how SPARSI can be usedin this scenario.
Utility Weights . We start by modeling the utility provided whenadvertisers receive a subset of data entries. As mentioned aboveeach check-in entry contains information about location. We as-sume a total number of k advertisers, so that each adversary is par-ticularly interested in check-in entries that occurred in a certain ge-ographical area. Given an adversary a ∈ A , we draw w da from auniform distribution U (0 . , for all entries d ∈ D that satisfy thelocation criteria of the adversary, and w da = 0 . otherwise. Wesimulate this process by performing a random partitioning of thelocation ids across adversaries. As mentioned above we assume anadditive utility function. Sensitive Properties and Information Disclosure . The sensi-tive property in the particular setup is the structure of the socialnetwork. More precisely, we require that no information is leakedabout the existence of any friendship link among users. It is easy tosee that each friendship link is associated with a sensitive property.Now, we examine how check-in entries leak information about thepresence of a friendship link. Cho et al. [2] proved that there isa strong correlation between the trajectory similarity and the exis-tence of a friendship link for two users. Computing the trajectorysimilarity for a pair of users is easy and can be done by computingthe cosine similarity of the users given the published set of check-ins. Because of this strong correlation we assume that the informa-tion leakage for a sensitive property, i.e., a the link between a pairof users, is equal to the trajectory similarity.More precisely, let D a ⊂ D be the check-in data published to adversary a ∈ A . Let U denote the set of users referenced in D a .Given a sensitive property p = e ( u i , u j ) , u i , u j ∈ U, i (cid:54) = j wehave that the information disclosure for p is: f ( D a )[ p ] = CosineSimilarity ( D a ( u i ) , D a ( u j )) (20)where D a ( u i ) and D a ( u j ) denote the set of check-in data for users u i and u j respectively. We aggregate the given check-ins based ontheir unique pairs of users and locations, and we extract , data entries that contain the user information, the location and thenumber of times that user visited that particular location. Cosinesimilarity is computed over these counts new data entries. Results.
As mentioned above, we aim to minimize the infor-mation leaked about any edge in the network. We model this re-quirement by considering the average case information leakage andsetting τ I = 0 . In particular, we would like that no informationis leaked at all if possible. Moreover, to partition the dataset wesolve the corresponding T RADE O FF problem. Since we considercosine similarity we are limited to using RAND+ and one of thelocal-search heuristics. In particular, we compare the quality ofthe solutions for RAND+, GREEDYL and GRASPL. Due to thefact that our implementation of GREEDY and GRASP is single-threaded, these algorithms do not scale for this particular task, andthus, are omitted. However, as illustrated later (see Section 6.2),these myopic algorithms are very efficient in minimizing the infor-mation leakage, thus, suitable for our current objective. Again, werun experiments for | A | ∈ { , , , , } .First, we point out that under all experiments GREEDYL andGRASPL reported data-to-adversary assignments with zero disclo-sure. This means that our algorithms were able to distribute thedata entries across adversaries so that no information at all is leakedabout the structure of the social network (i.e., friendship links) toany adversary. On the other hand the average information dis-closure for RAND+ ranges from 0.99 almost full disclosure to 0.1as the number of adversaries varies from 2 to 10 (see Table 1).Disclosing the structure of the entire network with probability al-most 1, violates our initial specifications, hence, RAND+ fails tosolve the problem when the number of adversaries is small. As thenumber of adversaries increases, the average amount of disclosedinformation decreases.We continue our analysis and focus on the utility and the tradeoffobjective. The corresponding results are shown in Figure 2. Figures2(a) and 2(b) correspond to the utility and utility-disclosure trade-off respectively. The corresponding disclosure is shown in Table1. As shown, for a small number of adversaries both GREEDYLand GRASPL generate solutions that induce low values for the totalutility. This is expected since both algorithms give particular em-phasis to minimizing information disclosure due to the tradeoff for-mulation of the underlying optimization problem. As the numberof adversaries increases both algorithms can exploit the structureof the problem better, and offer solutions that induce utility valuesthat are comparable or higher than the ones reported by RAND+.However, looking only at utility values can be misleading, as avery high utility value might also incur a high disclosure value. Iffact, RAND+ follows this behavior exactly. The high utility data toadversary assignments, when the number of adversaries is small, isassociated with almost full disclosure of the structure of the entire social network. This is captured in Figure 2(b), where we see thatlocal-search algorithms clearly outperform RAND+ since no infor-mation is disclosed (see Table 1). As shown in this figure, in mostcases, the average objective value for RAND+ is significantly lowerthan the ones reported by both GREEDYL and GRASPL. Recallthat we compute the average over multiple runs of RAND+, wherefor each run we execute the algorithm multiple times and consider able 1: Average information disclosure reported by RAND+,GREEDYL and GREEDYL for Brightkite. Notice that local-search algorithms generate solutions that reveal no informationabout the structure of the friendship network. Standard errorsare reported in the parenthesis. Avg. Information Disclosure for different values of k
Alg. k =2 k =3 k =5 k =7 k =10RAND+ 0.99(0) 0.99(0.3) 0.3(0.4) 0.1(0.18) 0.1(0.018)GREEDYL 0 0 0 0 0GRASPL 0 0 0 0 0 U t ili t y Adversaries
Utility (Brightkite)
RAND+GREEDYLGRASPL -0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 2 3 5 7 10 O b j e c t i v e Adversaries
Tradeoff objective (Brightkite)
RAND+GREEDYLGRASPL
Figure 2: Tradeoff objective and utility for Brightkite. RAND+generates solutions with high utility but almost full disclo-sure (see Table 1). This leads to a poor tradeoff value.GREEDYL and GRASPL disclose no information, and outper-form RAND+ with respect to the overall optimization objective. the best solution reported. The large error bars are indicative of thenon-robustness of RAND+ for this problem.
In this section, we present our results based on the synthetic data.We examined the behavior of the proposed algorithms under sev-eral scenarios, where we varied the properties of the dataset to bepartitioned, the number of adversaries and the family of disclosurefunctions considered.
Step Functions.
We started by considering step functions. Un-der this family of disclosure functions, both D
ISC B UDGET andT
RADE O FF correspond to the same optimization problem (see Sec-tion 4.1). Moreover, assuming feasibility of the optimization prob-lem, information disclosure will always be zero. In such cases con-sidering the total utility alone is sufficient to compare the perfor-mance of the different algorithms. However, when information isdisclosed, comparing the number of fully disclosed properties al-lows us to evaluate the performance of the different algorithms.First, we fixed the number of data entries in the dataset to be | D | = 500 and considered values of | P | in { , , , } . Figure3 shows the utility derived by the data-to-adversary assignment cor-responding to different algorithms for | P | = 50 . As depicted, allalgorithms that exploit the structure of the dependency graph whilesolving the underlying optimization problem (i.e., LP, GREEDY,GREEDYL, GRASP and GRASPL) outperform RAND + . In mostcases, LP, GREEDYL, GREEDY and GRASP where able to findthe optimal solution that ILP reported. The high performance ofthe LP algorithm is justified by the fact that the fractional solutionreported was in fact an integral solution.GRASPL found solutions with non-optimal utilities, which arestill better than RAND + . We conjecture that this performance de- crease is due to randomization. This is more obvious if we contrastthe performance of GRASPL with GRASP. We can see that ran-domization leads to worse solutions (with respect to utility) whenkeeping a myopic view on the given optimization problem. Onthe other hand randomization is helpful in the case of non-myopiclocal-search. Recall that the reported numbers correspond to no in-formation disclosure, while missing values correspond to full dis-closure of at least one sensitive property. As depicted GREEDYfailed to find a valid solution when splitting the data across 2 ad-versaries. However, when randomization was used, GRASP wasable to find the optimal solution. Similar performance for differentvalues of | D | . These results are omitted due to space constraints.Next, we ran a second set of experiments to investigate how theperformance of the proposed heuristics with respect to the amountof disclosed information. Recall that all local-search algorithmsdo not explicitly check for infeasible values of disclosure for stepfunctions, since they optimize the tradeoff between utility and dis-closure. Therefore, they might report infeasible solutions. We fixedthe number of sensitive properties to | P | = 50 , and the number ofadversaries to k = 2 , and we varied the number of data items | D | .We considered | D | ∈ { , , , } . We observed the samebehavior as in the previous experiment, i.e., all local-search heuris-tics failed to report feasible solutions. U t ili t y Adversaries
Utility (|D| = 500, |P| = 50)
RAND+LPILP GREEDYLGRASPLGREEDY GRASP
Figure 3: Utility for step disclosure functions. All reportednumbers correspond to no information disclosure, while miss-ing values to full disclosure.
The corresponding results are presented in Table 2. As the num-ber of data entries was increasing for a fixed number of sensitiveproperties, the number of fully disclosed properties started decreas-ing. This behavior is expected as the number of data entries persingle property increases, and hence, it is easier for our heuristicsto find a partitioning where the data items for a single property arepartitioned across adversaries inducing zero disclosure.
Table 2: Fully disclosed properties for data-to-adversary as-signments for step disclosure functions. As the number of dataentries per property increases, local-search heuristics exploitthe structure of the problem better and report solutions withfewer fully disclosed properties.
Number of fully disclosed properties ( | P | = 50, k = 2) Alg. | D | =100 | D | =200 | D | =300 | D | =500GREEDYL 11 6 3 0GRASPL 11 6 4 0GREEDY 11 6 3 1GRASP 11 6 5 0 he experiments above show that GREEDY and GRASP are vi-able alternatives for the case of step functions. However, for harderinstances of the problem, where the number of sensitive proper-ties is large and the number of adversaries is small, solving the LPrelaxation offers a more robust and reliable alternative. Linear Functions.
We continue our discussion and present ourexperimental results for linear functions. First, we compared thequality of solutions produced when solving D
ISC B UDGET and T
RADE -O FF optimally for both worst and average disclosure. We gener-ated a synthetic instance of the problem by setting | D | = 50 and | P | = 10 , and we run ILP for | A | = { , , , , } . We set themaximum allowed disclosure to τ I = 0 . and t = 2 .The utility and corresponding disclosure are shown in Figure4 when D ISC B UDGET and T
RADE O FF are solved optimally. Asshown, the worst case disclosure remains at the same levels acrossdifferent number of adversaries. Now, consider the case of averagedisclosure (see Equation 2). We see that for both versions of theoptimization problem the disclosure is decreasing as the number ofadversaries increases. However, the optimization corresponding toT RADE O FF is able to exploit the presence of multiple adversariesbetter, to reduce disclosure while maintaining utility high.Consequently, we evaluated RAND+, LP, GREEDYL, GRASPL,GREEDY and GRASP on solving T RADE O FF . We set upper dis-closure τ I = 0 denoting our requirement to minimize disclosure asmuch as possible. We do not report any results for the ILP sincefor | D | > , it was not able to terminate in reasonable time. First,we fixed the number of properties to | P | = 50 and consideredinstances with | D | = { , , , } . We considered av-erage disclosure. We show the performance of the algorithms for | D | = 500 in Figure 5. As shown, LP, GREEDY and GRASP out-perform RAND+ both in terms of the overall utility and the averagedisclosure. In fact RAND+ performs poorly as it returns solutionswith higher disclosure but lower utility than the LP. Furthermore,we see that the performance gap between the local-search algo-rithms and RAND+ keeps increasing as the number of adversariesincreases. This is expected as the proposed heuristics take into ac-count the structure of the underlying dependency graph, and, hence,can exploit the presence of multiple adversaries to achieve a higheroverall utility and lower disclosure.Furthermore, we see that solutions derived using the LP approx-imation provide the largest utility, while solutions derived usingthe proposed local-search algorithms minimize the disclosure. Aspresented in Figure 5 using the myopic construction returns solu- U t ili t y Adversaries
Utility (|D| = 50, |P| = 10)
DiscBudget-WorstTradeoff-WorstDiscBudget-AvgTradeoff-Avg I n f. D i sc l o s u r e Adversaries
Information Disclosure (|D| = 50, |P| = 10)
DiscBudget-WorstTradeoff-WorstDiscBudget-AvgTradeoff-Avg
Figure 4: The (a) utility and (b) disclosure when solving D IS - C B UDGET and T RADE O FF optimally. T RADE O FF can exploitthe presence of multiple adversaries better, to reduce disclosurewhile maintaining high utility. tions with low overall utility. This behavior is expected since thealgorithm does not maintain a global view of the data-to-adversaryassignment. Observe that randomization improves the quality ofthe solution with respect to total utility, when the global-view con-struction is used (see GREEDY and GRASP). When the myopicconstruction is used, randomization gives lower quality solutions.Measuring the average disclosure is an indicator for the over-all performance of the proposed algorithms. However, it does notprovide us with detailed feedback about the information disclosureacross properties for the different algorithms. To understand whatis the exact information disclosure for the solutions returned by thedifferent algorithms, we measured the total number of propertiesthat exceeded a particular disclosure level. We present the corre-sponding plots for | D | = 500 , | P | = 50 , k = 2 and k = 7 in Figure 6. As shown, the proposed search-algorithms can ex-ploit the presence of multiple adversaries very effectively in orderto minimize disclosure. If we compare Figures 6(a) and 6(a) we seethat the total number of properties reaches zero for a significantlysmaller disclosure threshold in the presence of ten adversaries. N u m be r o f P r ope r t i e s Disclosure Level
RAND+LPGREEDYLGRASPLGREEDYGRASP N u m be r o f P r ope r t i e s Disclosure Level
RAND+LPGREEDYLGRASPLGREEDYGRASP
Figure 6: The number of properties that exceed a particulardisclosure level for | D | = 500 and | P | = 50 . GREEDYL andGRASPL can exploit the presence of multiple adversaries moreeffectively to minimize disclosure. The total number of proper-ties reaches zero for a significantly smaller disclosure thresholdin the presence of ten adversaries.
7. RELATED WORK
There has been much work on the problem of publishing (or al-lowing aggregate queries over) sensitive datasets (see surveys [1,5]). Here, information disclosure is characterized by a privacy def-inition, which is either syntactic constraints on the output dataset(e.g., k -anonymity [20] or (cid:96) -diversity [17]), or constraints on thepublishing or query answering algorithm (e.g., (cid:15) -differential pri-vacy [5]). Each privacy definition is associated with a privacy level( k , (cid:96) , (cid:15) , etc.) that represents a bound on the information disclo-sure. Typical algorithmic techniques for data publishing or queryanswering, which include generalization, or coarsening of values,suppression, output perturbation, and sampling, attempt to maxi-mize the utility of the published data given some level of privacy(i.e., a bound on the disclosure). Krause et al. [14] consider theproblem of trading-off utility for disclosure, and consider generalsubmodular utility and supermodular disclosure functions. This pa-per formulates a submodular optimization problem, and presentsefficient algorithm for the same. However, all the above techniquesassume that all the data is published to a single adversary. Evenwhen multiple parties may ask different queries, prior work makesa worst-case assumption that they arbitrarily collude. On the other O b j e c t i v e Adversaries
Objective (|D| = 500, |P| = 50) U t ili t y Adversaries
Utility (|D| = 500, |P| = 50) D i sc l o s u r e Adversaries
Avg. Disclosure (|D| = 500, |P| = 50)
RAND+LPGREEDYLGRASPLGREEDYGRASP
Figure 5: Tradeoff objective, utility and disclosure for linear functions considering average disclosure ( | D | = 500 , | P | = 50 ). LP,GREEDY and GRASP outperform RAND+ both in terms of total utility and average disclosure. LP maximizes utility, while thelocal-search heuristics are the most effective in minimizing disclosure. hand, in this paper, we formulate the novel problem of multiplenon-colluding adversary, and develop near-optimal algorithms fortrading-off utility for information disclosure in this setting.
8. CONCLUSIONS AND FUTURE WORK
More and more sensitive information is released on the Weband processed by online services, naturally raising concerns re-lated to privacy in domains where detailed and fine-grained infor-mation must be published. In this paper, motivated by applicationslike online advertising and crowd-sourcing markets, we introducethe problem of privacy-aware k -way data partitioning, namely, theproblem of splitting a sensitive dataset among k untrusted parties.We present SPARSI a theoretical framework that allows us to for-mally define the problem as an optimization of the tradeoff betweenthe utility derived by publishing the data and the maximum infor-mation disclosure incurred to any single adversary. Moreover, weprove that solving it is NP-hard by reducing it to hypergraph par-titioning. We present a performance analysis of different approxi-mation algorithms for a variety of synthetic and real-world datasets,and demonstrate how SPARSI can be applied in the domain of on-line advertising. Our algorithms are able to partition user-locationdata to multiple advertisers while ensuring that almost no sensitiveinformation about potential friendship links about these users canbe inferred by any advertiser.Our research so far has raised several interesting research di-rections. To our knowledge, this is the first work that leveragesthe presence of multiple adversaries to minimize the disclosure ofprivate information while maximizing utility. While we providedworst case guarantees for several families of disclosure functions,an interesting future direction is to examine if rigorous guaranteescan be provided for other widely-used information disclosure func-tions like information gain, or if the current ones can be improved.Finally, it is of particular interest to consider how the proposedframework can be extended to consider interactive scenarios wheredata are published to adversaries more than once, or in streamingdata where the partitioning must be done in an online manner.
9. REFERENCES [1] B.-C. Chen, D. Kifer, K. Lefevre, and A. Machanavajjhala.Privacy-preserving data publishing.
Foundations and Trendsin Databases , 2(1-2):1–167, 2009. [2] E. Cho, S. A. Myers, and J. Leskovec. Friendship andmobility: user movement in location-based social networks.In
KDD , 2011.[3] I. Dinur, O. Regev, and C. D. Smyth. The hardness of 3 -uniform hypergraph coloring. In
FOCS , 2002.[4] B. Doerr, M. K¨unnemann, and M. Wahlstr¨om. Randomizedrounding for routing and covering problems: experimentsand improvements. In
SEA , 2010.[5] C. Dwork. Differential privacy: A survey of results. In
TAMC , 2008.[6] T. A. Feo and M. G. Resende. Greedy randomized adaptivesearch procedures.
Journal of Global Optimization , 1995.[7] S. Fijishige.
Submodular functions and optimization . Annalsof Discrete Mathematics. Elsevier, 2005.[8] R. Gandhi, S. Khuller, S. Parthasarathy, and A. Srinivasan.Dependent rounding and its applications to approximationalgorithms.
J. ACM , 2006.[9] M. X. Goemans, N. J. A. Harvey, S. Iwata, and V. Mirrokni.Approximating submodular functions everywhere. In
SODA ,2009.[10] D. Golovin. Max-min fair allocation of indivisible goods.Technical Report, 2005.[11] S. Guha, B. Cheng, and P. Francis. Privad: practical privacyin online advertising. In
NSDI , 2011.[12] M. Hardt, K. Ligett, and F. McSherry. A simple and practicalalgorithm for differentially private data release.
CoRR , 2010.[13] S. Khot and A. K. Ponnuswami. Approximation algorithmsfor the max-min allocation problem. In
APPROX/RANDOM ,2007.[14] A. Krause and E. Horvitz. A utility-theoretic approach toprivacy and personalization. In
AAAI , 2008.[15] A. Kulik, H. Shachnai, and T. Tamir. Maximizingsubmodular set functions subject to multiple linearconstraints. In
SODA , 2009.[16] T. Li and N. Li. Injector: Mining background knowledge fordata anonymization. In
ICDE , 2008.[17] A. Machanavajjhala, J. Gehrke, D. Kifer,M. Venkitasubramaniam. (cid:96) -diversity: Privacy beyond k -anonymity. In ICDE , 2006.[18] A. Marshall.
Principles of Economics . 1890.[19] P. Raghavan and C. Tompson. Randomized rounding: Atechnique for provably good algorithms and algorithmicroofs.
Combinatorica , 1987.[20] L. Sweeney. k-anonymity: a model for protecting privacy. ,
Int. J. Uncertain. Fuzziness Knowl.-Based Syst. , 2002.[21] V. Toubiana, A. Narayanan, D. Boneh, H. Nissenbaum, andS. Barocas. Adnostic: Privacy preserving targetedadvertising. In