[PDF] Finding the Sweet Spot for Data Anonymization: A Mechanism Design Perspective

Abstract

Data sharing between different organizations is an essential process in today's connected world. However, recently there were many concerns about data sharing as sharing sensitive information can jeopardize users' privacy. To preserve the privacy, organizations use anonymization techniques to conceal users' sensitive data. However, these techniques are vulnerable to de-anonymization attacks which aim to identify individual records within a dataset. In this paper, a two-tier mathematical framework is proposed for analyzing and mitigating the de-anonymization attacks, by studying the interactions between sharing organizations, data collector, and a prospective attacker. In the first level, a game-theoretic model is proposed to enable sharing organizations to optimally select their anonymization levels for k-anonymization under two potential attacks: background-knowledge attack and homogeneity attack. In the second level, a contract-theoretic model is proposed to enable the data collector to optimally reward the organizations for their data. The formulated problems are studied under single-time sharing and repeated sharing scenarios. Different Nash equilibria for the proposed game and the optimal solution of the contract-based problem are analytically derived for both scenarios. Simulation results show that the organizations can optimally select their anonymization levels, while the data collector can benefit from incentivizing the organizations to share their data.

Full PDF

11 Finding the Sweet Spot for Data Anonymization:A Mechanism Design Perspective

Abdelrahman Eldosouky, Tapadhir Das, Anuraag Kotra, and Shamik Sengupta

Abstract —Data sharing between different organizations is an essential process in today’s connected world. However, recently therewere many concerns about data sharing as sharing sensitive information can jeopardize users’ privacy. To preserve the privacy,organizations use anonymization techniques to conceal users’ sensitive data. However, these techniques are vulnerable tode-anonymization attacks which aim to identify individual records within a dataset. In this paper, a two-tier mathematical framework isproposed for analyzing and mitigating the de-anonymization attacks, by studying the interactions between sharing organizations, datacollector, and a prospective attacker. In the ﬁrst level, a game-theoretic model is proposed to enable sharing organizations to optimallyselect their anonymization levels for k-anonymization under two potential attacks: background-knowledge attack and homogeneityattack. In the second level, a contract-theoretic model is proposed to enable the data collector to optimally reward the organizations fortheir data. The formulated problems are studied under single-time sharing and repeated sharing scenarios. Different Nash equilibria forthe proposed game and the optimal solution of the contract-based problem are analytically derived for both scenarios. Simulationresults show that the organizations can optimally select their anonymization levels, while the data collector can beneﬁt fromincentivizing the organizations to share their data.

Index Terms —Data anonymization, k-anonymity, game theory, contract theory (cid:70)

NTRODUCTION T HE rise of Big Data has helped generate tremendousamounts of digital information that are continually be-ing collected, analyzed, and distributed. This technology hashelped organizations personalize their services, optimizetheir decision making, and help predict future trends [2].Nevertheless, these operations tend to raise public concerndue to the fact that much of the data contain user sensitiveinformation. To address these concerns and preserve userprivacy, organizations engage in deploying robust securitymechanisms to protect their data against different forms ofcyber-attacks [3]. Consequently, concepts like data security,privacy, and trust have recently received signiﬁcant atten-tion in the literature as different forms of preserving thedata [4], [5], [6], [7], [8], [9].Yet, conventional security mechanisms do not becomehandy when it comes to data sharing. For instance, en-cryption based mechanisms can help to secure data sharedbetween different parts or sites of the same organization,e.g., patients’ remote monitoring [10]. However, it is notfeasible to widely share encrypted data, among many or-ganizations, due to key management issues. One solutionfor data sharing is to remove the sensitive information,e.g., name, phone number, and address from the a dataset,before sharing it. However, it was shown that the remainingunique characteristics of the dataset can still be used toidentify users [11]. To further preserve the privacy whensharing data, anonymization techniques have been pro-posed to ensure that each record in a dataset is indistin-guishable from the others, by removing identiﬁable features,and, hence, reducing the probability of identifying individ-ual records. Prominent anonymization methods include k -anonymization [12], l -diversity [13], t -closeness [14], where k -anonymization was introduced ﬁrst and then l -diversityand t -closeness were introduced as expansions on it that This paper is an extension of the work originally presented in [1] provide further modiﬁcations to the dataset, making it morechallenging to differentiate the rindividual ecords.Despite these developments, speciﬁc de-anonymizationtechniques, like background knowledge attacks [15] andhomogeneity attacks [16], are able to compromise the secu-rity of these approaches, i.e., k -anonymization, l -diversity,and t -closeness. As, k -anonymization is the basic techniquebehind l -diversity, and t -closeness, the research conductedin this paper focuses primarily on k -anonymization. In k -anonymization, the magnitude of k directly corresponds tothe level of privacy achieved on the dataset. However, italso corresponds to the amount of information loss fromthe dataset, which may reduce its usefulness to other or-ganizations if the value of k is too big. The goal for theorganizations is, then, to optimally select k such that it maxi-mizes privacy while minimizing information loss. The workin [17] proposed two algorithms to reduce the informationloss associated with k -anonymization. However, as thesealgorithms depend on the structure of the data, they cannotbe generalized. To this end, choosing the optimal value of k ,in k -anonymization, remains an open problem. Different techniques have been proposed to enable theorganizations to preserve the privacy of their shared in-formation [18], [19], [20], [21], [22], [23]. The authors in[18] studied the case of asymmetric information sharing inwhich a data collector interacts with multiple data ownerssequentially and each data owner possesses a record desiredby the collector. The authors proposed a pricing techniqueto enable the data collector to determine the optimal valueof the data. The work in [19] addressed the privacy issueof data sharing using blockchains, in which the authorsproposed a novel and efﬁcient protocol to mitigate Sybilattacks by malicious users through a time-locked deposit a r X i v : . [ c s . G T ] J a n protocol while minimizing the execution costs. The work in[20] introduced a data analytics system for privacy preserva-tion in data forwarding and aggregation, by using summarystatistics of encrypted value aggregations. The authors in[21] proposed a technique to share trained machine learningmodels instead of the original data in the applications thatuse the data for prediction purposes. Meanwhile, the worksin [22] and [23] investigated the concept of data sharingusing an information sharing platform, with the goal ofachieving collaborative information sharing between theorganizations. For instance, the work in [22] proposed acollaborative information sharing environment to providecyber incident prevention, protection for the shared data.The work in [23] addressed the problem of information leak-age by preventing employees in collaborating organizationsfrom transferring the sensitive information accidentally ordeliberately to non-authorized users. However, one limita-tion of the works in [18], [19], [20], [21], [22], [23] is that theyfocus on preserving the privacy of the shared data from apassive standpoint, i.e. they do not consider active attacksthat try to directly reveal the sensitive information of thedataset.Other works in literature have studied the privacy, underpossible attacks, by analyzing the interactions between thesharing organizations and the attacker using game theory[24]. Game theory is a powerful mathematical frameworkthat enables to study the interactions between differentdecision makers and that is widely used in many securitydomains [25], [26], [27]. Similarly, game theory has beenrecently used to preserve the privacy of shared data [28],[29], [30]. For instance, the work in [28] studied the trade-off between information sharing and the cost of privacyand preservation based on incentives from the informationsharing platform. The authors in [29] developed a game-theoretic approach to share genomic data that accounts forthe adversarial behavior and the available resources. Thework in [30] studied the problem of sharing security datafor investment purposes and the authors have proposed agame-theoretic approach to make decisions based on theprivacy risk and the security knowledge needs. However,while the works in [28], [29], [30] help to preserve theprivacy of the shared data, their approaches do not apply todata anonymization and, hence, the problem of optimizingthe anonymization level.Finally, we note that game theory has been also used fordata anonymization [31], [32], [33]. In [31], the authors in-troduced a game-theoretic approach to ensure k -anonymitywhen generating dummy data for a location-based servicethat involves multiple users. The authors in [32] used acoalitional game-theoretic model to ﬁx the anonymizationlevel of k -anonymity based on a given threshold for in-formation loss. However, the works in [31] and [32] arebased on the assumption of a ﬁxed anonymization leveland do not enable to optimize its selection. The authors in[33] introduced an approach to optimize the anonymizationlevel selection in a scenario consisting of three differentparties: a data provider, a data collector, and a data user. Theequilibrium of the game formulated in [33] was analyticallyderived to choose a value of k that represents the sharedagreements between the different parties. However, thework in [33] does not consider the presence of an attacker which can affect the utilities of different parties. The main contribution of this paper is a general frameworkto optimize the selection of anonymization levels underpossible attacks. In particular, we propose a multi-levelframework to study the interactions between the differententities involved in data sharing, i.e., the sharing organiza-tions, an information sharing platform (data collector), andan attacker. In the ﬁrst level, we formulate a game-theoreticmodel to analyze the interactions between the organizationsand the attacker where each organization chooses a valueof k that maximizes its outcome based on the expectedattacks and the choices of other organizations. Meanwhile,an attacker can choose from a set of attacks based on its ex-pected outcome. The framework considers two types of de-anonymization attacks which are background knowledgeand homogeneity attack [15] and [16]. First, we consider thecase of single-time sharing which corresponds to a static nonzero-sum game, for which we analytically derive both thepure and the mixed-strategy equilibrium points. Then, wesolve a dynamic game, which represents a repeated sharingscenario, by tracing the change of the utilities over time.In the second level of the framework, we formulate anoptimal contract-theoretic problem to study the interactionsbetween the various organizations and the the data collector.In particular, the data collector offers contracts to the orga-nizations that maximize its own reward while incentivizingeach organization to accept a contract and to share its datawith the data collector. The problem is formulated usingthe framework of contract theory [34] that provides a setof tools for modeling the relations between a principal (thedata collector) and a number of agents (the organizations).Note that, contract theory has been used in a wide varietyof applications to solve the principal-agents problem, e.g.,[35], [36], [37]. However, to the best of our knowledge, this isthe ﬁrst work to use contract theory in information sharingproblems to mitigate de-anonymization attacks. To this end,the optimal solution of the contract based problem is analyt-ically derived under the single-time sharing scenario. Then,we propose an approach in the repeated sharing scenario toincentivize more organizations to share their data.The rest of the paper is organized as follows. The systemmodel and the proposed two-tier framework are formulatedin Section 2. The equilibrium analysis of the proposed gameand the optimal solution of the contract-based model arederived for the static case in Section 3. The solutions of thedynamic case are derived in Section 4. Numerical results arepresented and analyzed in Section 5. Finally, conclusions aredrawn in Section 6. YSTEM M ODEL

The goal of k − anonymization is to make each record inthe shared dataset indistinguishable from at least k − other records [12]. This can be achieved by applying someoperations on the dataset attributes such as generalizationand suppression. In general, the attributes of a dataset canbe classiﬁed into key attributes such as name and address, quasi- identiﬁers such as date of birth, zip code, and gender,and sensitive attributes that are speciﬁc attributes such asmedical records and salaries. When sharing a dataset, theresearches are interested in the sensitive attributes whichindicate the valuable information in the dataset. To preservethe privacy of the individuals in the dataset, key attributesare always removed before sharing. On the other hand,quasi-identiﬁers are the attributes that are processed toachieve the k-anonymity. For instance, in the generalizationprocess, the quasi-identiﬁers are replaced with less speciﬁcvalues, e.g., removing the last two digits from the zip code.Suppression is used to remove speciﬁc records, usuallyoutliers, that if kept will cause too much information loss toachieve a speciﬁc k value under the generalization process.After achieving a speciﬁc k value, the probability ofidentifying an individual record is reduced to /k . However,some attacks can increase this probability for an attacker. Forinstance, under background knowledge attack, the attackercollects some information to help eliminate some valuesfrom the dataset, so that, it increases the probability ofidentifying speciﬁc records [38]. On the other hand, underhomogeneity attack, the sensitive attributes might be thesame for more than one record among the k indistinguish-able records, which will increase the probability to infer thatsensitive information of a speciﬁc record [13]. Note that, inthis work, we do not consider special datasets with frequentcommon values in the sensitive attributes that can self-reveal the sensitive attributes even after being anonymized. Consider an information exchange scenario in which someorganizations interact with a data collector to share sets ofdata. The data collector is a system that collects data fromdifferent organizations and manages how this data is sharedlater for different purposes, e.g., research, marketing, etc.Typically, the shared data can contain sensitive information,therefore, the organizations usually apply an anonymizationtechnique to their data before sharing it. We also considerthe presence of an attacker that targets the shared data,at the data collector side, by applying de-anonymizationtechniques and extracting the sensitive information. Fig. 1shows the interactions in our system model for the caseof two organizations. As discussed earlier, we considerthe case where the organizations perform k -anonymizationtechnique to preserve the privacy. Here, we assume that theorganizations are able to achieve the desired k level by usingboth generalization and suppression techniques.As there are three type of parties in the system, wepropose a two-tier model to model the different interactionsbetween the parties. First, we use game theory to model theinteractions between the organizations and the attacker, toenable the organizations to strategically choose the appro-priate level of data anonymization. In the second tier, weformulate a contract-theoretic problem for the data collectorto optimize its rewards from the shared data by determiningthe optimal payments to the organizations. Next, we deﬁnethe different entities in the system, following which, weformulate the mathematical models for both the system’stiers. Fig. 1. The interactions between the organizations, the data collector,and the attacker.

Organizations : Consider a set N of N organizations thatshare their data. Each organization uses k − anonymizationtechnique to make its shared data anonymous. The goal oforganization i is to choose the best value of k i to maximizeits payoff, given other organizations k values and the possi-ble attacks on the data. Each organization can take an action,choosing an anonymization level from a set D of availablelevels. Attacker : An attacker targets the data, at the data collec-tor side, in order to reveal the sensitive information. We as-sume the attacker can anticipate the level of anonymizationused, by analyzing the structure of the dataset. The attackerhas three actions to choose from. Let a ∈ A = { B, H, N } represents the attacker’s action of performing backgroundknowledge attack, performing homogeneity attack , or no-attack,respectively. Data Collector : A data collector is a system, or indepen-dent organization, whose objective is to collect data from thedifferent organizations and earn proﬁts by performing datamining. Here, the data collector needs to ﬁnd the appropri-ate rewards, so that it can achieve a positive outcome whileincentivizing the organizations to share their data.

In the ﬁrst tier of the framework, we study the inter-actions between the organizations and the attacker. Sincethe attacker has opposing goals to the organizations, gametheory [24] represents a suitable mathematical frameworkfor modeling such interactions. In the proposed game, eachplayer wants to maximize its payoff based on its actions andthe other players’ actions. The players’ payoffs are given inthe form of utility functions which map the players’ actionsto their outcome as discussed next.

Organizations : The utility of each organization is givenas a function in the reward it gets from the data collector, r i ( k i ) , the cost for applying the anonymization technique, c i ( k i ) , the level of trust in the data collector, T i ( k i ) , and theprobability of data breach, b i ( k i , k − i , a ) , where k − i refersto the other organization’s action. Let u i be the utility oforganization i , it can then be given by: u i ( k i , k − i , a ) = r i ( k i ) · (1 − b i ( k i , k − i , a )) − c i ( k i ) + T i ( k i ) , (1) where the ﬁrst term represents the probability of receivingthe reward based on all players’ actions. The reward func-tion r i ( k i ) is given by the data collector, as discussed later.The cost function c i ( k i ) depends on the choice of k i .Here, we propose to deﬁne the anonymization cost as afunction in the computational complexity of executing k -anonymization procedure. This complexity was shown in[39] to equal O ( | V | k ) , where | V | represents the number ofdifferent subsets of the dataset based on their common at-tributes for anonymization purpose. Using this complexityfunction, we propose to deﬁne the cost function c i ( k i ) as: c i ( k ) = log (cid:16) | V | k (cid:17) , (2)where the log function converts the time complexity intomonetary values on the same scale as the reward values.Next, we consider the level of trust in the data collector T i ( k i ) , which represents how much an organization truststhe data collector to protect its dataset against cyber threats,e.g., breaches. We propose to deﬁne the level of trust as: T i ( k i ) = γ · k i (3)where, γ is the coefﬁcient of trust. Note that, the level oftrust is deﬁned as an increasing function in k i as whenthe anonymization level increases, the organizations willbe more conﬁdent that their information will be safe evenunder breaches as it is less informative.Finally, we consider the breach probability for each orga-nization’s shared data, b i ( k i , k − i , a ) . We use the probabilityof breach deﬁned in [40] as: b ( a, k i ) = p ( a ) αk i + 1 , (4)Where p ( a ) is probability of a successful attack, based on theattack type, and α > is a measure of information security.Note that, in [40], the probability of breach is given as afunction in the organization’s investment which is analo-gous to the level of anonymization k i as the organization’sinvestment in protecting its shared data.Equation (4) represents the organization’s own proba-bility of breach. In case of multiple organizations sharingto the same platform, this will increase the probabilityof successful attacks as an attacker can link informationfrom different datasets to identify the records [41]. Here,we propose to model this interdependency similar to themodel in [42] such that the interdependent risk between twoorganizations is given as: b ( a, k i , k − i ) = 1 − (1 − p ( a ) αk i + 1 )(1 − p ( a ) αk − i + 1 ) . (5)Substituting (5) in (1), the utility of any organization, interms of its and other organizations’ anonymization levels,can then be given as: (6) u i ( k i , k − i , a ) = r i ( k i ) · (1 − p ( a ) αk i + 1 ) · (1 − p ( a ) αk − i + 1 ) − c i ( k i ) + γ · k i . Attacker : For the case of two organizations, the at-tacker’s utility can be given in terms of its probability of achieving the reward from the information and the cost toapply its attack. Thus, we deﬁne the attacker’s utility u a asfollows: u a ( k , k , a ) = b ( a, k , k ) R a − c a ( a ) , (7)where R a is the reward for revealing the real data and thisreward can be achieved based on the combined breach prob-abilities of the datasets and c a ( a ) is the cost of performingeach type of the attack.Note that (5) can be rewritten as: b ( a, k , k ) = p ( a )( αk + 1) + p ( a )( αk + 1) − p ( a )( αk + 1) p ( a )( αk + 1) , (8)and, thus, the attacker’s utility in (7) can be written as: (9) u a ( k , k , a ) = (cid:16) p ( a )( αk + 1) + p ( a )( αk + 1) − p ( a )( αk + 1) p ( a )( αk + 1) (cid:17) R a − c a ( a ) , Here, according to the nature of the homogeneity attack,the attacker will beneﬁt if the two organizations are usingthe same anonymization level for sharing datasets withrelated information. This is because the similar structure ofthe shared data will increase the probability of identifyingthe individual records.Let p ( H s ) be the success probability of the homogeneityattack when the organizations use the same anonymizationlevel. Similarly, let p ( H d ) be the success probability of thehomogeneity attack when the organizations use differentanonymization levels, such that p ( H s ) > p ( H d ) . We assume p ( B ) > p ( H d ) > , i.e., the success probability of back-ground attack is higher than that of the homogeneity attack.That is because the attacker can use the background knowledge to link with the shared data and increase its chances ofidentifying the records. However, p ( B ) can be higher orlower than p ( H s ) .For the attack cost, the cost of performing the back-ground knowledge attack is assumed to be higher than thatof the homogeneity attack, i.e., c a ( B ) > c a ( H ) > . Thisis because the attacker will spend more time collecting thebackground information and linking the similar informa-tion. Note that, when the attacker chooses not to attack,its utility u a ( N, k , k ) will equal zero. This choice will besuperior to the attacker if the cost of performing the attackexceeds the reward from revealing the information.To this end, we deﬁne a game G = {N , D , A , U} suchthat N is the set of the players which include all theorganizations as well as the attacker, D is the set of theorganization’s strategies, A is the set of attacker’s strategies,and U is the set of the all players’ utilities. In the second tier of the framework, we study the interac-tions between the data collector and the organizations. Wenotice that the data collector gives rewards to the organiza-tions for their shared data. Since the data collector has thepower to price the data, it can make take-it or leave-it offers,i.e., if an organization was offered a low price, it can refuseit and does not share its data with the data collector. To thisend, we propose to use contract theory [34] to model the data collector’s problem of ﬁnding the optimal prices forthe data.The goal of the data collector is to collect the data fromthe organizations and make proﬁts by performing datamining on it. The data collector will be referred to as theprincipal in this section. The utility function of the principalcan be given as: U d = N (cid:88) i =1 θ i ( v i − r i ( k i )) , (10)where, v i is the principal’s evaluation of the received datafrom organization i and r i ( k i ) is the reward paid to or-ganization i for its data. The evaluation v i represents thedata collector’s expected proﬁts from obtaining this data.Finally, θ i is the organization’s type, which speciﬁes howthe principal perceives different organizations in the marketsuch that ≤ θ i ≤ . Since the main difference between theorganizations is their k selection, θ i needs to be a functionin k i .Here we propose to deﬁne θ = 1 /k such that it satisﬁes ≤ θ i ≤ . Moreover, the principal’s utility will be a declin-ing function in k such that when the level of anonymizationincreases, the information will be less informative and,hence, its value will decrease. In return, the data collectorwill give less rewards to the organizations for their data.Note that, by using θ = 1 /k when k = 1 , i.e., noanonymization, the data collector can obtain the full value ofthe reward and the organizations can obtain the full reward.For every k > , the reward will be declining such that, forlarge values of k , e.g., k > , any increase in k will causea small decrease in r i . This can be interpreted as when theanonymization level increases, the information will be lessuseful up to some point where the increased k will havevery small effect on the information loss (reward). This canbe captured by the heavy tail of the function θ = 1 /k .The principal’s problem is to design different contractsfor the different organization types such that organizationsfrom the same type will be given equal payments. Thecontracts offered by the principal need to be feasible for theorganizations, i.e., they need to be persuading for the orga-nizations to accept. To this end, the contracts must satisfytwo key properties [34] which are individual rationality (IR)and incentive compatibility (IC).1) Individual Rationality (IR):

As the organizationsare rational, the given payments need to ensure non-negative utility for each organizations, i.e., θ i r i ( k i ) − c i ( k i ) + T i ( k i ) ≥ , i ∈ N . (11)2) Incentive Compatibility (IC):

Each organizationmust always prefer the contract designed for itstype, over all other contracts. This ensures that anorganization can achieve a better utility only if itchooses the contract designed for its type. θ i r i ( k i ) − c i ( k i ) + T ( k i ) ≥ θ i r j ( k j ) − c j ( k j ) + T j ( k j ) , ∀ i , j ∈ N , i (cid:54) = j. (12) To this end, the principal’s problem of ﬁnding the opti-mal contracts can be given as:max r i ( k i ) N (cid:88) i =1 θ i ( v i − r i ( k i )) s.t θ i r i ( k i ) − c i ( k i ) + T i ( k i ) ≥ , ∀ i ∈ N ,θ i r i ( k i ) − c i ( k i ) + T ( k i ) ≥ θ i r j ( k j ) − c j ( k j )+ T j ( k j ) , ∀ i , j ∈ N , i (cid:54) = j. (13) INGLE T IME S HARING

In this section, we solve the problems formulated in Section2, of the proposed two-tier model, for the static case. Inthis case, we assume that the organizations share their dataonly once with the data collector, while the attacker tries toidentify the anonymized data. In particular, we ﬁrst solvethe optimal contract-based model to determine the optimalrewards. Then, we derive the different Nash equilibria ofthe static game-theoretic model using the optimal rewards.

To solve the problem in (13), we notice that the numberof constraints is large. For instance, the number of IR con-straints equals N and the number of IC constraints equals N ( N − . However, the number of constraints can besigniﬁcantly reduced as shown next. Lemma 1.

The IC constraints are equivalent to r i > r j for everypair of organizations such that θ i > θ j .Proof. We prove this lemma by using the IC constraints forany two organizations with types θ and θ and θ > θ .The downward and the upward IC constraints are: θ r ( k ) − c ( k ) + T ( k ) ≥ θ r ( k ) − c ( k ) + T ( k ) ,θ r ( k ) − c ( k ) + T ( k ) ≥ θ r ( k ) − c ( k ) + T ( k ) . By adding the two inequalities, we get: θ r ( k ) − c ( k ) + T ( k ) + θ r ( k ) − c ( k ) + T ( k ) ≥ θ r ( k ) − c ( k ) + T ( k ) + θ r ( k ) − c ( k ) + T ( k ) , which can be rearranged as: θ r ( k ) + θ r ( k ) ≥ θ r ( k ) + θ r ( k ) , ( θ − θ ) r ( k ) ≥ ( θ − θ ) r ( k ) . Since θ > θ , we can conclude that r ( k ) > r ( k ) .Using Lemma 1, we can reduce the number of IC con-straints to just N constraints of each two consecutive orga-nization types. Next, we show the solution to the reducedcontract problem for two organizations sharing the samedata but with different anonymization values k L and k H such that k L < k H . Theorem 1.

The optimal rewards for two organizations sharingthe same data and using k L and k H , such that k L < k H is: r H ( k H ) = c H ( k H ) − T H ( k H ) θ H ,r L ( k L ) = max (cid:18) c L ( k L ) − T L ( k L ) θ L , r H ( k H ) (cid:19) Proof.

When two organizations share the same data withdifferent anonymization levels, the principal’s problem be-comes: max r i ( k i ) θ L ( v − r L ( k L )) + θ H ( v − r H ( k H )) s.t θ L r L ( k L ) − c L ( k L ) + T L ( k L ) ≥ ,θ H r H ( k H ) − c H ( k H ) + T H ( k H ) ≥ ,r L ( k L ) > r H ( k H ) . Since r L ( k L ) needs to be higher than r H ( k H ) accordingto the last constraint, the second constraint will bind, i.e., θ H r H ( k H ) − c H ( k H ) + T H ( k H ) = 0 . This will lead r H ( k H ) to equal c H ( k H ) − T H ( k H ) θ H . The value of r L ( k L ) needs to beat least r H ( k H ) and needs to satisfy the ﬁrst constraint,therefore, r L ( k L ) will be the higher of both values, i.e., max (cid:16) c L ( k L ) − T L ( k L ) θ L , r H ( k H ) (cid:17) .From theorem 1, we notice that the organization withhigher k H will get a reward that makes its utility equalszero, i.e., not beneﬁting nor losing from the sharing. How-ever, this might be seen as being not an enough incentivefor the organizations to share their data. Therefore, in thedynamic case in Section 4, we discuss long-term contract toenable the data collector to design incentivizing contractsfor the organizations when they sign a long-term contract.Note in theorem 1, the solution is shown for the case of twoorganizations’ types, however, the solution of the generalcase of multiple organization types will follow similarly. Inthat case, the organization with the highest k value will geta reward that makes its utility zero. Each of the remainingorganizations will get a reward that equals the maximum ofits cost and the higher organizations’ reward.Finally, we note that the solution presented in theorem1 represents the case where both organizations share thesame dataset. However, for the general case when eachorganization has a different value of v i , the principal needsto solve the general problem in (13) using any optimizationtechnique, e.g., linear programming. In such case, eachorganization will achieve a different utility that does notneed to equal zero. After determining the rewards for each anonymizationlevel, we can use these values in solving the formulatedgame. Recall that, in the organizations’ utilities in (6), eachorganization obtains a fraction of the reward r i ( k i ) based onthe attack’s success probability. Let (14) δ = (1 − p max ( a ) αk i + 1 ) · (1 − p max ( a ) αk − i + 1 ) be the minimum fraction of r i ( k i ) that an organization canachieve based on the maximum success probability of theavailable attacks, i.e., p max ( a ) . We refer to δr i ( k i ) as theminimum proﬁt factor.In the proposed game, the goal of each player is totake actions to maximize its utility given the actions ofother players. When no player can improve its utility byunilaterally changing its actions, the game is said to be atequilibrium. The notion of equilibrium, in game theory, isreferred to as Nash equilibrium [43]. Nash equilibrium can be either pure Nash equilibrium, or mixed-strategy Nashequilibrium. A pure strategy equilibrium, is when everyplayer has only one action/ strategy at equilibrium. Onthe other hand, a mixed Nash equilibrium represents aprobability distribution over each player’s set of availableactions [44].The proposed game, under the static scenario, is a ﬁnitestatic non-zero-sum game which is known to have a Nashequilibrium, either pure or mixed-strategy [24]. For the sakeof analytical tractability, we consider the case in which eachorganization can choose between two different k values,i.e., k L and k H . These values represent choosing a low andhigh values for k , respectively. Based on these values, eachorganization will have two minimum proﬁt factors δ H and δ L corresponding to the choice of k L and k H , respectively.Let p be the probability for the ﬁrst organization tochoose k L such that it chooses k H with a probability − p . Similarly, the second organization can choose k L and k H with the probabilities p and − p , respectively.The attacker, on the other hand, will have a probabilitydistribution of q B , q H , q N for choosing the actions B , H ,and N , respectively. We start the analysis by considering thecases in which the game G can have a pure strategy Nashequilibrium. Proposition 1.

Let k ∗ i = argmax k i r i ( k i ) − c i ( k i ) + γ · k i .Then, the tuple ( k ∗ i , k ∗− i , N ) constitute a pure strategy Nashequilibrium for G when the attacker cannot achieve a positiveutility.Proof. We note that the attacker’s utility for no-attack is zero,i.e., u a ( k , k , N ) = 0 . The attacker can only turn to thischoice if all the other actions yield a negative utility, i.e., allthe utility instances for choosing B and H with the differentcombinations of k L and k H , for each organization, willresult in a negative attacker’s utility. Therefore, choosingthe action N will be a dominant strategy for the attacker. Inthis case, each organization’s utility will be: u i ( k i , k − i , N ) = r i ( k i ) − c i ( k i ) + γ · k i , (15)which clearly depends only on the organization’s actionand not on the other players’ actions. In this case, eachorganization will chose the value of k that maximizes itsutility in (15). Hence, k ∗ i = argmax k i r i ( k i ) − c i ( k i ) + γ · k i will represent the optimal organization’s choice under no-attack scenario. In this case, no player will have an incen-tive to change its choice and, therefore, the actions tuple ( k ∗ i , k ∗− i , N ) is a pure strategy Nash equilibrium for thegame.From Proposition 1, the attacker’s probability q N ofchoosing the action N will be either or based on whetherthe action N dominates the other actions or it is beingdominated by another action. Thus, the actions B and H can be selected by probabilities q and − q , respectively,when N is not selected.Similar to the attacker, each organization can have adominant strategy under some circumstances and, hence,the probability p can be either or based on the dominantstrategy. Proposition 2.

Each organization will have a dominant strategywhen the rewards assigned from the data collector r i ( k i ) are high enough such that the minimum proﬁt factor is the dominant factorin the organization’s utility, i.e., δ H r i ( k H ) > γ · k H − c i ( k H ) and δ L r i ( k L ) > γ · k L − c i ( k L ) . The dominant strategy can thenbe given as the solution of: k ∗ i = argmax i δ i r i ( k i ) − c i ( k i ) + γ · k i , i ∈ { L, H } . Proof.

The values of δ H r i ( k H ) and δ L r i ( k L ) represent theminimum fractions of the reward each organization canachieve, under the attacker’s maximum probability of suc-cess. When the values of r i ( k i ) are high enough to makethese minimum proﬁt factors higher than the rest of thefactors of the utility, each organization can expect that anyother attacker’s action will not lower its utility. Thus, theorganization can determine its dominant strategy whileneglecting the attacker’s effect.Note in Proposition 2, a high reward can eliminatethe attacker’s effect, however, it cannot be used solely todetermine the organization’s action as this is affected by theother factors in the organization’s utility.To this end, when no player has a dominant strategy,the players will randomize over their strategies using theprobability distributions of the mixed-strategy Nash equi-librium. These mixed strategies can be calculated when theplayers are indifferent between choosing their actions, i.e.,the expected utility of choosing each action will be the same.For instance, the organizations can choose their p such thatthe attacker’s expected utility from choosing the action B will equal to that of choosing the action H . The attacker’sexpected utility from choosing the action B can be given by: E ( u a ( k , k , B )) = p · p · u a ( k L , k L , B ) + p · (1 − p ) · u a ( k L , k H , B ) + (1 − p ) · p · u a ( k H , k L , B )+ (1 − p ) · (1 − p ) · u a ( k H , k H , B ) . (16)Similarly, the expected utility of choosing action H is: E ( u a ( k , k , H )) = p · p · u a ( k L , k L , H ) + p · (1 − p ) · u a ( k L , k H , H ) + (1 − p ) · p · u a ( k H , k L , H )+ (1 − p ) · (1 − p ) · u a ( k H , k H , H ) . (17)For the attacker to be indifferent between its actions,the utility in (16) must equal the utility in (17). Solvingboth equations together, the organizations’ probabilities ofchoosing k L , i.e., p can then be given as the solution of theequation: (cid:16) p ( B ) − p ( H )( αk L + 1)( αk H + 1) − p ( B ) − p ( H )( αk H + 1) − p ( B ) − p ( H )( αk L + 1) (cid:17) R a p + 2 R a p (cid:16) p ( B ) − p ( H )( αk L + 1) − p ( B ) − p ( H )( αk H + 1) + p ( B ) − p ( H )( αk H + 1) − p ( B ) − p ( H )( αk L + 1)( αk H + 1) (cid:17) + (cid:16) p ( B ) − p ( H )( αk H + 1) − p ( B ) − p ( H )( αk H + 1) (cid:17) R a − c a ( B ) + 4 c a ( H ) = 0 . (18)After calculating the probability p , the attacker’s prob-ability q can be calculated in a similar way by consideringthe expected utility of one of the organizations. Note that,due to the symmetry between the organizations, consideringthe utilities of both organizations will be redundant. To this end, the ﬁrst organization’s expected utility from choosing k L can be given by: E ( u ( k L , k , a )) = p · q · u ( k L , k L , B ) + p · (1 − q ) · u ( k L , k L , H ) + (1 − p ) · q · u ( k L , k H , B )+ (1 − p ) · (1 − q ) · u ( k L , k H , H ) . (19)Similarly, the ﬁrst organization’s expected utility fromchoosing k H can be given by: E ( u ( k H , k , a )) = p · q · u ( k H , k L , B ) + p · (1 − q ) · u ( k H , k L , H ) + (1 − p ) · q · u ( k H , k H , B )+ (1 − p ) · (1 − q ) · u ( k H , k H , H ) . (20)For the organization to be indifferent between its actions,the utility in (19) must equal the utility in (20). Solving bothequations together, the attacker’s probabilities of choosing B , i.e., q can then be given as the solution of the equation: q = (cid:18) u ( k L , k H , H ) + u ( k H , k H , H ) − p (cid:16) u ( k L , k L , H ) + u ( k H ,k L , H ) + u ( k L , k H , H ) + u ( k H , k H , H ) (cid:17)(cid:19)(cid:30)(cid:18) u ( k L , k H , H )+ u ( k H , k H , H ) − u ( k H , k H , B ) − u ( k H , k L , B ) + p (cid:16) u ( k H , k H ,B ) + u ( k H , k H , B ) + u ( k L , k L , B ) + u ( k H , k L , B ) − u ( k H , k H ,H ) − u ( k L , k H , H ) − u ( k L , k L , H ) − u ( k H , k L , H ) (cid:17)(cid:19) . (21)Given the value of p from (18), the value of q can be uniquelycomputed from (21). The Nash equilibrium mixed strategiescan then be given as ( p, − p ) for the organizations and ( q, − q ) for the attacker.Note that, the solution of equations (18) and (21) givesone Nash equilibrium of the game G . However, the gameas involving multiple players might have other Nash equi-librium solutions [45]. These equilibrium points will havedifferent values for p and q but the same outcomes for theplayers. Here, we only consider the solution in which all theorganizations adopt the same the strategy. EPEATED S HARING

In this section, we consider the case in which the organi-zations share their datasets with the data collector morethan once over a period of time. At each time step, theorganizations share different datasets with their choice ofanonymization levels. The problems formulated in Section2 are extended by adding the notion of time. In particular,we ﬁrst consider the dynamic contracts offered by the datacollector, then, we solve the repeated game between theorganizations and an attacker.

The problem formulated and solved in the static case in(13) can be used by the data collector, at each time step,to determine the optimal reward value that will be given toeach organization. The utility function of the data collectorcan be extended from (10) to be: U d,T = T (cid:88) t =1 N (cid:88) i =1 θ i,t ( v i,t − r i,t ( k i,t )) , (22) where each parameter is the same as (10) but considered at aspeciﬁc time step t and T is the total number of time steps.Solving the static problem in (13) at every time step willensure that the data collector maximizes its reward over alltime steps.However, considering the dynamic nature of the prob-lem in (22), we propose a new technique for the datacollector to improve its outcome while providing moreincentive for the organizations to participate. Recall that inthe static case, according to Theorem 1, one organizationcan get a reward that equals its cost if it uses the highestanonymization level. This will make the net outcome of thisorganization equals zero, which can be seen as a limitingfactor that might hinder the organization from participation.In the repeated case, we propose that the data collector cansign a long-term contracts with the organizations and offera minimum net outcome for each organizations at each timestep.To this end, the individual rationality constraint in (11)will be modiﬁed to maintain a minimum outcome value, m i at each time step such that m i > : θ i r i ( k i ) − c i ( k i ) + T i ( k i ) ≥ m i , i ∈ N . (23)The range of the values of m i can be calculated using thefollowing theorem. Theorem 2.

To incentivize their participation, the data collectorcan offer each organization i a value m i , at each time step, as aminimum net outcome such that (cid:80) Tt =1 m i,t satisﬁes: < T (cid:88) t =1 m i,t < T (cid:88) t =1 θ i,t ( v i,t − r i,t ( k i,t )) , t ∈ T , where T is the set of the time steps that the organization wasexpected to have a zero utility in the static case.Proof. Considering the problem in (13), the data collectorwill either gain θ i ( v i − r i ( k i )) when an organization sharesits dataset, or zero if it does not share. The idea here isto allow the data collector to offer part of its expectedoutcome to the organizations, to ensure their participation.In return, the data collector will increase its total rewardby the difference between what it pays m i,t and the newlysecured beneﬁt θ i,t ( v i,t − r i,t ( k i,t )) for this speciﬁc timestep.Let T be the set of time steps in which the individualrationality constraint will be binding in (13). The data col-lector can assume that the organizations will not be willingto participate at these time steps. The expected gain that thedata collector will miss at these time steps can be expressedas follows: u m,i = T (cid:88) t =1 θ i,t ( v i,t − r i,t ( k i,t )) , t ∈ T , where u m,i is the gain that the data collector can miss if theorganization i does not participate.It is clear that the data collector can improve its outcomeif it can obtain part of u m . This can be secured by signinga long-term contract with the organizations to offer a mini-mum reward that is less than u m,i . In this section, we consider the case in which the same gameis repeated over time. This is different from the static caseconsidered in Section 3 in that the players’ utilities changeover time. In particular, the organizations’ utility will be: u i ( k i , k − i , a, t i ) = R i ( t i ) k i ( t i ) · (1 − p ( a ) αk i ( t i ) + 1 ) · (1 − p ( a ) αk − i ( t i ) + 1 ) − c i ( t i )( k i ( t i )) + γ ( t i ) · k i ( t i ) , (24)where every parameter will be a function in the time step t i .However, from a practical point of view, the probability of asuccessful attack p ( a ) is expected to be constant over time,thus, it is not deﬁned as a function in time. Note that, thefunctions R i ( t i ) and c i ( t i ) will be evaluated each time step.Thus, their values will change based on the datasets sharedat a given time step. On the other hand, the coefﬁcient oftrust, γ ( t i ) will need to be a function that changes over time.Here, we propose to deﬁne that the coefﬁcient of trust, γ ( t i ) , based on the organization’s observation of the securitylevel of the common platform. When an organization sharesits dataset with the common platform, it is the responsibilityof the platform to maintain and protect the data. Thus, anorganization can build its trust in the common platformbased on the rate of successful and unsuccessful attacks inthe previous time steps. Thus, the coefﬁcient of trust, γ ( t i ) can be given as: γ ( t i ) = γ ( t i − ) + un , (25)where is the indicator function which equals in case ofunsuccessful attack and equals if the attack was successful.Following (25), the coefﬁcient of trust will equal half ofthe value in the previous step if an attack was successful.On the other hand, if an attack was blocked by the commonplatform, the coefﬁcient of trust will be the average ofthe previous value and . This insures that the value ofthe coefﬁcient remains between and , when the initialvalue is less than . Note that, the design of the trustfunction in (25) will punish the successful attack more thanit rewards blocking the attacks. Note that, this design isdifferent from other trust coefﬁcients proposed in literature,e.g., [46] which treat positive and negative updates equally.Our proposed coefﬁcient ensures that the data collector doesits best to protect the data and prevent any undesired accessto it.In Fig. 2, we study the evolution of the coefﬁcient oftrust over time. We model three scenarios, the ﬁrst whenthere is no successful attack. In this case, the coefﬁcient oftrust will will increase from its initial value, . until itreaches almost after times steps. This means after sixtimes of sharing, the organization will be able to fully trustthe data collector. The second scenario, is when there is asuccessful attack after every alternate iteration, data share.Since our coefﬁcient punishes the successful attacks morethan it rewards the positive ones, we notice that the value ofthe coefﬁcient of trust will oscillate between . and . .Finally, we consider the scenario when there is only onerandom successful attack, after the value has reached itsmaximum. We notice that this successful attack will cause Number of iterations C o - e ff i e c i en t o f t r u s t No Successful AttackAlternating Succesful AttacksOne Random Successful Attack

Fig. 2. Coefﬁcient of trust updates over the time the value of the coefﬁcient of trust to drop to about . andit will need iterations to return to again. The effect ofthe coefﬁcient of trust evolution over time is studied in thesimulations section. IMULATION R ESULTS AND A NALYSIS

For our simulations, we choose two values of k , the lower k L = 3 and the higher k H = 7 . These values are kept thesame for all the experiments in this section. The values ofthe other parameters will be as follows, unless otherwisestated. The measure of information security, α = 0 . andthe coefﬁcient of trust, γ = 1 . . The success probabilities ofthe different attack types are assumed to be P ( H d ) = 0 . , P ( H s ) = 0 . , and P ( B ) = 0 . , which follows the discussionin Section 2.2.2 about the relation between the differentprobabilities. Finally, we assume similar dataset structuresbetween the different organizations, i.e., the organizationshave the same value of V in (2) so that the cost function isaffected only by the choice of k . We note that the rewardgiven to each organization, by the data collector, dependson the value of its shared dataset as in (10). However, asthese values can fall within a wide range that will affectthe game equilibrium, we consider the case in which twoorganizations are sharing datasets with the same values.Thus, the effect of different dataset values is eliminated onthe equilibrium analysis.First, we solve the formulated game G using the analy-sis in section 3.2. We consider the value of organizations’datsets to vary from to . These values representthe monetary rewards the data collector will give to theorganizations as a reward for sharing the data. Here, weuse abstract values. However, in a real-life scenario, the datacollector needs to estimate these values to be proportional tothe cost. Throughout the following results, we assume thatthe attacker can achieve the full value of the information,i.e., R a = v , while the organizations will be given a rewardof r ( k L ) = 0 . v and r ( k H ) = 0 . v such that r ( k L ) > r ( k H ) according to Lemma 1. Note that, under a limited numberof simulation parameters, there was no feasible solution forequations (18) and (21). In such cases, we used a numerical

10 20 30 40 50 60 70 80 90 100

Reward R a P l a y e r s ' u t ili t i e s Organizations' utility - Equilibrium strategyOrganizations' utility - Random strategyAttacker's utility - Equilibrium strategyAttacker's utility - Random strategy

Fig. 3. The organization’s and the attacker’s utilities at equilibrium atdifferent reward R values. solver for the game we have chosen the equilibrium pointin which the organizations have the same strategy.Using the previous parameters, the equilibrium strate-gies for both the attacker and the organizations are shownin Tables 1 and 2, respectively. We note that, when thevalues of R a are less than , the attacker cannot achievea positive utility. Hence, it will choose not to attack. Thissituation corresponds to the case of Proposition 1 and theorganization’s utility is calculated using (15). In this case,the organization will have a pure strategy of choosing k H when R a = 10 and a pure strategy of choosing k L when R a = 20 . This change occurs as k L achieves a higher rewardfor the organization starting from R a = 20 , i.e., if therewere no attacks for higher rewards, the organization willchoose k L . For the values of R a between and , boththe attacker and the organization will have mixed strategies,i.e., choosing their actions with certain probabilities. In thecase where R a is , the attacker has a higher probability ofchoosing homogeneity attack. Correspondingly, the organi-zation will prioritize using k H . However, for large values of R a , the attacker will beneﬁt if it performed the backgroundknowledge attack, in this case the organization can choosebetween the two values of k with k L being superior, i.e., ithas a high probability to be chosen.Next, we study the utilities associated with the previousequilibrium strategies in Fig. 3. The expected utilities, i.e.,the summation of every outcome multiplied by the equilib-rium probabilities of choosing these outcomes from Tables1 and 2, of the players are shown in Fig. 3. These utilitiesrepresent the outcomes of the game which each player willachieve. In Fig. 3, we can see that when the attacker choosesnot to attack, its utility will equal zero. Meanwhile, theorganization will be able to achieve a utility slightly higherthan the reward value. On the other hand, for the rewardvalues R a ≥ , the utility of the organization will be lessthan the reward as the attack reduces the organization’sutility according to (1). However, for all the values of R a , theplayers’ utilities witness a monotonic increase in the value of R a . Then, we compare these equilibrium utilities to the casewhere one player chooses random probabilities while theother player sticks to its equilibrium strategy. From Fig. 3,we can see that when a player deviates from the equilibrium TABLE 1Attacker’s equilibrium strategies R a

10 20 30 40 50 60 70 80 90 100 B H N TABLE 2Organization’s equilibrium strategies r ( k H ) r ( k L ) k L k H strategy, to a random strategy, it cannot achieve a higherutility as its utility will be lower or equal to the equilibriumutility. This corroborates the importance of ﬁnding Nashequilibrium strategies as they represent the best that eachplayer can achieve given their opponent’s actions.In Fig. 4, we study the effect of the success probability ofthe background knowledge attack, i.e., p ( B ) on the equilib-rium strategies of the players, under two different scenariosof lower reward ( R a = 50 ) and a higher reward ( R a = 100 ).All the other parameters are the same as in Fig. 3. Notethat, the values of p ( B ) are chosen to start at . to satisfythe assumption p ( B ) > p ( H d ) . For each value of p ( B ) and R a , the game G is solved and the equilibrium strategies areshown in Fig. 4 in a similar way to the values in Tables 1and 2. From Fig. 4, we can see that when p ( B ) is slightlyhigher than p ( H d ) i.e., p ( B ) = 0 . the attacker will have azero value of q which corresponds to exclusively choosingto perform homogeneity attack, for both R a = 50 and R a = 100 . At the same point, the organization will choose k L with slightly higher probability for both R a = 50 and R a = 100 . However, as the value of p ( B ) increases, underlower reward, the organization will prefer to use k H as itprovides a higher utility. Correspondingly, the attacker’sprobability for performing background knowledge attacksincreases till it becomes close to pure strategy. On theother hand, under high reward ( R a = 100 ), the attacker’sprobability of performing background knowledge attackwill become near pure-strategy due to the increased proba-bility of success. However, the organization’s probability ofchoosing k L will be very high, i.e, near pure-strategy. In Fig.4, the effect of the reward that each organization is given isclear on its equilibrium strategies, as for low rewards, theorganization will be choosing k H while for higher rewardsthe organization will be choosing k L .In Fig. 5, we study the effect of the success probabilityof the homogeneity attack, at similar values of k , i.e, p ( H s ) on the equilibrium strategies of the players, under the samevarying reward scenarios as the Fig 4. Similar to Fig. 4, thevalues of p ( H s ) are starting at . so that p ( H s ) > p ( H d ) .The rest of the simulation parameters are the same as Fig.3. From Fig. 5, we can see that when p ( H s ) is less than p ( B ) i.e., p ( H s ) < . , under low reward, the attacker willhave a near pure-strategy probability of choosing the back-ground knowledge attack. This probability will decrease as p ( H s ) is equal to p ( B ) or higher. In this case, the attackerwill choose the homogeneity attack with higher probabil-ity especially with the increase in its success probability. Success probability of the background knowledge attack p(B) E qu ili b r i u m p r ob a b iliti e s Defender's probability (p), R=100Attacker's probability (q), R=100Defender's probability (p), R=50Attacker's probability (q), R=50

Fig. 4. The organization’s and the attacker’s equilibrium probabilitiesunder low and high rewards at different success probabilities for back-ground knowledge attack p ( B ) values. Similarly, when p ( H s ) < . , the organization will choose k H with higher probability. However, this probability willbe decreasing as p ( H s ) increases. Under the high rewardscenario, when p ( H s ) < . , the attacker will be performingbackground knowledge attack in a near exclusive manner,while as p ( H s ) is equal to p ( B ) or higher, this probabilitydecreases. Similarly, when p ( H s ) < . , the organizationwill have a higher probability on k L . As p ( H s ) increases,the organizations will place more emphasis on playing k H .We note that, the low values of p ( H S ) ≤ . are similarto the case of Proposition 2, where the organizations areable to play a pure strategy that maximize their utility.Consequently, the attacker also plays a pure strategy ofchoosing the background knowledge attack. However, asthe values of p ( H s ) increase, the organizations’ portion ofthe utility decrease and the game witnesses a mixed strategysolution.Next, we study the dynamic case of the game in Fig.6. We notice from (25) that the coefﬁcient of trust changesover time based on its previous values that depend on thebreaches in the previous steps. However, these successfulbreaches cannot be predicted beforehand and they can fol-low different behavior as in Fig. 2. Therefore, in Fig. 6, weshow the equilibrium strategies against different γ values.In a dynamic scenario, the equilibrium at each time stepcan be identiﬁed through the corresponding γ in Fig. 6.Similar to the previous ﬁgures, we study the effect of γ under low and high values of R a , while the rest of the Success probability of the homogeneity attack p(H s ) E qu ili b r i u m p r ob a b iliti e s Defender's probability (p), R=100Attacker's probability (q), R=100Defender's probability (p), R=50Attacker's probability (q), R=50

Fig. 5. The organization’s and the attacker’s equilibrium probabilitiesunder low and high rewards at different success probabilities for homo-geneity attack p ( H s ) values. coefficient of trust E qu ili b r i u m p r ob a b iliti e s Defender's probability (p), R=20Attacker's probability (q), R=20Defender's probability (p), R=50Attacker's probability (q), R=50

Fig. 6. The organization’s and the attacker’s equilibrium probabilitiesunder low and high rewards at different coefﬁcient of trust γ values. parameters is similar to Fig. 3. In Fig. 6, We can see that, forhigher rewards R a = 50 , the trust factor γ does not have avisible effect on the equilibrium strategies, i.e, both players’will choose a pure strategy (except for the organizationat γ = 0 . ). However, for lower rewards R a = 20 , theequilibrium strategies change. For instance, the equilibriumprobabilities of both the attacker and the organization devi-ate signiﬁcantly as γ > . . The difference between the highand the low reward values is that in the lower reward case,the trust factor represents a signiﬁcant portion of the utility,and, thus, it affects the equilibrium strategy. However, forhigh values of R a ≥ , the coefﬁcient of trust can be seennegligible compared to the reward value, and, thus, it doesnot affect the equilibrium.Finally, in Fig. 7, we study the contract utilities of dif-ferent organizations when they accept contracts from thedata collector that match their anonymization levels andthat do not match. For this case, we consider three orga-nizations adopting three anonymization levels, k L , k M , k H .We use a generalized solution of Theorem 1 to assign thecontract utilities of the organizations, while maintainingthe incentive compatibility constraints between the differentorganization’s types. In Fig. 7, we can clearly see that it is k L k M k H Chosen contract -5051015202530 U tilit y o f t h e o r g a n i za ti on s Organization using k L Organization using k M Organization using k H Fig. 7. The utility of each organization while accepting the contractdesigned for its type or other contracts. better for each organization to choose the contract designedfor its type as it will beneﬁt with the most utility if they stickto their incentive compatible contract, versus if they chooseto accept a reward designed for another anonymizationlevel, whether higher or lower, as it makes their net outcome(utility) lower.

ONCLUSIONS

In this paper, we have proposed a two-tier model to studythe interactions between different organizations who shareanonymized datasets with a data collector, and an attackerwho performs de-anonymization attacks. In the ﬁrst tier, agame-theoretic model has been used to determine the opti-mal anonymization level for k-anonymization, under possi-ble attacks. Two common attack types have been consideredwhich are homogeneity attack and background knowledgeattack. In the second tier, a contract-theoretic model hasbeen formulated to determine the data collector’s optimalrewards that are given to the organizations for their data. Inboth levels of the problem, the mathematical solutions havebeen derived and closed-form solutions have been derived.The problem has also been studied in both a static scenarioand a dynamic scenario to highlight the effect of repeatedsharing on the players’ behavior. Through simulation re-sults, we have shown that the sharing organizations canoptimize their anonymization level selection while the datacollector can economically beneﬁt by incentivizing moreorganizations to share their data. R EFERENCES [1] A. Kotra, A. Eldosouky, and S. Sengupta, “Every anonymizationbegins with k: A game-theoretic approach for optimized k se-lection in k-anonymization,” in ,pp. 1–6, 2020.[2] G. Zyskind, O. Nathan, et al. , “Decentralizing privacy: Usingblockchain to protect personal data,” in , pp. 180–184, IEEE, 2015.[3] S. Badsha, I. Vakilinia, and S. Sengupta, “Privacy preserving cyberthreat information sharing and learning for cyber defense,” in , pp. 0708–0714, IEEE, 2019. [4] J. Soria-Comas, J. Domingo-Ferrer, D. S´anchez, and D. Meg´ıas, “In-dividual differential privacy: A utility-preserving formulation ofdifferential privacy guarantees,” IEEE Transactions on InformationForensics and Security , vol. 12, no. 6, pp. 1418–1429, 2017.[5] T. Zhu, P. Xiong, G. Li, and W. Zhou, “Correlated differentialprivacy: Hiding information in non-iid data set,”

IEEE Transactionson Information Forensics and Security , vol. 10, no. 2, pp. 229–242,2014.[6] M. Keshavarz, A. Shamsoshoara, F. Afghah, and J. Ashdown, “Areal-time framework for trust monitoring in a network of un-manned aerial vehicles,” in

IEEE INFOCOM 2020-IEEE Conferenceon Computer Communications Workshops (INFOCOM WKSHPS) ,pp. 677–682, IEEE, 2020.[7] F. Afghah, A. Shamsoshoara, L. L. Njilla, and C. A. Kamhoua,“Cooperative spectrum sharing and trust management in iot net-works,”

Modeling and Design of Secure Internet of Things , pp. 79–109,2020.[8] M. Boreale, F. Corradi, and C. Viscardi, “Relative privacy threatsand learning from anonymized data,”

IEEE Transactions on Infor-mation Forensics and Security , vol. 15, pp. 1379–1393, 2020.[9] J. Domingo-Ferrer, J. Soria-Comas, and R. Mulero-Vellido,“Steered microaggregation as a uniﬁed primitive to anonymizedata sets and data streams,”

IEEE Transactions on InformationForensics and Security , vol. 14, no. 12, pp. 3298–3311, 2019.[10] A. Eldosouky and W. Saad, “On the cybersecurity of m-healthiot systems with led bitslice implementation,” in , pp. 1–6, IEEE,2018.[11] L. Sweeney, “k-anonymity: A model for protecting privacy,”

In-ternational Journal of Uncertainty, Fuzziness and Knowledge-BasedSystems , vol. 10, no. 05, pp. 557–570, 2002.[12] L. SWEENEY, “Achieving k-anonymity privacy protection usinggeneralization and suppression,”

International Journal of Uncer-tainty, Fuzziness and Knowledge-Based Systems , vol. 10, no. 05,pp. 571–588, 2002.[13] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubrama-niam, “l-diversity: Privacy beyond k-anonymity,” in , pp. 24–24, IEEE,2006.[14] N. Li, T. Li, and S. Venkatasubramanian, “t-closeness: Privacy be-yond k-anonymity and l-diversity,” in , pp. 106–115, IEEE, 2007.[15] T. Li, N. Li, and J. Zhang, “Modeling and integrating backgroundknowledge in data anonymization,” in , pp. 6–17, IEEE, 2009.[16] Q. Wang, Z. Xu, and S. Qu, “An enhanced k-anonymity modelagainst homogeneity attack.,”

JOURNAL OF SOFTWARE , vol. 6,no. 10, pp. 1945–1952, 2011.[17] Z. Liang and R. Wei, “Efﬁcient k-anonymization for privacypreservation,” in , pp. 737–742, IEEE, 2008.[18] L. Xu, C. Jiang, Y. Qian, Y. Zhao, J. Li, and Y. Ren, “Dynamicprivacy pricing: A multi-armed bandit approach with time-variantrewards,”

IEEE Transactions on Information Forensics and Security ,vol. 12, no. 2, pp. 271–285, 2017.[19] M. Baza, N. Lasla, M. Mahmoud, G. Srivastava, and M. Abdallah,“B-ride: Ride sharing with privacy-preservation, trust and fairpayment atop public blockchain,”

IEEE Transactions on NetworkScience and Engineering , 2019.[20] J. M. de Fuentes, L. Gonz´alez-Manzano, J. Tapiador, and P. Peris-Lopez, “Pracis: Privacy-preserving and aggregatable cybersecurityinformation sharing,” computers & security , vol. 69, pp. 127–141,2017.[21] M. Baza, A. Salazar, M. Mahmoud, M. Abdallah, and K. Akkaya,“On sharing models instead of data using mimic learning forsmart health applications,” in , pp. 231–236,IEEE, 2020.[22] W. Zhao and G. White, “A collaborative information sharingframework for community cyber security,” in , pp. 457–462, IEEE,2012.[23] M. Alawneh and I. M. Abbadi, “Preventing information leakagebetween collaborating organisations,” in

Proceedings of the 10thinternational Conference on Electronic Commerce , pp. 1–10, 2008.[24] Z. Han, D. Niyato, W. Saad, T. Bas¸ar, and A. Hjørungnes,

Game theory in wireless and communication networks: theory, models, andapplications . Cambridge university press, 2012.[25] T. Das, A. Eldosouky, and S. Sengupta, “Think smart, play dumb:Analyzing deception in hardware trojan detection using game the-ory,” in , pp. 1–8, IEEE, 2020.[26] A. Eldosouky, A. Ferdowsi, and W. Saad, “Drones in distress: Agame-theoretic countermeasure for protecting uavs against gpsspooﬁng,”

IEEE Internet of Things Journal , vol. 7, no. 4, pp. 2840–2854, 2020.[27] A. Ferdowsi, A. Eldosouky, and W. Saad, “Interdependence-awaregame-theoretic framework for secure intelligent transportationsystems,”

IEEE Internet of Things Journal , pp. 1–1, 2020.[28] I. Vakilinia, D. K. Tosh, and S. Sengupta, “3-way game modelfor privacy-preserving cybersecurity information exchange frame-work,” in

MILCOM 2017-2017 IEEE Military Communications Con-ference (MILCOM) , pp. 829–834, IEEE, 2017.[29] Z. Wan, Y. Vorobeychik, W. Xia, E. W. Clayton, M. Kantarcioglu,and B. Malin, “Expanding access to large-scale genomic data whilepromoting privacy: a game theoretic approach,”

The AmericanJournal of Human Genetics , vol. 100, no. 2, pp. 316–322, 2017.[30] M. Ezhei and B. T. Ladani, “Information sharing vs. privacy: Agame theoretic analysis,”

Expert Systems with Applications , vol. 88,pp. 327–337, 2017.[31] X. Liu, K. Liu, L. Guo, X. Li, and Y. Fang, “A game-theoreticapproach for achieving k-anonymity in location based services,”in , pp. 2985–2993, IEEE, 2013.[32] S. L. Chakravarthy, V. V. Kumari, and C. Sarojini, “A coalitionalgame theoretic mechanism for privacy preserving publishingbased on k-anonymity,”

Procedia Technology , vol. 6, pp. 889–896,2012.[33] R. Karimi Adl, M. Askari, K. Barker, and R. Safavi-Naini, “Privacyconsensus in anonymization systems via game theory,” in

Dataand Applications Security and Privacy XXVI (N. Cuppens-Boulahia,F. Cuppens, and J. Garcia-Alfaro, eds.), pp. 74–89, 2012.[34] P. Bolton, M. Dewatripont, et al. , Contract theory . MIT press, 2005.[35] A. Eldosouky, W. Saad, and N. Mandayam, “Resilient criticalinfrastructure: Bayesian network analysis and contract-based op-timization,”

Reliability Engineering & System Safety , p. 107243, 2020.[36] L. Duan, L. Gao, and J. Huang, “Cooperative spectrum sharing: Acontract-based approach,”

IEEE Transactions on Mobile Computing ,vol. 13, no. 1, pp. 174–187, 2012.[37] A. Eldosouky, W. Saad, C. Kamhoua, and K. Kwiat, “Contract-theoretic resource allocation for critical infrastructure protection,”in ,pp. 1–6, Dec 2015.[38] E. K. Wang, B. Jia, and N. Ke, “Modeling background knowl-edge for privacy preserving medical data publishing,” in , pp. 136–141, IEEE, 2017.[39] A. Meyerson and R. Williams, “On the complexity of optimalk-anonymity,” in

Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems ,pp. 223–228, 2004.[40] L. A. Gordon and M. P. Loeb, “The economics of informationsecurity investment,”

ACM Trans. Inf. Syst. Secur. , vol. 5, pp. 438–457, Nov. 2002.[41] A. S. Sattar, J. Li, J. Liu, R. Heatherly, and B. Malin, “A prob-abilistic approach to mitigate composition attacks on privacy innon-coordinated environments,”

Knowledge-based systems , vol. 67,pp. 361–372, 2014.[42] H. Ogut, N. Menon, and S. Raghunathan, “Cyber insurance andit security investment: Impact of interdependence risk.,” in

WEIS ,2005.[43] S. Sengupta, M. Chatterjee, and K. Kwiat, “A game theoreticframework for power control in wireless sensor networks,”

IEEETransactions on Computers , vol. 59, no. 2, pp. 231–242, 2009.[44] A. Eldosouky, W. Saad, and D. Niyato, “Single controller stochasticgames for optimized moving target defense,” in , pp. 1–6, IEEE, 2016.[45] K.-H. Lee and R. Baldick, “Solving three-player games by thematrix approach with application to an electric power market,”

IEEE Transactions on Power Systems , vol. 18, no. 4, pp. 1573–1580,2003.[46] Z. A. Khan, “Using energy-efﬁcient trust management to protectiot networks for smart cities,”