[PDF] Data sharing games

Abstract

Data sharing issues pervade online social and economic environments. To foster social progress, it is important to develop models of the interaction between data producers and consumers that can promote the rise of cooperation between the involved parties. We formalize this interaction as a game, the data sharing game, based on the Iterated Prisoner's Dilemma and deal with it through multi-agent reinforcement learning techniques. We consider several strategies for how the citizens may behave, depending on the degree of centralization sought. Simulations suggest mechanisms for cooperation to take place and, thus, achieve maximum social utility: data consumers should perform some kind of opponent modeling, or a regulator should transfer utility between both players and incentivise them.

Full PDF

DData sharing games

V´ıctor Gallego a,b, ∗ , Roi Naveiro a,b , David R´ıos Insua a,c , Wolfram Rozas d a ICMAT-CSIC, Spain b The Statistical and Applied Mathematical Sciences Institute (SAMSI), NC, USA c School of Management, USST, China d IBM ILBD

Abstract

Data sharing issues pervade online social and economic environments. Tofoster social progress, it is important to develop models of the interactionbetween data producers and consumers that can promote the rise of cooper-ation between the involved parties. We formalize this interaction as a game,the data sharing game, based on the Iterated Prisoner’s Dilemma and dealwith it through multi-agent reinforcement learning techniques. We considerseveral strategies for how the citizens may behave, depending on the degreeof centralization sought. Simulations suggest mechanisms for cooperation totake place and, thus, achieve maximum social utility: data consumers shouldperform some kind of opponent modeling, or a regulator should transfer util-ity between both players and incentivise them.

Keywords: data sharing; iterated prisoner’s dilemma; multi-agentreinforcement learning.

1. Introduction

As recently discussed [24], data, as the intangible asset par excellence inthe 21st century, is the most disputed raw material at global scale . Ours isa data-driven society and economy, with data guiding most business actionsand decisions. This is becoming even more important as many business pro-cesses are articulated through a cycle of sensing-processing-acting. Indeed,Big Data is the consequence of a digitized world where people, objects andoperations are fully instrumented and interconnected, producing all sorts of ∗ Corresponding author: [email protected]

Preprint submitted to arXiv January 27, 2021 a r X i v : . [ c s . G T ] J a n ata, both machine-readable (numbers and labels, known as structured) andhuman-readable (text, audio or video, known as unstructured). As data ac-quisition grows at sub-second speed, the capability to monetize them arisesthrough the ability to derive new synthetic data. Thus, considered as anasset, data create markets and enhance competition. Unfortunately, it iscreating bad practices as well. See [17] for an early discussion as well asthe recent European directives and legislative initiatives to promote public-private B2G data partnerships, e.g. [8].This is the main reason for analyzing data sharing games with mechanismsthat could foster cooperation to guarantee and promote social progress. Datasharing problems have been the object of several contributions and studiedfrom diﬀerent perspectives. For example, [16] proposes a game theoretic ap-proach to help users determine their optimal policy in terms of sharing datain online social networks, based on a confrontation between a user (aimedat sharing certain information and hiding the rest) and an attacker (aimedat exposing the user’s PI or concealing the information the user is willingto share). This is modelled through a zero-sum Markov game; a Markovequilibrium is computed and the corresponding Markov strategies are usedto give advice to users.[12] reviews the impact of data sharing in scienceand society and presents guidelines to improve the eﬃciency of data sharingprocesses, quoting [22], who provide a game theoretical analysis suggestingthat sharing data with the community can be the most proﬁtable and stablestrategy. Similarly, [10] consider a setting in which a group of ﬁrms mustdecide whether to cooperate in a project that requires the combination ofdata held by several of them; the authors address the question of how tocompensate the ﬁrms for the data they contribute with, framing the prob-lem as a transferable utility game and characterizing its Shapley value as acompensation mechanism.Our approach models interactions between data owners and consumersinspired by the iterated prisoner’s dilemma (IPD) [3]. This is an elegantincarnation of the problem of how to achieve agents‘ cooperation in competi-tive settings. Other authors have used similar models in other socio-technicalproblems as in politics [5], and security [18], among others. Our approach tomodel agent’s behavior is diﬀerent and relies on multi-agent reinforcementlearning (MARL) arguments [13]. Reinforcement learning (RL)has been suc-cessfuly applied to games that are repeated over time, thus making it possiblefor agents to optimize their strategies [19]. The work of [25] also discusses theuse of RL in iterated games such as Prisoner’s Dilemma, although they do not2ocus on the issue of incentivizing cooperation between the players. ThroughRL we are able to identify relevant mechanisms to promote cooperation.The structure of the paper is as follows. First, a qualitative description ofthe problem, the intervening agents and their strategies is provided. We nextmodel it quantitatively and develop scenarios that could promote cooperationthrough MARL. We study those in simulated environments conﬁrming thatcooperation is both possible and the best social strategy, ending with a briefdiscussion.

2. Data sharing: categories, agents and strategies

Before modeling interactions between data consumers and producers, itis convenient to understand the data categories available. Even though ad-mittedly with a blurry frontier, from a legal standpoint, there are two mainones: • Data that should not be bought/sold . This refers to personal information(PI), as e.g. the data preserved in the European Union through theGeneral Data Protection Regulation (GDPR) [11] and other citizendefense frameworks aimed at guaranteeing civic liberties. PI includesdata categories such as internal information (like knowledge and beliefs,and health data); ﬁnancial information (like accounts or credit data); social information (like criminal records or communication data); or, tracking information (like computer device; or location data). • Data that might be purchased . Citizen’s data is a property, there being aneed to guarantee a fair and transparent compensation. Accountabilitymitigates market frictions. For traceability and transparency reasons,blockchain-based platforms are being implemented at the moment.A characterization of what type of data belongs to each category will dependon the context and is, most of the times, subjective.In any case, in the last decades, modern data analytics techniques andstrategies are enabling the generation of new types of data: • Data that might be estimated/derived.

Currently available analyticstechnologies have the ability of estimating eﬃciently citizen behaviorand other characteristics by deeply analyzing Big Data. For instance,3latforms such as IBM Personality Insights [15] can estimate person-ality traits of a given individual using his/her tweets, thus facilitatingmarketing activities. As a result, the originating data becomes a newasset for a company willing to undertake its analysis.Having mapped the available data, there is a need to understand theknowledge actually available and how is it uncovered. Within the abovescenario, we consider two players in a data sharing game: the data providers(Citizen, she) and the Dominant Data Owner (DDO, he). A DDO couldbe a private company, e.g. GAFA (Google, Apple, Facebook, Amazon) orTelefonica, or a public institution (Government). Inspired by the classicJohari window [20], we inter-relate now what a Citizen knows or does notwith what a DDO knows or does not to obtain these scenarios:1. Citizen knows what DDO does. The citizen has created a data assetwhich she sells to a DDO. Sellable data create a market which couldevolve in a sustainable manner if accountability and transparency aresomehow guaranteed.2. Citizen knows what DDO does not. This is the PI realm. Citizenswould want legal frameworks like the GDPR or data standards preserv-ing citizen rights, mainly ARCO-PL (access, rectiﬁcation, cancellation,objection, portability and limitation) so that PI is respected.3. Citizen does not know what DDO does. The DDO has unveiled citizen’sPI through deep analysis of Big Data. This analysis may be acceptableif data are dealt just as a target. Data protection frameworks shouldguarantee civil rights and liberties in such activities.Note that we could also think of a fourth scenario in which neither the citizenknows, nor the DDO does, although this is clearly unreachable.Once explained how knowledge is shared, we analyze how knowledge cre-ation can be fostered to stimulate social progress, studying cooperation sce-narios between Citizen and DDO. We simplify by considering two strategiesfor both players, respectively designated

Cooperate (C) and

Defect (D), lead-ing to the four scenarios in Table 1.Reﬂecting about them, the only one that ultimately fosters knowledgecreation and, therefore, stimulates social progress, is mutual cooperation. Itis the best scenario and produces mutual value. Cooperation begs for a FATE As in the famous Target pregnant teenager case [14] DO cooperates DDO defectsCitizen cooperates Citizen sells data, Citizen taken for a ridedemands data protection selling data, while DDODDO purchases and does not pay Citizenrespect Citizen data. data with servicesCitizen defects DDO taken for a ride Citizen sells wrong/noisypurchasing. Citizen data does not pay for DDOselling wrong/noisy services, who does not pay datadata becomes free rider. with services.

Table 1: Scenarios in the data sharing game. (fair, accountable, transparent, ethical) technology like blockchain. In suchscenario, data (Big Data), algorithms and processing technology would boostknowledge. Mutual cooperation is underpinned by decency and indulgencevalues such as being nice (cooperate when the other party does); provokable (punish non cooperation); forgiving (after punishing, immediately cooperate,reset credit); and clear (the other party easily understands and realises thatthe best next move is to cooperate).Mutual defection is the worst scenario in societal terms: it produces adata market failure, stagnating social progress. As there is no respect fromboth sides, no valuable data trade will happen, and even a noisy data vs.unveiled data war will take place. Loss of freedom may arise as a result.The scenario (Citizen cooperates, DDO defects) is the worst for the cit-izen, leading to data power abuses, as with the UK “ghost” plan. It wouldgenerate asymmetric information, adverse selection, and moral hazard prob-lems, in turn producing data market failures. The DDO behaves incorrectly,there being a need to punish unethical and illegal behaviour. As an example,the GDPR sets the right to receive explanations for algorithmic decisions.There is also a need for mitigating systematic cognitive biases in algorithms.Citizens may respond by sending noisy data, rejecting data services, imposingstandards over data services or setting prices according to success.Finally, the scenario (Citizen defects, DDO cooperates) is the worst forthe DDO. It leads to data market failures and shrinks knowledge. This stemsfrom a behavior of not paying for public/private services that can be obtainedanyway. In the long run, this erodes public and private services quality andcreativity. This misbehavior should be punished to restore cooperation anda fair price should be demanded for services.5 DO C D

Citizen

C R, R S, TD T, S P, P

Table 2: Payoﬀs in the data sharing game

3. A model for the data sharing game

We model interactions between citizens and DDOs over time from theperspective of the IPD. Table 2 shows its reward bimatrix. The row playerwill be the Citizen, for whom cooperate means that she wishes to sell andprotect her data, whereas defect means she either sells wrong data or decidesnot to contribute. The DDO will be the column player for whom cooperate means that he purchases and protects data, whereas defect means that he isnot going to pay for the collected data or will not protect it. Payoﬀs satisfythe usual conditions in the IPD, that is

T > R > P > S and 2

R > T + S .When numerics are necessary, we adopt the choice T = 6, R = 5, P = 1, and S = 0.It is well-known that in the one-shot version of the IPD game, the uniqueNash equilibrium is ( D, D ), leading to the social dilemma described above:the selﬁsh rational point of view of both players leads to an inferior societalposition. Similarly, if the game is played N times, and this is known by theplayers, these have no incentive to cooperate, as we may reason by backwardsinduction [4]. However, in realistic scenarios, players are not sure aboutwhether they will meet again in future and, consequently, they cannot be surewhen the last interaction will be taking place [3]. Thus, it seems reasonableto assume that players will interact an indeﬁnite number of times or thatthere is a positive probability of meeting again. This possibility that playersmight interact again is precisely what makes cooperation emerge.The framework that we adopt to deal with the problem is MARL [7]. Eachagent a ∈ { C, DDO } maintains its policy π a ( d a | o a , θ a ) used to select a deci-sion d a under some observed state of the game o a (for example, the previouspair of decisions) and parameterised by certain parameters θ a . Each agentlearns how to make decisions by optimizing his policy under the expected6um of discounted utilities max θ a E π a (cid:34) ∞ (cid:88) t =0 γ t r a,t (cid:35) , where γ ∈ (0 ,

1) is a discount factor and r a,t is the reward that agent a attainsat time t . The previous optimization can be performed through Q-learningor policy gradient methods [27]. The main limitation with this approach inthe multi-agent setting is that if the agents are unaware of each other, theyare shown to fail to cooperate [13], leading to defection every time, which isundesirable in the data sharing game.As an alternative, in order to foster collaboration, we propose three ap-proaches, depending on the degree of decentralization and incentivisationsought for. • In a (totally) decentralized case, C and DDO are alone and we resort toopponent modelling strategies, as showcased in Section 4.1. However,this approach may fail under severe misspeciﬁcation in the opponent’smodel. Ideally, we would like to encourage collaboration without mak-ing strong assumptions about learning algorithms used by each player. • Alternatively, a third-party could become a regulator of the data mar-ket: C and DDO use it and the regulator introduces taxes, as showcasedin Section 4.2. The beneﬁt of this approach is that the regulator onlyneeds to observe the actions adopted by the agents, not needing to makeany assumption about their models or motivations and optimizing theirbehaviors based on whatever social metric he considers. • Finally, in Section 4.3 we augment the capabilities of the previous reg-ulator to enable it to incentivize the agents, leading to further increasesin the social metric considered.To ﬁx ideas, we focus on a social utility (SU) metric deﬁned as the agents’average utility SU t = r C,t + r DDO,t . (1)This requires adopting a notion of transferable utility, serving as a commonmedium of exchange that can be transferred between agents, see e.g. [2].7 . Three solutions via Reinforcement Learning Our ﬁrst approach models the interaction between both agents as anIPD, and simulates such interactions to assess the impact of diﬀerent DDOstrategies over social utility. We ﬁrst ﬁx the strategy of the DDO, assumethat the citizen models the DDO behaviour and simulate interactions betweenboth agents ﬁnally assessing social utility. We model the Citizen as a Fictitious Play Q-learner (FPQ) in the spiritof [13]. She chooses her action d a ∈ { C, D } maximizing her expected utility ψ ( d a ) deﬁned through ψ ( d a ) = E p FP ( d b ) [ Q ( d a , d b )] = (cid:88) d b ∈{ C,D } Q ( d a , d b ) p F P ( d b ) , where p F P ( d b ) reﬂects the Citizen’s beliefs about her opponent’s actions d b ∈ { C, D } and Q ( d a , d b ) is the augmented Q-function from the threatenedMarkov decision processes as deﬁned in [13], an estimate of the expectedutility obtained by the Citizen if both players were to commit to actions d a , d b .We estimate the probabilities p F P ( d b ) using the empirical frequencies ofthe opponent’s past plays as in Fictitious Play [6]. To further favor learning,the Citizen could place a Beta prior over p C ∼ B ( α, β ), the probability ofthe DDO cooperating, with probability p D = 1 − p C of defecting. Then, ifthe opponent chooses, for instance, cooperate , the citizen updates her beliefsleading to the posterior p C ∼ B ( α + 1 , β ), and so on.We may also augment the citizen model to have memory of the previousopponent’s action. This can be straightforwardly done replacing Q ( d a , d b )with Q ( s, d a , d b ) and p F P ( d b ) with p F P ( d b | s ) where s ∈ { C, D } × {

C, D } isthe previous pair of actions both players took. Thus, we need to keep trackof four Beta distributions, one for each value of s . This FPQ agent withmemory will be called FPM. Clearly, this approach could be expanded toaccount for longer memories over the action sequences. However, [21] showsthat agents with a good memory-1 strategy can eﬀectively force the iteratedgame to be played as memory-1, even if the opponent has a longer memory. Code for all the simulations performed can be found at https://github.com/vicgalle/data-sharing .1.1. Experiments We simulate the previous IPD under diﬀerent strategies for the DDO andmeasure the impact over social utility. For each scheme, we display the socialutility attained over time by the agents. For all experiments, we model thecitizen as an FPM agent (with memory-1). The discount factor was set to0.96.

Selﬁsh DDO.

When we assume a selﬁsh DDO, playing always defect, oursimulation conﬁrms that this strategy will force the citizen to play defectand sell wrong data, not having incentives to abandon such strategy. Evenwhen citizens have strong prior beliefs that the DDO will cooperate, aftera few iterations they will learn that the DDO is defecting always and thuschoose also to defect, as shown in Figure 1(a). (a) Agents’ utilities. (b) Social utility.Figure 1: Agents’ utilities and social utilities in case of DDO always defecting.

Figure 1(b) shows that under the defecting strategy, the social utility achievesits minimum value.

A Tit for Tat DDO.

We next model the DDO as a player using the Tit forTat (TfT) strategy (it will ﬁrst cooperate and, then, subsequently replicatethe opponent’s previous action: if the opponent was previously cooperative,the agent is cooperative; if not, it defects). This policy has been widelyused in the IPD, because of its simplicity and eﬀectiveness [3]. A recentexperimental study [9] tested real-life people’s behaviour in IPD scenarios,showing that TfT was one of the most widely strategies. Figure 2 shows thatunder TfT, the social utility achieves its maximum value: mutual cooperationis achieved, thus leading to the optimal social utility.9 igure 2: Social utility of a FPM citizen against a TfT DDO.

It is important to mention though that if the citizen had no memoryabout previous actions, the policy of the DDO could not be learnt and mutualcooperation would not be achieved.

Random behaviour among citizens.

Previously, we all citizens were assumedto act according to the FPM model. However, assuming that the wholepopulation will behave following such complex strategies is unrealistic. Amore reasonable assumption considers having a subpopulation of citizensthat acts randomly. To simulate this, we modify the FP/FPM model todraw a random action with probability 0 < (cid:15) < (cid:15) = 0 .

7, this entails a huge decrease in social utility.

A forgiving DDO.

A possible solution for this decrease in social utility con-sists of forcing the DDO to eventually forgive the Citizen and play cooperateregardless of her previous actions. We model this as follows: with probability p the DDO will cooperate, whereas with probability 1 − p he will play TFT.To assess what proportion of time should the DDO forgive, we evaluateda grid of values from 0 to 100, and chose the one that produced the highestincrease in social utility. The optimal value was forgiving 70% of times. AsFigure 3 shows, this produces an increase of approximately half a unit in theaverage social utility with respect to the case of never forgiving.Note, though, that there exists a limit value for the forgiving rate suchthat, if surpassed, the social utility will decrease to around 3. The reason forthis is that, in this regime, when not acting randomly, the Citizen will learnthat the DDO cooperates most of the time, and thus her optimal strategy10 igure 3: Social utility when citizens act randomly 70 % of the time against a TfT and aforgiving TfT DDO. will be to defect. Thus, in most iterations the actions chosen will be ( C, D ),leading to a social utility of around 3.

We discuss now an alternative solution to promote cooperation introduc-ing a third player, a Regulator (R, it). Its objective is to nudge the behaviourof the other players through utility transfer, based on taxes. Appendix Adiscusses a one-shot version identifying its equilibria. As in Section 4.1, ourfocus is on the iterated version of this game.At each turn, the regulator will choose a tax policy for the agents( tax

C,t , tax

DDO,t ) ∼ π R ( ·| o R , θ R ) , where o R is the observed state of the game and θ R are relevant parametersfor the regulator. Then, the other two agents will receive their correspondingadjusted utility ˜ r a,t through˜ r a,t = r a,t − tax a,t + 12 (cid:88) a tax a,t , where the ﬁrst term is the original utility (Table 2); the second one is thetax that the regulator collects from that agent; and, ﬁnally, the third one isis the (evenly) redistributed collected reward. Note that SU t = r C,t + r DDO,t r C,t + ˜ r DDO,t . θ R E π R (cid:34) ∞ (cid:88) t =0 γ t SU t (cid:35) . Then, two nested RL problems are considered: ﬁrst, the regulator selects atax regime and, next, the other two players optimally adjust their behaviourto this regime. After a few steps, the regulator updates its policy to furtherencourage cooperation (higher SU t ), and so on. At the end of this process, wewould expect that both players’ behaviours would have been nudged towardscooperation.We thus frame learning as a bi-level RL problem with two nested loops,using policy gradient methods:1. (Outer loop) The regulator has parameters θ R , imposing a certaintax policy.(a) (Inner loop) The agents learn under this tax policy for T itera-tions:(b) They update their parameters: θ a,t +1 = θ a,t + η ∇ E π a [ (cid:80) ∞ t =0 γ t r a,t ].2. The regulator updates its parameters: θ R,t +1 = θ R,t + η ∇ E π R [ (cid:80) ∞ t =0 γ t SU t ].Let us highlight a few beneﬁts of this approach. First, the Regulatormakes no assumptions about the policy models of the other players (thusit does not matter whether they are just single-RL agents or are opponent-modelling). Moreover, this framework is also agnostic to the social welfarefunction to be optimized; for simplicity, we just use the expression (1). It isalso scalable to more than two players: the regulator only needs to collecttaxes for each player, and then redistributes wealth. In case that we wereconsidering k > /k . This experiment illustrates the performance of the general framework,showing how the inclusion of a Regulator encourages the emergence of coop-erative behavior. 12onsider the interactions between a Citizen and a DDO. The parameterfor each player is a vector θ a ∈ R , with a ∈ { C, DDO } , representing thelogits of choosing the actions, i.e. the unnormalized probabilities of choosingeach decision. We consider two types of regulators.The ﬁrst one has a discrete action space deﬁned through tax a,t =  .

00 if a R = 00 . · r a,t if a R = 10 . · r a,t if a R = 20 . · r a,t if a R = 3 . For example, when a R = 2 the tax rate reaches 30%. In this case, θ R ∈ R represent the logits of a categorical random variable taking the previousvalues (0,1,2,3).The second regulator adopts a Gaussian policy deﬁned through π R ( d R | o R , θ R ) ∼N ( d R | θ R , . ), with tax tax a,t = 0 . · sigmoid ( d R ) · r a,t , to allow for a continuous range in [0 , . T = 1000 iterations. After each iteration, bothagents perform one update of their policy parameter gradient. The regula-tor updates its parameters using policy gradients every 50 iterations. Thedecision of updating the regulator less frequently than the other agents ismotivated to allow them to learn and adapt to the new tax regime and sta-bilise the overall learning of the system. Figure 4 displays results. For each ofthe three variants (no intervention, discrete, continuous) we plot 5 diﬀerentruns and their corresponding means in darker color.Clearly, under no intervention, both agents fail to learn to cooperate con-verging to the static Nash equilibrium ( D, D ). We also appreciate that thediscrete policy is neither eﬀective, also converging to (

D, D ), albeit at a muchslower pace. On the other hand, the Gaussian regulator is more eﬃcient as itallows to avoid convergence to (

D, D ) although it does not preclude conver-gence to (

C, C ). This regulator is more eﬀective than its discrete counterpart,because it can better exploit the policy gradient information. Because of this,in the next subsection we will focus on this Gaussian regulator.In summary, the addition of a Regulator can make a positive impact inthe social utility attained in the market, preventing collapse into (

D, D ).13 igure 4: Social utility under three diﬀerent regulation scenarios.

However, introducing taxes to the players is not suﬃcient, since in Figure 4the social utility converged towards a value of 3, far away from the optimalvalue of 5.

In order to further stimulate cooperative behavior, we introduce incen-tives to the players via the Regulator: if both players cooperate at a giventurn, they will receive an extra amount I of utility, a scalar that adds to theirperceived rewards. Appendix B shows that incentives complement well withthe tax framework, so that mutual cooperation is possible in the one-shotversion of this game. Note that, when I > T − R , instead of the Prisoner’sDilemma, we have an instance of the Stag Hunt game [26], in which both( C, C ) and (

D, D ) are pure Nash equilibria. From now on, we focus the discussion in the iterated version. In this batchof experiments, players interact over T = 1000 iterations, and the Regulatoronly provides incentives during the ﬁrst 500 iterations. After that, he willonly collect taxes from the players and redistribute them as in Section 4.2.Figure 5 shows results from several runs under diﬀerent incentive values. Afew comments are in order. Achieving mutual cooperation is much simpler in this case. igure 5: Social utility under diﬀerent incentives with tax collection.Figure 6: Social utility under diﬀerent incentives with no tax collection. Firstly, note that as the incentive increases, also does the social util-ity. For an incentive of 1, the maximum reward of (

C, C ) and (

D, C ) is thesame (6) for the Citizen, and cooperation emerges naturally. Also note thatsince the policies for each player are stochastic, it is virtually impossible tomaintain an exact convergence towards the optimal value of 5, since a smallamount of time the agents are deviating from (

C, C ) due to the stochastic-15ty in their actions. Second, observe that even when the Regulator stopsincentivizing players in the middle of the simulations, both players keep co-operating along time.We hypothesize that the underlying tax system from Section 4.2 is neces-sary for players to learn to cooperate and maintain that behaviour even afterthe Regulator stops incentivizing them. To test this hypothesis, we repeatthe experiments removing tax collection, ceteris paribus . Results are shownin Figure 6. Observe now that even under the presence of high incentives,both agents fail to cooperate, with social utility decaying over time. Thus,we have shown that the tax collection framework from 4.2 has a synergiceﬀect with the incentives introduced in this Section.

5. Discussion

A deﬁning trend in modern society is the abundance of data which opensup new opportunities, challenges and threats. In the upcoming years, socialprogress will be essentially conditioned by the capacity of society to gather,analyze and understand data, as this will facilitate better and more informeddecisions. Thus, to guarantee social progress, eﬃcient mechanisms for datasharing are key. Obviously, such mechanisms should not only facilitate thedata sharing process, but must also guarantee the protection of the citizen’spersonal information. As a consequence, the problem of data sharing notonly has importance from a socioeconomic perspective, but also from thelegislative point of view. This is well described in numerous recent legislativepieces from the EU, e.g. [8], as well as in the concept of ﬂourishing in adata-enabled society [1].We have studied the problem of data sharing from a game theoretic per-spective with two agents. Within our setting, mutual cooperation emergesas the strategy leading to the best social outcome, and it must be promotedsomehow. We have proposed modelling the confrontation between dominantdata owners and citizens using two versions of the iterated prisoner dilemmavia multi agent reinforcement learning: the decentralized case, in which bothagents interact freely, and the centralized case, in which the interaction isregulated by an external agent/institution. In the ﬁrst case, we have shownthat there are strategies with which mutual cooperation is possible, and thata forgiving policy by the DDO can be beneﬁcial in terms of social utility. Inthe centralized case, regulating the interaction between citizens and DDOs16ia an external agent could foster mutual cooperation through taxes andincentives.Besides fostering cooperation, the data sharing game may be seen as aninstance of a two sided market [23]. Therefore, the creation of intermediaryplatforms that facilitate the connection between dominant data owners andcitizens to enable data sharing would be key to guarantee social progress.

Acknowledgements.

This work was partially supported by the NSF underGrant DMS-1638521 to the Statistical and Applied Mathematical SciencesInstitute and a BBVA Foundation project. RN also acknowledges supportof the Spanish Ministry for his grant FPU15-03636. VG also acknowledgessupport of the Spanish Ministry for his grant FPU16-05034. DRI is gratefulto the MTM2017-86875-C3-1-R AEI/ FEDER EU project, and the AXA-ICMAT Chair in Adversarial Risk Analysis.

ReferencesReferences [1] ALLEA. Flourishing in a data-enabled society.

ALLEA DiscussionPaper , 2019.[2] Robert J Aumann. Linearity of unrestrictedly transferable utilities.

Naval Research Logistics Quarterly , 7(3):281–284, 1960.[3] Robert Axelrod.

The Evolution of Cooperation . Basic, New York, 1984.[4] Robert Axelrod and William Hamilton. The evolution of cooperation.

Science , pages 1390–1396, 1981.[5] Steve Brams.

Game Theory and Politics . Dover, New York, 2011.[6] George W Brown. Iterative solution of games by ﬁctitious play.

ActivityAnalysis of Production and Allocation , pages 374–376, 1951.[7] Lucian Busoniu, Robert Babuska, and Bart De Schutter. Multi-agentreinforcement learning: An overview. In

Innovations in multi-agentsystems and applications-1 , pages 183–221. Springer, 2010.[8] European Commission. A european strategy for data. 2020. URL: https://ec.europa.eu/digital-single-market/en/policies/building-european-data-economy .179] Pedro Dal B´o and Guillaume R Fr´echette. Strategy choice in theinﬁnitely repeated prisoner’s dilemma.

American Economic Review ,109(11):3929–52, 2019.[10] Pierre Dehez and Daniela Tellone. Data games: Sharing public goodswith exclusion.

Journal of Public Economic Theory , 15(4):654–673,2013.[11] EUR-lex. Regulation (eu) 2016/679 of the eu parliament and of thecouncil. general data protection regulation. 2016. URL: https://data.europa.eu/eli/reg/2016/679/oj .[12] Ana Soﬁa Figueiredo. Data sharing: convert challenges into opportuni-ties.

Frontiers in public health , 5:327, 2017.[13] Victor Gallego, Roi Naveiro, David Rios Insua, and David Gomez-Ullate Oteiza. Opponent aware reinforcement learning. arXiv preprintarXiv:1908.08773 , 2019.[14] Kamir Hill. How target ﬁgured out a teen girl was pregnant.

Forbes ,pages 374–376, 2012.[15] IBM. Watson personality insights. 2020. URL: https://cloud.ibm.com/docs/personality-insights/science.html .[16] Charles A Kamhoua, Kevin A Kwiat, and Joon S Park. A game theoreticapproach for modeling optimal data sharing on online social networks. In , pages 1–6. IEEE, 2012.[17] Benn R Konsynski and F Warren McFarlan. Information partnerships–shared data, shared scale.

Harvard Business Review , 68(5):114–120,1990.[18] Howard Kunreuther and Geoﬀ Heal. Interdependent security.

Journalof Risk and Uncertainty , 26:231–249, 2003.[19] Ratul Lahkar and Robert M. Seymour. Reinforcement learning in pop-ulation games.

Games and Economic Behavior , 80:10 – 38, 2013.1820] John Luft and Harold Ingham. The johari window as a graphic modelof interpersonal awareness. In

Proc. Western Training Lab. in GroupDevelopment . UCLA Ext. Oﬀ., 1955.[21] William H Press and Freeman J Dyson. Iterated prisoner’s dilemma con-tains strategies that dominate any evolutionary opponent.

Proceedingsof the National Academy of Sciences , 109(26):10409–10413, 2012.[22] Tessa E Pronk, Paulien H Wiersma, Anne van Weerden, and FeikeSchieving. A game theoretic analysis of research data sharing.

PeerJ ,3:e1242, 2015.[23] Jean-Charles Rochet and Jean Tirole. Two-sided markets: a progressreport.

The RAND journal of economics , 37(3):645–667, 2006.[24] Eric Seuillet and Patrick Duvaut. Blockchain, a technol-ogy that also protects and promotes your intangible as-sets.

Harvard Business Review France , 1990. URL: .[25] Aric P. Shafran. Learning in games with risky payoﬀs.

Games andEconomic Behavior , 75(1):354 – 371, 2012.[26] Brian Skyrms.

The stag hunt and the evolution of social structure . Cam-bridge University Press, 2004.[27] Richard S Sutton and Andrew G Barto.

Reinforcement learning: Anintroduction . MIT press, 2018. 19 ppendix A. One-shot game for the centralized case

We model the one-shot version of the centralized case game as a three-agent sequential game. The regulator acts ﬁrst choosing a tax policy; afterobserving it, the agents take their actions. Introducing a regulator can fostercooperation in the one shot game.For simplicity, consider the following policy: the regulator will retain apercentage x of the reward if the agent decides to defect, and 0 if it decidesto cooperate. Then, the regulator will share evenly the amount collectedbetween both agents. With this, given the regulator’s action x , the payoﬀmatrix is as in Table A.3, recalling that T > R > P > S . DDO

C D

Citizen

C R, R S (cid:48) , T (cid:48)

D T (cid:48) , S (cid:48)

P, P

Table A.3: Utilities for the data sharing game

Assume that if one agent defects and the other cooperates, the ﬁrst onewill receive a higher payoﬀ, that is T (cid:48) > S (cid:48) , which means that x < − ST .Depending on x , three scenarios arise:1. T (cid:48) > R > P > S (cid:48) ⇐⇒ x < (cid:2) P − ST (cid:3) . This is equivalent to theprisoner’s dilemma. ( D, D ) strictly dominates, thus being the uniqueNash Equilibrium.2.

R > T (cid:48) > S (cid:48) > P ⇐⇒ x > (cid:2) − RT (cid:3) . In this case, ( C, C ) strictlydominates, becoming the unique Nash Equilibrium.3. T (cid:48) > R > S (cid:48) > P ⇐⇒ x ∈ (cid:0) (cid:2) P − ST (cid:3) , (cid:2) − RT (cid:3)(cid:1) . This is a coordina-tion game. There are two possible Nash Equilibria with pure strategies( C, D ) and (

D, C ).Moving backwards, consider the regulator’s decision. Recall that R max-imizes social utility. Again, three scenarios emerge:1. x < (cid:2) P − ST (cid:3) . The social utility is P .2. x > (cid:2) − RT (cid:3) . The social utility is R .3. x ∈ (cid:0) (cid:2) P − ST (cid:3) , (cid:2) − RT (cid:3)(cid:1) . The social utility is S + T .20s R > P and

R > T + S (as requested in the IPD), the regulatormaximizes his payoﬀ choosing x > (cid:2) − RT (cid:3) . Therefore, ( x, C, C ), with x > (cid:2) − RT (cid:3) is a subgame perfect equilibrium, and we can foster coopera-tion in the one-shot version of the game. Appendix B. One-shot game for the centralized case plus incen-tives

Under this scenario, we consider the reward bimatrix in Table B.4, where I is the incentive introduced by the Regulator. DDO

C D

Citizen

C R + I, R + I S, TD T, S P, P

Table B.4: Utilities for the data sharing game with incentives

Consider the case in which the agents take the (

C, D ) pair of actions. Inthis case, they perceive rewards (

S, T ). After tax collection and distribution,it leads to ( S − Sx + T x , T − T x + Sx ), with x being the tax rate collectedby the Regulator. In order to ensure that ( C, C ) is a Nash equilibrium, twoconditions must hold: • S − Sx + T x > P , so that agents do not switch from ( C, D ) to (

D, D ).This simpliﬁes to x > ( P − S ) T − S . • R + I > T − T x + Sx , so that the agents do not switch from ( C, C ) to(