Valuating User Data in a Human-Centric Data Economy
VValuating User Data in a Human-Centric Data Economy
Marius ParaschivIMDEA NetworksLegan´es – Madrid [email protected]
Nikolaos LaoutarisIMDEA NetworksLegan´es – Madrid [email protected]
Abstract — The idea of paying people for their data is in-creasingly seen as a promising direction for resolving privacydebates, improving the quality of online data, and even offeringan alternative to labour-based compensation in a future domi-nated by automation and self-operating machines. In this paperwe demonstrate how a Human-Centric Data Economy wouldcompensate the users of an online streaming service. We borrowthe notion of the Shapley value from cooperative game theoryto define what a fair compensation for each user should be formovie scores offered to the recommender system of the service.Since determining the Shapley value exactly is computationallyinefficient in the general case, we derive faster alternatives usingclustering, dimensionality reduction, and partial information.We apply our algorithms to a movie recommendation data setand demonstrate that different users may have a vastly differentvalue for the service. We also analyse the reasons that somemovie ratings may be more valuable than others and discussthe consequences for compensating users fairly.
I. INTRODUCTIONData, and the economy around it, are said to be drivingthe fourth industrial revolution. Interestingly – the people –whose data is what moves the new economy, have a ratherpassive role in it, as they are left outside the direct valueflow that transforms raw data into huge monetary benefits.This is a consequence of the de facto understanding (or onemay say misunderstanding) between people and companies,that the former get unpaid access to online services inexchange for unpaid access to their personal data. This isincreasingly being challenged by various voices who callfor the establishment of a new, renegotiated, relationshipbetween users and services. Indeed, a variety of pathologiescan be traced back to the way the data economy has beenworking so far. Some are direct and obvious, such as privacyrisks for individuals, and market failures and dangers for theeconomy from the rise of data monopolies and oligopolies.Others are less obvious, and further reaching into the future,such as mass unemployment due to data-driven automation.It was estimated recently [1] that, if automation due toartificial intelligence reaches maturity and fair remunerationalgorithms are set in place, a family of four could earn up to$20,000 per year from their data. The idea of micropayments,or providing small contributions to users in exchange fortheir presence on a platform or for accessing a service,is of course much older. In the pre-World Wide Web era,France developed a videotex online service called Minitel,that included micropayments as part of its design, but JaronLannier brought it to public attention in 2013, in his book”Who owns the future?”[2]. In it, he argues that we haveonly underwent half of the Data Revolution, the part that compensates users with implicit benefits, but not the partthat also compensates them with explicit monetary benefits.There have been a series of proposed approaches for howthis compensation might materialise. The simplest, at leastin theory, would be to assign a context-free value to data, akind of dollar-per-bit measure. This has been proven to bevery hard [3], [4], [5], [12]. Indeed, since the value of datais strongly connected to its intended use, it becomes verydifficult to argue about how to assign an a priori averagevalue. For traditional currencies, we are able to have acontext-free appreciation of their value for the simple reasonthat we have been using these currencies long enough tobe able to do so. Although we clearly understand nowadaysthat one’s browsing and mobility patterns, social network, orpast purchases all have value, we are far from being ableto appreciate how much this value is in terms of dollarsor euros. The latter is further complicated by our inabilityto tell in advance, by how many parties, and how manytimes, a piece of data may be utilized. As an analogy, sellingan individual’s data, or rather renting it temporarily, is asdifficult and risky as renting an infinitely fast vehicle, with nogas and maintenance costs, and without any prior restrictionswith regard to mileage or the person driving it.A second proposed method has been to compensate usersfor their privacy damage [13], [14]. Processing massiveamounts of data can lead to privacy infringements, such asthe leakage of habitual user behavior, their location or otherpersonal identifiable information (PII). Users are thus seenas victims who must be compensated for their damage.Our approach is different, we consider users as activepartners in the data value chain. Such a chain requires abusiness model, smart predictive algorithms for extractinguseful information from raw data and online marketing forattracting and retaining users, among many others. Thefundamental component of the value chain, however, is theuser, and it is ultimately a matter of common sense that theyshould be rewarded in a fair manner, which may or may notexceed the perceived privacy-related damages.In a Human-Centric Data Economy, when a transaction,or set of transactions, is converted, a proportion of theobtained revenue will be returned to the users. Defining theright amount to be returned to the users is difficult, as itdepends on many market characteristics of a multilateralvalue chain, such as competition and user loyalty [6]. Inthis paper, we assume that the total amount of revenue tobe redistributed to users is given, e.g., 5% or 10%, or anyother number produced by the competition between services a r X i v : . [ c s . S I] A ug ithin a given sector. We thus turn our attention to the nextimportant question, namely: given a fixed total amount to bedistributed, how should this distribution be performed? Howshould different users be compensated based on the value ofthe data that they contribute to a specific service?One obvious answer would be to split the sum by thenumber of users on the platform. This may not be fairhowever, because it is unreasonable to assume that allusers contribute equally to the service. In the case of arecommendation system, for example, some users may rateand view hundreds of items, while others may have amuch sparser activity. Another example would be a trafficapplication. Users who travel regularly through a given areaand constantly provide feed-back, may be more relevant tothe platform than occasional passers-by. Even at an equallevel of intensity, some movie scores, or location data may bemore important for a recommender than others. For examplea score for a blockbuster movie that just aired may bemore important than a score for a well known (and voted)blockbuster movie. Similarly, speed information for a known-to-be-congested route is less useful to a navigation app thanspeed data for an unknown alternative route with less traffic.Another important point to understand is that the DataEconomy is not a zero-sum game. Paying users for data neednot be seen as a measure that will reduce the revenue orprofitability of online services. More and better data can leadto better services and thus more revenue and profit. It is thusnot surprising that the vision of paying directly or throughtaxes for data has received positive comments from industryleaders such as Bill Gates [7], Elon Musk [8] and MarkZuckerberg [9]. Also, despite not happening today, there is arealistic path that can lead to wide adoption. It only requiresthat a small number of visionary companies start offeringmicro-payments to get a competitive advantage in terms ofuser retention, for others to follow, and the practice to gettraction in the online services market.Our paper brings a threefold technical contribution towardsthe realisation of a Human-Centric Data Economy: • We define data payoff fairness in terms of acceptedeconomic notions. Specifically, we use the Shapleyvalue from collaborative game theory to define howto split a total payback among all the users that haveprovided data to a revenue-generating online service.We also sketch a Contribution-Reward framework forimplementing such paybacks in practice. • We develop two algorithms that can provide efficientestimates of the Shapley value, which in its raw formdoes not scale for large data sets. The first one appliesminibatch k-Means with N d-dimensional points, for k clusters and t iterations directly on the definitionof Shapley to reduce the complexity from O ( N !) to O ( N kdt + k !) , where k, d, t (cid:28) N . The second is a O ( N ) heuristic of local information that does not usethe Shapley definition. • We apply the above algorithms to a movie scoring dataset and study how different users may be in terms oftheir value for their service. We observe that some users
Users Dataset Feature Importance andUser ContributionShapleyEstimatorScore to monetary valueconversion.
Proportional user remuneration Shapley scoresIndividual user contributionsto significant featuresData SampleUsergenerated data
Fig. 1. The Contribution-Reward framework. In establishing the proportionof each user’s contribution, one first determines important features orrelevant measures that result in revenue. One such measure could be thenumber of useful recommendations obtained from a recommender system.After a measure or relevant feature has been identified, user contributions toit are computed and users are ordered in a hierarchy. Based on their placein the value hierarchy, they are remunerated. may be as much as × more valuable than others. • Finally, we study how scoring behaviour impacts onuser value. We find that what differentiates a “good”user from a “bad” one is that the good user tendsto vote mainly for the most popular movies and isconsistent with the movie popularity hierarchy, namelypopular movies get high ratings and unpopular ones,low ratings. By contrast, users at the other end of thevalue distribution tend to vote and provide high scoresto unpopular movies as well.While human centricity and fairness need to reside at thecore of the proposed framework, transparency must also beconsidered. Users need a method of verifying for themselvesthe amount of payment received and be able to connect theirbehaviour on an online platform to the obtained revenue, ina clearly interpretable manner. We discuss how to do this viaan accounting meta-data layer at the end of the article.II. B
ACKGROUND
In the Background section, we introduce the Contribution-Reward framework and describe the setting for our two algo-rithms – a recommender system generating recommendationsfor the users of an online movie streaming service. We alsointroduce the Shapley value, as the backbone of the firstproposed algorithm and also an accepted point of referencein terms of fair credit distribution. The section ends witha toy example that shows how the exact Shapley value iscomputed in a simple two-user setting.
A. The Contribution-Reward Framework
The proposed Contribution-Reward framework, depictedin Fig. 1, aims to provide users with a remuneration schemethat is both fair and sustainable. With usage, the service2athers and stores relevant data. Based on the data type, anumber of important features are chosen and the contributionof each user to the said features is estimated. Finally, theuser receives a repayment from the platform, proportional totheir estimated contribution. The difficulty of finding the saidestimation resides in the need to first observe which are therelevant features that make a data set valuable. In the caseof recommendation data, for example, it could be that usersvoting for the most popular movies, thus contributing to analready existing hierarchy, are considered more valuable, oron the contrary, users reviewing items which are initiallyunpopular could bring an element of novelty, and thus ahigher overall value. On the other hand, if the accuracy andupdate rate of the data set are essential, for example in trafficapplications, users with a higher contribution frequency maybe considered of higher value.It thus becomes clear that user value depends on the struc-ture of the data and its intended use. Due to this subjectivenature, deriving a generally-applicable valuation frameworkis not a straightforward task. The question we raise hereis two-fold: first, how can one determine a hierarchy ofvalue for a given set of users and a particular use case(recommender systems) and second, how can one quantifythe position of each user in the value hierarchy? An answerto the second question is of particular importance, since ifone can assign scores to users, based on their contributions,one can then define a mapping from these scores to an actualfinancial amount.For the rest of this paper, the setting is thus fixed: arecommender system is trained on a training set, consistingof movie reviews, and then makes predictions on a separatetest set. If the recommendations made to a user, on the testset, have an error below a fixed threshold, we consider thatthere is a high likelihood that the user will want to watchthe recommended item. We refer to these recommendationsas ”clicks”. As such, when we say: ”user A has generated5 clicks”, this is to be interpreted as ”the recommender hasmade 5 recommendations to user A, all with an error belowour set threshold”. Clicks here play the role of the importantfeature in the Contribution-Reward framework.The most na¨ıve approach would be to train the recom-mender system by removing one user at a time, and countingthe difference in the overall number of clicks obtained on thetest set. There are a series of disadvantages to this: first, adata set may contain reviews from hundreds of thousands ofusers, quickly making the leave-one-out training computa-tionally unfeasible. The second drawback resides in the factthat removing only a single user may not have the expectedresult on the recommender. Indeed, if the user provides asignificant amount of novelty, it may happen that the systemproduces more clicks in the absence of the user than in theirpresence, leading to a negative assigned value. This wouldfurther be difficult to map to a monetary contribution. Inthe following sections, we present two alternative methodswhich avoid these shortcomings and can be implemented ina computationally-scalable way.After a theoretical discussion of the proposed algorithms, they are applied to a case study based on a subset of theMovieLens data set [15], depicting how groups of userscan contribute differently to the overall performance of therecommender system, and thus hold different values withrespect to the service.
B. Introducing the Shapley Value
Credit assignment in cooperative games has long been acentral problem of cooperative game theory. To this end,Shapley [16], [17] proposed that players should be rewardedin a manner proportional to their average marginal contribu-tion to the payoff of any coalition they could join.Let N be a set of N players and S ⊂ N be a coalitionwith cost v ( S ) . The Shapley value is a uniquely determinedvector of the form ( φ ( v ) , ..., φ n ( v )) , where the elementrepresenting player i is given by φ i ( v ) = 1 N ! (cid:88) π ∈ S N [ v ( S ( π, i )) − v ( S ( π, i ) \ i )] , (1)where π is a permutation representing the arrival order ofset N , while S ( π, i ) represents the set of players that havearrived into the system before player i .The Shapley value satisfies a series of important proper-ties: • Efficiency : the total gain is completely distributedamong the players (cid:88) i ∈N φ i ( v ) = v ( N ) (2) • Symmetry : if i and j are two players who bring equalcontributions, in the sense that, for every subset S thatcontains neither i nor j , v ( S ∪ { i } ) = v ( S ∪ { j } ) , thentheir respective Shapley values are also equal, φ i ( v ) = φ j ( v ) . • Linearity : if two coalition games, denoted v and w arecombined, the resulting gain is the sum of the gainsderived from each game separately φ i ( v + w ) = φ i ( v ) + φ i ( w ) (3)and also φ i ( α · v ) = α · φ i ( v ) , for any real α . • Null player: the Shapley value of a null player iszero. A player is considered null if they do not bring acontribution to any possible coalition.Unfortunately, the Shapley value has also been proven tobe NP-hard for many domains [18], [19], [20]. Since it takesinto account all possible coalitions, for each user, the numberof terms scales with N ! , where N represents the number ofusers, such that it quickly becomes computationally unfeasi-ble.In Ref. [21] the authors use Monte Carlo to approximatethe Shapley value for computing the cost contribution ofindividual households to the peak hour traffic and costsof an Internet Service Provider (ISP). In that case, therelatively simpler structure of the problem made Monte Carloan appropriate technique for approximating Shapley. Otherrecent works have presented approximation algorithms for3hapley for specific problems of lower complexity thanrecommendation [10], [11]. In the context of the currentproposal, the inherent higher complexity of the consideredvalue functions v() that may represent the workings of com-plex ML algorithms for things like recommendation, makesusing Monte Carlo inaccurate according to our preliminarytests. Instead, we intend to use clustering to reduce the inputsize of the problem. Our approach will be to first groupusers in a number of clusters according to the similarity oftheir movie ratings (or trajectories in the case of mobilityrelated applications), and then compute the Shapley valueof each cluster instead of each user. By controlling thenumber of clusters we can make the computation as preciseand complex as our computing resources allow. In SectionIII we will apply this method in the case of a movierecommendation data set, containing an amount of users forwhich the exact computation of Eq. 1 is not possible. C. A Toy Example
The purpose of this section is to provide a very simpleexample of computing the Shapley value for the case oftwo users. Consider an artificial data set, containing movierecommendations from two different users. The predictionsmade by the recommender are the net contribution in ourcase. We start from the marginal contributions V ( { } ) = 12 and V ( { } ) = 15 , which would mean that the presenceof user 1 alone in the data set, causes the recommender toproduce 12 useful movie recommendations, and the presenceof user 2 alone results in 15 useful recommendations. Letus further assume that the presence of both users, simultane-ously, increases the number of recommendations to 28, hence V ( { , } ) = 28 . Since it is clear that the two contributionsare not equal, the following question arises: how can we finda factor, proportional to the user contributions, that would beuseful in determining a fair repayment ?For this, we compute the Shapley value, as defined inEq. 1. We first note that there are two possible ordersof arrival of the users, with equally likely probabilities ofoccurrence: [1 , or [2 , . In the first situation, user 1 comesfirst, bringing a contribution of V ( { } ) = 12 , followedby user 2, who increases the overall useful contribution to V ( { , } ) = 28 , thus, the net contributions of the two users,for this particular order of arrival are φ [1 , (1) = V ( { } ) = 12 ,φ [1 , (2) = V ( { , } ) − V ( { } ) = 28 −
12 = 16 . In the second case, the order of arrival is reversed, namely [2 , . User 2 arrives first, bringing a contribution V ( { } ) =15 , followed by user 1 who increases the net contribution to V ( { , } ) = 28 , thus φ [2 , (1) = V ( { , } ) − V ( { } ) = 28 −
15 = 13 ,φ [2 , (2) = V ( { } ) = 15 . Since the two cases are equally probable, the Shapleyvalue is the sum of the two marginal contributions multiplied by the probability of the order of arrival (or of the factorialof the number of users) φ (1) = 12! · ( φ [1 , (1) + φ [2 , (1)) = 12 . ,φ (2) = 12! · ( φ [1 , (2) + φ [2 , (2)) = 15 . . We thus see that the average marginal contribution of user2 is higher than the one of user 1, and based on this factor,we can devise a scheme of fair remuneration, to be presentedas part of the Contribution-Reward framework.It is also worth noting that, in order to compute the Shap-ley terms, we need to compute N ! marginal contributions,which leads to scaling problems if an exact computationof the Shapley value is required. The methods presented inthis paper provide ways of overcoming this obstacle throughvarious approximate approaches.III. U SER V ALUE E STIMATION
In this section we first present and then apply the twoproposed algorithms. We then discuss the results obtained ona subset of the MovieLens data set. At the end of this sectionand in the next, we will interpret user behaviour and tryto understand what is the relationship between votes given,number of votes, types of movies reviewed and the assigneduser value.
A. Approximate Shapley Value Estimation (ASVE)
The first user contribution estimation method is basedon the Shapley value, described in the previous sections.The algorithm is constructed around a recommender systemframework and we choose our value of interest to be thenumber of ”clicks” or useful recommendations that the modelprovides, for a given data set. The pseudocode for thisalgorithm is provided as Algorithm 1.The input data consists of user identifiers, product iden-tifiers (movie categorical IDs) and votes. From Eq. 1 weobserve that computing the Shapley value of each user inthe data set directly is unfeasible, as the computationalcomplexity is of order N ! where N (the number of usersin the data set) is typically extremely large (in the orderof hundreds of thousands or even millions for services likeYouTube and NetFlix). Clearly such an approach does notscale.There are two plausible directions one can pursue in orderto avoid the complexity barrier. The first is to attempt toestimate the Shapley value using Monte Carlo sampling, asdone in Ref. [21]. The second, which we shall employ here,takes advantage of the similarities between user behaviouralpatterns. In general, when treating data related to userpreference, consumers tend to cluster into a limited numberof similar groups.For example there may be individuals with a strongerpreference for action movies as opposed to romantic ones.Exploiting such relationships allows us to greatly simplifythe estimate of a user’s contribution to the overall data set.4 lgorithm 1 Approximate Shapley Value Estimation procedure ASVE
2: clusteredData, clusterLabels ← Cluster(inputData) clusterCoalitions ← Compute(clusterLabels)
5: K ← TrainPredict(inputDataset)
7: for
S in clusterCoalitions do V S ← TrainPredict(trainData, predData)
11: perm ← Permute(clusterLabels) for π in perm do for i in clusterLabels do V i ← V ( π ) − V ( π − { i } ) margContrib[i] += V i for i in clusterLabels do φ i = len ( clusterLabels )! ∗ sum(margContrib[i])
20: for userId in clusteredData do ˜ φ (userId) = ( (cid:80) k φ k EuclideanDist(user, centroid(k)) ) userValSum += ˜ φ (userId)
24: for userId in clusteredData do φ (userId) = ˜ φ (userId) · K userValSum Algorithm 2
Neighbourhood Similarity Value Estimation procedure NSVE Rec = 0
3: K ← TrainPredict(inputDataset)
5: for userId in inputDataset do neighbors ← FindNeighbors(inputDataset, userId) for id in neighbors do newItems ← IntersectLeft(userId, id)
Rec + = 1
RecList.append(Rec)
Rec ←
14: for userRec in RecList do φ (userId) ← K*userRec (cid:80) j userRec j The algorithm starts by clustering the input data, prior towhich, a dimensionality reduction method (such as PCA)is applied. The next step is the calculation of all possiblecoalitions of clusters. For example, for three clusters, theseare { } , { } , { } , { , } , { , } , { , } , { , , } , where { } represents a data set where only the users corre-sponding to the first clusters are present. The recommendersystem is then trained on the filtered data sets, correspondingonly to the clusters in the above coalitions. A train-testsplit is performed prior to this, but omitted in line 8 ofthe pseudocode for clarity. During the prediction phase, themodel makes recommendations to users and also providesan error estimate for each recommendation. We assert that,for errors under a certain threshold, the recommendationscan be valuable, and we refer to these recommendations as”clicks” for the rest of this paper, in analogy to advertisingrecommendations where, if the user is interested in an advert,they will click the banner.After obtaining the number of clicks for every one ofthe possible subsets of clusters, one must compute themarginal contribution of the cluster. This takes into accountevery possible complete coalition, where the order of arrivalmatters. For example, in our three user case, all possiblecomplete coalitions are: { , , } , { , , } and { , , } . Therelevant code is between lines 11 - 19.Finally, when the Shapley value (computed in line 19)for each cluster is known, we can proceed to determine thevalue of each individual user. For this, the centroid of thecluster is labeled with the corresponding Shapley value of thecluster. Thus, the value of each user is equal to the sum of allindividual cluster values divided by the Euclidean distancebetween the point, representing the user, and the respectivecluster centroid, with one added for stability. The assignedvalues, from both proposed algorithms, are then scaled withthe total number of clicks produced by the recommender, onthe complete data set, to ensure that the efficiency conditionis met. In this manner, to every point on a projected two-dimensional surface, representing the space of all users, wecan assign an approximate Shapley value.We have applied the ASVE algorithm to a subset of theMovieLens data set, containing 92,394 total ratings on 4,180movies from 610 users, with the intent on understanding ifthe value hierarchy provided by the method can be intuitivelyunderstood. First, the users were separated into three classes,the best users (representing the ones with the highest scoresgiven by the ASVE method), average users and bad users.We considered whether or not there is a relationship betweenmovie popularity (based on the overall number of ratings amovie has received) and the vote distribution of the threeclasses of users.In Fig. 2, the user vote distributions are presented, for thethree classes of users. The movies are grouped into equal-sized ordered buckets, with their popularity decreasing inthe positive direction of the horizontal axis. We observe that5 Popularity-sorted Movie Buckets0200040006000800010000 V o t e s Fig. 2. User vote distributions based on movie popularity for the ASVE method. Votes from the most valuable users are shown in red, those from averageusers in blue and votes from the least valuable users are shown in green. For clarity, the movies have been grouped into ordered buckets, with popularitydecreasing in the positive direction of the horizontal axis.
Popularity-sorted Movie Buckets0200040006000800010000 V o t e s Fig. 3. User vote distributions based on movie popularity for the NSVE method. Votes from the most valuable users are shown in red, those from averageusers in blue and votes from the least valuable users are shown in green. For clarity, the movies have been grouped into ordered buckets, with popularitydecreasing in the positive direction of the horizontal axis. users which the ASVE method considers as being the mostvaluable, tend to vote mostly on popular movies, whereasusers considered to have low value, have a more widespreaddistribution of votes. This reinforces our initial assumptionthat the ASVE method gives high scores to users who agreewith a predetermined hierarchy, not necessarily to users witha novel contribution.One remark that must be made is that, by design, theapplication of ASVE requires one prior check. When re-moving a cluster of users, it is entirely possible that one ormore clusters provide a negative contribution. It is thereforeessential that one checks the values assigned to the clustersfor non-negativity, prior to applying the method to individualusers.
B. Neighbourhood Similarity Value Estimation (NSVE)
The second approach relies on first reducing the dimen-sionality of the data set (similar to ASVE) and creating aneighbourhood around each user. Based on the predictivecapability of the user on its neighbours, one can approximatethe user’s importance.The algorithm works as follows: first, the recommendersystem is trained on the complete data set and a total numberof clicks (similar to the previous method) is obtained (line4). For each user in the data set, we then construct a neigh-bourhood and observe the number of items that the centraluser has rated, and the neighbours have not. These elements act as an estimator of the user’s ”predictive potential”. Forexample, if user A has seen and rated movies M1, M2 andM5 and user B, which is in user A’s neighbourhood, has onlyseen movie M1 (out of the list that user A has rated overall),then A could potentially recommend movies M2 and M5 toB, so we say that ”A has a predictive potential of 2 (movies)over B). This predictive potential is then used, in line 18, toproportionately assign clicks (from the total number K) toeach user, based on their respective predictive potentials.One must keep in mind that the size of the neighbourhoodis a hyperparameter that must be determined through trialand error. Furthermore, there may be other distance metrics,other than the Euclidean distance, chosen here, that couldpotentially offer a better estimate, however, for simplicity,we restrict ourselves to Euclidean distances in the presentpaper.In order to compare the two methods, a proper scalingmust be found for the two corresponding distributions. Theunscaled distributions, shown in Fig. 4 are correlated, witha correlation distance of 0.29 (0 representing a perfectcorrelation) for the MovieLens data set. We were thus ableto learn an approximate mapping, that would project thesecond distribution to the domain of the first, making a directcomparison possible, as depicted in Fig. 5.Due to the nature of this mapping being approximate,one sees artefacts, such as the unusually tall peaks of6
20 40 60 80 100User Value020406080100 N u m b e r o f U s e r s Fig. 4. The ASVE distribution, shown in blue, and the unscaled NSVEdistribution, shown in orange. It is important to note that, even though thevariance of the two distributions is different, the user value hierarchy ismaintained to a large degree. N u m b e r o f U s e r s Fig. 5. The ASVE distribution, in blue, and scaled NSVE distribution,in orange, showing artefacts due to the approximate mapping between thetwo. the projected NSVE distribution, compared to its ASVEcounterpart. While this is an issue that needs to be consideredwhen comparing method predictions, the mapping processwould not be necessary in a real-world implementation, asit’s sole purpose is to aid the comparison, and would beirrelevant for the purpose of credit assignment.Having the two distributions on the same scale, we con-tinued our investigation by asking whether or not the secondmethod produces the same distribution of user votes, basedon movie popularity. As seen in Fig 3, both methods agreeon the fact that users considered valuable tend to vote forpopular movies.It is instructive, at this point, to also see how users voted,based on movie popularity, that is, is it the case that morepopular movies got higher ratings than less popular ones?Also, how do the three classes of users rate movies of variouslevels of popularity?In Fig. 6, we present the vote distribution by user cate-gory. Rather surprisingly, there is somewhat little differencebetween users considered very valuable and those considered average , as both classes seem to give high scores to very pop-ular movies and low scores to unpopular ones. Less valuable users (according to both methods) seem to offer high scoresto a wide array of movies, going against the establishedpopularity hierarchy. This further serves to confirm that usersreinforcing an existing order are rated higher.One point to notice is that, in the case of the NSVE, bydesign a user cannot have a negative assigned value, thusno prior checks are required, as in the case of ASVE. Apotential problem here is the reliance of the method onthe neighbourhood radius hyperparameter, which stronglydepends on the data set. Furthermore, users outside a givenneighbourhood have zero contribution to the final rating ofthe neighbourhood source. This is in stark contrast to thesmooth value assignment of ASVE, and does give rise to aseries of artefacts, one of which being that there are always(or at least in all practical situations) users with an assignedzero value, which raises the difficulty of mapping this to anactual financial amount.A second artefact can be observed once we ask thequestion: how many clicks do movies generate (over all usersin a fixed test set) based on their popularity? As can be seenin Fig. 7, the Neighbourhood-based method is very sensitiveto the relative distance between users. As movies becomeless popular and users who vote for them get spread overlarger distances, the amount of clicks generated (which inthe case of NSVE represent movies that the central usercan recommend to its immediate neighbours) quickly falls tozero. Thus, NSVE strongly favors users who vote for popularmovies, where users who favor them cluster together.From a computational standpoint, the NSVE algorithmonly requires one training of the recommender, in orderto obtain the total number of clicks. The neighbourhoodcomparison operations are linear. Another advantage of thismethod is that, if applied to a production environment, themodel would continuously train as new user data becomesavailable, and only be required to perform occasional infer-ence, to update the total number of clicks.Another potential impediment to both methods is theirsensitivity to badly-labelled data. If, for example, in ourmovie data set, the tags describing the movie genre arenot consistent or worse, misleading, this will affect thedistances between users and result in both methods providinginaccurate value estimates. Therefore, data cleaning and acareful analysis of the data set are important, before thealgorithms are employed.IV. D
ISCUSSION
In the previous section, we have presented two credit-assignment methods that could become part of an onlineservice, and transform users into active participants in theinformation value flow, from its raw form to the final gen-erated revenue. As the main producers of raw data, fairnessrequires that users share part of the resulting revenue, in amanner proportional to the significance of their data.We have proposed two different methods that are inagreement on a few key points. The first method is based onthe commonly-used Shapley value, which is approximatedby being computed at cluster level. The users are then7 A v e r a g e R a t i n g Fig. 6. Movie ratings given by the three user classes, sorted by movie popularity. The average ratings from the best users (over both methods) are depictedin blue, those from users with an intermediate value, in orange and those from the least valuable users, in green. V o t e s / C li c k s Fig. 7. Movie votes and number of clicks generated by the corresponding bucket, according to the two credit-assignment methods. The overall numberof votes is shown in blue, the number of clicks that movies in the corresponding bucket generate are shown in orange, for ASVE and green for NSVE. scored based on their relative distance from the clustercenters. The second method exploits user similarity, creatinga neighbourhood around each user, where users with similarpreferences may be found. It is thus assumed that the itemsthe central user can recommend are close to the interests ofhis neighbours, such that they become potential clicks . Thisis a very crude, but also very effective method for performingdistance-based recommendations.We have seen that the two algorithms provide consistentpredictions on key aspects such as the distribution of uservotes based on their attributed value, the score given by usersbased on the same attributed value and the number of clicks movies produce, based on their popularity. These come toconfirm our initial hypothesis, based on intuition, that usersreceiving high scores vote in line with the item’s popularitylevel. There are, however, points where the two methods donot agree. The NSVE is more sensitive to distances betweenusers in the projected plane. In areas where the users aresparsely distributed, NSVE tends to assign zero values tothem. This is not the case for the ASVE algorithm, asbelonging to a cluster ensures a non-zero contribution to thefinal outcome.Also as a result of the sharp cut-off at the neighborhood’sboundary, the scores assigned to users by the Shapley-basedmethod change in a much smoother and more continuousmanner from one region to another. This reflects itself in amuch narrower distribution (as seen in Fig. 4). In contrast, theuser score distribution of the NSVE method has a very high variance, assigning very small scores to users who are farapart and very high scores to the ones in densely-populatedregions.One issue not yet discussed is how should new usersbe treated, with respect to their overall value. As theircontribution is minimal, a valid argument would be to start byassigning them relatively small scores (hence small propor-tional payments). This is indeed the case for both methods.Considering ASVE, a new user would likely be far fromthe center of any cluster, and, as the proportional paymentdecreases linearly with the distance from the cluster center,would get a small reward. For NSVE the situation is rathersimilar, a new user would either be in a sparsely-populatedarea of the projected space or in a densely-populated areabut having very few possible recommendations for theirneighbours. In either case, they would receive a small score.Finally, what should be the impact of low votes? Thisquestion is of particular importance if either of the twoproposed algorithms is to be implemented in a realisticsetting. Observing Fig. 6, we see that valuable users givehigh votes to popular films and low votes for unpopularones. If this rule were to hold for a vast array of data sets,users could attempt to game the system , by voting based onpopularity and hence obtaining a high reward. This is also aproblem that stems from the objective of the two methods,namely maximizing the number of clicks for each item andnot necessarily considering the element of novelty broughtinto the system by each individual.8. C
ONCLUSION
The current business model for online services is reachingthe limits of sustainability. The exchange of free services forfree data raises concerns both in terms of safety, with usersallowing highly sensitive private information to be used as acurrency, and also in terms of fairness, as an ever increasingnumber of voices call for an evaluation of the real worth ofan individual’s online impact.In the present paper, we have proposed a framework todetermine the value of every user for an online service,based on their contribution to a quantity or metric of interest.The proposed framework must be fair, such that users arerewarded proportionally to their contribution and transparent,the payments need to be interpretable and linked to userbehaviour on the platform.By restricting ourselves to the specific case of a rec-ommender system, we aim to gain an understanding ofwhat type of behaviour increases the overall worth of anindividual to the service. Once the profile of valuable usershas been studied on a series of particular cases, our goal is togeneralize and establish value in a broader sense. A seconddirection of future research is the creation of a transparentaccounting meta-data layer that users can access to makesure that they have been fairly compensated, and do that ina manner that does not harm their privacy by leaking data.As our goal is to ultimately expand our analysis to a broadspectrum of domains, the proposed Contribution-Rewardframework aims to be as wide-reaching as possible. There areno mentions regarding the types of data or its use cases. Theframework simply states that, once a set of important featuresor relevant metrics can be identified, based on data type anduse case, a value hierarchy for users can be established. Theselection of the metrics or features is done in terms of theirimpact on the overall revenue produced by the service, andthis selection may not be a trivial procedure. Once this hasbeen achieved, however, the second step of the framework isthe ranking of users, and this can be done with algorithmssimilar to the ones proposed in our case study. The obstaclesreside not only in determining the value of an individual, butalso in ensuring the safety of their data and fairness of theirremuneration. Once every user has been assigned a score, thefinal step of the framework consists in mapping the scoresto actual financial quantities. This is certainly not a staticmap, as it may depend on a series of continuously-varyingparameters, such as revenue produced, amount of revenueto be distributed among users and possibly even exchangerates between different currencies. In this paper we coverthe second problem, which is concerned with ranking.Following this line of reasoning, we have presented twoalgorithms for assigning proportional value to users, for aspecific task. We then saw that these algorithms do agreeon a number of points, despite the fact that both rely onapproximations. We also introduced the exact method fordetermining proportional contributions, the Shapley value,which in its raw form does not scale to large coalitions,but can serve as a reference or the basis for approximate methods.The predictions of the two algorithms are in line withcommon-sense assessments of the value of a particular user,for example with respect to their votes for the most popular,and hence revenue-producing movies. While this does offerfurther validation of the two methods, their limitations havealso been highlighted, and these stem from the fact thatboth are, to different degrees, estimates of an exact butuncomputable estimate.As this field, which may aptly be named
Data Economics is still in its infancy, the number of research directionsis rather considerable. One such path would be to treatother particular cases, such as traffic data and the case ofa traffic application, and understand the reasons behind thedifference in value for various types of users. One can thenhope that, having understood a number of such cases, theywill provide some intuition into possible generalizations. Adifferent research direction would be to address the othertwo components of the reward framework, namely featureor metric identification and score-to-currency mapping.An important aspect of our future work is designing andimplementing a layer of transparency in the Contribution-Reward framework, with the explicit aim of allowing users toverify the manner in which their repayments are distributedand connected to their behaviour. Transparency is a funda-mental characteristic, as it will both increase user confidencein regard to participating on a platform which implements theproposed framework; and growing confidence will, in turn,boost user presence and, hence, generated revenue.The purpose of the present paper has been to first pose aseries of essential questions in Data Economics, especiallyrelated to fair credit assignment, and to show that, for aparticular use case, a transparent algorithmic solution canbe found, based on a commonly-accepted economic method,the Shapley value. It is our hope that, as our data economyreaches maturity, the remaining open problems in the fieldwill soon be addressed.R
EFERENCES[1] Posner E. A. and Weyl E. G., Radical Markets Uprooting Capitalismand Democracy for a Just Society. Princeton University Press, 2018,ISBN 9780691177502.[2] Lanier J., Who owns the future?. Simon & Schuster, 2013, ISBN9781451654967.[3] Moody D. L. and Walsh P., Measuring the value of information-anasset valuation approach. ECIS, (496 – 512), 1999.[4] King K., A case study in the valuation of a database. Journal ofDatabase Marketing & Customer Strategy, , 2, (110 – 119), 2007. https://doi.org/10.1057/palgrave.dbm.3250041 [5] Alstyne, M. W., A Proposal for Valuing Information and InstrumentalGoods. Proceedings of the 20th International Conference on Informa-tion Systems, , (328 – 345), 1999.[6] Gyarmati L., Laoutaris N., Sdrolias K., Rodriguez P. and Courcou-betis C., From advertising profits to bandwidth prices-A quantitativemethodology for negotiating premium peering. NetEcon’14, 2014. ”arXiv:1404.4208v4 [7] Delaney K. J., The robot that takes your job should pay taxes, saysBill Gates. 2017. Retrieved from: https://qz.com/911968/bill-gates-the-robot-that-takes-your-job-should-pay-taxes/ [8] Thomas L., Universal basic income debate sharpens as observersgrasp for solutions to inequality. 2017. Retrieved from:
9] Gillespie P., Mark Zuckerberg supports universalbasic income. What is it?. 2017. Retrieved from: https://money.cnn.com/2017/05/26/news/economy/mark-zuckerberg-universal-basic-income/index.html [10] Cabello S. and Chan T. M., Computing Shapley values in the plane.2018. arXiv:1804.03894 [11] Zhao K., Mahboobi S. H. and Bagheri S. R., Shapley ValueMethods for Attribution Modeling in Online Advertising. 2018. arXiv:1804.05327 [12] Blackwell D., Equivalent Comparisons of Experiments. Annals ofMathematical Statistics, , (265 – 272), 1953.[13] Carrascal J. P., Riederer C., Erramilli V., Cherubini M. and OliveiraR., Your browsing behavior for a big mac: economics of personalinformation online. Proceedings of the 22nd international conferenceon World Wide Web, (189 – 200), 2013.[14] Acquisti A., Taylor C. R. and Wagman L., The Economics of Privacy.Journal of Economic Literature, , 2, 2016.[15] Harper F. M. and Konstan J. A., The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems, ,4, Article No. 19, 2016.[16] Shapley L. S., A Value for n-Person Games. Annals of MathematicsStudy, , (307 – 317), 1953.[17] Winter E., The Shapley Value. Published in The Handbook of Ga-meTheory, North-Holland, 2002, ISBN 9780444894281.[18] Papadimitriou C. H., Computatational Complexity, Addison-Wesley,1994, ISBN 9780201530827.[19] Deng X. and Papadimitriou C. H., On the Complexity of CooperativeSolution Concepts. Mathematics of Operations Research, , 2, (257– 266), 1994.[20] Bachrach Y., Elkind E., Meir R., Pasechnik D., Zuckerman M. RotheJ. and Rosenschein J. S., The Cost of Stability in Coalitional Games.Proceedings of SAGT, (112 – 134), 2009.[21] Stanojevic R., Laoutaris N. and Rodriguez P., On Economic HeavyHitters: Shapley value analysis of 95th-percentile pricing. IMC’10Melbourne, Australia, November 1-3, 2010., 2, (257– 266), 1994.[20] Bachrach Y., Elkind E., Meir R., Pasechnik D., Zuckerman M. RotheJ. and Rosenschein J. S., The Cost of Stability in Coalitional Games.Proceedings of SAGT, (112 – 134), 2009.[21] Stanojevic R., Laoutaris N. and Rodriguez P., On Economic HeavyHitters: Shapley value analysis of 95th-percentile pricing. IMC’10Melbourne, Australia, November 1-3, 2010.