[PDF] Dealer: End-to-End Data Marketplace with Model-based Pricing

Abstract

Data-driven machine learning (ML) has witnessed great successes across a variety of application domains. Since ML model training are crucially relied on a large amount of data, there is a growing demand for high quality data to be collected for ML model training. However, from data owners' perspective, it is risky for them to contribute their data. To incentivize data contribution, it would be ideal that their data would be used under their preset restrictions and they get paid for their data contribution. In this paper, we take a formal data market perspective and propose the first en\textbf{\underline{D}}-to-\textbf{\underline{e}}nd d\textbf{\underline{a}}ta marketp\textbf{\underline{l}}ace with mod\textbf{\underline{e}}l-based p\textbf{\underline{r}}icing (\emph{Dealer}) towards answering the question: \emph{How can the broker assign value to data owners based on their contribution to the models to incentivize more data contribution, and determine pricing for a series of models for various model buyers to maximize the revenue with arbitrage-free guarantee}. For the former, we introduce a Shapley value-based mechanism to quantify each data owner's value towards all the models trained out of the contributed data. For the latter, we design a pricing mechanism based on models' privacy parameters to maximize the revenue. More importantly, we study how the data owners' data usage restrictions affect market design, which is a striking difference of our approach with the existing methods. Furthermore, we show a concrete realization DP-\emph{Dealer} which provably satisfies the desired formal properties. Extensive experiments show that DP-\emph{Dealer} is efficient and effective.

Full PDF

DDealer: End-to-End Data Marketplace with Model-basedPricing

ABSTRACT

How can the broker as-sign value to data owners based on their contribution to the modelsto incentivize more data contribution, and determine pricing fora series of models for various model buyers to maximize the rev-enue with arbitrage-free guarantee . For the former, we introducea Shapley value-based mechanism to quantify each data owner’svalue towards all the models trained out of the contributed data.For the latter, we design a pricing mechanism based on models’privacy parameters to maximize the revenue. More importantly, westudy how the data owners’ data usage restrictions a ﬀ ect market de-sign, which is a striking di ﬀ erence of our approach with the existingmethods. Furthermore, we show a concrete realization DP- Dealer which provably satisﬁes the desired formal properties. Extensiveexperiments show that DP-

Dealer is e ﬃ cient and e ﬀ ective. PVLDB Reference Format:

Jinfei Liu. A Sample Proceedings of the VLDB Endowment Paper in La-TeX Format.

PVLDB , 12(xxx): xxxx-yyyy, 2019.DOI: https: // doi.org / / xxxxxxx.xxxxxxx

1. INTRODUCTION

Machine learning has witnessed great success across various typesof tasks and is being applied in an ever-growing number of indus-tries and businesses. High usability machine learning models de-pend on a large amount of high-quality training data, which makesit obvious that data are valuable. Recent studies and practices ap-proach the commoditization of data in various ways. A data mar-ketplace sells data either in the direct or indirect (derived) forms.

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.

Proceedings of the VLDB Endowment,

Vol. 12, No. xxxISSN 2150-8097.DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx

These data marketplaces can be generally categorized based ontheir pricing mechanisms: 1) data-based pricing, 2) query-basedpricing, and 3) model-based pricing.Data marketplaces with data-based pricing are selling datasetsand allow buyers to access the data entries directly, e.g., Dawex[1], Twitter [3], Bloomberg [4], and Iota [5]. Under these market-places, data owners have limited control over their data usage, e.g.,privacy abuse, which makes it challenging for the market to incen-tivize more data owners to contribute. Also, it can be overpriced forbuyers to purchase the whole dataset when they are only interestedin particular information extracted from the dataset. Therefore, themarketplace operates in an ine ﬃ cient way that cannot maximizethe revenue.Data marketplaces with query-based pricing [25, 26], e.g., GoogleBigquery [2], partially alleviate these shortcomings by chargingbuyers and compensating data owners on a per-query basis. Themarketplace makes decisions about the restrictions on data usage(e.g., return queries with privacy protection [29]), compensationallocation, and query pricing. However, most queries consideredby these marketplaces are too simplistic to support sophisticateddata analytics and decision making.Data marketplaces with model-based pricing [6, 13, 24] havebeen recently proposed. In [13], the authors focus on pricing aseries of model instances depending on their model quality to max-imize revenue, while [24] considers how to allocate compensationin a fair way among data owners when their data are utilized for aparticular machine learning model of k -nearest neighbors ( k -NN).Thus, they are limited to either end of the marketplace but not both.Most recently, [6] approaches it in a relatively more complete per-spective by studying two ends of the marketplace, where strategiesfor the broker to set the model usage charge from the buyers, andfor the broker to distribute compensation to the data owners are pro-posed. However, [6] oversimpliﬁes the role of the two end entitiesplayed in the overall marketplace: the data owners and the modelbuyers. For example, the data owners still have no means to controlthe way that their data is used, while the model buyers do not havea choice over the quality of the model that best suits their needs andbudgets. Gaps and Challenges.

Though e ﬀ orts have been made to ensurethe broker follows important market design principles in [6, 13,24], how the marketplace should respond to the needs of both thedata owners and the model buyers is still understudied. It is there-fore tempting to ask: how can we build a marketplace dedicatedto machine learning models, which can simultaneously satisfy theneeds of all three entities, i.e., data owners, broker, and model buy-ers. We summarize the gaps and challenges from the perspective ofeach entity as follows. • Data owners.

Under the existing data marketplace solutions [6,1 a r X i v : . [ c s . D B ] M a r How to model thedata owners’ restrictions and their associated e ﬀ ect on modelmanufacturing, model pricing, and compensation allocation ? • Model buyers.

As with the same practice of selling digital com-modities in several versions, current work [6, 13] provides a se-ries of models for sale with di ﬀ erent levels of quality. However,their oversimpliﬁed noise injection-based version control quanti-ﬁes the model quality via the magnitude of the noise, which doesnot directly align with the model buyers’ valuation of the modelin terms of its utility. How do we incorporate the model buyers’perspective in the model valuation and use it to optimize bothmodel pricing and model manufacturing ? • Broker.

The data owners’ restrictions and the model buyers’ esti-mation of the model value should be taken into consideration bythe broker when making market decisions, e.g., compensationallocation and model pricing.

How can the broker align the twoends’ requirements with the already complicated market designprincipals in an e ﬃ cient and e ﬀ ective way ? That is, how canthe broker remain competitive (e.g., train higher utility modelwith the data owner restrictions), while maximizing its revenueto maintain a sustainable data market? Data Owners Broker Model Buyers … … usage restrictions … … … model target model, budget payment compensationusage restrictionscompensation modeltarget model, budgetpayment … Figure 1: An end-to-end data marketplace with model-based pric-ing framework.

Contributions.

In this paper, we bridge the above gaps and addressthe identiﬁed challenges by proposing

Dealer : an en D -to- e nd d a tamarketp l ace with mod e l-based p r icing. Dealer provides both ab-stract and general mathematical formalization for the end-to-endmarketplace dynamics (Section 4) and a speciﬁc and practical dif-ferentially private marketplace with algorithm designs that ensuresdi ﬀ erential privacy of the data owners, one type of instantiation ofthe model restriction. End-to-End Mathematical Formalization of the Marketplace.

Weﬁrst propose an abstract mathematical formalization for

Dealer withan emphasis on the understudied parts of the marketplace as dis-cussed above. An illustration is provided in Figure 1, which in-cludes three main entities and their interactions. In this general

Dealer (Gen-

Dealer ), we deﬁne the abilities and constraints for thethree entities (i.e., data owners, broker, and model buyers). Fromthe data owners’ perspective, Gen-

Dealer aims to 1) allow the dataowners to restrict their data usage by the broker; and 2) have theoption to receive extra compensation if they are willing to partiallyrelax their restrictions. From the model buyers’ perspective, Gen-

Dealer provides a utility measure suitable for the model buyers’ model valuation, based on which the potential model buyers pro-vide their willingness to purchase and payment estimation. Fromthe broker’s perspective, Gen-

Dealer depicts the full marketplacedynamics through abstract behavior functions. In addition to thenew features novely brought into the market design considerationby Gen-

Dealer , commonly recognized market design principals arealso well-accommodated into our approach. Prominent ones in-clude: 1) compensation allocation in a fair and rational way, e.g.,based on the widely-accepted Shapley value notion; 2) arbitrage-free in model pricing, which prevents the model buyers from takingadvantage of the marketplace by combining lower-tier models soldfor cheaper prices into a higher-tier model to escape the designatedprice for that tier.

Marketplace Instantiation with Di ﬀ erential Privacy. We alsoprovide a concrete end-to-end di ﬀ erentially private data market-place instance. We focus on empirical risk minimization, a widelyused and well-studied family of supervised machine learning mod-els. For the data owners’ restrictions, we consider the privacy re-striction, which is arguably the most concerned issue for personaldata contributors. Di ﬀ erential privacy (DP) [16, 17], the de factostandard in privacy-preserving data analysis nowadays, is intro-duced to exemplify the data owners’ restriction requirements andwe will refer the di ﬀ erentially private data marketplace with model-based pricing instance by DP- Dealer . In DP-

Dealer , the market-place sells a series of di ﬀ erentially private models to respect dataowners’ privacy restrictions. The higher tier models correspond tomodels trained on data subsets contributed by the data owners withlower DP restrictions. On the contrary, the lower tier models aretrained with higher DP restriction data subsets. We will considertwo types of data owner DP restrictions: 1) hard restriction whichhas a rigid cutting point beyond that the data owners’ data cannotbe used for training; and 2) negotiable restriction which has a nego-tiable range within that the marketplace still has the option to usethe data but with extra compensation. At the model buyers’ end,DP- Dealer addresses the challenge of mismatched model tier rank-ing standard by converting the model “manufacturing” tier rankingstandard to the model utility standard adopted by the model buyersin making purchasing decisions. DP-

Dealer accommodates marketdesign principals like fair compensation allocation and arbitrage-free model pricing. To support the end-to-end marketplace dy-namics with all design considerations, DP-

Dealer establishes con-strained objective functions and develops e ﬃ cient algorithms to op-timize market decisions.We brieﬂy summarize our contributions as follows. • A general end-to-end data marketplace with model-based pric-ing framework Gen-

Dealer which is the ﬁrst systematic studythat includes all market participants (i.e., data owners, broker,and model buyers). Gen-

Dealer formalizes the abilities and re-strictions of the three entities and models the interactions amongthem. • A di ﬀ erentially private data marketplace with model-based pric-ing DP- Dealer which instantiates the general framework. In ad-dition to incorporating market design principals, DP-

Dealer pro-poses two data owner restriction schemes and provides the util-ity estimation for the model buyers to choose models best suit-ing their needs. DP-

Dealer formulates a series of optimizationproblems and develops e ﬃ cient algorithms to make the marketdecisions. • A series of experiments are conducted to justify the design ofDP-

Dealer and verify the e ﬃ ciency and e ﬀ ectiveness of the pro-posed algorithms.2 rganization. The rest of the paper is organized as follows. Sec-tion 2 presents the related work. Section 3 provides the backgroundinformation, including the concept of Shapley value and its compu-tation, the machine learning model exempliﬁed in this paper, anddi ﬀ erential privacy related deﬁnitions and properties. We providethe ﬁrst end-to-end data marketplace with model-based pricing for-malization and discuss the desiderata in Section 4. A concrete in-stance of di ﬀ erentially private data marketplace with e ﬃ cient algo-rithms is derived in Section 5. We report the experimental resultsand ﬁndings in Section 6. Finally, Section 7 draws a conclusionand discusses future work.

2. RELATED WORK

In this section, we discuss related work on data pricing and com-pensation allocation.

Ghosh et al. [20] initiated the study of markets for private datausing di ﬀ erential privacy. They modeled the ﬁrst framework inwhich data buyers would like to buy sensitive information to es-timate a population statistic. They deﬁned a property named envy-free for the ﬁrst time. Envy-free ensures that no individual wouldprefer to switching their payment and privacy cost with each other.Guruswami et al. [21] studied the optimization problem of rev-enue maximization with envy-free guarantee. They investigatedtwo cases of inputs: unit demand consumers and single mindedconsumers, and showed the optimization problem is APX-hard forboth cases, which can be e ﬃ ciently solved by a logarithmic approx-imation algorithm. Li et al. [28, 29, 30] presented the ﬁrst theoret-ical framework for assigning value to noisy query answers as func-tion of their accuracy, and for dividing the price among data ownerswho deserve compensation for their loss of privacy. They deﬁnedan enhanced edition of envy-free, which is named arbitrage-free.Arbitrage-free ensures the data buyer cannot purchase the desiredinformation at a lower price by combing two low-price queries.Lin et al. [31] proposed necessary conditions for avoiding arbi-trage and provide new arbitrage-free pricing functions. They alsopresented a couple of negative results related to the tension betweenﬂexible pricing and arbitrage-free, and illustrated how this tensionoften results in unreasonable prices. In addition to arbitrage-free,Koutris et al. [27] proposed another desirable property for the pric-ing function, discount-free, which requires that the prices o ﬀ er noadditional discounts than the ones speciﬁed by the broker. In fact,discount-free is the discrete version of arbitrage-free. Furthermore,they presented a polynomial time algorithm for pricing generalizedchain queries.Recently, Chawla et al. [12] investigated three types of succinctpricing functions and studied the corresponding revenue maximiza-tion problems. Due to the increasing pervasiveness of machinelearning based analytic, there is an emerging interest in studyingthe cost of acquiring data for machine learning. Chen et al. [13]proposed the ﬁrst and the only existing model-based pricing frame-work in which instead of pricing the data, directly prices machinelearning model instances. They formulated an optimization prob-lem to ﬁnd the arbitrage-free price that maximizes the revenue ofthe broker, and proved such optimization problem is coNP-hard.However, their work only focuses on the interactions between thebroker and the model buyers. Furthermore, they assume there isonly one survey price for each model, which is too simpliﬁed. An acquiescent method to evaluate data importance / value to amodel is leave-one-out (LOO) which compares the di ﬀ erence be-tween the predictor’s performance when trained on the entire datasetand the predictor’s performance when trained on the entire datasetminus one point [11]. However, LOO does not satisfy all the idealproperties that we expect for the data valuation. For example, givena point p in a dataset, if there is an exact copy p (cid:48) in the dataset, re-moving p from this datasets does not change the predictor at allsince p (cid:48) is still there. Therefore, LOO will assign zero value to p regardless of how important p is.Shapley value is a concept in cooperative game theory, whichwas named in honor of Lloyd Shapley [33]. Shapley value is theonly value division scheme used for compensation allocation thatmeets three desirable criteria, group rationality, fairness, and addi-tivity [24]. Combining with its ﬂexibility to support di ﬀ erent util-ity functions, Shapley value has been extensively employed in thedata pricing ﬁeld [6, 8, 19, 24]. One major challenge of apply-ing Shapley value is its prohibitively high computational complex-ity. Evaluating the exact Shapley value involves the computationof the marginal utility of each user to every coalition, which is (cid:93) P -complete [18]. Such exponential computation is clearly impracti-cal for evaluating a large number of training points. Even worse,for machine learning tasks, evaluating the utility function is ex-tremely expensive as machine learning tasks need to train models.The worst case is that we need to train O (2 n ) models for computingthe exact Shapley value for each data owner.A number of approximation methods have been developed toovercome the computational hardness of ﬁnding the exact Shap-ley value. The most representative method is Monte Carlo method[10, 18], which is based on the random sampling of permutations.However, the time cost is still prohibitively high due to the hightraining cost of deep learning models. Therefore, Ghorbani et al.[19] and Ancona et al. [8] illustrated how to compute the approx-imate Shapley value by performing stochastic gradient descent onone data point at a time.

3. BACKGROUND AND PRELIMINARIES

In this section, we introduce the background and preliminaries of

Dealer . We summarize the frequently used notations in Table 1. Inparticular, we denote the data owners by O , ..., O i , ..., O n , a seriesof models prepared by the broker for sale by M , ..., M m , ..., M M ,and the model buyers by B , ... B k , ..., B K . There are a number of market design principals, which are alsoconsidered in data marketplaces with data-based pricing, query-based pricing, and model-based pricing. In the following, we re-view the most widely adopted ones and introduce related tech-niques.

Shapley value based compensation is a prevalently adopted ap-proach mostly due to its theoretical properties, especially the fair-ness. Shapley value measures the marginal improvement of modelutility contributed by z i of data owner O i , averaged over all possiblecoalitions of the data owners. The formal Shapley value deﬁnitionof data owner O i is shown as follows. SV i = (cid:88) S ⊆{ z ,..., z n }\ z i U ( S ∪ { z i } ) − U ( S ) (cid:16) n − | S | (cid:17) (1)where U ( · ) is the utility of the model trained by a coalition of thedata owners, and the model utility is tested on the training dataset.3able 1: The summary of notations.Notation Deﬁnition O i the i th data owner M m the m th model B k the k th model buyer Z train = { z , z , ..., z n } training dataset X train = { x , x , ..., x n } features of training dataset y train = { y , y , ..., y n } labels of training dataset z i = { x i , y i } the i th training data Z test testing dataset X test features of testing dataset y test labels of testing dataset U model utility SV Shapley value UV utility valuation UF utility function MR model risk factor MB manufacturing budget DR data owner restriction function (cid:15), δ parameters for the DP bc base compensation ec extra compensation tm target model (cid:104) p ( (cid:15) ) , p ( (cid:15) ) , ..., p ( (cid:15) M ) (cid:105) optimal pricing( m , sp m [ j ]) survey price point( m , p m [ j ]) complete price point Monte Carlo Simulation Method.

Since the exact Shapley valuecomputation is based on enumeration which is prohibitively expen-sive, we adopt a commonly used Monte Carlo simulation method[10, 18] to compute the approximate Shapley value. We ﬁrst sam-ple random permutations of the data points corresponding to dif-ferent data owners, and then scan the permutation from the ﬁrstelement to the last element and calculate the marginal contributionof every new data point. Repeating the same procedure over mul-tiple Monte Carlo permutations, the ﬁnal estimation of the Shapleyvalue is simply the average of all the calculated marginal contri-butions. This Monte Carlo sampling gives an unbiased estimateof the Shapley value. In practical applications, we generate MonteCarlo estimates until the average has empirically converged and theexperiments show that the estimates converge very quickly. There-fore, Monte Carlo simulation method can control the degree of ap-proximation, i.e., the more permutations, the better the accuracy .The detailed algorithm is shown in Algorithm 1, where | π | is thenumber of permutations. The larger the | π | , the more accurate thecomputed Shapley value. Algorithm 1:

Monte Carlo Shapley value computation. input : Z train = ( X train , y train ) and Z test = ( X test , y test ). output: Shapley value SV i for each data z i in Z train . initialize SV i = for k = | π | do we have a training dataset ordered in π k , Z ktrain = { z π k , z π k , ..., z π kn } ; for i = do SV ( z π ki ) = U ( { z π k , ..., z π ki } ) − U ( { z π k , ..., z π ki − } ); SV π ki = SV π ki + SV ( z π ki ); For a perfect world, the broker would sell a personalized modelto each model buyer at a di ﬀ erent price to maximize the revenue. However, such personalized pricing is rarely possible in practicalapplications. On the one hand, it is expensive to train di ﬀ erentmodels for di ﬀ erent model buyers. On the other hand, it is di ﬃ cultto set an array of prices for the same model. Even if we could, itwould be impossible to get model buyers to stay within their in-tended pricing strata rather than to look for the lowest price. Fi-nally, the broker runs the risk of annoying or even alienating modelbuyers if they charge di ﬀ erent prices for the same model.There is a practical way to set di ﬀ erent prices for the same train-ing dataset without incurring high costs or o ﬀ ending model buyers.We can do it by o ﬀ ering the training dataset in di ﬀ erent versions de-signed to attract di ﬀ erent types of model buyers. With this strategy,which is called versioning [32], model buyers segment themselves.The model version they choose reveals the value they place on thetraining dataset and the price they are willing to pay. Therefore, inour model marketplace, the broker would train M di ﬀ erent modelversions by injecting di ﬀ erent noise to di ﬀ erent subsets of trainingdata contributed by the data owners. Arbitrage is possible when the “better” model can be obtainedmore cheaply than the advertised price by combining two or moreworse models with lower price. Arbitrage complicates the inter-actions between the broker and the model buyers, i.e., the modelbuyers need to carefully choose the models to achieve the lowestprice, while the broker may not achieve the revenue intended byher advertised prices. Therefore, an arbitrage-free pricing functionis highly desirable. We say a pricing function is arbitrage-free if itsatisﬁes the following two properties proved in [13, 30].P roperty (Monotonicity). Given a function f : ( R + ) k → R + ,we say f is monotone if and only if for any two vectors x , y ∈ ( R + ) k , x ≤ y , we have f ( x ) ≤ f ( y ) . P roperty (Subadditivity). Given a function f : ( R + ) k → R + ,we say f is subadditive if and only if for any two vectors x , y ∈ ( R + ) k , we have f ( x + y ) ≤ f ( x ) + f ( y ) . In this paper, we focus on the Empirical Risk Minimization (ERM),which is a widely-applied tool in machine learning. Denote thetraining dataset by Z train : = { z i } , i = , , ..., n , where z i ∼ D and z i = ( x i , y i ). The x i ∈ R d is the d -dimensional feature vector and y i is the response value which can be {− , + } for binary classiﬁca-tion task or [0 ,

1] for regression task. The ERM has the followingobjective function:arg min w ∈ Ω L ( w ; Z train ) + λ (cid:107) w (cid:107) = arg min w ∈ Ω n (cid:88) i = n l ( w ; z i ) + λ (cid:107) w (cid:107) , (2)where L ( w ; Z train ) = (cid:80) ni = l ( w ; z i ) is the empirical loss averagedfrom losses taken from all z i , λ (cid:107) w (cid:107) is the regularizer and Ω is theconstraint set. In this paper, we focus on models with convex, Lip-schitz continuous, and smooth loss (with respect to w ) functions.The formal deﬁnitions are in the following.D efinition (Convex Loss Function) A loss function l ( w ) : R d → R is called convex if for all w , w ∈ R d , | l ( w ) − l ( w ) | ≥ (cid:104)∇ l ( w ) , w − w (cid:105) . (3) In addition, if l ( w ) − l ( w ) ≥ (cid:104)∇ l ( w ) , w − w (cid:105) + µ (cid:107) w − w (cid:107) , for µ > , l ( w ) is µ -strongly convex. efinition (Lipschitz continuous Loss Function) A loss func-tion l ( w ) : R d → R is called L-Lipschitz continuous if for all w , w ∈ R d , | l ( w ) − l ( w ) | ≤ L (cid:107) w − w (cid:107) . (4)D efinition (Smooth Loss Function) A loss function l ( w ) : R d → R is called β -smooth if for all w , w ∈ R d ,l ( w ) − l ( w ) ≤ (cid:104)∇ l ( w ) , w − w (cid:105) + β (cid:107) w − w (cid:107) (5)We focus on the loss functions satisfying the above assumptionslike least square loss, logistic loss, and smoothed hinge loss. Inmachine learning, these losses are thoroughly studied with theo-retical properties like generalization performance, and easy to usewith e ﬃ cient optimization algorithm with guaranteed convergence.Furthermore, their di ﬀ erentially private versions are also equippedwith e ﬃ ciency, utility, and privacy guarantees. As a result, we fo-cus on this particular type of machine learning models in our mar-ket design for a thorough understanding of our proposed market.However, we would like to mention that most of our algorithmscan be applied to other types of machine learning models like thepopular deep learning family. Di ﬀ erential privacy is a formal mathematical tool for rigorouslyproviding privacy protection. However, none of the existing (pub-lished) data marketplace with model-based pricing paper has con-sidered it and it is still unknown how it can be incorporated in datamarketplace with model-based pricing and how it a ﬀ ects marketdesigns when adopted.D efinition (Di ﬀ erential Privacy) A randomized algorithm A is ( (cid:15), δ ) -di ﬀ erentially private, if for any pair of datasets S and S (cid:48) that di ﬀ ers in one data sample, and for all possible output O of A ,the following holds, P [ A ( S ) ∈ O ] ≤ e (cid:15) P [ A ( S (cid:48) ) ∈ O ] + δ, (6) where the probability is taken over the randomness of A . In practice, for a meaningful DP guarantee, the parameters are cho-sen as 0 < (cid:15) ≤ δ (cid:28) N , where N is the number of data samples.L emma (Simple Composition) Let A j be an ( (cid:15) j , δ j ) -di ﬀ erentiallyprivate algorithm. We have A = ( A , ..., A J ) is ( (cid:80) Jj = (cid:15) j , (cid:80) Jj = δ ) -di ﬀ erentially private. Lemma 1 is essential for DP mechanism design and analysis, whichenables algorithm designers to compose elementary DP operationsinto a more sophisticated one. More importantly, we will showthat it plays a crucial role in model market design as well. That is,Lemma 1 determines that DP is an appropriate mechanism for ver-sioning models, based on which prices have to satisfy the arbitrage-free property.D efinition ( (cid:96) -sensitivity) A function f : D N → R d has (cid:96) sensitivity ∆ if max neighboring S , S (cid:48) (cid:107) f ( S ) − f ( S (cid:48) ) (cid:107) = ∆ (7)For training ERM with DP restriction, a popular method is the ob-jective perturbation, which perturbates the objective function of themodel with quantiﬁed noise. Conventional objective perturbationonly supports DP guarantee for the exact optimum, which is hardly achievable in practice. In this paper, we follow the enhanced objec-tive perturbation called approximate minima perturbation [9, 22],which allows solving the perturbed objective up to α approxima-tion. It uses a two-phase noise injection strategy that perturbatesboth the objective and the approximate output. The detail is sum-marized in Algorithm 2. The algorithm trains a ( (cid:15), δ )-DP modelbased on training dataset Z train , which outputs model parameter w DP . In particular, Line 2 perturbates the model with calibratednoise N , Line 3 optimizes the perturbated model which is followedby an output perturbation with noise N . Finally, w DP is obtainedby a projection to the constrained set Ω . Algorithm 2:

Objective perturbation for di ﬀ erentially privateERM training. input : Z train and ( (cid:15), δ ). output: w DP . Sample N ∼ N ( d , σ I d ), where σ = L log(1 /δ ) (cid:15) ; Objective Perturbation: L OP ( w ) = L ( w ; Z train ) + λ (cid:107) w (cid:107) + n (cid:104) N , w (cid:105) ; Optimize L OP ( w ) to obtain α approximate solution ˆ w ; Sample N ∼ N ( d , σ I d ), where σ = α log(1 /δ ) λ(cid:15) ; return w DP = pro j W ( ˆ w + N );

4. GENERAL END-TO-END DATA MARKET-PLACE WITH MODEL-BASED PRICING

In this section, we propose Gen-

Dealer to model the entire end-to-end data marketplace with model-based pricing. Gen-

Dealer models the following three aspects of the data marketplace: 1) thefunctionalities and restrictions of the three participating entities; 2)interactions between the participating entities; 3) market decisionstaken by the broker. In addition to the well recognized market de-sign principals considered by previous work, we bring new consid-erations for the two ends of the marketplace (i.e., the data ownersand the model buyers), and study how the marketplace interactionsshould respond to their requirements and how the market decisionswill be a ﬀ ected to maintain a sustainable (or even proﬁtable) datamarketplace with model-based pricing. To begin with, we formalize the three participating entities asfollows.

Data Owners.

Data owners can be professional institutes, orga-nizations, or individuals. In this paper, we focus on the individualcase where each data owner O i contributes their own data z i . For or-ganizations or institutes which collects multiple individual owners’data for sale, we treat every atomic data tuple within the dataset asan individual data owner. Individual data owners are interested incontributing their data for compensation, e.g., coupons, exclusivesale, cashback, but are cautious about their data usage, e.g., per-sonal privacy exposure. The three main functionalities of the dataowners are shown as follows.1. Contributing Data: each data owner can contribute the personaldata to the broker. Denote the data of data owner O i by z i = ( x i , y i ), where x i is the feature vector and y i is the response vec-tor.2. Setting Usage Restriction: the data owners will set restrictionson how their data can be used, e.g., what types of models or howmany models their data can be used. In this paper, we consider5 natural strategy, where O i sets one restriction per each modelto train. That is, for model M m , O i provides D ata R estrictionfunction DR mi . Two types of restriction functions are modeledin this paper. The ﬁrst is the simpler “in or out” restriction,where DR mi = { , } is an indication function providing hardrestriction on whether z i is allowed to be used for training the m th model ( DR mi =

1) or not ( DR mi = DR mi is not only a function of the modeltier but also related to e xtra c ompensation ( ec mi ): DR mi ( ec mi ) = { , } . That is, z i can be used for training model M m if the extracompensation ec mi satisﬁes O i ’s expectation.3. Receiving Compensation: After the sales, the data owners re-ceive compensation based on their data usage. In addition, extracompensation will be paid if their data is used after negotiation.In particular, the extra compensation ec mi is a function of riskfactor risk m to be introduced in the next subsection: ec mi =

0, if risk mi ≤ MR m and is an nondecreasing nonnegative function of risk mi − MR m if risk mi > MR m .D iscussion Compared to existing machine learning modelmarketplace designs which limit the data owners actions to merelycontribute data and receive compensation, our formulation has thefollowing two strengths: 1) it allows the data owners to set data us-age restrictions; 2) the negotiable-type restrictions help data ownerto better estimate the value of their personal data, which is oftendi ﬃ cult for individuals who have limited market information andevaluation of data usage risk. We believe that by better model-ing the data owner by providing the rights on setting data usagerestrictions and receiving extra compensation, it will eventually in-centivize more data owners to contribute their data. A ssumption The data owners do not fake data. Each dataowner only contributes one data sample and all data are indepen-dent. R emark In practice, each data owner may have multiple atomicdata tuples, and those data tuples among di ﬀ erent data owners maybe correlated. Therefore, we need to consider their relationshipwhen we allocate compensation. For the correlated data, we leaveit as an open question. Model Buyers.

Model buyers can be industries or everyday users,who are interested in purchasing machine learning models to ei-ther integrate into their product or support certain decision making.They have very di ﬀ erent budgets and model utility requirements. Inthis paper, we focus on the single minded model buyers who willpurchase at most one model. There are two functionalities of themodel buyers as follows.1. Providing Purchase Willingness: B k provides the purchase will-ingness by providing ( tm k , v k ), where the t arget m odel tm k ∈{ , ..., M } indicates target model of B k and v k is the purchas-ing budget. We note that the number of potential model buyerscan be very large, but the purchase willingness of hundreds ofsampled model buyers is enough for the broker to make marketdecision.2. Model Transaction: B k decides to purchase the target model ornot by comparing the released price p ( tm k ) by the broker andher budget. Broker.

The broker collects data from the data owners and trainsa series of machine learning models for sale to the model buyers. Let the M odels be M , ..., M m , ..., M M . As discussed, a commonversioning strategy is to sell the models in various tiers, from thelowest tier (say M ) to the highest tier (say M M ). For these mod-els, it sets prices (cid:104) p ( M ) , ..., p ( M M ) (cid:105) . In addition, we assume thebroker is honest in the sense that it will strictly follow contractwith the data owners, e.g., respecting their usage restrictions, al-locating the compensation based on the true data usage. The bro-ker also wishes to remain competitive by training the best modelfor each tier within that tier’s resource budget. More importantly,the broker interacts with both the data owners and the model buy-ers, and makes various market decisions, which are detailed inthe later sections. Finally, we assume there is a model risk fac-tor MR m associated with each model M m to be detailed in Sec-tion 4.3. The model risk measures how large certain risk is tothe data owner who participates in the model training. When therisk is low, the broker usually puts high restrictions on data us-age, which often leads to limited data information extraction. As aresult, the lower risk model often corresponds to lower tier mod-els coming with a lower price. We denote such connection by (cid:104) p ( MR ) , ..., p ( MR M ) (cid:105) , where MR ≤ ... ≤ MR M and the pricesatisﬁes arbitrage-free with respect to the model risk. In a moregeneralized market setting, multiple brokers can co-exist, whichforms a competitive relationship. To focus on the principal func-tionalities of a broker, we follow existing work [6, 13, 24] to con-sider only a single broker case in this paper. In this subsection, we formalize the data marketplace dynamics,which consist of the interaction between the data owner and thebroker, the interaction between the model buyer and the broker.

Interaction between Data Owner and Broker. • Data Collection: The broker posts model tiers M , ..., M M andexplains each tier’s model risk MR m to the data owners and howeach tier will possibly be compensated if a data owner choosesto participate. The data owner O i contributes data z i and set datausage restriction DR mi as well as a potential extra compensation ec mi if he is willing to negotiate, for all m = , ..., M . Recall that DR mi and ec mi are functions of MR m . • Compensation Allocation: After training the models under datausage restrictions (see

Model Training part in the next subsec-tion), the broker pays compensation to the data owners accord-ing to the

Compensation Allocation market decision algorithmdetailed in the next subsection. Three key quantities are U tility V aluation to model M m : UV mi , its b ase c ompensation: bc mi , and e xtra c ompensation: ec mi . Interaction between Model Buyer and Broker. • Market Survey: The broker posts the models tiers M , ..., M M to the potential model buyers. Each model buyer B k providespurchase willingness ( tm k , v k ). • Model Transaction: The broker makes the model pricing (cid:104) p , ..., p M (cid:105) (see Model Pricing in the next subsection) based on ( tm k , v k ).The model buyers then make purchase decisions and completethe transaction if meeting budget restriction.D iscussion The data owners’ contributing willingness is basedon risk level if their data is used for training model M m , while themodel buyers’ purchasing willingness is based on the usefulnessof the model (i.e., model utility). Thus, the two ends have a dif-ferent standard for the same model tier, which requests the brokerto make optimal market decisions to bridge the two ends’ di ﬀ erentrequirements. .3 Marketplace Decision Making In this part, we propose a brand new versioning strategy, whichcenters on the data owners’ perspective, rather than simply low-ering the quality of the model which merely considers the modelbuyers’ payment ability like [13].

Versioning.

In general, more data owners’ contribution makesthe “manufacturing cost” of the corresponding model cheaper, be-cause the broker has many alternative choices and tends to dom-inate the compensation bargain. On the contrary, for the modelswith higher risk, much fewer data owners are willing to participate,which makes the broker have to pay higher compensation to in-trigue more data contribution. With increased manufacturing cost,the product (i.e., the model for sale) price should increase. Thus,we propose to set a version based on the participating risk, which isout of the data owners’ perspective. However, since the model buy-ers, in general, do not care about the risk factor but the model utility,the broker should provide conversion from the risk-tiering standardto the model utility standard for each model tier. To bridge bothends, the broker needs to make market decisions (e.g., set modelpricing) constrained by all participating entities’ requirements asconstraints. As a result, it assures our claim that designing the datamarketplace with model-based pricing from end to end is of neces-sity and importance.Recall that each model M m is associated with model risk MR m .The versioning is a process where the broker trains a series of mod-els with di ﬀ erent model risks: MR , ..., MR M under the restrictionsof all n data owners DR mi and extra compensation requirements ec mi . On the model buyer’s end, the broker in our marketplace willprovide a utility function UF ( MR m ) for each risk-based modeltier, since the utility is the fundamental model property interestedto the model buyers.In comparison, the versioning strategy of the existing model mar-ket [13] produces di ﬀ erent versions of models by controlling modelutility through directly adding noise to the model parameters, in or-der to suit di ﬀ erent model buyers coming with various paymentability. Obviously, their simpliﬁed versioning strategy fails to re-ﬂect the true “model manufacturing cost”. Model Utility Maximization with Manufacturing Budget.

Given O i = ( z i , DR mi ) or O i = ( z i , DR mi ( ec mi )), for i = , , ..., n , the brokertrains each model M m under data usage restrictions and tries its bestto train the best model for each model tier to remain competitive.Let data owner O i ’s preferred model risk be risk i , which indicatesthe highest risk she wants to take (without extra compensation). Inthis paper, we instantiate the data restriction DR mi function as fol-lows: 1) for the hard restriction case, DR mi = I ( MR m ≤ risk i ),i.e., the data can be used for M m only when the model risk MR m is lower than the data owner’s preferred risk risk i ; 2) for the nego-tiable case, DR mi ( ec mi ) = I ( MR m ≤ risk i ) ∧ I ( ec mi ), i.e., the datacan be used either the model risk is lower than the preferred risk,or the extra compensation is made.For the simpler hard data usage restriction, the broker trainsmodel M m with data S ubset: { S m : i ∈ , , ..., n , s . t . DR mi = } .For the negotiable data usage restriction, under limited manufac-turing budget MB , the broker needs to decide whose data worththe extra compensation, so that the utility valuation of the trainedmodel will be maximized for the broker to be competitive in themarket. Denote the utility valuation (to be detailed in the next sub-section) of z i to model M m by UV mi . We formalize the subsetselection of S m as a training budget constrained utility valuation maximization problem as follows.arg max S m ⊆{ z ,..., z n } (cid:88) i ∈ S m UV mi , (8) s . t . ( (cid:88) i ∈ S m ( bc mi + ec mi (max { , MR m − risk i } ))) ≤ MB m . (9)In the above, the utility value UV mi and the base compensation bc mi should satisfy certain market design principals, which will bediscussed in the following. In this part, we elaborate how Gen-

Dealer allocates base com-pensation bc mi and utility valuation UV mi strategy. Recall that formodel M m , its b ase c ompensation bc mi and e xtra c ompensation ec mi . The extra compensation is a function of the data owner pre-ferred risk risk i and the model risk MR m : if MR m ≤ risk i , dataowner O i will participate the training of M m with only base com-pensation; else if MR m > risk i , ec mi is charged with respect to MR m − risk i , i.e., the broker needs to pay for the extra risk the dataowner su ﬀ ers. Together, ec mi is a function of max { , MR m − risk i } as shown in Equation (9).For the base compensation, Gen- Dealer allocates it based on the z i ’s U tility V alue UV mi , where UV mi is based on the (approximate)Shapely value and divides bc mi according to the relative Shapelyvalue. This way, the true contribution of data owner O i to model M m can be evaluated and the base compensation is consistent withmarket design principals. To be practical, e ﬃ cient approximationalgorithms will be utilized.To summarize, Gen- Dealer will allocate the compensation todata owner O i for participating model M m as bc mi + ec mi (max { , MR m − risk i } )). The total compensation allocated to data owner O i is: M (cid:88) m = I ( i ∈ S m ) · (cid:104) bc mi + ec mi (max { , MR m − risk i } )) (cid:105) , (10)where I ( i ∈ S m ) is an indicator function for indication whether z i isin data subset S m for training M m . In this part, we show how to construct the market survey betweenthe broker and the potential model buyers, and how the broker max-imizes the revenue based on the market survey.

Market Survey.

Prior to release models and prices for sale, thebroker will estimate the price for each model through a market sur-vey, which can be done by the broker himself or by third-partycompanies like consultation service providers. Let the survey sizeby K (cid:48) , i.e., K (cid:48) potential model buyers are recruited to provide theirpurchasing willingness. The survey result will contain K (cid:48) tuples,one from each survey participant. For the k th survey participant, itprovides ( tm k , v k ), where tm k ∈ { , ..., M } is the target model and v k is the acceptable price of model tm k she is willing to purchase.We note that the survey participants may have an incentive to reportlower valuations in order to decrease the price, which can be alle-viated by the digital goods auction [7] based on two approaches:random-sampling mechanisms and consensus estimates. Revenue Maximization (Model Pricing).

With the surveyed purchasing willingness, the broker will priceeach model in the aim of maximizing revenue and at the same time7ollowing the market design principal of arbitrage-free. To do so,the revenue maximization problem is formulated as follows.arg max (cid:104) p ( MR ) ,..., p ( MR M ) (cid:105) M (cid:88) m = K (cid:48) (cid:88) k = p ( MR m ) · I ( tm k == m ) · I ( p ( MR m ) ≤ v k ) , (11) s . t . p ( MR m ) + p ( MR m (cid:48) ) ≥ p ( MR m + MR m (cid:48) ) , MR m , MR m (cid:48) ≥ , (12) p ( MR m ) ≥ p ( MR m (cid:48) ) , MR m ≥ MR m (cid:48) ≥ , (13) p ( MR m ) ≥ , MR m ≥ , (14)where MR m is the model risk deﬁned in Section 4.1, p ( MR m ) is themodel price for model M m whose model risk is MR m . In the nextsection, we will see that this problem is co-NP hard and we willprovide an e ﬃ cient approximation with accuracy bound algorithmthere for our DP- Dealer instance.

5. A DIFFERENTIALLY PRIVATE DATA MAR-KETPLACE INSTANCE AND EFFICIENTAPPROXIMATE OPTIMIZATION

In this section, we propose a concrete realization of Gen-

Dealer ,a di ﬀ erentially private data marketplace with model-based pricingframework DP- Dealer . We illustrate the functionalities and restric-tions of the data owners and the model buyers in DP-

Dealer inSections 5.1 and 5.2, respectively. Furthermore, in Section 5.3, wepresent the broker’s functioning in DP-

Dealer by providing con-crete solutions, which are e ﬃ cient approximate optimization al-gorithm to make the market practical. Finally, we summarize thecomplete DP- Dealer dynamics in Section 5.4.

We provide a data owner instance by instantiating its risk factor,data restrictions, and extra compensation functions. For the riskfactor, we focus on the privacy-preserving issue, which is arguablyone of the major concerns limiting individual users from contribut-ing their data. The data owners wish to contribute data for modeltraining with a certain level of privacy in exchange of a fair share ofcompensation. We follow the di ﬀ erential privacy notion and instan-tiate both the hard and negotiable data usage restriction cases. Letthe model risk factor MR m of model M m be described by (cid:15) m whichcorresponds to (cid:15) m -di ﬀ erential privacy of the model. Traditional DPsystem and algorithm designs mostly consider the di ﬀ erential pri-vacy strictness out of the broker’ and the model buyers’ perspec-tive, which sets it as a tradeo ﬀ factor as long as it a ﬀ ects the modelutility within a certain level. This overlooks the true privacy de-mand of the data owners. For those who do consider personalizedDP budget, they seldom consider what value the privacy parameteractually means to the data owner. Under such a lack of reward sce-nario, the data owner still has di ﬃ culty evaluating their own privacydemands. Our design allows the data owners to choose their ownprivacy preference and receives rewards for providing more usefulpersonal information. We believe it is a good starting point for apractical data marketplace with model-based pricing that respectsthe data owners’ privacy demand and incentivizes the data ownersfor their personal data contribution.Under the di ﬀ erential privacy risk factor, the three functionalitiesof data owner O i are shown as follows.1. Contributing Data: data owner O i contributes her data z i = ( x i , y i )to the broker; 2. Setting Usage Restriction: let the personal risk preference risk i be ( (cid:15) i , δ )-DP . Hard DP requirement.

In the ﬁrst simpliﬁed case, each dataowner O i chooses whether her data is allowed for training model M m with certain level of privacy restriction by DP parameter (cid:15) preferi . That is, data owner O i only allows her data to be usedfor models with DP restrictions stricter than (cid:15) i , i.e., (cid:15) m ≤ (cid:15) i .Then, the data restriction DR mi = I ( (cid:15) m ≤ (cid:15) i ). Negotiable DP requirement.

We further consider a more com-plicated data owner strategy, where the data owners have moreoptions for making their own trade-o ﬀ s between compensationand privacy risk. For data owner O i , in addition to DP require-ment (cid:15) i , she is also willing to trade some of the privacy for morecompensation. To do so, we introduce an extra compensationfunction ec mi ( (cid:15) i , (cid:15) m ), which pays an extra fraction of the basecompensation (allocated based on Shapley value) to compen-sate for the higher privacy risk. In this case, the data usagerestriction function is also a function of the extra compensa-tion: I ( ec mi ( (cid:15) i , (cid:15) m )) which indicates whether the extra compen-sation has been allocated. Thus, DR mi ( ec mi ) = [ I ( (cid:15) m ≤ (cid:15) i ) ∧ I ( ec mi ( (cid:15) i , (cid:15) m ))].3. Receiving Compensation: For both cases, we let the base com-pensation bc mi to be proportional to the relative approximatedShapley value (see the broker instance in the following). In ad-dition, for the negotiable case, we introduce the extra compen-sation function if (cid:15) m > (cid:15) i but the broker is willing to use z i fortraining M m to maximize the model value (subject to the con-straint of the manufacturing budget). Extra Compensation Function.

In particular, we present threetypes of the extra compensation function ec mi ( (cid:15) i , (cid:15) m ): concave,linear, and convex, to model three user inclinations of their per-sonal privacy risks: reserved, balanced, and casual, correspond-ingly. • linear: ec mi ( (cid:15) i , (cid:15) m ) = ρ mi bc mi max { , (cid:15) m − (cid:15) i } ; • convex: ec mi ( (cid:15) i , (cid:15) m ) = ρ mi bc mi (max { , (cid:15) m − (cid:15) i } ) ; • concave: ec mi ( (cid:15) i , (cid:15) m ) = ρ mi bc mi (max { , (cid:15) m − (cid:15) i } ) ;For the ease of presentation, we use ec mi to replace ec mi ( (cid:15) i , (cid:15) m ) toexpress the extra compensation of data owner O i on model M m inthe following.D iscussion For all cases, each data owner O i has the max-imum total potential privacy leakage (cid:80) Mm = (cid:15) mi . In this work, weassume the data owners do not have too much information aboutthe data marketplace except the information given by the broker.Thus, they invariably invest their total privacy budget to each ofthe M models, which can be suboptimal for certain owners. In thefuture, we will consider more informed data owners, who have notonly more knowledge about the data marketplace, e.g., the demandfor each type of model, but also the quality and privacy restrictionsof other data owners. With the additional market information, dataowners can allocate their total privacy leakage more intelligentlyby investing their privacy budget towards models returning themmore compensation. In this paper, we assume the δ is su ﬃ ciently small so that wedo not consider its value and composition for the remaining of thepaper for convenience.8 ssumption The goal of adding DP noise in models is tolimit what can be inferred from the models about individual train-ing data tuples. Therefore, it is better for the broker to supportthe relationship between DP parameter (cid:15) and what can be inferredfrom the models to the data owners. We note that such a relation-ship can be implemented by [23].

The model buyers in DP-

Dealer have the same functionalitieswith the model buyers in Gen-

Dealer shown in Section 4.1 whenconsidering the di ﬀ erential privacy instantiation.A ssumption The arbitrage-free property of the models is es-tablished in terms of the di ﬀ erential privacy budget. R emark In practice, the model buyers can buy a couple ofweak models and convert those weak models to strong ones by em-ploying some machine learning techniques such as ensemble learn-ing, bagging and boosting. However, it may be infeasible to for-mally characterize how model combinations behave in terms of themodel utility. Therefore, instead of ensuring the models to satisfyarbitrage-free in terms of model utility, we ensure that the modelssatisfy arbitrage-free in terms of DP parameter.

In this part, we present the broker’s functioning in the di ﬀ er-entially private data marketplace by providing concrete solutionsfor selecting optimal training subsets with budget constraints, themarket survey to potential model buyers, and pricing models forrevenue maximization with an arbitrage-free guarantee. Given the training data along with the data owners’ privacy andextra compensation functions, the broker aims to train the highestvalued model for each price tier with the constraint on the privacyand manufacturing / compensation budget. According to di ﬀ erentdata owners’ requirements, the broker has two types of workﬂows. Processing the Hard DP Restriction.

In this case, for model tier (cid:15) m , the broker is allowed to release model M m trained strictly witha subset S m within the data owners O i , where (cid:15) i ≤ (cid:15) m . We formalizethe optimization problem as follows.arg max S m (cid:88) i ∈ S m SV mi , s . t . (cid:88) i ∈ S m bc mi ≤ MB m (15)We omit the solutions for Equation (15) because it is a special caseof the following Equation (16). Processing the Negotiable DP Restriction.

A more practical caseis the negotiable DP restriction, where the broker has the optionto decide whether to intrigue high quality data owners to lowertheir privacy restriction with extra compensation. We formalizeit as the following Budget Constrained Maximum Value Problem(BCMVP) on model M m :arg max S m (cid:88) i ∈ S m SV mi , s . t . (cid:88) i ∈ S m ( bc mi + ec mi ) ≤ MB m (16)where MB m is the manufacturing budget of model M m . In essence,we reallocate the payment of lower valued data owners to becomethe extra compensation of higher valued data owners.The above problem is di ﬃ cult to be exactly solved. In fact,we prove the problem is NP-hard. Given this NP-hard complex-ity, we then present three approximation algorithms. First, we present a pseudo-polynomial time algorithm using dynamic pro-gramming technique. Then, we present a fully polynomial-time ap-proximation scheme with the worst case bound if each data owner’scompensation is not too large. Finally, we propose an enumera-tion guess based polynomial time approximation algorithm with theworst case bound by relaxing the compensation constraint, whichuses the pseudo-polynomial time algorithm as a subroutine. NP-hardness proof.

We prove that BCMVP is NP-hard by show-ing that the well-known partition problem is polynomial time re-ducible to BCMVP.D efinition ( Decision Version of BCMVP ) Given a set S of ndata owners with their corresponding privacy compensation bc m + ec m , bc m + ec m , ..., bc mn + ec mn and Shapley value SV m , SV m , ..., SV mn ,the decision version of BCMVP has the task of deciding whetherthere is a subset S ⊆ S such that (cid:80) i ∈ S bc mi + ec mi ≤ B and (cid:80) i ∈ S SV mi ≥ V. D efinition ( Decision Version of Partition Problem ) Given aset S of n positive integer values v , v , ..., v n , the decision versionof partition problem has the task of deciding whether the given setS can be partitioned into two subsets S and S such that the sumof the integers in S equals the sum of the integers in S . T heorem The decision version of BCMVP is an NP-hard prob-lem. P roof . We show that there exists a polynomial reduction by prov-ing that there exists a subset S ⊆ S such that (cid:80) i ∈ S bc mi + ec mi ≤ B and (cid:80) i ∈ S SV mi ≥ V if and only if there is a partition S and S suchthat the sum of the integer values in S equals the sum of the inte-ger values in S . We construct the polynomial reduction as follows.Consider the following instance of BCMVP: bc mi + ec mi = v i and SV mi = v i for i = , , ..., n , and B = V = (cid:80) ni = v i .We show the reduction as follows.(1) If there exists a partition S and S such that the sum of theinteger values in S equals to the sum of the integers in S , thereexists S and S such that (cid:80) i ∈ S v i = (cid:80) i ∈ S v i = (cid:80) ni = v i . We choosethe set of data owners S in BCMVP and we have (cid:80) i ∈ S bc mi + ec mi = (cid:80) i ∈ S v i = (cid:80) ni = v i = B and (cid:80) i ∈ S SV mi = (cid:80) i ∈ S v i = (cid:80) ni = v i = V .Therefore, we know that there exists a subset S ⊆ S such that (cid:80) i ∈ S bc mi + ec mi ≤ B and (cid:80) i ∈ S SV mi ≥ V .(2) If there exists a subset S ⊆ S such that (cid:80) i ∈ S bc mi + ec mi ≤ B and (cid:80) i ∈ S SV mi ≥ V , we partition the set S into S and S = S − S . We have (cid:80) i ∈ S bc mi + ec mi = (cid:80) i ∈ S v i ≤ B = (cid:80) ni = v i and (cid:80) i ∈ S SV mi = (cid:80) i ∈ S v i ≥ V = (cid:80) ni = v i . This implies that (cid:80) i ∈ S v i = (cid:80) ni = v i . We also have (cid:80) i ∈ S v i = (cid:80) ni = v i − (cid:80) ni = v i = (cid:80) ni = v i .Therefore, there exists a partition S and S such that (cid:80) i ∈ S v i = (cid:80) i ∈ S v i = (cid:80) ni = v i . Pseudo-polynomial time algorithm.

We present a pseudo-polynomialtime algorithm for BCMVP. Pseudo-polynomial means that ouralgorithm has the polynomial time complexity in terms of MB m rather than the number of data owners n . We divide MB m into (cid:100) MB m a (cid:101) parts, where a is the greatest common divisor in bc mi + ec mi for all i = , , ..., n . We deﬁne SV [ i , j ] as the maximum BCMVPthat can be attained with compensation budget ≤ j × a by onlyusing the ﬁrst i data owners. The detailed algorithm is shown inAlgorithm 3. In Line 5, if the compensation budget is not enough,we do not need to consider the i th data owner. Otherwise, we cantake O i if we can get more value by replacing some data ownersfrom O , ..., O i − in Line 8.9 lgorithm 3: Pseudo-polynomial time algorithm for BCMVP. input : bc mi + ec mi , MB m , and SV mi for i = , , ..., n . output: S m . for j = MB m do SV [0 , j ] = for i = do for j = MB m do if bc mi + ec mi > j × a then SV [ i , j ] = SV [ i − , j ]; else SV [ i , j ] = max {SV [ i − , j ] , SV [ i − , j × a − bc mi + ec mi ] + SV mi } ; backtrack from SV [ n , (cid:100) MB m a (cid:101) ] to SV [1 ,

0] to ﬁnd the selected O i ; Polynomial-time approximation algorithm.

The time cost of theproposed pseudo-polynomial time algorithm in Algorithm 3 is ex-tremely dominated by the compensation budget. We propose a sim-ple yet e ﬃ cient polynomial-time approximation algorithm in Algo-rithm 4, which is not sensitive to the compensation budget. We sortthe data owners in decreasing order of Shapley value per compen-sation budget SV mi bc mi + ec mi in Line 3. In Lines 6-8, we proceed to takethe data owners, starting with as high as possible of SV mi bc mi + ec mi untilthere is no budget. We also present a lower bound for Algorithm4 in Theorem 2, where MAX is the maximum value that we canobtain in function (16).

Algorithm 4:

Polynomial-time approximation algorithm forBCMVP. input : bc mi + ec mi , MB m , and SV mi for i = , , ..., n . output: S m . for i = do compute SV mi bc mi + ec mi ; sort SV mi bc mi + ec mi for i = , , ..., n in decreasing order and denote as SV m bc m + ec m ≥ SV m bc m + ec m ≥ ... ≥ SV mn bc mn + ec mn ; B = i = while B ≤ MB m do add bc mi + ec mi to B; i = i + return the corresponding O i of those bc mi + ec mi in B;T heorem If for all i, bc mi + ec mi ≤ ζ MB m , Algorithm 4 has alower bound guarantee (1 − ζ ) MAX. P roof . We set bc mk + ec mk as the ﬁrst data that is not acceptedin Algorithm 4, i.e., we choose the corresponding data owners of bc m + ec m , bc m + ec m , ..., bc mk − + ec mk − . For 1 ≤ i ≤ k , we have SV mi bc mi + ec mi ≥ SV mk bc mk + ec mk . ⇒ SV mi ≥ ( bc mi + ec mi ) SV mk bc mk + ec mk ⇒ SV m + SV m + ... + SV mk ≥ ( bc m + ec m + bc m + ec m + ... + bc mk + ec mk ) SV mk bc mk + ec mk Because we set bc mk + ec mk as the ﬁrst data that is not accepted,i.e., bc m + ec m + bc m + ec m + ... + bc mk + ec mk > (cid:80) ni = bc mi , we have ⇒ SV mk ≤ ( SV m + SV m + ... + SV mk ) bc mk + ec mk (cid:80) ni = bc mi ⇒ SV mk ≤ ζ ( SV m + SV m + ... + SV mk ) ⇒ SV mk ≤ ζ ( SV m + SV m + ... + SV mk − )1 − ζ Because SV m + SV m + ... + SV mk ≥ MAX , we have SV m + SV m + ... + SV mk − ≥ (1 − ζ ) MAX . Therefore, Algorithm 4 has alower bound guarantee (1 − ζ ) MAX .L emma There are at most (cid:100) α (cid:101) data owners having compen-sation bc mi + ec mi such that their corresponding Shapley value SV mi is at least α MAX in any optimal solution.

Lemma 2 is easy to see, otherwise, the optimal solution value islarger than

MAX , which is a contradiction.

Enumeration guess based polynomial time approximation al-gorithm.

Although Algorithm 4 can achieve (1 − ζ ) MAX , the re-quirement of bc mi + ec mi ≤ ζ MB m is too strict. We present anotheralgorithm with the same worst case bound but without the above re-quirement. Let α ∈ (0 ,

1) be a ﬁxed constant and h = (cid:100) α (cid:101) . We willtry to guess the h most proﬁtable data owners in an optimal solu-tion and compute the rest greedily as in Algorithm 4. The detailedalgorithm is shown in Algorithm 5 as follows. We ﬁrst enumerateall the subsets with data owner size ≤ h in Lines 1-3. We deletethose subsets with higher compensation budget than MB m in Lines4-6. In Lines 7-10, for each remaining subset, we call Algorithm4 to maximize the value with the remaining budget after taking the ≤ h data owners. Algorithm 5:

Guess and Polynomial-time approximation algo-rithm for BCMVP. input : bc mi + ec mi , MB m , and SV mi for i = , , ..., n . output: S m . for i = do choose i data owner(s) to compose a subset S (cid:48) ; we have (cid:80) hi = (cid:16) ni (cid:17) such subsets; for j = (cid:80) hi = (cid:16) ni (cid:17) do compute the compensation budget of the data owners in S (cid:48) ; delete those S (cid:48) if their compensation budget is larger than MB m ; we have r remaining subsets S (cid:48) , S (cid:48) , ..., S (cid:48) r ; for each subset S (cid:48) j , j = , , ..., r do let O a be the data owner with the least Shapley value in S (cid:48) j ,remove all data owners in S j − S (cid:48) j if their Shapley value islarger than SV ma and get a new subset S (cid:48)(cid:48) j ; run Algorithm 4 in S (cid:48)(cid:48) j with remaining compensationbudget MB m − (cid:80) | S (cid:48) j | i = ( bc mi + ec mi ); return the data owners in S (cid:48) j and S (cid:48)(cid:48) j , where S (cid:48) j and S (cid:48)(cid:48) j have thehighest Shapley value among j = , , ..., r ;T heorem Algorithm 5 runs in O ( n (cid:100) α (cid:101) ) time with ( − α )MAXworst case bound. P roof . For the time complexity, we have at most (cid:80) hi = (cid:16) ni (cid:17) subsets S (cid:48) after deleting those subsets if their compensation budget is largerthan MB m . That is, we have at most n h di ﬀ erent subsets S (cid:48) . Foreach subset S (cid:48) , the greedy Algorithm 4 only requires linear time tohandle the remaining data owners. Therefore, the total time costfor Algorithm 5 is O ( n (cid:100) α (cid:101) + ).For the worst case approximation bound, we assume subset S (cid:48) inthe optimal solution has exact h data owners. We note that subset S (cid:48)

10n the optimal solution may have ≤ h data owners, but it is easy tosee that this does not a ﬀ ect the complexity analysis. If the numberof data owners in the optimal solution is less than h , the optimalsolution will be included in S (cid:48) . In the following, we discuss thecase that the number of data owners in the optimal solution is largerthan h .We have h + k data owners O , ..., O h , O h + , ..., O h + k − , O h + k thatneed to be considered, where O , ..., O h are the data owners in sub-set S (cid:48) , O h + i is the i th data owner with the highest SV mi bc mi + ec mi in S (cid:48)(cid:48) . O h + k is the data owner with the highest SV mi bc mi + ec mi rejected by the greedy al-gorithm of Algorithm 4. Let MAX (cid:48) be the optimal value for thedata owners in S (cid:48)(cid:48) . Therefore, we have SV ( S (cid:48)(cid:48) ) + SV mh + k ≥ MAX (cid:48) . ⇒ SV ( S (cid:48)(cid:48) ) ≥ MAX (cid:48) − SV mh + k Based on Lemma 2, there are at most (cid:100) α (cid:101) data owners havingcompensation bc mi + ec mi such that their corresponding Shapleyvalue SV mi is at least α MAX in any optimal solution, and those (cid:100) α (cid:101) data owners are already pruned in Line 9. Therefore, we have SV mh + k ≤ α MAX . ⇒ SV ( S (cid:48)(cid:48) ) ≥ MAX (cid:48) − α MAX ⇒ SV ( S (cid:48) ) + SV ( S (cid:48)(cid:48) ) ≥ MAX (cid:48) + SV ( S (cid:48) ) − α MAX ⇒ SV ( S (cid:48) ) + SV ( S (cid:48)(cid:48) ) ≥ MAX − α MAX

That is, Algorithm 5 has the worst case bound (1 − α ) MAX . In the previous subsection, the broker requires the model budgetas a constraint to manufacture the models, which is before the mod-els are available to the model buyers. To acquire the budget vari-able, a common practice is to perform a market survey to collectpurchasing willingness from potential model buyers. That is, thebroker presents a series of potential models along with their perfor-mance estimation to the potential model buyers, who then providewhich model they are willing to purchase and at what price. Basedon the survey result, the broker can estimate the budget by solvinga revenue maximization problem in the next subsection. The mar-ket survey stage is sometimes the earliest stage among the overallmarket dynamics.In the following part, we propose a survey approach by overcom-ing two di ﬃ culties. First, the broker encounters di ﬀ erent standardsfor categorizing the tier of each model. During the manufacture,each data owner uses a di ﬀ erential privacy budget to di ﬀ erenti-ate the model tier, the model buyers, however, are unlikely to careabout the restriction in the privacy of the model they purchase. Onthe contrary, it is the model prediction performance that they payattention to. Thus, to sell models to the model buyers, the brokerneeds to transit the (cid:15) m -DP based model description to the predic-tion performance based model description. This raises the seconddi ﬃ culty: the Shapley value based utility measure is not availableat the survey stage (the data may not even been collected yet). Toovercome both di ﬃ culties, we utilize a common estimation of util-ity for the DP ERM models, which converts the DP parameter toa general excess population loss by assuming all data samples areidentically independently distributed. It also reveals the relationbetween the number of training samples and the utility estimation,which provides a guide to the data collection.For training each model M m subject to DP restriction (cid:15) m , the bro-ker uses a subset S m out of all data available in the market D , whosedata owners have (cid:15) ≥ (cid:15) m . Recall that the full dataset D is from dis-tribution D , i.e., D ∼ D . For the model buyers, they care aboutthe performance of the model on their prediction tasks. That is, for z predict = ( x predict , y predict ), where z predict ∼ D , the broker estimatesthe value of a particular tier of model by estimating l ( w m , z predict ).To formalize it, we utilize the notion of population loss as follows, D efinition (Population Loss) L ( w ; D ) : = E z ∼D [ l ( w , z )] , (17) where the expectation is over the distribution of the data. Thus, it measures the expected prediction loss of a model M m whengiven the output w m . With the population risk, the broker providesthe maximum discrepancy between the ideal model with model pa-rameter w ∗ and the DP one for sale w mDP . The following excesspopulation loss notion formalizes this discrepancy,D efinition (Excess Population Loss [9]) ∆ L ( A mAlgorithm ; S m ) : = E [ L ( w mDP ; D ) − L ( w ∗ ; D )] , (18) where w ∗ = arg min L ( w , D ) , s . t . w ∈ W (model parameter space), A mAlgorithm denotes the DP algorithm in Algorithm 2 for trainingmodel M m with DP restricted dataset S m , and the expectation istaken over the randomness of Algorithm 2. The speciﬁc accuracy measure can vary from application to ap-plication and be chosen according to the model buyers. The orderof the population loss is more universal and general than a speciﬁcchoice of accuracy metric. Also, we believe the order makes moresense than a particular number reported on a given testing dataset.Thus, the excess population risk serves as a good estimation of util-ity at the market survey stage.The following theorem from [9] provides an estimate for the ex-cess population loss for model M m .T heorem Under certain conditions, the excess population lossfor the output w mDP of the objective perturbation based training al-gorithm A mAlgorithm is ∆ L ( A mAlgorithm ; (cid:15) m , S m ) = O (max { √| S m | , (cid:112) d log(1 /δ ) (cid:15) m | S m | } ) . (19)From the above theorem, we can see that the model price dom-inated by the excess population loss is an increasing function of (cid:15) because of two reasons: 1) from the model buyer perspective, thelarger the (cid:15) (i.e., the looser the privacy restriction), the higher themodel utility, which will better meet the model buyers’ usage. Itis consistent with the common belief that a better product will costmore. 2) from the data owner perspective, less owners are willingto allow their data to be used for looser privacy protected mod-els, which leads to less contributors without extra compensationand the broker will have increased manufacturing cost in order torecruit more data owners. It is also consistent with the common be-lief that the product with higher manufacturing cost should be moreexpensive.In addition to the privacy budget, the training set size providesan important role. First, in the previous model based pricing paper[13], the authors claim that the versioning of the various quality ofmodels does not result into the extra cost for the broker, is howevernot true according to the | S m | in the above theorem. In fact, themanufacturing cost actually increases for higher tier models (i.e.,the one with a larger (cid:15) m ). That is, less data owners are will to con-tribute their data for models with larger (cid:15) m without extra compen-sation. Thus, the broker has to spend more manufacturing cost onrecruiting more data owners for model training (i.e., for extra com-pensation), otherwise the smaller S m will lead to lower model utilitydespite the increasing (cid:15) . Second, the | S m | in the above theorem alsoprovides a good guidance to the broker on the data collection phase,i.e. at least how much data owners the broker has to engage with.11 .3.3 Pricing Models for Revenue Maximization withArbitrage-free Guarantee Before the market survey, the broker provides the excess popu-lation risk estimation for each DP-model to the K (cid:48) survey partic-ipants, who are potential model buyers and are interested in themodel performance rather than the privacy risk. Each participant B k is asked to provide which model they want to purchase (target) tm k , and at what price v k . To make the arbitrage-free pricing p ( (cid:15) m )for model M m with di ﬀ erential privacy (cid:15) m , the objective functionfor the broker isarg max (cid:104) p ( (cid:15) ) ,..., p ( (cid:15) M ) (cid:105) M (cid:88) m = K (cid:48) (cid:88) k = p ( (cid:15) m ) · I ( tm k == m ) · I ( p ( (cid:15) m ) ≤ v k ) , (20) s . t . p ( (cid:15) m ) + p ( (cid:15) m (cid:48) ) ≥ p ( (cid:15) m + (cid:15) m (cid:48) ) , (cid:15) m , (cid:15) m (cid:48) ≥ , (21) p ( (cid:15) m ) ≥ p ( (cid:15) m (cid:48) ) ≥ , (cid:15) m ≥ (cid:15) m (cid:48) ≥ RM ) problemand the optimal revenue for RM as OPT ( RM ). We use ( m , sp m [ j ])to denote the j th lowest survey price point in the m th model. For ex-ample, in Figure 2, we have six survey participants shown in blackdisk (1 , sp [1] = , sp [2] = , sp [1] = , sp [2] = , sp [1] = , sp [2] = (cid:15) , the stricter the privacy restriction, which results to lower modelprice. The reason for this price trend will be revealed by the nextsubsection.Given a set of survey price points ( m , sp m [1]) for m = , , ..., M ,does there exist a pricing function p ( (cid:15) m ) such that 1) is positive,monotone, and subadditive; and 2) ensures p ( (cid:15) m ) = sp m [1] for all m = , , ..., M , which is a co-NP hard problem [13]. It is easyto see that this co-NP hard problem is a special case of our RM problem, i.e., there is only one survey price point for each model.Therefore, it is suspected that there is no polynomial-time algo-rithm for the proposed RM problem.In order to overcome the hardness of the original optimization RM problem, we seek to approximately solve the problem by re-laxing the subadditivity constraint. We relax the constraint of p ( (cid:15) m ) + p ( (cid:15) m (cid:48) ) ≥ p ( (cid:15) m + (cid:15) m (cid:48) ) in Equation (22) to p ( (cid:15) m ) /(cid:15) m ≥ p ( (cid:15) m (cid:48) ) /(cid:15) m (cid:48) (23)which still satisﬁes the requirement of arbitrage-free. We refer thisrelaxed problem as Relaxed Revenue Maximization ( RRM ) prob-lem. Generally speaking, we want to make sure that the unit pricefor large purchases is smaller than or equals to the unit price forsmall purchases, which is practical in the real marketplaces.In the following, we show the maximum revenue for

RRM , OPT ( RRM ), has a lower bound with respect to the maximum rev-enue for RM , OPT ( RM ).T heorem The maximum revenue for RM has the followingrelationship with the maximum revenue for RRM

OPT ( RRM ) ≥ OPT ( RM ) / roof . Given a feasible solution p RM of the revenue maximiza-tion problem, we construct a solution p RRM such that for all m > p RRM ( (cid:15) m ) = (cid:15) m × min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } , where p RRM ( (cid:15) m ) is theprice in RRM for model M m with privacy parameter (cid:15) m . Let 0 < m ≤ m (cid:48) . We show that p RRM is a feasible solution of the relaxingmaximization problem as follows.We prove that p RRM satisﬁes the monotone property as follows.Let x (cid:48) min = arg min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } . We have two cases for x (cid:48) min ,0 < x (cid:48) min ≤ m and m < x (cid:48) min ≤ m (cid:48) . For the ﬁrst case 0 < x (cid:48) min ≤ m , wehave min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } = min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } because x (cid:48) min lies in the range of (0 , m ]. And then we have p RRM ( (cid:15) m (cid:48) ) = (cid:15) m (cid:48) × min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } ≥ (cid:15) m × min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } = p RRM ( (cid:15) m ).For the second case m < x (cid:48) min ≤ m (cid:48) , we have p RRM ( (cid:15) m ) = (cid:15) m × min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } ≤ (cid:15) m ×{ p RM ( (cid:15) x ) /(cid:15) x } x = m = (cid:15) m ×{ p RM ( (cid:15) m ) /(cid:15) m } = p RM ( (cid:15) m ) < p RM ( (cid:15) x (cid:48) min ) = (cid:15) x (cid:48) min { p RM ( (cid:15) x (cid:48) min ) /(cid:15) x (cid:48) min } . Because x (cid:48) min ≤ m (cid:48) and p RM ( (cid:15) x (cid:48) min ) /(cid:15) x (cid:48) min = min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } , then we have (cid:15) x (cid:48) min { p RM ( (cid:15) x (cid:48) min ) /(cid:15) x (cid:48) min } ≤ (cid:15) m (cid:48) × min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } = p RRM ( (cid:15) m (cid:48) ).That is p RRM ( (cid:15) m ) < p RRM ( (cid:15) m (cid:48) ).We prove that p RRM satisﬁes the subadditive property as follows.We have p RRM ( (cid:15) m ) /(cid:15) m = min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } ≥ min < x ≤ m (cid:48) { p RM ( (cid:15) x ) /(cid:15) x } = p RRM ( (cid:15) m (cid:48) ) /(cid:15) m (cid:48) . Therefore, we have p RRM ( (cid:15) m ) /(cid:15) m ≥ p RRM ( (cid:15) m (cid:48) ) /(cid:15) m (cid:48) which is the subadditive property constraint.Let x min = arg min < x ≤ m { p RM ( (cid:15) x ) /(cid:15) x } , i.e., x min ≤ m . We showthat for every m >

0, we have p RM ( (cid:15) m ) / ≤ p RRM ( (cid:15) m ) as fol-lows. We have p RM ( (cid:15) m ) = p RM ( (cid:15) x min (cid:15) m (cid:15) xmin ) ≤ p RM ( (cid:15) x min (cid:100) (cid:15) m (cid:15) xmin (cid:101) ) ≤(cid:100) (cid:15) m (cid:15) xmin (cid:101) p RM ( (cid:15) x min ) because p RM satisﬁes the subadditive propertyconstraint. Therefore, we have p RRM ( (cid:15) m ) = (cid:15) m { p RM ( (cid:15) xmin ) (cid:15) xmin } ≥ (cid:15) m (cid:15) xmin { p RM ( (cid:15) m ) (cid:100) (cid:15) m (cid:15) xmin (cid:101) } ≥ (cid:15) m (cid:15) xmin { p RM ( (cid:15) m ) (cid:15) m (cid:15) xmin + } ≥ p RM ( (cid:15) m ) / x min ≤ m . Because p RRM ( (cid:15) m ) = (cid:15) m × min < x ≤ m { p RM ( (cid:15) x ) (cid:15) x } ≤ (cid:15) m × { p RM ( (cid:15) x ) (cid:15) x } x = m = p RM ( (cid:15) m ), we have p RRM ( (cid:15) m ) ≤ p RM ( (cid:15) m ). Therefore, for each m >

0, we have (cid:80) K (cid:48) k = · I ( tm k == m ) · I ( p RRM ( (cid:15) m ) ≤ v k ) ≥ (cid:80) K (cid:48) k = · I ( tm k == m ) · I ( p RM ( (cid:15) m ) ≤ v k ).With p RM ( (cid:15) m ) / ≤ p RRM ( (cid:15) m ), we conclude that OPT ( RM ) / ≤ OPT ( RRM ). Dynamic Programming Algorithm.

In this part, we show an e ﬃ -cient dynamic programming algorithm to solve the relaxed revenuemaximization problem.

14 37 58510 1 2 30 price ǫ Figure 2: Revenue maximization example.At ﬁrst glance, for each model, it seems that all possible valuesin the price range can be an optimal price, which makes the prob-lem arguably intractable to solve. In the following, we show howto construct the complete solution space in the discrete space andprove the complete solution space is su ﬃ cient to obtain the maxi-mum revenue. Constructing Complete Solution Space.

It is easy to see thatthose survey price points should be contained in the complete so-lution space. For each survey price point ( m , sp m [ j ]), it determinesunit price sp m [ j ] /(cid:15) m and price sp m [ j ]. The general idea is that ifwe choose ( m , sp m [ j ]) as the price point in model M m , it a ﬀ ectsthe price for models M k , k = , ..., m − M k , k = m + , ..., M due tothe subadditive constraint. If we set the optimal price in model12 m as sp m [ j ], the unit price of the following models after model M m cannot be larger than sp m [ j ] /(cid:15) m . Therefore, for each surveyprice point ( m , sp m [ j ]), we draw one line l ( m , sp m [ j ]) through sur-vey price point ( m , sp m [ j ]) and the original point. For each model M m , we draw one vertical line l M m . By intersecting line l ( m , sp m [ j ]) and l M m , we obtain M − m new price points ( l M k , l ( m , sp m [ j ]) ) for k = m + , ..., M . We note that we do not need to generate theprice points for k = , ..., m − M m can only constrain the unite price of model M k , k = m + , ..., M .Furthermore, for each model, its price is also determined by thesurvey price of its right neighbors. Therefore, we need to add thesurvey price points of model M m to models M k , k = , ..., m − f ( m , p m [ j ]) to dis-tinguish the survey price points from the other points in the com-plete solution space. For ease of presentation in the following, wename the price point in the complete solution space from Line 1as SV (survey) point, the price point from Line 8 as SC (subad-ditivity constraint) point, and the price point from Line 10 as MC (monotonicity constraint) point. Algorithm 6:

Constructing complete solution space for the re-laxed revenue maximization problem. input :

Model with noise parameter (cid:15) m and the j th lowestsurvey price point for model with noise parameter (cid:15) m ,denoted as ( m , sp m [ j ]). output: Complete solution space. add all the survey price points ( m , sp m [ j ]) to the completesolution space; for each survey price point ( m , sp m [ j ]) do draw a line l ( m , sp m [ j ]) through this point and the originalpoint; for each model with noise parameter (cid:15) m do draw a vertical line l M m ; for each line l M m do for each line l ( m , sp m [ j ]) do add point ( l M k , l ( m , sp m [ j ]) ) by intersecting line l M m andline l ( m , sp m [ j ]) to the complete solution space for k = m + , ..., M ; for each survey price point ( m , sp m [ j ]) do add price point ( k , sp m [ j ]) to the complete solution spacefor k = , ..., m − for each price point ( m , p m [ j ]) in the complete solution space do if ( m , p m [ j ]) is a survey price point then f ( m , p m [ j ]) = else f ( m , p m [ j ]) = xample We show an running example of Algorithm 6. InFigure 2, we assume (cid:15) = , (cid:15) = , and (cid:15) = . We add thesurvey price points (1 , sp [1] = , (1 , sp [2] = , (2 , sp [1] = , (2 , sp [2] = , (3 , sp [1] = , and (3 , sp [2] = to the completesolution space in Line 1. In Line 2, for the survey price point (1 , ,we draw a line l (1 , through this point and the original point inLine 3. In Line 4, for model M with noise parameter (cid:15) = ,we draw a vertical line l M . In Lines 6-8, for l , and l M , we addintersection ( l M , l , sp [1] ) = (2 , to the complete solution space.In total, we have six such new price points shown in box. In Line 9, for survey price point (3 , sp [1]) = (3 , , we add price points (2 , and (1 , to the complete solution space in Line 10. Similarly, wealso have six such new price points shown in circle. Therefore, forthe complete solution space, we have six price points for models M and M . We have ﬁve price points for model M because theintersection point of l M , l , sp [2] = (2 , is same to the MC point ( k , sp [2]) = (2 , for k = . In Lines 12-15, we have f (2 , = and f (2 , = . T heorem The complete solution space constructed by Algo-rithm 6 is su ﬃ cient for ﬁnding the optimal solution of the relaxedrevenue maximization problem. P roof . As we discussed in constructing complete solution space,for each survey price point ( m , sp m [ j ]), it can a ﬀ ect the model rev-enue of model M m , the unit price for models M k , k = m + , ..., M and the price for models M k , k = , ..., m −

1. We prove that the SC and MC points are non-recursive, i.e., we do not need to gen-erate new price points in the complete solution space based on thegenerated SC and MC points.Given a survey price point ( m , sp m [ j ]) in model M m , it deter-mines a SC point in model M m (cid:48) , where m (cid:48) > m , i.e., ( m (cid:48) , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ). We do not need to generate SC points based on ( m (cid:48) , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ) because it has the same SC points for models M k , k = m (cid:48) + , ..., M with ( m , sp m [ j ]). We may use ( m (cid:48) , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ) to gen-erate a MC point ( k , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ). If k > m , the new point( k , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ) is not necessary because sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) /(cid:15) k > sp m [ j ] /(cid:15) m which violates the subadditive constraint. If k < m ,the new point ( k , sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) ) is also not necessary because sp m [ j ] /(cid:15) m × (cid:15) m (cid:48) > sp m [ j ] which violates the subadditive constraint.Given a survey price point ( m , sp m [ j ]) in model M m , it deter-mines a MC point in model M m (cid:48) , where m (cid:48) < m , i.e., ( m (cid:48) , sp m [ j ]).We do not need to generate MC points based on ( m (cid:48) , sp m [ j ]) be-cause those MC points are already determined by ( m , sp m [ j ]). Itis also not necessary to generate SC points based on ( m (cid:48) , sp m [ j ])because if ( m (cid:48) , sp m [ j ]) and ( m , sp m [ j ]) are chosen as the optimalprices, all the optimal prices for models M k , k = m (cid:48) , ... m , ..., M aredetermined, i.e., p ( (cid:15) m (cid:48) ) = ... = p ( (cid:15) m ) = ... p ( (cid:15) M ). A recursive solution.

We deﬁne the revenue of an optimal solu-tion recursively in terms of the optimal solutions to subproblems.We pick as our subproblems the problems of determining the max-imum revenue

OPT ( m , j ), where OPT ( m , j ) denotes the maximumrevenue for considering the ﬁrst m models and taking the j th low-est price point in the complete solution space of model M m . Forthe full problem, the maximum revenue would be max { OPT ( M , j ) } for all the price points ( M , p M [ j ]) in the complete solution spacein model M M . For the price points in the complete solution spaceof model M , we can directly compute OPT (1 , j ) for all the pricepoints because there is no initial constraint. For the price points inthe complete solution space of other models, we need to considerboth the monotone constraint and the subadditive constraint. Wehave a recursive equation as follows. OPT ( m , j ) = max { OPT ( m − , j (cid:48) ) } + MR ( m , j ) (24)where p m − [ j (cid:48) ] ≤ p m [ j ]&& p m − [ j (cid:48) ] /(cid:15) m − ≥ p m [ j ] /(cid:15) m and MR ( m , j )denotes the revenue from model M m if we price model M m for p m [ j ]. Computing the maximum revenue.

Now, we could easily write arecursive algorithm in Algorithm 7 based on recurrence (24), where | p m | is the number of the price points in model M m .T heorem Algorithm 7 can be ﬁnished in O ( N M ) time. lgorithm 7: Dynamic programming algorithm for ﬁnding anoptimal solution of the relaxed revenue maximization problem. input :

Model with noise parameter (cid:15) m and its correspondingprice points in the complete solution space. output: OPT ( RRM ). for each model M m do sort the price points in the complete solution space indecreasing order; use ( m , p m [ j ]) to denote the j th lowest price point; MR ( m , | p m | ) = p m [ | p m | ] f ( m , p m [ | p m | ]); for j = | p m | − to do MR ( m , j ) = p m [ j ] (cid:80) | p m | k = j f ( m , p m [ k ]); for j = to | p | do OPT (1 , j ) = MR (1 , p [ j ]); for each model M m , m = , ..., M do for each price point ( m , p m [ j ]) do OPT ( m , j ) = max { OPT ( m − , j (cid:48) ) } + MR ( m , j ), where p m − [ j (cid:48) ] ≤ p m [ j ]&& p m − [ j (cid:48) ] /(cid:15) m − ≥ p m [ j ] /(cid:15) m ; p ( L . OPT ( m , j )) = p m − [ j (cid:48) ] that satisﬁes OPT ( m , j ) inLine 11; OPT ( RRM ) = max { OPT ( M , j ) } , where j = | p m | .P roof . For the O ( N ) survey price points, we general O ( NM )price points in the complete solution space. For each price pointin the complete solution space, we need O ( NM ) time to update OPT ( m , j ). Therefore, Algorithm 7 requires O ( N M ) time.E xample In model M of Figure 2, we have OPT (1 , j ) forj = , , ..., shown in Table 2. For computing OPT (2 , , there isonly one price point (1 , satisfying both the monotone constraintand the subadditive constraint within model M . Therefore, wehave OPT (2 , = OPT (1 , + MR (2 , = + = . Similarly, wecan ﬁll the entire table shown in Table 2.Constructing an optimal solution. Although Algorithm 7 deter-mines the maximum revenue of

RRM , it does not directly showthe optimal price for each model p ( (cid:15) m ). However, for each pricepoint ( m , p m [ j ]) in the complete solution space, we record the pricepoint p ( L . OPT ( m , j )) in model M m − which has the maximum rev-enue in those price points that satisfy both the monotone constraintand the subadditive constraint with ( m , p m [ j ]) in Line 12 of Algo-rithm 7. Therefore, we can recursively backtrack the optimal pricepoint in model M m − from the optimal price point in model M m .We need O ( nM ) time to ﬁnd the maximum value in OPT ( M , j ) and O ( M ) time to backtrack. Therefore, we can construct an optimalsolution in O ( NM ) time. We note that such a solution may be oneof several solutions that can achieve the optimal value.We show an running example in Table 2. We ﬁrst obtain OPT (3 , OPT (3 , j ) for j = , , ..., p ( (cid:15) ) =

5. We backtrack to

OPT (2 ,

2) in model M and set p ( (cid:15) ) =

3. And then we backtrack to

OPT (1 ,

2) inmodel M and set p ( (cid:15) ) =

3. Finally, an optimal pricing settingis (cid:104) p ( (cid:15) ) , p ( (cid:15) ) , p ( (cid:15) ) (cid:105) = (cid:104) , , (cid:105) . We note that in our running ex-ample, the pricing setting (cid:104) , , (cid:105) also has the maximum revenue19.A ssumption The broker is honest but curious. R emark In practice, the broker may be semi-honest or evenmalicious. For these cases, we can take advantage of the local

Table 2: Example for constructing an optimal solution.Model M M M

15 18 19 19 11 4 di ﬀ erential privacy [15] in which the data owners and the modelbuyers can add DP noise by themselves before sending their datato the broker. Also, encryption-based techniques can be furtherincorporated into the market design. We summarize the di ﬀ erentially private data marketplace withmodel-based pricing dynamics from the broker’s perspective, whichis an end-to-end data marketplace with practical considerations andconsists of computationally e ﬃ cient component algorithms. Thedetailed algorithm is shown in Algorithm 8, which integrates allthe proposed algorithms in the previous sections. We assume thatthe broker can set appropriate parameters M , (cid:15) m , and MB m basedon her market experiences, which is reasonable. For example, Mi-crosoft can easily determine the di ﬀ erent features assigned to Win-dows 10 Home version and Windows 10 Pro version. Finally, al-though instantiated with the DP market, we stress that Algorithm 8can be applied to the general setting by switching the DP parameterto the risk factor. Algorithm 8:

The complete broker functioning in DP-

Dealer

Pipeline. collect data and usage restriction among n data owners: collectdataset D = { z i } along with DP restriction parameters (cid:15) i andextra compensation function ec i for i = , , ..., n ; decide a set of M models to train with model privacy parameter (cid:15) m and manufacturing budget MB m for m = , , ..., M ; %% Model training and releasing; for m = do data valuation: call Algorithm 1 to compute approximateShapley value SV mi for i = , , ..., n ; base compensation: compute bc mi = SV mi (cid:80) ni = SV mi MB m ; data selection: call Algorithm 3, 4, or 5 to select trainingsubset S m with manufacturing budget MB m to maximize SV ( S m ); model training: train the model with subset S m byAlgorithm 2; model releasing: release model M m , its pricing p ( (cid:15) m ) andestimated excess population loss ∆ L ( A mAlgorithm ; (cid:15) m , S m ); perform market survey among K (cid:48) sampling model buyers(survey participants): collect market survey results of modeldemand tm k and valuation v k for k = , , ..., K (cid:48) ; %% Model pricing; model pricing: call Algorithms 6 and 7 to compute the optimalprice p ( (cid:15) m ) of model M m for m = , , ..., M ; %% Compensation allocation; for m = do compute bc i and ec i of data owner O i for i ∈ S m byproportionally dividing p ( (cid:15) m ) (cid:80) Mm = p ( (cid:15) m ) OPT ( RRM ) and allocatethe corresponding compensation to O i ;14 . EXPERIMENTS In this section, we present experimental studies validating: 1)ourproposed mechanisms for compensation allocation are e ﬃ cient ande ﬀ ective; 2) our proposed mechanisms for pricing models can gen-erate more revenue for the data owners and the broker; 3) ourexquisitely designed dynamic programming algorithms for pricingmodels signiﬁcantly outperform the baseline algorithms. We ran experiments on a machine with an Intel Core i7-8700Kand two NVIDIA GeForce GTX 1080 Ti running Ubuntu with 64GBmemory. We employed SVM classiﬁer as our model and used bothsynthetic datasets and a real Breast Cancer dataset [14] in our ex-periments. We implemented the following algorithms in Matlab2018a. • Greedy : The greedy algorithm for compensation allocationin Algorithm 4. • PPDP : The pseudo-polynomial dynamic programming algo-rithm for compensation allocation in Algorithm 3. • GuessGreedy : The guess and greedy algorithm for compen-sation allocation in Algorithm 5. • Dealer : The optimal prices computed by the dynamic pro-gramming algorithm in Algorithm 7 with survey price space. • Dealer + : The optimal prices computed by the dynamic pro-gramming algorithm in Algorithm 7 with complete solutionspace. • Linear : We take the lowest survey price from model M and the highest survey price from model M M and use linearinterpolation for the remaining models M , ..., M M − basedon the two end-prices. • Low : We set the lowest price from all survey prices to allmodels. • Median : We set the median price from all survey prices toall models. • High : We set the highest price from all survey prices to allmodels.

Figures 3(a)(b) shows the compensation allocation time cost andaccuracy of Greedy, PPDP, and GuessGreedy of a various num-ber of data owners, respectively. Because PPDP is signiﬁcantlya ﬀ ected by the budget and cannot work for a very large budget, weset all budget to 10000 in our experiments for fair comparison. Fig-ure 3(a) shows the time cost for a various number of data owners.Greedy signiﬁcantly outperforms both PPDP and GuessGreedy dueto its simplicity. GuessGreedy costs the highest time cost becausewe need to enumerate (cid:16) nv (cid:17) subsets, where n is the total number ofdata owners and v is the size of the sampled subsets during enu-meration. In our experiments, the time cost for GuessGreedy isprohibitively high even we set v =

2. We skip some results ofGreedy and PPDP in the ﬁgures due to their prohibitively high timecost. Figure 3(b) shows the accuracy of various algorithms on ﬁvemodels. We employ three di ﬀ erent algorithms Greedy, PPDP, andGuessGreedy to choose three subsets for each manufacturing bud-get MB m = . (cid:80) ni = SV i , . (cid:80) ni = SV i , . (cid:80) ni = SV i , . (cid:80) ni = SV i ,and 2.0 (cid:80) ni = SV i , respectively. We also use ALL as a baseline,which includes all the patients. We add di ﬀ erential privacy with parameters (cid:15) = . , . , ,

5, and 10 in the training processing(Algorithm 2) on the four datasets, respectively. We can see that al-though the number of patients in subsets selected by Greedy, PPDP,and GuessGreedy are less than the number of patients in ALL, theaccuracy is higher for larger (cid:15) , which veriﬁes the e ﬀ ectiveness ofShapely value. For example, for (cid:15) =

10, there are only 337 patientsin the subset selected by Greedy, but the accuracy on that subset ishigher than the entire dataset ALL. For smaller (cid:15) , the accuracy onALL is higher. The reason is that for smaller (cid:15) , with the less bud-get, we obtain a smaller subset. For example, for (cid:15) = .

01, thereare only 259 patients in the subset selected by Greedy. Comparingdi ﬀ erent algorithms, the accuracy of the subset selected by PPDP isonly a little higher than Greedy. Therefore, we can employ Greedyfor most cases.

100 1k 10k 100k 1m number of data owners t i m e ( s ) GreedyPPDPGuessGreedy (a) time cost. privacy parameter a cc u r a cy ALLGreedyGuessGreedyPPDP (b) accuracy.Figure 3: Compensation allocation. We experimentally study the revenue gain of our proposed algo-rithms on di ﬀ erent distributed datasets. We generate two datasetswith 100 survey price points (i.e., collecting from 100 potentialmodel buyers). The number of survey price points on each modelfollows independent random distribution and Gaussian (mean = =

3) random distribution, respectively. For theﬁrst model of both datasets, we generate those survey price pointsfollowing independent distribution with range [1000 , ﬀ ordability ratio(fraction of the model buyers that can a ﬀ ord to buy a model), andrevenue on an independent random distributed dataset, respectively.Figure 4(b) shows that Dealer + , Dealer, and Linear have a simi-lar price setting distribution. All models in Dealer have di ﬀ erentprices. For the price setting distribution of Dealer + , the ﬁrst modeland the second model have the same price, the same to the ﬁfthmodel and the sixth model, the eighth model and the ninth model,which maximizes the revenue comparing to Dealer and veriﬁes thee ﬀ ectiveness of our complete solution space construction. Figure4(c) shows that Dealer + has the highest a ﬀ ordability ratio exceptfor Low. For the most critical metric revenue, Dealer + outperformsthe other algorithms at least 10%, which veriﬁes the gain of ourcomplete solution space construction.In practical applications, it is more likely that the survey pricepoint datasets follow Gaussian distribution rather than independentrandom distribution. Figures 5(a)(b)(c)(d) show that all algorithmshave similar performances on Gaussian distributed dataset as onindependent distributed dataset. We experimentally study the e ﬃ ciency of our proposed algo-rithms for pricing models. Because Linear, Low, Median, and High15 model p r i c e (a) Data distribution model p r i c e Dealer+DealerLinearLowMedianHigh (b) Price

Dealer+ Dealer Linear Low Median High algorithm r e v enue (c) Ratio Dealer+ Dealer Linear Low Median High algorithm r e v enue (d) RevenueFigure 4: Independent distribution. model p r i c e (a) Data distribution model p r i c e Dealer+DealerLinearLowMedianHigh (b) Price

Dealer+ Dealer Linear Low Median High algorithm r e v enue (c) Ratio Dealer+ Dealer Linear Low Median High algorithm r e v enue (d) RevenueFigure 5: Gaussian distribution.algorithms only need to scan through the survey price points once,the time cost is low. For the ease of presentation, we omit the exper-imental results for those four algorithms. Instead, we compare ourproposed Dealer and Dealer + with the classic exhaustion based ap-proach. We ﬁrst apply exhaustion-based approach to our completesolution space named Base. However, the time cost of most of theexperiments is prohibitively high. Therefore, we apply exhaustion-based approach to the survey price space named BaseAppr.

20 100 500 2500 12500 number of survey points t i m e ( s ) Dealer+DealerBaseBaseAppr

Figure 6: Time cost.Figure 6 shows the time cost of Dealer, Dealer + , Base, and BaseAppron a various number of survey price points. Both Dealer + andDealer linearly increase with the increase of the number of surveyprice points, which veriﬁes the e ﬃ ciency of our proposed dynamicprogramming algorithm. In practical applications, 12500 surveypoints are enough for most of the surveys, the optimal Dealer + only requires dozens of seconds on a PC. If the time cost is verysensitive, we can employ Dealer which searches from the originalsurvey price space with a slight tradeo ﬀ in optimal revenue. Thetime cost of both Base and BaseAppr is prohibitively high due tothe high volume price combinations for di ﬀ erent models.

7. CONCLUSION AND FUTURE WORK

In this paper, we proposed the ﬁrst end-to-end data marketplacewith model-based pricing framework towards answering the ques-tion: how can the broker assign value to the data owners basedon their contribution to the models to incentivize more data con-tributions, and determine optimal prices for a series of models forvarious model buyers to maximize the revenue with arbitrage-freeguarantee. For the former, we introduced a Shapley value-basedmechanism to quantify each data owner’s value towards all themodels trained out of the contributed data and the data owners havethe abilities to control their data usage. For the latter, we designeda pricing mechanism based on the models’ privacy parameter tomaximize the revenue. We proposed Gen-

Dealer to model the end-to-end data marketplace with model-based pricing and illustrate aconcrete realization of di ﬀ erentially private data marketplace withmodel-based pricing DP- Dealer which provably satisﬁes the de-sired formal properties. Extensive experiments veriﬁed that DP-

Dealer is e ﬃ cient.There are several exciting directions for future work. First, mul-tiple brokers can co-exist in practical applications, which form acompetitive relationship to maximize the revenue for themselves.Second, multiple risk factors forming a risk vector can be consid-ered, which enables the market to take care of di ﬀ erent types ofdemands of the data owners. Third, personalized model manufac-turing can be considered, which tailors the model training to eachmodel buyer to best suit their budget and model usage scenario.

8. REFERENCES [1] Dawex, https: // / en / .162] Google bigquery, https: // cloud.google.com / bigquery / .[3] https: // support.gnip.com / apis / .[4] https: // / professional / product / market-data / .[5] Iota, https: // data.iota.org / .[6] A. Agarwal, M. Dahleh, and T. Sarkar. A marketplace fordata: An algorithmic solution. In Proceedings of the 2019ACM Conference on Economics and Computation , pages701–726. ACM, 2019.[7] S. Alaei, A. Malekian, and A. Srinivasan. On randomsampling auctions for digital goods.

ACM Trans. Economicsand Comput. , 2(3):11:1–11:19, 2014.[8] M. Ancona, C. ¨Oztireli, and M. H. Gross. Explaining deepneural networks with a polynomial time algorithm forshapley value approximation. In

Proceedings of the 36thInternational Conference on Machine Learning, ICML 2019,9-15 June 2019, Long Beach, California, USA , pages272–281, 2019.[9] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta.Private stochastic convex optimization with optimal rates. In

Advances in Neural Information Processing Systems , pages11279–11288, 2019.[10] J. Castro, D. G´omez, and J. Tejada. Polynomial calculationof the shapley value based on sampling.

Computers & OR ,36(5):1726–1730, 2009.[11] G. C. Cawley and N. L. C. Talbot. E ﬃ cient leave-one-outcross-validation of kernel ﬁsher discriminant classiﬁers. Pattern Recognition , 36(11):2585–2592, 2003.[12] S. Chawla, S. Deep, P. Koutris, and Y. Teng. Revenuemaximization for query pricing.

PVLDB , 13(1):1–14, 2019.[13] L. Chen, P. Koutris, and A. Kumar. Towards model-basedpricing for machine learning in a data marketplace. In

Proceedings of the 2019 International Conference onManagement of Data, SIGMOD Conference 2019,Amsterdam, The Netherlands, June 30 - July 5, 2019. , pages1535–1552, 2019.[14] D. Dua and C. Gra ﬀ . UCI machine learning repository, 2017.[15] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Localprivacy and statistical minimax rates. In , pages429–438, 2013.[16] C. Dwork, F. McSherry, K. Nissim, and A. Smith.Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference , pages 265–284.Springer, 2006.[17] C. Dwork, A. Roth, et al. The algorithmic foundations ofdi ﬀ erential privacy. Foundations and Trends R (cid:13) in TheoreticalComputer Science , 9(3–4):211–407, 2014.[18] S. S. Fatima, M. J. Wooldridge, and N. R. Jennings. A linearapproximation method for the shapley value. Artif. Intell. ,172(14):1673–1699, 2008.[19] A. Ghorbani and J. Y. Zou. Data shapley: Equitable valuationof data for machine learning. In

Proceedings of the 36th International Conference on Machine Learning, ICML 2019,9-15 June 2019, Long Beach, California, USA , pages2242–2251, 2019.[20] A. Ghosh and A. Roth. Selling privacy at auction. In

Proceedings 12th ACM Conference on Electronic Commerce(EC-2011), San Jose, CA, USA, June 5-9, 2011 , pages199–208, 2011.[21] V. Guruswami, J. D. Hartline, A. R. Karlin, D. Kempe,C. Kenyon, and F. McSherry. On proﬁt-maximizingenvy-free pricing. In

Proceedings of the Sixteenth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA2005, Vancouver, British Columbia, Canada, January 23-25,2005 , pages 1164–1173, 2005.[22] R. Iyengar, J. P. Near, D. Song, O. Thakkar, A. Thakurta, andL. Wang. Towards practical di ﬀ erentially private convexoptimization. In IEEE S and P .[23] B. Jayaraman and D. Evans. Evaluating di ﬀ erentially privatemachine learning in practice. In , pages 1895–1912, 2019.[24] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li,C. Zhang, C. Spanos, and D. Song. E ﬃ cient task-speciﬁcdata valuation for nearest neighbor algorithms. Proceedingsof the VLDB Endowment , 12(11):1610–1623, 2019.[25] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, andD. Suciu. Query-based data pricing. In

Proceedings of the31st ACM SIGMOD-SIGACT-SIGAI symposium onPrinciples of Database Systems , pages 167–178. ACM,2012.[26] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, andD. Suciu. Toward practical query pricing with querymarket.In proceedings of the 2013 ACM SIGMOD internationalconference on management of data , pages 613–624. ACM,2013.[27] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, andD. Suciu. Query-based data pricing.

J. ACM ,62(5):43:1–43:44, 2015.[28] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricingprivate data. In

Joint 2013 EDBT / ICDT Conferences, ICDT’13 Proceedings, Genoa, Italy, March 18-22, 2013 , pages33–44, 2013.[29] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricingprivate data.

ACM Trans. Database Syst. , 39(4):34:1–34:28,2014.[30] C. Li, D. Y. Li, G. Miklau, and D. Suciu. A theory of pricingprivate data.

Commun. ACM , 60(12):79–86, 2017.[31] B. Lin and D. Kifer. On arbitrage-free pricing for generaldata queries.

PVLDB , 7(9):757–768, 2014.[32] C. Shapiro and H. Varian. Versioning: The smart way to sellinformation.

Harvard Business Review , 76(6):107–115,1998.[33] L. S. Shapley. A value for n-person games.