[PDF] A Survey on Data Pricing: from Economics to Data Science

Abstract

Data are invaluable. How can we assess the value of data objectively, systematically and quantitatively? Pricing data, or information goods in general, has been studied and practiced in dispersed areas and principles, such as economics, marketing, electronic commerce, data management, data mining and machine learning. In this article, we present a unified, interdisciplinary and comprehensive overview of this important direction. We examine various motivations behind data pricing, understand the economics of data pricing and review the development and evolution of pricing models according to a series of fundamental principles. We discuss both digital products and data products. We also consider a series of challenges and directions for future work.

Full PDF

AA Survey on Data Pricing:from Economics to Data Science

Jian PeiE-mail: [email protected] 1, 2020

Abstract

Data are invaluable. How can we assess the value of data objec-tively, systematically and quantitatively? Pricing data, or informationgoods in general, has been studied and practiced in dispersed areasand principles, such as economics, marketing, electronic commerce,data management, data mining and machine learning. In this arti-cle, we present a uniﬁed, interdisciplinary and comprehensive overviewof this important direction. We examine various motivations behinddata pricing, understand the economics of data pricing and review thedevelopment and evolution of pricing models according to a series offundamental principles. We discuss both digital products and dataproducts. We also consider a series of challenges and directions forfuture work.

In this digital economics era, data are well recognized as an essential re-source for work and life. Many products and services are delivered purelyin digital forms. Many big data applications are built on the second useor reuse of data [196], that is, the same data are customized and reused bymany applications for diﬀerent purposes. The extensive sharing and reusingdata has profound implications to economy. For example, digital maps areoften produced for traﬃc and directions as the immediate usage. However,Nagaraj [153] ﬁnds that mining activities were strongly beneﬁted by openmaps or maps sponsored by governments, particularly for smaller ﬁrms with1 a r X i v : . [ ec on . T H ] N ov ess resources. Universal availability of data often helps minority parties andemerging initiatives.In business and economic activities where data are shared, exchangedand reused, it is essential to measure the value of data properly. Whilethere exist many possible ways to appreciate and represent the value ofdata, a general approach that can be scalable for massive applications andacceptable to many parties is to set a price at which data can be sold orpurchased, that is, data pricing. The importance of pricing in business iswell recognized in ﬁnancial modeling [120], as price being one of the four Psof the marketing mix .Pricing data is far from trivial. Data have many diﬀerent aspects. Con-sequently, the term “price of data” may carry diﬀerent meanings and referto diﬀerent properties of data. To illustrate the complexity, let us quicklyconsider the following three scenarios involving price information related todata. • Data transmission.

Imagine the scenario where a mobile serviceprovider oﬀers a smart phone user the price of its data package. Here,the price is quoted for the data transmission service and is decided byseveral factors, such as the amount of data the user wants to trans-mit in a month time, the location (roaming or not, for example), andthe transmission speed. The price does not include and is indepen-dent from the content, that is, what the data are about, such as dataquality, and how the data are collected, stored or processed. • Digital products.

Imagine that a person wants to watch a movie athome. This is a purchase of data, since the movie is sent to the cus-tomer’s home as a stream of bits. The price here typically is relatedto the content, but is independent from the data transmission service,that is, how the data are transmitted to the user’s home. • Data products.

Many logistics companies want to pay for weather in-formation to support their business operations. While historical dataare relevant, more often than not those companies want to subscribe toweather forecasting information instead. Some companies may want The four Ps are product, price, place and promotion [120]. dig-ital products refer to those intangible goods but can be consumed throughelectronics, such as e-books, downloadable musics, online ads, and internetcoupons. Many digital products have physical correspondences in one wayor another, though not absolutely necessary.

Data products refer to data setsas products and information services derived from data sets. We build thelinkage between these two categories by pointing out many ideas and meth-ods on pricing digital products can be generalized and applied to pricingdata products. In some scenarios, the boundary between digital productsand data products is also blurry. Hereafter, we use the term informationgoods to refer to both digital products and data products.

The research into data pricing happens simultaneously in multiple domains,including but not limited to economics, marketing, e-commerce, databasesand data management, operational research, management science, machine3earning and AI. However, to the best of our knowledge, there exists verylimited eﬀort to provide an interdisciplinary survey of the related work. Thisarticle presents our endeavor to produce a comprehensive picture.There are some previous surveys related to data pricing. For example,Liang et al. [136] survey the life cycle of big data, and reviews 11 data pricingmodels. They also discuss data trading and protection. Fricker and Maksi-mov [75] report a literature survey over 18 research articles regarding severalresearch questions, including maturity of the pricing models. Very recently,Zhang and Beltr´an [220] review the state-of-the-art data pricing methods.They categorize data pricing methods according to two important data prop-erties, granularity and privacy, This article covers a substantially broaderscope than those [75, 136, 220]. We connect economics, digital product pric-ing and data product pricing. We also discuss a series of desirable propertiesin data pricing, including arbitrage-freeness, revenue maximization, fairness,truthfulness, and privacy preservation, and review the techniques achievingthose properties.Data pricing is related to cloud pricing, since a lot of data for pricing andtrading are hosted on cloud. Wu et al. [208] present a comprehensive surveyon cloud pricing models. They systematically categorize three fundamen-tal pricing strategies, namely value-based pricing, cost-based pricing andmarket-based pricing. Then, they further categorize nine pricing tacticalobjects. Speciﬁcally, value-based pricing is demand driven and consists ofcustomer value-based pricing, experience-based pricing, and service-basedpricing. Cost-based pricing is supply driven and consists of expenditure-based pricing, resource-based pricing and utility-based pricing. Market-based pricing is an equilibrium of supply and demand and consists of freeand pay later pricing, retail-based pricing and auction and online pricing.They cover in total 60 pricing models. While data and cloud are highlyrelated, data pricing and cloud pricing are fundamentally diﬀerent. Datapricing is selling data, while cloud pricing is selling cloud resources (e.g.,storage and computation), including physical resources, virtual resourcesand stateless resources.In addition, Sen et al. [181] survey the major broad-band pricing pro-posals, including the realizations in various consumer data plans aroundthe world. Murthy et al. [150] list diﬀerent pricing models and pricingschemes used by some popular IaaS (infrastructure-as-a-service) providers.4u et al. [211] propose pricing as a service, which is essentially a personal-ized pricing service for IaaS. Aazam and Huh [1] propose broker as a service,which matches cloud services among cloud service providers and users. Thekey idea is to predict resource demands and thus derive prices.As data are often hosted online, one interesting question is the fair shar-ing of the cost among data owners, data users and brokers. This is relatedto data pricing, because the costs of data hosting and processing have to berecovered from data pricing. Kantere et al. [116] study the fair allocation ofcosts in query services. They develop a stochastic model, which predicts theextent of cost amortization in time and number of services based on querytraﬃc statistics. The model can be implemented on top of a cloud DBMS.Al-Kiswany et al. [12] provide a cost assessment tool to evaluate the costof a desired data sharing. One useful feature of the tool is that a user canexplore the cost space of alternative conﬁgurations using various factors,such as quality, staleness, and accuracy. The technique is based on what-ifanalysis.

We take a multi-disciplinary approach in this survey. The rest of the articleis organized as follows.In Section 2, we start from economics and focus on two aspects. First, wediscuss cost reduction in information goods that contributes to their pricesand has impact on economics. Then, we discuss the diﬀerences betweendigital products and data products.In Section 3, we discuss the fundamental principles of data pricing.We ﬁrst present versioning as a general framework for pricing informationgoods. Then, we identify several desirable properties in data pricing, includ-ing truthfulness, fairness, revenue-maximization, arbitrage-freeness, privacypreservation and computational eﬃciency.In Section 4, we discuss pricing digital products. We ﬁrst review the threemajor streams of revenues for digital products. Then, we revisit the bundlingand subscription planning pricing models. Last, we consider auctions, whichare widely used in pricing digital products.In Section 5, we discuss pricing data products. We ﬁrst overview thestructures, players, and ways to produce data products in data marketplaces.5hen, we examine several important areas in pricing data products, includ-ing arbitrage-free pricing, revenue maximization pricing, fair and truthfulpricing, privacy preservation in pricing. We also discuss dynamic data pric-ing, online pricing, and pricing in federated and collaborative learning.Last, in Section 6, we discuss challenges and future directions.

In general, pricing is the practice that a business sets a price at which aproduct or a service can be sold. Pricing is often part of the marketingplan of a business. To set prices, a business often considers a series ofobjectives, such as proﬁtability, ﬁtness in marketplace, market positioning,price consistency across categories and products, and meeting or preventingcompetition. Some major pricing strategies in literature [38, 58, 108, 155,159] include operation-oriented pricing, revenue-oriented pricing, customer-oriented pricing, value-oriented pricing, and relationship-oriented pricing.There is a rich body of studies in economics and marketing research onpricing tactics, which are far beyond the scope and capacity of this survey.In this section, to understand the economic factors speciﬁc to data pric-ing, we examine the cost reduction in information goods. Then, we inspectthe diﬀerences between digital products and data as products. “Technology changes. Economic laws do not.” [182] The production, distri-bution, and consumption of information goods, comparing to those of phys-ical products in the long history of human economies, are distinguished bysigniﬁcant cost reductions on ﬁve aspects, namely search costs, productioncosts, replication costs, transportation costs, and tracking and veriﬁcationcosts. Essentially, digital and data economics investigates how standard eco-nomic models adjust when those costs are reduced dramatically. Goldfarband Tucker [93] present a thorough discussion, whose framework is largelyfollowed here. 6 .1.1 Search Costs “ Search costs are the costs of looking for information” [182], which are in-curred in any information collection activities. Information goods allowmore eﬀective and eﬃcient online search. The consequent low search costsfacilitate users’ discovering digital products and data sets, as well as compar-ing prices of similar products and services. For example, Brynjolfsson andSmith [40] show that online prices of books and CDs are clearly lower thanoﬄine, though the price dispersion, however, does not shrink accordingly.Low search costs facilitate the sales of rare and long tail products [15,214]. Thus, more variety is often observed in information goods and services.The degree of variety may be heavily impacted by recommender systems.Speciﬁc to consumption of media, one of the major categories of digitalproducts, Gentzkow and Shapiro [82] show that online media consumptionis more diverse than oﬄine. At the same time, customers may tend toconsume more that aligns more or less with their viewpoints, which is calledthe “echo chamber” eﬀect [188].Low search costs give strong rise to the prevalent platform businesses,which provide extensive matching services to customers and improve tradeeﬃciency [115]. Interoperability, compatibility and standards are strategictools for both building platforms and running platform businesses [99].

Producing digital products, such as online courses, eBooks, software, graph-ics and digital arts, and photography, is very diﬀerent from manufacturingphysical products, like bread, shoes, and jackets. Moreover, collecting andprocessing massive data so that parts of data can be sold and can meetcustomers’ needs is also diﬀerent from traditional production. A wide spec-trum of production costs in traditional products are substantially reducedin information goods.First, some essential major costs in traditional production, such as ma-terials, semi-ﬁnished products and their transportation, are dramaticallyreduced in producing information goods. In many cases, the costs of obtain-ing, producing and transporting raw materials and physical semi-ﬁnishedproducts can be reduced to very low or can even approach zero in mak-ing information goods. Second, a substantial cost of a traditional physical7roduct often belongs to the product itself and cannot be further reducedthrough sharing. The unit costs of information goods can approach zerothrough sharing as long as there are suﬃcient reuses and sales volume. Last,smart manufacturing and customer-to-manufacturing can reduce the supplychain costs in traditional physical production [88, 187]. Information goodsoften can reduce the costs of customization to extreme.The substantial reduction in production cost in materials, semi-ﬁnishedproducts, customization and sharing gives rise to a series of innovative busi-ness models, such as economics of sharing, pay-as-you-go and query-baseddata consumption. This also encourages innovation and long tail productsthat address diverse and smaller groups of potential customers.

One distinct feature of information goods versus traditional products is thatinformation goods are non-rival. That is, one customer consuming an infor-mation good does not reduce the amount or quality of the product availableto other customers. The zero marginal costs and the non-rival propertyof information goods empower innovative opportunities and bring in newchallenges.In order to structure pricing of a large variety of non-rival informationgoods with zero marginal costs, bundling is often used [182], that is, mul-tiple products are sold together at a single price. Since a large number ofinformation goods can be bundled together without a substantial increase incost, economically it may be optimal to bundle thousands of digital productstogether to meet diverse and independent customer preferences [10, 25, 26].Due to the zero marginal costs and the non-rivalrous property, manyinformation goods are made publicly available, such as Wikipedia and opensource software [131]. People contribute to open source or publicly availabledigital products and data to demonstrate their professional skills to potentialemployers. Companies support those products to complement their sales onother products.The zero marginal costs and non-rivalrous property post challenges tocopyright policies and enforcement. Waldfogel [203] shows that low repli-cation costs, though may reduce revenue, help supplies and demands, and Thanks to the Internet, the costs of transporting information goods approachzero. This may imply, in many scenarios, that local communities may notaﬀect adoptions and consumptions of information goods, often known as theeﬀect of ﬂat world [76]. Interestingly, this is not true all the time, as somestudies demonstrate that tastes may still be local in music [73] and contentconsumption [81].While the physical transportation may approach zero, regulation mayput sophisticated constraints on locations. For example, when Wekipediawas blocked in China in October 2005, more contributors from outside Chinawere motivated to contribute [221]. Copyright policies may also aﬀect theavailability and consumption of information goods in diﬀerent regions, suchas news media [46], and thus may be reﬂected by price.

The capability of tracking users with relatively low costs is an importantfeature of information goods [182]. The low tracking costs give the rise toextensive personalized markets and possible price discrimination [77, 165].Behavioral price discrimination is an immediate type, which sets prices ac-cording to customers’ previous behavior. Correspondingly, if customers arewell aware of the beneﬁts of tracking information to a monopoly, they maylikely choose to be privacy sensitive and hold the information [193]. Anothertype of price discrimination is versioning [183], which sells information atdiﬀerent prices to diﬀerent customers using diﬀerent versions. Versioning isdiscussed in detail in Section 3.1. 9he advantage of low tracking costs also leads to the blooming busi-nesses of personalized advertising [69]. A challenge for a company, however,is how to set prices for many advertisements that may be shown to massivecustomers? The same advertisement may have diﬀerent prices for diﬀer-ent customers. Auctions are often used to address the challenge [19], andcan even be used to discover prices for information goods [164]. At thesame time, auctions may be less useful when online marketplaces becomemature [66].The low tracking costs and the consequences, such as price discrimina-tion, lead to serious concerns on privacy [4]. As to be discussed later inthis article, whether privacy should be treated as goods and how privacy ispriced are investigated [74, 163]. Moreover, privacy regulation and the im-pact on welfare are important topics, though they are far beyond the scopeof this survey.As a byproduct of low tracking costs, the costs of verifying identity andreputation of producers and users of information goods are dramaticallylower than those in traditional scenarios. The low veriﬁcation costs facilitateonline transactions extensively and lower the costs of trust dramatically.

This survey focuses on pricing two categories of information goods, digitalproducts and data products. While digital products and data products sharea series of common ideas and methods in pricing, they are also essentiallydiﬀerent from each other on at least four aspects.First, the units of digital products are often well deﬁned and ﬁxed. Forexample, individual movies and musics are often priced and sold in whole.The consumption of a digital product is often independent from each other.For example, it would be rare that two digital books have to be read atthe same time. In contrast, although the basic unit in a data set can beat a very small granularity, such as a record in a relational table, the unitsfor pricing and consumption often vary from one customer to another. Forexample, a customer may be interested in the sales data of female customersin a province, while another customer may be interested in the sales dataon electronics during the Christmas season. Correspondingly, one individualunit of data at the lowest granularity may not be valuable as a data product.10or example, one customer purchase record, after proper anonymization,may not be useful for a retailer. Instead, more often than not, many basicunits of data are combined, aggregated and consumed together.Second, diﬀerent from digital products, data sets as data products havevery strong and ﬂexible aggregateability. Customers often aggregate datausing various dimensions. The aggregateability, on the one hand, enablesmany opportunities for innovations in data business, and, on the other hand,posts many technical and business challenges, such as ensuring arbitrage-freeness as to be discussed later in this article. In many business scenarios,digital products like movies and musics are bundled. However, bundles arenot aggregates. Customers still get digital products and consume themindividually. Bundling is to take the advantage of low replication costs ofdigital products to boost sales and meet customers’ diverse demands [10,25,26].Third, the means of consuming digital products and data products arealso very diﬀerent. Typically digital products are consumed directly by peo-ple, such as movies watched by people and musics enjoyed by fans. Data setsare more often than not consumed by computers. They are, for example,analyzed, summarized or used to train machine learning models. The out-puts of models are used to automate operations or support human decisionmaking.Last, digital products and data products are dramatically diﬀerent inways to be reused and resold. Digital products are easy to be consumedby others, that is, to be reused, or even to be resold to others in whole.Data sets, to the contrary, can be reused by others in diﬀerent ways, suchas aggregation in diﬀerent dimensions and analysis for diﬀerent purposes.Moreover, data can be easily processed and transformed so that they can beresold in a hard-to-detect manner.The above diﬀerences between digital products and data products leadto diﬀerent considerations in pricing principles and methods, which are dis-cussed later. Before we leave this topic, we want to point out that it ispossible that the same information can be regarded as digital products insome situations and as data products in some other situations. For exam-ple, social media like tweets and customer reviews can be regarded as digitalproducts when a customer reads them online. At the same time, they canbe collected and processed in batch by analytic tools to detect events, dis-11over customer proﬁles and feed recommender systems. In this situation,a systematic collection of social media can be priced and sold as a dataproduct.

In summary, information goods, including digital products and data prod-ucts, distinguish themselves from the traditional physical products in sig-niﬁcant cost reductions, particularly in search costs, production costs, repli-cation costs, transportation costs, and tracking and veriﬁcation costs. Thesigniﬁcant reduction of costs has profound impact on pricing informationgoods, which is discussed in the later sections of this article. There are sev-eral major diﬀerences between digital products and data products, includingconsumption units, aggregatebility, means of consumption, and reusing andreselling.

In this section, we ﬁrst review the idea of versioning [182, 183], which is afundamental framework of designing information goods and pricing them.Then, we review several important properties in cost models of digital anddata products.

As the replication costs of information goods are very low, even approachingzero in many cases, the price of an information good tends to be very lowin marketplaces, too. The potential of very low prices of information goods,on the one hand, makes information goods economically appealing, and, onthe other hand, may also make information goods economically dangerous,as the competitors may easily enter the market [182, 183]. This dilemmakeeps many traditional pricing strategies far away from being eﬀective forinformation goods.To tackle the dilemma, the core idea is “linking price to value”, that is,setting the price reﬂecting the value that a customer places on the informa-tion. Speciﬁcally, the versioning strategy [183] makes diﬀerent versions toappeal to diﬀerent types of customers. For example, for a piece of software,12iﬀerent versions have diﬀerent subsets of features. Diﬀerent versions of amovie may provide diﬀerent image resolutions and sound eﬀects. Essen-tially, versioning divides customers into subgroups so that each subgroupmay regard some features highly valuable and some other features of littlevalue. A version corresponding to the demands can be provided.There are many diﬀerent ways to produce diﬀerent versions of informa-tion goods. For example, as information is often time sensitive, delay is oftena good basis. In stock market information services, an expensive version maydeliver real time quotes while a basic version delivers the same information20 minutes later. In addition, versions may be deﬁned by convenience (e.g.,data can be accessed only by PDF ﬁle or by downloadable spreadsheet), com-prehensiveness (e.g., the length of historical data available), manipulation(e.g., whether users can store, duplicate, print the information), community(e.g., availability of posting and reading discussion boards), annoyance (e.g.,the option of no advertisements), the means of customer support (e.g., bywebsite only or by talking to experts), and many other factors. Most ver-sions of information goods are created by subtracting value from the mosttechnologically advanced and complete version.In many situations where customers may not realize the value of an in-formation good unless they try it, even the free versions may be provided.The rationale is that the free versions can provide opportunities to poten-tial customers to test out. The objectives of oﬀering free versions includebuilding awareness, gaining follow-on sales, creating a customer network,attracting attentions, and gaining competitive advantages.The number of versions of an information good may be decided by twomajor considerations. First, the characteristics of the information to be soldis important. An information good that can be used in many diﬀerent waysopens the door to many diﬀerent versions. The second important factor isthe value that diﬀerent customers may place on it. The larger the variance,the more versions may be needed.The versioning strategy has been investigated in pricing data products,for example, relational data sets and query results [27, 28]. Relational viewsprovide a natural and ﬂexible technical mean to produce versions of aninformation source. A series of technical challenges are identiﬁed, such asarbitrage in pricing, ﬁne-grained data pricing, pricing updates, integrateddata and competing data sources, which are reviewed further in this article.13 .2 Important Desiderata in Data Pricing

There are many diﬀerent ways to design and implement pricing models forinformation goods. There are a small number of desiderata pursued by mostmodels. How to implement those desiderata in pricing models is discussedin the later sections.

To make a market eﬃcient, the market is preferred to be truthful. A marketis truthful if every buyer is selﬁsh and only oﬀers the price that maximizesthe buyer’s true utility value. In other words, in a truthful market, no buyerpays more than suﬃcient to purchase a product. Here, diﬀerent buyers mayhave diﬀerent utility values on the same product. Truthfulness can facilitatea wide spectrum of pricing mechanisms, such as many kinds of auctions [7].Auctions of digital products are discussed in Section 4.3.

Pricing models can optimize diﬀerent objectives, such as lowest cost, highestproﬁt, and largest sales. The objective of maximizing revenue is often ofspecial interest in designing pricing strategies. The rationale is that, fora business to be successful long term, a more immediate and importantrequirement is to win over as many customers as possible.For traditional physical products, it is often assumed that the marginalcost goes up after a certain number of units are manufactured, and thusthe proﬁt can be maximized if the output level is set so that the marginalrevenue is equal to the marginal cost, and the revenue can be maximizedif the marginal revenue becomes zero. However, given that the replicationcosts of information goods are very low, revenue maximization and proﬁtmaximization for information products become quite diﬀerent from thosefor physical products [7, 42].

Essentially, a market is fair if each seller gets the fair share of the revenuein coalition. In his seminal article [184], Shapley lays out the fundamentalrequirements of fairness in markets. Suppose there are k sellers cooperatively14articipate in a transaction that leads to a payment v . There are four basicrequirements for being fair. • Balance : the sum of the payment to each seller should be equal to v .That is, the payment is fully distributed to all sellers. • Symmetry : for a set of sellers S and two additional sellers s and s (cid:48) who are not in S , that is, s, s (cid:48) (cid:54)∈ S , if S ∪ { s } and S ∪ { s (cid:48) } produce thesame payment, then s and s (cid:48) should receive the same payment. Thatis, the same contribution to utility should be paid the same. • Zero element : for a set of sellers S and an additional seller s (cid:54)∈ S ,if S ∪ { s } and S produce the same payment, then s should receive apayment of 0. That is, no contribution, no payment. • Additivity : If the goods can be used for two tasks T and T withpayment v and v , respectively, then the payment to complete bothtasks T + T is v + v .In the above well celebrated Shapley fairness, the Shapley value is theunique allocation of payment that satisﬁes all the requirements. ψ ( s ) = 1 n (cid:88) S ⊆ D \{ s } U ( S ∪ ( s )) − U ( S ) (cid:0) n − | S | (cid:1) (1)where U () is the utility function, D is the complete set of sellers, S ⊆ D isa set of sellers, and s is a seller.Equivalently, Equation 1 can also be written as ψ ( s ) = 1 N ! (cid:88) π ∈ Π( D ) ( U ( P πs ∪ { s } ) − U ( P πs )) (2)where π ∈ Π( D ) is a permutation of all sellers, and P πi is the set of sellerspreceding s in π .Agarwal et al. [7] observe that, as the replication costs of informationgoods are very low, the marginal costs of production are close to zero, aseller can produce more units of the same information good to obtain alarger Shapley value and thus a larger portion of the payment unjustiﬁed inbusiness. This is a challenge in designing fair marketplace for informationgoods. 15 .2.4 Arbitrage-free Pricing Arbitrage is the activities that take advantage of price diﬀerences betweentwo or more markets or channels. For example, consider a scenario wherea user wants to purchase the access to an article, whose listed price is $ $

25. Then, the user can conduct arbitrage to subscribe to the journalfor only one month and obtain the article at a price cheaper than the listedprice.Arbitrage is often undesirable in pricing models. At least it should beable to check whether a pricing model is arbitrage-free. However, arbitragecan sneak in pricing models that are not thoroughly designed. For example,suppose a data service provider sells query results with prices based onvariance [133], a variance of 10 for $ $

100 each query result. Each answer is perturbed independently. Acustomer who wants to obtain an answer of variance of 1 can purchase thequery 10 times and compute their average. Due to the independent noise inperturbation, the aggregated average has variance 1, and thus the customersaves $

50 by arbitrage.

Privacy is becoming a more and more serious concern about informationgoods. In general, privacy is the ability of an individual or a group to keepthemselves or the information about themselves hidden from being identiﬁedor approached by other people. Privacy is highly related to information andinformation exchange, which are what information goods about.As explained in Section 2.1.5, due to the low tracking costs of informationgoods, it is easier to collect data about user privacy [4]. Whether privacyshould be treated as goods and how privacy is priced are investigated [74,163].It is highly desirable to preserve privacy in marketplaces of informationgoods. In general, transactions in a marketplace may disclose privacy ofvarious parties in many diﬀerent ways.First, privacy of buyers is highly vulnerable. Their identities, the loca-tion and time of purchases, speciﬁc products purchased, the purchase pricesand total amount may reﬂect their privacy. It has been reported from time16o time that e-commerce providers leak customer information by mistakes,such as an accident reported recently .Second, privacy of information good providers may also be disclosed. Forexample, medical treatment information in hospitals is highly valuable formany business companies, such as pharmacy and medical equipment com-panies. Imagine that hospitals can collect and anonymize medical treatmentdata properly and provide the corresponding data products in marketplacesso that individual patients cannot be re-identiﬁed. Buyers, however, maybe able to infer from the data the successful rates of a speciﬁc treatment ina hospital, which may be regarded as the privacy of the hospital.Last, transactions in marketplaces may also disclose privacy of a thirdparty involved. For example, an AI technology company may provide ma-chine learning model building services to data product buyers. However,machine learning models may be stolen [194], which are regarded privacy ofthe AI technology company.To protect privacy in marketplaces of information goods, various di-rections are being explored, such as hiding the information about what,when and how much a buyer purchases [11], building decentralized andtrustworthy privacy preservation data marketplace [50, 107], investigatingthe tradeoﬀ between payments and accuracy when privacy presents [160],and aggregating non-veriﬁable information from a privacy-sensitive pop-ulation [86]. There are many studies on preserving privacy in informa-tion goods. We refer interested readers to consult the rich body of sur-veys [8, 35, 61, 72, 78, 114, 212, 224] and others. We do not discuss furtherdetails about general privacy preservation techniques in this article, sinceprivacy preservation techniques are far beyond the scope and capacity ofthis survey. As many information goods may be sold to a huge number of potentialbuyers, a pricing model has to match goods/sellers and buyers with an ap-propriate price. Computing prices eﬃciently with respect to a large numberof goods and a large number of buyers presents technical challenges [28]. Versioning is a common mechanism in designing and pricing informationgoods, so that prices of diﬀerent versions can be linked to values placedby various customer groups. There are a series of important requirementson pricing information goods, including truthfulness, revenue maximization,fairness, arbitrage-free pricing, privacy preservation, and computational ef-ﬁciency. Those requirements post technical challenges to pricing models.

Although the focus of this article is about pricing data products, we providea brief review on pricing digital products here, since some general ideas inpricing digital products can be borrowed and extended to data products.In some cases, the boundary between digital products and data products iseven blurry.We ﬁrst discuss the three major streams of revenues for digital products.Then, we look at two major types of pricing models. The ﬁrst is bundlingand subscription, and the second is auctions. These pricing models arepopularly adopted by digital product marketplaces.

As discussed in Section 3.2.2, revenue maximization often serves as the basicobjective in pricing mechanisms, including pricing digital products. There-fore, the understanding of pricing digital products can naturally start withan analysis of possible ways where revenues of digital products may comefrom.Lambrecht et al. [127] summarize that there are three streams of revenuesfor digital products that are delivered online.18

Money . A provider can sell to customers content or, more broadly,services, such as movies and e-books. • Information/privacy . Instead of charging customers directly, aprovider can collect customer information by tracking (e.g., using cook-ies) and sell the information about customers to generate revenues. • Time/attention . A provider can sell space in their digital products toadvertisers to produce revenue.Often, a ﬁrm has to design a revenue model for its digital products thatcombine more than one revenue stream. The three streams are not indepen-dent. Instead, they compete with each other, and thus a good tradeoﬀ hasto be settled [79]. On the one hand, in some situations, revenues from moneystream may be increased at the cost of those from time/attention stream.For example, customers may pay for the content and avoid ads [168, 171], orconvert from free versions to premium versions with ﬁtting functions [202].On the other hand, customers may be highly price sensitive in some digi-tal products, and thus growth in time/attention stream may be easier. Forexample, an online news site experiences a dramatic loss of customer vis-its after introducing a paywall [45]. Free samples may stimulate long-termsales [37]. A possible tradeoﬀ between money and time/attention has to becarefully designed.Typical approaches in revenue models of content and services [173] in-clude rigid pricing (e.g., each movie is priced at a ﬁxed price), designingpricing tiers (e.g., basic versus premium versions), setting up duration ofsubscription plans (e.g., 6 months of promotion period with very low sub-scription price) and designing freemium models. One important and uniquefeature in digital product consumption is micropayments, which means acustomer can pay a very small amount that is typically impractical in tradi-tional transactions using standard credit cards due to network service fees.Micropayments and subscriptions have diﬀerent eﬀects on consumer behav-ior [20].As a concrete example of revenue models, consider pricing software prod-ucts [130]. The major parameters of pricing models include formation ofprice, structure of payment ﬂow, assessment base, price discrimination, pricebuilding and dynamic strategies. The formation of price considers price de-termination, that is, cost-based, value-based or competition-oriented, as well19s degree of interaction, unilateral versus interactive. In terms of paymentﬂow, it may be by single payment, recurring payments or combination. Theassessment base of pricing may be usage-dependent (e.g., by transaction ortime) or usage-independent (e.g., server types and GPU).As the tracking costs of digital products are low, a ﬁrm can collectcustomer personal data and sell such data for revenue, that is, generat-ing revenues from information/privacy stream. Typically, personal datamay include customers’ identities, behavior patterns, preferences and needs.There are various ways to sell customer data, which are also discussed inSection 5 when data products and their marketplaces are discussed. Forexample [32, 36], a website can provide direct marketing companies useractivity information. Moreover, websites can also collaborate with datamanagement platforms (DMP, for advertising) [67] and produce revenuesby facilitating businesses to identify audience segments. For example, theinformation about how customers are connected in social networks can beused to design customized discounts in marketing campaigns [215]. Berge-mann and Bonatti [33] develop a model of pricing customer-level informationsuch that the data about each customer are sold individually and individualqueries to the database are priced linearly. As new technologies of customertracking become available, more pricing models may emerge.We want to point out that selling customer data, though serves thepurpose of selling digital products, crosses the boundary between sellingdigital products and data products. We review some studies on settingprices for customer data and privacy information in the next section.To produce revenues from time/attention stream, many digital productproducers and service providers embed advertisements in their products inone way or the other, and obtain remarkable or even dominant advertisingincome. However, as John Wanamaker (1838-1922) wisely said, “Half themoney I spend on advertising is wasted; the trouble is I don’t know whichhalf.” It is well recognized that it is hard to accurately measure advertisingeﬀects [95, 132]. Advertisers customize ads for online display [190, 216].One feasible way to improve advertising eﬀectiveness is to combine userinformation and advertising opportunities. Retargeted advertising [128] issuch an approach, which combines customer online and oﬄine behavior dataand makes ﬁrms focus on customers showing prior interest in the relatedproducts. For example, Athey et al. [21] consider customers with multiple20omes and investigate the advertising strategies and eﬀectiveness.In summary, digital product and service suppliers produce rev-enues through three major streams, money, information/privacy andtime/attention. Orthogonally, a ﬁrm can bundle its digital products andalso design subscription plans that provide products and services in a spe-ciﬁc period for a price, which is discussed next.

Product bundling organizes products or services into bundles, such that abundle of products or services are for sale as one combined product or servicepackage. Product bundling is a common marketing practice, particularly inthe traditional industry like telecommunication services, ﬁnancial services,healthcare, and consumer electronics.As discussed in Section 2.1.3, the low replication costs of informationgoods allow prevalent adoption of bundling in pricing digital products [182].Designing product bundles essentially is a combinatorial optimization prob-lem. The basic and static setting is that a customer wants to buy eitherone or multiple products at a time, which is investigated well before digitalproducts are available [6]. A series of studies [18, 148, 169] develop pric-ing strategies with two products under diﬀerent types of bundling. Theyshare the basic assumption that demand for a bundle is elastic comparingto demand for individual products. For example, Armstrong [18] studiesthe scenarios where products may be substituted or provided by separatesellers.Bundling multiple products is analyzed, often under the independentvalue distribution framework [152]. Consider the situation where there are n heterogeneous products for one buyer, and the objective is to maximizeexpected revenue. Assume that the value distributions on products areindependent. That is, for each product x i , the price that a buyer wouldlike to pay for is an arbitrary distribution D i in range [ a i , b i ], where 0 ≤ a i ≤ b i < ∞ , and those distributions D , . . . , D i are independent from eachother. Further assume that the buyer is additive, that is, the buyer’s valuefor a set of products is the sum of the buyer’s values of those individualproducts in the set. Babaioﬀ et al. [51] show that either selling each itemseparately or selling all items together as a grand bundle produces at least21 constant fraction of the optimal revenue. This interesting and importantresult allows a simple yet eﬀective bundling strategy: either pricing eachproduct individually or pricing the grand bundle in the expected price. Inpractice, many platforms, such as Hulu and Amazon Prime Video, oﬀergrand bundle subscription for their products.More recently, Haghpanah and Hartline [97, 98] show that grand bun-dle is optimal if more price-sensitive buyers consider the products morecomplementary. When multiple buyers are considered, whose preferencesare unknown, Balcan et al. [30] give a simple pricing model that achievesa surprisingly strong guarantee: in the case of unlimited supplies, a ran-dom single price achieves expected revenue within a logarithmic factor forcustomers with general valuation functions. This result allows great con-venience in practice, that is, setting a uniform price for all products. It iseasier to price a bundle of a larger number of products, since the law oflarge numbers allows to predict customers’ valuations more accurately for alarger bundle of products [2].Orthogonal to bundling, subscription is to price the interactions betweencustomers and a platform over a period of time. Subscribing customers are ingeneral heterogeneous in both usage rate and value of products. On the onehand, customers with higher usage rates may prefer subscribing to largersubscription sets. On the other hand, in order to maximize revenue, theplatform wants customers with lower usage rates to subscribe, and customerswith higher usage rates to rent. Moreover, diﬀerent users may have diﬀerentvalues for a product. Many platforms oﬀer subscription and renting at thesame time. For a platform, the subscription model is to select a subscriptionfee and the period for each set of products and also set the rental price foreach product [13].Alaei et al. [13] follow the model of grand bundle and consider grandsubscription, a single rental price for the set that includes all products. Theyestablish the suﬃcient and necessary condition for the optimality of grandsubscription. They also show that subscription fees can be set proportionalto the cardinality of a set of products and can achieve

14 log 2 m +log n of theoptimal revenue for n types of customers and m types of products. Thisapproximation is tight in the sense that it cannot be improved more thanΩ( n ) in polynomial time.After all, modeling bundling and subscriptions is computationally chal-22enging due to the combinatorial nature. Dynamic pricing bundles and sub-scriptions, such as promotions and coupons, have rarely been touched yet. Auctions have a long history back to the Babylonian and Roman em-pires [185]. There are many excellent surveys on auctions (e.g., [24, 68, 118,145]). A comprehensive review on auctions is far beyond the scope and ca-pacity of this article. In this article, we instead only focus on the importantrole of auctions as a pricing mechanism for digital products.

There are four basic types of auctions widely used. • In the ascending-bid auction (also known as English auction), the priceis raised successively until only one bidder remains, who wins the ob-ject at the ﬁnal price. • The descending auction (also known as the Dutch auction) works theother way by starting at a very high price and lowering the pricecontinuously, until the ﬁrst bidder calls out and accepts the currentprice. • In the ﬁrst-price sealed-bid auction , every bidder submits a bid withoutknowing the others’ bids. The one making the highest bid wins andpays at the named price. • The second-price sealed-bid auction (also known as the Vickrey auc-tion [198]) works in the same way as the ﬁrst-price sealed-bid auctiondoes, except that the winner pays only the second highest bid.There are two basic models of the value information in auctions. The private-value model assumes that every bidder has an independent value onthe object for sale. The value is also private to the bidder only. The purecommon-value model assumes that the actual value of the object for saleis the same for all bidders, but bidders have diﬀerent private informationabout that actual value. Every bidder adjusts her/his estimate of the actual23alue by learning other bidders’ signals. There are also models consideringboth values private to individual bidders and common to all bidders.One fundamental principle in auction theory is the revenue equivalencetheorem [152, 177, 198, 199], which essentially states that, for a set of risk-neutral bidders with independent private valuation of an object drawn froma common cumulative distribution that is strictly increasing and atomlesson [ v min , v max ], any auction mechanism yields the same expected revenueand thus any bidder with valuation v makes the same expected payment if(1) the object is allocated to the bidder with the highest valuation; and (2)any bidder with valuation v min has an expected utility of 0. Based on therevenue equivalence theorem, the four basic types of auctions lead to thesame payment by the winner and the same revenue.While most studies in auction theory make some simple assumptionsabout independence of customer valuations, empirical studies [106] demon-strate that, in practice, the wrong assumption of valuation independencecauses ineﬃcient auctions in e-commerce. Online ad and sponsored search auctions [126, 172, 197] are one importantapplication of auctions in pricing digital products. Sponsored search [110] isthe business model where content providers pay search engines for traﬃc totheir websites. In sponsored search, advertisers and, more generally, contentproviders bid for keywords in search engines, and search engines decide whichad to display in which position to answer a query from a user.

GoTo.com created the ﬁrst sponsored search auction [110].Diﬀerent pricing models can be used in sponsored search auctions, suchas pay-per mille /pay-per impression (PPM), pay-per-click (PPC), and pay-per-action (PPA). In the early days of sponsored search, a generalized ﬁrstprice auction is used. Each advertiser bids on multiple keywords, and canset a bidding price for each keyword. When a user query is answered, whichis a keyword, the top k bids on the keyword in price are displayed. If anad is clicked by the user, the corresponding advertiser pays the biddingprice. The ﬁrst price auction mechanism is unstable, costs advertisers timeand reduces search engine proﬁts [64]. Later, Google generalizes the second That is, the cost of 1,000 advertisement impressions.

One unique feature of digital products is that the replication costs are verylow and thus may have almost unlimited supply. Products of unlimitedsupplies lead to new challenges and opportunities to auction mechanismdesign. For example, the second price auction can be straightforwardlygeneralized for k identical products – the top k highest bidders win andeach pays the ( k + 1)-th bidding price. However, when there are unlimitedidentical products, the ( k + 1)-th bidding price approaches 0. The lack ofcompetition due to obsessive supplies prevents bidders from oﬀering any highprices. In other words, the challenge is how to ensure the bids are truthful,that is, reﬂecting the bidders’ true valuation of the digital products.Denote by B the set of bidders, and by b , b , . . . the bidding pricesin descending order, that is, b i ≥ b i +1 ≥ i >

0. Suppose thegeneralized second price auction mechanism is used. That is, if k bids aretaken, those winning bidders each pays the cost b k +1 . The auction objectiveis to maximize k · b k +1 . An auction is competitive if it yields revenue withina constant factor of the optimal ﬁxed pricing. It is tricky that, when there isunlimited supply, the Vickrey auction is not competitive if the seller choosesthe number of products to sell before knowing the bids, and is not truthfulif the seller chooses after knowing the bids [92].Goldberg et al. [92] propose the ﬁrst competitive auction for digital goodswith unlimited supplies. The major idea is the smart framework of randomsampling auction . An auction is bid-independent if bidder i ’s bid valueshould only determine whether the bidder wins the auction, but not the25rice. We select a sample B (cid:48) of B at random, independent from the bidvalues. We use the bids in B (cid:48) to compute the optimal bid threshold f B (cid:48) that maximizes the revenue in B (cid:48) , and every bidder in B − B (cid:48) whose bidvalue is over f B (cid:48) wins. Symmetrically, we use the bids in B − B (cid:48) to computethe optimal bid threshold f B − B (cid:48) that maximizes the revenue in B − B (cid:48) , andevery bidder in B (cid:48) whose bid value is higher than f B − B (cid:48) wins. In general, f B (cid:48) = f B − B (cid:48) does not necessarily hold. Random sampling auctions arecompetitive, no matter the single-price version or the multi-price version.Indeed, random sampling auctions are 15-competitive in the worst case [70]and 4-competitive for a large class of instances where there are at least 6bids that are as good as the optimal sale price [14]. There are a series ofimprovements on random sampling auctions. For example, Hartline andMcGrew [102] further improve the competitiveness.Goldberg and Hartline [89] extend the scope from single digital productwith unlimited supply to multiple products with unlimited supplies. Givena set of bids, they show that the bidder-optimal product assignment giventhe bids and the optimal sale prices can be determined by solving the integerprogramming problem as follows. max (cid:80) j (cid:80) i x ij r j subject to r m = 0 (cid:80) j x ij ≤ ≤ i ≤ nx ij ≥ ≤ i ≤ n, ≤ j ≤ mp i + r j ≥ a ij ≤ i ≤ n, ≤ j ≤ m (cid:80) i p i = (cid:80) j (cid:80) i x ij ( a ij − r j ) (3) where x ij is the assignment of product j to bidder i , r j is the optimal pricefor product j , p i is the proﬁt of bidder i , and a ij is bid from bidder i onproduct j .Then, we can solve the optimal pricing problem in the following randomsampling auction. Let B be the set of bidders. First, we obtain a sample B (cid:48) of bidders. Second, we compute the optimal sale prices for B (cid:48) . Last,we run the ﬁxed-price auction on B − B (cid:48) using the sale prices computedin Equation 3. All bidders in B (cid:48) lose the auction. The random samplingauction is shown truthful and competitive [89].Most of the proposed auctions for digital goods with unlimited supplyare randomized auctions. Goldberg et al. [92] show that no deterministic26uction can be competitive. Aggarwal et al. [9] later point out that the resultdoes not hold for asymmetric auctions [144]. In a symmetric ex ante auc-tion, buyers’ preference parameters are drawn from a symmetric probabilitydistribution, and thus there exists a symmetric equilibrium if an equilibriumexists at all. In an asymmetric auction, each buyer has the same informationabout the product but a diﬀerent opportunity cost of obtaining the product,that is, bidders’ valuations are drawn from diﬀerent distributions. Aggar-wal et al. [9] give an asymmetric deterministic auction that can approximatethe revenue of any optimal single-price sale in the worst case. Indeed, theydevelop a general derandomization technique to transform any randomizedauction into an asymmetric deterministic auction with approximately thesame revenue. The general idea follows the deterministic maximum ﬂowsolution to the well-known hat problem [63]. One drawback in random sampling auctions is that some bidders may loseeven they make bids higher than some winning bidders do, since the biddersin B (cid:48) and B − B (cid:48) use diﬀerent thresholds (i.e., f B − B (cid:48) and f B (cid:48) , respectively)in the one product version and all bidders in B (cid:48) lose in the multiple productversion.Goldberg and Hartline [91] establish a fundamental result: an auctioncannot be truthful, competitive and envy-free at the same time. They alsoexplore possible tradeoﬀs between truthfulness and envy-freeness based onthe consensus revenue estimate (CORE) technique [90]. Speciﬁcally, using asimilar idea in combinatorial auctions with single parameter agents [16], wecan relax the truthfulness requirement by requiring being truthful with prob-ability (1 − (cid:15) ), and always guarantee envy-free. The auction is highly truthfulwhen (cid:15) approaches 0 and the number of winners in the auction approachesinﬁnity. The other type of auctions relaxes the envy-free requirement tobeing envy-free with probability (1 − (cid:15) ), and guarantees truthfulness. Bothauctions are competitive and the probability is over random coin tosses madeby the randomized auction mechanism and not the input.27 .3.5 Online Auctions In addition to potentially unlimited supply, another important feature ofdigital goods is that a digital good may be sold repetitively, such as a movieand a song. Therefore, auctions on digital goods may run continuouslyinstead of only one round. Moreover, customers may want to have promptanswers to their bids.Online auctions [129] are designed to address the setting where diﬀer-ent customers bid at diﬀerent times. The auction mechanism has to makedecision about each bid as it arrives. An (online) auction is incentive com-patible if the bidders are rationally motivated to reveal their true valuationsof the object. Lavi and Nisan [129] show that an online auction is incentivecompatible if and only if it is based on supply curves under the assumptionof limited supply, that is, before it receives the i -th bid b i ( q ), it ﬁxes thesupply curve p i ( q ) based on the previous bids, and (1) the quantity q i soldto customer i is the quantity q that maximizes the sum (cid:80) qj =1 ( b i ( j ) − p i ( j ));and (2) the price paid by i is (cid:80) qj =1 p i ( j ).To tackle the challenges when there is unlimited supply, Bar-Yossef et al. [31] point out that supply curves are not available anymore. Instead,they propose an extremely simple incentive-compatible randomized onlineauction. Each bidder i picks a random number t ∈ { , . . . , (cid:98) log h (cid:99)} and setsthe price threshold to s i = 2 t , where h is the ratio of the highest valuationagainst the lowest valuation among all bidders. This auction is O (log h )-competitive.The auction mechanism can be further improved to achieve even bet-ter incentive-compatibility. Speciﬁcally, we can divide a sequence of bids b , b , . . . into l = ( (cid:98) log h (cid:99) + 1) buckets, such that bucket B j contains thebids with indexes in range [2 j , j +1 ). The weight of bucket B j is the sum ofbids within B j , that is, w j = (cid:80) i ∈ B j i . A new bidder can choose one of thebuckets at random with the probability proportional to the bucket weight,and pays the price of the lowest bid of the bucket. The price s i that bidder i pays follows the probability distribution P r [ s i = 2 j ] = (cid:0) w j (cid:80) l − r =0 w r (cid:1) d , where d is a parameter. The auction is shown O (3 d (log h ) d +1 )-competitive. Bysetting d = √ log log h , the auction is O (exp( √ log log h ))-competitive.28 .4 Summary As revenue maximization plays a fundamental role in pricing digital prod-ucts, we review the three major streams of revenues for digital products,namely money, information/privacy, and time/attention. Then, we revisitbundling and subscription planning for digital products, which echoes theopportunities and challenges due to low replication costs of informationgoods. Auctions are widely used in pricing digital products. We review somebasic types of auctions and their applications in digital products, includingsponsored search auctions, auctions with unlimited supplies, envy-free auc-tions and online auctions. Some ideas employed by pricing digital productsare also used in pricing data products, as to be discussed in the next section.

In this section, we discuss pricing in marketplaces of data. We ﬁrst obtainan overall understanding about data markets and the major players in suchmarkets. Then, we look into several most studied technical problems in dataproduct pricing, including arbitrage-free pricing, revenue maximization pric-ing, fair and truthful pricing and privacy preservation in data marketplaces.Last, we discuss pricing in novel application scenarios, including dynamicdata pricing, online pricing and federated learning pricing.

Marketplaces for data have been actively developed for over a decade. Anearly survey [179] identiﬁes diﬀerent categories and dimensions of data mar-ketplaces and data vendors in 2012. There are many studies on variousissues about data markets and pricing strategies. Before we discuss anyspeciﬁcs in detail, it is important to obtain an overall understanding aboutdata markets, such as what are sold and for what purposes, who are thesellers, who are the buyers, and what are the basic pricing models.Pantelis and Aija [167] present a brief economic analysis of data taxon-omy as a market mechanism. Data and databases are legally protected byeither copyright or database right. Copyright protects expression and signif-icant creative eﬀort that creates and organizes data. Database right protects29 whole database. One challenge is that both copyright and database rightare hard to enforce due to the non-rivalrous nature of data.In general, data may be owned by governments, private parties or in-dividuals. Consequently, data can be categorized into three types: open,public, and private data [167]. Open data are common pool resources [166],such as the data made available by the open data initiatives. Public data,such as the data collected by the government in the United States, are valu-able resources subject to the “tragedy of the commons” [101]. Public dataare often produced by individuals or organizations for research and used bygovernments and local authorities, but may also be employed by commercialparties to enhance their proprietary resources or services. Private data aregenerated by private applications or services.To understand what are sold in data markets and for what purposes,Muschalle et al. [151] consider the common queries and demands on datamarkets, as well as the pricing strategies. They observe two major typesof queries. The ﬁrst type of queries is to estimate the value of a “thing”or compare the values of “things”, where examples of the “things” are likewebpages for advertisements, starlets, politicians and products. The secondtype is to show all about a “thing”. Those queries are raised by seven cate-gories of beneﬁciaries, namely analysts, application vendors, data processingalgorithm developers, data providers, consultants, licensing and certiﬁcationentities, and data market owners. Muschalle et al. [151] also identify threetypes of market structures. First, in a monopoly, a supplier is powerfulenough to set prices to maximize proﬁts. Second, an oligopoly is domi-nated by a small number of strong competitors. Last, in strong competitionmarkets, prices may align with marginal costs.A series of pricing strategies and models may be considered in datamarkets [151]. First, free data may be obtained from public authorities,may help to attract customers and suppliers of commercial data, and maybe integrated into private and not-free data products. Second, prices canbe based on usages, such as charging customers per hour of data usage.Third, package pricing allows a customer to obtain a certain amount ofdata or API calls for a ﬁxed fee. A few studies [116, 210] try to optimizepackage pricing models. Fourth, in the ﬂat fee tariﬀ model, a data productor service is oﬀered at a ﬂat rate, regardless of usage. It is simple, easy touse. The drawback is the lack of ﬂexibility, particularly for buyers. Fifth,30ombining package pricing and ﬂat fee tariﬀ results in two-part tariﬀ, thatis, a ﬁxed basic fee plus additional fee per unit consumed. This model ispopular in data services. Speciﬁcally, Wu and Banker [209] show that, underzero marginal costs and monitoring costs, ﬂat fee and two-part tariﬀ pricingare on par, and two-part tariﬀ is the most proﬁtable strategy. Last, in thefreemium model, users can use basic products or services for free and payfor premium functions or services.Recently, machine learning, particularly deep learning [94], becomes dis-ruptive in many applications, such as computer vision [139,201] and naturallanguage processing [218]. In most situations, powerful deep models heav-ily rely on large amounts of training data [156]. Monetization of data andmachine learning models built on data through markets gains stronger andstronger interests from industry. Speciﬁc to data as an economic good anddata pricing as a monetization mechanism in this context, a series of studiesfocus on data utility for model building and the associated pricing, particu-larly considering privacy.Some data owners may have detailed knowledge of speciﬁc machine learn-ing tasks and thus dedicate corresponding eﬀort to collect high quality datafor building better models. Babaioﬀ et al. [23] study the design of optimalmechanisms for a monopoly data provider to sell her/his data. Speciﬁcally,they show that it is feasible to achieve optimal revenue by a simple one-roundprotocol, that is, a protocol where a buyer and a seller each sends a singlemessage, and there is a single money transfer. The optimal mechanism canbe computed in polynomial time. For a buyer who may abort the interactionwith a seller prematurely, multiple rounds of partial information disclosureinterleaved by payments may be needed to ensure optimal revenue. Cum-mings et al. [49] study the optimal design for data buyers to purchase dataestimators with diﬀerent variances and combine the estimators to meet arequired quality guarantee on variance with the lowest total cost.The role of privacy in data collection and machine learning model build-ing is investigated. For example, Ghosh and Roth [87] develop auctions thatare truthful and approximately optimal for data buyers to obtain accurateestimates on data from owners who are compensated for privacy loss. Theyshow that the classic Vickrey auction [198] can minimize the buyer’s totalpayment and meet the accuracy requirement. They also develop a mecha-nism that can maximize the accuracy given a budget.31n general, modeling data owners’ costs of privacy loss is very diﬃcult,since the costs may be correlated with private data arbitrarily. It is impos-sible to design a direct revelation mechanism that can provide a non-trivialguarantee on accuracy and, at the same time, is rational for individual dataowners. To tackle the issue, Ligett and Roth [137] design a take-it-or-leave-it mechanism, which randomly approaches individuals from a populationand makes oﬀers. This mechanism can be used for some data collectionscenarios, such as surveys.Versioning is an important strategy in data pricing. A data seller cancustomize data into diﬀerent versions according to buyers’ needs. Berge-mann et al. [34] develop the optimal menu of information products that amonopoly data supplier can oﬀer to a data buyer, so that one product can ﬁtthe buyer’s willingness to buy the information at the oﬀered price, and therevenue is maximized. One important ﬁnding is that information productsindeed allow larger scopes of price discrimination. There are at least twodimensions that sellers can explore to derive various subsets of a data set,namely data quality and data position.When data are used to build machine learning models, it is important toassess the value of each data record within a data set. There exist variousmethods for assessment, such as leave-one-out [47], leverage or inﬂuencescore [48].Ghorbani and Zou [85] propose to apply the Shapley fairness on the dataused to train a machine learning model, and thus deﬁne data Shapley for arecord i in a training data set D as ψ i = C (cid:88) S ⊆ D −{ i } U ( S ∪ { i } ) − U ( S ) (cid:0) n − | S | (cid:1) where C is an arbitrary (positive) constant, and U ( S ) is the performancescore of the model trained on data S ⊆ D . One challenge is that computingthe exact data Shapley values on large data sets for sophisticated models,such as deep neural networks, is computational prohibitive. Ghorbani andZou [85] also develop Monte Carlo and gradient-based methods for estima-tion.If a data point p appears in two samples D and D from the same datadistribution, intuitively the Shapley value of p in D and D should be simi-lar. Mathematically, the intrinsic Shapley value of p in a distribution should32e the expectation of the Shapley value of p in the distribution. Based onthis intuition, Ghorbani et al. [84] propose the notion of distributional Shap-ley. Let Z be a universe in question. For example, in classiﬁcation problems,conventionally Z = X ×Y , where X is the feature space and Y is the output.Let D be a data distribution in Z . Assuming a potential function or a per-formance metric U : Z ∗ → [0 ,

1] and a sample size m >

0, the distributionalShapley value of a point z ∈ Z is the expected Shapley value over data setsof size m containing z , that is, ν ( z ; U, D , m ) = E S ∼D m − [ ψ ( z ; U, S ∪ { z } ],where S ∼ D m − is a set of m points sampled i.i.d. from D . They showthat distribution Shapley values are stable. Kwon et al. [125] further derivethe computationally tractable expressions for distributional Shapley for aseries of models, including linear regression, binary classiﬁcation and non-parametric density estimation.Alternative to Shapley values, there are some other data valuation meth-ods. For example, in machine learning, inﬂuence functions [119,206] approx-imate leave-one-out to assess the value of a data item. Cai et al. [41] proposestrategy-proof mechanisms for data elicitation and trade oﬀ between modelaccuracy and reward. Richardson et al. [176] focus on the case of linearregression. Recently, Yoon et al. [217] propose data valuation using rein-forcement learning. They use a data value estimator to learn how much adata item as an element in the training data contributes to improving modelperformance. One distinct advantage is that the model being trained andthe data value estimator can improve each other’s performance.Data quality is an important issue [170]. There are many studies onassessment of data quality [103, 170, 204]. Some studies speciﬁcally focuson pricing based on data quality and the impact on data markets. Heck-man et al. [103] propose a simple linear model, Value of data = ﬁxed cost + (cid:88) i w i · factor i , where the factors include but are not limited to age of data, periodicityof data, volume of data, and accuracy of data, and w i is the associatedweight. One practical diﬃculty in using the model is that the parametersin the model are hard to estimate. Another diﬃculty is that many data setsdo not have public prices associated. Yu and Zhang [219] consider pricingmultiple versions formed by multiple factors of data quality and build atwo-level model. The ﬁrst level is the data platform where a single owner33s assumed, who designs the number of versions. The second level is thecustomers who want to maximize the data utility. Each level is modeled asa maximization problem and thus the whole model is a bi-level programmingproblem, which is NP-hard.Another way to form multiple versions of data products is to charge byqueries [121–124]. Intuitively, a data seller may treat a view of a data set asa version. Setting the price for every possible view is not only tedious butalso tricky. If prices on views are not set properly, arbitrages or less thanhighest prices may happen. Koutris et al. [121, 124] propose a frameworkof query and view based data pricing. The major idea is that a seller onlyneeds to specify the prices on a few views, and then the prices of other viewscan be decided algorithmically. Their advocate two desiderata, arbitrage-freeness and discount-freeness. Theoretically, they show the existence anduniqueness of pricing functions satisfying the requirements. They also showthe complexity of computing the pricing functions. Unfortunately, onlyselection views and conjunctive queries without self-joins are tractable. Theypresent polynomial time algorithms for chain queries and cyclic queries.Technically, the core idea in the view and query based pricing frameworkis query determinacy [157, 158, 180]. A query Q is said to be determined bya set of views V if the answer to Q can be completely derived from theviews. Query determinacy enables the feasibility of arbitrage detection. If V determines Q , then arbitrage happens if and only if the price of V ischeaper than that of Q .Koutris et al. [123] further explore the technical challenges in practicalimplementation of view and query based data pricing. Speciﬁcally, theydevelop an integer linear programming formulation for the pricing problemwith a large number of queries. Considering the scenario where a user maypurchase multiple queries over time or the database is updated, such thatinformation in multiple queries and updates may have overlaps, they alsoleverage query history to avoid double charging. To handle the situationwhere there are multiple sellers, they deﬁne the share of a seller as the max-imum revenue that the seller can get among all minimum-cost solutions, andaccordingly deﬁne a fair revenue distribution policy. A prototype demon-stration system is reported in [122].Tang et al. [192] follow the view and query based pricing frameworkand consider the minimum granularity of data, that is, each tuple is a view.34heir model assigns to each tuple a price and prices queries based on minimalprovenances. Tang et al. [191] extend view and query based pricing to XMLdocuments and consider the situation where a customer may just want topurchase a sample instead of the complete query result. Arbitrage is probably the most intensively studied issue in pricing data prod-ucts. As introduced in Section 3.2.4, in general, arbitrage is the activitiesthat take advantage of price diﬀerences between two or more markets orchannels. Arbitrage is undesirable in many pricing models. Unfortunately,arbitrage may sneak in pricing models without rigorous design. For exam-ple, Balazinska et al. [28] analyze that subscription based pricing possiblywith a query limit allows arbitrage. Muschalle et al. [151] point out that apricing model charging users a certain amount of API calls for a ﬁxed ratemay potentially allow arbitrage, depending on the package size.Arbitrage-freeness is one of the fundamental properties of pricing mod-els in query and view based pricing [121–124]. Li and Miklau [134] andLi et al. [133] develop frameworks of pricing linear aggregate queries. Specif-ically, Li et al. [133] consider linear queries. Given a data set of n tuples x , . . . , x n , a linear query q = ( q , . . . , q N ) is a real-valued vector, and theanswer q ( x ) = (cid:80) ni =1 q i x i . For a multiset of queries S = { Q , . . . , Q k } andquery Q , if the answer to Q can be linearly derived from the answers to thequeries in S , then Q is said to be determined by S , denoted by S → Q . Apricing function π ( Q ) is arbitrage-free if for any multiset S and query Q such that S → Q , π ( Q ) ≤ (cid:80) ki =1 π ( Q i ).Under the general intuition of arbitrage-freeness, Li et al. [133] considera speciﬁc form of queries, linear queries with variance ( q , v ), that is, theestimation of the answer to query q should have a variance no larger than v .Using diﬀerent values of v , diﬀerent versions are formed. A pricing modelnot carefully designed may allow arbitrage.Li et al. [133] ﬁrst establish the observation that π ( q , v ) = Ω( v ). Then,they synthesize pricing function π ( q , v ) = f ( q ) v , which is arbitrage-free if f ispositive and semi-norm . For any arbitrage-free pricing functions π , . . . , π k , A function f : R n → R is semi-norm if for any c ∈ R and any query Q ∈ R n , f ( c q ) = | c | f ( q ); and for any q , q ∈ R n , f ( q + q ) ≤ f ( q ) + f ( q ). ( π ( q ) , . . . , π k ( q )) is also arbitrage-free if f is a subadditive and nonde-creasing function.As Roth [178] summarizes, the framework by Li et al. [133] still facesthree important challenges. First, arbitrage is still possible to derive answersto a bundle of queries from another bundle of queries and their answers.Second, arbitrage is still possible on biased estimators for statistical queries.Last, it is unclear whether we can obtain arbitrage-free pricing maximizingproﬁt given the distribution of buyer demands. Later, Deep and Koutris [54]provide some interesting insights to arbitrage-free pricing for bundles.Lin and Kifer [138] investigate arbitrage-free pricing for general dataqueries. They consider three types of pricing models for query bundles,where a query bundle is a set of queries posted simultaneously as a batch.First, an instance-independent pricing function depends on the query bun-dle but not the database instance. Second, an up-front dependent pricingfunction depends on both the query bundle and the database instance. Acustomer knows an un-front dependent pricing function, and decides whetherto purchase or not the query answers. Last, a delayed pricing function de-pends on both the query bundle and the answers computed by the querybundle on the current database instance. The customer knows the pricingfunction, but do not know the exact price. Once agreeing, the customer ischarged when the answers are computed.Lin and Kifer [138] also summarize ﬁve diﬀerent types of arbitrage situ-ations. First, if prices are quoted by queries, in order to avoid price-basedarbitrage, answers to queries should not be deduced from prices along. Sec-ond, a buyer may use multiple accounts to derive answers to a query bun-dle. To avoid separate account arbitrage, the price of a query bundle [ q , q ]should be at most the sum of the prices of q and q . Third, if the answersto a query bundle q (cid:48) can always be deduced from answers to another querybundle q , to prevent post-processing arbitrage from happening, the priceof q should be no cheaper than that of q (cid:48) . Fourth, although the answersto a query bundle q may not be always derivable from the answers to an-other query bundle q (cid:48) on all database instances, still for a speciﬁc databaseinstance I , the answers to q may be derived from the answers to q (cid:48) . Ifso, a serendipitous arbitrage happens. Last, if two queries behave almostidentical but their prices are dramatically diﬀerent, almost-certain arbitrage A function f is subadditive if for any x , . . . , x k , f ( (cid:80) ki =1 x i ) ≤ (cid:80) ki =1 f ( x i ). and SSB benchmark datasets as demonstration.The key idea in Qirana is that it regards a query as an uncertaintyreduction mechanism. Initially, a buyer faces a set of possible databases I deﬁned by a database schema, primary keys and predeﬁned constraints.Once a buyer obtains the answer E to a query Q , all possible databases D such that E (cid:54) = Q ( D ) are eliminated. The price assigned to Q shouldbe a function of how much the set of possible databases shrinks. Let S bethe set of possible databases before the query Q is answered. S is called thesupport set. Then, a weighted coverage function assigns a weight w i to every D i ∈ S , and computes the price to a query by p wc ( Q, D ) = (cid:80) Q ( D i ) (cid:54) = Q ( D ) w i .Alternatively, consider the equivalence relation in S : D i ∼ D j if and onlyif Q ( D i ) = Q ( D j ). Assign to each possible database D i ∈ S a weight w i such that (cid:80) D i ∈S w i = 1. Let P Q be the set of equivalence classes. Foreach class B ∈ P Q , denote by w B = (cid:80) D i ∈ B w i . The Shannon entropyfunction is used to compute the price of query Q as the entropy of thequery output P H ( Q, D ) = − (cid:80) B ∈P Q w B log w B . The q-entropy function(also known as Tsallis entropy) for q = 2 is used to assign to Q the price P T ( Q, D ) = (cid:80) B ∈P Q w B (1 − w B ). Deep and Koutris [54] show that theweighted coverage function, the Shannon entropy function and the 2-entropyfunction are all arbitrage-free.Using the complete set of possible databases as the support set leads to a P -hard problem. To make the price calculation computationally feasible,Qirana uses uniform random sample and random neighboors as the support properly to avoid arbitrage is important. Xia and Muthukrishnan [213]consider the following problem. Denote by q i a selection query over userattributes, by U i the set of all users satisfying q i , and by p i the price ofeach user in U i . If a buyer purchases n users (1 ≤ n ≤ | U i | ) in U i , she/hehas to pay n · p i . If prices of diﬀerent queries are not well coordinated,version-arbitrage may arise. If two queries q i and q j return similar usersets but q i is dramatically more expensive than q j , then a user who wants q i may purchase q j instead. Xia and Muthukrishnan [213] point out thatuniform pricing, that is, every query has the same price, is arbitrage-free,but is a logarithmic approximation to the maximum revenue arbitrage-freepricing solution. Then, they present a greedy non-uniform pricing design.The design starts with the optimal uniform pricing that is arbitrage-free,and then iteratively updates the pricing function. If the price of a query canbe updated to increase the revenue, it is increased so that the arbitrage-freeproperty is retained. This greedy algorithm is still a logarithmic approxi-mation to the maximum revenue arbitrage-free pricing solution.Chen et al. [44] develop an arbitrage-free pricing design for multipleversions of a machine learning model. They assume that a broker trains theoptimal model on the complete raw data. Then, random Gaussian noises areadded to the optimal model to produce diﬀerent versions for diﬀerent buyers.The assumption is that the error of a machine learning model instance ismonotonic with respect to the variance of the noise injected into the model.In this setting, a pricing function is arbitrage-free if and only if the price ofa randomized model instance is monotonically increasing and subadditivewith respect to the inverse of the variance. As explained in Section 3.2.2, the objective of revenue maximization is oftenof special interest in designing pricing strategies, since for a business to besuccessful long term, a more immediate and important requirement is to win Here, “buying a user” is short for purchasing the impression of a user in online adver-tising and a user email in targeted email advertising, for example. O ( D )approximation algorithm to maximize revenue, where D is the largest min-imal demand among all buyers.Chawla et al. [42] consider query and view based pricing for arbitrage-free revenue maximization under the assumption that all buyers are single-minded and the supply is unlimited. A buyer is single-minded if the buyerwants to purchase the answer to a single set of queries. They consider threetypes of pricing functions. Uniform bundle pricing sets the price of everybundle identical. Additive or item pricing prices each item and charges abundle the sum of prices for the items in the bundle. Fractionally subadditivepricing or XOS sets k weights w j , . . . , w kj for each item j , and for a bundle e , the price is set to max ki =1 (cid:80) j ∈ e w ij . Building on the extensive studies onrevenue maximization with single-minded buyers and unlimited supply [29,39, 96], they develop new heuristics.It is well known that there exists uniform bundle pricing that is O (log m )approximation of revenue maximization, where m is the number of bundles.Swamy and Cheung [189] show that item pricing can achieve an O (log B )approximation of maximum revenue, where B is the maximum number ofbundles an item can involve. Chawla et al. [42] show some new lower bounds,that is, uniform bundle pricing, item pricing and XOS pricing combininga constant number of item pricing functions are still Ω(log m ) away frommaximum revenue. They also present approximation algorithms.To maximize revenue in machine learning models, Chen et al. [44] showthat the optimization problem is coNP-hard. Thus, they relax the subaddi-tive constraint p ( x + y ) ≤ p ( x ) + p ( y ) by q ( x ) x ≥ q ( y ) y for every 0 < x ≤ y ,and turn to ﬁnding a pricing function q () such that q ( x ) x is decreasing withrespect to x . They show that, for every well standing pricing function p (),39here exists a pricing function q () with the relaxed subadditive constraintsuch that p ( x )2 ≤ q ( x ) ≤ p ( x ), and q ( x ) can be computed using dynamicprogramming in O ( n ) time, where n is the number of interpolated pricepoints. Fairness and truthfulness are important for data product markets. Recallthat fairness refers to that the revenue generated by a sale transaction inthe data market is distributed among sellers in an unprejudiced manner sothat they are paid for their marginal contributions. Truthfulness means amarket where buyers are well motivated to report their internal valuationsof data products unwarily.Agarwal et al. [7] propose a mathematical model of data marketplacesthat are fair, truthful, revenue maximizing, and scalable. They assume eachseller j supplies a data stream X j and each buyer n conducts a predictiontask Y n , where X j , Y n ∈ R T . For example, X j may be a stream of customers’interest on diﬀerent products, and Y n is a task predicting a new customer’sinterest. Taking a prediction task Y n and an estimate ˆ Y n , a prediction gainfunction G n : R T → [0 ,

1] measures the quality of the prediction. The valuethat buyer n gets from estimate ˆ Y n is µ n ·G ( Y n , ˆ Y n ), where µ n is the price ratethat the buyer is willing to pay for a unit increase in G . A machine learningmodel M : R MT → R T uses data from M sellers to produce an estimate Y n for buyer n ’s prediction task Y n . Let p n and b n be the price and the bid,respectively. Then, allocation function AF : ( p n , b n ; X M ) → (cid:101) X M measuresthe quality at which buyer n obtains that is allocated to the sellers on sale X M , where (cid:101) X M ∈ R M . Revenue function RF : ( p n , b n , Y n ; M , G , X M ) → r n calculates how much revenue r n ∈ R + to extract from the buyer. The utilitythat buyer n receives by bidding n n for Y n is U ( b n , Y n ) = µ n · G ( Y n , ˆ Y n ) − RF ( p n , b n , Y n ) , where ˆ Y n = M ( Y n , (cid:101) X M ) and (cid:101) X M = AF ( p n , b n ; X M ). A market is truthfulif for all prediction tasks Y n , µ n = arg max z ∈R + U ( z, Y n ). They adopt thenotion of fairness following the famous Shapley fairness [184].One main result [7] is that, the data market deﬁned as such is truthful ifand only if function AF ∗ is monotonic, that is, an increase in the diﬀerence40etween price rate p n and bid b n leads to a decrease in predication gain G .They also give randomized (cid:15) -approximation algorithms for fair data market,that is, || ψ n, Shapley − ˆ ψ n || ∞ < (cid:15) with probability 1 − δ , where ψ n, Shapley isthe Shapley-fair payment division among sellers, ˆ ψ n is the output of theapproximation algorithm, and δ, (cid:15) >

0. Their algorithms are polynomial.Shapley fairness [184] is popularly adopted as the foundation of fairnessin data markets. However, computing Shapley value is exponential [57].Maleki et al. [141] present a permutation sampling method that approxi-mates Shapley values for any bounded utility functions. The basic idea is touse Equation 2 and tackle ψ ( s ) = E [ U ( P πs ∪ { s } ) − U ( P πi )] by sample mean.Following Hoeﬀding’s inequality [104], to achieve an ( (cid:15), δ )-approximation,that is, P ( | ˆ s − s | p ≤ (cid:15) ) ≥ − δ , where ˆ s is the estimate, we need r N(cid:15) log Nδ samples and evaluate the utility function O ( N log N ) times, where r is therange of the utility function U .Jia et al. [112] present approximation algorithms for Shapley value thatcan substantially reduce the number of times that the utility function isevaluated. First, they apply the idea of feature selection using group test-ing [60, 225]. For user s , let β s be the random variable that s appears ina random sample of sellers. Then, for sellers s i and s j , the diﬀerence inShapley values between s i and s j is ψ ( s i ) − ψ ( s j ) = N − (cid:80) S ∈ D \{ s i ,s j } U ( S ∪{ s i } ) −U ( S ∪{ s j } ) ( N − | S | )= E [( β s i − β s j ) U ( β s , . . . , β s j )]where U ( β s , . . . , β s j ) is the utility computed using the sellers appear-ing in the random sample. They can use group testing to ﬁrst esti-mate the Shapley diﬀerences and then derive the Shapley value from thediﬀerences by solving a feasibility problem. They show that this algo-rithm is an ( (cid:15), δ )-approximation that evaluates the utility function at most O ( √ N (log N ) ) times. They further observe that most of the Shapley val-ues are around the mean. Exploiting this approximate sparsity, they givean ( (cid:15), δ )-approximation algorithm that evaluates the utility function only O ( N (log N ) log(log N ) times.Ghorbani and Zou [85] propose a principled framework of fair data eval-uation in supervised learning, and Monte-Carlo and gradient-based approx-imation methods. Their Monte-Carlo method follows a general idea similarto that in Jia et al. [112]. They generate Monte-Carlo estimates until the41verage empirically converges. They also argue that, in practice, it is suﬃ-cient to estimate Shapley values up to the intrinsic noise in the predictiveperformance on the test data set. Adding one tuple as a training data pointdoes not signiﬁcantly aﬀect the performance of a model trained using a largetraining data set. Therefore, truncation can be used in practice based on thebootstrap variation on the test set. In their gradient Shapley method, theytrain a model using one “epoch” of the training data, and then update themodel by gradient descent on one data point at a time, where the marginalcontribution is the change in the performance of the model.In general, computing Shapley values requires an exponential numberof model evaluations. However, for some speciﬁc model, the computationmay be reduced dramatically. For example, Jia et al. [111] show that forunweighted kNN classiﬁers, the exact computation needs only O ( N log N )time and an ( (cid:15), δ )-approximation can be achieved in O ( N h ( (cid:15),k ) log N ) timewhen (cid:15) is not too small and k is not too large. They also propose a Monte-Carlo approximation of O ( N (log N ) (log k ) ) for weighted kNN classiﬁers. A keyenabler of the progress is the speciﬁc utility function of a kNN classiﬁer U kNN ( S ) = 1 k min { k, | S |} (cid:88) i =1 [ y α i ( S ) = y test ]where α i ( S ) is the index of the training feature that is the k -th closest to x test among the training examples in S . Moreover, the sublinear approximationfor unweighted kNN classiﬁers is facilitated by locality sensitive hashing [52].Recently, Jia et al. [113] leverage the eﬃcient computation of Shapleyvalues in kNN [111] to tackle general classiﬁcation problems. They proposeto ﬁrst train a target model, such as a deep neural network, and identify thefeatures. Then, they conduct a model distillation to kNN by training a kNNclassiﬁer using the features to mimic the performance of the original modeland tune parameter k , the number of nearest neighbors considered. Last,they apply the Shapley value estimation method in kNN [111] to approachthe Shapley values in the target model.Many classic rewarding methods, such as Shapley values, may be vulner-able to data-replication attacks. One data provider may replicate its dataand act as an additional provider to obtain extra unconscionable rewards.To prevent data-replication attacks from happening, replication-robust pay-oﬀ mechanisms are proposed. Han et al. [100] propose a ﬁx to Shapley value42ased payoﬀ mechanisms. The idea is to down-weigh the Shapley value –a data provider gets a less reward if there are multiple copies of its data inthe coalitions.Related to fairness and truthfulness in a market, cooperation among dif-ferent agents in a market may happen. Building trust in a sub-communitywithin a data marketplace becomes an interesting subject. Armstrong andDurfee [17] analyze factors that may inﬂuence the eﬃciency of building trustand conducting cooperation in a data market. For each agent in a market,the other agents can be divided into two categories, namely those remem-bered agents and those strange or forgotten agents. They have a few inter-esting ﬁndings. Cooperations arising from iterated interactions is inverselyproportional to the rate of system mixing, the number of initially misbe-having agents, and the rate at which agents explore alternative strategies.Cooperation is also initially inversely proportional to population size. At thesame time, cooperation is proportional to average member size and betterestimation of the likelihood of strange agents to misbehave. Privacy is a serious concern and also a critical tipping point in designingmarketplaces of data. When a user shares her/his data with some others, theuser may disclose her/his privacy to some extent. Therefore, it is importantto explore how to protect or minimize the privacy leakage. At the sametime, it is also important to understand how a seller’s privacy disclosuremay be properly compensated through data pricing.Ghosh and Roth [87] design truthful marketplaces where data buyerswant to purchase data to estimate statistics and sellers want compensationfor their privacy loss. In the design, there is only one query and the individ-ual evaluations of their data are private. Data owners are asked to reportthe costs for the use of their data. Under the assumption of diﬀerentialprivacy [61, 62], they transform the problem into variants of multi-unit pro-curement auction. They show that, when a buyer holds an accuracy goal,the classic Vickrey auction can minimize the buyer’s total cost and guaran-tee the accuracy. When the buyer has a budget, they give an approximationalgorithm to maximize the accuracy under the budget constraint.The method by Ghosh and Roth [87] may not work well when the costs43nd the data are correlated. For example, a store with more customer traﬃcmay request a higher cost in using the data. Correspondingly, reporting thecost may reveal the privacy of the store. Fleischer and Lyu [74] tacklethe scenario where costs are correlated with data and propose a posted-price-like mechanism. Given a set of data sellers categorized into diﬀerenttypes and the associated distributions of costs, the mechanism oﬀers eachuser a contract with the expected payment corresponding to the type. Ifa seller takes the oﬀer, the payment is determined by the seller’s veriﬁabletype and the associated payment in the contract. All sellers have the sameprobability to take or reject their contracts independently. The sellers aretruthful, that is, a user takes the oﬀer if the payment is larger than or equalto the privacy loss. This posted-price-like mechanism is Bayesian incentivecompatible (i.e., every seller’s strategy is Bayesian-Nash equilibrium), ex-interim individually rational (i.e., the expected utility is non-negative forevery seller when the seller decides truthfully), O ( (cid:15) − )-accurate, perfectlydata private (i.e., whenever the mechanism’s posterior belief about a seller’sdata diﬀers from its prior belief, the mechanism pays the seller) and (cid:15) -diﬀerentially private.Li et al. [133] tackle the same problem as Ghosh and Roth [87] do,but assume that individual valuations are public and focus on returningunbiased estimations and pricing multiple queries consistently. To addressthe concerns on privacy loss, they develop a theoretical framework to dividethe price among data owners who contribute to the aggregate computationand thus have loss of privacy. Their framework extends several principlesfrom both diﬀerential privacy and query pricing in data markets.The fairness mechanism considered by Li et al. [133] only compensatesa seller whose data are used. Niu et al. [163] further consider the scenariowhere multiple sellers’ data are correlated and extend to dependent fair-ness. In dependent fairness, a seller s is still compensated if the data ofanother seller s (cid:48) are used that are correlated with the data of s . They pro-pose two approaches to privacy compensation. In the bottom-up approach,the broker ﬁrst satisﬁes each individual seller’s privacy compensation andthen decides the price for the statistic selling to a buyer. In the top-downdesign, the broker decides the total price of a data aggregate product sold toa buyer, and then spares a fraction of the total price for privacy compensa-tion. The privacy compensation is divided and assigned to individual data44ellers by solving a budget allocation problem. Each seller receives a com-pensation roughly proportional to the privacy loss due to the data sharing.Niu et al. [161] further extend to time series data that may have temporalcorrelations. They adopt Puﬀerﬁsh privacy [117] to measure privacy lossesunder temporal correlations.While various eﬀorts have been made to address the challenges of privacyloss compensation when user data are correlated in one way or another, asGhosh and Roth [87] point out, in general, it is impossible for any mechanismto compensate individuals for privacy loss properly if correlations betweentheir private data and their cost functions are unknown beforehand.In the classical setting of physical goods [143], using contract theory [142]with hidden information, that is, unobservable types of buyers, a seller candesign a set of contracts with diﬀerent consumption levels to maximize rev-enue from buyers. Naghizadeh and Sinha [154] extend the contract designmodel to price a bundle of queries at diﬀerent privacy levels to maximizerevenue. They also consider adversarial users. Their work also adopts dif-ferential privacy [61, 62]. For a query bundle { Q , . . . , Q k } , a contract is atuple ( p, (cid:15), s ), where p > (cid:15) is the privacybudget, such that a buyer can get an answer to query Q i (1 ≤ i ≤ k ) with (cid:15) i -diﬀerential privacy guarantee, and (cid:15) ≥ (cid:80) ki =1 (cid:15) i , and p is the post-hoc ﬁneto be paid if the buyer is found misusing the query answers. It is assumedthat an adversarial buyer derives a beneﬁt C ( (cid:15) ), which is monotonicallyincreasing and convex, C (0) = 0. One interesting ﬁnding is that, in the tra-ditional contract theory, if there are n types of honest buyers and one typeof adversarial buyers, the seller should design up to n + 1 contracts. In thedata marketplace situation, they show that up to n contracts are suﬃcient.In other words, a data seller should not design a contract for the adver-sary. Instead, the seller should adjust the contracts’ pricing to account forthe risks from adversarial users. They also design post-hoc ﬁnes in pricingquery bundles that can help to reduce loss due to privacy leakage by adver-sarial buyers. They provide a fast approximation algorithm to compute thecontracts.A data owner has to decide a tradeoﬀ between privacy and data utility.Li and Raghunathan [135] design an economics-based incentive-compatiblemechanism for a data owner to price and disseminate private data. Speciﬁ-cally, let two-part tariﬀ pricing function R ( s, x ) = α s + β s x be the price for45 amount of data at sensitivity level s , where α s and β s are the ﬁxed andvariable price factors, respectively. Assuming two types of data users, onetype for aggregate information and patterns in data and the other type forindividual identity and personal information, the proposed mechanism worksin four stages. First, the data owner selects a variety of sensitivity types tooﬀer. Second, the data owner oﬀers diﬀerent prices for data with diﬀerentsensitivity types. Third, a data user selects a certain sensitivity type withcorresponding price, and thus reveals the user type. Last, the data user se-lects the optimal amount of data with the chosen sensitivity type. The coreidea is that the data owner can identify the sensitive attributes in the data,such as the identifying attributes, which are not useful for aggregate analysisbut necessary at individual communication. A data owner can oﬀer a lowerprice for data without sensitive attributes, and charge for a higher price fordata with sensitive attributes. This approach provides an orthogonal ideato the popular ways of tuning the parameter in diﬀerential privacy.Due to the privacy concerns, when a company may have opportunitiesto collect data about its customers, should it do it (i.e., collecting and re-vealing the data) or not (i.e., a blanket policy of never collecting)? Jaisingh et al. [109] ﬁnd that the company should not collect customer data if thetotal gains from trading the data cannot cover the privacy loss. In practice,there is an increasing tendency for consumers to overestimate their loss ofprivacy, particularly when the use of the private data is uncertain. In othercases, the company should oﬀer two contracts on their services and prod-ucts. One contract collects the customer data at a certain price, and theother contract does not collect any customer data at a diﬀerent price.While most of the studies on privacy preserving data marketplaces fo-cus on the privacy of data owners, transactions may also disclose privacyof data buyers, such as what, when and how much they buy. For example,a retail company purchasing query results may consider what queries (e.g.,the products or customer groups involved in the queries), when (e.g., theperiods where the queries are concerned), and how much data it purchasesas privacy, and may want to keep the information conﬁdential from anyothers, including the data sellers and the broker. Aiello et al. [11] design amechanism such that after making an initial deposit and maintaining a suﬃ-cient balance, a buyer can engage in an unlimited number of price-oblivioustransfer protocols where the sellers and the broker cannot know anything46ther than the amount of interaction and the initial deposit amount. Thebroker even cannot know the buyer’s current balance and when the buyer’sbalance runs out. This is achieved by adapting conditional disclosure [83]to the two-party setting.Distribution and use of private data are another important step whereprivacy may leak. Hynes et al. [107] demonstrate Sterling, a decentralizedmarketplace for private data, which supports privacy-preserving distributionand use of data. The central technical idea comes from privacy-preservingsmart contracts on a permissionless blockchain. To provide strong securityand privacy guarantees, they combine blockchain smart contracts, trustedexecution environments and diﬀerential privacy. Particularly, smart con-tracts allow enforcement of constraints on data usage and enables paymentsand rewards. The demand of data pricing arises in many novel application scenarios. Inthis subsection, we particularly discuss three emerging situations: dynamicdata pricing, online pricing and pricing in federated learning.Many applications are built on dynamic and online data. How to pricetemporal views on data streams properly is an important issue for practicaldata markets. One central task is to estimate and optimize the operationalcosts, which are the costs to evaluate queries of diﬀerent users on the ﬂy.The pricing decisions involve not only data sellers but also data buyers. Forexample, suppose two data buyers b and b purchase two queries q and q , such that q can be written as a further selection on top of q (e.g., q is about all customers in North America, while q keeps all the same as q but focuses on only customers in Canada). The optimal pricing of q and q should take the advantage of the overlap between the two queries so thatthe sharing can save the operational costs, and, at the same time, be fair to b and b .Al-Kiswany et al. [12] propose a greedy method that enumerates allpossible sharing plans and selects the one with the minimum additionalcost. It does not come with any quality guarantee. Liu and Hacig¨um¨u¸s [140]propose an improved method that takes some risk in sharing plan. If the47osts of the previous sharings are already cumulated to a high level, andthe additional cost of a new sharing (i.e., the risk) is moderate and canbe amortized well by the previous sharings, then the new sharing may betaken. They also give ﬁve rules to ensure fair pricing. Let AC ( S ) be thecost attributed to a sharing S . First, for two identical sharings S = S , AC ( S ) = AC ( S ) should hold. Second, for any sharing S , AC ( S ) shouldbe no higher than the lowest cost of S if no other sharing exists. Third, fortwo sharings S and S , if the query of S is contained by the query of S ,that is, the result of S is a subset of the result of S , and the lowest costof S is smaller than the lowest cost of S if no other sharing exists, then AC ( S ) ≤ AC ( S ). Fourth, a sharing plan with common subexpressionswith other sharings should be compensated. Last, the cost of the globalplan should be equal to the sum of costs attributed to all sharings.In order to purchase dynamic data, a buyer may have to call a seller’sAPI repeatedly. A buyer may have to pay for the same data multipletimes. Upadhyaya et al. [195] explore how to modify APIs to achieve op-timal history-aware pricing, that is, buyers are charged only once for datapurchased and not updated. The central idea is the introduction of the no-tion of refund – a user can ask for refunds of data that she/he has boughtbefore. For each query, the seller issues a coupon in addition to the queryresult, where the coupon records the identity information of the data in thequery result. Speciﬁcally, a coupon c = (( id, uid, v ) , τ, H ( id ⊕ τ ⊕ κ )), where id is a tuple identiﬁer, uid is a user-id, v is a version-id that is monotonicallyincreasing, τ is a query identiﬁer that is also monotonically increasing, H is a cryptographic hash function [59], such as SHA-1, SHA-256 and SHA-3,and κ is a secret key only known to the seller. If a buyer gets two coupons c and c in two diﬀerent purchases such that c [( tid, uid, v )] = c [( tid, uid, v )],then the buyer can ask the seller for a refund by showing the two coupons.As pointed out by Deep and Koutris [55], the refund mechanism does notprovide any arbitrage-free guarantee.Qirana [55,56] can support history-aware pricing. To incorporate a queryhistory, suppose a buyer already purchases queries Q = Q , . . . , Q k and paysfor a total of p ( Q , D ) so far. When a new query Q k +1 comes, let the supportset S k +1 = { D i ∈ S | Q ( D i ) = Q ( D ) , Q k +1 ( D i ) (cid:54) = Q k +1 ( D ) } . Then, thenew total price p (( Q , . . . , Q k , Q k +1 ) , D ) = p ( Q , D ) + (cid:80) D i ∈S k +1 w i . Thishistory-aware pricing function is shown arbitrage-free.48heng et al. [223] consider online pricing for mobile crowd-sensing datamarkets. Diﬀerent from most of the work on data markets, they assume thatdata providers are distributed in space and there are three types of spatialqueries from buyers, namely single-data query (e.g., inquiring the value ata speciﬁc location), multi-data query (e.g., inquiring the mean in a region)and range query (e.g., inquiring the probability that the data at a regionfalls in a given range). The vendor uses raw data from data providers andproduces a statistical model through Gaussian process to answer queries.To form diﬀerent versions of data products, the vendor generates diﬀerentconditional Gaussian distribution with respect to locations and uses theconditional entropy to quantify the quality of the versions. They propose arandomized online pricing strategy so that the price can be adaptive fromthe historical queries. They show that the pricing mechanism is arbitrage-free and is a constant factor approximation of revenue maximization.Niu et al. [162] consider online data market where a query may be sold todiﬀerent buyers at diﬀerent time and the broker can adjust prices over time.The objective is to maximize the broker’s cumulative revenue by postingreasonable prices for sequential queries. They design a contextual dynamicpricing mechanism with the reserve price constraint. The central idea isto use the properties of ellipsoid for eﬃcient online optimization. Theirmethod can support both linear and non-linear market value models withuncertainty.Federated learning [146,147] trains a machine learning model across mul-tiple decentralized parties, where each party holds local data without anypeer-wise data exchanging. The parties and their data sets are often or-dered in a federated learning process. To accommodate the participationorder and value data in federated learning, Wang et al. [205] develop fed-erated Shapley value. Let I be the set of participants and U be the utilityfunction, where U ( A + B ) is the utility of training ﬁrst on A and then on B .For participant i at round t in a federated learning process, the federatedShapley value is ψ t ( i ) = 1 | I t | (cid:88) S ⊆ I t \{ i } (cid:0) | I t |− | S | (cid:1) [ U ( I t − + S ∪ { i } )) − U ( I t − + S )]if i ∈ I t and ψ t ( i ) = 0 otherwise. The federated Shapley value of aparty is the sum of the values of all rounds, that is, ψ ( i ) = (cid:80) Tt =1 ψ t ( i ).49ang et al. [205] show that the federated Shapley values have instanta-neous group rationality, that is, (cid:80) i ∈ I t ψ t ( i ) = U ( I t ) − U ( I t − ). Thefairness is guaranteed at each round. That is, for any two parties i and j , ψ t ( i ) = ψ t ( j ) at round t if ∀ S ⊆ I t \ { i, j } , U ( I t − + ( S ∪ { i } )) = U ( I t − + ( S ∪ { j } )). Moreover, for any party i at round t , ψ t ( i ) = 0 if ∀ S ⊆ I t \ { i } , ψ ( I t − + ( S ∪ { i } )) = U ( I t − + S ). They also extendthe previous Shapley value approximation techniques to compute federatedShapley values.Sim et al. [186] consider the more general situation of collaborativemachine learning and advocate using information gain as the utility func-tion. For a model θ trained on data D , the information gain I ( θ ; D ) = H ( θ ) − H ( θ | D ), which is the reduction in uncertainty. They generalize to ρ -Shapley fairness by assigning a reward r i = kψ ρi to a party i . By tuningparameter ρ , they can trade oﬀ among Shapley fairness, individual rational-ity, stability of the grand coalition and group welfare.Hu and Gong [105] consider privacy leaking in federated learning anddesign an incentive mechanism to compensate the cost of privacy leakageof the users that are most likely to provide reliable data. Their problem isformulated in a two-stage Stackelberg game [200]. Richardson et al. [176]use inﬂuence functions to reward data contributions to linear regression inthe federated learning setting. In this section, we review the topic of pricing data products. We ﬁrst an-alyze the structures, players, and ways to produce data products in datamarketplaces. Then, we examine several important areas in pricing dataproducts, including arbitrage-free pricing, revenue maximization pricing, fairand truthful pricing and privacy preserving pricing. We also discuss how toprice dynamic data and online pricing. When pricing data products in adata marketplace, those several considerations are typically incorporatedand integrated in one way or another.50

Discussion and Open Challenges

Data pricing comes from practical demands and has been tackled in multipledisciplines. Although there is a rich body of literature addressing a series ofissues in data pricing, there are still many questions remained unexplored.In this section, we discuss some interesting challenges for possible futurework. By no means our list is exhaustive. Instead, we hope our discussioncan intrigue more extensive interest and research eﬀort into this fast growingarea.

At the macro level, although many studies focus on diﬀerent steps in datamarketplaces, we clearly observe a lack of systematic investigation on datasupply chains and development of end-to-end solutions. As data productsare abundant and diversiﬁed, to develop ecologically sustainable market-places, supply chains of data products have to be built. Here, we introduceand advocate the notion of data supply chains , which connect all parties in-volved in data production and consumption, including data providers, dataprocessors, data analysts, data product and services consumers and otherpossible roles. Each party in a data supply chain connects its upstreamproviders and its downstream consumers, provides its value-added contri-butions and obtains rewards. Feedback mechanisms through pricing andmarketing have to be created in a data supply chain so that supply andconsumption can be matched, coordinated and balanced. Most of thoseproblems are not thoroughly thought about.Although the notion of data supply chain is not mentioned in litera-ture, some speciﬁc trends and challenges are discussed sporadically. Forexample, Muschalle et al. [151] identify some trends and challenges in dataconsumption and marketplaces. First, they assert that many essential dataprocessing tasks are essential for data markets, such as labeling, annotatingand aggregating data. Second, data markets will be integrated with numer-ous application domains. To enable domain data markets, it is importantto customize general data processing technologies for niche domains. Third,customers want to have data faster. Thus, it is important to create on-line data query services and develop corresponding pricing models. Fourth,as there are more data, more data providers and more analysts, a data51roduct may be substituted by others. To hatch a healthy ecological datamarketplace, it is important to establish standard data processing mashupsto facilitate data product substitution. Fifth, to maintain a fair data marketoverall, it is important to provide price transparency so that data productproviders have to optimize their data and data processing/analysis services.Last, customer preferences and experience are critical for data markets.Recently, Acemoglu et al. [3] present an insightful study on the ecologicaleﬀect of data markets. They demonstrate that a user’s sharing of data maylikely reveal some other users’ privacy and depress the price of other users’data. The depressed prices lead to excessive data sharing and thus furtherreduce welfare. Their study suggests the need of mediation in data sharingin data markets.Most recently, Fernandez et al. [71] analyze the challenges and propose aresearch agenda around constructing a data market platform to address thesharing, discovery and integration of data among many parties. Their bigpicture covers both market design and system development. The focus is tocreate the incentives and mechanisms to connect data supply and demand.As the middlemen, arbiters build data mashups to match data supply anddemand. The market platforms advocated by the authors can be regardedas the data exchange mechanisms in data supply chain.One challenge associated with the macro view of data supply chain isthe interdisciplinary nature of data pricing research. As can be observedin this article, data pricing is studied in many diﬀerent disciplines, such aseconomics, marketing, electronic commerce, data management, data miningand machine learning. The communication and dialog among diﬀerent areashave to be strengthened.

At the micro level, there are many research problems remained open. Wename a few examples of fundamental problems.First, most of the studies suggest relative prices of data products. Veryfew studies connect theoretical models with data pricing practice and in-vestigate absolute prices of data products and their marketing eﬀect. Asdata pricing is a market mechanism and user behavior in practice is hard tomodeled completely, experimental studies of data pricing models are essen-52ial and should be connected to theoretical investigations.Second, pricing is based on valuation and equilibrium among multipleparties. Diﬀerent parties may have diﬀerent valuation on data, data prod-ucts and data services. It is important to systematically establish the prin-ciples of value assessment for various parties in data marketplaces, such asdata providers, data owners, data users, and data brokers. Moreover, itis important to understand what messages are passed to diﬀerent partiesin data marketplaces through data pricing actions, and how. So far, valueassessment of data and negotiations among diﬀerent parties in data market-places are largely not analyzed in detail.Third, many pricing models are proposed in literature. It is importantto understand how data pricing models and their assumptions can be im-plemented and enforced in practice. Speciﬁcally, accounting and auditing indata marketplaces are critical to achieve transparency in data pricing andeﬃciency in data marketplaces. Accounting and auditing in data market-places, however, are interesting problems that have not been investigatedin depth yet. We need principles, quality guarantees and designs of opera-tional procedures for accounting and auditing in data pricing, transactionsand adversary detection.Fourth, most of the studies on data pricing develop general models. Atthe same time, as data science transforms many application domains, datapricing has to deal with speciﬁc applications. Mechanisms, regulations andconstraints in a speciﬁc domain may facilitate data pricing in some aspects,and post challenges in some other aspects. For example, Jia et al. [111]show that, although fair pricing in general is exponential in computationtime but can be achieved polynomially in kNN models (Section 5.4). It isinteresting and highly desirable to explore fairness, truthfulness, and privacypreservation of data pricing in speciﬁc applications.Last but not least, almost all applications are dynamic in nature. Thevalues of data, data products and data services may also evolve over time.The changes may be caused by the updates in demands and supplies. Itis important to develop mechanisms to capture and monitor changes in de-mand and supply of data, data products and data services, and explorecorresponding dynamic pricing. 53 eferences [1] M. Aazam and E. Huh, “Broker as a service (baas) pricing and resourceestimation model,” in , Dec 2014, pp. 463–468.[2] T. Abdallah, “On the beneﬁt (or cost) of large-scale bundling,”

Production and Operations Management , vol. 28, no. 4, pp. 955–969,2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/poms.12958[3] D. Acemoglu, A. Makhdoumi, A. Malekian, and A. Ozdaglar, “TooMuch Data: Prices and Ineﬃciencies in Data Markets,” NationalBureau of Economic Research, Inc, NBER Working Papers 26296,Sep. 2019. [Online]. Available: https://ideas.repec.org/p/nbr/nberwo/26296.html[4] A. Acquisti, C. Taylor, and L. Wagman, “The economics of privacy,”

Journal of Economic Literature ∼ acquisti/papers/acquisti-REV.pdf[6] W. J. Adams and J. L. Yellen, “Commodity bundling andthe burden of monopoly,” The Quarterly Journal of Economics

Proceedings of the 2019 ACM Conferenceon Economics and Computation , ser. EC’19. New York, NY, USA:Association for Computing Machinery, 2019, pp. 701–726. [Online].Available: https://doi.org/10.1145/3328526.3329589[8] C. C. Aggarwal and P. S. Yu, “Privacy-preserving data mining:A survey,” in

Handbook of Database Security: Applications nd Trends , M. Gertz and S. Jajodia, Eds. Boston, MA:Springer US, 2008, pp. 431–460. [Online]. Available: https://doi.org/10.1007/978-0-387-48533-1 18[9] G. Aggarwal, A. Fiat, A. V. Goldberg, J. D. Hartline, N. Immorlica,and M. Sudan, “Derandomization of auctions,” in Proceedingsof the Thirty-Seventh Annual ACM Symposium on Theory ofComputing , ser. STOC’05. New York, NY, USA: Associationfor Computing Machinery, 2005, pp. 619–625. [Online]. Available:https://doi.org/10.1145/1060590.1060682[10] L. Aguiar and J. Waldfogel, “As streaming reaches ﬂood stage, does itstimulate or depress music sales?”

International Journal of IndustrialOrganization , vol. 57, no. C, pp. 278–307, 2018. [Online]. Available:https://ideas.repec.org/a/eee/indorg/v57y2018icp278-307.html[11] W. Aiello, Y. Ishai, and O. Reingold, “Priced oblivious transfer: Howto sell digital goods,” in

Advances in Cryptology - EUROCRYPT2001, International Conference on the Theory and Applicationof Cryptographic Techniques, Innsbruck, Austria, May 6-10,2001, Proceeding , ser. Lecture Notes in Computer Science,vol. 2045. Springer, 2001, pp. 119–135. [Online]. Available:https://iacr.org/archive/eurocrypt2001/20450118.pdf[12] S. Al-Kiswany, H. Hacig¨um¨u¸s, Z. Liu, and J. Sankaranarayanan,“Cost exploration of data sharings in the cloud,” in

Proceedingsof the 16th International Conference on Extending DatabaseTechnology , ser. EDBT’13. New York, NY, USA: Associationfor Computing Machinery, 2013, pp. 601–612. [Online]. Available:https://doi.org/10.1145/2452376.2452447[13] S. Alaei, A. Makhdoumi, and A. Malekian, “Optimal subscriptionplanning for digital goods,”

SSRN Electronic Journal , 01 2019.[14] S. Alaei, A. Malekian, and A. Srinivasan, “On random samplingauctions for digital goods,” in

Proceedings of the 10th ACMConference on Electronic Commerce , ser. EC’09. New York, NY,USA: Association for Computing Machinery, 2009, pp. 187–196.[Online]. Available: https://doi.org/10.1145/1566374.15664025515] C. Anderson,

The Long Tail: Why the Future of Business Is SellingLess of More . Hyperion, 2006.[16] A. Archer, C. Papadimitriou, K. Talwar, and E. Tardos, “An approx-imate truthful mechanism for combinatorial auctions with single pa-rameter agents,” in

Proceedings of the Fourteenth Annual ACM-SIAMSymposium on Discrete Algorithms , ser. SODA’03. USA: Society forIndustrial and Applied Mathematics, 2003, pp. 205–214.[17] A. A. Armstrong and E. H. Durfee, “Mixing and memory: Emer-gent cooperation in an information marketplace,” in

Proceedings of the3rd International Conference on Multi Agent Systems , ser. ICMAS’98.USA: IEEE Computer Society, 1998, p. 34.[18] M. Armstrong, “A more general theory of commodity bundling,”

Journal of Economic Theory , vol. 148, no. 2, pp. 448–472, 2013.[Online]. Available: https://ideas.repec.org/a/eee/jetheo/v148y2013i2p448-472.html[19] N. Arnosti, M. Beck, and P. Milgrom, “Adverse selection and auctiondesign for internet display advertising,” in

Proceedings of the SixteenthACM Conference on Economics and Computation

ManagementScience , vol. 64, pp. 1574–1590, 2018.[22] J. Auerbach, J. Galenson, and M. Sundararajan, “An empiricalanalysis of return on investment maximization in sponsored searchauctions,” in

Proceedings of the 2nd International Workshop on DataMining and Audience Intelligence for Advertising , ser. ADKDD’08.New York, NY, USA: Association for Computing Machinery, 2008, pp.1–9. [Online]. Available: https://doi.org/10.1145/1517472.15174735623] M. Babaioﬀ, R. Kleinberg, and R. Paes Leme, “Optimal mechanismsfor selling information,” in

Proceedings of the 13th ACM Conferenceon Electronic Commerce , ser. EC’12. New York, NY, USA:Association for Computing Machinery, 2012, pp. 92–109. [Online].Available: https://doi.org/10.1145/2229012.2229024[24] P. Bajari and A. Hortacsu, “Economic insights from internetauctions,”

Journal of Economic Literature

Manage. Sci. , vol. 45, no. 12, pp. 1613–1630,Dec. 1999. [Online]. Available: https://doi.org/10.1287/mnsc.45.12.1613[26] ——, “Bundling and competition on the internet,”

Marketing Science ,vol. 19, no. 1, pp. 63–82, Feb. 2000.[27] M. Balazinska, B. Howe, P. Koutris, D. Suciu, and P. Upadhyaya,“A discussion on pricing relational data,” in

In Search ofElegance in the Theory and Practice of Computation: EssaysDedicated to Peter Buneman , V. Tannen, L. Wong, L. Libkin,W. Fan, W.-C. Tan, and M. Fourman, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2013, pp. 167–173. [Online]. Available:https://doi.org/10.1007/978-3-642-41660-6 7[28] M. Balazinska, B. Howe, and D. Suciu, “Data markets in the cloud:An opportunity for the database community,”

PVLDB , vol. 4, no. 12,pp. 1482–1485, 2011. [Online]. Available: http://dblp.uni-trier.de/db/journals/pvldb/pvldb4.html

Proceedings of the 7th ACMConference on Electronic Commerce , ser. EC’06. New York, NY,USA: Association for Computing Machinery, 2006, pp. 29–35.[Online]. Available: https://doi.org/10.1145/1134707.1134711[30] M.-F. Balcan, A. Blum, and Y. Mansour, “Item pricing for revenuemaximization,” in

Proceedings of the 9th ACM Conference on lectronic Commerce , ser. EC’08. New York, NY, USA: Associationfor Computing Machinery, 2008, pp. 50–59. [Online]. Available:https://doi.org/10.1145/1386790.1386802[31] Z. Bar-Yossef, K. Hildrum, and F. Wu, “Incentive-compatible onlineauctions for digital goods,” in Proceedings of the Thirteenth AnnualACM-SIAM Symposium on Discrete Algorithms , ser. SODA’02. USA:Society for Industrial and Applied Mathematics, 2002, pp. 964–970.[32] J. Benﬁeld and W. Szlemko, “Internet-based data collection: Promisesand realities,”

Journal of Research Practice , vol. 2, no. 2, 1 2006.[33] D. Bergemann and A. Bonatti, “Selling cookies,”

American EconomicJournal: Microeconomics

American Economic Review

Privacy-PreservingData Mining: Models and Algorithms , C. C. Aggarwal and P. S.Yu, Eds. Boston, MA: Springer US, 2008, pp. 183–205. [Online].Available: https://doi.org/10.1007/978-0-387-70992-5 8[36] S. J. Best and B. S. Krueger,

Internet Data Collection , ser. Quantita-tive Applications in the Social Sciences. Thousand Oaks, CA: SAGEPublications, Inc., 2004.[37] A. Boom, ““download for free”: When do providers of digital goodsoﬀer free samples?” Free University Berlin, School of Business & Eco-nomics, Discussion Papers 2004/28, 2004.[38] R. Brennan, L. Canning, and R. Mcdowell,

Business-to-business mar-keting . Sage Publications, 01 2013.5839] P. Briest and P. Krysta, “Single-minded unlimited supply pricing onsparse instances,” in

Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm , ser. SODA’06. USA: Soci-ety for Industrial and Applied Mathematics, 2006, pp. 1093–1102.[40] E. Brynjolfsson and M. D. Smith, “Frictionless commerce? acomparison of internet and conventional retailers,”

ManagementScience , vol. 46, no. 4, pp. 563–585, 2000. [Online]. Available:https://doi.org/10.1287/mnsc.46.4.563.12061[41] Y. Cai, C. Daskalakis, and C. Papadimitriou, “Optimum statisticalestimation with strategic data sources,” in

Proceedings of The 28thConference on Learning Theory , ser. Proceedings of Machine LearningResearch, P. Gr¨unwald, E. Hazan, and S. Kale, Eds., vol. 40. Paris,France: PMLR, 03–06 Jul 2015, pp. 280–296.[42] S. Chawla, S. Deep, P. Koutrisw, and Y. Teng, “Revenue maximizationfor query pricing,”

Proc. VLDB Endow. , vol. 13, no. 1, pp. 1–14, Sep.2019. [Online]. Available: https://doi.org/10.14778/3357377.3357378[43] Y.-K. Che, S. Choi, and J. Kim, “An experimental study ofsponsored-search auctions,”

Games and Economic Behavior

Proceedings ofthe 2019 International Conference on Management of Data , ser.SIGMOD’19. New York, NY, USA: Association for ComputingMachinery, 2019, pp. 1535–1552. [Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/3299869.3300078[45] L. Chiou and C. Tucker, “Paywalls and the demand for news,”

Information Economics and Policy , vol. 25, no. 2, pp. 61–69, 2013.[Online]. Available: https://EconPapers.repec.org/RePEc:eee:iepoli:v:25:y:2013:i:2:p:61-69[46] ——, “Content aggregation by platforms: The case of thenews media,”

Journal of Economics & Management Strategy ,59ol. 26, no. 4, pp. 782–805, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/jems.12207[47] R. D. Cook, “Detection of inﬂuential observation in linear regression,”

Technometrics , vol. 19, no. 1, pp. 15–18, Feb. 1977.[48] R. Cook and S. Weisberg,

Residuals and Inﬂuence in Regression ,ser. Chapman & Hall/CRC Monographs on Statistics & AppliedProbability. Taylor & Francis, 1982. [Online]. Available: https://books.google.ca/books?id=MVSqAAAAIAAJ[49] R. Cummings, K. Ligett, A. Roth, Z. S. Wu, and J. Ziani,“Accuracy for sale: Aggregating data with a variance constraint,”in

Proceedings of the 2015 Conference on Innovations in TheoreticalComputer Science , ser. ITCS’15. New York, NY, USA: Associationfor Computing Machinery, 2015, pp. 317–324. [Online]. Available:https://doi.org/10.1145/2688073.2688106[50] D. Dao, D. Alistarh, C. Musat, and C. Zhang, “Databright: Towardsa global exchange for decentralized data ownership and trustedcomputation,”

CoRR , vol. abs/1802.04780, 2018. [Online]. Available:http://arxiv.org/abs/1802.04780[51] C. Daskalakis, A. Deckelbaum, and C. Tzamos, “Strong duality for amultiple-good monopolist,”

Econometrica , vol. 85, no. 3, pp. 735–767,2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA12618[52] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in

Proceedings of the Twentieth Annual Symposium on ComputationalGeometry , ser. SCG’04. New York, NY, USA: Association forComputing Machinery, 2004, pp. 253–262. [Online]. Available:https://doi.org/10.1145/997817.997857[53] D. Davydov, S. Izmalkov, and A. Smirnov, “Sponsored-SearchAuctions: Empirical and Experimental Works,”

Journal of the NewEconomic Association , vol. 28, no. 4, pp. 56–73, 2015. [Online].Available: https://ideas.repec.org/a/nea/journl/y2015i28p56-73.html 6054] S. Deep and P. Koutris, “The design of arbitrage-free data pricingschemes,”

CoRR , vol. abs/1606.09376, 2016. [Online]. Available:http://arxiv.org/abs/1606.09376[55] ——, “Qirana: A framework for scalable query pricing,” in

Proceedingsof the 2017 ACM International Conference on Management ofData , ser. SIGMOD’17. New York, NY, USA: Association forComputing Machinery, 2017, pp. 699–713. [Online]. Available:https://doi-org.proxy.lib.sfu.ca/10.1145/3035918.3064017[56] S. Deep, P. Koutris, and Y. Bidasaria, “Qirana demonstration:Real time scalable query pricing,”

Proc. VLDB Endow. , vol. 10,no. 12, pp. 1949–1952, Aug. 2017. [Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.14778/3137765.3137816[57] X. Deng and C. H. Papadimitriou, “On the complexity ofcooperative solution concepts,”

Mathematics of Operations Research ,vol. 19, no. 2, pp. 257–266, 1994. [Online]. Available: https://doi.org/10.1287/moor.19.2.257[58] S. Dibb, L. Simkin, W. M. Pride, and O. Ferrell,

Marketing: Conceptsand Strategies. 5th Edition . Abingdon, UK: Houghton Miﬄin, April2005. [Online]. Available: http://oro.open.ac.uk/2041/[59] W. Diﬃe and M. Hellman, “New directions in cryptography,”

IEEETrans. Inf. Theor. , vol. 22, no. 6, pp. 644–654, Sep. 2006. [Online].Available: https://doi.org/10.1109/TIT.1976.1055638[60] D.-Z. Du and F. K. Hwang,

Combinatorial Group Testing andIts Applications

Theory andApplications of Models of Computation , M. Agrawal, D. Du, Z. Duan,and A. Li, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008,pp. 1–19.[62] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noiseto sensitivity in private data analysis,” in

Theory of Cryptography ,61. Halevi and T. Rabin, Eds. Berlin, Heidelberg: Springer BerlinHeidelberg, 2006, pp. 265–284.[63] T. Ebert, “Applications of recursive operators to randomness and com-plexity,” Ph.D. dissertation, University of California, Santa Barbara,1998.[64] B. Edelman and M. Ostrovsky, “Strategic bidder behavior insponsored search auctions,”

Decision Support Systems , vol. 43,no. 1, pp. 192–198, Feb. 2007. [Online]. Available: https://doi.org/10.1016/j.dss.2006.08.008[65] B. Edelman, M. Ostrovsky, and M. Schwarz, “Internet advertisingand the generalized second-price auction: Selling billions ofdollars worth of keywords,”

American Economic Review

Journal of PoliticalEconomy , vol. 126, no. 1, pp. 178–215, 2018. [Online]. Available:https://ideas.repec.org/a/ucp/jpolec/doi10.1086-695529.html[67] H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan,and S. Chen, “Overview of turn data management platform for digitaladvertising,”

Proc. VLDB Endow. , vol. 6, no. 11, pp. 1138–1149, Aug.2013. [Online]. Available: https://doi.org/10.14778/2536222.2536238[68] R. Engelbrecht-Wiggans, “Auctions and bidding models: A survey,”

Management Science

Journal of Economic Perspectives

Proceedings of the First International Conference on Internetand Network Economics , ser. WINE’05. Berlin, Heidelberg:Springer-Verlag, 2005, pp. 878–886. [Online]. Available: https://doi.org/10.1007/11600930 89[71] R. C. Fernandez, P. Subramaniam, and M. J. Franklin, “Data marketplatforms: Trading data assets to solve data problems,”

Proc. VLDBEndow. , vol. 13, no. 12, pp. 1933–1947, Jul. 2020. [Online]. Available:https://doi.org/10.14778/3407790.3407800[72] M. A. Ferrag, L. Maglaras, and A. Ahmim, “Privacy-preservingschemes for ad hoc social networks: A survey,”

IEEE Communica-tions Surveys Tutorials , vol. 19, no. 4, pp. 3015–3045, 2017.[73] F. Ferreira and J. Waldfogel, “Pop internationalism: Has half a cen-tury of world music trade displaced local culture?”

Economic Journal ,vol. 123, no. 569, pp. 634–664, Jun 2013.[74] L. K. Fleischer and Y.-H. Lyu, “Approximately optimal auctions forselling privacy when costs are correlated with data,” in

Proceedingsof the 13th ACM Conference on Electronic Commerce , ser. EC’12.New York, NY, USA: Association for Computing Machinery, 2012,pp. 568–585. [Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/2229012.2229054[75] S. A. Fricker and Y. V. Maksimov, “Pricing of data products in datamarketplaces,” in

Software Business , A. Ojala, H. Holmstr¨om Olsson,and K. Werder, Eds. Cham: Springer International Publishing, 2017,pp. 49–66.[76] T. L. Friedman,

The world is ﬂat : a brief history of the twenty-ﬁrstcentury / Thomas L. Friedman. , 1st ed. New York :: Farrar, Strausand Giroux,, 2005., includes index.[77] D. Fudenberg and J. M. Villas-Boas, “Price discrimination in thedigital economy,” in

The Oxford Handbook of the Digital Economy

ACM Comput. Surv. , vol. 42, no. 4, Jun. 2010. [Online]. Available:https://doi.org/10.1145/1749603.1749605[79] J. M. Gallaugher, P. Auger, and A. BarNir, “Revenue streams anddigital content providers: an empirical investigation,”

Information &Management

Internet and Net-work Economics , X. Deng and F. C. Graham, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2007, pp. 541–548.[81] N. Gandal, “Native language and internet usage,”

InternationalJournal of the Sociology of Language

The Quarterly Journal of Economics , vol. 126, no. 4, pp. 1799–1839, 11 2011. [Online]. Available: https://doi.org/10.1093/qje/qjr044[83] Y. Gertner, Y. Ishai, E. Kushilevitz, and T. Malkin, “Protectingdata privacy in private information retrieval schemes,”

J. Comput.Syst. Sci. , vol. 60, no. 3, pp. 592–629, Jun. 2000. [Online]. Available:https://doi.org/10.1006/jcss.1999.1689[84] A. Ghorbani, M. P. Kim, and J. Zou, “A distributional frameworkfor data valuation,” in

Proceedings of the International Conference onMachine Learning 1 pre-proceedings (ICML 2020) , 2020.[85] A. Ghorbani and J. Zou, “Data shapley: Equitable valuation ofdata for machine learning,” in

Proceedings of the 36th InternationalConference on Machine Learning , ser. Proceedings of MachineLearning Research, K. Chaudhuri and R. Salakhutdinov, Eds.,vol. 97. Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp.64242–2251. [Online]. Available: http://proceedings.mlr.press/v97/ghorbani19c.html[86] A. Ghosh, K. Ligett, A. Roth, and G. Schoenebeck, “Buying privatedata without veriﬁcation,” in

Proceedings of the Fifteenth ACMConference on Economics and Computation , ser. EC’14. New York,NY, USA: Association for Computing Machinery, 2014, pp. 931–948.[Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/2600057.2602902[87] A. Ghosh and A. Roth, “Selling privacy at auction,” in

Proceedingsof the 12th ACM Conference on Electronic Commerce , ser. EC’11.New York, NY, USA: Association for Computing Machinery, 2011,pp. 199–208. [Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/1993574.1993605[88] A. Gilchrist,

Industry 4.0: The Industrial Internet of Things , 1st ed.USA: Apress, 2016.[89] A. V. Goldberg and J. D. Hartline, “Competitive auctions for multipledigital goods,” in

Proceedings of the 9th Annual European Symposiumon Algorithms , ser. ESA’01. Berlin, Heidelberg: Springer-Verlag,2001, pp. 416–427.[90] ——, “Competitiveness via consensus,” in

Proceedings of the Four-teenth Annual ACM-SIAM Symposium on Discrete Algorithms , ser.SODA’03. USA: Society for Industrial and Applied Mathematics,2003, pp. 215–222.[91] ——, “Envy-free auctions for digital goods,” in

Proceedings of the 4thACM Conference on Electronic Commerce , ser. EC’03. New York,NY, USA: Association for Computing Machinery, 2003, pp. 29–35.[Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/779928.779932[92] A. V. Goldberg, J. D. Hartline, and A. Wright, “Competitive auctionsand digital goods,” in

Proceedings of the Twelfth Annual ACM-SIAMSymposium on Discrete Algorithms , ser. SODA’01. USA: Society forIndustrial and Applied Mathematics, 2001, pp. 735–744.6593] A. Goldfarb and C. Tucker, “Digital economics,”

Journal of EconomicLiterature

Deep Learning

Marketing Science ,vol. 38, no. 2, pp. 193–225, 2019. [Online]. Available: https://doi.org/10.1287/mksc.2018.1135[96] V. Guruswami, J. D. Hartline, A. R. Karlin, D. Kempe, C. Kenyon,and F. McSherry, “On proﬁt-maximizing envy-free pricing,” in

Pro-ceedings of the Sixteenth Annual ACM-SIAM Symposium on DiscreteAlgorithms , ser. SODA’05. USA: Society for Industrial and AppliedMathematics, 2005, pp. 1164–1173.[97] N. Haghpanah and J. Hartline, “Reverse mechanism design,” in

Proceedings of the Sixteenth ACM Conference on Economics andComputation

American Economic Journal: Microe-conomics

ArXiv , vol. abs/2006.14583,2020. 66101] G. Hardin, “The tragedy of the commons,”

Science , vol. 162,no. 3859, pp. 1243–1248, 1968. [Online]. Available: https://science.sciencemag.org/content/162/3859/1243[102] J. D. Hartline and R. McGrew, “From optimal limited to unlimitedsupply auctions,” in

Proceedings of the 6th ACM Conference onElectronic Commerce , ser. EC’05. New York, NY, USA: Associationfor Computing Machinery, 2005, pp. 175–182. [Online]. Available:https://doi.org/10.1145/1064009.1064028[103] J. Heckman, E. Peters, N. G. Kurup, E. Boehmer, and M. Davaloo,“A pricing model for data markets,” in iConference 2015 Proceedings .iSchools, 2015.[104] W. Hoeﬀding, “Probability inequalities for sums of boundedrandom variables,”

Journal of the American Statistical Association

ArXiv , vol. abs/2009.05604, 2020.[106] W. Hu and A. Bolivar, “Online auctions eﬃciency: A survey of ebayauctions,” in

Proceedings of the 17th International Conference onWorld Wide Web , ser. WWW’08. New York, NY, USA: Associationfor Computing Machinery, 2008, pp. 925–934. [Online]. Available:https://doi.org/10.1145/1367497.1367621[107] N. Hynes, D. Dao, D. Yan, R. Cheng, and D. Song, “A demonstrationof sterling: A privacy-preserving data marketplace,”

Proc. VLDBEndow. , vol. 11, no. 12, pp. 2086–2089, Aug. 2018. [Online]. Available:https://doi.org/10.14778/3229863.3236266[108] G. Irvin,

Modern Cost-Beneﬁt Methods . London: Macmillan Pub-lishers Limited, 1978.[109] J. Jaisingh, J. Barron, S. Mehta, and A. Chaturvedi, “Privacyand pricing personal information,”

European Journal of OperationalResearch

International Journal of ElectronicBusiness , vol. 6, pp. 114–131, 01 2008.[111] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. M. Gurel, B. Li,C. Zhang, C. Spanos, and D. Song, “Eﬃcient task-speciﬁc datavaluation for nearest neighbor algorithms,”

Proc. VLDB Endow. ,vol. 12, no. 11, pp. 1610–1623, Jul. 2019. [Online]. Available:https://doi.org/10.14778/3342263.3342637[112] R. Jia, D. Dao, B. Wang, F. A. Hubis, N. Hynes, N. M.G¨urel, B. Li, C. Zhang, D. Song, and C. J. Spanos, “Towardseﬃcient data valuation based on the shapley value,” in

Proceedingsof Machine Learning Research , ser. Proceedings of MachineLearning Research, K. Chaudhuri and M. Sugiyama, Eds., vol. 89.PMLR, 16–18 Apr 2019, pp. 1167–1176. [Online]. Available:http://proceedings.mlr.press/v89/jia19a.html[113] R. Jia, X. Sun, J. Xu, C. Zhang, B. Li, and D. Song, “Anempirical and comparative analysis of data valuation with scalablealgorithms,”

CoRR , vol. abs/1911.07128, 2019. [Online]. Available:http://arxiv.org/abs/1911.07128[114] H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, “Diﬀerentialprivacy and its applications in social network analysis: A survey,”

ArXiv , vol. abs/2010.02973, 2020.[115] B. Jullien, “Two-sided b to b platforms,” in

The OxfordHandbook of the Digital Economy

Proceedings of the2011 ACM SIGMOD International Conference on Managementof Data , ser. SIGMOD’11. New York, NY, USA: Associationfor Computing Machinery, 2011, pp. 325–336. [Online]. Available:https://doi.org/10.1145/1989323.198935868117] D. Kifer and A. Machanavajjhala, “A rigorous and customizableframework for privacy,” in

Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems ,ser. PODS’12. New York, NY, USA: Association for ComputingMachinery, 2012, pp. 77–88. [Online]. Available: https://doi.org/10.1145/2213556.2213571[118] P. Klemperer, “Auction theory: A guide to the literature,”

Journalof Economic Surveys , vol. 13, no. 3, pp. 227–286, 1999. [Online].Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-6419.00083[119] P. W. Koh and P. Liang, “Understanding black-box predictions viainﬂuence functions,” in

Proceedings of the 34th International Confer-ence on Machine Learning - Volume 70 , ser. ICML’17. JMLR.org,2017, pp. 1885–1894.[120] P. Kotler,

Marketing Management : the millennium edition . Boston,MA: Pearson Custom Pub., 2000.[121] P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu,“Query-based data pricing,” in

Proceedings of the 31st ACMSIGMOD-SIGACT-SIGAI Symposium on Principles of DatabaseSystems , ser. PODS’12. New York, NY, USA: Association forComputing Machinery, 2012, pp. 167–178. [Online]. Available:https://doi-org.proxy.lib.sfu.ca/10.1145/2213556.2213582[122] ——, “Querymarket demonstration: Pricing for online data markets,”

Proc. VLDB Endow. , vol. 5, no. 12, pp. 1962–1965, Aug. 2012.[Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.14778/2367502.2367548[123] ——, “Toward practical query pricing with querymarket,” in

Proceedings of the 2013 ACM SIGMOD International Conferenceon Management of Data , ser. SIGMOD’13. New York, NY, USA:Association for Computing Machinery, 2013, pp. 613–624. [Online].Available: https://doi-org.proxy.lib.sfu.ca/10.1145/2463676.2465335[124] ——, “Query-based data pricing,”

J. ACM , vol. 62, no. 5, Nov. 2015.[Online]. Available: https://doi.org/10.1145/277087069125] Y. Kwon, M. A. Rivas, and J. Zou, “Eﬃcient computation and analysisof distributional shapley values,”

ArXiv , vol. abs/2007.01357, 2020.[126] S. Lahaie, D. M. Pennock, A. Saberi, and R. V. Vohra, “Sponsoredsearch auctions,” in

Algorithmic Game Theory , N. Nisan, T. Rough-garden, E. Tardos, and V. V. Vazirani, Eds. Cambridge UniversityPress, 2007, pp. 699–716.[127] A. Lambrecht, A. Goldfarb, A. Bonatti, A. Ghose, D. Goldstein,R. Lewis, A. Rao, N. Sahni, and S. Yao, “How do ﬁrms make moneyselling digital goods online?”

Marketing Letters , vol. 25, pp. 331–341,09 2014.[128] A. Lambrecht and C. Tucker, “When does retargeting work?information speciﬁcity in online advertising,”

Journal of MarketingResearch , vol. 50, no. 5, pp. 561–576, 2013. [Online]. Available:https://doi.org/10.1509/jmr.11.0503[129] R. Lavi and N. Nisan, “Competitive analysis of incentive compatibleon-line auctions,” in

Proceedings of the 2nd ACM Conference onElectronic Commerce , ser. EC’00. New York, NY, USA: Associationfor Computing Machinery, 2000, pp. 233–241. [Online]. Available:https://doi.org/10.1145/352871.352897[130] S. Lehmann and P. Buxmann, “Pricing strategies of software vendors,”

Business & Information Systems Engineering , vol. 1, pp. 452–462, 122009.[131] J. Lerner, P. A. Pathak, and J. Tirole, “The dynamics ofopen-source contributors,”

American Economic Review

SSRN Electronic Journal , 01 2013.[133] C. Li, D. Y. Li, G. Miklau, and D. Suciu, “A theory of pricingprivate data,”

ACM Trans. Database Syst. , vol. 39, no. 4, Dec. 2015.[Online]. Available: https://doi.org/10.1145/2691190.269119170134] C. Li and G. Miklau, “Pricing aggregate queries in a datamarketplace,” in

Proceedings of the 15th International Workshop onthe Web and Databases 2012, WebDB 2012, Scottsdale, AZ, USA,May 20, 2012 , Z. G. Ives and Y. Velegrakis, Eds., 2012, pp. 19–24.[Online]. Available: http://db.disi.unitn.eu/pages/WebDB2012/papers/p15.pdf[135] X.-B. Li and S. Raghunathan, “Pricing and disseminating customerdata with privacy awareness,”

Decision Support Systems

IEEE Access ,vol. PP, pp. 1–1, 02 2018.[137] K. Ligett and A. Roth, “Take it or leave it: Running a survey whenprivacy comes at a cost,” in

Proceedings of the Eighth InternationalWorkshop on Internet and Network Economics (WINE’12) , ser. Lec-ture Notes in Computer Science, P. W. Goldberg and M. Guo, Eds.,vol. 7695. Berlin, Heidelberg: Springer, 2012, pp. 378–391.[138] B.-R. Lin and D. Kifer, “On arbitrage-free pricing for general dataqueries,”

Proc. VLDB Endow. , vol. 7, no. 9, pp. 757–768, May 2014.[Online]. Available: https://doi.org/10.14778/2732939.2732948[139] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi,M. Ghafoorian, J. A. [van der Laak], B. [van Ginneken], and C. I.S´anchez, “A survey on deep learning in medical image analysis,”

Medical Image Analysis

Proceedings ofthe 2014 ACM SIGMOD International Conference on Managementof Data , ser. SIGMOD’14. New York, NY, USA: Association forComputing Machinery, 2014, pp. 1359–1370. [Online]. Available:https://doi.org/10.1145/2588555.259367971141] S. Maleki, L. Tran-Thanh, G. Hines, T. Rahwan, and A. Rogers,“Bounding the estimation error of sampling-based shapley valueapproximation with/without stratifying,”

CoRR , vol. abs/1306.4265,2013. [Online]. Available: http://arxiv.org/abs/1306.4265[142] A. Mas-Colell, M. Whinston, and J. Green,

MicroeconomicTheory . Oxford University Press, 1995. [Online]. Available:https://EconPapers.repec.org/RePEc:oxp:obooks:9780195102680[143] E. Maskin and J. Riley, “Monopoly with incomplete information,”

The RAND Journal of Economics

The Review of Economic Studies ,vol. 67, no. 3, pp. 413–438, 07 2000. [Online]. Available:https://doi.org/10.1111/1467-937X.00137[145] R. P. McAfee and J. McMillan, “Auctions and Bidding,”

Journal ofEconomic Literature , vol. 25, no. 2, pp. 699–738, June 1987. [Online].Available: https://ideas.repec.org/a/aea/jeclit/v25y1987i2p699-738.html[146] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Ar-cas, “Communication-Eﬃcient Learning of Deep Networks from De-centralized Data,” in

Proceedings of the 20th International Conferenceon Artiﬁcial Intelligence and Statistics , 2017, pp. 1273–1282.[147] B. McMahan and D. Ramage, “Federated learning: Collaborativemachine learning without centralized training data,” Google AI Blog,April 2017. [Online]. Available: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html[148] D. Menicucci, S. Hurkens, and D.-S. Jeon, “On the optimalityof pure bundling for a monopolist,”

Journal of MathematicalEconomics

Journal of Economic Perspectives

Proceedings of the International Conference on Advances inComputing, Communications and Informatics , ser. ICACCI’12. NewYork, NY, USA: Association for Computing Machinery, 2012, pp. 143–147. [Online]. Available: https://doi.org/10.1145/2345396.2345421[151] A. Muschalle, F. Stahl, A. L¨oser, and G. Vossen, “Pricing ap-proaches for data markets,” in

Enabling Real-Time Business Intel-ligence , M. Castellanos, U. Dayal, and E. A. Rundensteiner, Eds.Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 129–144.[152] R. B. Myerson, “Optimal auction design,”

Math. Oper. Res. ,vol. 6, no. 1, pp. 58–73, Feb. 1981. [Online]. Available: https://doi.org/10.1287/moor.6.1.58[153] A. Nagaraj, “The private impact of public information: Landsatsatellite maps and gold exploration,”

Unpublished , 07 2016. [Online].Available: http://abhishekn.com/ﬁles/nagaraj landsat2020.pdf[154] P. Naghizadeh and A. Sinha, “Adversarial contract design for privatedata commercialization,” in

Proceedings of the 2019 ACM Conferenceon Economics and Computation , ser. EC’19. New York, NY, USA:Association for Computing Machinery, 2019, pp. 681–699. [Online].Available: https://doi-org.proxy.lib.sfu.ca/10.1145/3328526.3329633[155] J. Nagle, T.T. & Hogan,

The Strategy and Tactics of Pricing: A Guideto Growing More Proﬁtably . Prentice Hall, 2010.[156] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya,R. Wald, and E. Muharemagic, “Deep learning applicationsand challenges in big data analytics,”

Journal of Big Data ,vol. 2, no. 1, p. 1, Feb 2015. [Online]. Available: https://doi.org/10.1186/s40537-014-0007-7[157] A. Nash, L. Segouﬁn, and V. Vianu, “Determinacy and rewriting ofconjunctive queries using views: A progress report,” in

Proceedings of he 11th International Conference on Database Theory , ser. ICDT’07.Berlin, Heidelberg: Springer-Verlag, 2007, pp. 59–73. [Online].Available: https://doi.org/10.1007/11965893 5[158] ——, “Views and queries: Determinacy and rewriting,” ACMTrans. Database Syst. , vol. 35, no. 3, Jul. 2010. [Online]. Available:https://doi.org/10.1145/1806907.1806913[159] M. Neumeier,

The brand ﬂip : why customers now run companies–andhow to proﬁt from it . San Francisco :: New Riders,, 2015.[160] K. Nissim, S. Vadhan, and D. Xiao, “Redrawing the boundaries onpurchasing data from privacy-sensitive individuals,” in

Proceedingsof the 5th Conference on Innovations in Theoretical ComputerScience , ser. ITCS’14. New York, NY, USA: Association forComputing Machinery, 2014, pp. 411–422. [Online]. Available:https://doi-org.proxy.lib.sfu.ca/10.1145/2554797.2554835[161] C. Niu, Z. Zheng, S. Tang, X. Gao, and F. Wu, “Making big moneyfrom small sensors: Trading time-series data under puﬀerﬁsh privacy,”in

IEEE INFOCOM 2019 - IEEE Conference on Computer Commu-nications , April 2019, pp. 568–576.[162] C. Niu, Z. Zheng, F. Wu, S. Tang, and G. Chen, “Onlinepricing with reserve price constraint for personal data markets,”

CoRR , vol. abs/1911.12598, 2019. [Online]. Available: http://arxiv.org/abs/1911.12598[163] C. Niu, Z. Zheng, F. Wu, S. Tang, X. Gao, and G. Chen,“Unlocking the value of privacy: Trading aggregate statisticsover private correlated data,” in

Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery & DataMining

Proceedings of the 5th International Conferenceon Electronic Commerce , ser. ICEC’03. New York, NY, USA:Association for Computing Machinery, 2003, pp. 355–366. [Online].Available: https://doi.org/10.1145/948005.948051[166] E. Ostrom,

Governing the Commons: The Evolution of Institutions forCollective Action , ser. Canto Classics. Cambridge University Press,2015.[167] K. Pantelis and L. Aija, “Understanding the value of (big) data,” in , 2013, pp. 38–42.[168] K. Pauwels and A. Weiss, “Moving from free to fee: How onlineﬁrms market to change their business model successfully,”

Journalof Marketing , vol. 72, no. 3, pp. 14–31, 2008. [Online]. Available:https://doi.org/10.1509/JMKG.72.3.014[169] A. Pavan, I. Segal, and J. Toikka, “Dynamic mechanism design: Amyersonian approach,”

Econometrica , vol. 82, no. 2, pp. 601–653,2014. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA10269[170] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,”

Commun. ACM , vol. 45, no. 4, pp. 211–218, Apr. 2002. [Online].Available: https://doi.org/10.1145/505248.506010[171] A. Prasad, V. Mahajan, and B. Bronnenberg, “Advertising versus pay-per-view in electronic media,”

International Journal of Research inMarketing

ACM Trans. Intell.Syst. Technol. , vol. 5, no. 4, Jan. 2015. [Online]. Available:https://doi.org/10.1145/2668108[173] A. Rao, “Online Content Pricing: Purchase and Rental Markets,”

Marketing Science , vol. 34, no. 3, pp. 430–451, May 2015. [Online].75vailable: https://ideas.repec.org/a/inm/ormksc/v34y2015i3p430-451.html[174] J. M. Rao and D. H. Reiley, “The economics of spam,”

Journal ofEconomic Perspectives

Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , ser. KDD’19. New York,NY, USA: Association for Computing Machinery, 2019, pp. 363–372.[Online]. Available: https://doi.org/10.1145/3292500.3330870[176] A. Richardson, A. Filos-Ratsikas, and B. Faltings, “Rewarding high-quality data via inﬂuence functions,”

CoRR , vol. abs/1908.11598,2019. [Online]. Available: http://arxiv.org/abs/1908.11598[177] J. Riley and W. F. Samuelson, “Optimal auctions,”

AmericanEconomic Review , vol. 71, no. 3, pp. 381–392, 1981. [Online].Available: https://EconPapers.repec.org/RePEc:aea:aecrev:v:71:y:1981:i:3:p:381-92[178] A. Roth, “Technical perspective: Pricing information (and itsimplications),”

Commun. ACM , vol. 60, no. 12, p. 78, Nov. 2017.[Online]. Available: https://doi-org.proxy.lib.sfu.ca/10.1145/3139455[179] F. Schomm, F. Stahl, and G. Vossen, “Marketplaces for data: Aninitial survey,”

SIGMOD Rec. , vol. 42, no. 1, pp. 15–26, May 2013.[Online]. Available: https://doi.org/10.1145/2481528.2481532[180] L. Segouﬁn and V. Vianu, “Views and queries: Determinacy andrewriting,” in

Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems ,ser. PODS’05. New York, NY, USA: Association for ComputingMachinery, 2005, pp. 49–60. [Online]. Available: https://doi.org/10.1145/1065167.1065174 76181] S. Sen, C. Joe-Wong, S. Ha, and M. Chiang, “A survey of smartdata pricing: Past proposals, current plans, and future trends,”

ACMComputing Survey , vol. 46, no. 2, Nov. 2013. [Online]. Available:https://doi.org/10.1145/2543581.2543582[182] C. Shapiro, S. Carl, H. Varian, and H. B. Press,

Information Rules:A Strategic Guide to the Network Economy , ser. Strategy/Technology/ Harvard Business School Press. Harvard Business School Press,1998. [Online]. Available: https://books.google.ca/books?id=aE J4Iv PVEC[183] C. Shapiro and H. R. Varian, “Versioning: The smart way to sellinformation,”

Harvard Business Review

Auctions, Bidding, and Contracting , M. Shubik and J. Stark, Eds.New York University Press, 1983, pp. 33–52.[186] R. H. L. Sim, Y. Zhang, M. C. Chan, and B. K. H. Low, “Collaborativemachine learning with incentive-aware model rewards,” in

Proceedingsof the International Conference on Machine Learning 1 pre-proceedings(ICML 2020) , 2020.[187] B. Squire, S. Brown, J. Readman, and J. Bessant, “The impact ofmass customisation on manufacturing trade-oﬀs,”

Production and Op-erations Management , vol. 15, pp. 10 – 21, 01 2009.[188] C. Sunstein,

Echo Chambers: Bush V. Gore, Impeachment, andBeyond . Princeton University Press, 2001. [Online]. Available:https://books.google.ca/books?id=sEgHAAAACAAJ[189] C. Swamy and M. Cheung, “Approximation algorithms for single-minded envy-free proﬁt-maximization problems with limited supply,”77n . Los Alamitos, CA, USA: IEEEComputer Society, oct 2008, pp. 35–44. [Online]. Available:https://doi.ieeecomputersociety.org/10.1109/FOCS.2008.15[190] G. Tang, Y. Yang, and J. Pei, “Price information patterns inweb search advertising: An empirical case study on accommodationindustry,” in . Los Alamitos, CA, USA: IEEE ComputerSociety, dec 2013, pp. 737–746. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICDM.2013.100[191] R. Tang, A. Amarilli, P. Senellart, and S. Bressan, “Get a sample fora discount,” in

Database and Expert Systems Applications , H. Decker,L. Lhotsk´a, S. Link, M. Spies, and R. R. Wagner, Eds. Cham:Springer International Publishing, 2014, pp. 20–34.[192] R. Tang, H. Wu, Z. Bao, S. Bressan, and P. Valduriez, “The priceis right,” in

Database and Expert Systems Applications , H. Decker,L. Lhotsk´a, S. Link, J. Basl, and A. M. Tjoa, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2013, pp. 380–394.[193] C. R. Taylor, “Consumer Privacy and the Market for CustomerInformation,”

RAND Journal of Economics , vol. 35, no. 4, pp.631–650, Winter 2004. [Online]. Available: https://ideas.repec.org/a/rje/randje/v35y20044p631-650.html[194] F. Tram`er, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart,“Stealing machine learning models via prediction apis,” in

Proceedingsof the 25th USENIX Conference on Security Symposium , ser. SEC’16.USA: USENIX Association, 2016, pp. 601–618.[195] P. Upadhyaya, M. Balazinska, and D. Suciu, “Price-optimal queryingwith data apis,”

Proc. VLDB Endow. , vol. 9, no. 14, pp. 1695–1706,Oct. 2016. [Online]. Available: https://doi.org/10.14778/3007328.3007335[196] S. van de Sandt, S. Dallmeier-Tiessen, A. Lavasa, and V. Petras, “Thedeﬁnition of reuse,”

Data Science Journal , vol. 18, no. 1, p. 22, 2019.78197] H. R. Varian, “Online ad auctions,”

American Economic Review

The Journal of Finance

Recent Advances in GameTheory . Princeton, New Jersey: Princeton University Conference,1962, pp. 15–27.[200] H. von Stackelberg,

Market Structure and Equilibrium . J. Springer,1934.[201] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis,“Deep learning for computer vision: A brief review,”

ComputationalIntelligence and Neuroscience , vol. 2018, p. 7068349, Feb 2018.[Online]. Available: https://doi.org/10.1155/2018/7068349[202] T. Wagner, A. Benlian, and T. Hess, “Converting freemium customersfrom free to premium–the role of the perceived premium ﬁt in the caseof music as a service,”

Electronic Markets , vol. 24, pp. 259–268, 122014.[203] J. Waldfogel, “Copyright research in the digital age: Movingfrom piracy to the supply of new products,”

American EconomicReview

Journal of Management InformationSystems , vol. 12, no. 4, pp. 5–33, 1996. [Online]. Available:https://doi.org/10.1080/07421222.1996.11518099[205] T. Wang, J. Rausch, C. Zhang, R. Jia, and D. Song, “A princi-pled approach to data valuation for federated learning,”

ArXiv , vol.abs/2009.06192, 2020. 79206] Z. Wang, H. Zhu, Z. Dong, X. He, and S. Huang, “Less isbetter: Unweighted data subsampling via inﬂuence function,”

CoRR , vol. abs/1912.01321, 2019. [Online]. Available: http://arxiv.org/abs/1912.01321[207] H. L. Williams, “Intellectual property rights and innovation:Evidence from the human genome,”

Journal of Political Economy ,vol. 121, no. 1, pp. 1–27, 2013. [Online]. Available: https://doi.org/10.1086/669706[208] C. Wu, R. Buyya, and K. Ramamohanarao, “Cloud pricingmodels: Taxonomy, survey, and interdisciplinary challenges,”

ACMComput. Surv. , vol. 52, no. 6, Oct. 2019. [Online]. Available:https://doi.org/10.1145/3342103[209] S. Wu and R. Banker, “Best pricing strategy for information services,”

Journal of the Association of Information Systems , vol. 11, no. 6, pp.339–366, Jan. 2010.[210] S. Wu and P. Pavlou, “On the optimal ﬁxed-up-to pricing for infor-mation services,”

Journal of the Association of Information Systems ,vol. 20, no. 10, pp. 1447–1474, Jan. 2019.[211] X. Wu, W. Zhang, and W. Dou, “Pricing as a service: Personalizedpricing strategy in cloud computing,” in , Oct 2012, pp.1119–1124.[212] X. Wu, X. Ying, K. Liu, and L. Chen, “A survey ofprivacy-preservation of graphs and social networks,” in

Managingand Mining Graph Data , C. C. Aggarwal and H. Wang, Eds.Boston, MA: Springer US, 2010, pp. 421–453. [Online]. Available:https://doi.org/10.1007/978-1-4419-6045-0 14[213] C. Xia and S. Muthukrishnan, “Arbitrage-free pricing in user-basedmarkets,” in

Proceedings of the 17th International Conference on Au-tonomous Agents and MultiAgent Systems , ser. AAMAS’18. Richland,SC: International Foundation for Autonomous Agents and MultiagentSystems, 2018, pp. 327–335. 80214] H. Yang, “Targeted search and the long tail eﬀect,”

RAND Journal ofEconomics , vol. 44, no. 4, pp. 733–756, December 2013.[215] Y. Yang, X. Mao, J. Pei, and X. He, “Continuous inﬂuencemaximization: What discounts should we oﬀer to social networkusers?” in

Proceedings of the 2016 International Conference onManagement of Data , ser. SIGMOD’16. New York, NY, USA:Association for Computing Machinery, 2016, pp. 727–741. [Online].Available: https://doi.org/10.1145/2882903.2882961[216] Y. Yang, Q. S. Lu, G. Tang, and J. Pei, “The Impact ofMarket Competition on Search Advertising,”

Journal of InteractiveMarketing , vol. 30, no. C, pp. 46–55, 2015. [Online]. Available:https://ideas.repec.org/a/eee/joinma/v30y2015icp46-55.html[217] J. Yoon, S. Arik, and T. Pﬁster, “Data valuation using reinforcementlearning,” in

Proceedings of the International Conference on MachineLearning 1 pre-proceedings (ICML 2020) , 2020.[218] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trendsin deep learning based natural language processing [review article],”

IEEE Computational Intelligence Magazine , vol. 13, no. 3, pp. 55–75,2018.[219] H. Yu and M. Zhang, “Data pricing strategy based on data quality,”

Computers & Industrial Engineering

SSRN , April 2020. [Online]. Available: https://ssrn.com/abstract=3609120orhttp://dx.doi.org/10.2139/ssrn.3609120[221] X. M. Zhang and F. Zhu, “Group size and incentives to contribute:A natural experiment at chinese wikipedia,”

American EconomicReview

Proceedings of he 24th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining , ser. KDD’18. New York, NY, USA:Association for Computing Machinery, 2018, pp. 1021–1030. [Online].Available: https://doi.org/10.1145/3219819.3219918[223] Z. Zheng, Y. Peng, F. Wu, S. Tang, and G. Chen, “An online pricingmechanism for mobile crowdsensing data markets,” in Proceedingsof the 18th ACM International Symposium on Mobile Ad HocNetworking and Computing , ser. Mobihoc’17. New York, NY, USA:Association for Computing Machinery, 2017. [Online]. Available:https://doi.org/10.1145/3084041.3084044[224] B. Zhou, J. Pei, and W. Luk, “A brief survey on anonymizationtechniques for privacy preserving publishing of social network data,”

SIGKDD Explor. Newsl. , vol. 10, no. 2, pp. 12–22, Dec. 2008.[Online]. Available: https://doi.org/10.1145/1540276.1540279[225] Y. Zhou, U. Porwal, C. Zhang, H. Ngo, X. Nguyen, C. R´e, andV. Govindaraju, “Parallel feature selection inspired by group testing,”in