Optimization of Privacy-Utility Trade-offs under Informational Self-determination
OOptimization of Privacy-Utility Trade-offs under InformationalSelf-determination
Thomas Asikis a , Evangelos Pournaras a a Professorship of Computational Social ScienceETH Zurich, Zurich, Switzerland { asikist,epournaras } @ethz.ch Abstract
The pervasiveness of Internet of Things results in vast volumes of personal data generated by smart devicesof users (data producers) such as smart phones, wearables and other embedded sensors. It is a commonrequirement, especially for Big Data analytics systems, to transfer these large in scale and distributed datato centralized computational systems for analysis. Nevertheless, third parties that run and manage thesesystems (data consumers) do not always guarantee users’ privacy. Their primary interest is to improveutility that is usually a metric related to the performance, costs and the quality of service. There areseveral techniques that mask user-generated data to ensure privacy, e.g. differential privacy. Setting up aprocess for masking data, referred to in this paper as a ‘privacy setting’, decreases on the one hand theutility of data analytics, while, on the other hand, increases privacy. This paper studies parameterizationsof privacy settings that regulate the trade-off between maximum utility, minimum privacy and minimumutility, maximum privacy, where utility refers to the accuracy in the estimations of aggregation functions.Privacy settings can be universally applied as system-wide parameterizations and policies (homogeneous datasharing). Nonetheless they can also be applied autonomously by each user or decided under the influence of(monetary) incentives (heterogeneous data sharing). This latter diversity in data sharing by informationalself-determination plays a key role on the privacy-utility trajectories as shown in this paper both theoreticallyand empirically. A generic and novel computational framework is introduced for measuring privacy-utilitytrade-offs and their Pareto optimization. The framework computes a broad spectrum of such trade-offs thatform privacy-utility trajectories under homogeneous and heterogeneous data sharing. The practical use ofthe framework is experimentally evaluated using real-world data from a Smart Grid pilot project in whichenergy consumers protect their privacy by regulating the quality of the shared power demand data, whileutility companies make accurate estimations of the aggregate load in the network to manage the power grid.Over 20 ,
000 differential privacy settings are applied to shape the computational trajectories that in turnprovide a vast potential for data consumers and producers to participate in viable participatory data sharingsystems.
Keywords: data sharing, privacy, utility, trade-off, optimization, masking, differential privacy, datatransformation, diversity, Internet of Things, Big Data,
1. Introduction
High data volumes are generated in real-timefrom users’ smart devices such as smartphones,wearables and embedded sensors. Big Data sys-tems process these data, generate information andenable services that support critical sectors of econ-omy, e.g. health, energy, transportation etc. Suchsystems often rely on centralized servers or cloud computing systems. They are managed by corpo-rate third parties referred to in this paper as dataconsumers who collect the data of users referredto respectively as data producers . Data consumersperform data analytics for decision-making and au-tomation of business processes. However, data pro-ducers are not always aware of how their data areused and processed. Terms of Use are shown to belimited and ineffective [4, 32]. Security and privacy
Preprint submitted to Elsevier May 8, 2018 a r X i v : . [ c s . CR ] M a y f users’ data depend entirely on data consumersand as a result misuse of personal informationis possible, for instance, discrimination or limitedfreedom and autonomy by personalized persuasivesystems [22, 7, 24, 21]. Giving control back to dataproducers by self-regulating the amount/quality ofshared data can limit these threats [37]. Incentiviz-ing the sharing of higher amount/quality of dataresults in improved quality of service, i.e. higheraccuracy in predictions [44, 10, 26]. At the sametime, data sharing empowers data producers withan economic value to claim.Several applications do not require storage of theindividual data generated by data producers. In-stead, data consumers may only require aggregateddata. For instance, Smart Grid utility companiescompute the total daily power load or the averagevoltage stability to prevent possible network fail-ures, bottlenecks, predict future power demand, op-timize power production and design pricing policies[27, 6]. Privacy-preserving masking mechanisms [1],i.e. differential privacy, accurately approximate theactual aggregate values without transmitting theprivacy-sensitive individual data of data producers.Masking is a numerical transformation of the sen-sor values that usually relies on the generation ofrandom noise and is irreversible .Privacy-preserving masking mechanisms arestudied by calculating metrics of privacy q and util-ity u . The former represents the amount of personalinformation that a data producer preserves whensharing a masked data value. The latter representsthe benefit that a data consumer preserves whenusing certain masked data for aggregation, e.g. ac-curacy in data analytics. Literature work [1, 30, 25]shows that privacy and utility are negatively corre-lated, meaning that an increase on one results indecrease on the other. This paper studies the op-timization of computational trade-offs between pri-vacy and utility that can be used to model infor-mation sharing as supply-demand systems run bycomputational markets [37, 29]. These trade-offscan be measured by the opportunity cost betweenprivacy-preservation and the performance of algo-rithms operating on masked data, i.e. predictionaccuracy. Trade-offs can be made by choosing dif-ferent parameters for different masking mechanismseach influencing the mean or the variance of thegenerated noise distributions [1]. Each parameteri- It is computationally infeasible to compute the originaldata using the transformed data. zation results in a pair of privacy and utility valueswithin a trajectory of possible privacy-utility val-ues.The selection of parameters for masking mecha-nisms that maximize privacy and utility is studiedin this paper as an optimization problem [30, 25].In contrast to related work that exclusively focuseson universal optimal privacy settings (homogeneousdata sharing), this paper studies the optimizationof privacy-utility trade-offs under diversity in datasharing (heterogeneous data sharing). This is achallenging but more realistic scenario for partic-ipatory data sharing systems that allow informa-tional self-determination via a freedom and auton-omy in the amount/quality of data shared by eachdata producer. A novel computational frameworkis introduced to compute the privacy settings thatrealize different privacy-utility trade-offs.The main contributions of this article are thefollowing: (i) The introduction of a general-ized, domain-independent, data-driven optimiza-tion framework, which selects privacy settings thatmaximize privacy and utility. (ii) A formal proofon how high utility can be achieved under informa-tional self-determination (heterogeneous data shar-ing) originated from the diversity in the privacy set-tings selected by the users. (iii) The introductionof new privacy and utility metrics based on statis-tical properties of the generated noise. (iv) Theintroduction of a new masking mechanism. (v) Anempirical analysis of privacy-utility trajectories ofmore than 20 ,
000 privacy settings computed usingreal-world data from a Smart Grid pilot project.This paper is outlined as follows: Section 2 in-cludes related work on privacy masking mecha-nisms, privacy-utility trade-off as well as privacy-utility maximization problems. Section 3 definesthe optimization problem and illustrates the re-search challenge that this paper tackles. Section4 introduces the proposed optimization framework.Section 5 outlines the experimental settings onwhich the proposed framework is tested and eval-uated. Section 6 shows the results of the experi-mental evaluation. Finally, Section 7 concludes thispaper and outlines future work.
2. Related Work
Several algorithms are proposed to perform dataaggregation without transmitting the raw data.The basic idea behind such algorithms is to irre-2ersibly transform the data, so that the originalvalues cannot be estimated. While doing so, someof the properties of the data should be preserved toaccurately estimate aggregation functions such assum, count or multiplication [1, 10, 37, 18, 11]. Themasking process enables the data producers to con-trol the amount of personal information sent to dataconsumers. These methods also ensure that thedata remain private even when a non-authorizedparty acquires them, for example in the case of aman-in-the-middle attack. An overview of privacy-preserving mechanisms isillustrated below:
Perturbative masking mechanisms allow the dataproducers to share their data after masking individ-ual values. Each value is perturbed by replacing itwith a new value that is usually generated via aprocess of random noise generation or vector quan-tization techniques on current and past data val-ues [1]. Some of the most well-known perturbativemasking methods are the following:
Additive noise : A privacy-preserving approachis the addition of randomized noise [11, 12, 46].This approach is often used in differential privacyschemes [12]. Differential privacy is ensured whenthe masking process prohibits the estimation of thereal data values, even if the data consumer can uti-lize previously known data values or the identityof the individual who sends the data [13]. Algo-rithms that achieve differential privacy rely on thenotion that the change of a single element in adatabase does not affect the probability distribu-tion of the elements in the database [11, 13, 46, 47].Furthermore, the removed element cannot be iden-tified when comparing the version of the databasebefore and after the removal. This is achieved byadding a randomly generated noise to each datavalue. The distribution of the random noise is pa-rameterized and usually is symmetric around 0 andrelies on the cancellation of noises with oppositevalues. Increasing the number of noise values alsoincreases the noise cancellation, since a larger num-ber of opposite values are sampled. This propertycan be used to combine differential privacy mech-anisms in order to ensure privacy while achieving A process also known as masking. high utility [23]. Statistical aggregation queries onthe masked data return an approximate numericalresult, which is close to the actual result. Differ-ential privacy can be applied to discrete and con-tinuous variables for the calculation of several ag-gregation functions [10]. Differential privacy canbe combined with the usage of deep neural net-works [43, 35], to apply more complex aggregationoperations on statistical databases. Furthemore,several additive noise implementations are suscep-tible to noise filtering attacks, such as the use ofKalman filters [20] or reconstruction attacks [14].These attacks can be prevented when the noise isnot autocorrelated or the distribution of its auto-correlation is approximately uniform.
Microaggregation : Microaggregation relies onthe replacement of each data value with a repre-sentative data value that is derived from the sta-tistical properties of the dataset it belongs to. Awell-known application of microaggregation is K-anonimity. K-anonymity relies on the notion thatat least K original data values are mapped to thesame value [40]. When a crisp clustering algorithmis applied on the data, each data value is mappedto the cluster centroid it belongs to. K is the min-imum number of elements in a cluster. Using crispclustering techniques may result in vulnerabilitiesto specific attacks, so membership or fuzzy cluster-ing is preferred instead [33]. Membership cluster-ing assigns a data point to multiple clusters witha probability that is often proportional to the dis-tance from each cluster centroid. For membershipclustering techniques, usually large amounts of dataare required. The storage and computational ca-pacity of sensor devices cannot usually support suchprocesses [33, 1]. Synthetic microdata generation
An newdataset is synthesized based on the original dataand multiple imputations [1]. The “synthetic”dataset is used instead of the original one for ag-gregation calculations. The application of syntheticmicrodata generation on sensor devices may pro-duce prohibitive processing and storage costs. Fur-thermore, the availability of historical data on eachsensor device may not be adequate for such methodsto achieve comparable performance and efficiencywith the perturbative masking methods [1]. Such as K-Means. .1.2. Encryption Several approaches use encryption to produce anencrypted set of numbers or symbols, known asciphers. The aggregation operations can be per-formed on the ciphers and produce an encryptedaggregation value. The encrypted aggregate valuecan then be decrypted to the original aggregateone, with the usage of the corresponding privateand public keys and decryption schemes, provid-ing maximum utility and privacy to the recipient.The encrypted individual values cannot be trans-formed to the original values without the usage ofthe appropriate keys from an adversary, so max-imum privacy is ensured. Currently, there is ex-tensive research on this area, and there has been arecent breakthrough with the development of fullyhomomorphic encryption schemes [15, 16, 18, 19].Homomorphic encryption schemes though requirehigh computational and communication costs, es-pecially when applied in large scale networks [17, 9].
Multi-Party Computation (MPC) [48, 3] can alsobe used for privacy-preservation [8] by moving datafrom one device to another. In such an approach,security and integrity of the data depend on the re-silience and security of the network. Most of themethods that rely on encryption can calculate theexact sum of the data, but they can also be vio-lated if an attacker manages to have access to theprivate key or uses an algorithm that can guess it.Furthermore, in most cases they rely on communi-cation protocols that burden the system with ex-tra computational and communication costs [38].These costs are often prohibitive for devices suchas IoT sensors and smartphone wearables in whichcomputational power and storage are limited [3].
A supply-demand system operating on a compu-tational market of data, can be created with the in-troduction of self-regulatory privacy-preserving in-formation systems [37]. Privacy preservation is uti-lized to create such systems, for instance by usingK-means for microaggregation and different num-bers of clusters for each sensor. Varying the num-ber of clusters produces different levels of privacyand utility. The resulting trade-off between privacyand utility is used to create a reward system, wheredata consumers offer rewards for the data providedby the data producers. The rewards are based on the demand of transformed data that enables theestimation of more accurate aggregate values.A reward system can be combined with pricingstrategies from existing literature on pricing pri-vate data [29], in which three actors are introduced:Various pricing functions are proposed to the
Mar-ket Maker so that the privacy-utility of both dataconsumers and data producers are satisfied. Theoptimization framework of the current paper canutilize any parametric masking mechanism of theliterature mentioned in Section 2.1. The outputof the optimization can be used along with pricingfunctions on participatory computational markets,to create fully functional and self-regulatory datamarkets.
The challenge of an automated selection of pri-vacy settings that satisfy different trade-offs isnot tackled in the aforementioned mechanisms.Privacy-utility trajectories have not been earlierstudied extensively and empirically as in the restof this paper. The optimization of privacy-utilitytrade-offs under diversity in data sharing originatedfrom informational self-determination is the chal-lenge tackled in this paper. To the best of the au-thors’ knowledge, this challenge is not the focus ofearlier work.
3. Problem Definition
Related work [25, 44, 30, 41, 37] on privacy-utilitytrade-offs focuses on the parameter optimization ofa single masking mechanism. A masking mecha-nism is often a noise generation process, which sam-ples random noise values from a laplace distribu-tion and then it aggregates it to the data, for in-stance the sampled noise is then added to the datato achieve differential privacy [11]. The result of theoptimization is usually a vector of parameter values θ η,k , for a masking mechanism η and parameter in-dex k . The pair of the masking mechanism andthe parameter values is referred as a privacy setting f η ( S, θ η,k ) of a set of sensor values S ∈ R . Thisprivacy setting produces a pair of privacy-utilityvalues ˆ q , ˆ u , such that:ˆ q → max ( Q ) (1)ˆ u → max ( U ) (2)Where (ˆ q, ˆ u ) is a (sub-optimal) privacy-utility pairof values, which is computed by an optimization4lgorithm that searches for the optimal privacy-utility values pair. max ( Q ), max ( U ) are the max-imum privacy and utility values of a privacy valueset Q and a utility value set U . These sets are gen-erated by the application of a masking mechanism.The optimization of an objective function thatsatisfies both Relations (1) and (2) simultaneouslyis an NP-hard problem [25], in the case that pri-vacy and utility are orthogonal ( q ⊥ u ) or oppo-site ( q ↑ , u ↓ ), and often intractable to solve, sinceprivacy-utility trade-offs prohibit the satisfactionof both Relations (1) and (2). Particularly, max-imizing simultaneously utility and privacy usuallyyields sub-optimal values, which are lower than thecorresponding optimal values computed by optimiz-ing each metric separately [25]. Furthermore, suchoptimization is applicable for statistical databases[13, 1], where data are stored in a centralized sys-tem. In such case, a specific privacy setting is cho-sen by the designer/administrator of the system.As a result, this approach relies on the assumptionthat a specific privacy setting should be used by alldata producers.However, remaining to a fixed privacy settingmay be limited for data producers, especially whena data producer wishes to switch to a different pri-vacy setting to improve privacy further. In thiscase, the optimization of different objective func-tions is formalized in the following inequalities: q ∗ > ˆ q + δ ∧ u ∗ > ˆ u + c (3)Where δ measures the change in privacy, which de-notes whether the data producers require higherprivacy, δ >
0, or lower privacy δ <
0, from thesystem. c measures the change in utility, which de-notes whether the data consumer demands lowerutility, c >
0, or higher utility c <
0, from the sys-tem. Finally, ( q ∗ , u ∗ ) denotes a new (sub-optimal)pair of privacy-utility values, computed by an op-timization algorithm that searches for the optimalpair of privacy-utility values with respect to the pri-vacy requirements of data producer and the utilityrequirements of data consumer expressed by c and δ respectively. In the case that privacy and utility are positive corre-lated ( q ↑ , u ↑ ), the problem is reduced to NTIME-hard, andespecially in the case privacy and utility are proportional q ∝ u to DTIME-hard [5]. The solution of the problem isprovided by linearly evaluating all pairs of privacy and utilityvalues once without comparing to all other pairs. The optimization of an objective function to sat-isfy Relation (3) is also based on the assumptionthat all data producers agree to use the same pri-vacy setting. This means that data producers mayacquire a different privacy level by changing thevalue of δ via the collective selection of a differ-ent privacy setting. Consequently, a single pri-vacy setting is generated and it produces a pairof privacy-utility values, which satisfy Inequality(3). The value of δ is determined via a collectivedecision-making process applied by the data pro-ducers, e.g. voting between different privacy-utilityrequirements. Such a system is referred to as a ho-mogeneous privacy system, where data producersare able to influence the amount of privacy appliedon the data by actively participating in the mar-ket, nevertheless they all share the same value for δ . The data consumer can bargain for higher utilityby offering higher rewards to the data producers tolower their privacy requirements.Another challenge that arises is the optimizationbetween privacy and utility when each user decidesand self-determines a preferred privacy setting in-stead of using a universal privacy setting. In sucha scenario, inequality (3) is substituted by the fol-lowing set of inequalities:( q ∗ > ˆ q + δ ) ∧ . . . ∧ ( q ∗ n > ˆ q + δ | N | ) ∧ ( u ∗ > ˆ u + c ) (4)Where δ n measures the change in privacy whichdenotes whether a data producer n belonging toa set of users N requires higher privacy, δ n > δ n < q ∗ n denotes a new (sub-optimal) privacy value for each data producer n .The value is computed by an optimization algo-rithm that searches for the optimal privacy valuewith respect the data producer’s privacy require-ments expressed by δ n .A system in which the inequalities of Relation(4) hold is referred to as an heterogeneous privacysystem, where each data producer self-determinesand autonomously applies a privacy setting basedon a preferred privacy value and an expected rewardfor increasing system utility.
4. Framework
The design of a new privacy preserving opti-mization framework is introduced in this sectionto tackle the challenges posed in Section 3. Ad-ditive noise masking mechanisms require a lower5umber of parameters in general and they are of-ten used in privacy-utility optimization [1, 13, 25].Each privacy setting is illustrated as an ellipse inFigure 1a. Each point within the ellipse is a possi-ble privacy-utility pair of values. The ellipse centeris chosen based on the privacy and utility mode ofthe setting. The mode is the value with the high-est density. In symmetric distributions, it can bemeasured via the mean. The vertical radius of theellipse denotes the dispersion of utility values, whilehorizontal radius denotes the dispersion of privacyvalues. Additive noise is stochastic, which meansthat applying the same privacy setting on the samedataset yields varying privacy-utility values. Thechoice of an optimal privacy-utility pair cannot beachieved by only evaluating the mode of privacyand utility for each privacy setting. If the privacy-utility values of a privacy setting with high utilitymode are varying to a large extend, there is highprobability that unexpected non-optimal values areobserved. To overcome this challenge, the objectivefunction of the parameter optimization algorithmselects the parameters that minimize the disper-sion of privacy-utility values while maximizing theexpected utility.A data producer selects any privacy setting,among different ones, that satisfies personal privacyrequirements. The proposed framework divides therange of privacy values in a number of equally sizedbins, as illustrated in Figure 1b. Within each bin,a fitness value is calculated for each privacy set-ting, based on privacy-utility mode and dispersionEach privacy setting produces privacy values withlow dispersion. This is done by applying a lowerbound constraint on privacy and utility constrainton the dispersion of privacy values and evaluatingonly privacy settings that satisfy this constraint,as shown in Figure 1c. The optimization frame-work evaluates several privacy settings, to find theparameters that achieve maximum privacy-utilityvalues that vary as little as possible. This is illus-trated in Figure 1d in which the ellipses with thehighest utility mode and lowest utility dispersionare filtered for each privacy bin. The elliptical shape is chosen for the sake of illustrationand it indicates a symmetrical distribution of privacy-utilityvalues, generated by a privacy setting, within the ellipse area. This refers to the dispersion measures of the privacy andutility distributions. If the values belong to a gaussian dis-tribution, then the standard deviation is used to measure thedispersion. Since this is not always the case, other measuresof scale can be used, such as the Inter-Quantile Range(IQR).
Pri vacy U tilit y (a) Privacy-utility trajec-tory Pri vacy U tilit y (b) Binning of the privacyrange Pri vacy U tilit y (c) Evaluation via objectivefunction Pri vacy U tilit y (d) Bin optimization Objective Function Value
Low High (e) Objective function scale
Figure 1: A graphical representation of the algorithm.Each ellipse denotes the privacy-utility values of a pri-vacy setting. In Figures 1c and 1d the varying colordenotes the fitness value. A lighter red color denoteshigher fitness.
In a homogeneous data sharing system, a uni-versal privacy setting is selected by the data pro-ducers, via, for instance, voting [34]. Alternatively,in a heterogeneous system, the data producers self-determine the privacy setting independently. Theo-rem 1 below proves that aggregation functions canbe accurately approximated (utility can be maxi-mized) even if different privacy settings from thesame of different masking mechanisms are selected.
Theorem 1.
Let the transformation of | I | disjointsubsets of sensor values S i into the respective sub-sets of masked values M i using a certain privacysettings f i for each such transformation. It holdsthat the aggregation of the generated multisets ofmasked values M i approximates the aggregation ofthe sensor values multiset S i : ( | I | (cid:91) i =1 M i ) → g ( S ) , (5) given that the commutative and associative proper-ties hold between each of the privacy settings f i andthe aggregation function g .Proof. Let a multiset of real sensor values S ⊆ R and | I | disjoint subsets of S such that: | I | (cid:91) i =1 S i = S, S i (cid:54) = ∅ ∀ i ∈ { , ..., | I |} (6)Let a privacy setting f : S, Ψ → M be a pairwiseelement operation between a set of sensor values S and a set of noise values Ψ, that transforms eachsensor value s ∈ S by aggregating it with a ran-domly selected noise value ψ from Ψ to produce amasked value m : f ( S, Ψ) = g ( S ∪ Ψ) = M ⇔ f ( s, ψ ) = g ( { s, ψ } ) = m (7)Let g : A → R be an aggregation functionwhich aggregates all elements of real values mul-tisets S, Ψ , M ⊆ A ⊆ R into a single real value g ( A ) = z A ∈ R . Assume that g : A → R is de-fined in a recursive manner so that it satisfies thefollowing equation for a multiset A and any unionof all possible combinations of disjoint subsets A i that satisfy Relation (6): g ( A ) = g ( | I | (cid:91) i =1 A i ) = g ( | I | (cid:91) i =1 g ( A i )) (8)According to literature [2] the family of aggrega-tion functions that Relation 8 applies to is referredto as extended aggregation functions . The pair-wise operation between s and ψ in f is designed insuch way that it satisfies the commutative and asso-ciative properties when combined with the pairwiseoperation of g : g ( f ( S, Ψ)) , = f ( g ( S ) , g (Ψ)) (9)where g (Ψ) → ι , ι is the strong neutral element ofthe extended aggregation function g , such that: g ( g ( A ) ∪ ι ) = g ( A ) ⇒ g ( g ( A ) ∪ g (Ψ)) → g ( A ) (10) A subset of those functions are the averaging func-tions, which include aggregations such as the mean, weightedmean, Gini mean, Bonferoni mean, Choquet integrals etc.
This property is used in the noise cancellation ofSection 2.1.1. Let | I | multisets Ψ i of noise thatsatisfy Relation 6, then the following relation holds: g ( M i ) = g ( f ( S i , Ψ i )) (9) ⇔ g ( M i ) = f ( g ( S i ) , g (Ψ i )) (10 , ⇔ g ( M i ) → g ( S i ) , (11)which means that each noise multiset Ψ i is gener-ated in such a way that the aggregation of g ( M i ) ap-proximates the aggregation of g ( S i ). An illustrativeexample is the laplace noise used in the literaturefor the aggregation functions of count or summa-tion [11, 9], which satisfies Relations 7, 8 and 10.Now it can be proven that: g ( | I | (cid:91) i =1 f i ( S i , Ψ i )) (8) = g ( | I | (cid:91) i =1 g ( f i ( S i , Ψ i ))) (11) ⇐⇒ g ( | I | (cid:91) i =1 M i ) → g ( | I | (cid:91) i =1 g ( S i ))) (6) , (8) ⇐⇒ g ( | I | (cid:91) i =1 M i ) → g ( S ) (12)Thus, Theorem 1 is proven.The practical implication of Theorem 1 is thatthe aggregation of sensor values is approximated bythe aggregation of masked values produced by dif-ferent privacy settings. The approximation standsas long as the noise values produced by the dif-ferent privacy settings satisfy Relations 9 and 10.According to Relation 6, each subset of sensor val-ues should be masked by one privacy setting. Re-garding the complexity of these operations, apply-ing the masking on top of sensor values is linearlydepended to the number of sensor values | S i | as-signed to each privacy setting. Due to Relation 6,applying the proposed framework in real time in-creases computational complexity by O ( | S | ). Theoriginal values are not stored or transmitted at run-time, thus the storage and communication complex-ity does not change. During optimization all theprivacy settings i ∈ I are applied to a training setof sensor values S . In that case real sensor valuesare stored and transmitted as well along with themasked values for each setting. The storage andcommunication costs increase by O ( | I | · | S | ). Thecomputation costs also increase to O ( | I |· | S | ), whichis a quadratic complexity in the worst case | I | = | S | .7n most real world applications, it is safe to assumethat the sensor values have considerably higher vol-ume to the evaluated privacy settings | I | << | S | ,thus the expected computational, storage and com-munication complexity are linear to the number ofsensor values.The framework can be applied as a multi-agentsystem. It requires two types of agents represent-ing the data consumers and data producers. Thisscheme can be applied in both centralized and de-centralized aggregation services, such as MySQL orDIAS [36]. Finally in both heterogeneous and ho-mogeneous systems, the data consumer can influ-ence the data producer’s choice by offering a higheramount of reward to achieve a higher utility.
5. Experimental Settings
This section illustrates the experimental settings,which are used to empirically evaluate the proposedframework. A set of sensor values S is used for theevaluation. Each sensor value s n,t belongs to a user n and is generated at time t . For each sensor value,a privacy setting that operates on the device of thedata producer masks the sensor value f η ( s n,t , θ η,k )by using the masking mechanism η with parameters θ η,k . Two metrics are used to evaluate privacy andutility. The main metric, which is used to calculate pri-vacy, is the difference of the masked value and theoriginal value, which is defined as the local error: ε n,t = (cid:12)(cid:12)(cid:12)(cid:12) f η ( s n,t , θ η,k ) − s n,t s n,t (cid:12)(cid:12)(cid:12)(cid:12) (13)For a privacy setting to achieve a high privacy,a data consumer should not be able to estimatethe local error for the sensor values sent by dataproducers. This is achieved by choosing privacysettings that generate noise that is difficult to esti-mate. As it is shown in the literature [37, 25, 13, 1],the noise is difficult to estimate, if it is highly ran-dom and causes a significant change in the originalvalue. To avoid noise fitering attacks, noise withlow or no autocorrelation is generated. The rangeof autocorrelation values can be determined analyt-ically when the noise generation function is defined.In case this is not possible, a metric quantifyingthe color of noise can be included in the objectivefunction. Randomness is evaluated by measuring the Shannon entropy [42] H ( E ) of the local errorfor all local error values E . The entropy is calcu-lated by creating a histogram of the error valuesand then applying the discrete Shannon entropycalculation. Each bin of the histogram has a sizeof 0.001. The significance of change is measured bycalculating the mean local error µ ( E ) and standarddeviation σ ( E ). When comparing privacy settings,higher mean, variance and entropy indicate higherprivacy [1]. In this article, the objective functionthat measures privacy for a privacy setting f η,k isdefined as follows : q = α µ ( E η,k )max ( µ ( E η,k )) + α σ ( E η,k )max ( σ ( E η,k ))+ α H ( E η,k )max ( H ( E η,k )) (14)Where α , α , α are weighting parameters used tocontrol the effect of each metric in the privacy ob-jective function. max ( • ) is the maximum observedvalue for a metric during the experiments. Thisvalue is produced by evaluating all privacy settings f η,k . Dividing by this value, normalizes the met-rics in [0 , The utility of the system is estimated by measur-ing the error the system accumulates within a timeperiod, by computing an aggregation function g ( • )on the masked sensor values. Examples of such ag-gregation functions are the daily total, daily aver-age and weekly variance of the sensor values. Theaccumulated error is referred to as global error [ (cid:15) t = (cid:12)(cid:12)(cid:12)(cid:12) g ( M t ) − S t g ( S t ) (cid:12)(cid:12)(cid:12)(cid:12) (15)A sample set of global error values (cid:15) is created byapplying the masking process for a number of timeperiods of the dataset. The mean, entropy and vari-ance of the global error of a privacy setting f η,k iscalculated over this sample. The mean global error The error function described in (13) and (15) is alsoknown in literature as absolute percentage error (APE) [31].The error values are easy to interpret, as APE measures therelative change of the sensor values and aggregate values byusing masking. Yet, when the denominator of the functionis approaching zero, then the absolute relative error cannotbe calculated. If the sensor values are sparse, then anothererror function can be used, such as MAPE. ( (cid:15) η,k ) indicates the expected error between themasked and actual aggregate. The standard devia-tion σ ( (cid:15) η,k ) and the entropy H ( (cid:15) η,k ) of the globalerror, indicate how much and how often the maskedaggregate diverges from the expected value. Min-imizing all three quantities to 0, ensures that themasked aggregate approximates the actual aggre-gate efficiently. Thus, after the global error sampleis created for each privacy setting, the correspond-ing utility objective function is calculated: u = 1 − (cid:18) γ µ ( (cid:15) η,k )max ( µ ( (cid:15) η,k )) + γ σ ( (cid:15) η,k )max ( σ ( (cid:15) η,k ))+ γ H ( (cid:15) η,k ) H (max ( (cid:15) η,k )) (cid:19) (16)Where the weighting parameters γ , γ , γ are usedto control the effect of each metric in the utility ob-jective function. max ( • ) is the maximum observedvalue for a metric during the experiments. Thisvalue is produced by evaluating all privacy settings f η,k . Dividing by this value, normalizes the met-rics in [0 , . As a result, the maximization of the followingobjective function is based on utility:perc ( U,
50) + perc ( U,
10) (17)Where perc ( U, i ) calculates the i th percentile of aset of utility values U produced by the applicationof a privacy setting.The factors that maximize Relation (17) are: (i)the value of the mode, which is assumed to be ap-proximated by the median and (ii) the dispersion It is confirmed in some experimental settings that someprivacy settings generate samples of privacy-utility valuesthat do not pass a Kolmogorov Smirnoff normality test [39],and are also non-symmetrical. towards values lower than the median, which is ex-pressed by adding the 10 th percentile to the median.The objective function evaluates the median andthe negative dispersion (10 th percentile) of utilityvalues. Positive dispersion is not taken into accountin the optimization, since the abstract objective ofthe optimization is to ensure the least expected util-ity of a privacy setting for the data consumers. Theprivacy is constrained by evaluating only privacysettings in which the 10 th percentile differs fromthe privacy median for at most ω , as shown in In-equality (18). The value of ω is constrained to belower or equal to the bin size of the optimization toensure low privacy dispersion:perc ( Q, − perc ( Q, < ω, (18)Where perc ( Q, i ) calculates the i th percentile of aset a set of privacy values Q produced by the ap-plication of a privacy setting.
6. Experimental Evaluation
The proposed framework is evaluated experimen-tally by applying it to a real-world dataset. Privacyand utility are evaluated using over 20 ,
000 privacysettings for empirical evaluation.
The Electricity Customer Behavior Trial (ECBT)dataset contains sensor data that measure the en-ergy consumption for 6 ,
435 energy data producers.The data are sampled every 30 minutes daily for536 days. For the proposed framework, a set ofsensor values S of | N | = 6 ,
435 users and | T | = 536time periods. The total number of sensor values inthe set is | S | = 165 , , g ( S t ) = (cid:80) n =1 s n,t . Around 10% of thedaily measurements are missing values, and are notincluded in the experiments. The significance of themissing values reduces as the aggregation intervalincreases. Therefore, a daily summation is chosenover more granular summation.9uring the experiments, the local error of Rela-tion (13) results in a non-finite number only fora low number of maskings. Hence, these values areexcluded from the experiments, so that the calcu-lation of finite local error values is feasible. Con-cluding, the proposed framework operates on 90%of the ECBT dataset. Among several masking mechanisms [1], two onesare used for the evaluation of the framework. Eachmechanism is parameterized using the grid searchalgorithm [28]. The majority of masking mech-anisms are parameterized with real numerical val-ues. A grid search discretizes these values, and thenevaluates exhaustively all possible combinations ofparameter values. This mechanism is widely used in literature[13, 1, 10]. The noise in the experiments of thispaper is generated by sampling a laplace distribu-tion with zero mean. The scale parameter b of thedistribution is selected to ensure maximum privacy.Part of privacy can be sacrificed to increase utilityif the privacy requirements from the data produc-ers are not high. In this masking mechanism, thisis achieved by reducing the b . The scale parameterfor each laplace masking setting, is generated fromvalue b = 0 .
001 and during the parameter sweepthe value increases by 0 .
001 until it reaches b = 10. This mechanism is introduced in this paper. Themechanism generates random noise that can beadded to each sensor value. Assume a uniform ran-dom variable υ . The noise generated from the intro-duced masking mechanism is calculated as follows: m = | Ξ | (cid:88) ξ =0 [ θ ξ sin(2 πυ )] ξ +1 (19)The coefficients of the polyonym are denoted as θ ξ ,and ξ denotes the index of the coefficient. Boththe length of the polyonym | Ξ | and the individual The original sensor value is zero, therefore the result ofRelation (13) is infinite for non-zero noise or indefinite forzero-noise. Also known as parameter sweep. coefficient values can be tuned to optimize the re-sulting privacy-utility values of the masking mech-anism. The generated noise is symmetrically dis-tributed around zero, because the odd power of thesine function produces both negative and positivenoise with equal probability. The sine function andits odd powers are always symmetrical towards thehorizontal axis, meaning that (cid:12)(cid:12) [ θ ξ sin(2 πυ )] ξ +1 (cid:12)(cid:12) = (cid:12)(cid:12) [ − θ ξ sin(2 πυ )] ξ +1 (cid:12)(cid:12) . Hence, the integral of eachfactor is zero (cid:82) . [ θ ξ sin(2 πυ )] ξ +1 dυ = 0. There-fore the distribution of generated values is symmet-rical around zero for υ ∈ [0 , . , . .
03 until the value of 0.3, to evaluate set-tings that create low noise. Then the step changesto 0.3 until the value of 1.8, to evaluate privacysettings that generate higher values of noise. Thesine polyonym masking settings are generated bycreating all possible permutations of these valuesfor 5 coefficients. This yields around 10,000 mask-ing settings. Preliminary analysis on the autocor-relation and the spectrograms of the proposed sinepolyonym noise does not show autocorrelation andrecurring patterns over different spectrograms . Each privacy setting that results from parame-terization of the mechanisms is evaluated by ana-lyzing the local and the global error that they gen-erate on varying subset sizes of the ECBT dataset.By sampling varying sizes of the dataset, the util-ity and privacy dispersion metrics are evaluated ona varying number of sensor values, measuring theeffect of varying participation in the system. To Further analysis on this, is possible future work and isout of the scope of this article. This can be evaluated byintroducing a metric that measures noise color in the privacyfunction. N test is chosen. In each repetitionthe users are chosen randomly. All users use a uni-versal privacy setting. The initial size of the subsetis 50 users, and then it increases by 50 users until | N test | = 500 users. Then, the size of the subsetincreases by 500 users until | N test | = 6 , (CDF)is shown for each metric in Figure 2.The sine polyonym mechanism can produce awider range of local and global error values com-pared to the laplace mechanism, since almost everysine polyonym CDF curve is covering a wider do-main range on the domain axis compared to therespective laplace CDF curves. The majority ofthe range axis values of the sine polyonym CDFcurve are higher than the corresponding range val-ues of the laplace CDF curve. This indicates thatit is more probable to generate lower global or localerror value by using a sine polyonym setting com-pared to a laplace setting. Concluding, the sinepolyonym settings are expected to produce a widerrange of privacy-utility trade-offs. Based on theCDF charts, sine polyonym settings are more likelyto achieve higher utility, whereas laplace settingsare expected to achieve higher privacy. For the experiments, α and γ parameters are de-fined to calculate the privacy and utility. The choiceof these parameters may vary based on the distri-bution of the sensor values and the kind of aggrega-tion. Also data producers and data consumers mayhave varying requirements that affect the choice ofthose values. In this paper, these values are deter-mined empirically, to showcase an empirical eval-uation. If a data consumer successfully calculatesthe local error mean by acquiring the correspondingoriginal values of a masked set, then it is possible toestimate the original sensor values of other maskedsets as well, by subtracting the calculated mean.This challenge is addressed by using privacy set-tings with high noise variance. Still, high variance The cumulative distribution function denotes the prob-ability of a generated value being lower or equal than thecorresponding domain axis value [45].
Local Error Mean C u m u l a ti v e P r ob a b ilit y laplacesine (a) Local error mean Global Error Mean C u m u l a ti v e P r ob a b ilit y laplacesine (b) Global error mean Local Error St. Deviation C u m u l a ti v e P r ob a b ilit y laplacesine (c) Standard deviation of lo-cal error Global Error St. Deviation C u m u l a ti v e P r ob a b ilit y laplacesine (d) Standard deviation ofglobal error Local Error Entropy C u m u l a ti v e P r ob a b ilit y laplacesine (e) Local error entropy Global Error Entropy C u m u l a ti v e P r ob a b ilit y laplacesine (f) Global error entropy Figure 2: Cumulative distribution function of each localand global error metric computed by all settings of eachmasking mechanism. does not guarantee that the masking process is notirreversible. If noise varies between a small finitenumber of real values, then the data consumer canalso estimate the original value of the data by sub-tracting the variance. To overcome this challenge,privacy settings that produce noise with high en-tropy, therefore high randomness, are chosen. Con-sequently, a lower value for the coefficient of localerror mean is chosen as α = 0 .
2, while entropy andstandard deviation of the local error share a highercoefficient value of α = α = 0 . γ = 0 . γ = γ = 0 . µ ( (cid:15) ) < . σ ( (cid:15) ) < . b of the distribution, also increases thetotal noise added to the dataset. In the sine mech-anism, increasing the number and values of the co-efficients, also increases the total generated noise.In Figure 3, a comparison of privacy and utility isshown between the two types of mechanisms. Thevalues of utility and privacy are generated as shownin Section 6.3. The total noise is generated by mea-suring the noise level of each privacy setting on asample of 100,000 sensor values . The lines aresmoothed by applying a moving average, to makethe comparison clearer. For the same amount oftotal absolute generated noise (cid:80) t | ψ t | , the laplaceprivacy settings achieve higher privacy, often morethan 1% over the sine polyonym privacy settings.The sine polyonym privacy settings achieve higherutility around 1% over the laplace privacy settings.Therefore the results illustrated in Figure 2 are re-flected in the privacy and utility values generatedfrom the above parameterization. Moreover, thetrade-off between privacy and utility is observable,as privacy increases with the decrease of utility andvice versa for both mechanisms. All the generated privacy settings are evaluatedvia the framework proposed in Section 4. The pro-posed framework filters out five privacy settings for This sample size is chosen to be large enough for statis-tical significance and small enough to reduce computationcosts. P r i v ac y / U tilit y sine-privacysine-utilitylaplace-privacylaplace-utility Figure 3: Comparison of sine polyonym and laplacemasking mechanisms in terms of privacy and utility.Table 1: A table summarizing the performance of thefive optimal privacy settings based on the parameters ofthe sine polyonym denoting the coefficient value for eachfactor of the polyonym or the scale value for a laplacemechanism. In case of the sine polyonym, the first num-ber from right is mapped to the first factor ( ξ = 1 ) andso on. ID Masking Parameters Privacy UtilityA cosine 0.0-0.0-0.0-0.18-0.0 0.01 0.99B laplace 0.005 0.20 0.98C cosine 0.6-0.6-0.0-0.9-0.3 0.40 0.84D cosine 1.2-0.3-0.6-1.2-0.9 0.60 0.76E cosine 1.5-1.5-1.2-0.3-1.2 0.80 0.68N none - 0.00 1.00 five privacy bins of size 0.2. The constraint valuefor evaluating privacy settings is chosen empiricallyto be half of the bin size ω = 0 .
1, to ensure lowprivacy dispersion, based on Relation (18). The re-sulting privacy settings are summarized in
Table .Figure 4a shows the generated privacy-utility val-ues for all the privacy settings tested. Each coloris mapped to the masking mechanism that is usedto produce this setting. The line denotes the me-dian value of utility at the given privacy value. Thenon-median privacy-utility values occur in the semi-transparent area. Upper and lower edges of the areadenote the minimum and maximum utility value forthe corresponding privacy value. Lower utility val-ues for a given privacy point are generated fromapplications of the privacy setting on small subsetsof the ECBT datasets, where | N | ≤ In an heterogeneous system, the framework per-formance is evaluated under the use of different pri-vacy settings from each user. The difference of pri-vacy and utility between homogeneous and hetero-geneous systems is quantified. This quantification isdone by performing an exhaustive simulation. Thesimulation combines the ECBT dataset and the sixprivacy settings in
Table
1. Every user of the ECBTdataset is assigned a privacy setting from
Table . U tilit y laplace sine (a) Trajectory for all user setsizes Privacy0 . 2 0 . 4 0 . 6 0 . 8 1 . 00 . 00 . 20 . 40 . 60 . 81 . 0 U tilit y A B C D E (b) Optimization results forall user set sizes
Privacy0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0 U tilit y laplace sine (c) Trajectory for all user setswith more than 1000 users Privacy0 . 2 0 . 4 0 . 6 0 . 8 1 . 00 . 00 . 20 . 40 . 60 . 81 . 0 U tilit y A B C D E (d) Optimization results foruser sets with more than 1000users
Figure 4: Figures 4a & 4c show the privacy-utility tra-jectory of the privacy settings grouped by masking mech-anisms in the same color. Figures 4b & 4d illustrate thetrajectories of the privacy settings, which are generatedby the proposed framework. the median and the interquantile range (IQR) ofprivacy and utility for all histograms that the pri-vacy setting has a higher percentage of users com-pared to the others. Such a setting is referred to asdominant setting. This sorting of settings is doneto examine the privacy-utility changes while usersmove from a higher to the next lower utility set-ting. The top row of the heatmap shows the ho-mogeneous scenario case, where 100% of the userschose only one setting.The analysis of the heatmap in Figure 5a showsan increase in privacy when the majority of userschoose the more privacy-preserving settings of thehomogeneous scenario. This effect is observed forany percentage of users for a dominant setting. Adecrease in utility median is confirmed in 5c, whenthe majority of users shifts from less private to moreprivate settings. The trade-off between privacy and IQR is considered a robust measure of scale, which isespecially used for non-symmetric distributions. It measuresthe range between the 25 th and the 75 th quantiles.
7. Conclusion and Future Work
An optimization framework for the selection ofprivacy settings is introduced in this paper. Theframework computes privacy settings that maxi-mize utility for different values of privacy. Thisframework can be utilized in privacy-preserving sys-tems that calculate aggregation functions over pri-vatized sensor data. The data producers of suchsystem can self-determine the privacy setting oftheir choice, since it is guaranteed that it producesthe desired privacy with very low deviation. Forthe data consumer of the system, it is guaranteedthat if the data producers are incentivized to uselow-privacy settings and high utility settings, the
N A B C D E
Privacy Settings % U s e r s (a) Privacy Median N A B C D E
Privacy Settings % U s e r s (b) Privacy IQR N A B C D E
Privacy Settings % U s e r s (c) Utility Median N A B C D E
Privacy Settings % U s e r s (d) Utility IQR Figure 5: The heatmaps in Figures 5a-5d show the pri-vacy and utility median and IQR values, for variousdistributions of privacy setting choices among the users. approximated aggregate is highly accurate. Ana-lytical as well as empirical evaluation using over20 ,
000 privacy settings and real-world data froma Smart Grid pilot project confirm the viability ofparticipatory data sharing under informational self-determination.For future work, the proposed framework canbe improved by incorporating a machine learningprocess that computes personalized recommenda-tions of privacy settings to each data producer,by identifying the prior distribution of the sensordata and also the preferences of the data producer.Further empirical evaluations of framework can beperformed by implementing other aggregation func-tions and using different datasets. Finally, an ana-lytical proof that the sine polyonym additive noiseis not colored and differentially private can be per-formed.
8. Acknowledgments
This work is supported by European Communi-tys H2020 Program under the scheme ’ICT-10-2015RIA’, grant agreement .
9. ReferencesReferences [1] Aggarwal, C. C. and Yu, P. S., editors (2008).
Privacy-Preserving Data Mining , volume 34 of
Advances inDatabase Systems . Springer US, Boston, MA.[2] Beliakov, G., Pradera, A., and Calvo, T. (2007). Aggre-gation functions: A guide for practitioners. In
Studies inFuzziness and Soft Computing .[3] Bennati, S. and Pournaras, E. (2018). Privacy-enhancingaggregation of internet of things data via sensors grouping.
Sustainable Cities and Society , 39:387 – 400.[4] Bhatia, J. and Breaux, T. D. (2015). Towards an infor-mation type lexicon for privacy policies. In , pages 19–24.[5] Book, R. V. (1974). Comparing complexity classes.
Jour-nal of Computer and System Sciences , 9(2):213 – 229.[6] Burns, A. C. and Bush, R. F. (2014).
Marketing research .Pearson.[7] Carmichael, L., Stalla-Bourdillon, S., and Staab, S.(2016). Data Mining and Automated Discrimination: AMixed Legal/Technical Perspective.
IEEE Intelligent Sys-tems , 31(6):51–55.[8] Du, W. and Atallah, M. J. (2001). Secure multi-partycomputation problems and their applications. In
Proceed-ings of the 2001 workshop on New security paradigms -NSPW ’01 , page 13, New York, New York, USA. ACMPress.[9] Duan, Y. (2009). Differential privacy for sum querieswithout external noise. In
ACM Conference on Informa-tion and Knowledge Management (CIKM) .[10] Duan, Y. and Yitao (2009). Privacy without noise. In
Proceeding of the 18th ACM conference on Informationand knowledge management - CIKM ’09 , page 1517, NewYork, New York, USA. ACM Press.[11] Dwork, C. (2006). Differential privacy.
Proceedingsof the 33rd International Colloquium on Automata, Lan-guages and Programming , pages 1–12.[12] Dwork, C. (2008). Differential Privacy: A Survey ofResults. In
Theory and Applications of Models of Com-putation , pages 1–19. Springer Berlin Heidelberg, Berlin,Heidelberg.[13] Dwork, C. and Roth, A. (2014). The algorithmic foun-dations of differential privacy.
Found. Trends Theor.Comput. Sci. , 9:211–407.[14] Dwork, C., Smith, A., Steinke, T., and Ullman, J.(2017). Exposed! a survey of attacks on private data.
Annual Review of Statistics and Its Application , 4(1):61–84.[15] Gentry, C. (2009a).
A fully homomorphic encryp-tion scheme . PhD thesis, Stanford University. crypto.stanford.edu/craig .[16] Gentry, C. (2009b). Fully homomorphic encryption us-ing ideal lattices. In
Proceedings of the Forty-first AnnualACM Symposium on Theory of Computing , STOC ’09,pages 169–178, New York, NY, USA. ACM. [17] Gentry, C. (2009c). Fully homomorphic encryption us-ing ideal lattices. Proceedings of the 41st annual ACMsymposium on Symposium on theory of computing STOC09 , 19(September):169.[18] Gentry, C. (2010). Computing arbitrary functions ofencrypted data.
Commun. ACM , 53(3):97–105.[19] Gentry, C. and Halevi, S. (2011).
Implementing Gen-try’s Fully-Homomorphic Encryption Scheme , pages 129–148. Springer Berlin Heidelberg, Berlin, Heidelberg.[20] Gibson, J. D., Koo, B., and Gray, S. D. (1991). Filteringof colored noise for speech enhancement and coding.
IEEETransactions on Signal Processing , 39(8):1732–1742.[21] Gkika, S. and Lekakos, G. (2014). Investigating the ef-fectiveness of persuasion strategies on recommender sys-tems. In , pages94–97.[22] Helbing, D. and Pournaras, E. (2015). Build digitaldemocracy.
Nature , 527(7576):33–34.[23] Kairouz, P., Oh, S., and Viswanath, P. (2017). Thecomposition theorem for differential privacy.
IEEE Trans-actions on Information Theory , 63(6):4037–4049.[24] Kitchin, R. (2014).
The Data Revolution: Big Data,Open Data, Data Infrastructures and Their Conse-quences . SAGE Publications.[25] Krause, A. and Horvitz, E. (2008). A utility-theoreticapproach to privacy and personalization. In
Proceedingsof the 23rd National Conference on Artificial Intelligence- Volume 2 , AAAI’08, pages 1181–1188. AAAI Press.[26] Krutz, R. L. and Vines, R. D. (2010).
Cloud Security: AComprehensive Guide to Secure Cloud Computing . WileyPublishing.[27] Kursawe, K., Danezis, G., and Kohlweiss, M. (2011).
Privacy-Friendly Aggregation for the Smart-Grid , pages175–191. Springer Berlin Heidelberg, Berlin, Heidelberg.[28] Lerman, P. M. (1980). Fitting Segmented RegressionModels by Grid Search.
Applied Statistics , 29(1):77.[29] Li, C., Li, D. Y., Miklau, G., and Suciu, D. (2014). ATheory of Pricing Private Data.
ACM Transactions onDatabase Systems , 39(4):1–28.[30] Li, T. and Li, N. (2009). On the tradeoff between pri-vacy and utility in data publishing. In
Proceedings of the15th ACM SIGKDD international conference on Knowl-edge discovery and data mining - KDD ’09 , page 517,New York, New York, USA. ACM Press.[31] Makridakis, S. (1993). Accuracy measures: theoreticaland practical concerns.
International Journal of Forecast-ing , 9(4):527 – 529.[32] McDonald, A. M. and Cranor, L. F. (2008). The Costof Reading Privacy Policies.
A Journal of Law and Policyfor the Information Society , 4(3):543–568.[33] Nin, J., Herranz, J., and Torra, V. (2008). On thedisclosure risk of multivariate microaggregation.
Data &Knowledge Engineering , 67(3):399–412.[34] Nurmi, H. (2012).
Comparing Voting Systems . Theoryand Decision Library A:. Springer Netherlands.[35] Phan, N., Wang, Y., Wu, X., and Dou, D. (2016). Dif-ferential Privacy Preservation for Deep Auto-Encoders :An Application of Human Behavior Prediction.
Proceed-ings of the Thirtieth AAAI Conference on Artificial In-telligence (AAAI-16) , pages 1309–1316.[36] Pournaras, E., Nikolic, J., Omerzel, A., and Helbing, D.(2017). Engineering democratization in internet of thingsdata analytics. In ions (AINA) , pages 994–1003.[37] Pournaras, E., Nikolic, J., Vel´asquez, P., Trovati, M.,Bessis, N., and Helbing, D. (2016). Self-regulatory infor-mation sharing in participatory social sensing. EPJ DataScience , 5(1):1–24.[38] Prabhakaran, M. and Rosulek, M. (2008).
Crypto-graphic Complexity of Multi-Party Computation Prob-lems: Classifications and Separations , pages 262–279.Springer Berlin Heidelberg, Berlin, Heidelberg.[39] Rao, P. V., Schuster, E. F., and Littell, R. C. (1975).Estimation of shift and center of symmetry based onkolmogorov-smirnov statistics.
The Annals of Statistics ,3(4):862–873.[40] Samarati, P. and Sweeney, L. (1998). Protecting privacywhen disclosing information: k-anonymity and its enforce-ment through generalization and suppression. Technicalreport, Computer Science Laboratory, SRI Internation.[41] S´anchez, D., Domingo-Ferrer, J., and Mart´ınez, S.(2014). Improving the Utility of Differential Privacy viaUnivariate Microaggregation. In
Lecture Notes in Com-puter Science , pages 130–142. Springer, Cham.[42] Shannon, C. E. (1948). A mathematical theory of com-munication.
The Bell System Technical Journal , 27(July1928):379–423.[43] Shokri, R. and Shmatikov, V. (2015). Privacy-Preserving Deep Learning. In
Proceedings of the 22ndACM SIGSAC Conference on Computer and Communi-cations Security - CCS ’15 , pages 1310–1321, New York,New York, USA. ACM Press.[44] Soria-Comas, J., Domingo-Ferrer, J., Snchez, D., andMegas, D. (2017). Individual differential privacy: Autility-preserving formulation of differential privacy guar-antees.
IEEE Transactions on Information Forensics andSecurity , 12(6):1418–1429.[45] Spiegel, M. (1992).
Schaum’s Outline of Statistics .[46] Wang, Z., Fan, K., Zhang, J., and Wang, L. (2013). Ef-ficient Algorithm for Privately Releasing Smooth Queries.[47] Wang, Z., Jin, C., Fan, K., Zhang, J., Huang, J., Zhong,Y., and Wang, L. (2016). Differentially Private Data Re-leasing for Smooth Queries.
Journal of Machine LearningResearch , 17(51):1–42.[48] Yao, A. C. (1982). Protocols for secure computations.In
Proceedings of the 23rd Annual Symposium on Foun-dations of Computer Science , SFCS ’82, pages 160–164,Washington, DC, USA. IEEE Computer Society.
Nomenclature A A multiset of real values. Any capital letter istreated as a multiset of real values, unless statedotherwise. g ( A ) a function that aggregates all elements of aset A into a real value. e.g. for sum: g sum ( A ) = (cid:80) | A | i =0 a i µ ( A ) The mean value of all elements of a set, where a i ∈ A .m ( A ) The median value of all elements of a set,where a i ∈ A .max ( A ) The maximum value of all elements of aset, where a i ∈ A . min ( A ) The minimum value of all elements of aset, where a i ∈ A . H ( A ) The Shanon’s entropy value for all elementsof a set, where a i ∈ A .ˆ a A suboptimal value that approaches an optimalvalue, e.g. ˆ a → max ( A ) or ˆ a → min ( A ). a ∗ A new suboptimal value that approaches an ex-isting suboptimal value ˆ a . n A user N A set of users t A time index s n,t A sensor value generated in time t by the user nη a masking mechanism, which consists of a para-metric algorithm that masks the sensor values ofa multiset S . θ η,k A parameterization k for a masking mecha-nism η . υ A uniformly distributed variable. f η ( S, θ η,k ) a privacy setting consisting of a mask-ing mechanism η parameterized with parame-ters θ η,k and operating on a set of sensor val-ues S . It produces a masked set of sensor values f η ( S, θ η,k ) = M , such that | S | = | M | . Q A multiset of privacy values. α i A parameter that weights the importance ofprivacy factors for calculating the privacy values. δ A parameter that denotes the amount of privacythat the data producer sacrifices or gains over theexisting privacy. c A parameter that denotes the amount of utilitythat a data consumer sacrifices or gains over theexisting utility value. U A multiset of utility values. α i A parameter that weights the importance ofprivacy factors for calculating the utility values. γ ii