Publishing Location Dataset Differential Privately with Isotonic Regression
PPublishing Location Dataset DifferentialPrivately with Isotonic Regression
Chengfang Fang Ee-Chien Chang
School of ComputingNational University of Singapore [email protected] [email protected]
Abstract.
We consider the problem of publishing location datasets, inparticular 2D spatial pointsets, in a differentially private manner. Manyexisting mechanisms focus on frequency counts of the points in somea priori partition of the domain that is difficult to determine. We pro-pose an approach that adds noise directly to the point, or to a groupof neighboring points. Our approach is based on the observation that,the sensitivity of sorting, as a function on sets of real numbers, can bebounded. Together with isotonic regression, the dataset can be accu-rately reconstructed. To extend the mechanism to higher dimension, weemploy locality preserving function to map the dataset to a boundedinterval. Although there are fundamental limits on the performance oflocality preserving functions, fortunately, our problem only requires dis-tance preservation in the “easier” direction, and the well-known Hilbertspace-filling curve suffices to provide high accuracy. The publishing pro-cess is simple from the publisher’s point of view: the publisher just needsto map the data, sort them, group them, add Laplace noise and publishthe dataset. The only parameter to determine is the group size whichcan be chosen based on predicted generalization errors. Empirical studyshows that the published dataset can also exploited to answer otherqueries, for example, range query and median query, accurately.
The popularity of personal devices equipped with location sensors leads to largeamount of location data being gathered. Such data contain rich information andwould be valuable if they can be shared and published. As the data may reveallocation of identified individual, it is important to anonymize the data beforepublishing. The recently developed notion of differential privacy [5] provides astrong form of privacy assurance regardless of the background information heldby the adversaries. Such assurance is important as many case studies and pastevents have shown that a seemingly annoymized dataset together with additionalknowledge held by the adversary could reveal information on individuals.Most studies on differential privacy focus on publishing statistical values, forinstance, k-means[2], private coreset[7], and median of the database[19]. Publish-ing specific statistics or data-mining results is meaningful if the publisher knows a r X i v : . [ c s . CR ] N ov hat the public specifically want. However, there are situations where the pub-lishers want to give the public greater flexibility in analyzing and exploring thedata, for example, using different visualization techniques. In such scenarios, itis desired to “publish data, not the data mining result” [8]. −130 −120 −110 −100 −90 −80 −70 −603032343638404244464850 Fig. 1.
Twitter location data cropped at the North America region. To avoidclogging, only 10% of the points (randomly chosen) are plotted.In this paper, we consider the problem of publishing location data, or otherlow dimensional data in a differential private manner. An example is shown inFig. 1 which depicts the locations of 183,072 Twitter users in North America [1],and Fig. 2 shows a sequence of sorted real number obtained by mapping thepoints in Fig. 1 into the unit interval. We proposed a mechanism based on theobservation that sorting, as a function that takes in a set of real numbers from theunit interval, interestingly has sensitivity one (Theorem 1). Hence (cid:15) -differentialprivacy can be achieved by adding Laplace noise with a scale parameter 1 /(cid:15) directly to the sorted sequence. Fig. 3 shows such noisy data by adding noise tothe curve in Fig. 2. Although seemingly noisy, as the original sequence is sorted,there are dependencies among them to be exploited. Fig. 4 shows a reconstructedsequence using isotonic regression.To further reduce perturbation induced by the Laplace noise, consecutiveelements in the sorted sequence can be grouped. However, grouping introduces generalization error . The amount of generalization error in the “worst case” canbe analytically determined, and together with the model of error induced by theLaplace noise, the publisher can choose an appropriate group size k based onthe privacy requirement (cid:15) and the total number of points n . For the example inFig. 1, the group size determined is 300 and the corresponding published andreconstructed data are depicted in Fig. 5. Fig. 6 shows a comparison of the errorof each points in the reconstructed sequence. After reconstruction, the inversemapping is to be applied to the data. Our variant of isotonic regression outputsdataset with larger number of repetition, as shown in Fig. 7. Fig. 8 shows thereconstructed pointset, with a post-processing that maps the repeating pointso the surrounding area of the location. The post-processing is solely for thepurpose of visualization. Fig. 9 shows a zoomed version region around New YorkCity. Lo c a t i on s Fig. 2.
The 1D data mapped from 2D Twitter locations (Fig. 1) to the unitinterval [0 , −8−6−4−20246810 Points Lo c a t i on s Fig. 3.
The published noisy location data with group size k = 1 and (cid:15) = 1,that is, there is no grouping. To avoid clogging, only 2% of the published points(randomly chosen) are plotted.Our choice of the group size k is determined by minimizing an error func-tion which measures the Earth-Mover-Distance (EMD) of the original and re-constructed pointsets. In one-dimension, the EMD of two equal-sized pointsetsis simply the L distance between the two respective sorted sequences. Al-though designed to minimize EMD, the proposed mechanism achieves good ac-curacy w.r.t. other utilities. Experimental studies show that the proposed mecha-nism achieves higher accuracy compared to the wavelet-based method for range Lo c a t i on s Original sequenceReconstructed sequence
Fig. 4.
The reconstructed data obtained by performing isotonic regression onthe published data as shown in Fig. 3, plotted together with the original data. Lo c a t i on s Noisy pointsReconstructed sequence
Fig. 5.
The published noisy location data with grouping (group size k = 300)and the corresponding reconstructed data through isotonic regression. The figureshows only the region from 0.5 to 0.7 in the unit interval.queries[26], and outperforms the equi-width histogram w.r.t. the accuracy inestimation the underlying probability density function.An advantage of the proposed mechanism is its simplicity from the publisher’sviewpoint. The publisher only has to map the points to the unit interval, sortthem, add Laplace noise, and publish the results. By publishing the “raw” noisydata instead of the reconstructed data, users in the public are not confined to aparticular inference techniques, and have the flexibility in using different variantsof isotonic regression to suit their needs.In contrast to an equi-width histogram, the bins of an equi-depth histogramcontains same number of elements, with their width varies. Intuitively, the sizeof a bin is larger in location with lower “density” of points. There are exten-sive studies on equi-depth histogram, and it is generally well accepted that anequi-depth histogram provides more useful statistical information[20] compareto equi-width histogram. However, it is not clear how to generate an equi-depthhistogram while achieving differential privacy. Interestingly, grouping in our pro- −0.1−0.08−0.06−0.04−0.0200.020.040.060.08 Points E rr o r Without groupingWith group size of 300
Fig. 6.
Differences of the two reconstructed data from the original. The blackdashed line is the displacement of reconstructed data without grouping (see Fig.4), the blue solid line is the displacement of reconstructed data with group size k = 300 (see Fig. 5). −130 −120 −110 −100 −90 −80 −70 −603032343638404244464850 Longitude La t i t ude Fig. 7.
Inverse mapped of the reconstructed data with (cid:15) = 5. Each point in thefigure represents a group of repeating points.posed mechanism naturally produces equi-depth histograms: grouping of k ele-ments leads to a depth of k .Our approach can be applied to obtain order-statistic, for example, median.Finding median is challenging due to its large sensitivity. Accurate mechanismcan be derived by adding Laplace noise proportional to the smooth sensitivity[19]instead of the global sensitivity. However, computation of smooth sensitivitytakes Θ ( n ) time where n is the dataset’s size. In contrast, our mechanism takes O ( n ) time when the dataset is already sorted. Experimental studies on datasetswith 129 elements suggest that the proposed mechanism is less sensitive to ahigher local sensitivity, or a small (cid:15) . As it is computational intensive to compute smooth sensitivity, we are unable torepeat the experiments for significant larger n .
130 −120 −110 −100 −90 −80 −70 −603032343638404244464850 Longitude La t i t ude Fig. 8.
Inverse mapped of the reconstructed data followed by post-processing.The post-processing “diffuses” a repeated point to its surrounding and is per-formed solely for visualization purpose. To avoid clogging, only 10% of the points(randomly chosen) are plotted. −82 −80 −78 −76 −74 −723939.54040.54141.54242.543 Longitude La t i t ude Fig. 9.
A zoom-in of Fig. 8 to region within the indicated rectangle.The locality preserving map is a key component in our mechanism, taking therole of transforming the data points to the one-dimension space. Although thereare fundamental limits on locality preserving mapping, fortunately, our problemonly requires preservation in the “easier” direction, i.e., any pair of neighborsin the one-dimensional domain are also neighbors in the multi-dimensional do-main. The classic Hilbert space-filling curve suffices to provide high accuracy.For other types of non-spatial data, our techniques can be applied as long as anappropriate locality preserving mapping is available.
Organization:
We first describe some background materials in the next section(Section 2). In Section 3 we present our main ideas and mechanism, and showthat the proposed mechanism achieves differential privacy in Section 4. Next, insection 5, we formulate and analyze the noise incurred by the Laplace and thegeneralization noise. Based on the noise model, we derive a strategy to choosethe group size. In Section 6, we compare our mechanism with three known mech-nisms: (1) equi-width histogram, (2) wavelet-based method for range queries,and (3) smooth-sensitivity based median finding. In Section 7, we describe a fewpossible extensions, in particular, a hybrid of our mechanism with equi-widthhistogram. Lastly, we describe related works in Section 8 and conclude in Section9.
We treat a database as a multi-set (i.e. a set with possibly repeating elements),and define two D and D to be neighbor when D can be obtained from D byreplacing one element, i.e. D = { x } ∪ D \ { y } for some x and y . Let us call theabove definition of neighborhood the replacement neighborhood .A randomized algorithm (also known as a mechanism ) A achieves (cid:15) differen-tial privacy if, P r [ A ( D ) ∈ S ] ≤ exp ( (cid:15) ) × P r [ A ( D ) ∈ S ]for all S ⊆ Range( A ), where Range( A ) denotes the output range of the algorithm A , and for any pair of neighbouring datasets D and D .The replacement neighborhood we adopted is similar to the notion used byNissim et al [19]. Such variant differs from the well-adopted notion that treatstwo datasets D , D to be neighbors iff D = D ∪{ x } or D = D \{ x } for some x . Note that a mechanism that achieves differential privacy under replacementneighborhood can be converted to one that achieves privacy under the well-adopted neighborhood.For a function f : D → R k , the sensitivity [5] of f is defined as ∆ ( f ) := max (cid:107) f ( D ) − f ( D ) (cid:107) where the maximum is taken over all pairs of neighboring D and D . It can beshown [6] that the mechanism AA ( D ) = f ( D ) + ( Lap ( ∆ ( f ) /(cid:15) )) k achieves (cid:15) -differential privacy, where ( Lap ( ∆ ( f ) /(cid:15) )) k is a vector of k indepen-dently and randomly chosen values from the Laplace distribution with standarddeviation 2 ∆ ( f ) /(cid:15) .It is meaningless if the output of a mechanism is simply noise, even if privacyis achieved. The accuracy of a mechanism is measured by a utility function u ( X, y ) that measures the quality of the output y given the dataset is X . Al-ternatively, the utility can be measured by an error function that measures thedistant of the output from the ideal output.The notion of differential privacy has a useful sequential composition property [16]: if mechanisms M and M achieve (cid:15) and (cid:15) -differential privacy respec-tively, then the combined mechanism of applying M follows by M achieves (cid:15) + (cid:15) differential privacy. .2 Isotonic Regression Given a sequence of n real numbers a , . . . , a n , the problem of finding the least-square fit x , . . . , x n subjected to the constraints x i ≤ x j for all i < j ≤ n isknown as the isotonic regression. Formally, we wants to find the x , . . . , x n thatminimizes n (cid:88) i =1 ( x i − a i ) , subjected to x i ≤ x j for all 1 ≤ i < j ≤ n The unique solution can be efficiently found using pool-adjacent-violators al-gorithms in O ( n ) time [10]. When minimizing w.r.t. (cid:96) -1 norm, there is also anefficient O ( n log n ) algorithm[23]. There are many variants of isotonic regression,for example, having a smoothness component in the objective function [25,17]. A locality preserving map T : R d → R maps d -dimensional points to real num-bers while preserving “locality”. In this paper, we seek mapping whereby twoneighboring points in the one-dimensional range are also neighboring points inthe d -dimensional domain. Specifically, there is some constant A s.t. for any x, y ∈ R d , (cid:107) x − y (cid:107) ≤ A · ( T ( x ) − T ( y )) /d The well-known Hilbert curve achieves (cid:107) x − y (cid:107) ≤ (cid:112) | T ( x ) − T ( y ) | − x, y in R [9]. Niedermeier et al. [18] showed thatwith careful construction, the bound can be improved to (cid:112) | T ( x ) − T ( y ) | − d -dimensional domain are also neighboringpoints in the one-dimensional range. Fortunately, in our problem, such propertyis not required. We conduct experiments on two datasets: locations of Twitter users [1] (hereincalled the Twitter location dataset) and the dataset collected by Kaluˇza etal. [13] (herein called Kaluˇza’s dataset). The Twitter location dataset containsover 1 million Twitter users’ data from the period of March 2006 to March 2010,among which around 200,000 tuples are labeled with location (represented inlatitude and longitude) and most of the tuples are in the North America conti-nent, concentrating in regions around the state of New York and California. Fig.1 shows the cropped region covering most of the North America continent. Thecropped region contains 183,072 tuples. The Kaluˇza’s dataset contains 164,860tuples collected from tags that continuously records the locations information of5 individuals.
Proposed Approach
Given the privacy requirement (cid:15) and a dataset D of size n , the publisher carriesout the following:A1. Maps each point in D to a real number in the unit interval [0 ,
1] usinga locality preserving map T . Let T ( D ) be the set of transformed points.Determine a group size k based on n and (cid:15) . For clarity in exposition, let usassume that k divides n .A2. Sorts T ( D ). Divides the sorted sequence into groups of k consecutive ele-ments. For each group, determines its sum. Let the sums be S = (cid:104) s , . . . , s n/k (cid:105) .A3. Publishes (cid:101) S = S + (Lap( (cid:15) − )) ( n/k ) and the group size k .An user in the public may extract information from the published data asfollow:B1. Performs isotonic regression on k − (cid:101) S , and maps the data point back to theiroriginal domain. That is, computes (cid:101) D = T − (IR( k − (cid:101) S )), where IR( · ) de-notes isotonic regression. Let us call (cid:101) D the reconstructed data. Remark.
1. The size of the dataset n is not considered to be a secret and can be derivedfrom the published (cid:101) S . The transformation T and the lookup table for k arepublic knowledge prior to the publishing.2. When the database size n is unknown to the user, the publisher can exploitthe sequential composition property of differential privacy and carry out thefollowing steps: (1) Firstly, publishes a noisy size (cid:101) n using a portion of theprivacy “budget”. (2) Next, extracts exactly (cid:101) n points from the dataset usinga deterministic padding algorithm as follow: if (cid:101) n > n , inserts ( (cid:101) n − n ) numberof 0’s to the dataset; if (cid:101) n < n , removes ( n − (cid:101) n ) smallest elements. (3) Lastly,publishes the padded pointset using our proposed mechanism.3. To relieve the public users from computing step B1, the regression can becarried out by the publisher on behalf of the users. Nevertheless, the rawdata (cid:101) S should be (but not necessary) published alongside the reconstructeddata.4. The public is not confined to adopt a particular isotonic regression. After (cid:101) S is published, various inference techniques can be applied. For instance, auser may perform a variant of isotonic regression that optimizes objectivefunctions with a smoothness component[25,17].5. The publisher’s main design decisions are the choice of T and the groupsize k . The choice of T depends on the underlying metric of the points.For Euclidean distance in two-dimensional space, the classic Hilbert curvealready attains good performance. The group size k can be computed fromthe lookup table constructed using our proposed noise model. Security Analysis: Sensitivity of Sorting is bounded
In this section, we show that the proposed mechanism (Step A1 to A3) achievesdifferential privacy, and thus also the reconstructed pointset output by B1. Thefollowing theorem shows that sorting, as a function, interestingly has sensitivity1. Note that a straightforward analysis that treats each element independentlycould lead to a bound of n , which is too large to be useful. Theorem 1.
Let S n ( D ) be a function that on input D , which is a multiset con-taining n real numbers from the unit interval [0 , , outputs the sorted sequenceof elements in D . The sensitivity of S n w.r.t. the replacement neighborhood is 1.Proof. Let Let D and D be any two neighbouring datasets. (cid:104) x , x . . . x i . . . x n (cid:105) be S n ( D ), i.e. the sorted sequence of D . WLOG, let us assume that an element x i is replaced by a larger value A to give D , for some 1 ≤ i ≤ n − x i < A .Let j to be largest index s.t. x j < A ≤
1. Hence, the sorted sequence of D is: x , x , . . . , x i − , x i +1 , . . . , x j , A, x j +1 , . . . , x n The L difference due to the replacement is, (cid:107) S n ( D ) − S n ( D ) (cid:107) = | x i +1 − x i | + | x i +2 − x i +1 | + . . . + | x j − x j − | + | A − x j | = ( x i +1 − x i ) + ( x i +2 − x i +1 ) + . . . +( x j − x j − ) + ( A − x j )= A − x i ≤ D and D where the difference A − x i = 1.Hence, the sensitivity is 1.In the proof, the fact that the sequence is sorted is exploited to obtain thebound. Since the sensitivity is 1, the mechanism S n ( D ) + Lap (1 /(cid:15) ) n enjoys (cid:15) -differential privacy. Also note that the value of n is fixed. Hence, in the contextof data publishing, the size of D is not a secret and is made known to the public.Next, we show that grouping (in Step A2) has no effect on the sensitivity. Corollary 1.
Consider a partition H = { h , h . . . h m } of the indices { , , . . . , n } .Let S H ( D ) be the function that, on input D , which is a multiset containing n real numbers from the unit interval [0 , , outputs a sequence of m numbers: y i = (cid:88) j ∈ h i x j , for ≤ i ≤ m where (cid:104) x , x , . . . , x n (cid:105) is the sorted sequence of D . The sensitivityof S H w.r.t. the replacement neigbourhood is 1.roof. Let us consider two neighbouring datasets D and D , and their respec-tive sorted sequences be x , x , . . . , x n and x (cid:48) , x (cid:48) , . . . , x (cid:48) n WLOG, let us assume that D is obtained by replacing an element in D witha strictly larger element. Thus, x i ≤ x (cid:48) i for all i ’s.The L difference due to the replacement is (cid:107) S H ( D ) − S H ( D ) (cid:107) = m (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j ∈ h i x (cid:48) i − (cid:88) j ∈ h i x i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = m (cid:88) i =1 (cid:88) j ∈ h i x (cid:48) i − (cid:88) j ∈ h i x i = n (cid:88) j =1 (cid:0) x (cid:48) j − x j (cid:1) ≤ h i ’s . Hence, Corollary 1 gives a more general result where H can be any par-tition. From Corollary 1, the proposed mechanism that publishes (cid:101) S achieves (cid:15) -differential privacy. The main goal of this section is to analyze the effect of the privacy requirement (cid:15) , dataset size n and the group size k on inducing the error in the reconstructeddata, which in turn provides a strategy in choosing the parameter k from thegiven n and (cid:15) .Intuitively, in the absent of “generalization noise”, when n is larger, thereare more constraints in the isotonic regression, leading to a more accurate recon-struction. Grouping affects the accuracy in two opposing ways. It reduces thenumber of constraints for regression, and introduces generalization error . On theother hand, the Laplace noise is essentially reduced by a factor of k . By takinginto account of the above factor, and the effect of generalization noise, we candetermine the optimal k . We use an error function related to the Earth-Mover-Distance (EMD) [21] toquality the utility of the published data. The EMD between two pointsets ofqual size is defined to be the minimum cost of bipartite matching between thetwo sets, where the cost of an edge linking two points is the cost of movingone point to the other. Hence, EMD can be viewed as the minimum cost oftransforming one pointset to the other. Different variants of EMD differ on howthe cost is defined. In this paper, we adopt the typical definition that defines thecost as the Euclidean distant between the two points.In one-dimensional space, the EMD between two sets D and (cid:101) D is simplythe L norm of the differences between the two respective sorted sequences, i.e. (cid:107) S n ( D ) − S n ( (cid:101) D ) (cid:107) , which can be efficiently computed. In other words,EMD( D, (cid:101) D ) = n (cid:88) i =1 | p i − (cid:101) p i | (1)where p i ’s and (cid:101) p i ’s are the sorted sequence of D and (cid:101) D respectively.Given a pointset D and the published pointset (cid:101) D of a mechanism M where | D | = | (cid:101) D | = n , let us define the normalized error as n EMD( D, (cid:101) D ) and denote Err M , D the expected normalized error, Err M ,D = Exp (cid:20) n EMD( D, (cid:101) D ) (cid:21) (2)where the expectation is taken over the randomness in the mechanism.Although EMD can be computed efficiently for one-dimensional pointsets,the best known algorithm that computes EMD in higher dimension has cubicrunning time [14]. Jang et al. [12] proposed a fast approximation that employs aspace-filling curve. Similarly, for higher dimensional space, we approximate theEMD by first map each point to a real number in [0 ,
1] through a space-fillingcurve, and then compute the EMD in the one-dimensional space in O ( n log n )time. Let us first omit the effect of grouping and consider cases where k = 1. Weconduct experimental studies on four types of pointsets with varying size n : (1)Multisets containing elements with the same value 0.5 (herein called “repeatingsingle-value dataset), (2) sets containing equally-spaced numbers ( i/ ( n − i = 0 , . . . , n − n randomly chosen elements from the Twitter location data [1], and (4) setscontaining n randomly chosen elements from the Kaluˇza’s data [13] .Fig. 10 shows the expected normalized error. Each value on the graph is theaverage over 500 sample runs. Not surprisingly, the expected error reduces whenthe number of points increases. Fig. 11 shows the expected normalized errorfor dataset on equally-spaced points for different (cid:15) . The results agree with theintuition that when (cid:15) is increased by a factor of c , the error would approximatelydecrease by factor of c as shown in Fig. 12. E rr o r Repeated single−value dataEqually spaced dataKaluza’s dataTwitter location data
Fig. 10.
The expected normalized error without grouping versus the size of thedataset. The red solid line is for repeating single-value dataset, the black dashedline is for equally-spaced numbers, the purple dotted line is for the Kaluˇza’sdataset and the blue dash-dot line is for the Twitter location dataset. E rr o r C: (cid:2) = 1/2A: (cid:2) = 2D: (cid:2) = 1/3B: (cid:2) = 1 Fig. 11.
The expected normalized error without grouping versus the size ofdataset for different the security parameter (cid:15) . Now, we consider cases where k >
1. Grouping reduces the number of con-strains by a factor of k . As suggested by Fig. 10, when the number of datapointdecreases, error increases.On the other hand, recall that the regression is performed on the publishedvalues divided by k (see the role of k in Step B1). This essentially reduces thelevel of Laplace noise by a factor of k . Hence, the accuracy attained by grouping k elements is “equivalent” to the accuracy attained without grouping but withthe privacy parameter (cid:15) increased by a factor of k .From Fig. 10, we can predict the effects of grouping on the repeating single-value dataset. For instance, if n = 10 ,
000 and (cid:15) = 1 and k = 5, without grouping,the reconstructed points are expected to have a 0 .
02 error; whereas with grouping R a t i o Ratio of A/BRatio of C/BRatio of D/B
Fig. 12.
The ratio of the expected normalized error with different (cid:15) against theexpected normalized error with (cid:15) = 1.of size 5, the expected error is 0 . / .
01. Fig. 13 shows the predicted errorsunder different k ’s, for n = 10 ,
000 and (cid:15) = 1 of different datasets. E rr o r Repeated single−value dataEqually−spaced dataKaluza’s dataTwitter location data
Fig. 13.
Predicted error versus different group size without generalization noisefor different datasets of size n = 10 ,
000 and (cid:15) = 1. The red solid line is theupper bound, the black dashed line, purple dotted line and blue dash-dot lineare for equally-spaced numbers dataset, Kaluˇza’s dataset and Twitter locationdataset respectively.
The negative effect of grouping is the generalization noise , as all elements in agroup is represented by their mean. Before giving formal description of general-ization noise, let us introduce some notations.iven a sequence D = (cid:104) x , . . . , x n (cid:105) of n numbers, and a parameter k , where k divides n , let us call the following function downsampling : ↓ k ( D ) = (cid:104) s , . . . , s ( n/k ) (cid:105) where each s i is the average of x k ( i − , . . . , x ik . Given a sequence D (cid:48) = (cid:104) s (cid:48) , . . . , s (cid:48) m (cid:105) and k , let us call the following function upsampling , ↑ k ( D (cid:48) ) = (cid:104) x (cid:48) , . . . , x mk (cid:105) where x (cid:48) i = s (cid:48)(cid:98) ( i − /k (cid:99) +1 for each i .The normalized generalization error is defined as, Gen
D,k = 1 n (cid:107) D − ↑ k ( ↓ k ( D )) (cid:107) It is easy to see that, for any k and D , the normalized generalization erroris at most k/ (2 n ). Fig. 14 shows the generalization error of different group sizea dataset containing 10 ,
000 equally-spaced values, a dataset containing 10 , ,
000 numbers randomly drawn from the transformed Twitter location data.They agrees with our upper bound on the generalization error.Furthermore, the worst case occurred when the values in the groups dividedequally between two values, for example, half of them have value 0, and halfof them have value 1. This is very unlikely. Intuitively, even if the elements inthe groups only have two distinct value a and b , the number of elements havingvalue a may vary. Hence, one would expect the average generalization error to be k/ (4 n ). Fig. 14 shows that such approximation is very accurate and consistentfor various datasets. −3 Group size E rr o r UpperboundEqually−spaced dataKaluza’s dataTwitter location data
Fig. 14.
Generalization error of different group size for different datasets of size n = 10 ,
000 and (cid:15) = 1. The red solid line is for the Kaluˇza’s dataset, the bluedotted line is for the equally-spaced dataset and the black dashed line for theTwitter location dataset. .5 Combined effects of grouping
Now, let us combine the effects of both grouping and Laplace noises on thenormalized error
Err D . Let us consider the mechanism that, on input D and theparameter k , outputs M k ( D ) = ↑ k (IR( ↓ k ( S n ( D )) + Lap(1) n/k ))This mechanism is essentially similar to our proposed method, but with thedifference on how k is chosen: here, the k is given as a parameter, whereas inStep A1 of the proposed method, the k is chosen from a lookup table. Recallthat the expected normalized error produced by this mechanism M k on D isdenoted as Err M k ,D . For abbreviation, we write it as Err k,D
Let (cid:101) S to be an instance of ↓ k ( S n ( D )) + Lap(1) n/k , and (cid:101) D the correspondingreconstructed dataset generated by M k , i.e. (cid:101) D = ↑ k (IR( (cid:101) S )). We have,EMD ( D, (cid:101) D ) = (cid:107) S n ( D ) − ↑ k (IR( (cid:101) S )) (cid:107) = (cid:107) S n ( D ) − ↑ k ( ↓ k ( S n ( D ))+ ↑ k ( ↓ ( S n ( D ))) − ↑ k (IR( (cid:101) S )) (cid:107) ≤ n · Gen
D,k + (cid:107) ↑ k ( ↓ k ( S n ( D ))) − ↑ k (IR( (cid:101) S )) (cid:107) = n · Gen
D,k + k · (cid:107) ↓ k ( S n ( D )) − IR( (cid:101) S ) (cid:107) = n · Gen
D,k + k · EMD( ↓ k ( S n ( D )) , IR( (cid:101) S )) (3)Note that the first term n · Gen
D,k is a constant independent of the randomchoices made by the mechanism. Also note that the second term is the EMDbetween the down-sampled dataset and its reconstructed copy obtained usinggroup size 1. By taking expectation over randomness of the mechanism (i.e. theLaplace noise Lap(1) n/k ), we have
Err k,D ≤ Gen k,D + Err , ↓ k ( D ) (4)In other words, the expected normalized error is bounded by the sum of normal-ized generalization error, and the normalized error incurred by the Laplace noise.Figure 15 shows the three values versus different group size k for equally-spaceddata of size 10,000. The minimum of the expected normalized error suggests theoptimal group size k .Fig. 16 illustrates the expected errors for different k on the Twitter locationdata with 10,000 points. The red dotted line is Err k,D whereas the blue solid lineis the sum in the right-hand-side of the inequality (4). Note that the differencesbetween the two graphs are small. We have conducted experiments on otherdatasets and observed similar small differences. Hence, we take the sum as anapproximation to the expected normalized error,
Err k,D ≈ Gen k,D + Err , ↓ k ( D ) (5)
20 40 60 80 10000.0010.0020.0030.0040.0050.0060.0070.0080.0090.01 Group size E rr o r Error induced by Laplace noiseGeneralization errorExpected normalized error
Fig. 15.
The normalized error derived from the generalization error and per-turbation error of different group size k for database of size n = 10 ,
000 and (cid:15) = 1. k Now, we are ready to find the optimal k given (cid:15) and n . From Fig. 10 and Fig. 14and the approximation given in equation (5), we can determine the best groupsize k give the size of the database n and the security requirement (cid:15) . From (cid:15) and Fig. 10, we can obtain the value Err , ↓ k ( D ) /(cid:15) for different k . From thedatabase’s size n and Fig. 14, we can approximate Gen k,D for different n . Thus,we can approximate the normalized error Err k,D with equation (5) as illustratedin Fig. 15. Using the same approach, the best group size given different n and (cid:15) can be calculated and is presented in table 2. −3 Group size E rr o r Predicted errorActual error
Fig. 16.
The predicted error for 10,000 points with (cid:15) = 1 and the actual error ofdataset contains 10,000 points randomly selected from Twitter location dataset. able 1.
The best group size k given n and (cid:15) (cid:15) = 0 . (cid:15) = 1 (cid:15) = 2 (cid:15) = 3n= 2,000 44 29 20 12n= 5,000 59 37 27 18n= 10,000 79 51 36 27n= 20,000 121 83 61 41n= 100,000 234 150 98 73 Isotonic regression is not unbiased. Reconstructed points on the left side (i.e.having value smaller than the median) tend to have negative bias, whereas pointson the right side (i.e. having value larger than the median) tend to have positivebias. The biasness usually is smaller for points nearer to the median or the twoends.We conducted experiments on an equally-spaced data of size 1000 to mea-sure the displacement (the difference between the reconstructed points and theoriginal point). Fig. 17 shows the estimated distribution of the displacement ofa smaller point (the 100th point) and a larger point (the 900th point) derivedfrom 200,000 runs of the experiments. The displacement of the smaller pointhas a mean of − . . . −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.401000200030004000500060007000 Displacement Fig. 17.
The displacement distribution of the 100th point and the 900th pointin the reconstructed sequence of our process on an equally-spaced pointset ofsize 1000.
In this section, we compare the performance of the proposed mechanism withthree mechanisms w.r.t. different utility functions. The first mechanism outputsqui-width histograms differential privately. We treat the generated equi-widthhistogram as an estimate of the underlying probability density function, anduse the statistical distance between density functions as a measure of utility.Next, we investigate the wavelet-based mechanism proposed by Xiao et al. [26]and measure accuracy of range queries. Lastly, we consider the problem of es-timating median, and compare with a mechanism based on smooth sensitivityproposed by Nissim et al[19]. We remark that although comparisons are basedon different utility functions, our proposed mechanism is the same, in particular,the parameter k is chosen from the same lookup table. The equi-width histogram compresses of equal-sized non-overlapping bins. Withrespect to the replacement neighborhood, the histogram generation has sensi-tivity of 2 and thus adding the Laplace noise
Lap (2 /(cid:15) ) to each of the frequencycounts will give a differentially private histogram. Note that the size of the binshas to be determined prior to publishing. Without good background knowledgeof the pointset, it is difficult to determine a good choice of bin size, as the samebin size can lead to significantly different accuracy for different pointsets. Estimating the underlying pdf:
A histogram of a pointset can be treated as anestimate of an underlying density function f whereby the points are drawn from.The value of f ( x ) can be estimated by the ratio of the number of samples, overthe width of the bin where x belongs to, multiplies by some normalizing constant.Essentially, this estimation employs step function as its kernel in estimating thedensity function.In this section, we treat the estimation of the density function as the mainusage of the published data. Hence, we qualify the mechanism’s utility by thedistance between the two estimated density functions: one that is derived fromthe original dataset, and the other that is derived from the mechanism’s output.To facilitate comparison, we need an algorithm to estimate the density func-tion from the original pointset D . There are many ways to estimate the densityfunction, and we adopt the following method: Let B be the set of distinct -pointsin D , and let us consider the Voronoi diagram of B . The cells in the Voronoidiagram are taken to be the bins of a histogram, from which an estimate of thedensity function is obtained. Note that the bins generated have variable sizes,and thus the above process can be treated as a form of “variable-bandwidth”kernel density estimation[22], where the kernels are step functions with differentshapes.Similarly we need to estimate the density function when given the outputfrom our mechanism. Since isotonic regression is performed on the space-filling-curve, we adopt a variant of the above estimation with the Voronoi diagramcomputed in the transformed one-dimensional space. In other words, given thereconstructed (cid:101) D of multidimensional points, let B be the set of distinct-values in T ( D ) where T is the locality preserving map. Next, determine the Voronoi dia-gram of B , which comprises of a sequence of intervals. Such sequence of intervalsorm the bins of a histogram, from which an estimate of the density function isobtained. Experimental results:
Fig. 18, 19 and 20 show the estimated density functionfrom the Twitter’s location dataset, the density functions reconstructed by noisyequi-width histogram and the dataset by our mechanism. For comparisons, 1%of the original points are plotted on top of the two reconstructed density func-tions. For the original dataset and the reconstructed dataset by our mechanism,we quantize the location domain into 1024 × ×
25 units, there there are in total 1681 bins. Fig.21 and 22 show the details of the two reconstructed density functions at theregion [420 , × [720 ,
100 200 300 400 500 600 700 800 900 10001002003004005006007008009001000
Fig. 18.
Density function estimated from the original dataset, with darker arearepresenting larger value.The statistical difference, measured with (cid:96) -norm and (cid:96) -norm, between thetwo estimated density functions derived from the original and the mechanism’soutput are shown in Table 2. We remark that it is not easy to determine theoptimal bin size for the equi-width histogram prior to publishing. Figure 23shows that the optimal bin size differs significantly for three different datasets. Table 2.
The statistic difference between the estimated density function. equi-width histogram method proposed method (cid:96) -norm 1.38 1.19 (cid:96) -norm 0.25 0.20
00 200 300 400 500 600 700 800 900 10001002003004005006007008009001000
Fig. 19.
Density function of dataset estimated using our mechanism with (cid:15) = 3,with 1% of the original points (randomly selected) plotted on top.
100 200 300 400 500 600 700 800 900 10001002003004005006007008009001000
Fig. 20.
Density function estimated using equi-width histogram mechanism with (cid:15) = 3, with 1% of the original points (randomly selected) plotted on top.
Many applications are interested to ask the total number of points within thequery range in a dataset. It is desired to publish a noisy pointset meeting the pri-vacy requirement, and yet able to provide accurate answers to the range queries.Publishing an equi-width histogram would not attain high accuracy if the sizeof the query ranges varies drastically. Intuitively, wavelet-based techniques arenatural solutions to address such multi-scales queries. Xiao et al. [26] proposed amechanism of adding Laplace noise to the coefficient of a wavelet transformationof an equi-width histogram. The noisy wavelet coefficients are then published,from which range queries can be answered. Essentially, what being published isa series of equi-width histograms with different widths (scales). Note that thereare quite a number of parameters to be determined prior to publishing, includingthe widths at varies scales and the amounts of privacy budget they consumed.The answer to range queries can also be inferred from the output of ourmechanism. Given a range, we can estimate the number of points within the
20 740 760 780 800 820 840420440460480500520540560580
Fig. 21.
A zoom-in view of Fig. 19.
720 740 760 780 800 820 840400420440460480500520540560
Fig. 22.
A zoom-in view of Fig. 20.
50 100 150 200 250 3000.811.21.41.61.82 Number of bins per edge e rr o r random dataKaluza’s datatwitter data Fig. 23.
The errors versus the bin size for different datasets. Each value on thegraph is the average over 100 sample runs.range from the estimated density function (as described in Section 6.1) by accu-mulating the probability over the query region, and then multiply by the totalnumber of points.We compare the wavelet-based mechanism, our mechanism and the equi-width histogram mechanism on the Twitter location dataset. For each rangequery, the absolute difference between the the true answer and the answer derivedfrom the mechanism’s output is taken as the error. We only consider square rangequeries in our experiments. For each query size y , 1,000 randomly selected squareranges with width y are taken as the queries, and the average error is shown inFig. 24.In this experiment, we use Haar wavelet, and perform wavelet transform onthe equi-width histogram with 512 ×
512 bins. After that, appropriate noiseis added to ensure (cid:15) -differential privacy. To incorporate the knowledge of thedatabase’s size n , the DC component of the wavelet transform is set to be exactly n . Under this setting, the best group size for our mechanism is 51.Observe that as all mechanisms know the exact value of n , the accuracyimprove when the range of the query covers more than half of the dataset.As expected, the wavelet-base method outperforms the equi-width histogrammechanism in larger size range queries, but performs badly for small range dueto the accumulation of noise. Surprisingly, our mechanism outperforms the equi-width histogram method for small range queries, and outperforms the waveletbased method for all sizes. This is possibly due to the fact that the locations ofour queries are uniformly randomly chosen over a continuous domain, and thus,it is very likely that the query boundaries do not match the bins, leading to largeerror. Finding the median accurately in a differential private manner is challenging dueto the high “global sensitivity”: there are two datasets that differ by one element E rr o r Group size 20Group size 51Equi−widthWavelet
Fig. 24.
The average range query error over 1000 random square range queriesfor each query size. The blue dash-dot line is the error of the equi-width his-togram mechanism, the purple solid line is of the wavelet method, the red dashedline is our mechanism with group size 20 and the black dotted line is with groupsize 51.but having a completely different median. Nevertheless, for many instances, their“local sensitivity” are small. Nissim et al. [19] showed that in general, by addingnoise proportional to the “smooth sensitivity” of the database instance, insteadof the global sensitivity, can also ensure differential privacy. They also gave an Θ ( n ) algorithm that find the smooth sensitivity w.r.t. median.Our mechanism outputs the sorted sequence differential privately, it nat-urally gives the median. Compare to the smooth sensitivity-based mechanism,our mechanism can be efficiently carried out in O ( n ) time with an sorted dataset.We conduct experiments on datasets of size 129 to compare the accuracyof both mechanisms. Due to the quadratic running time in determining smoothsensitivity, we are unable to further investigate larger datasets. The experimentsare conducted for different local sensitivity and different (cid:15) values. To constructa dataset with different local sensitivity, 66 random numbers are generated withexponential distribution and then scaled to the unit interval. The dataset con-tains the 66 random numbers and 63 ones. Figure 25 shows the average noiselevel of both mechanisms on different local sensitivity, and Figure 26 shows thenoise level with different (cid:15) on a dataset that has a local sensitivity of 0 . (cid:15) is smaller, the accuracy of ourmechanism decreases slower than the smooth sensitivity-based method. E rr o r Our mechanismSmooth sensitivity based method
Fig. 25.
The error of median output by the two mechanisms versus differentlocal sensitivities. The blue dots are the error incurred by our mechanism and theblack circles are the error incurred by the smooth sensitivity-based mechanism.
Value of (cid:2) E rr o r Our mechanismSmooth sensitivity based method
Fig. 26.
The error of median versus different (cid:15) . The blue dashed line and blacksolid line are the error incurred by our mechanism and smooth sensitivity-basedmechanism respectively.
The proposed mechanism can be viewed as the publishing of a “fixed-depth”histogram, where the number of elements in the histogram bins is fixed prior topublishing, whereas the mechanism outputs noisy bin’s boundary. On the otherhand, mechanisms based on frequency counts can be viewed as “fixed-width”histogram, where the boundary of the bins are fixed prior to publishing, whereasthe mechanisms output noisy counts of elements in the bins. The fixed-depth andfixed-width histogram could complement each other, by alternatively publishingone after another. Here are two possibilities: ixed-width-then-fixed-depth
Let us take the Twitter location dataset shownin Figure 1 as an example. Observe that large portion of the region is sparse. Ifthe sparse region can be omitted, the sensitivity of sorting would be significantlyreduced. This could be achieved by (1)first publishing an coarse equi-width his-togram with large width. (2) Next, for each bin, use the deterministic paddingalgorithm (Section 3 Remark 2) to extract (cid:101) n points, where (cid:101) n is the noisy countoutput by the equi-width histogram. (3) Finally, publish the extracted pointsusing our fixed-depth mechanism. Note that the sensitivity for the fixed-depthmechanism is the width (or area) of the bin, which could be significantly smallerthan the width (or area) of the whole domain. Fixed-depth-then-fixed-width
The unique solution of isotonic regression isa piecewise constant function. The steps in the solution lead to artifacts of clus-tered data. It is interesting to investigate whether a subsequent fixed-width his-togram could “break” the steps.
It is interesting to investigate whether the proposed techniques can be appliedto multidimensional data other than spatial data, for instance, tuples with at-tributes of age and gender.
There are extensive works on privacy-preserving data publishing. The recentsurvey by Fung et al. [8] gives a comprehensive overview on various notions, forexample, k -anonymity [24], (cid:96) -diversity [15], and differential privacy [5].Hay et al.[11] proposed exploiting redundancies in the published data to boostaccuracy, with supporting examples. One of the examples employs isotonic re-gression but in a way different from our mechanism. They consider publishing unattributed histogram , which is the (unordered) multiset of the frequencies of ahistogram. As the frequencies are unattributed (i.e. order of appearance is irrele-vant), Hay et al. proposed publishing the sorted frequencies and later employingisotonic regression to improve accuracy. In contrast, our mechanism publishesthe whole database. It is no doubt that median is an important statistic. Find-ing median in a differentially private way is not easy due to the large globalsensitivity. Nissim et al.[19] introduced the notion of smooth sensitivity and pro-posed an Θ ( n ) algorithm that computes the smooth sensitivity of an instancew.r.t. median. Median has also been used in the construction of other differentialprivate mechanisms, for e.g. dataset learning [3] and spatial decompositions [4]. Our mechanism is very simple from the publisher’s point of view. The publisherjust has to sort the points, group consecutive values, add Laplace noise andublish the noisy data. There is also minimal tuning to be carried out by thepublisher. The main design decision is the choice of the group size k , which canbe determined using our proposed noise models, and the locality preserving mapwhich the classic Hilbert curve is suffice in attaining high accuracy. Throughempirical studies, we have shown that the published raw data contain rich infor-mation for the public to harvest, and provide high accuracy even for usages likemedian-finding, and range-searching that our mechanism is not initially designedfor. Such flexibility is desired for the need of “ publish data, not the data miningresult ” as deliberated by Fung et al. [8]. References
1. Twitter census: Twitter users by location. .2. A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the sulqframework. pages 128–138, 2005.3. A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactivedatabase privacy. pages 609–618, 2008.4. G. Cormode, M. Procopiuc, E. Shen, D. Srivastava, and T. Yu. Differentiallyprivate spatial decompositions.
Arxiv preprint arXiv:1103.5170 , 2011.5. C. Dwork. Differential privacy.
Automata, languages and programming , pages 1–12,2006.6. C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivityin private data analysis.
Theory of Cryptography , pages 265–284, 2006.7. D. Feldman, A. Fiat, H. Kaplan, and K. Nissim. Private coresets. pages 361–370,2009.8. B. Fung, K. Wang, R. Chen, and P. Yu. Privacy-preserving data publishing: Asurvey of recent developments.
ACM Computing Surveys , page 14, 2010.9. C. Gotsman and M. Lindenbaum. On the metric properties of discrete space-fillingcurves.
IEEE Transactions on Image Processing , pages 794–797, 1996.10. S. Grotzinger and C. Witzgall. Projections onto order simplexes.
Applied mathe-matics and optimization , pages 247–270, 1984.11. M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the accuracy of differ-entially private histograms through consistency.
VLDB Endowment , pages 1021–1032, 2010.12. M. Jang, S. Kim, C. Faloutsos, and S. Park. A linear-time approximation of theearth mover’s distance.
Arxiv preprint arXiv:1106.1521 , 2011.13. B. Kaluˇza, V. Mirchevska, E. Dovgan, M. Luˇstrek, and M. Gams. An agent-basedapproach to care in independent living.
Ambient Intelligence , pages 177–186, 2010.14. E. Lawler.
Combinatorial optimization: networks and matroids . Dover Pubns,2001.15. A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. (cid:96) -diversity:Privacy beyond k -anonymity. International Conference on Data Engineering , pages24–24, 2006.16. F. McSherry. Privacy integrated queries: An extensible platform for privacy-preserving data analysis.
SIGMOD , pages 10–30, 2009.17. M. C. Meyer. Inference using shape-restricted regression splines.
Annals of AppliedStatistics , pages 1013–1033, 2008.8. R. Niedermeier, K. Reinhardt, and P. Sanders. Towards optimal locality in mesh-indexings. pages 364–375, 1997.19. K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling inprivate data analysis. pages 75–84, 2007.20. G. Piatetsky-Shapiro and C. Connell. Accurate estimation of the number of tuplessatisfying a condition. pages 256–276, 1984.21. Y. Rubner, L. Guibas, and C. Tomasi. The earth movers distance, multi-dimensional scaling, and color-based image retrieval. pages 661–668, 1997.22. D. W. Scott. Variable kernel density estimation.
Annals of Statistics , 20:1236–1265,1992.23. Q. F. Stout. Optimal algorithms for unimodal regression.
Computer Science andStatistics , pages 109–122, 2000.24. L. Sweeney. k -anonymity: a model for protecting privacy. International Journal ofUncertainty, Fuzziness and Knowledge-Based System , pages 557–570, 2002.25. X. Wang and F. Li. Isotonic smoothing spline regression.
Journal Computationaland Graphical Statistics , pages 21–37, 2008.26. X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transforms.