Approximate Co-Sufficient Sampling for Goodness-of-fit Tests and Synthetic Data
OOne Step to Efficient Synthetic Data
Jordan Awan
Department of StatisticsPennsylvania State UniversityUniversity Park, PA 16802 [email protected]
Zhanrui Cai
Department of StatisticsPennsylvania State UniversityUniversity Park, PA 16802 [email protected]
Abstract
We propose a general method of producing synthetic data, which is widely appli-cable for parametric models, has asymptotically efficient summary statistics, andis both easily implemented and highly computationally efficient. Our approachallows for the construction of both partially synthetic datasets, which preserve thesummary statistics without formal privacy methods, as well as fully synthetic datawhich satisfy the strong guarantee of differential privacy (DP), both with asymp-totically efficient summary statistics. While our theory deals with asymptotics,we demonstrate through simulations that our approach offers high utility in smallsamples as well. In particular we 1) apply our method to the Burr distribution,evaluating the parameter estimates as well as distributional properties with theKolmogorov-Smirnov test, 2) demonstrate the performance of our mechanism on alog-linear model based on a car accident dataset, and 3) produce DP synthetic datafor the beta distribution using a customized Laplace mechanism.
With the advances in modern technology, government and other research agencies are able to collectmassive amounts of data from individual respondents. These data are valuable for scientific progressand policy research, but they also come with increased privacy risk [Lane et al., 2014]. To publishuseful information while preserving confidentiality of sensitive information, numerous methods ofgenerating synthetic data have been proposed (see Hundepool et al. [2012, Chapter 3] for a survey).The goal of synthetic data is to produce a new dataset which preserves distributional properties of theoriginal dataset, while protecting the privacy of the participating individuals. There are two maintypes of synthetic data: partially synthetic data , which allows for certain statistics or attributes tobe released without privacy while protecting the other aspects of the data, and fully synthetic data ,where all statistics and attributes of the data are protected.In this paper, we propose a general method of producing synthetic data which is widely applicablefor parametric models, has asymptotically efficient summary statistics, is both easily implementedand highly computationally efficient, and can produce either partially synthetic data or differentiallyprivate fully synthetic data. More formally, given sensitive data X , . . . , X n i.i.d. ∼ f θ with unknownparameter θ , we present a method of producing a synthetic dataset ( Y i ) ni =1 , which satisfies ˆ θ Y =ˆ θ X + o p ( n − / ) under relatively mild conditions. In occasions where ˆ θ X itself must be protected, weshow that the proposed synthetic mechanism can be easily modified to satisfy the strong guarantee ofdifferential privacy by first privatizing ˆ θ X . Our approach can be viewed as one step of an approximateNewton method, which aims to solve ˆ θ Y = ˆ θ X . Similar to the classical one-step estimator [Van derVaart, 2000], our one-step approach to synthetic data has efficient summary statistics.Differential privacy (DP) was proposed in Dwork et al. [2006b] as a framework to develop formallyprivate methods. Methods which satisfy DP require the introduction of additional randomness, Preprint. Under review. a r X i v : . [ m a t h . S T ] J un eyond sampling in order to obscure the effect of one individual on the output. Intuitively, DP ensuresplausible deniability for those participating in the dataset. As the literature on DP has developed,there are now many privacy tools to preserve sample statistics with asymptotically negligible noise(e.g., Smith [2011],Reimherr and Awan [2019]), and produce DP synthetic data (see related work). Related work
A common approach to synthetic data is that of Liew et al. [1985], which proposedrawing synthetic data from a fitted model. While Liew et al. [1985] do not incorporate formalprivacy methods, this approach is often used to produce differentially private synthetic data. Hallet al. [2013] develop DP tools for kernel density estimators, which can be sampled to produce DPsynthetic data. Machanavajjhala et al. [2008] develop a synthetic data method based on a multinomialmodel, which satisfies a modified version of DP to accomodate sparse spatial data. McClure andReiter [2012] sample from the posterior predictive distribution to produce DP synthetic data, which isasymptotically similar to the Liew et al. [1985] (see Example 3.1). Liu [2016] also use a Bayesianframework: first they produce DP estimates of the Bayesian sufficient statistics, draw the parameterfrom the distribution conditional on the DP statistics, and finally sample synthetic data conditional onthe sampled parameter. Zhang et al. [2017] propose a method of developing high-dimensional DPsynthetic data which draws from a fitted model based on differentially private marginals.There is also a line of research which produces synthetic data from a conditional distribution ,preserving certain statistics. The most fundamental perspective of this approach is that of Muralidharand Sarathy [2003], who propose drawing confidential variables from the distribution conditional onthe non-confidential variables. Burridge [2003] generate partially synthetic data, preserving the meanand covariance for normally distributed variables. This approach was extended to a computationallyefficient version by Mateo-Sanz et al. [2004], and Ting et al. [2005] give an alternative approach topreserving the mean and covariance by using random orthogonal matrix multiplication.There are also tools, often based in algebraic statistics, to sample conditional distributions preservingcertain statistics for contingency tables. Karwa and Slavkovic [2013] give a survey of Markov ChainMonte Carlo (MCMC) techniques to sample conditional distributions. Another approach is sequentialimportance sampling, proposed by Chen et al. [2006]. Slavkovi´c and Lee [2010] use these techniquesto generate synthetic contingency tables that preserve conditional frequencies.In differential privacy, there are also synthetic data methods which preserve sample statistics. Karwaand Slavkovi´c [2012] generate DP synthetic networks from the beta exponential random graphmodel, conditional on the degree sequence. Li et al. [2018] produce DP high dimensional syntheticcontingency tables using a modified Gibbs sampler. Hardt et al. [2012] give a distribution-freealgorithm to produce a DP synthetic dataset, which approximately preserves several linear statistics.While this paper is focused on producing synthetic data for parametric models, there are severalnon-parametric methods of producing synthetic data, using tools such as multiple imputation [Rubin,1993, Raghunathan et al., 2003, Drechsler, 2011], regression trees [Reiter, 2005, Drechsler and Reiter,2008], and random forests [Caiola and Reiter, 2010]. Recently there has been success in producingdifferentially privacy synthetic data using generative adversarial neural networks [Jordon et al., 2018,Triastcyn and Faltings, 2018, Xu et al., 2019].
Our contributions and organization
The related work cited above largely fits into one of twocategories: 1) sampling from a fitted distribution or 2) sampling from a distribution conditional onsample statistics. We illustrate in Example 3.1 that the first approach results in samples with reducedasymptotic relative efficiency compared to the original sample. On the other hand, in the case ofcertain distributions (e.g., exponential families), the second approach is often able to result in samplesequal in distribution to the original sample, maintaining asymptotic performance.However, there are important limitations to the previous works which sample from a conditionaldistribution. First, the previous approaches are all highly specific to the model at hand, and requiredifferent techniques for different models. Second, many of the approaches are difficult to implementand computationally expensive, involving complex iterative sampling schemes such as MCMC.Our approach also preserves summary statistics, but unlike previous methods it is applicable to awide variety of parametric models, easily implemented, and highly computationally efficient. Indeed,the regularity conditions required for our asymptotics are similar to those required for the CentralLimit Theorem of the maximum likelihood estimator (MLE), the computations only require efficientestimators for the parameters and the ability to sample the model, and the computational time is2roportional to simply fitting the model. Our approach can be used to produce partially synthetic, orfully synthetic DP data by first privatizing the efficient estimator.The rest of the paper is organized as follows: In Section 2, we review some statistics background andnotation. We give our asymptotic results in Section 3, and illustrate the performance of our approachwith the Burr distribution and a log-linear model. In Section 4, we recall the basics of differentialprivacy and extend our approach to produce DP synthetic data. We also include an example whichconstructs an efficient DP estimator for the beta distribution, and demonstrate the performance of ourapproach via simulations. We end in Section 5 with some discussion.
In this section, we review some background and notation that we use throughout the paper.For a parametric random variable, we write X ∼ f θ to indicate that X has probability density function(pdf) f θ . To indicate that a sequence of random variables from the model f θ are independent and iden-tically distributed (i.i.d.), we write X , . . . , X n i.i.d. ∼ f θ . We write X = ( X i ) ni =1 = ( X , . . . , X n ) (cid:62) .Let A be a random vector, A n be a sequence of random vectors, and r n be a positive numericalsequence. We write A n d → A to denote that A n converges in distribution to A . We write A n = o p ( r n ) to denote that A n /r n d → . We write A n = O p ( r n ) to denote that A n /r n is bounded in probability .For multivariate derivatives, we will overload the ddθ operator as follows. For a function f : R p → R ,we write ddθ f ( θ ) to denote the p × vector of partial derivatives ( ∂∂θ j f ( θ )) pj =1 . For a function g : R p → R q , we write ddθ g ( θ ) to denote the p × q matrix ( ∂∂θ j g k ( θ )) p,qj,k =1 .For X ∼ f θ , we denote the score function as S ( θ, x ) = ddθ log f θ ( x ) , and the Fisher Information as I ( θ ) = E θ (cid:2) S ( θ, X ) S (cid:62) ( θ, X ) (cid:3) . An estimator ˆ θ : X n → Θ is efficient if for X , . . . , X n i.i.d. ∼ f θ , wehave √ n (ˆ θ ( X ) − θ ) d → N (0 , I − ( θ )) . We will often write ˆ θ X in place of ˆ θ ( X ) . In this section, we present our synthetic data procedure and its asymptotics in Theorem 3.2. Wealso include a pseudo-code version of our approach in Algorithm 1, to aid implementation. Wedemonstrate the finite-sample performance of our procedure on both a continuous and a discreteexample. With the Burr type XII distribution, we study both the properties of the fitted parameters aswell as distributional properties as measured by the Kolmogorov-Smirnov test. We then investigate theperformance of our synthetic data on a log-linear model with two-way interactions, demonstratingour approach on more complex datasets.We saw in the related work that a common approach to synthetic data is to sample from a fitteddistribution. However, this approach results in suboptimal asymptotics, illustrated in Example 3.1. Example 3.1.
Suppose that X . . . , X n i.i.d. ∼ N ( µ, . We estimate ˆ µ ( X ) = n − (cid:80) ni =1 X i and draw Z , . . . , Z n i.i.d. ∼ N (ˆ µ ( X ) , . We can compute Var(ˆ µ ( X )) = n − , whereas Var(ˆ µ ( Z )) = 2 n − . Byusing the synthetic data Z , we have lost half of the effective sample size.While a simple example, the implications of Example 3.1 are quite general. Recall that the majorityof estimators in the statistical literature are √ n -consistent and follow asymptotic normal distributions.For example, given an i.i.d. sample ( X i ) ni =1 drawn from f θ , an efficient estimator (e.g., maximumlikelihood estimator) ˆ θ X has variance n − I − ( θ ) + o ( n − ) . Drawing ( Z i ) ni =1 i.i.d. from f ˆ θ X , it iseasily verified that the variance of an efficient estimator ˆ θ Z is n − I − ( θ ) + o ( n − ) . Half of theeffective sample size is lost here as well.Our approach avoids the asymptotic problem of Example 3.1 by producing a sample ( Y i ) ni =1 suchthat ˆ θ Y = ˆ θ X + o p ( n − / ) . Then marginally, the asymptotic distributions of ˆ θ Y and ˆ θ X are identical.The intuition behind the method is that after fixing the “seed,” we search for a parameter θ new such3hat when ( Y i ) ni =1 are sampled from f θ new , we have that ˆ θ Y = ˆ θ X + o p ( n − / ) . To arrive at the value θ new , we use one step of an approximate Newton method, described in Theorem 3.2.To facilitate the asymptotic analysis, we assume regularity conditions (R0)-(R3). (R1)-(R3) are similarto standard conditions to ensure that there exists an efficient estimator, which are relatively mild andwidely assumed in the literature [Serfling, 1980, Lehmann, 2004]. We include measure-theoreticassumptions in (R0) to formalize our method.(R0) Let (Ω , F , P ) be a probability space of the seed ω . Let X θ : Ω → X be a measurablefunction, where ( X , G ) is a measurable space and θ ∈ Θ ⊂ R p , where Θ is compact. Weassume that there exists a measure µ on ( X , G ) which dominates P X − θ for all θ ∈ Θ . Thenthere exist densities f θ : X → R ≥ such that (cid:82) A dP X − θ = (cid:82) A f θ dµ for all A ∈ G .(R1) Let θ ∈ Θ ⊂ R p be the true parameter. Assume there exists an open ball B ( θ ) ⊂ Θ about θ , the model f θ is identifiable, and that the set { x ∈ X | f θ ( x ) > } does not depend on θ .(R2) The pdf f θ ( x ) has three derivatives in θ for all x and there exist functions g i ( x ) , g ij ( x ) , g ijk ( x ) for i, j, k = 1 , . . . , p such that for all x and all θ ∈ B ( θ ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂f θ ( x ) ∂θ i (cid:12)(cid:12)(cid:12)(cid:12) ≤ g i ( x ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ f θ ( x ) ∂θ i ∂θ j (cid:12)(cid:12)(cid:12)(cid:12) ≤ g ij ( x ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ f θ ( x ) ∂θ i ∂θ j ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ g ijk ( x ) . We further assume that each g satisfies (cid:82) g ( x ) dx < ∞ and E θ g ijk ( X ) < ∞ for θ ∈ B ( θ ) .(R3) The Fisher Information matrix I ( θ ) = E θ [( ddθ log f θ ( X ))( ddθ log f θ ( X )) (cid:62) ] consistsof finite entries, and is positive definite. Theorem 3.2.
Assume that (R0)-(R3) hold. Let X , . . . , X n i.i.d. ∼ f θ and let ω , . . . , ω n i.i.d. ∼ P .Choose θ new ∈ argmin θ ∈ Θ (cid:13)(cid:13)(cid:13) θ − (cid:16) θ X − ˆ θ Z (cid:17)(cid:13)(cid:13)(cid:13) , where ˆ θ is an efficient estimator and ( Z i ) ni =1 =( X ˆ θ X ( ω i )) ni =1 . Then for ( Y i ) ni =1 = ( X θ new ( ω i )) ni =1 , we have ˆ θ Y = ˆ θ X + o p ( n − / ) .Proof Sketch. The proof is based on two expansions: ˆ θ Z − ˆ θ X = I − ( θ ) n (cid:80) ni =1 S ( θ , X θ ( ω i )) + o p ( n − / ) and ˆ θ Y = θ new + I − ( θ ) n (cid:80) ni =1 S ( θ , X θ ( ω i )) + o p ( n − / ) . As n → ∞ , θ new =2ˆ θ X − ˆ θ Z with probability one. Combining the results gives ˆ θ Y = ˆ θ X + o p ( n − / ) . Algorithm 1
One Step Synthetic Data Pseudo-Code
INPUT: Seed ω , parametric family { f θ | θ ∈ Θ } , efficient estimator ˆ θ ( · ) , and sample X , . . . , X n i.i.d. ∼ f θ set.seed ( ω ) and a sample Z , . . . , Z n i.i.d. ∼ f ˆ θX Choose θ new ∈ argmin θ ∈ Θ (cid:107) θ − (2ˆ θ X − ˆ θ Z ) (cid:107) set.seed ( ω ) and sample Y , . . . , Y n i.i.d. ∼ f θ new OUTPUT: Y , . . . , Y n Remark 3.3 (Seeds) . In the case of continuous real-valued random variables, we can be more explicitabout the “seeds.” Recall that for U ∼ U (0 , , F − θ ( U ) ∼ f θ where F − θ ( · ) is the quantile function.So in this case, the distribution P can be taken as U (0 , , and X θ ( · ) can be replaced with F − θ ( · ) .When implementing the procedure of Theorem 3.2, it may be convenient to use numerical seeds. Forexample in R, the command set.seed can be used to emulate the result of drawing Z i and Y i withthe same seed ω i . In Algorithm 1, we describe the procedure in pseudo-code.We illustrate Theorem 3.2 with two examples. In Example 3.4, we simulate from the Burr Type XIIdistribution, demonstrating that ˆ θ Y has similar performance as ˆ θ X , whereas ˆ θ Z has inflated variance.We also investigate the distributional properties of the ( Y i ) with the Kolmogorov-Smirnov test. Wechose the Burr distribution because it is neither location-scale nor exponential family and so providesa non-trivial setting to test our approach. In Example 3.5, we apply our approach to a log-linearmodel to show how our approach performs on more complex datasets. For location-scale families, a linear transformation can be used to produce a sample with the desired statistics.In exponential families, if the efficient statistic is sufficient then the distribution conditional on the statistic isindependent of the parameter and can thus be sampled (in principle) without knowing the true parameter. (cid:96) -distance between the MLE and the vector (2 , . ( X i ) are drawn i.i.d.from Burr(2 , , ( Z i ) are i.i.d. from Burr(ˆ θ X ) , and ( Y i ) are from Algorithm 1. Results are averagedover 10000 replicates, for each n . The first and third lines are accurate up to approximately ± in thethird digit of each value with confidence. The second line has error ± in the third digit. n : 100 1000 10000 ˆ θ X . × − . × − . × − ˆ θ Z . × − . × − . × − ˆ θ Y . × − . × − . × − Example 3.4 (Burr type XII distribution) . The Burr Type XII distribution, denoted
Burr( c, k ) , alsoknown as the Singh–Maddala distribution, is a useful model for income [McDonald, 2008]. Thedistribution has pdf f ( x ) = ckx c − (1 + x c ) − ( k +1) , with support x > . Both c and k are positive.For the simulation, we set c = 2 and k = 4 , and denote θ = ( c, k ) . Let ˆ θ MLE be the MLE. Wedraw X i i.i.d. ∼ Burr(2 , , Z i i.i.d. ∼ Burr(ˆ θ MLE ( X )) , and ( Y i ) ni =1 from Algorithm 1. The simulation iconducted for n ∈ { , , } with results averaged over 10000 replicates for each n .Over the replicates, we compute the MLE and report the average squared (cid:96) -distance to the trueparameters, which estimates the variance. The results are in Table 1. When sampling from the thefitted model, ˆ θ Z has about twice the variance as ˆ θ X , whereas ˆ θ Y has very similar variance as ˆ θ X .We also calculate the empirical power of the Kolmogorov-Smirnov (K-S) test, comparing each samplewith the true distribution Burr(2 , , at type I error . . The results are presented in Table 2. We seethat the ( X i ) have empirical power approximately . , confirming that the type I error is appropriatelycalibrated. We also see that the K-S test using ( Y i ) has power approximately . , indicating that theempirical distribution of the ( Y i ) is very close to the true distribution. On the other hand, we see thatthe K-S test with ( Z i ) has power . , significantly higher than the type I error, indicating that the ( Z i ) are from a fundamentally different distribution than the ( X i ) .Table 2: Empirical power of the Kolmogorov-Smirnov test for the distribution Burr(2 , at typeI error . . ( X i ) are drawn i.i.d from Burr(2 , , ( Z i ) are drawn i.i.d from Burr(ˆ θ X ) , and ( Y i ) are from Algorithm 1. Results are averaged over 10000 replicates, for each n . Standard errors areapproximately . for lines 1 and 3, and . for line 2. n : 100 1000 10000 ( X i ) ( Z i ) ( Y i ) Example 3.5 (Log-linear model) . This example is based on a dataset of of 68,694 passengers inautomobiles and light trucks involved in accidents in the state of Maine in 1991. Table 3 reports thenumber of passengers according to gender (G), location (L), seatbelt status (S), and injury status (I).As in Agresti [2003], we fit a hierarchical log-linear model based on all one-way effects and two-wayinteractions. The model is summarized in Equation (1), where µ ijk(cid:96) represents the expected count inbin i, j, k, (cid:96) . The parameter λ Gi represents the effect of Gender, and parameter λ GLij represents theinteraction between Gender and Location. The other main effects and interactions are analogous. log µ ijk(cid:96) = λ + λ Gi + λ Lj + λ Sk + λ I(cid:96) + λ GLij + λ GSik + λ GIi(cid:96) + λ LSjk + λ LIj(cid:96) + λ SIk(cid:96) (1)For our simulations, we treat the fitted parameters as the true parameters, to ensure that modelassumptions are met. We simulate from the fitted model at sample sizes n ∈ { , , , } and compare the performance in terms of the fitted probabilities for each bin of the contingency table.The results are plotted in Figure 1a, with both axes on log-scale. The “mean error” is the averagesquared (cid:96) distance between the estimated parameter vector and the true parameter vector, averagedover 200 replicates. To interpret the plot, note that if the error is of the form error = cn − , where c is a constant, then log(error) = c + ( −
1) log( n ) . So, the slope represents the convergence rate,5nd the vertical offset represents the asymptotic variance. In Figure 1a, we see that the curve for ˆ θ Y approaches the curve for ˆ θ X , indicating that they have the same asymptotic rate and variance. On theother hand, the curve for ˆ θ Z has the same slope, but does not approach the ˆ θ X curve, indicating that ˆ θ Z has the same rate but inflated variance.Recall that our procedure approximately preserves the sufficient statistics, similar to sampling from aconditional distribution. Previous work has proposed procedures to sample directly from conditionaldistributions for contingency table data. However, these approaches require sophisticated tools fromalgebraic statistics, and are computationally expensive (e.g., MCMC) [Karwa and Slavkovic, 2013].In contrast, our approach is incredibly simple to implement and highly computationally efficient. Ourapproach is also applicable for a wide variety of models, whereas the techniques to sample directlyfrom the conditional distribution require a tailored approach for each setting.Table 3: Injury, Seat-Belt Use, Gender, and Location. Source: Agresti [2003, Table 8.8]. Originallycredited to Cristanna Cook, Medical Care Development, Augusta, Maine. InjuryGender Location Seatbelt
No YesFemale Urban No 7,287 996Yes 11,587 759Rural No 3,246 973Yes 6,134 757Male Urban No 10,381 812Yes 10,969 380Rural No 6,123 1,084Yes 6,693 513
In this section, we review the basics of differential privacy, and modify our synthetic data procedureto satisfy DP in Corollary 4.2. In Example 4.5 we construct an efficient DP estimator for the betadistribution and demonstrate the result of Corollary 4.2 through a simulation study.The concept of differential privacy (DP) was proposed in Dwork et al. [2006b] as a framework todevelop methods of preserving privacy, with mathematical guarantees. Intuitively, the constraint ofdifferential privacy requires that for all possible databases, the change in one person’s data does notsignificantly change the distribution of outputs. Consequently, having observed the DP output, anadversary cannot accurately determine the input value of any single person in the database. Definition4.1 gives a formal definition of DP. In Definition 4.1, H : X n × X n → Z ≥ represents the Hammingmetric , defined by H ( x, x (cid:48) ) = { i | x i (cid:54) = x (cid:48) i } . Definition 4.1 (Differential privacy: Dwork et al. [2006b]) . Let (cid:15) > and n ∈ { , , . . . } be given.Let X be any set, and ( Y , S ) a measurable space. Let M = { M x | x ∈ X n } be a set of probabilitymeasures on ( Y , S ) , which we call a mechanism . We say that M satisfies (cid:15) -differential privacy ( (cid:15) -DP) if M x ( S ) ≤ e (cid:15) M x (cid:48) ( S ) for all S ∈ S and all x, x (cid:48) ∈ X n such that H ( x, x (cid:48) ) = 1 .An important property of differential privacy is that it is invariant to post-processing. Applying anydata-independent procedure to the output of a DP mechanism preserves (cid:15) -DP [Dwork et al., 2014,Proposition 2.1]. Furthermore, Smith [2011] demonstrated that under conditions similar to (R1)-(R3),there exist efficient DP estimators for parametric models. Using these techniques, we modify oursynthetic data procedure to satisfy differential privacy in Corollary 4.2. Corollary 4.2 (Differentially private synthetic data) . Assume that (R0)-(R3) hold. Let X , . . . , X n i.i.d. ∼ f θ , and let ω , . . . , ω n i.i.d. ∼ P . Set θ new ∈ argmin θ ∈ Θ (cid:13)(cid:13)(cid:13) θ − (cid:16) θ X − ˆ θ ( Z ) (cid:17)(cid:13)(cid:13)(cid:13) ,where ˆ θ X is an (cid:15) -DP efficient estimator, ˆ θ is efficient, and ( Z i ) ni =1 = ( X ˆ θ X ( ω i )) ni =1 . Releasing ( Y i ) ni =1 = ( X θ new ( ω i )) ni =1 satisfies (cid:15) -DP, and ˆ θ ( Y ) = ˆ θ X + o p ( n − / ) . − − − n l og ( m ean e rr o r )
100 1000 10000 1e+05 q ^ x q ^ z q ^ y (a) Simulations corresponding to the log-linear modelwith two-way interactions from Example 3.5. − − − − n l og ( m ean e rr o r ) q ^ x q ^ z q ^ y (b) Simulations for the beta distribution from Example4.5. ˆ θ X is the MLE. ˆ θ Z and ˆ θ Y both satisfy 1-DP. Figure 1: Both figures plot the average squared (cid:96) -distance between the estimated parameters and thetrue parameters on the log-scale. Averages are over 200 replicates for both plots. ˆ θ X is from the truemodel, ˆ θ Z from the fitted model, and ˆ θ Y from Algorithm 1.The proof of Corollary 4.2 is trivial, as ( Y i ) ni =1 satisfies (cid:15) -DP by post-processing and ˆ θ Y = ˆ θ X + o p ( n − / ) by Theorem 3.2. Note that in Corollary 4.2, only ˆ θ X needs to satisfy (cid:15) -DP. The estimator ˆ θ , applied to ( Z i ) ni =1 and ( Y i ) ni =1 must be efficient need not satisfy DP. In fact, to improve finitesample performance, we recommend using a non-private estimator for ˆ θ . Remark 4.3.
Besides Definition 4.1, there are many other variations of differential privacy, themajority of which are relaxations of Definition 4.1 which also allow for efficient estimators. Forinstance, approximate DP [Dwork et al., 2006a], concentrated DP [Dwork and Rothblum, 2016, Bunand Steinke, 2016], truncated-concentrated DP [Bun et al., 2018], and Renyi DP [Mironov, 2017] allallow for efficient estimators. On the other hand, local differential privacy [Kasiviswanathan et al.,2011, Duchi et al., 2013] in general does not permit efficient estimators and would not fit in ourframework. For an axiomatic treatment of formal privacy, see Kifer and Lin [2012].While there are some general methods of producing efficient DP parameter estimates, such as in Smith[2011], often these approaches do not perform well in practical sample sizes. We demonstrate ourapproach using a modification of the standard
Laplace mechanism . Given a statistic T , the Laplacemechanism adds independent Laplace noise to each entry of the statistic, with scale parameterproportional to the sensitivity of the statistic. Informally, the sensitivity of T is the largest amountthat T changes, when one person’s data is changed in the dataset. Proposition 4.4 (Sensitivity and Laplace mechanism: Dwork et al. [2006b]) . Let (cid:15) > be given, andlet T : X n → R p be a statistic. The (cid:96) -sensitivity of T is ∆ n ( T ) = sup (cid:107) T ( x ) − T ( x (cid:48) ) (cid:107) , where thesupremum is over all x, x (cid:48) ∈ X n such that H ( x, x (cid:48) ) = 1 . Provided that ∆ n ( T ) is finite, releasingthe vector ( T j ( x ) + L j ) pj =1 satisfies (cid:15) -DP, where L , . . . , L p i.i.d. ∼ Laplace (∆ n ( T ) /(cid:15) ) . Often, to ensure finite sensitivity, the data are clamped to artificial bounds [ a, b ] , introducing biasin the DP estimate. Typically, these bounds are fixed in n , resulting in asymptotically negligibleLaplace noise, but O p (1) bias. In Example 4.5, we show that for the beta distribution, it is possible toincrease the bounds in n to produce both noise and bias of order o p ( n − / ) , resulting in an efficientDP estimator. Furthermore, we show through simulations that using this estimator in Algorithm 1results in a DP sample with optimal asymptotics. While we work with the beta distribution, thisapproach may be of value for other exponential family distributions as well.7 xample 4.5 (Beta distribution with DP) . We assume that X , . . . , X n i.i.d. ∼ Beta( α, β ) , where α, β ≥ . The complete sufficient statistics for the beta distribution are n − (cid:80) ni =1 log( X i ) and n − (cid:80) ni =1 log(1 − X i ) . We will add Laplace noise to each of these statistics to achieve differentialprivacy. However, the sensitivity of these quantities is unbounded. First we will pre-process thedata by setting (cid:101) X i = min { max( X i , t ) , − t } , where t is a threshold that will depend on n . Thenthe (cid:96) -sensitivity of the pair of sufficient statistics is ∆( t ) = 2 n − | log( t ) − log(1 − t ) | . We addindependent noise to each of the statistics from the distribution Laplace(∆( t ) /(cid:15) ) , which results in (cid:15) -DP versions of these statistics. Finally, we estimate θ = ( α, β ) by plugging in the privatizedsufficient statistics into the log-likelihood function and maximizing with respect to θ . The resultingparameter estimate satisfies (cid:15) -DP by post-processing.We must carefully choose the threshold t to ensure that the resulting estimate is efficient. The choiceof t must satisfy ∆( t ) = o ( n − / ) to ensure that the noise does not affect the asymptotics of thelikelihood function. We also require that both P ( X i < t ) = o ( n − / ) , and P ( X i > − t ) = o ( n − / ) to ensure that (cid:101) X i = X i + o p ( n − / ) , which limits the bias to o p ( n − / ) . For the betadistribution, we can calculate that P ( X i < t ) = O ( t α ) and P ( X i > − t ) = O ( t β ) . Sincewe assume that α, β ≥ , so long as t = o ( n − / ) the probability bounds will hold. Taking t = min { / , / (log( n ) √ n ) } satisfies t = o ( n − / ) , and we estimate the sensitivity as ∆( t ) ≤ n − log( t − ) ≤ n − log(log( n ) √ n ) = O (log( n ) /n ) = o ( n − / ) , which satisfies our requirement for ∆ . While there are many choice of t which would satisfy therequirements, our threshold (including the constant 10) was chosen to optimize the finite sampleperformance, so that the asymptotics could be demonstrated using smaller sample sizes.For the simulation, we sample X , . . . , X n i.i.d. ∼ Beta(5 , for n ∈ { , , , } . We estimate ˆ θ X with the MLE. Using (cid:15) = 1 , we privatize the sufficient statistics as described above, and obtain ˆ θ DP from the privatized log-likelihood function. We sample Z , . . . , Z n i.i.d. ∼ f ˆ θ DP and estimate ˆ θ Z using maximum likelihood. We produce ( Y i ) ni =1 from Algorithm 1 using ˆ θ DP in place of ˆ θ X . InFigure 1b, we plot the average squared (cid:96) error between each estimate of θ from the true value (5 , .The errors are averaged over 200 replicates, and are plotted on the log-scale. We see that ˆ θ DP and ˆ θ Y have the same asymptotic performance as the MLE, whereas ˆ θ Z has inflated variance. See thediscussion in Example 3.5 to understand this interpretation of the plot. In this paper, we proposed a simple method of producing synthetic data from a parametric model,which approximately preserves efficient statistics ensuring optimal asymptotics. Our approach iswidely applicable to parametric models, requiring standard regularity conditions, and is both easilyimplemented and highly computationally efficient.A useful aspect of our approach is that it allows for both partially synthetic data, as well as differen-tially private fully synthetic data. While we investigated pure differential privacy, alternatives such asconcentrated DP and approximate DP can also be used with potentially higher finite-sample utility.Another strength of our approach is that it only requires the ability to estimate parameters and samplefrom the model. This is great for usability as many practitioners can easily implement Algorithm 1,but may not have the expertise to implement a customized MCMC procedure.We saw in Example 3.4 that in the case of the Burr distribution, the Kolmogorov-Smirnov test cannotdistinguish between our synthetic sample and the true distribution which generated the originalsample. This suggests that the marginal distribution of the output of Algorithm 1 is very similar tothat of the original sample. Future work should investigate ways of quantifying this observation.While the focus of the paper is on asymptotics, we saw in our simulations that our approach offershigh finite-sample utility as well. However, as our approach is a “one-step” procedure, using aniterated version could improve finite-sample utility. Such an iterated algorithm is included in theSupplementary Materials, including a parameter for momentum for improved convergence. We foundin additional simulations that this iterated version improves the utility in small samples at an increasedcomputational cost. Other finite-sample improvements to our approach are worth investigating.8 eferences
Alan Agresti.
Categorical data analysis , volume 482. John Wiley & Sons, 2003.Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, andlower bounds. In
TCC , 2016.Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatileprivacy via truncated cdp. In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theoryof Computing , pages 74–86, 2018.Jim Burridge. Information preserving statistical obfuscation.
Statistics and Computing , 13(4):321–327, 2003.Gregory Caiola and Jerome P Reiter. Random forests for generating partially synthetic, categoricaldata.
Trans. Data Privacy , 3(1):27–42, 2010.Yuguo Chen, Ian H Dinwoodie, Seth Sullivant, et al. Sequential importance sampling for multiwaytables.
The Annals of Statistics , 34(1):523–545, 2006.Jörg Drechsler.
Synthetic datasets for statistical disclosure control: theory and implementation ,volume 201. Springer Science & Business Media, 2011.Jörg Drechsler and Jerome P Reiter. Accounting for intruder uncertainty due to sampling whenestimating identification disclosure risks in partially synthetic data. In
International conference onprivacy in statistical databases , pages 227–238. Springer, 2008.John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimaxrates. In , pages 429–438.IEEE, 2013.Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy.
CoRR , abs/1603.01887,2016.Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data,ourselves: Privacy via distributed noise generation. In
Annual International Conference on theTheory and Applications of Cryptographic Techniques , pages 486–503. Springer, 2006a.Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity inprivate data analysis. In
Theory of cryptography conference , pages 265–284. Springer, 2006b.Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.
Foundationsand Trends R (cid:13) in Theoretical Computer Science , 9(3–4):211–407, 2014.Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions and functionaldata. Journal of Machine Learning Research , 14(Feb):703–727, 2013.Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentiallyprivate data release. In
Advances in Neural Information Processing Systems , pages 2339–2347,2012.Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt,Keith Spicer, and Peter-Paul De Wolf.
Statistical disclosure control . John Wiley & Sons, 2012.James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pate-gan: Generating synthetic data withdifferential privacy guarantees. 2018.Vishesh Karwa and Aleksandra Slavkovic. Conditional inference given partial information incontingency tables using markov bases.
Wiley Interdisciplinary Reviews: Computational Statistics ,5(3):207–218, 2013.Vishesh Karwa and Aleksandra B Slavkovi´c. Differentially private graphical degree sequences andsynthetic graphs. In
International Conference on Privacy in Statistical Databases , pages 273–285.Springer, 2012. 9hiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.What can we learn privately?
SIAM Journal on Computing , 40(3):793–826, 2011.Daniel Kifer and Bing-Rong Lin. An axiomatic view of statistical privacy and utility.
Journal ofPrivacy and Confidentiality , 4(1), 2012.Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum.
Privacy, big data, and thepublic good: Frameworks for engagement . Cambridge University Press, 2014.Erich Leo Lehmann.
Elements of large-sample theory . Springer Science & Business Media, 2004.Bai Li, Vishesh Karwa, Aleksandra Slavkovi´c, and Rebecca Carter Steorts. A privacy preservingalgorithm to release sparse high-dimensional histograms.
Journal of Privacy and Confidentiality ,8(1), 2018.Chong K Liew, Uinam J Choi, and Chung J Liew. A data distortion by probability distribution.
ACMTransactions on Database Systems (TODS) , 10(3):395–411, 1985.Fang Liu. Model-based differentially private data synthesis. arXiv preprint arXiv:1606.08052 , 2016.Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. Privacy:Theory meets practice on the map. In ,pages 277–286. IEEE, 2008.Josep Maria Mateo-Sanz, Antoni Martínez-Ballesté, and Josep Domingo-Ferrer. Fast generationof accurate synthetic microdata. In
International Workshop on Privacy in Statistical Databases ,pages 298–306. Springer, 2004.David McClure and Jerome P Reiter. Differential privacy and statistical disclosure risk measures: Aninvestigation with binary synthetic data.
Trans. Data Privacy , 5(3):535–552, 2012.James B McDonald. Some generalized functions for the size distribution of income. In
ModelingIncome Distributions and Lorenz Curves , pages 37–55. Springer, 2008.Ilya Mironov. Rényi differential privacy. In , pages 263–275. IEEE, 2017.Krishnamurty Muralidhar and Rathindra Sarathy. A theoretical basis for perturbation methods.
Statistics and Computing , 13(4):329–335, 2003.Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statisticaldisclosure limitation.
Journal of official statistics , 19(1):1, 2003.Matthew Reimherr and Jordan Awan. Kng: The k-norm gradient mechanism. In
Advances in NeuralInformation Processing Systems , pages 10208–10219, 2019.Jerome P Reiter. Using cart to generate partially synthetic public use microdata.
Journal of OfficialStatistics , 21(3):441, 2005.Donald B Rubin. Statistical disclosure limitation.
Journal of official Statistics , 9(2):461–468, 1993.R. L. Serfling.
Approximation Theorems in Mathematical Statistics . New York: Wiley, 1980.Aleksandra B Slavkovi´c and Juyoun Lee. Synthetic two-way contingency tables that preserveconditional frequencies.
Statistical Methodology , 7(3):225–239, 2010.Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In
Proceedingsof the forty-third annual ACM symposium on Theory of computing , pages 813–822, 2011.Daniel Ting, Stephen Fienberg, and Mario Trottini. Romm methodology for microdata release.
Monographs of official statistics , page 89, 2005.Aleksei Triastcyn and Boi Faltings. Generating differentially private datasets using gans. 2018.Aad W Van der Vaart.
Asymptotic statistics . Cambridge university press, 2000.10hugui Xu, Ju Ren, Deyu Zhang, Yaoxue Zhang, Zhan Qin, and Kui Ren. Ganobfuscator: Mitigatinginformation leakage under gan via differential privacy.
IEEE Transactions on Information Forensicsand Security , 14(9):2358–2371, 2019.Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes:Private data release via bayesian networks.
ACM Transactions on Database Systems (TODS) , 42(4):1–41, 2017.
Algorithm 2
Iterated One-Step with Momentum
INPUT: Seed ω , . . . , ω n ∈ Ω , measurable function X θ : Ω → X for all θ ∈ Θ , value ˆ θ X ∈ Θ , efficient estimator ˆ θ ( · ) , momentumvalue ρ ∈ [0 , Sample ( Z , . . . , Z n ) ∼ f n ˆ θX ( ω ) Set θ (0) = ˆ θ X , ˆ θ (0) = ˆ θ ( Z ) , and i = 0 repeat Set i = i + 1 Set θ ( i ) = (1 − ρ )(ˆ θ X − [ˆ θ ( i − − θ ( i − ]) + ρθ ( i − Sample ( Y ( i )1 , . . . , Y ( i ) n ) ∼ f nθ ( i ) ( ω ) Set ˆ θ ( i ) = ˆ θ (cid:16) ( Y ( i ) j ) nj =1 (cid:17) until ConvergenceOUTPUT: Sample Y , . . . , Y n ∼ f nθ ( i ) ( ω ) . Parts 1 and 2 of Lemma 6.1 can be rephrased as the following: ˆ θ is efficient if and only if it isconsistent and n − (cid:80) ni =1 S (ˆ θ, X i ) = o p ( n − / ) . The third property of Lemma 6.1 is similar tomany standard expansions used in asymptotics, for example in Van der Vaart [2000]. However, werequire the expansion for arbitrary efficient estimators, and include a proof for completeness. Lemma 6.1.
Suppose X , . . . , X n i.i.d. ∼ f θ , and assume that (R1)-(R3) hold. Let ˆ θ be an efficientestimator, which is a sequence of zeros of the score equations. Suppose that (cid:101) θ is a √ n -consistentestimator of θ . Then1. If n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) , then (cid:101) θ − ˆ θ = o p ( n − / ) .2. If (cid:101) θ is efficient, then n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) .3. If (cid:101) θ is efficient, then (cid:101) θ = θ + I − ( θ ) n − (cid:80) ni =1 S ( θ , X i ) + o p ( n − / ) .Proof. As (cid:101) θ and ˆ θ are both √ n -consistent, we know that (cid:101) θ − ˆ θ = O p ( n − / ) . So, we may considera Taylor expansion of the score function about (cid:101) θ = ˆ θ . n − n (cid:88) i =1 S ( (cid:101) θ, X i ) = n − n (cid:88) i =1 S (ˆ θ, X i ) + (cid:32) dd ˆ θ n − n (cid:88) i =1 S (ˆ θ, X i ) (cid:33) ( (cid:101) θ − ˆ θ ) + O p ( n − )= 0 + (cid:34) dd ˆ θ n − n (cid:88) i =1 S (ˆ θ, X i ) + O p ( n − / ) (cid:35) ( (cid:101) θ − ˆ θ )= [ − I ( θ ) + o p (1)] ( (cid:101) θ − ˆ θ ) , (2)where we used assumptions (R1)-(R3) to justify that 1) the second derivative is bounded in aneighborhood about θ (as both ˆ θ and (cid:101) θ converge to θ ), 2) the derivative of the score converges to − I ( θ ) by Lehmann [2004, Theorem 7.2.1] along with the Law of Large Numbers, and 3) that I ( θ ) is finite, by (R3).To establish property 1, note that the left hand side of Equation (2) is o p ( n − / ) implying that (cid:16)(cid:101) θ − ˆ θ (cid:17) = o p ( n − / ) . Recall that by Lehmann [2004, Page 479], if (cid:101) θ and ˆ θ are both effi-11ient, then (cid:16)(cid:101) θ − ˆ θ (cid:17) = o p ( n − / ) . Plugging this into the right hand side of Equation (2) gives n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) , establishing property 2.For property 3, we consider a slightly different expansion: o p ( n − / ) = n − n (cid:88) i =1 S ( (cid:101) θ, X i )= n − n (cid:88) i =1 S ( θ , X i ) + ddθ n − n (cid:88) i =1 S ( θ , X i )( (cid:101) θ − θ ) + O p ( n − ) , = n − n (cid:88) i =1 S ( θ , X i ) + ( − I ( θ ) + o p (1))( (cid:101) θ − θ ) + O p ( n − ) where we used property 2 for the first equality, expanded the score about ˆ θ = θ for the second, andjustify the O p ( n − ) by (R2). By (R1)-(R2) and Law of Large Numbers along with Lehmann [2004,Theorem 7.2.1], we have the convergence of the derivative of score to − I ( θ ) . By (R3), I ( θ ) isinvertible. Solving the equation for (cid:101) θ gives the desired result. Lemma 6.2.
Assume that (R0)-(R3) hold, and let ω , . . . , ω n i.i.d. ∼ P . Then n − n (cid:88) i =1 ddθ S ( θ, X θ ( ω i )) = o p (1) . Proof.
First we can express the derivative as n − n (cid:88) i =1 ddθ S ( θ, X θ ( ω i )) = n − n (cid:88) i =1 (cid:18) ddα S ( α, X θ ( ω i ) + ddα S ( θ, X α ( ω i )) (cid:19) (cid:12)(cid:12)(cid:12) α = θ . The result follows from the Law of Large Numbers, provided that E ω ∼ P (cid:18) ddα S ( α, X θ ( ω ) + ddα S ( θ, X α ( ω ) (cid:19) (cid:12)(cid:12)(cid:12) α = θ = 0 . The expectation of the first term is − I ( θ ) , by Lehmann [2004, Theorem 7.2.1]. For the second term,we compute E ω ∼ P ddα S ( θ, X α ( ω )) (cid:12)(cid:12)(cid:12) α = θ = (cid:90) Ω ddα S ( θ, X α ( ω )) (cid:12)(cid:12)(cid:12) α = θ dP ( ω )= (cid:90) X ddα S ( θ, x ) f α ( x ) (cid:12)(cid:12)(cid:12) α = θ dµ ( x )= (cid:90) X S ( θ, x ) (cid:18) ddα f α ( x ) (cid:12)(cid:12)(cid:12) α = θ (cid:19) (cid:62) dµ ( x )= (cid:90) X S ( θ, x ) (cid:32) ddθ f θ ( x ) f θ ( x ) (cid:33) (cid:62) f θ ( x ) dµ ( x )= (cid:90) X S ( θ, x ) S (cid:62) ( θ, x ) f θ ( x ) dµ ( x )= E X ∼ θ (cid:2) S ( θ, X ) S (cid:62) ( θ, X ) (cid:3) = I ( θ ) . Proof of Theorem 3.3.
We expand ˆ θ Z about ˆ θ X : ˆ θ Z = ˆ θ X + I − (ˆ θ X ) n − n (cid:88) i =1 S (ˆ θ X , X ˆ θ X ( ω i )) + o p ( n − / ) (3) = ˆ θ X + I − ( θ ) n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p ( n − / ) , (4)12here (3) is a standard expansion of efficient estimators by Lemma 6.1; for (4), the continuousmapping theorem justifies that I − (ˆ θ X ) = I − ( θ )+ o p (1) , we use that n − (cid:80) ni =1 S ( θ , X θ ( ω i )) = O p ( n − / ) , and the score can be expanded about ˆ θ X = θ : n − n (cid:88) i =1 S (ˆ θ X , X ˆ θ X ( ω i )) = n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + (cid:32) ddθ ∗ n − n (cid:88) i =1 S ( θ ∗ , X θ ∗ ( ω i )) (cid:33) (ˆ θ X − θ )= n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p (1) O p ( n − / ) , where θ ∗ is between ˆ θ X and θ ; by Lemma 6.2, we justify that the derivative is o p (1) .Using the same techniques, we do an expansion for ˆ θ Y about θ new = 2ˆ θ X − ˆ θ Z : ˆ θ Y = θ new + I − ( θ new ) n − n (cid:88) i =1 S ( θ new , X θ new ( ω i )) + o p ( n − / ) (5) = θ new + I − ( θ ) n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p ( n − / ) (6) = θ new + [ˆ θ Z − ˆ θ X ] + o p ( n − / ) (7) = ˆ θ X + o p ( n − / ) , (8)where line (6) is a similar expansion as used for equation (3), in line (7) we substituted the expressionfrom (4), and line (8) uses the fact that as n → ∞ , θ new = 2ˆ θ X − ˆ θ Z with probability tending toone. Indeed, since θ X − ˆ θ Z is a consistent estimator of θ , we have that as n → ∞ , P (2ˆ θ X − ˆ θ Z ∈ Θ) ≥ P (2ˆ θ X − ˆ θ Z ∈ B ( θ )) →1