[PDF] Approximate Co-Sufficient Sampling for Goodness-of-fit Tests and Synthetic Data

Abstract

Co-sufficient sampling refers to resampling the data conditional on a sufficient statistic, a useful technique for statistical problems such as goodness-of-fit tests, model selection, and confidence interval construction; it is also a powerful tool to generate synthetic data which limits the disclosure risk of sensitive data. However, sampling from such conditional distributions is both technically and computationally challenging, and is inapplicable in models without low-dimensional sufficient statistics. We study an indirect inference approach to approximate co-sufficient sampling, which only requires an efficient statistic rather than a sufficient statistic. Given an efficient estimator, we prove that the expected KL divergence goes to zero between the true conditional distribution and the resulting approximate distribution. We also propose a one-step approximate solution to the optimization problem that preserves the original estimator with an error of o p ( n −1/2 ) , which suffices for asymptotic optimality. The one-step method is easily implemented, highly computationally efficient, and applicable to a wide variety of models, only requiring the ability to sample from the model and compute an efficient statistic. We implement our methods via simulations to tackle problems in synthetic data, hypothesis testing, and differential privacy.

Full PDF

OOne Step to Efﬁcient Synthetic Data

Jordan Awan

Department of StatisticsPennsylvania State UniversityUniversity Park, PA 16802 [email protected]

Zhanrui Cai

Department of StatisticsPennsylvania State UniversityUniversity Park, PA 16802 [email protected]

Abstract

We propose a general method of producing synthetic data, which is widely appli-cable for parametric models, has asymptotically efﬁcient summary statistics, andis both easily implemented and highly computationally efﬁcient. Our approachallows for the construction of both partially synthetic datasets, which preserve thesummary statistics without formal privacy methods, as well as fully synthetic datawhich satisfy the strong guarantee of differential privacy (DP), both with asymp-totically efﬁcient summary statistics. While our theory deals with asymptotics,we demonstrate through simulations that our approach offers high utility in smallsamples as well. In particular we 1) apply our method to the Burr distribution,evaluating the parameter estimates as well as distributional properties with theKolmogorov-Smirnov test, 2) demonstrate the performance of our mechanism on alog-linear model based on a car accident dataset, and 3) produce DP synthetic datafor the beta distribution using a customized Laplace mechanism.

With the advances in modern technology, government and other research agencies are able to collectmassive amounts of data from individual respondents. These data are valuable for scientiﬁc progressand policy research, but they also come with increased privacy risk [Lane et al., 2014]. To publishuseful information while preserving conﬁdentiality of sensitive information, numerous methods ofgenerating synthetic data have been proposed (see Hundepool et al. [2012, Chapter 3] for a survey).The goal of synthetic data is to produce a new dataset which preserves distributional properties of theoriginal dataset, while protecting the privacy of the participating individuals. There are two maintypes of synthetic data: partially synthetic data , which allows for certain statistics or attributes tobe released without privacy while protecting the other aspects of the data, and fully synthetic data ,where all statistics and attributes of the data are protected.In this paper, we propose a general method of producing synthetic data which is widely applicablefor parametric models, has asymptotically efﬁcient summary statistics, is both easily implementedand highly computationally efﬁcient, and can produce either partially synthetic data or differentiallyprivate fully synthetic data. More formally, given sensitive data X , . . . , X n i.i.d. ∼ f θ with unknownparameter θ , we present a method of producing a synthetic dataset ( Y i ) ni =1 , which satisﬁes ˆ θ Y =ˆ θ X + o p ( n − / ) under relatively mild conditions. In occasions where ˆ θ X itself must be protected, weshow that the proposed synthetic mechanism can be easily modiﬁed to satisfy the strong guarantee ofdifferential privacy by ﬁrst privatizing ˆ θ X . Our approach can be viewed as one step of an approximateNewton method, which aims to solve ˆ θ Y = ˆ θ X . Similar to the classical one-step estimator [Van derVaart, 2000], our one-step approach to synthetic data has efﬁcient summary statistics.Differential privacy (DP) was proposed in Dwork et al. [2006b] as a framework to develop formallyprivate methods. Methods which satisfy DP require the introduction of additional randomness, Preprint. Under review. a r X i v : . [ m a t h . S T ] J un eyond sampling in order to obscure the effect of one individual on the output. Intuitively, DP ensuresplausible deniability for those participating in the dataset. As the literature on DP has developed,there are now many privacy tools to preserve sample statistics with asymptotically negligible noise(e.g., Smith [2011],Reimherr and Awan [2019]), and produce DP synthetic data (see related work). Related work

A common approach to synthetic data is that of Liew et al. [1985], which proposedrawing synthetic data from a ﬁtted model. While Liew et al. [1985] do not incorporate formalprivacy methods, this approach is often used to produce differentially private synthetic data. Hallet al. [2013] develop DP tools for kernel density estimators, which can be sampled to produce DPsynthetic data. Machanavajjhala et al. [2008] develop a synthetic data method based on a multinomialmodel, which satisﬁes a modiﬁed version of DP to accomodate sparse spatial data. McClure andReiter [2012] sample from the posterior predictive distribution to produce DP synthetic data, which isasymptotically similar to the Liew et al. [1985] (see Example 3.1). Liu [2016] also use a Bayesianframework: ﬁrst they produce DP estimates of the Bayesian sufﬁcient statistics, draw the parameterfrom the distribution conditional on the DP statistics, and ﬁnally sample synthetic data conditional onthe sampled parameter. Zhang et al. [2017] propose a method of developing high-dimensional DPsynthetic data which draws from a ﬁtted model based on differentially private marginals.There is also a line of research which produces synthetic data from a conditional distribution ,preserving certain statistics. The most fundamental perspective of this approach is that of Muralidharand Sarathy [2003], who propose drawing conﬁdential variables from the distribution conditional onthe non-conﬁdential variables. Burridge [2003] generate partially synthetic data, preserving the meanand covariance for normally distributed variables. This approach was extended to a computationallyefﬁcient version by Mateo-Sanz et al. [2004], and Ting et al. [2005] give an alternative approach topreserving the mean and covariance by using random orthogonal matrix multiplication.There are also tools, often based in algebraic statistics, to sample conditional distributions preservingcertain statistics for contingency tables. Karwa and Slavkovic [2013] give a survey of Markov ChainMonte Carlo (MCMC) techniques to sample conditional distributions. Another approach is sequentialimportance sampling, proposed by Chen et al. [2006]. Slavkovi´c and Lee [2010] use these techniquesto generate synthetic contingency tables that preserve conditional frequencies.In differential privacy, there are also synthetic data methods which preserve sample statistics. Karwaand Slavkovi´c [2012] generate DP synthetic networks from the beta exponential random graphmodel, conditional on the degree sequence. Li et al. [2018] produce DP high dimensional syntheticcontingency tables using a modiﬁed Gibbs sampler. Hardt et al. [2012] give a distribution-freealgorithm to produce a DP synthetic dataset, which approximately preserves several linear statistics.While this paper is focused on producing synthetic data for parametric models, there are severalnon-parametric methods of producing synthetic data, using tools such as multiple imputation [Rubin,1993, Raghunathan et al., 2003, Drechsler, 2011], regression trees [Reiter, 2005, Drechsler and Reiter,2008], and random forests [Caiola and Reiter, 2010]. Recently there has been success in producingdifferentially privacy synthetic data using generative adversarial neural networks [Jordon et al., 2018,Triastcyn and Faltings, 2018, Xu et al., 2019].

Our contributions and organization

The related work cited above largely ﬁts into one of twocategories: 1) sampling from a ﬁtted distribution or 2) sampling from a distribution conditional onsample statistics. We illustrate in Example 3.1 that the ﬁrst approach results in samples with reducedasymptotic relative efﬁciency compared to the original sample. On the other hand, in the case ofcertain distributions (e.g., exponential families), the second approach is often able to result in samplesequal in distribution to the original sample, maintaining asymptotic performance.However, there are important limitations to the previous works which sample from a conditionaldistribution. First, the previous approaches are all highly speciﬁc to the model at hand, and requiredifferent techniques for different models. Second, many of the approaches are difﬁcult to implementand computationally expensive, involving complex iterative sampling schemes such as MCMC.Our approach also preserves summary statistics, but unlike previous methods it is applicable to awide variety of parametric models, easily implemented, and highly computationally efﬁcient. Indeed,the regularity conditions required for our asymptotics are similar to those required for the CentralLimit Theorem of the maximum likelihood estimator (MLE), the computations only require efﬁcientestimators for the parameters and the ability to sample the model, and the computational time is2roportional to simply ﬁtting the model. Our approach can be used to produce partially synthetic, orfully synthetic DP data by ﬁrst privatizing the efﬁcient estimator.The rest of the paper is organized as follows: In Section 2, we review some statistics background andnotation. We give our asymptotic results in Section 3, and illustrate the performance of our approachwith the Burr distribution and a log-linear model. In Section 4, we recall the basics of differentialprivacy and extend our approach to produce DP synthetic data. We also include an example whichconstructs an efﬁcient DP estimator for the beta distribution, and demonstrate the performance of ourapproach via simulations. We end in Section 5 with some discussion.

In this section, we review some background and notation that we use throughout the paper.For a parametric random variable, we write X ∼ f θ to indicate that X has probability density function(pdf) f θ . To indicate that a sequence of random variables from the model f θ are independent and iden-tically distributed (i.i.d.), we write X , . . . , X n i.i.d. ∼ f θ . We write X = ( X i ) ni =1 = ( X , . . . , X n ) (cid:62) .Let A be a random vector, A n be a sequence of random vectors, and r n be a positive numericalsequence. We write A n d → A to denote that A n converges in distribution to A . We write A n = o p ( r n ) to denote that A n /r n d → . We write A n = O p ( r n ) to denote that A n /r n is bounded in probability .For multivariate derivatives, we will overload the ddθ operator as follows. For a function f : R p → R ,we write ddθ f ( θ ) to denote the p × vector of partial derivatives ( ∂∂θ j f ( θ )) pj =1 . For a function g : R p → R q , we write ddθ g ( θ ) to denote the p × q matrix ( ∂∂θ j g k ( θ )) p,qj,k =1 .For X ∼ f θ , we denote the score function as S ( θ, x ) = ddθ log f θ ( x ) , and the Fisher Information as I ( θ ) = E θ (cid:2) S ( θ, X ) S (cid:62) ( θ, X ) (cid:3) . An estimator ˆ θ : X n → Θ is efﬁcient if for X , . . . , X n i.i.d. ∼ f θ , wehave √ n (ˆ θ ( X ) − θ ) d → N (0 , I − ( θ )) . We will often write ˆ θ X in place of ˆ θ ( X ) . In this section, we present our synthetic data procedure and its asymptotics in Theorem 3.2. Wealso include a pseudo-code version of our approach in Algorithm 1, to aid implementation. Wedemonstrate the ﬁnite-sample performance of our procedure on both a continuous and a discreteexample. With the Burr type XII distribution, we study both the properties of the ﬁtted parameters aswell as distributional properties as measured by the Kolmogorov-Smirnov test. We then investigate theperformance of our synthetic data on a log-linear model with two-way interactions, demonstratingour approach on more complex datasets.We saw in the related work that a common approach to synthetic data is to sample from a ﬁtteddistribution. However, this approach results in suboptimal asymptotics, illustrated in Example 3.1. Example 3.1.

Suppose that X . . . , X n i.i.d. ∼ N ( µ, . We estimate ˆ µ ( X ) = n − (cid:80) ni =1 X i and draw Z , . . . , Z n i.i.d. ∼ N (ˆ µ ( X ) , . We can compute Var(ˆ µ ( X )) = n − , whereas Var(ˆ µ ( Z )) = 2 n − . Byusing the synthetic data Z , we have lost half of the effective sample size.While a simple example, the implications of Example 3.1 are quite general. Recall that the majorityof estimators in the statistical literature are √ n -consistent and follow asymptotic normal distributions.For example, given an i.i.d. sample ( X i ) ni =1 drawn from f θ , an efﬁcient estimator (e.g., maximumlikelihood estimator) ˆ θ X has variance n − I − ( θ ) + o ( n − ) . Drawing ( Z i ) ni =1 i.i.d. from f ˆ θ X , it iseasily veriﬁed that the variance of an efﬁcient estimator ˆ θ Z is n − I − ( θ ) + o ( n − ) . Half of theeffective sample size is lost here as well.Our approach avoids the asymptotic problem of Example 3.1 by producing a sample ( Y i ) ni =1 suchthat ˆ θ Y = ˆ θ X + o p ( n − / ) . Then marginally, the asymptotic distributions of ˆ θ Y and ˆ θ X are identical.The intuition behind the method is that after ﬁxing the “seed,” we search for a parameter θ new such3hat when ( Y i ) ni =1 are sampled from f θ new , we have that ˆ θ Y = ˆ θ X + o p ( n − / ) . To arrive at the value θ new , we use one step of an approximate Newton method, described in Theorem 3.2.To facilitate the asymptotic analysis, we assume regularity conditions (R0)-(R3). (R1)-(R3) are similarto standard conditions to ensure that there exists an efﬁcient estimator, which are relatively mild andwidely assumed in the literature [Serﬂing, 1980, Lehmann, 2004]. We include measure-theoreticassumptions in (R0) to formalize our method.(R0) Let (Ω , F , P ) be a probability space of the seed ω . Let X θ : Ω → X be a measurablefunction, where ( X , G ) is a measurable space and θ ∈ Θ ⊂ R p , where Θ is compact. Weassume that there exists a measure µ on ( X , G ) which dominates P X − θ for all θ ∈ Θ . Thenthere exist densities f θ : X → R ≥ such that (cid:82) A dP X − θ = (cid:82) A f θ dµ for all A ∈ G .(R1) Let θ ∈ Θ ⊂ R p be the true parameter. Assume there exists an open ball B ( θ ) ⊂ Θ about θ , the model f θ is identiﬁable, and that the set { x ∈ X | f θ ( x ) > } does not depend on θ .(R2) The pdf f θ ( x ) has three derivatives in θ for all x and there exist functions g i ( x ) , g ij ( x ) , g ijk ( x ) for i, j, k = 1 , . . . , p such that for all x and all θ ∈ B ( θ ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂f θ ( x ) ∂θ i (cid:12)(cid:12)(cid:12)(cid:12) ≤ g i ( x ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ f θ ( x ) ∂θ i ∂θ j (cid:12)(cid:12)(cid:12)(cid:12) ≤ g ij ( x ) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ f θ ( x ) ∂θ i ∂θ j ∂θ k (cid:12)(cid:12)(cid:12)(cid:12) ≤ g ijk ( x ) . We further assume that each g satisﬁes (cid:82) g ( x ) dx < ∞ and E θ g ijk ( X ) < ∞ for θ ∈ B ( θ ) .(R3) The Fisher Information matrix I ( θ ) = E θ [( ddθ log f θ ( X ))( ddθ log f θ ( X )) (cid:62) ] consistsof ﬁnite entries, and is positive deﬁnite. Theorem 3.2.

Assume that (R0)-(R3) hold. Let X , . . . , X n i.i.d. ∼ f θ and let ω , . . . , ω n i.i.d. ∼ P .Choose θ new ∈ argmin θ ∈ Θ (cid:13)(cid:13)(cid:13) θ − (cid:16) θ X − ˆ θ Z (cid:17)(cid:13)(cid:13)(cid:13) , where ˆ θ is an efﬁcient estimator and ( Z i ) ni =1 =( X ˆ θ X ( ω i )) ni =1 . Then for ( Y i ) ni =1 = ( X θ new ( ω i )) ni =1 , we have ˆ θ Y = ˆ θ X + o p ( n − / ) .Proof Sketch. The proof is based on two expansions: ˆ θ Z − ˆ θ X = I − ( θ ) n (cid:80) ni =1 S ( θ , X θ ( ω i )) + o p ( n − / ) and ˆ θ Y = θ new + I − ( θ ) n (cid:80) ni =1 S ( θ , X θ ( ω i )) + o p ( n − / ) . As n → ∞ , θ new =2ˆ θ X − ˆ θ Z with probability one. Combining the results gives ˆ θ Y = ˆ θ X + o p ( n − / ) . Algorithm 1

One Step Synthetic Data Pseudo-Code

INPUT: Seed ω , parametric family { f θ | θ ∈ Θ } , efﬁcient estimator ˆ θ ( · ) , and sample X , . . . , X n i.i.d. ∼ f θ set.seed ( ω ) and a sample Z , . . . , Z n i.i.d. ∼ f ˆ θX Choose θ new ∈ argmin θ ∈ Θ (cid:107) θ − (2ˆ θ X − ˆ θ Z ) (cid:107) set.seed ( ω ) and sample Y , . . . , Y n i.i.d. ∼ f θ new OUTPUT: Y , . . . , Y n Remark 3.3 (Seeds) . In the case of continuous real-valued random variables, we can be more explicitabout the “seeds.” Recall that for U ∼ U (0 , , F − θ ( U ) ∼ f θ where F − θ ( · ) is the quantile function.So in this case, the distribution P can be taken as U (0 , , and X θ ( · ) can be replaced with F − θ ( · ) .When implementing the procedure of Theorem 3.2, it may be convenient to use numerical seeds. Forexample in R, the command set.seed can be used to emulate the result of drawing Z i and Y i withthe same seed ω i . In Algorithm 1, we describe the procedure in pseudo-code.We illustrate Theorem 3.2 with two examples. In Example 3.4, we simulate from the Burr Type XIIdistribution, demonstrating that ˆ θ Y has similar performance as ˆ θ X , whereas ˆ θ Z has inﬂated variance.We also investigate the distributional properties of the ( Y i ) with the Kolmogorov-Smirnov test. Wechose the Burr distribution because it is neither location-scale nor exponential family and so providesa non-trivial setting to test our approach. In Example 3.5, we apply our approach to a log-linearmodel to show how our approach performs on more complex datasets. For location-scale families, a linear transformation can be used to produce a sample with the desired statistics.In exponential families, if the efﬁcient statistic is sufﬁcient then the distribution conditional on the statistic isindependent of the parameter and can thus be sampled (in principle) without knowing the true parameter. (cid:96) -distance between the MLE and the vector (2 , . ( X i ) are drawn i.i.d.from Burr(2 , , ( Z i ) are i.i.d. from Burr(ˆ θ X ) , and ( Y i ) are from Algorithm 1. Results are averagedover 10000 replicates, for each n . The ﬁrst and third lines are accurate up to approximately ± in thethird digit of each value with conﬁdence. The second line has error ± in the third digit. n : 100 1000 10000 ˆ θ X . × − . × − . × − ˆ θ Z . × − . × − . × − ˆ θ Y . × − . × − . × − Example 3.4 (Burr type XII distribution) . The Burr Type XII distribution, denoted

Burr( c, k ) , alsoknown as the Singh–Maddala distribution, is a useful model for income [McDonald, 2008]. Thedistribution has pdf f ( x ) = ckx c − (1 + x c ) − ( k +1) , with support x > . Both c and k are positive.For the simulation, we set c = 2 and k = 4 , and denote θ = ( c, k ) . Let ˆ θ MLE be the MLE. Wedraw X i i.i.d. ∼ Burr(2 , , Z i i.i.d. ∼ Burr(ˆ θ MLE ( X )) , and ( Y i ) ni =1 from Algorithm 1. The simulation iconducted for n ∈ { , , } with results averaged over 10000 replicates for each n .Over the replicates, we compute the MLE and report the average squared (cid:96) -distance to the trueparameters, which estimates the variance. The results are in Table 1. When sampling from the theﬁtted model, ˆ θ Z has about twice the variance as ˆ θ X , whereas ˆ θ Y has very similar variance as ˆ θ X .We also calculate the empirical power of the Kolmogorov-Smirnov (K-S) test, comparing each samplewith the true distribution Burr(2 , , at type I error . . The results are presented in Table 2. We seethat the ( X i ) have empirical power approximately . , conﬁrming that the type I error is appropriatelycalibrated. We also see that the K-S test using ( Y i ) has power approximately . , indicating that theempirical distribution of the ( Y i ) is very close to the true distribution. On the other hand, we see thatthe K-S test with ( Z i ) has power . , signiﬁcantly higher than the type I error, indicating that the ( Z i ) are from a fundamentally different distribution than the ( X i ) .Table 2: Empirical power of the Kolmogorov-Smirnov test for the distribution Burr(2 , at typeI error . . ( X i ) are drawn i.i.d from Burr(2 , , ( Z i ) are drawn i.i.d from Burr(ˆ θ X ) , and ( Y i ) are from Algorithm 1. Results are averaged over 10000 replicates, for each n . Standard errors areapproximately . for lines 1 and 3, and . for line 2. n : 100 1000 10000 ( X i ) ( Z i ) ( Y i ) Example 3.5 (Log-linear model) . This example is based on a dataset of of 68,694 passengers inautomobiles and light trucks involved in accidents in the state of Maine in 1991. Table 3 reports thenumber of passengers according to gender (G), location (L), seatbelt status (S), and injury status (I).As in Agresti [2003], we ﬁt a hierarchical log-linear model based on all one-way effects and two-wayinteractions. The model is summarized in Equation (1), where µ ijk(cid:96) represents the expected count inbin i, j, k, (cid:96) . The parameter λ Gi represents the effect of Gender, and parameter λ GLij represents theinteraction between Gender and Location. The other main effects and interactions are analogous. log µ ijk(cid:96) = λ + λ Gi + λ Lj + λ Sk + λ I(cid:96) + λ GLij + λ GSik + λ GIi(cid:96) + λ LSjk + λ LIj(cid:96) + λ SIk(cid:96) (1)For our simulations, we treat the ﬁtted parameters as the true parameters, to ensure that modelassumptions are met. We simulate from the ﬁtted model at sample sizes n ∈ { , , , } and compare the performance in terms of the ﬁtted probabilities for each bin of the contingency table.The results are plotted in Figure 1a, with both axes on log-scale. The “mean error” is the averagesquared (cid:96) distance between the estimated parameter vector and the true parameter vector, averagedover 200 replicates. To interpret the plot, note that if the error is of the form error = cn − , where c is a constant, then log(error) = c + ( −

1) log( n ) . So, the slope represents the convergence rate,5nd the vertical offset represents the asymptotic variance. In Figure 1a, we see that the curve for ˆ θ Y approaches the curve for ˆ θ X , indicating that they have the same asymptotic rate and variance. On theother hand, the curve for ˆ θ Z has the same slope, but does not approach the ˆ θ X curve, indicating that ˆ θ Z has the same rate but inﬂated variance.Recall that our procedure approximately preserves the sufﬁcient statistics, similar to sampling from aconditional distribution. Previous work has proposed procedures to sample directly from conditionaldistributions for contingency table data. However, these approaches require sophisticated tools fromalgebraic statistics, and are computationally expensive (e.g., MCMC) [Karwa and Slavkovic, 2013].In contrast, our approach is incredibly simple to implement and highly computationally efﬁcient. Ourapproach is also applicable for a wide variety of models, whereas the techniques to sample directlyfrom the conditional distribution require a tailored approach for each setting.Table 3: Injury, Seat-Belt Use, Gender, and Location. Source: Agresti [2003, Table 8.8]. Originallycredited to Cristanna Cook, Medical Care Development, Augusta, Maine. InjuryGender Location Seatbelt

No YesFemale Urban No 7,287 996Yes 11,587 759Rural No 3,246 973Yes 6,134 757Male Urban No 10,381 812Yes 10,969 380Rural No 6,123 1,084Yes 6,693 513

In this section, we review the basics of differential privacy, and modify our synthetic data procedureto satisfy DP in Corollary 4.2. In Example 4.5 we construct an efﬁcient DP estimator for the betadistribution and demonstrate the result of Corollary 4.2 through a simulation study.The concept of differential privacy (DP) was proposed in Dwork et al. [2006b] as a framework todevelop methods of preserving privacy, with mathematical guarantees. Intuitively, the constraint ofdifferential privacy requires that for all possible databases, the change in one person’s data does notsigniﬁcantly change the distribution of outputs. Consequently, having observed the DP output, anadversary cannot accurately determine the input value of any single person in the database. Deﬁnition4.1 gives a formal deﬁnition of DP. In Deﬁnition 4.1, H : X n × X n → Z ≥ represents the Hammingmetric , deﬁned by H ( x, x (cid:48) ) = { i | x i (cid:54) = x (cid:48) i } . Deﬁnition 4.1 (Differential privacy: Dwork et al. [2006b]) . Let (cid:15) > and n ∈ { , , . . . } be given.Let X be any set, and ( Y , S ) a measurable space. Let M = { M x | x ∈ X n } be a set of probabilitymeasures on ( Y , S ) , which we call a mechanism . We say that M satisﬁes (cid:15) -differential privacy ( (cid:15) -DP) if M x ( S ) ≤ e (cid:15) M x (cid:48) ( S ) for all S ∈ S and all x, x (cid:48) ∈ X n such that H ( x, x (cid:48) ) = 1 .An important property of differential privacy is that it is invariant to post-processing. Applying anydata-independent procedure to the output of a DP mechanism preserves (cid:15) -DP [Dwork et al., 2014,Proposition 2.1]. Furthermore, Smith [2011] demonstrated that under conditions similar to (R1)-(R3),there exist efﬁcient DP estimators for parametric models. Using these techniques, we modify oursynthetic data procedure to satisfy differential privacy in Corollary 4.2. Corollary 4.2 (Differentially private synthetic data) . Assume that (R0)-(R3) hold. Let X , . . . , X n i.i.d. ∼ f θ , and let ω , . . . , ω n i.i.d. ∼ P . Set θ new ∈ argmin θ ∈ Θ (cid:13)(cid:13)(cid:13) θ − (cid:16) θ X − ˆ θ ( Z ) (cid:17)(cid:13)(cid:13)(cid:13) ,where ˆ θ X is an (cid:15) -DP efﬁcient estimator, ˆ θ is efﬁcient, and ( Z i ) ni =1 = ( X ˆ θ X ( ω i )) ni =1 . Releasing ( Y i ) ni =1 = ( X θ new ( ω i )) ni =1 satisﬁes (cid:15) -DP, and ˆ θ ( Y ) = ˆ θ X + o p ( n − / ) . − − − n l og ( m ean e rr o r )

100 1000 10000 1e+05 q ^ x q ^ z q ^ y (a) Simulations corresponding to the log-linear modelwith two-way interactions from Example 3.5. − − − − n l og ( m ean e rr o r ) q ^ x q ^ z q ^ y (b) Simulations for the beta distribution from Example4.5. ˆ θ X is the MLE. ˆ θ Z and ˆ θ Y both satisfy 1-DP. Figure 1: Both ﬁgures plot the average squared (cid:96) -distance between the estimated parameters and thetrue parameters on the log-scale. Averages are over 200 replicates for both plots. ˆ θ X is from the truemodel, ˆ θ Z from the ﬁtted model, and ˆ θ Y from Algorithm 1.The proof of Corollary 4.2 is trivial, as ( Y i ) ni =1 satisﬁes (cid:15) -DP by post-processing and ˆ θ Y = ˆ θ X + o p ( n − / ) by Theorem 3.2. Note that in Corollary 4.2, only ˆ θ X needs to satisfy (cid:15) -DP. The estimator ˆ θ , applied to ( Z i ) ni =1 and ( Y i ) ni =1 must be efﬁcient need not satisfy DP. In fact, to improve ﬁnitesample performance, we recommend using a non-private estimator for ˆ θ . Remark 4.3.

Besides Deﬁnition 4.1, there are many other variations of differential privacy, themajority of which are relaxations of Deﬁnition 4.1 which also allow for efﬁcient estimators. Forinstance, approximate DP [Dwork et al., 2006a], concentrated DP [Dwork and Rothblum, 2016, Bunand Steinke, 2016], truncated-concentrated DP [Bun et al., 2018], and Renyi DP [Mironov, 2017] allallow for efﬁcient estimators. On the other hand, local differential privacy [Kasiviswanathan et al.,2011, Duchi et al., 2013] in general does not permit efﬁcient estimators and would not ﬁt in ourframework. For an axiomatic treatment of formal privacy, see Kifer and Lin [2012].While there are some general methods of producing efﬁcient DP parameter estimates, such as in Smith[2011], often these approaches do not perform well in practical sample sizes. We demonstrate ourapproach using a modiﬁcation of the standard

Laplace mechanism . Given a statistic T , the Laplacemechanism adds independent Laplace noise to each entry of the statistic, with scale parameterproportional to the sensitivity of the statistic. Informally, the sensitivity of T is the largest amountthat T changes, when one person’s data is changed in the dataset. Proposition 4.4 (Sensitivity and Laplace mechanism: Dwork et al. [2006b]) . Let (cid:15) > be given, andlet T : X n → R p be a statistic. The (cid:96) -sensitivity of T is ∆ n ( T ) = sup (cid:107) T ( x ) − T ( x (cid:48) ) (cid:107) , where thesupremum is over all x, x (cid:48) ∈ X n such that H ( x, x (cid:48) ) = 1 . Provided that ∆ n ( T ) is ﬁnite, releasingthe vector ( T j ( x ) + L j ) pj =1 satisﬁes (cid:15) -DP, where L , . . . , L p i.i.d. ∼ Laplace (∆ n ( T ) /(cid:15) ) . Often, to ensure ﬁnite sensitivity, the data are clamped to artiﬁcial bounds [ a, b ] , introducing biasin the DP estimate. Typically, these bounds are ﬁxed in n , resulting in asymptotically negligibleLaplace noise, but O p (1) bias. In Example 4.5, we show that for the beta distribution, it is possible toincrease the bounds in n to produce both noise and bias of order o p ( n − / ) , resulting in an efﬁcientDP estimator. Furthermore, we show through simulations that using this estimator in Algorithm 1results in a DP sample with optimal asymptotics. While we work with the beta distribution, thisapproach may be of value for other exponential family distributions as well.7 xample 4.5 (Beta distribution with DP) . We assume that X , . . . , X n i.i.d. ∼ Beta( α, β ) , where α, β ≥ . The complete sufﬁcient statistics for the beta distribution are n − (cid:80) ni =1 log( X i ) and n − (cid:80) ni =1 log(1 − X i ) . We will add Laplace noise to each of these statistics to achieve differentialprivacy. However, the sensitivity of these quantities is unbounded. First we will pre-process thedata by setting (cid:101) X i = min { max( X i , t ) , − t } , where t is a threshold that will depend on n . Thenthe (cid:96) -sensitivity of the pair of sufﬁcient statistics is ∆( t ) = 2 n − | log( t ) − log(1 − t ) | . We addindependent noise to each of the statistics from the distribution Laplace(∆( t ) /(cid:15) ) , which results in (cid:15) -DP versions of these statistics. Finally, we estimate θ = ( α, β ) by plugging in the privatizedsufﬁcient statistics into the log-likelihood function and maximizing with respect to θ . The resultingparameter estimate satisﬁes (cid:15) -DP by post-processing.We must carefully choose the threshold t to ensure that the resulting estimate is efﬁcient. The choiceof t must satisfy ∆( t ) = o ( n − / ) to ensure that the noise does not affect the asymptotics of thelikelihood function. We also require that both P ( X i < t ) = o ( n − / ) , and P ( X i > − t ) = o ( n − / ) to ensure that (cid:101) X i = X i + o p ( n − / ) , which limits the bias to o p ( n − / ) . For the betadistribution, we can calculate that P ( X i < t ) = O ( t α ) and P ( X i > − t ) = O ( t β ) . Sincewe assume that α, β ≥ , so long as t = o ( n − / ) the probability bounds will hold. Taking t = min { / , / (log( n ) √ n ) } satisﬁes t = o ( n − / ) , and we estimate the sensitivity as ∆( t ) ≤ n − log( t − ) ≤ n − log(log( n ) √ n ) = O (log( n ) /n ) = o ( n − / ) , which satisﬁes our requirement for ∆ . While there are many choice of t which would satisfy therequirements, our threshold (including the constant 10) was chosen to optimize the ﬁnite sampleperformance, so that the asymptotics could be demonstrated using smaller sample sizes.For the simulation, we sample X , . . . , X n i.i.d. ∼ Beta(5 , for n ∈ { , , , } . We estimate ˆ θ X with the MLE. Using (cid:15) = 1 , we privatize the sufﬁcient statistics as described above, and obtain ˆ θ DP from the privatized log-likelihood function. We sample Z , . . . , Z n i.i.d. ∼ f ˆ θ DP and estimate ˆ θ Z using maximum likelihood. We produce ( Y i ) ni =1 from Algorithm 1 using ˆ θ DP in place of ˆ θ X . InFigure 1b, we plot the average squared (cid:96) error between each estimate of θ from the true value (5 , .The errors are averaged over 200 replicates, and are plotted on the log-scale. We see that ˆ θ DP and ˆ θ Y have the same asymptotic performance as the MLE, whereas ˆ θ Z has inﬂated variance. See thediscussion in Example 3.5 to understand this interpretation of the plot. In this paper, we proposed a simple method of producing synthetic data from a parametric model,which approximately preserves efﬁcient statistics ensuring optimal asymptotics. Our approach iswidely applicable to parametric models, requiring standard regularity conditions, and is both easilyimplemented and highly computationally efﬁcient.A useful aspect of our approach is that it allows for both partially synthetic data, as well as differen-tially private fully synthetic data. While we investigated pure differential privacy, alternatives such asconcentrated DP and approximate DP can also be used with potentially higher ﬁnite-sample utility.Another strength of our approach is that it only requires the ability to estimate parameters and samplefrom the model. This is great for usability as many practitioners can easily implement Algorithm 1,but may not have the expertise to implement a customized MCMC procedure.We saw in Example 3.4 that in the case of the Burr distribution, the Kolmogorov-Smirnov test cannotdistinguish between our synthetic sample and the true distribution which generated the originalsample. This suggests that the marginal distribution of the output of Algorithm 1 is very similar tothat of the original sample. Future work should investigate ways of quantifying this observation.While the focus of the paper is on asymptotics, we saw in our simulations that our approach offershigh ﬁnite-sample utility as well. However, as our approach is a “one-step” procedure, using aniterated version could improve ﬁnite-sample utility. Such an iterated algorithm is included in theSupplementary Materials, including a parameter for momentum for improved convergence. We foundin additional simulations that this iterated version improves the utility in small samples at an increasedcomputational cost. Other ﬁnite-sample improvements to our approach are worth investigating.8 eferences

Alan Agresti.

Categorical data analysis , volume 482. John Wiley & Sons, 2003.Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpliﬁcations, extensions, andlower bounds. In

TCC , 2016.Mark Bun, Cynthia Dwork, Guy N Rothblum, and Thomas Steinke. Composable and versatileprivacy via truncated cdp. In

Proceedings of the 50th Annual ACM SIGACT Symposium on Theoryof Computing , pages 74–86, 2018.Jim Burridge. Information preserving statistical obfuscation.

Statistics and Computing , 13(4):321–327, 2003.Gregory Caiola and Jerome P Reiter. Random forests for generating partially synthetic, categoricaldata.

Trans. Data Privacy , 3(1):27–42, 2010.Yuguo Chen, Ian H Dinwoodie, Seth Sullivant, et al. Sequential importance sampling for multiwaytables.

The Annals of Statistics , 34(1):523–545, 2006.Jörg Drechsler.

Synthetic datasets for statistical disclosure control: theory and implementation ,volume 201. Springer Science & Business Media, 2011.Jörg Drechsler and Jerome P Reiter. Accounting for intruder uncertainty due to sampling whenestimating identiﬁcation disclosure risks in partially synthetic data. In

International conference onprivacy in statistical databases , pages 227–238. Springer, 2008.John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimaxrates. In , pages 429–438.IEEE, 2013.Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy.

CoRR , abs/1603.01887,2016.Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data,ourselves: Privacy via distributed noise generation. In

Annual International Conference on theTheory and Applications of Cryptographic Techniques , pages 486–503. Springer, 2006a.Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity inprivate data analysis. In

Theory of cryptography conference , pages 265–284. Springer, 2006b.Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy.

Foundationsand Trends R (cid:13) in Theoretical Computer Science , 9(3–4):211–407, 2014.Rob Hall, Alessandro Rinaldo, and Larry Wasserman. Differential privacy for functions and functionaldata. Journal of Machine Learning Research , 14(Feb):703–727, 2013.Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentiallyprivate data release. In

Advances in Neural Information Processing Systems , pages 2339–2347,2012.Anco Hundepool, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt,Keith Spicer, and Peter-Paul De Wolf.

Statistical disclosure control . John Wiley & Sons, 2012.James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pate-gan: Generating synthetic data withdifferential privacy guarantees. 2018.Vishesh Karwa and Aleksandra Slavkovic. Conditional inference given partial information incontingency tables using markov bases.

Wiley Interdisciplinary Reviews: Computational Statistics ,5(3):207–218, 2013.Vishesh Karwa and Aleksandra B Slavkovi´c. Differentially private graphical degree sequences andsynthetic graphs. In

International Conference on Privacy in Statistical Databases , pages 273–285.Springer, 2012. 9hiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith.What can we learn privately?

SIAM Journal on Computing , 40(3):793–826, 2011.Daniel Kifer and Bing-Rong Lin. An axiomatic view of statistical privacy and utility.

Journal ofPrivacy and Conﬁdentiality , 4(1), 2012.Julia Lane, Victoria Stodden, Stefan Bender, and Helen Nissenbaum.

Privacy, big data, and thepublic good: Frameworks for engagement . Cambridge University Press, 2014.Erich Leo Lehmann.

Elements of large-sample theory . Springer Science & Business Media, 2004.Bai Li, Vishesh Karwa, Aleksandra Slavkovi´c, and Rebecca Carter Steorts. A privacy preservingalgorithm to release sparse high-dimensional histograms.

Journal of Privacy and Conﬁdentiality ,8(1), 2018.Chong K Liew, Uinam J Choi, and Chung J Liew. A data distortion by probability distribution.

ACMTransactions on Database Systems (TODS) , 10(3):395–411, 1985.Fang Liu. Model-based differentially private data synthesis. arXiv preprint arXiv:1606.08052 , 2016.Ashwin Machanavajjhala, Daniel Kifer, John Abowd, Johannes Gehrke, and Lars Vilhuber. Privacy:Theory meets practice on the map. In ,pages 277–286. IEEE, 2008.Josep Maria Mateo-Sanz, Antoni Martínez-Ballesté, and Josep Domingo-Ferrer. Fast generationof accurate synthetic microdata. In

International Workshop on Privacy in Statistical Databases ,pages 298–306. Springer, 2004.David McClure and Jerome P Reiter. Differential privacy and statistical disclosure risk measures: Aninvestigation with binary synthetic data.

Trans. Data Privacy , 5(3):535–552, 2012.James B McDonald. Some generalized functions for the size distribution of income. In

ModelingIncome Distributions and Lorenz Curves , pages 37–55. Springer, 2008.Ilya Mironov. Rényi differential privacy. In , pages 263–275. IEEE, 2017.Krishnamurty Muralidhar and Rathindra Sarathy. A theoretical basis for perturbation methods.

Statistics and Computing , 13(4):329–335, 2003.Trivellore E Raghunathan, Jerome P Reiter, and Donald B Rubin. Multiple imputation for statisticaldisclosure limitation.

Journal of ofﬁcial statistics , 19(1):1, 2003.Matthew Reimherr and Jordan Awan. Kng: The k-norm gradient mechanism. In

Advances in NeuralInformation Processing Systems , pages 10208–10219, 2019.Jerome P Reiter. Using cart to generate partially synthetic public use microdata.

Journal of OfﬁcialStatistics , 21(3):441, 2005.Donald B Rubin. Statistical disclosure limitation.

Journal of ofﬁcial Statistics , 9(2):461–468, 1993.R. L. Serﬂing.

Approximation Theorems in Mathematical Statistics . New York: Wiley, 1980.Aleksandra B Slavkovi´c and Juyoun Lee. Synthetic two-way contingency tables that preserveconditional frequencies.

Statistical Methodology , 7(3):225–239, 2010.Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In

Proceedingsof the forty-third annual ACM symposium on Theory of computing , pages 813–822, 2011.Daniel Ting, Stephen Fienberg, and Mario Trottini. Romm methodology for microdata release.

Monographs of ofﬁcial statistics , page 89, 2005.Aleksei Triastcyn and Boi Faltings. Generating differentially private datasets using gans. 2018.Aad W Van der Vaart.

Asymptotic statistics . Cambridge university press, 2000.10hugui Xu, Ju Ren, Deyu Zhang, Yaoxue Zhang, Zhan Qin, and Kui Ren. Ganobfuscator: Mitigatinginformation leakage under gan via differential privacy.

IEEE Transactions on Information Forensicsand Security , 14(9):2358–2371, 2019.Jun Zhang, Graham Cormode, Cecilia M Procopiuc, Divesh Srivastava, and Xiaokui Xiao. Privbayes:Private data release via bayesian networks.

ACM Transactions on Database Systems (TODS) , 42(4):1–41, 2017.

Algorithm 2

Iterated One-Step with Momentum

INPUT: Seed ω , . . . , ω n ∈ Ω , measurable function X θ : Ω → X for all θ ∈ Θ , value ˆ θ X ∈ Θ , efﬁcient estimator ˆ θ ( · ) , momentumvalue ρ ∈ [0 , Sample ( Z , . . . , Z n ) ∼ f n ˆ θX ( ω ) Set θ (0) = ˆ θ X , ˆ θ (0) = ˆ θ ( Z ) , and i = 0 repeat Set i = i + 1 Set θ ( i ) = (1 − ρ )(ˆ θ X − [ˆ θ ( i − − θ ( i − ]) + ρθ ( i − Sample ( Y ( i )1 , . . . , Y ( i ) n ) ∼ f nθ ( i ) ( ω ) Set ˆ θ ( i ) = ˆ θ (cid:16) ( Y ( i ) j ) nj =1 (cid:17) until ConvergenceOUTPUT: Sample Y , . . . , Y n ∼ f nθ ( i ) ( ω ) . Parts 1 and 2 of Lemma 6.1 can be rephrased as the following: ˆ θ is efﬁcient if and only if it isconsistent and n − (cid:80) ni =1 S (ˆ θ, X i ) = o p ( n − / ) . The third property of Lemma 6.1 is similar tomany standard expansions used in asymptotics, for example in Van der Vaart [2000]. However, werequire the expansion for arbitrary efﬁcient estimators, and include a proof for completeness. Lemma 6.1.

Suppose X , . . . , X n i.i.d. ∼ f θ , and assume that (R1)-(R3) hold. Let ˆ θ be an efﬁcientestimator, which is a sequence of zeros of the score equations. Suppose that (cid:101) θ is a √ n -consistentestimator of θ . Then1. If n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) , then (cid:101) θ − ˆ θ = o p ( n − / ) .2. If (cid:101) θ is efﬁcient, then n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) .3. If (cid:101) θ is efﬁcient, then (cid:101) θ = θ + I − ( θ ) n − (cid:80) ni =1 S ( θ , X i ) + o p ( n − / ) .Proof. As (cid:101) θ and ˆ θ are both √ n -consistent, we know that (cid:101) θ − ˆ θ = O p ( n − / ) . So, we may considera Taylor expansion of the score function about (cid:101) θ = ˆ θ . n − n (cid:88) i =1 S ( (cid:101) θ, X i ) = n − n (cid:88) i =1 S (ˆ θ, X i ) + (cid:32) dd ˆ θ n − n (cid:88) i =1 S (ˆ θ, X i ) (cid:33) ( (cid:101) θ − ˆ θ ) + O p ( n − )= 0 + (cid:34) dd ˆ θ n − n (cid:88) i =1 S (ˆ θ, X i ) + O p ( n − / ) (cid:35) ( (cid:101) θ − ˆ θ )= [ − I ( θ ) + o p (1)] ( (cid:101) θ − ˆ θ ) , (2)where we used assumptions (R1)-(R3) to justify that 1) the second derivative is bounded in aneighborhood about θ (as both ˆ θ and (cid:101) θ converge to θ ), 2) the derivative of the score converges to − I ( θ ) by Lehmann [2004, Theorem 7.2.1] along with the Law of Large Numbers, and 3) that I ( θ ) is ﬁnite, by (R3).To establish property 1, note that the left hand side of Equation (2) is o p ( n − / ) implying that (cid:16)(cid:101) θ − ˆ θ (cid:17) = o p ( n − / ) . Recall that by Lehmann [2004, Page 479], if (cid:101) θ and ˆ θ are both efﬁ-11ient, then (cid:16)(cid:101) θ − ˆ θ (cid:17) = o p ( n − / ) . Plugging this into the right hand side of Equation (2) gives n − (cid:80) ni =1 S ( (cid:101) θ, X i ) = o p ( n − / ) , establishing property 2.For property 3, we consider a slightly different expansion: o p ( n − / ) = n − n (cid:88) i =1 S ( (cid:101) θ, X i )= n − n (cid:88) i =1 S ( θ , X i ) + ddθ n − n (cid:88) i =1 S ( θ , X i )( (cid:101) θ − θ ) + O p ( n − ) , = n − n (cid:88) i =1 S ( θ , X i ) + ( − I ( θ ) + o p (1))( (cid:101) θ − θ ) + O p ( n − ) where we used property 2 for the ﬁrst equality, expanded the score about ˆ θ = θ for the second, andjustify the O p ( n − ) by (R2). By (R1)-(R2) and Law of Large Numbers along with Lehmann [2004,Theorem 7.2.1], we have the convergence of the derivative of score to − I ( θ ) . By (R3), I ( θ ) isinvertible. Solving the equation for (cid:101) θ gives the desired result. Lemma 6.2.

Assume that (R0)-(R3) hold, and let ω , . . . , ω n i.i.d. ∼ P . Then n − n (cid:88) i =1 ddθ S ( θ, X θ ( ω i )) = o p (1) . Proof.

First we can express the derivative as n − n (cid:88) i =1 ddθ S ( θ, X θ ( ω i )) = n − n (cid:88) i =1 (cid:18) ddα S ( α, X θ ( ω i ) + ddα S ( θ, X α ( ω i )) (cid:19) (cid:12)(cid:12)(cid:12) α = θ . The result follows from the Law of Large Numbers, provided that E ω ∼ P (cid:18) ddα S ( α, X θ ( ω ) + ddα S ( θ, X α ( ω ) (cid:19) (cid:12)(cid:12)(cid:12) α = θ = 0 . The expectation of the ﬁrst term is − I ( θ ) , by Lehmann [2004, Theorem 7.2.1]. For the second term,we compute E ω ∼ P ddα S ( θ, X α ( ω )) (cid:12)(cid:12)(cid:12) α = θ = (cid:90) Ω ddα S ( θ, X α ( ω )) (cid:12)(cid:12)(cid:12) α = θ dP ( ω )= (cid:90) X ddα S ( θ, x ) f α ( x ) (cid:12)(cid:12)(cid:12) α = θ dµ ( x )= (cid:90) X S ( θ, x ) (cid:18) ddα f α ( x ) (cid:12)(cid:12)(cid:12) α = θ (cid:19) (cid:62) dµ ( x )= (cid:90) X S ( θ, x ) (cid:32) ddθ f θ ( x ) f θ ( x ) (cid:33) (cid:62) f θ ( x ) dµ ( x )= (cid:90) X S ( θ, x ) S (cid:62) ( θ, x ) f θ ( x ) dµ ( x )= E X ∼ θ (cid:2) S ( θ, X ) S (cid:62) ( θ, X ) (cid:3) = I ( θ ) . Proof of Theorem 3.3.

We expand ˆ θ Z about ˆ θ X : ˆ θ Z = ˆ θ X + I − (ˆ θ X ) n − n (cid:88) i =1 S (ˆ θ X , X ˆ θ X ( ω i )) + o p ( n − / ) (3) = ˆ θ X + I − ( θ ) n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p ( n − / ) , (4)12here (3) is a standard expansion of efﬁcient estimators by Lemma 6.1; for (4), the continuousmapping theorem justiﬁes that I − (ˆ θ X ) = I − ( θ )+ o p (1) , we use that n − (cid:80) ni =1 S ( θ , X θ ( ω i )) = O p ( n − / ) , and the score can be expanded about ˆ θ X = θ : n − n (cid:88) i =1 S (ˆ θ X , X ˆ θ X ( ω i )) = n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + (cid:32) ddθ ∗ n − n (cid:88) i =1 S ( θ ∗ , X θ ∗ ( ω i )) (cid:33) (ˆ θ X − θ )= n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p (1) O p ( n − / ) , where θ ∗ is between ˆ θ X and θ ; by Lemma 6.2, we justify that the derivative is o p (1) .Using the same techniques, we do an expansion for ˆ θ Y about θ new = 2ˆ θ X − ˆ θ Z : ˆ θ Y = θ new + I − ( θ new ) n − n (cid:88) i =1 S ( θ new , X θ new ( ω i )) + o p ( n − / ) (5) = θ new + I − ( θ ) n − n (cid:88) i =1 S ( θ , X θ ( ω i )) + o p ( n − / ) (6) = θ new + [ˆ θ Z − ˆ θ X ] + o p ( n − / ) (7) = ˆ θ X + o p ( n − / ) , (8)where line (6) is a similar expansion as used for equation (3), in line (7) we substituted the expressionfrom (4), and line (8) uses the fact that as n → ∞ , θ new = 2ˆ θ X − ˆ θ Z with probability tending toone. Indeed, since θ X − ˆ θ Z is a consistent estimator of θ , we have that as n → ∞ , P (2ˆ θ X − ˆ θ Z ∈ Θ) ≥ P (2ˆ θ X − ˆ θ Z ∈ B ( θ )) →1