[PDF] Private Tabular Survey Data Products through Synthetic Microdata Generation

Abstract

We propose three synthetic microdata approaches to generate private tabular survey data products for public release. We adapt a disclosure risk based-weighted pseudo posterior mechanism to survey data with a focus on producing tabular products under a formal privacy guarantee. Two of our approaches synthesize the observed sample distribution of the outcome and survey weights, jointly, such that both quantities together possess a probabilistic differential privacy guarantee. The privacy-protected outcome and sampling weights are used to construct tabular cell estimates and associated standard errors to correct for survey sampling bias. The third approach synthesizes the population distribution from the observed sample under a pseudo posterior construction that treats survey sampling weights as fixed to correct the sample likelihood to approximate that for the population. Each by-record sampling weight in the pseudo posterior is, in turn, multiplied by the associated privacy, risk-based weight for that record to create a composite pseudo posterior mechanism that both corrects for survey bias and provides a privacy guarantee for the observed sample. Through a simulation study and a real data application to the Survey of Doctorate Recipients public use file, we demonstrate that our three microdata synthesis approaches to construct tabular products provide superior utility preservation as compared to the additive-noise approach of the Laplace Mechanism. Moreover, all our approaches allow the release of microdata to the public, enabling additional analyses at no extra privacy cost.

Full PDF

PPrivate Tabular Survey Data Products through Synthetic MicrodataGeneration

Jingchen Hu ∗ and Terrance D. Savitsky † and Matthew R. Williams ‡ January 18, 2021

Abstract

We propose three synthetic microdata approaches to generate private tabular survey dataproducts for public release. We adapt a disclosure risk based-weighted pseudo posterior mech-anism to survey data with a focus on producing tabular products under a formal privacyguarantee. Two of our approaches synthesize the observed sample distribution of the outcomeand survey weights, jointly, such that both quantities together possess a probabilistic diﬀer-ential privacy guarantee. The privacy-protected outcome and sampling weights are used toconstruct tabular cell estimates and associated standard errors to correct for survey samplingbias. The third approach synthesizes the population distribution from the observed sample un-der a pseudo posterior construction that treats survey sampling weights as ﬁxed to correct thesample likelihood to approximate that for the population. Each by-record sampling weight inthe pseudo posterior is, in turn, multiplied by the associated privacy, risk-based weight for thatrecord to create a composite pseudo posterior mechanism that both corrects for survey biasand provides a privacy guarantee for the observed sample. Through a simulation study anda real data application to the Survey of Doctorate Recipients public use ﬁle, we demonstratethat our three microdata synthesis approaches to construct tabular products provide superiorutility preservation as compared to the additive-noise approach of the Laplace Mechanism. ∗ Vassar College, Box 27, 124 Raymond Ave, Poughkeepsie, NY 12604, [email protected] † U.S. Bureau of Labor Statistics, Oﬃce of Survey Methods Research, Suite 5930, 2 Massachusetts Ave NEWashington, DC 20212, [email protected] ‡ National Center for Science and Engineering Statistics, National Science Foundation, 2415 Eisenhower Avenue,Alexandria, VA 22314, [email protected] a r X i v : . [ s t a t . M E ] J a n oreover, all our approaches allow the release of microdata to the public, enabling additionalanalyses at no extra privacy cost. Keywords: diﬀerential privacy, pseudo posterior, sampling weights, synthetic data, tabu-lar survey data, weight smoothing

Survey data are collected by government statistical agencies from individuals, households, andbusiness establishments to support research and policy making. For example, the CurrentEmployment Statistics (CES) survey administered by the U.S. Bureau of Labor Statisticsconducts a survey of business establishments for the purpose of measuring employment totalsfor metropolitan statistical areas, states, and for the country as a whole. The CES is used tocompose unemployment statistics and to support state and federal policy formulation.Survey sampling designs typically employ unequal probabilities for the selection of respon-dents from the population in order to over-sample important sub-populations or to improvethe eﬃciency of a domain (e.g., state-by-industry) estimator (e.g., of employment totals). Cor-relations are also induced in the sampling designs through the sampling of geographic clustersof correlated respondents, which is done for convenience and cost. As a result of unequal in-clusion probabilities and dependence induced by the survey sampling design, the distributionof variables of interest (e.g., total employment) are expected to be diﬀerent in the observedsample than in the underlying population. Therefore, models and statistics estimated on theobserved sample without correction will be biased.At the same time, many government statistical agencies are under legal obligation (suchas Title 13 in the U.S.) to protect the privacy and conﬁdentiality of survey participants.Agencies utilize statistical disclosure control procedures before releasing any statistics derivedfrom survey responses to the public.More recent disclosure limitation methods include the addition of noise to statistics toperturb their values (Dwork et al. , 2006), on the one hand, and the release of synthetic datagenerated from a model that encodes smoothing to replace the closely-held data (Little, 1993;Rubin, 1993), on the other hand. Both classes of methods induce distortion into statistics ordata targeted for release to encode privacy protection. By contrast, survey sampling weights(constructed to be inversely proportional to respondent inclusion probabilities) are used to orrect statistics estimated in the sample to the population. The goal is to reduce distortionor bias. Such is the case with use of the Horvitz-Thompson survey expansion estimator forproducing population statistics from an observed sample. Therefore, there is a tension betweenusing survey sampling weights to correct the distortion in the observed sample statistics forthe population, on the one hand, from the injection of distortion into data statistics to induceprivacy protection, on the other hand.There is a strong connection between smoothness and disclosure risk, where the more localor less smooth is the closely-held data distribution, the higher are the identiﬁcation disclosurerisks for survey respondents. The distribution of the sampling weights are typically highlyskewed with extremely large values. In the CES, relatively high inclusion probabilities areassigned to large business establishments to reduce the variance of the domain estimator fortotal employment. As a result, relatively small business establishments are assigned low in-clusion probabilities, which produces high magnitude sampling weight values. This skeweddistribution for sampling weights can inadvertently accentuate the disclosure risk for a rela-tively isolated participant by assigning them a large sampling weight. The result of employingthe sampling weights to correct estimates of statistics performed on the observed sample backto the population adds peakedness or roughness to those statistics that, in turn, must besmoothed in a disclosure risk-limiting procedure.This paper focuses on the adaptation of statistical disclosure control procedures for theproduction of synthetic data to survey data for estimation of tabular statistics for the popula-tion. In the sequel, we develop three alternatives that both correct the survey data distributionto the target population of interest, while simultaneously inducing distortion in the generatedsynthetic data to reduce the identiﬁcation disclosure risks for survey respondents. Our focus metric for measuring the relative privacy guarantees of our additive noise andsynthetic data measures is diﬀerential privacy (Dwork et al. , 2006). We next provide a formaldeﬁnition for diﬀerential privacy.

Deﬁnition 1 (Diﬀerential Privacy)

Let D ∈ R n × k be a database in input space D . Let M e a randomized mechanism such that M () : R n × k → O . Then M is (cid:15) -diﬀerentially private if P r [ M ( D ) ∈ O ] P r [ M ( D (cid:48) ) ∈ O ] ≤ exp( (cid:15) ) , for all possible outputs O = Range ( M ) under all possible pairs of datasets D, D (cid:48) ∈ D of thesame size which diﬀer by only row (Hamming-1 distance).

Diﬀerential privacy assigns a disclosure risk for a statistic to be released to the public, f ( D ) (e.g., total employment for a state-industry) of any D ∈ D based on the global sensitivity,∆ G = sup D,D (cid:48) ∈D : δ ( D,D (cid:48) )=1 (cid:12)(cid:12)(cid:12) f ( D ) − f ( D (cid:48) ) (cid:12)(cid:12)(cid:12) , over the space of databases, D , where δ ( D, D (cid:48) ) = 1denotes the Hamming − D diﬀers from D (cid:48) by a single record. If the valueof the statistic, f , expresses a high magnitude change after the change of a data record from D (cid:48) , then the mechanism will be required to induce a relatively higher level of distortion to f .The more sensitive is a statistic to the change of a record, the higher its disclosure risk.Under additive noise processes, such as the Laplace Mechanism (Dwork et al. , 2006), thescale of additive noise to produce a diﬀerential privacy guarantee of (cid:15) is proportional to ∆ G /(cid:15) ,where the larger the sensitivity, ∆ G , the higher the required scale for the addition of Laplace-distributed noise to f . The diﬀerential privacy guarantee, (cid:15) , is a property of the randomizedmechanism M , not the actual released data M ( D ). An alternative to the addition of additive noise to statistics is the generation of syntheticdata to replace the closely-held data for release to the public. Synthetic data is produced byestimating a model on the closely-held data, followed by generating replicate data from themodel posterior predictive distribution after integrating over the model parameters (Little,1993; Rubin, 1993). A major advantage of synthetic data mechanisms, in contrast with additivenoise mechanisms, is that they do not need to account for interactive query data releases thatspend a privacy budget for each release mechanism computed on the closely-held data. Thesynthetic data release comes with a guarantee and those data may be used for any purpose,including producing unlimited tables with cells at any level of granularity. Thus there is nosubsequent privacy “accounting” required after the initial creation of the synthetic data. imitrakakis et al. (2017) employ the Exponential Mechanism of McSherry and Talwar(2007) for generating synthetic data by selecting the model log-likelihood as the utility function,which produces the posterior distribution, ξ ( θ | X ), as the random mechanism, M ( X , θ ). Theydemonstrate a connection between the model-indexed sensitivity, sup x , y ∈X n : δ ( x , y )=1 sup θ ∈ Θ | f θ ( x ) − f θ ( y ) | ≤ ∆ and (cid:15) ≤ f θ ( x ) is the model log-likelihood and ∆ denotes a Lipschitzbound. The guarantee applies to all databases x in the space of databases of size n , X n .The posterior mechanism suﬀers from a non-ﬁnite ∆ = ∞ for most Bayesian probability mod-els used, in practice. Savitsky et al. (2020) extend Dimitrakakis et al. (2017) by employinga vector of record-indexed likelihood weights, α i , i = 1 , . . . , n (where n denotes the num-ber of records). The α i ∈ [0 ,

1] serve to downweight the likelihood contributions with each α i ∝ / sup θ ∈ Θ | f θ ( x i ) | such that highly risky records are more strongly downweighted. Thediﬀerential downweighting of each record intends to better preserve utility by focusing thedownweighting to high risk records. The method sets α i = 0 for any record with a inﬁnite log-likelihood, which ensures a ﬁnite ∆ α = sup x , y ∈X n : δ ( x , y )=1 sup θ ∈ Θ | α ( x ) × f θ ( x ) − α ( y ) × f θ ( y ) | < ∞ . We see that ∆ α ≤ ∆ since α i ≤ (cid:15) )+ δ for some δ >

0, called ( (cid:15), δ ) − diﬀerential privacy. We might also estimate “local” versions of (cid:15) x based on an observed database, x , and then account for additional information “leakage”over all databases, x (cid:48) ∈ X n (Nissim et al. , 2007). Diﬀerential privacy might be replacedwith a weaker version that is met with some probability δ ∈ [0 ,

1] called ( (cid:15), δ ) − probabilistic diﬀerential privacy (Machanavajjhala et al. , 2008). Here, δ ∈ [0 ,

1] denotes the probability thatthere is some database in X n that exceeds the global (cid:15) diﬀerential privacy guarantee. Thisrelaxed version of probabilistic diﬀerential privacy is met by the weighted (pseudo) posteriormechanism of Savitsky et al. (2020), reviewed above, where the authors demonstrate that δ ↓ n , increases, which means that the local privacy guarantee, (cid:15) x contractsonto the global privacy guarantee, (cid:15) . Suppose a sample S of n individuals is taken from a population U of size N . The sampleis taken under a design distribution that assigns indicators ω i ∈ { , } to each individual in U with probability of selection P ( ω i = 1 | A ) = π i . Often the selection probabilities π i are elated to the response of interest y i (given some population information A ). The balance ofinformation of the observed sample y i , i ∈ S is diﬀerent from the balance of information inthe population y (cid:96) , (cid:96) ∈ U . We call such a sampling design informative . To account for thisimbalance, survey weights w i = 1 /π i are used to create estimators on the observed samplethat reduce bias; for example, an unbiased and consistent estimate of the population mean isˆ µ = (cid:80) S w i y i / (cid:80) S w i . When the estimation focus moves beyond simple statistics, consistentestimation for more general models can be based on the exponentiated pseudo likelihood: L w ( θ ) = (cid:81) S p ( y i | θ ) w i . Use of this pseudo likelihood provides for consistent estimation of θ for broad classes of both population models (Savitsky and Toth, 2016) and complex surveysampling designs (Williams and Savitsky, 2020a).The use of survey weights increases the inﬂuence of individual observations, thus increasingthe sensitivity of the output mechanism. For example, a weighted count would have a sensi-tivity of max i w i instead of 1. Thus direct usage of additive noise and perturbation can leadto a large amount of noise at the expense of utility.For estimation, survey weights mitigate the estimation bias, but the uncertainty distribu-tion (covariance structure) also needs to be estimated and adjusted (Williams and Savitsky,2020b). The typical assumption for variance estimation (of the same models) is to assume anarbitrary amount of within cluster dependence both in the sampling design and the populationgenerating model (Heeringa et al. , 2010; Rao et al. , 1992).The de facto approach for variance estimation is based on the approximate sampling inde-pendence of the primary sampling units (Heeringa et al. , 2010). Variance estimation can be inthe form of Taylor linearization or replication based methods with a variety of implementationsavailable for each (Binder, 1996; Rao et al. , 1992). Williams and Savitsky (2020b) propose ahybrid approach made possible by recent advances in algorithmic diﬀerentiation (Margossian,2018). Each of these methods re-use the data and thus require an additional (cid:15) privacy expen-diture. Trade-oﬀs between eﬃcient estimation of variance (more clusters or more replicatesfor reduced variability of variance estimates) and conserving an (cid:15) budget (aggregating clustersand fewer replicates for larger variability of variance estimates) are an open challenge. .4 Main Approach In this work, we aim at producing tabular data products with privacy guarantee (cid:15) , conditionedon the local survey sample database. Speciﬁcally, we propose to synthesize microdata underthe pseudo posterior mechanism (Savitsky et al. , 2020), coupled with survey weights. Weextend Savitsky et al. (2020) for the generation of synthetic data for survey data. We mayview the focus on tabular data as one type of data utility for measuring the value of ouradapted pseudo posterior diﬀerentially-private mechanism. We note that, once generated,synthetic data may be used for many purposes, each with distinct measures of utility. Ourfocus on tabular statistics owes to their being the main data product released by governmentstatistical agencies. For simplicity we assume the survey design is a single stage sample withunequal probabilities of selection. Our motivating example will be the Survey of DoctorateRecipients (SDR) which provides demographic, education, and career history information fromindividuals with a U.S. research doctoral degree in a science, engineering, or health (SEH) ﬁeld.The SDR is sponsored by the National Center for Science and Engineering Statistics and bythe National Institutes of Health.We consider a local survey database y n = ( y , · · · , y n ), design information in variables X n = ( x , · · · , x n ), and associated survey weights w n = ( w , · · · , w n ). We consider univariatecontinuous outcome variable y i , such as salary, and categorical design information vector x i ,such as ﬁeld of study and gender, where ﬁelds of study are strata. Our goals are to createprivate tables of ﬁeld counts (cell, marginal, and total) and average salary (cell, marginal, andtotal) from synthesized microdata of these variables, with (cid:15) privacy guarantee.The basic setup of our proposed survey microdata synthesizers uses privacy protectionweights α under the pseudo posterior mechanism (Savitsky et al. , 2020). The pseudo posteriormechanism estimates an α − weighted pseudo posterior distribution, ξ α ( y n ) ( θ | y n , X n ) ∝ n (cid:89) i =1 p ( y i | x i , θ ) α i × ξ ( θ ) , (1)where ξ ( θ ) is the prior distribution and p ( · ) denotes the likelihood function with correspondingutility (cid:80) ni =1 α i u ( y i , x i , θ ) = log ( (cid:81) ni =1 p ( y i | x i , θ ) α i ). Deﬁne ∆ α as the α − weighted Lipschitzbound over the space of databases and the space of parameters. This mechanism indirectly setsthe local privacy guarantee, ( (cid:15) y = 2∆ α ) through the computation of the likelihood weights α .The details of steps of calculating α i ∝ / ∆ α ,y i ∈ [0 ,

1] based on the local database ( y n , X n ) re laid out in Algorithm 1 in Hu et al. (2020). We note that the previous work and thepresentation above focuses on databases that are either complete or simple random sampleswithout survey weights.The pseudo posterior mechanism allows us to take a synthesizer model with good utilityand modify it to become diﬀerentially private. For survey data, we could synthesize theoriginal population from the observed data using the survey sampling weights released withthe observed sample. Alternatively, we could synthesize the sample distribution together withthe survey weights to extrapolate back to the population. The former method uses the samplingweights, w n , to synthesize ( y n , X n ) under the population distribution and then discards theweights. The latter method synthesizes both the data and weights under the distribution ofthe observed sample and then uses the synthesized weights to correct the data back to thepopulation distribution. If we are able to assume that covariate information is public, we canperform partial synthesis (Little, 1993). If we must protect covariate information, then a fullsynthesis is required (Rubin, 1993). To explore theses trade-oﬀs, we propose three diﬀerentsurvey microdata synthesizers.Each microdata synthesizer uses survey weights, where y n denotes the outcome variable, X n denotes the design variables, and w n denotes the sampling weights, of the local database:(i) Fully Bayes model for observed sample (FBS) models ( y n , w n ) with a bivariate normalmodel using privacy protection weights α , and X n are used as predictors and un-synthesized;(ii) Fully Bayes model for population (FBP) forms the exact likelihood for ( y n , w n ) in the ob-served sample to model y n and corrects for population bias (Leon-Novelo and Savitsky, 2019),where privacy protection weights α are used and X n are used as predictors and un-synthesized;and (iii) Pseudo Bayes model for population (PBP) jointly models ( y n , X n ) estimated on thesample with combined ˜ w i = w i × α i to exponentiate each likelihood, and X n are modeled bya multinomial distribution and synthesized. The w i in PBP serves to correct the likelihoodon the observed sample back to the population as an approximation while the α i ∈ [0 , (cid:15) y = 2∆ α , ( y n , X n , w n ) × m ), where m ≥ stimates and standard error estimates.For comparison, we add noise proportional to the sensitivity local to the database from theLaplace Mechanism (See Appendix A for review of the Laplace Mechanism). We use the samelevel of privacy guarantee ( (cid:15) y = 2∆ α , ( y n , X n , w n ) × m ) as in the microdata synthesis approaches.With the same level of privacy guarantee, we compare their utility performances in the createdsurvey tables of counts and average outcome by design variables. This privacy guarantee is“local” in that is applies to the single observed dataset rather than the collection of all possibledata sets of the same type (i.e. global). These local results are useful because they allow fora lower bound on the privacy loss and an upper bound on the utility and can be shown to berelated to the global bounds through careful adjustments and arguments (Nissim et al. , 2007;Savitsky et al. , 2020). For the pseudo posterior mechanism, Savitsky et al. (2020) demonstratethat it provides an ( (cid:15), δ ) − probabilistic diﬀerential privacy guarantee where δ , the probabilitythat there exists a database exceeding the global privacy guarantee, (cid:15) , goes rapidly to 0 asthe sample size, n , increases. In addition to providing a theoretical guarantee, the authors usea Monte Carlo simulation study to show that δ contracts onto 0 for relatively small samplesizes such that the local privacy guarantee, (cid:15) y , for database, y , becomes arbitrarily close tothe global diﬀerential privacy guarantee, (cid:15) .The remainder of the paper is organized as follows. In Section 2, we lay out the detailsof each of our three proposed microdata synthesis approaches one-by-one, from Section 2.1to Section 2.3. Furthermore, we discuss and compare the three approaches in Section 2.4,and describe in detail how to create survey tables from synthetic microdata in each approach.Section 2.5 describes details of how to construct the sensitivity of the Laplace Mechanism thatwe use as an additive-noise alternative to our synthetic data generation smoothing alternativesfor the creation of tabular data statistics. We present an extensive simulation study in Section3, where we focus on utility comparison between the Laplace and our three microdata synthesisapproaches. Having determined that FBS is the preferred microdata synthesis approach, weapply it to the public use ﬁle of the Survey of Doctorate Recipients in Section 4, where wecompare the utility performances of the Laplace Mechanism and FBS. The paper ends withconcluding remarks in Section 5. A review of the alternative mechanisms compared in ourstudy is included in Appendix A. Methods for Diﬀerentially Private Tabular SurveyData

In this section, we describe the fully Bayes model for observed sample in Section 2.1, the fullyBayes model for population in Section 2.2, and the pseudo Bayes model for population inSection 2.3. We discuss and compare these three approaches in Section 2.4, where we alsodescribe in detail how to create survey tables from synthetic microdata in each approach.Finally, we present how to add noise from the Laplace Mechanism in Section 2.5.

Our ﬁrst approach is fully Bayesian because it jointly models the outcome y n and samplingweight w n of the observed sample of size n . We label it FBS (Fully Bayes Sample). Sincewe model the observed sample, not the population, we do not suppose the model estimatedon the sample is the population generating model. In fact, the distribution of the responseand weight variables are generally expected to be diﬀerent in the observed sample than forthe underlying population. We retain the smoothed / model-estimated version of the responseand weights, and we utilize the latter to correct the distribution in the sample back to thepopulation. We use a bivariate normal synthesizer for the joint distribution of ( y i , w i ) of unit i with predictors x i , as in Equation (2):  y i w i  ∼ MVN( x i β , Σ) = MVN  x i  β y β w  , Σ  . (2)We specify an independent and identically-distributed multivariate Gaussian prior for thecoeﬃcient locations, ( β y ) and ( β w ), and an LKJ prior for covariance Σ parameter (StanDevelopment Team, 2016). For brevity, the details of the prior speciﬁcation are includedin the Supplementary Material for further reading. In practice, we may ﬁrst transform theresponse and weight, for example (log( y i ) , log( w i )).After ﬁtting this unweighted synthesizer for ( y n , w n ), we calculate the unit-level privacyprotection weights α = ( α , · · · , α n ). For each unit i , we exponentiate its likelihood by α i , sothat we arrive at the α − weighted pseudo posterior distribution of ( β , Σ), as in Equation (3): α ( y n , X n , w n ) ( β , Σ | y n , X n , w n ) ∝ n (cid:89) i =1 p ( y i , w i | x i , β , Σ) α i × ξ ( β , Σ) . (3)Once we estimate this α − weighted pseudo posterior distribution of ( β , Σ), we simulate m posterior samples, which achieve the local ( (cid:15) y = 2∆ α , ( y n , X n , w n ) × m ) privacy guarantee(Savitsky et al. , 2020). Given the simulated m posterior samples of ( β , Σ), we can generate m synthetic survey datasets following the bivariate normal model in Equation (2), denoted as( Y ∗ , X , W ∗ ) = { ( y ∗ , (1) n , X (1) n , w ∗ , (1) n ) , · · · , ( y ∗ , ( m ) n , X ( m ) n , w ∗ , ( m ) n ) } .Each synthetic survey dataset ( y ∗ , ( (cid:96) ) n , X ( (cid:96) ) n , w ∗ , ( (cid:96) ) n ), (cid:96) = 1 , · · · , m , is used to form surveytables, which are to be released. This table creation process does not cost additional privacybudget since it is post-processing (the protected data y n is not used; Dwork et al. (2006);Nissim et al. (2007)). Moreover, since the predictors x i are not synthesized, this approachproduces partially synthetic data, and the survey tables are created by combining rules ofpartial synthesis (Reiter and Raghunathan, 2007; Drechsler, 2011). For estimand Q , let q ( (cid:96) ) bethe point estimator q of Q , and u ( (cid:96) ) be variance of q in (cid:96) th synthetic dataset ( y ∗ , ( (cid:96) ) n , X ( (cid:96) ) n , w ∗ , ( (cid:96) ) n ).The analyst can use ¯ q m = (cid:80) m(cid:96) =1 q ( (cid:96) ) /m to estimate Q , and T p = b m /m + ¯ u m to estimate thevariance of ¯ q m , where b m = (cid:80) m(cid:96) =1 ( q ( (cid:96) ) − ¯ q m ) / ( m −

1) and ¯ u m = (cid:80) m(cid:96) =1 u ( (cid:96) ) /m , where u ( (cid:96) ) denotes the within database (cid:96) variance for (cid:96) = 1 , . . . , m .In addition, we may use smoothed w ∗ , ( (cid:96) ) n from the conditional normal distribution derivedfrom Equation (2), E ( w i | y i ) = x i β w + ρ ( y i − x i β y ) σ w /σ y (where ρ denotes the correlationbetween y n and w n and σ y and σ w are the standard deviation of y n and w n , respectively) tocreate survey tables from synthetic survey sample ( y ∗ , ( (cid:96) ) n , X ( (cid:96) ) n , w ∗ , ( (cid:96) ) n ). The smoothed weights w ∗ , ( (cid:96) ) n will provide survey tables with less noise, an appealing feature of modeling w n withoutcome variable y n .Alternatively, we can directly release synthetic survey samples ( Y ∗ , X , W ∗ ) to the public.Data users will need to know how to incorporate survey weights w ∗ , ( (cid:96) ) n for unbiased inferencewith respect to the population. The availability of synthetic survey samples allows data users toperform analyses of their interests which are only feasible with microdata, such as a weightedregression of income y n on predictors X n . This increases the utility of the synthetic datacompared to direct table protection without additional privacy loss. .2 Fully Bayes Model for Population (FBP) Our second approach is also fully Bayesian and jointly models the outcome y n and the sam-pling weights w n . However, the speciﬁc joint speciﬁcation is for the generative model of thepopulation and the sample design, rather than directly modelling the sample itself. We labelit FBP (Fully Bayes Population). To form the exact likelihood for ( y i , w i ) in the observedsample which corrects for population bias, we follow the fully Bayesian approach proposed byLeon-Novelo and Savitsky (2019) and form the exact likelihood through inclusion probability w i = 1 /π i ; that is, we model ( y i , π i ) in the observed sample. We ﬁrst assume a linear modelfor the population as y i | x i , β , σ y ∼ Normal( x ti β , σ y ) . (4)Given y i , the conditional population model for inclusion probabilities is π i | y i , x i , κ y , κ x , σ π ∼ Lognormal( κ y y i + x ti κ x , σ π ) , (5)where κ y and κ x are regression coeﬃcients for y i and x ti , respectively. We specify independentmultivariate normal priors for β and ( κ y , κ x ) and half Cauchy priors for σ y and σ x . For brevity,the details of the prior speciﬁcation are included in the Supplementary Material for furtherreading. In practice, we may ﬁrst transform the response (log( y i )).Leon-Novelo and Savitsky (2019) has shown that the posterior distribution observed on thesample is ξ s ( y i , π i | x i , β , σ y , κ y , κ x , σ π ) = Normal(log π i | κ y y i + x ti κ x , σ π )exp { x ti κ x + σ π / κ y x ti β + κ y σ y / }× Normal( y i | x i β , σ y ) . (6)After ﬁtting this unweighted synthesizer for ( y n , X n , π n ), we calculate the unit-level privacyprotection weights α = ( α , · · · , α n ). For each unit i , we exponentiate its likelihood by α i ,so that we arrive at the α − weighted pseudo posterior distribution of ( β , σ y , κ y , κ x , σ π ), as inEquation (7): ξ α ( y n , X n , π n ) ( β , σ y , κ y , κ x , σ π | y n , π n , X n ) ∝ n (cid:89) i =1 p ( y i , π i | x i , β , σ y , κ y , κ x , σ π ) α i × ξ ( β , σ y , κ y , κ x , σ π ) . (7) nce we estimate this α − weighted pseudo posterior distribution of ( β , σ y , κ y , κ x , σ π ), wesimulate m posterior samples, which achieve the local ( (cid:15) y = 2∆ α , ( y n , X n , w n ) × m ) privacyguarantee (Savitsky et al. , 2020); note that for coherence with the other two approaches,we use w n instead of π n in the expression of (cid:15) y , and this can be done since w i = 1 /π i .Given the simulated m posterior samples of ( β , σ y , κ y , κ x , σ π ), we can generate m syntheticsurvey datasets following the population model in Equation (4) and Equation (5), denoted as( Y ∗ , X , W ∗ ) = { ( y ∗ , (1) n , X (1) n , w ∗ , (1) n ) , · · · , ( y ∗ , ( m ) n , X ( m ) n , w ∗ , ( m ) n ) } , where each w ∗ , ( l ) i = 1 /π ∗ , ( l ) i ,and (cid:96) = 1 , · · · , m .Similar to FBS, each synthetic survey dataset ( y ∗ , ( (cid:96) ) n , w ∗ , ( (cid:96) ) n ), (cid:96) = 1 , · · · , m , is used to formsurvey tables. Partial synthesis combining rules (reviewed in Section 2.1) are used to createthe survey tables due to the fact that the predictors x i are not synthesized. As with FBS, thetable creation process does not cost additional privacy budget. Moreover, smoothed weights w ∗ , ( (cid:96) ) n are generated from the smoothed π ∗ , ( (cid:96) ) n by using only the κ y y i component of the meanin Equation (5). These weights are only needed when combining data across strata. Withinstrata, the individuals in the synthetic population are equally weighted. See Section 2.4 formore details.Alternatively, we can directly release synthetic survey samples ( Y ∗ , X , W ∗ ) to the public.Data users do not need to know how to incorporate survey weights w ∗ , ( (cid:96) ) n for unbiased inferencewith respect to the population, since Y ∗ are corrected for survey sampling bias. However,data users need to aggregate survey weights w ∗ , ( (cid:96) ) n to account for diﬀerences in populationsizes across strata, for example when creating tables of counts. As with FBS, the release ofsynthetic survey samples ( Y ∗ , X , W ∗ ) increases the utility of the synthetic data compared todirect table protection without additional privacy loss. Our third approach is a pseudo Bayesian model for jointly modeling ( y i , x i ) in the samplewith plug-in survey weight w i and corrected for population bias. We label it PBP (PseudoBayes Population). This method is pseudo Bayesian because it is only partially generative (i.e.we specify the population generative model for y i but not for the pair ( y i , w i ) in either theobserved sample or the population). Treating survey weights w n as ﬁxed, we use the following seudo likelihood function for the parameters Θ of the population: L w (Θ | y n , X n ) = n (cid:89) i =1 p ( y i , x i | Θ) w i . (8)Speciﬁcally for the joint model of ( y i , x i | Θ), we specify a marginal multinomial model for x i as x i | θ ∼ Multinomial( θ ) , (9)where θ is a probability vector with one 1 and the remaining 0’s. Next, we specify a conditionalmodel of ( y i | x i ): y i | x i , β ∼ Normal( x i β , σ ) . (10)We additionally model and synthesize X n because we cannot release the un-modelled weights w n . The weighted sampled X n distribution is therefore also sensitive. We therefore modeland release estimates of the population distribution for X n . This requires a fully syntheticgeneration of ( y n , X n ).We specify a gamma prior for θ , an independent and identically-distributed multivariateGaussian prior for the coeﬃcient locations, ( β ), and a student-t prior for σ . For brevity,the details of the prior speciﬁcation are included in the Supplementary Material for furtherreading. In practice, we may ﬁrst transform the response (log( y i )).After ﬁtting this survey-weighted synthesizer ξ w ( θ , β , σ | y n , X n ) for ( y n , X n ), we calculatethe unit-level privacy protection weights α = ( α , · · · , α n ). For each unit i , we exponentiateits likelihood by α i , so that we arrive at the α − weighted pseudo posterior distribution of( θ , β , σ ), as in Equation (11): ξ w , α ( y n , X n , w n ) ( θ , β , σ | y n , X n ) ∝ n (cid:89) i =1 p ( y i , x i | θ , β , σ ) w i × α i × ξ ( θ , β , σ ) . (11)Once we estimate this α − weighted pseudo posterior distribution of ( θ , β , σ ), we simulate m posterior samples, which achieve the local ( (cid:15) y = 2∆ α ( y n , X n , w n ) × m ) privacy guarantee(Savitsky et al. , 2020). Given the simulated m posterior samples of ( θ , β , σ ), we can generate m synthetic survey datasets following the joint model in Equation (9) and Equation (10),denoted as ( Y ∗ , X ∗ ) = { ( y ∗ , (1) n , X ∗ , (1) n ) , · · · , ( y ∗ , ( m ) n , X ∗ , ( m ) n ) } .To form survey tables with each synthetic survey dataset ( y ∗ , ( (cid:96) ) n , X ∗ , ( (cid:96) ) n ), (cid:96) = 1 , · · · , m , ince X ∗ , ( (cid:96) ) n is synthesized, we use full synthesis combining rules to obtain ﬁnal estimates fromeach table (Reiter and Raghunathan, 2007; Drechsler, 2011). For estimand Q , let q ( (cid:96) ) be thepoint estimator q of Q , and u ( (cid:96) ) be variance of q in (cid:96) th synthetic dataset ( y ∗ , ( (cid:96) ) n , X ( (cid:96) ) n ). Theanalyst can use ¯ q m = (cid:80) m(cid:96) =1 q ( (cid:96) ) /m to estimate Q , and T f = (1 + m − ) b m − ¯ u m to estimatethe variance of ¯ q m , where b m = (cid:80) m(cid:96) =1 ( q ( (cid:96) ) − ¯ q m ) / ( m −

1) and ¯ u m = (cid:80) m(cid:96) =1 u ( (cid:96) ) /m . In the casewhere T f <

0, the analyst can use a modiﬁed T ∗ f = ( n syn /n )¯ u m . In our application, n syn = n therefore n syn /n = 1, reducing to T ∗ f = ¯ u m .As with FBS and FBP, the table creation process does not cost additional privacy budget.However in contrast to FBS and FBP, we do not model or release the survey weights w n .Therefore no weights are used in tabulation and there are no opportunities for eﬃciencies fromsurvey weight smoothing.Alternatively, we can directly release synthetic populations ( Y ∗ , X ∗ ) to the public, whereall are corrected for survey sampling bias. These synthetic data are then analyzed as if theywere generated from a simple random sample. As with FBS and FBP, the release of syntheticpopulations ( Y ∗ , X ∗ ) increases the utility of the synthetic data compared to direct tableprotection without additional privacy loss. We summarize and present the key features and comparisons of our three proposed approachesin Table 1.

FBS FBP PBPModeling w n yes yes noBias Correction Stage analysis synthesis/analysis synthesisSynthesis type partial synthesis partial synthesis full synthesisSynthesized variables ( y n , w n ) ( y n , w n ) ( y n , X n )Unsynthesized variables X n X n w n m synthetic samples ( Y ∗ , X , W ∗ ) ( Y ∗ , X , W ∗ ) ( Y ∗ , X ∗ )Post Analysis weighted/stratiﬁed stratiﬁed simpleTable 1: Features of the three approaches: Fully Bayes for observed Sample (FBS), Fully Bayes forPopulation (FBP), and Pseudo Bayes for Population (PBP). FBS and FBP are fully Bayesian approaches which jointly model outcome variables andthe sampling weights, whereas PBP is a pseudo Bayes approach which treats sampling weightsas ﬁxed. FBS models the observed sample without correction for population bias, therefore e will use the synthesized sampling weights w ∗ n to form survey tables (correcting the biasat the analysis stage). FBP corrects for bias at the modelling stage. However, since FBPdoes not co-model X n , the synthetic weights will be used to generate for marginal estimatesacross diﬀerent values of X n . In other words, the partial synthesis FBP does not correctfor the diﬀerence in the distributions of X n between the sample and the population. FBSand FBP implement partial synthesis because design variables X n are used as predictors butnot synthesized, whereas PBP is full synthesis as it not only uses design variables X n aspredictors but also synthesize them. We also note that while FBP can incorporate populationbias correction directly into the model, it is less ﬂexible compared to FBS in more complicatedsettings, for example, when there are more than one outcome variables of mixed types.A very important aspect of discussing and comparing the three approaches is how to createsurvey tables from each approach, once synthetic microdata are obtained. These tables includeboth point estimate such as counts or means, and estimates of variability such as standarderror estimates.FBS requires correction for population bias of the outcome variable with weights. There-fore, with the (cid:96) th ( (cid:96) = 1 , · · · , m ) synthetic sample from FBS, ( y ∗ , ( (cid:96) ) n , X ( (cid:96) ) n , w ∗ , ( (cid:96) ) n ), we use thesmoothed weights w ∗ , ( (cid:96) ) n to create the by ﬁeld and gender counts and average salary values.Variance estimates for a cell total count and salary for each database, (cid:96) , are produced viaTaylor linearization (Binder, 1996), a standard method for survey samples. The m databasesare used to compute a between-databases variance and combining rules of partial synthesis areused to compute ﬁnal point and standard error estimates.FBP incorporates correction for population bias of the outcome variable in the model forgiven values of X n . In our example, X n are categorical data. Therefore, we create by ﬁeldand gender average salary values without any weights. Because we do not synthesize X n ,however, we use the smoothed weights, w ∗ , ( (cid:96) ) n , to construct marginal salary values (e.g., overgender and ﬁeld). When creating counts (or size) estimates for the population based on the X n categories, we create by ﬁeld and gender counts with smoothed weights w ∗ , ( (cid:96) ) n . As with FBS,we use combining rules of partial synthesis to create ﬁnal point and standard error estimates.We estimate variances for the tabular data based on stratiﬁed simple random samples (withstrata deﬁned by X n ) for each database and the standard combining rules across databases.With co-modeling of weights together with the outcome salary and using smoothed weights n table construction, FBS and FBP will result in more accurate point estimates and smallerstandard error estimates of counts and average salary survey tables, as long as the modelestimation process is eﬃcient.Similar to FBP, PBP also incorporates correction for population bias of the outcome vari-able in the model, although here weights w n are not synthesized while ﬁeld and gender X n are synthesized. Therefore, with the (cid:96) th synthetic sample from PBP, ( y ∗ , ( (cid:96) ) n , X ∗ , ( (cid:96) ) n ), we use anew set of constant weights ( N/n ) for each unit to create the by ﬁeld and gender counts andaverage salary values. Variance estimates for each database are based on the basic formulasfor simple random samples.Unlike FBS and FBP which are partial synthesis, for PBP we use combining rules of fullsynthesis to create ﬁnal point and standard error estimates. Without co-modeling of weightsand treating them as ﬁxed, PBP will result in less eﬃcient point estimates and larger standarderror estimates.

As a comparison to our three proposed Bayesian data synthesizers that use smoothing toencode disclosure protection, we compare to the alternative of adding noise to tabular productsproduced directly from the original sensitive survey data. Each product (e.g. cell means, cellcounts, and corresponding standard errors) has a diﬀerent amount of noise added from theLaplace distribution (See Appendix A for a review of the Laplace Mechanism). We presentsensitivity calculations for adding noise according to the Laplace distribution, which are basedon the outcome variable and survey weights local to the database ( y n , X n , w n ). Since boththe weights and outcomes are not required to be bounded, global additive noise mechanismsdo not exist (i.e. they would add noise of inﬁnite or unbounded scale). Let S f,g representsthe set of observations in ﬁeld of study f and gender g . Consistent with our use of partialsynthesis for FBS and FBP, we assume that the unweighted sample sizes n f,g are not sensitive(i.e. publicly available) which is often the case for demographic surveys.The local sensitivity ∆ cf,g for count of ﬁeld f and gender g (cell count):∆ cf,g = max i ∈S f,g w i − min i ∈S f,g w i . (12) he local sensitivity ∆ af,g for average salary of ﬁeld f and gender g (cell average):∆ af,g = max i ∈S f,g w i y i − min i ∈S f,g w i y i (cid:80) i ∈S f,g w i − (max i ∈S f,g w i − min i ∈S f,g w i ) . (13)The marginal and total counts and averages are calculated in a similar fashion, once S f , S g ,and S are deﬁned accordingly. Finally, with calculated local sensitivity ∆ cf,g for count of ﬁeld f and gender g , the noise to be added to that cell count is sampled from the following Laplacedistribution (Dwork et al. , 2006): Laplace(0 , ∆ cf,g /(cid:15) ) , (14)where (cid:15) is the privacy budget. A similar process is for the noise to be added to the averagesalary of ﬁeld f and gender g with calculated local sensitivity ∆ af,g . With a given privacybudget (cid:15) , the larger the local sensitivity, ∆ cf,g or ∆ af,g , the larger the scale for the added noise.Suppose that the observed values of y i and w i within each ﬁeld f and gender g cell maynot represent the full range in the unobserved population. As a step towards a more globalor comprehensive sensitivity (representing a more complete span of y i and w i ), we take themaximum of the cell-speciﬁc sensitivities and use that instead for the cell level Laplace noise:∆ c ∗ = max f,g ∆ cf,g and ∆ a ∗ = max f,g ∆ af,g .In addition to the counts and averages, we must also add noise to their correspondingvariance estimates. We use a replication method (Rao et al. , 1992) for a set of R = 10replicates. Assuming the variance is a post-processing step, we add noise to each of these 10replicated point estimates, based on the sensitivity calculations for point estimates above toachieve a target (cid:15) vc = 10 (cid:15) rep for a given cell.Each individual’s data is used in four table cells (one interior cell, plus the row and columnmargin, and the grand margin) to calculate two estimates (point and variance). There aretwo tables produced (counts and means). Assuming equal budget across each of these 8 pairsof estimates, then the total (cid:15) = 8 (cid:15) pc + 8 (cid:15) vc . If we choose to assign equal budget to thepoint estimates and variance estimates, then setting a global budegt of say (cid:15) = 80, we use (cid:15) pc = (cid:15)/

16 = 5 for each cell point estimate, and (cid:15) rep = (cid:15)/ (16)(10) = 0 . (cid:15) vc = 5. Simulation Study

We now turn to our simulation study, where for a simulated sample from a simulated popu-lation, we use each of the three approaches to synthesize the sample and create survey tablesfrom the synthetic microdata sample. We scale the three approaches at equivalent levels ofprivacy guarantee to compare their utility performances. We also add noise under the LaplaceMechanism with the equivalent level of privacy guarantee to the table formed by the simulated,conﬁdential sample, and compare its utility performance to our proposed microdata synthesisapproaches.

We design our simulation study based on the public use ﬁle of the 2017 Survey of DoctorateRecipients . We simulate a population of N = 100 ,

000 units containing unit-level informationon salary, ﬁeld of expertise (8 levels), and gender (2 levels). In our simulated population, theﬁeld and gender percentages follow those in the public use ﬁle. Given simulated ﬁeld andgender, each unit’s salary value y i is simulated from a lognormal distribution with a ﬁeld andgender speciﬁc mean (obtained from the public use ﬁle) and a ﬁxed scale of 0.4. Moreover, wegenerate additive noise i for the log( y i ) from a normal distribution with 0 mean and the sameﬁxed scale of 0.4, so that we are able to simulate inclusion probability π i for unit i from thefollowing construction log( π i ) = log( y i ) + noise i . (15)We then obtain survey weight w i = 1 /π i . Less noise corresponds to a more informativesampling design, whereas more noise corresponds to a weaker relationship between the outcomeand the selection probability. We chose a moderate level of noise corresponding to a moderatelyinformative design.Finally, we take a stratiﬁed probability proportional to size (PPS) sample of n = 1000units, where ﬁelds are used as strata (and y is used as the size variable in Equation (15)). Wedenote y n as the outcome variable salary, X n as the ﬁeld and gender variables, and w n as thesampling weights of the sample.Our goal is to create survey tables of counts of observations and average salary by ﬁeld For more information, visit nd gender along with corresponding standard error estimates, all with privacy protection.Speciﬁcally, we ﬁt our three microdata synthesis models presented in Section 2.1 to Section 2.3to generate synthetic survey samples, from which we create desired survey tables containingpoint estimates and standard error estimates. As a comparison method, we add noise tothe point estimates and standard error estimates using the Laplace distribution with localsensitivities for the Laplace Mechanism constructed as outlined in Section 2.5. Note that wescale the maximum Lipschitz bound in three microdata synthesis approaches at equivalentlevel, denoted as ∆ α , ( y n , X n , w n ) . We generate m = 10 synthetic datasets from each microdatasynthesis approach, therefore we use ( (cid:15) y = 2∆ α , ( y n , X n , w n ) ×

10) as the total privacy budgetfor adding Laplace noise under the Laplace Mechanism comparison method.

We ﬁrst evaluate the model ﬁts of our proposed three microdata synthesis approaches. Figure1a plots distributions of the record-level Lipschitz bound ∆ α , ( y i , X i , w i ) of the three approaches.The maximum value of each violin plot on the y-axis corresponds to the overall Lipschitzbound, ∆ α , ( y n , X n , w n ) of each approach. As Figure 1a shows, our proposed three approacheshave equivalent maximum Lipschitz bound, about 3.4, indicating that every approach providesan ( (cid:15) y = 2 × . ×

10) privacy guarantee with m = 10 simulated synthetic datasets. Notethat the overall Lipschitz bounds of the three approaches without α weighting are 8.64, 9.95,and 45.39 respectively. These show that our α − weighted approaches produce substantiallylower overall Lipschitz bounds, indicating the ability of vector weights α to control ( (cid:15) y =2∆ α , ( y n , X n , w n ) ) − DP privacy guarantee (Savitsky et al. , 2020; Hu et al. , 2020).Figure 1a further shows that, compared to our pseudo Bayes model of PBP, our twofully Bayes models, FBS and FBP, have more records expressing relatively high values of∆ α , ( y i , X i , w i ) that are concentrated around the overall Lipschitz, ∆ α , ( y n , X n , w n ) for the entiredatabase. Although only the overall Lipschitz bound (i.e. the maximum) controls the overallprivacy guarantee, we observe that our two fully Bayes models avoid overly downweightingrecords’ likelihood contributions through α , which will result in higher utility as we will seein Section 3.3 (Hu et al. , 2020). a) Lipschitz bounds. (b) Privacy weights α i ’s. Figure 1: Violin plots of model ﬁts.

Figure 1b plots the distributions of the privacy weights, ( α i ), under each of our modelsynthesizing methods. Each violin plot represents the associated distribution of record-levelweights ( α i ) ni =1 ∈ [0 ,

1] of the three approaches. These results further support our conclusionthat the two fully Bayes models avoid overly downweighting in comparison to the pseudo Bayesmethod. Our fully Bayes models, FBS and FBP, clearly produce more records with closer-to-1weight α i , compared to our pseudo Bayes model of PBP. In fact, our fully Bayes model forthe observed sample, FBS, produces the largest number of records with high α i values, as canbe seen by the concentrated distribution mass near 1. As demonstrated in previous works,record-level α weights skewed towards 1 will result in higher level of utility of the simulatedsynthetic microdata (Hu et al. , 2019; Savitsky et al. , 2020; Hu et al. , 2020), no matter whetherthe primary use is to produce tabular data statistics or any other purpose. We will see inSection 3.3 that synthetic microdata that better preserves the distribution of the closely-helddata will, in turn, produce survey tables with higher level of utility, an intuitive and desirableeﬀect of creating private survey tables from microdata through synthesis. Recall that our primary goal is to produce survey tables with a formal privacy guarantee. Allof our proposed microdata synthesis approaches achieve this goal through Bayesian modelingof the conﬁdential survey sample ( y n , X n , w n ), generating synthetic microdata samples withan (cid:15) privacy guarantee (asymptotically), and subsequently producing survey tables where thetables, no matter how ﬁne the resolution of cells, provide the same (cid:15) guarantee.Since all three microdata synthesis approaches are scaled to have equivalent overall Lips-chitz bounds, ∆ α , ( y n , X n , w n ) , they all achieve an ( (cid:15) y = 2 × . ×

10) privacy guarantee with = 10 simulated synthetic datasets. Therefore, to evaluate and compare their utility perfor-mances under production of tabular statistics, we focus on the point estimates and associatedestimates of standard error for the cell counts and average salary variables. Speciﬁcally for oursimulation, we compare counts of observations by ﬁeld and gender (i.e. the sampling weights w n by design variables X n , and average salary by ﬁeld and gender (i.e. the outcome variable y n by design variables X n ), between the original sample and the survey tables created fromsynthetic microdata samples. As a comparison, we produce corresponding point estimates andstandard error estimates of Laplace noise added to the original sample estimates to implementthe Laplace Mechanism under an equivalent ( (cid:15) y = 2 × . ×

10) privacy guarantee.To create survey tables from the original, conﬁdential sample ( y n , X n , w n ), we treat ﬁeldsas strata and produce counts of observations by ﬁeld and gender, and average salary by ﬁeldand gender. To add Laplace noise to these point estimates of counts and average salary byﬁeld and gender, we calculate cell sensitivities according to Equations (12) and (13) in Section2.5 and take the maximums (∆ c ∗ and ∆ a ∗ ). As described in Section 2.5, we also produce R = 10replicate tables with added noise, to create variance estimates. To create survey tables fromthe three microdata synthesis approaches, we follow the procedures described in Section 2.4.Turning to the utility evaluation, we present the utility results and comparison of countsof observations by ﬁeld and gender in Table 2 and those of average salary by ﬁeld and genderin Table 4. Note that there are 8 ﬁelds and 2 genders. For the counts table in Table 2, wecreate 9 blocks representing total across all ﬁelds and the 8 ﬁelds individually, and within eachblock, there are 3 rows: overall for both genders, male only, and female only. For columns,ﬁrst, we have one column for the population truth from our simulated N = 100 ,

000 units,labeled as “Pop.”, and one column that uses the sample of n = 1000 units to create a surveyweighted expansion estimate of the population values, labeled as “Sample”. Next, we havenoise added from the Laplace Mechanism, labeled as “Laplace”. Finally, we have three columnscorresponding to the three microdata approaches applied to the sample, and they are labeledas “FBS”, “FBP”, and “PBP” respectively.Except for the “Pop.” column, which only contains the true generating point values, everyother column includes point estimates, with standard error estimates in the parenthesis. Sinceall privacy protection methods achieve the same level of privacy guarantees, we compare theirutility performances in terms of: (i) how close the point estimate is to the “Pop.” estimate, nd (ii) which method most eﬃciently covers the truth based on the estimated standard errors.For readability of Table 2, we highlight the point estimate closest to the “Pop” estimate ineach row using a rose color and the smallest standard error in blue color.Among the four privacy protection methods, FBS produces counts with the smallest stan-dard error estimates - it has the smallest standard error estimate for all but ﬁve rows. For thoseﬁve rows, all of which are about the female or male count of a speciﬁc ﬁeld, PBP produces thesmallest standard error estimates. However, it is important to note that having the smalleststandard error estimates does not come at the price of under-coverage, since FBS also producesabout half of all 26 closest point estimates - it has 12 counts with the closest point estimates,while Laplace has 5, FBP has 7, and PBP has 2. Table 3 presents the values for root meansquared estimation error (RMSE) for each cell and method. It veriﬁes our conclusion that FBSproduces the most eﬃcient estimates, followed by FBP, which in turn outperforms Laplace.The Laplace Mechanism performs no smoothing, and so its accuracy is upper-bounded by theunderlying accuracy of the survey expansion estimator. By contrast, our three model-basedmethods (particularly the fully Bayes methods) provide smoothing (noise removal), such thatthe estimator may be more eﬃcient than the un-modeled survey expansion estimator. Such isexactly what we see in Table 3 as many cells have lower RMSE values under FBS and FBPthan does the Sample.By contrast, Laplace performs best for cells where the underlying sample estimate performswell (e.g., requires little smoothing). In general, however, Laplace produces relatively ineﬃcientestimates as compared to FBS and FBP. Given a ﬁxed privacy budget (cid:15) , the amount of noiseto be added from the Laplace Mechanism solely depends on the cell sensitivity: the scale ofLaplace noise is proportional to the cell sensitivity, so that larger cell sensitivity results inlarger noise to be added (Dwork et al. , 2006). When units are accompanied with samplingweights, and if the sampling weight distribution has a large variability, as is often the casein practice, the cell sensitivity of counts can be large. Moreover, the Laplace method addstoo much noise to the standard error estimates too, roughly 2 to 10 times of those of FBS.In addition to these drawbacks, the Laplace method does not by default maintain certain keyfeatures, such as the counts of male and female should add up to count of both genders ina given ﬁeld, which are naturally maintained by the microdata synthesis approaches. Somemodiﬁcations can be made to the query or the post-processing to enforce these consistency o p . S a m p l e L a p l a ce F B S F B PP B P A ll ﬁ e l d s ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) T a b l e : M e a n a ndS E o f c o un t s b y ﬁ e l d a nd g e nd e r o f : s u r v e y w e i g h t e d e x p a n s i o n e s t i m a t e o f t h e p o pu l a t i o n v a l u e s b a s e d o n t h e o b s e r v e d s a m p l e ( S a m p l e ) , n o i s e a dd e d f r o m t h e L a p l a ce d i s tr i bu t i o n t o t h e o b s e r v e d s a m p l e ( L a p l a ce ) , a nd o u rt h r ee m i c r o d a t a s y n t h e s i s a pp r oa c h e s - F u ll y B a y e s f o r o b s e r v e dS a m p l e ( F B S ) , F u ll y B a y e s f o r P o pu l a t i o n ( F B P ) , a nd P s e ud o B a y e s f o r P o pu l a t i o n ( P B P ) . F o r e a c h r o w , t h e p o i n t e s t i m a t ec l o s e s tt o t h e “ P o p . ” i s h i g h li g h t e d i n r o s e a nd t h e s m a ll e s t s t a nd a r d e rr o r i s h i g h li g h t e d i nb l u e . requirements (See for example Li et al. , 2010).The main reason of PBP’s underperformance lies in its treating sampling weights as ﬁxed.As described in detail in Section 2.3, PBP treats sampling weights w n as ﬁxed and uses itin the exponents of likelihoods to correct for population bias. This approach fails to utilizesampling weights in modeling outcome y n , producing more variable point estimates and largerstandard error estimates in survey tables. PBP must also fully synthesize both the response y n and the covariates X n (i.e. full synthesis). On the other hand, FBS and FBP are fullyBayesian approaches which co-model outcome and sampling weights ( y n , w n ) and both can usenon-sensitive (“public”) data more eﬃciently (i.e. partial synthesis). While the speciﬁc model s diﬀerent, both FBS and FBP take the advantage of co-modeling the weights and outcome,which will often lead to weight smoothing (less variable weights) and result in more stableestimates of domain counts which are tabulated from the weights (See for example Beaumont,2008).The main advantage of co-modeling is to fully utilize any information in the samplingweights to improve estimation of the outcome variable(s). We now turn to the survey tablesof average salary by ﬁeld and gender in Table 4, where we see that again FBS and FBPoutperform PBP and Laplace. Similar to Table 2 for counts, we highlight the closest pointestimate in rose and smallest standard error estimate in blue for each row in Table 4 for averagesalary.We again observe that standard error estimates under Laplace are quite large, reﬂectingthe variability in the noise added. Overall, the quality of the Laplace point estimates is betterfor average salary than for cell counts because the sensitivity for average salary is scaled bythe sum of the weights.Among the three microdata synthesis approaches, the pseudo Bayes approach PBP has theworst point estimate performance while it has the smallest standard error estimate for severalrows. Table 5 again demonstrates that FBS and FBP produce the most eﬃcient estimates,while PBP is less eﬃcient and Laplace provides the least eﬃcient estimates over most tabularcells. We expect a survey weighted pseudo Bayes (PBP) approach to generally undercover,without an adjustment to the Markov chain Monte Carlo (MCMC) parameter draws (Williamsand Savitsky, 2020b). However an adjustment would require additional use of the restricteddata, increasing the privacy budget. In addition, the pseudo Bayes approach fails to co-modeloutcome and sampling weight, the main reason of under-performance for counts and averagesalary.We observe that between the two fully Bayes approaches of co-modeling outcome andsampling weight, FBP produces the majority of the closest point estimates (12 rows in roseout of 27 rows in total) and the most of the smallest standard error estimates (11 rows in blueout of 27 rows in total). These results suggest that FBP outperforms FBS in creating a surveytable of average salary, which indicates that the co-modeling in FBP achieves better estimationof the outcome salary using sampling weights. Yet, FBS and FBP produce estimates of verysimilar eﬃciency, as seen in Table 5. o p . S a m p l e L a p l a ce F B S F B PP B P A ll ﬁ e l d s ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) F i e l d t o t a l ( ) ( ) ( ) ( ) ( ) M a l e ( ) ( ) ( ) ( ) ( ) F e m a l e ( ) ( ) ( ) ( ) ( ) T a b l e : M e a n a ndS E o f a v e r ag e s a l a r y b y ﬁ e l d a nd g e nd e r o f : s u r v e y w e i g h t e d e x p a n s i o n e s t i m a t e o f t h e p o pu l a t i o n v a l u e s b a s e d o n t h e o b s e r v e d s a m p l e ( S a m p l e ) , n o i s e a dd e d f r o m t h e L a p l a ce d i s tr i bu t i o n t o t h e o b s e r v e d s a m p l e ( L a p l a ce ) , a nd o u rt h r ee m i c r o d a t a s y n t h e s i s a pp r oa c h e s - F u ll y B a y e s f o r o b s e r v e dS a m p l e ( F B S ) , F u ll y B a y e s f o r P o pu l a t i o n ( F B P ) , a nd P s e ud o B a y e s f o r P o pu l a t i o n ( P B P ) . F o r e a c h r o w , t h e p o i n t e s t i m a t ec l o s e s tt o t h e “ P o p . ” i s h i g h li g h t e d i n r o s e a nd t h e s m a ll e s t s t a nd a r d e rr o r i s h i g h li g h t e d i nb l u e . To further illustrate the diﬀerence in their performances, Figure 2 shows a series of scatterplots for FBS: (i) salary versus sampling weight in the original, conﬁdential sample; (ii) salaryversus sampling weight in one synthetic dataset with synthesized weights; and (iii) salaryversus sampling weight in one synthetic dataset with smoothed weights. Figure 3 shows aseries of scatter plots for FBP with the same arrangement.

For both approaches, we illustrate the use of synthesized weights, where the samplingweights are simulated from the bivariate normal model in Equation (2) for FBS and from thelognormal model for inclusion probability in Equation (5) for FBP, once the posterior drawsof parameters are available after MCMC estimation of each model. Furthermore, we illustratethe use of smoothed weights, where the sampling weights are conditional expectations (givensynthesized y n ) from the corresponding models for FBS and FBP. Using smoothed weightsremoves extra noise unrelated to y n from the synthesizing step, which will eventually producesmaller standard error estimates. We note that results of FBS and FBP in Table 2 for countsand of FBS in Table 4 for average salary are reported with smoothed weights. Comparing theright graph in Figure 2 and Figure 3, we observe that co-modeling in FBP provides slightlysmoother estimation of weights, compared to that in FBS. Even though we do not use the moothed weights directly in FBP (as we do in FBS), the smoothed weights in FBP allows amore precise estimation of the exact posterior distribution for the sample.In summary, our microdata synthesis approaches outperform the Laplace Mechanism inutility preservation of point estimates and standard error estimates. Among our three micro-data synthesis approaches, PBP has the lowest utility performance due to increased variabilityfrom using sample weights as plug-in values rather than co-modeling outcome and samplingweights. Between the two fully Bayes approaches with co-modeling of outcome and samplingweights, FBS produces counts with higher utility while FBP produces average salary withslightly higher utility. Overall both methods perform reasonably well across the two sets oftables. We now turn to a Monte Carlo simulation of repeated sampling, to compare thefrequentist properties of FBS and FBP. To compare the frequentist properties of our preferred two fully Bayes models, we conduct aMonte Carlo simulation by simulating R = 100 samples of size n = 1000 from our simulatedpopulation containing N = 100 ,

000 units, described in Section 3.1. For each of the r =1 , · · · , R sample, we apply the FBS and FBP approaches, and scale them to have equivalent( (cid:15) y = 2∆ α , ( y n , X n , w n ) × m ), where we simulate m = 10 synthetic datasets from each approach.Of each two sets of synthetic datasets across R = 100 simulations, we calculate the eﬀectivecoverage rate of the nominal 95% conﬁdence interval of the population count and averagesalary by ﬁeld and gender. The closer the coverage rate to 0.95, the better the performance.In addition, we calculate the average ratio of the coeﬃcients of variation (CV), which arethe standard errors scaled by the corresponding estimates, and compare between the twoapproaches. The CV provides information about the relative eﬃciency, and the lower the CV,the better the performance. The detailed results are included in the Appendix B. Figures 4and 5 present coverage of the 95% nominal interval on the x-axis as compared to CV on they-axis for cell counts and average salary, respectively. Each point represents a ﬁeld-by-gendercell.For results of counts shown in Figure 4, the FBP approach has higher coverage than theFBS approach in all ﬁelds except for one, ﬁeld 7, which has a serious undercoverage issue forboth genders and male alone. The higher coverage of FBP comes at the cost of wider average onﬁdence intervals: on average, the length of the conﬁdence intervals of the FBP approach is1.6 times of that of the FBS approach (see Table 8 in Appendix B for details). These resultsindicate that FBP overcovers and produces overly long intervals, which is sometimes desirableby being conservative. By contrast, the FBS approach only slightly undercovers, and achievesgreater eﬃciency with smaller variances and shorter intervals. For results of average salaryshown in Figure 5, FBS and FBP perform equally well in terms of coverage rates and coverageeﬃciency manifested by CV. Table 9 in Appendix B presents approximately equivalent averageconﬁdence interval lengths between two the approaches. Figure 4: Coverage vs. CV of count by ﬁeld and gender of Fully Bayes for observed Sample (FBS)and Fully Bayes for Population (FBP). A red dashed line at coverage = 0.95 is included for reference.Figure 5: Coverage vs. CV of average salary by ﬁeld and gender of Fully Bayes for observed Sample(FBS) and Fully Bayes for Population (FBP). A red dashed line at coverage = 0.95 is included forreference.

Based on the simulation results of the previous section combined with the modeling ﬂexibilityof the FBS approach, we conclude that FBS is the preferred microdata synthesis approach. n particular, the FBS approach is very straightforward to implement and accommodatesany desired model for the observed sample. In our real data application to the Survey ofDoctorate Recipients, therefore, we focus on using FBS as our microdata synthesis approach,and compare its results to those of noise-added point and standard error estimates from theLaplace Mechanism. We compare privacy protected tables to the corresponding unprotectedtables from the survey sample. (Results of FBS Weighted vs Unweighted are in Appendix C(two tables)). (a) Lipschitz bounds. (b) Weights α i ’s. Figure 6: Violin plots of model ﬁts.

We ﬁrst examine the FBS approach. Figure 6 demonstrates that we eﬃciently downweightthe likelihood to reduce the ∆ α (as compared to the unweighted synthesizer) while keepingmost record level weights close to 1.Table 6 compares the point and standard error estimates of the by cell (ﬁeld-by-gender)counts for each of the survey weighted expansion estimator in “Sample”, the Laplace Mecha-nism in “Laplace” and the Fully Bayes synthesizer sample-based estimator in “FBS”. As inthe simulation study, the FBS generally produces lower standard error estimates in the realdata application. Since this is a real data application, we do not know the population truth,but the results of the simulation study suggest that FBS will produce the most accurate andeﬃcient point estimates. As a result, we may interpret the large diﬀerence in point estimatesfor the Field 1 - Female cell between FBS, on the one hand, and Sample and Laplace, on theother hand, as related to the greater accuracy and eﬃciency from jointly modeling the surveysampling weights and income variable. We saw in the simulation study that removing noiseunrelated to the response / income from the weights often produces a lower RMSE estimatorthan does Sample. Table 7 is constructed in the same fashion as Table 6, but here using the average salaryvariable. The point estimates are relatively similar across the three methods, though the FBSproduces estimates with lower standard errors as we have seen for the simulation study.

We address the issue of formal privacy for data collected from an informative sampling design.There are three major challenges for inducing formal privacy protection into survey data:(a) The correlation between weights and outcome aﬀects data utility since the distribution n the sample is diﬀerent from the underlying population. It also makes privacy protectionmore complex because the sampling weights also need to be protected. (b) Under skewed andpossibly unbounded outcomes (weights and salary) typically present for survey data, tradi-tional additive noise mechanisms perform poorly and global privacy guarantees are typicallynot available for additive noise processes due to unbounded variables and sampling weightsthat produce unbounded sensitivities. (c) Both standard errors and point estimates must beproduced and released in a private manner. Standard error calculations from complex surveydata are non-standard.We apply several competing modeling approaches to perform data synthesis across diﬀerentvariations: (a) We sythesize a joint sample of the response variable and weights (FBS) vs. syn-thesizing a population (FBP, PBP). (b) We jointly model weights and outcomes (FBS, FBP)vs. using plug-in weights (PBP). (c) We conduct partial synthesis (FBS, FBP) vs. full syn-thesis (PBP). These design choices are not independent and impact each other as we see withour three alternatives: 1. FBS, 2. FBP, and 3. PBP. Our initial screening study favors the full(joint modelling) approach over the pseudo (plug-in) approach also corresponding to partialvs. fully synthesis. Our simulation study shows both methods perform relatively well with aslight edge for FBS. Given the more ﬂexible nature of FBS and its ease-of-implementation forany synthesizing model chosen by the owner of the closely-held data, we prefer this approachand demonstrated its performance on a real application data set where we achieve both moreaccurate and eﬃcient estimations for FBS as compared to the additive noise Laplace Mecha-nism. Our FBS synthesizer even often produces superior RMSE performances as compared tothe survey expansion estimator demonstrated in the simulation study; yet, the former embedsformal privacy protection while the latter does not.Future extensions of this work will include multiple variables / responses, two-stage samplesurveys, and alternative formulations of DP for Bayesian methods; for example, censoring /transforming the loglikelihood to ensure global DP for all sample sizes (Savitsky et al. , 2020).We acknowledge that there are more advanced methods to add Laplace noise (For exampleLi et al. , 2010) that may lead to some eﬃciencies for the additive noise method. cknowledgement This research was supported, in part, by the National Science Foundation (NSF), NationalCenter for Science and Engineering Statistics (NCSES) by the Oak Ridge Institute for Scienceand Education (ORISE) for the Department of Energy (DOE). ORISE is managed by OakRidge Associated Universities (ORAU) under DOE contract number DE-SC0014664.The authors also wish to thank Phillip Leclerc, a Mathematical Statistician at the U.S.Census Bureau, for his guidance and advice on our implementation of the added noise LaplaceMechanism.All opinions expressed in this paper are the authors’ and do not necessarily reﬂect thepolicies and views of NSF, BLS, DOE, ORAU, or ORISE.

References

Beaumont, J.-F. (2008). A new approach to weighting and inference in sample surveys.

Biometrika , 3, 539–553.Binder, D. A. (1996). Linearization methods for single phase and two-phase samples: a cook-book approach. Survey Methodology , 17–22.Dimitrakakis, C., Nelson, B., Zhang, Z., Mitrokotsa, A., and Rubinstein, B. I. P. (2017).Diﬀerential privacy for bayesian inference through posterior sampling. Journal of MachineLearning Research , 1, 343–381.Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control . Springer: NewYork.Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity inprivate data analysis. In

Proceedings of the Third Conference on Theory of Cryptography ,TCC’06, 265–284, Berlin, Heidelberg. Springer-Verlag.Heeringa, S. G., West, B. T., and Berglund, P. A. (2010).

Applied Survey Data Analysis .Chapman and Hall/CRC.Hu, J., Savitsky, T. D., and Williams, M. R. (2019). Risk-eﬃcient Bayesian data synthesis forprivacy protection arXiv:1908.07639. u, J., Savitsky, T. D., and Williams, M. R. (2020). Re-weighting of vector-weighted mecha-nisms for utility maximization under diﬀerential privacy arXiv:2006.01230.Leon-Novelo, L. G. and Savitsky, T. D. (2019). Fully Bayesian estimation under informativesampling. Electronic Journal of Statistics , 1608–1645.Li, C., Hay, M., Rastogi, V., Miklau, G., and McGregor, A. (2010). Optimizing linear count-ing queries under diﬀerential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , 123–134.Little, R. J. A. (1993). Statistical analysis of masked data.

Journal of Oﬃcial Statistics ,407–426.Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy:Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th InternationalConference on Data Engineering , 277–286. IEEE Computer Society.Margossian, C. C. (2018). A review of automatic diﬀerentiation and its eﬃcient implementa-tion.

CoRR abs/1811.05031 .McSherry, M. and Talwar, K. (2007). Mechanism design via diﬀerential privacy. In

Proceedingsof the 48th Annual IEEE Symposium on Foundations of Computer Science , 94–103.Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smooth sensitivity and sampling inprivate data analysis. In

Proceedings of the 39th Annual ACM Symposium on Theory ofComputing , 75–83.Rao, J. N. K., Wu, C. F. J., and Yue, K. (1992). Some recent work on resampling methodsfor complex surveys.

Survey Methodology , 209–217.Reiter, J. P. and Raghunathan, T. E. (2007). The multiple adaptations of multiple imputation. Journal of the American Statistical Association , 1462–1471.Rubin, D. B. (1993). Discussion statistical disclosure limitation.

Journal of Oﬃcial Statistics , 461–468.Savitsky, T. D. and Toth, D. (2016). Bayesian estimation under informative sampling. Elec-tronic Journal of Statistics , 1, 1677–1708. avitsky, T. D., Williams, M. R., and Hu, J. (2020). Bayesian pseudo posterior mechanismunder diﬀerential privacy arXiv:1909.11796.Snoke, J. and Slavkovic, A. (2018). pMSE mechanism: Diﬀerentially private synthetic datawith maximal distributional similarity. In J. Domingo-Ferrer and F. Montes, eds., Privacy inStatistical Databases , vol. 11126 of

Lecture Notes in Computer Science , 138–159. Springer.Stan Development Team (2016). RStan: the R interface to Stan. R package version 2.14.1.Wang, Y.-X., Fienberg, S., and Smola, A. (2015). Privacy for free: Posterior sampling andstochastic gradient monte carlo. In F. Bach and D. Blei, eds.,

Proceedings of the 32ndInternational Conference on Machine Learning , vol. 37 of

Proceedings of Machine LearningResearch , 2493–2502, Lille, France. PMLR.Wasserman, L. and Zhou, S. (2010). A statistical framework for diﬀerential privacy.

Journalof the American Statistical Association , 375–389.Williams, M. R. and Savitsky, T. D. (2020a). Bayesian estimation under informative samplingwith unattenuated dependence.

Bayesian Analysis , 1, 57–77.Williams, M. R. and Savitsky, T. D. (2020b). Uncertainty estimation for pseudo-Bayesianinference under complex sampling. International Statistical Review . A Review of Diﬀerentially Private Mechanisms

In this section we describe some of the major classes of mechanisms for data release that can beshown to be diﬀerentially private (i.e. satisfy the property in Deﬁnition 1). In particular, wedescribe additive noise, the Exponential Mechanism, and the Bayesian posterior mechanismand discuss connections between the three. Variations of these approaches will be comparedin our analyses.Perhaps the simplest way to protect a data release is to generate random noise and add itthe original data value (record level or tabular). The key insight from Dwork et al. (2006) isthat we can calibrate the amount of noise (scale or variability) to meet speciﬁc target valuesof (cid:15) . For smaller (cid:15) , more noise (larger scale) is needed. The “right” amount of noise to add isbased on the sensitivity of the mechanism M . eﬁnition 2 Let

D, D (cid:48) ∈ R k × n . Let q deﬁne a (non-diﬀerentially private) data release (e.g.mean), q () : R k × n → S . Then deﬁne the L1- sensitivity of q as ∆ q = sup D,D (cid:48) ∈ R k × n (cid:13)(cid:13) q ( D ) − q ( D (cid:48) ) (cid:13)(cid:13) The sensitivity of the desired data release mechanism q () tells us how much noise to add toget a diﬀerentially private version of the mechanism M (). The most common example is theLaplace Mechanism, which adds noise from a Laplace distribution. The density of a Laplacedistribution is the following: f LAP ( x | µ, b ) = 1 b exp (cid:18) − b | x − µ | (cid:19) Given a deterministic data release q () : R k × n → S , deﬁne a Laplace Mechanism M L () = q () + LAP (0 , ∆ q /(cid:15) ). Then M L is an (cid:15) -diﬀerentially private release mechanism (Dwork et al. ,2006).Dwork et al. (2006) also considers metrics beyond the absolute diﬀerence measure (L1) tomeasure sensitivity and how to create a mechanism that is diﬀerentially private. Subsequentworks clariﬁed this formulation and named it the Exponential Mechanism and showed howthe Laplace Mechanism is a special case (McSherry and Talwar, 2007; Wasserman and Zhou,2010).Switching notation, slightly, let θ be the desired output (previously s ) which could be asynthetic value, a model parameter, or a tabular summary statistic. The Exponential Mecha-nism inputs a non-private mechanism for θ and generates θ in such a way that induces a DPguarantee on the overall mechanism. Deﬁnition 3

The Exponential Mechanism M E uses a utility function u ( D, θ ) to generatevalues of θ from a distribution proportional to, M E ( θ | D ) ∼ exp (cid:18) (cid:15) u ( D, θ )2∆ u (cid:19) ξ ( θ | γ ) , (16) where, ∆ u = sup D,D (cid:48) ∈ R k × n : δ ( D,D (cid:48) )=1 sup θ ∈ Θ | u ( D, θ ) − u ( D (cid:48) , θ ) | is the sensitivity, deﬁned glob-ally over D ∈ R k × n , δ ( D, D (cid:48) ) = { i : Di (cid:54) = D (cid:48) i } which is the Hamming distance between D, D (cid:48) ∈ R k × n . ξ ( θ | γ ) is a proper probability measure. ach single draw of θ from the Exponential Mechanism M E ( θ | D ) satisﬁes (cid:15) -DP. See McSherryand Talwar (2007) or Wasserman and Zhou (2010).The main beneﬁt of the Exponential Mechanism over (symmetric) additive noise is that thegeneral utility function u ( D, θ ) can be constructed in such a way that restrictions to the rangeof output (for example enforcing positive counts) are possible and the input data can be lessrestrictive (for example variables without natural bounds, such as revenue). In other words,the perturbation is not symmetric by default and can be applied to more variable types besidescategories and counts. A major challenge when using the Exponential Mechanism is actuallysampling from the implied distribution in (16). For stability, a base (or prior) probabilitymeasure is often used to guarantee that implied distribution is a proper probability measure( ξ ( θ | γ ) integrates to 1 instead of ∞ ). Still, generating a θ using an arbitrary utility function u () is a signiﬁcant challenge (Wasserman and Zhou, 2010; Snoke and Slavkovic, 2018).Wang et al. (2015) show the connection between the Exponential Mechanism and sam-pling from the posterior distribution of a probability model by setting the utility u as thelog-likelihood. The main beneﬁt of using Bayesian methods is the extensive research into com-putational methods to generate samples, something that presents a signiﬁcant challenge for thegeneral Exponential Mechanism. Dimitrakakis et al. (2017) provide alternative extensions andproofs for the diﬀerential privacy property of samples from the posterior distribution. Theyalso assume the log-likelihood is bounded and suggest truncating the support for θ to achievethis. Savitsky et al. (2020) extend these results to incorporate individual level adjustments,where the weights α i ∝ i are related to record-speciﬁc sensitivity estimates ˆ∆ i . ξ α ( θ | x , γ ) ∝ (cid:34) n (cid:89) i =1 p ( x i | θ ) α i (cid:35) ξ ( θ | γ ) . (17) Monte Carlo results of simulation study

Pop. FBS coverage avg CI FBP coverage avg CIField 1 total 27963 0.89 2844 1.00 6519Male 14599 0.98 3496 1.00 4766Female 13364 1.00 4124 1.00 5136Field 2 total 2734 0.84 928 0.95 1817Male 1951 0.98 1125 1.00 1475Female 783 1.00 1163 1.00 1064Field 3 total 4035 0.89 1148 1.00 2719Male 2918 0.98 1397 1.00 2217Female 1117 1.00 1460 1.00 1594Field 4 total 18109 0.88 2320 1.00 5232Male 12381 1.00 2836 1.00 4278Female 5728 1.00 3013 1.00 3332Field 5 total 11197 0.87 1852 1.00 4701Male 3861 1.00 2035 1.00 2672Female 7336 0.99 2654 1.00 3986Field 6 total 12842 0.85 1987 1.00 4792Male 6433 1.00 2406 1.00 3327Female 6409 1.00 2847 1.00 3631Field 7 total 18131 0.84 2285 0.47 4598Male 14358 0.95 2761 0.49 4032Female 3773 1.00 2653 1.00 2474Field 8 total 4989 0.84 1244 1.00 2860Male 2033 1.00 1437 1.00 1767Female 2956 1.00 1804 1.00 2279Table 8: Coverage of average counts and length of conﬁdence intervals of FBS and FBP under 100repeated samples. 41op. FBS coverage avg CI FBP coverage avg CIField 1 total 110659 0.93 11346 0.86 11102Male 121125 0.95 15661 0.83 15164Female 99226 0.98 14686 0.92 14283Field 2 total 140742 0.98 46945 0.97 44778Male 146916 0.99 54948 0.99 51907Female 125358 0.99 69852 0.93 66609Field 3 total 114668 0.94 31082 0.94 29541Male 118338 0.96 36480 0.97 34524Female 105081 0.99 47825 0.94 45360Field 4 total 115647 0.87 14954 0.92 14451Male 122601 0.86 17753 0.95 17212Female 100618 1.00 23113 0.98 22296Field 5 total 105808 0.96 17094 0.91 16819Male 120531 0.97 28873 0.96 28303Female 98060 0.97 19413 0.92 18700Field 6 total 110159 0.97 16671 0.91 16263Male 122676 0.99 23240 0.89 22379Female 97595 0.97 21332 0.96 20526Field 7 total 132249 0.92 16706 0.92 16122Male 136370 0.93 18718 0.97 18013Female 116566 0.99 30682 0.95 29770Field 8 total 119005 0.95 29297 0.94 28152Male 137349 1.00 44490 0.90 42430Female 106388 0.95 34603 0.98 32946Table 9: Coverage of average salary and length of conﬁdence intervals of FBS and FBP under 100repeated samples. 42