A Bayesian Framework for Generation of Fully Synthetic Mixed Datasets
AA BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETICMIXED DATASETS
JOSEPH FELDMAN AND DANIEL R. KOWAL
Department of Statistics, Rice University
Abstract.
Much of the micro data used for epidemiological studies contain sensitive measurements onreal individuals. As a result, such micro data cannot be published out of privacy concerns, rendering anypublished statistical analyses on them nearly impossible to reproduce. To promote the disseminationof key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesiveBayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixedcategorical, binary, count, and continuous variables. This process centers around a joint Bayesian modelthat is simultaneously compatible with all of these data types, enabling the creation of mixed syntheticdatasets through posterior predictive sampling. Furthermore, a focal point of epidemiological dataanalysis is the study of conditional relationships between various exposures and key outcome variablesthrough regression analysis. We design a modified data synthesis strategy to target and preserve theseconditional relationships, including both nonlinearities and interactions. The proposed techniques aredeployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive,and social measurements on nearly 20,000 North Carolina children. Introduction
The combined impact of social and environmental stressors on childhood cognitive development re-mains understudied, limited in part by accessibility to rich, individual-level data. Of particular interest isthe modeling of proxies for cognitive development, such as academic achievement, as potentially complex,nonlinear functions of different social and environmental exposures. In such work, the primary statisticaltool used to study these relationships is regression analysis, which requires ample micro data collectedon individual subjects within a population [1, 2]. Such micro data often contains sensitive informationthat precludes its public release. While the inferences drawn from analyzing such micro data may haveprofound policy implications, the micro data itself often remain under lock and key out of concern forsubject confidentiality, which undermines the scientific reproducibility of these analyses.For this work, we examine a dataset built by linking three administrative datasets for the stateof North Carolina (NC): detailed birth records , which includes maternal demographics, maternaland infant health, and maternal obstetrics history for all documented live births in NC; blood leadsurveillance data from the state registry maintained by the Childhood Lead Poisoning PreventionProgram of the Children’s Environmental Health Unit, Department of Health and Human Services inRaleigh, NC, which includes integer-valued blood lead levels; and end-of-grade (EOG) standardizedtesting data from the NC Education Research Data Center of Duke University in Durham, whichincludes EOG reading and mathematics test scores from the 1995-1996 school year to the present, studentidentifying information, and data on demographics and socioeconomics. Using residential addresses, thesedatasets are further linked with extensive exposure data, including environmental exposures (ambient airquality and temperature) and social exposures (indices for racial isolation and neighborhood deprivation);additional details are provided in the supplementary material as well as [2].These micro data are high dimensional and consist of child-level measurements of various data types(categorical, binary, count, continuous). See Table 1. Several of these measurements are also highlysensitive, such as those taken on children’s health, academic achievement, and socioeconomic status.While there are no unique identifiers in the data, other features in the dataset that contain high resolutiongeographic information connected to places of residence could be used by an ill-intended adversary to
E-mail address : [email protected] . a r X i v : . [ s t a t . M E ] F e b A BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS link individuals to these protected attributes with high accuracy. As a result, publishing these data couldspawn a myriad of privacy and legal issues.
Table 1.
North Carolina dataset description
Type Variable Range/Values Label
Categorical Mother’s Race White, Black, Hispanic,Asian/Pacific Islander, Other
M_RACEGROUP
Mother’s Education No H.S. Diploma, H.S.Diploma, Some Col-lege/Associates Degree,Bachelor’s or Higher
M_EDUCGROUP
Binary Gender Male, Female
MALE
Prenatal Care Yes, No
NOPNC
Marital Status Married, Not Married
NOTMARRIED
Smoker Yes, No
SMOKER
Econ. Disadvantaged Yes, No ED Medicaid Yes, No
MEDICAID
Integer EOG Reading Score Integer 316-370
ReadScal1
EOG Math Score Integer 321-373
MathScal1
Mother’s Age (years) Integer 15-40
MAGE
Birth Weight Per-centile Integer 0-100
BWTpctl_clin
Gestational Period(Weeks) Integer 32-42
GEST
Blood Lead Test Re-sult Integer 1-10
PBresult
Continuous Temperature (bytrimester) Fahrenheit 27.8 - 86.60
Ti_temp_avgi = 1,2,3 ,PM 2.5 (by trimester) PPM 5.60-31.06
Ti_pm25_24hri = 1,2,3
Acute PM 2.5 Expo-sure PPM 6.026 - 17.891
PM25_30days_June
Chronic PM 2.5 Expo-sure PPM 7.549 - 11.709
PM25_1yr_June
Racial Isolation atBirth 0 - 1
RI_nhb_Birth
Racial Isolation atTest 0 - 1
RI_nhb_Educ
Neighborhood Depri-vation Index at Birth -4.5174 - 11.3888
NDI_Birth
Neighborhood Depri-vation Index at Test -4.17036 - 10.3524
NDI_Educ
To address this issue, we develop a Bayesian framework for the construction of fully synthetic datasetsthat is compatible with mixed data types and able to scale to high dimensions. It is important toemphasize the ability of this framework to seamlessly incorporate modeling of unordered categoricalvariables alongside other data types, as unordered categorical variables, especially race or ethnicity, arecritical in health disparities research and practice.We apply our method to create a fully synthetic version of the aforementioned North Carolina dataset.In addition to providing consistent univariate, bivariate, and multivariate relationships in the syntheticdata, our method ensures that outcomes measuring cognitive development of subjects display similarrelationships with key exposure variables observed in the original data. Given the aims of studying theoriginal dataset, maintaining these structures is the most desired feature of any synthetic version. Becausethe synthetic data comprises no real individuals, it does not jeopardize the privacy of individuals in theoriginal dataset. In addition, we confirm minimal attribute disclosure risks via several tests. As such,
BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 3 researchers may request access to the synthetic dataset at https://doi.org/10.25614/synthetic_data , and unrestricted public access is currently pending approval.The literature on the construction of fully synthetic datasets to preserve confidentiality dates backat least to [3] who approached data synthesis as a problem of multiple data imputation. This approachconstructs synthetic datasets by learning a Bayesian generative model from the data and then simulatingfrom a posterior predictive distribution. In our case, the dataset to be synthesized features categorical,binary, count, and continuous measurements, necessitating a generative model simultaneously compatiblewith each type.Several fully Bayesian generative models exist that are able to jointly model data comprised of mixedscales. [4] develop a hierarchical mixture model specifically aimed at multiple imputation for mixedcontinuous and categorical outcomes. [5] provides a joint model for multivariate categorical data, and[6] uses this model for the generation of synthetic multivariate categorical data. [7] use Bayesian markedpoint process modeling to create synthetic datasets for epidemiological analyses that preserve spatialdependence structures. Closely aligned with this work, [8] and [9] estimate a semiparametric Gaussiancopula for data of mixed continuous, count, and ordinal types through a marginal likelihood called theextended rank likelihood. However, none of these models is simultaneously compatible with unorderedcategorical data.Other methods are designed specifically for the generation of fully synthetic datasets. Generally, theyrely on iterative conditioning: using the relationship P ( Y , . . . , Y p ) = P ( Y ) (cid:81) pj =2 P ( Y j | Y , . . . , Y j − ),separate models are fit for each P ( Y j | Y , . . . , Y j − ) and then Y j syn is synthesized through bootstrapping,posterior predictive sampling, or density estimation. [10] used parametric linear regression models ateach iteration of the conditioning. More recently, nonparametric, tree-based methods have emerged[11, 12, 13], which can provide a more effective approach for modeling nonlinear relationships in thedata.The primary limitation of this approach is the need to choose the variable ordering for data synthesis.As demonstrated in Section 6.3, the utility and privacy of the synthetic data are highly sensitive to theordering. Ideally, one might consider optimizing the ordering for maximal utility and privacy. Yet thisbecomes intractable for moderate to large p , since the number of orderings to consider is p !. In oursetting, the dimension is p = 26, which is far too large for a combinatorial search.Our main contribution is in the development of a Bayesian semiparametric generative model for thejoint distribution of categorical, binary, count, and continuous random variables. The model is basedon a semiparametric Gaussian copula, which provides a convenient platform for linking univariate mar-ginal distributions under a multivariate dependence structure to induce a joint distribution. A syntheticdataset can then be constructed by simulating samples from the posterior predictive distribution. Underthis framework, variables are synthesized jointly, which obviates the need to declare an ordering fordata synthesis. Additionally, our data synthesizer is capable of coherently and accurately capturing therelationships among categorical, binary, count, and continuous variables. Therefore, a notable achieve-ment of this work is in the development of a joint model for unordered categorical, binary, count, andcontinuous variables, a previously significant hurdle for existing Bayesian methods.The North Carolina dataset is particularly useful for deriving insights into the relationships betweensocial and environmental exposures and EOG test scores. Consequently, we design the synthetic datagenerating process to target and capture the regression associations for modeling EOG test scores usingthe other demographic, socioeconomic, and exposure variables. To do so, we incorporate a nonparametricand nonlinear Bayesian regression model using the EOG test scores as the response variable. Thisapproach capitalizes on the benefits of iterative conditioning by directly identifying the key responsevariables of interest, yet avoids the difficulties in selecting an ordering by generating the remainingvariables—all but two (EOG reading and mathematics scores)—jointly using our semiparametric copulamodel.Ultimately, we are able to demonstrate several key properties of our synthetic data. The semipara-metric and multivariate components provide remarkable consistency for marginal distributions and incrucial bivariate relationships. Perhaps more important to our application, we demonstrate the highutility of one dataset produced under our method when summarizing the inference from linear regressionmodels for both synthetic and observed data. Such validation is vital if the synthetic data are to be used A BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS by additional investigators beyond the primary managers of confidential databases. We must ensure thatthe synthetic data ”tell the same story” as the observed data.The organization of this paper is as follows. In Section 2, we introduce a semiparametric Gaussiancopula model for estimating a joint distribution. Section 3 describes our main methodological innovation:the extended rank-probit likelihood as a model for unordered categorical, binary, count, and continuousvariables. Section 4 includes the modeling details, the MCMC sampling algorithm, and the data synthesis.Section 5 details the modifications for nonlinear regression. The results are presented in Section 6. Weconclude in Section 7. 2.
A Semiparametric Gaussian Copula
To generate a fully synthetic dataset, we first build a model for the joint distribution of the observeddata. Copula models provide an effective strategy: by linking marginal distributions with a multivariatedependence structure, copula models can preserve both marginal properties and multivariate relation-ships. By Sklar’s Theorem [14], the joint distribution of a p -dimensional random vector Y = ( y , . . . , y p )can be expressed through the univariate marginals F j , j = 1 , . . . , p and a copula C : F ( y , . . . , y p ) = C { F ( y ) , . . . , F p ( y p ) } . For computational convenience and modeling flexibility, we build upon the Gaussian copula(1) C ( u , . . . , u p ) = Φ p { Φ − ( u ) , . . . Φ − ( u p ) } where u j ∈ [0 ,
1] for j = 1 , . . . , p and Φ p is the cumulative distribution function (CDF) of a p -dimensionalGaussian random vector with correlation matrix C . The joint distribution of Y is derived by combining(1) with univariate marginal distributions { F j } pj =1 . Since F j ( y j ) ∼ Uniform(0 , F ( y , . . . , y p ) = Φ p [Φ − { F ( y ) } , . . . , Φ − { F p ( y p ) } ]which provides a generative model for multivariate data Y .A Bayesian approach requires a probability model for marginal distributions { F j } pj =1 and the pa-rameters that govern the Gaussian copula—namely, the correlation matrix C . Given samples from theposterior distribution of these parameters, it is then possible to simulate posterior predictive samples, en-abling construction of a fully synthetic dataset. In particular, the data generating process is defined by (i)sampling latent data z ∼ N p ( , C ) and (ii) computing the marginals y j = F − j { Φ( z j ) } for j = 1 , . . . , p .The latent Gaussian random variables z capture the dependence structure among the variables, whilethe marginal inverse CDFs F − j link these dependent random variables to the correct scale of observeddata. Since the marginal distributions F j can be estimated accurately using empirical CDFs or othermarginal models even for small to moderate sample sizes, the important modeling task centers on thedependence structure C .Given data { y ij } , a natural semiparametric strategy for inference on C is to compute psuedo-data z ij = Φ − { ˆ F j ( y ij ) } for observations i = 1 , . . . , n , where ˆ F j is an estimate of each marginal CDF, and toperform inference on the model z i ∼ N p ( , C ) independently for i = 1 , . . . , n . However, problems arisefor data of mixed types. Consider maximum likelihood estimation of C : when the marginal distributions F j are continuous, the estimator is consistent; yet when the marginal distributions are discrete, thepsuedo-data transformation changes only the sample space and not the data distribution so the resultingestimator is inconsistent [8]. These issues persist for Bayesian models.For continuous, count, and ordinal variables, [8] proposes a remedy based on the rank likelihood. SinceCDFs are non-decreasing, observing y ij < y lj implies that z ij < z lj . We can make this partial orderingprecise for each variable j :(3) D ( y j ) = { z ∈ R n : y ij < y lj = ⇒ z ij < z lj , ∀ i (cid:54) = l ∈ , . . . , n } , so D ( y j ) is the set of all values of z j that match the ordering of the observed data y j . For n observations Y = ( Y , . . . , Y n ) (cid:48) , the latent variables Z = ( Z , . . . , Z n ) (cid:48) must satisfy the event D = { Z ∈ R n × p : max { z kj : y kj < y ij } < z ij < min { z kj : y ij < y kj } , ∀ j = 1 , . . . , p } . The full data likelihood can thus be
BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 5 decomposed P ( Y | C , F , . . . , F p ) = P ( Y , Z ∈ D | C , F , . . . , F p )(4) = P ( Z ∈ D | C ) × P ( Y | Z ∈ D , C , F , . . . , F p )(5)The equivalence in (4) is true by construction, since observing Y implies that Z ∈ D , and (5) followsbecause the event Z ∈ D is independent of the marginal distributions F , . . . , F p . [8] proposes to estimatethe copula parameters C by treating P ( Z ∈ D | C ) as the likelihood, which he refers to as the extendedrank likelihood . With this likelihood, [8] models C using an inverse-Wishart prior; [9] adopt a similarapproach based on a factor model. Although the extended rank likelihood is a rank-based approximationfor the true likelihood, it nonetheless contains much of the information about C . [9] confirm this intuitionby showing strong posterior consistency for C in a reduced rank setting.3. The Extended Rank-Probit Likelihood
There are three crucial limitations in the Gaussian copula model and the extended rank likelihood.These limitations are described here and resolved in subsequent sections.First, an incongruity arises in applying the extended rank likelihood for data that contain unorderedcategorical variables. The extended rank likelihood is built upon (i) inverse marginal CDFs and (ii) theordering of the observed data. Unordered categorical variables have neither a well-defined inverse CDFnor a natural ordering. The ability to model unordered categorical variables is crucial. For example,race or ethnicity is a critical factor in health and health disparities research and practice. Use of theextended rank likelihood here would require an ordering of the races—a task that is both unethical andnonsensical.A simple workaround is to one-hot-encode each categorical variable with k levels as k − k − C will be restricted to k − unordered categorical variablesusing our proposed approach, which is confirmed for both simulated data and the North Carolina dataset(see Section 6).Third, the correlation matrix C only captures linear dependencies on the latent scale. As a result,the Gaussian copula may be an overly simplistic model for outcomes of interest for which nonlineardependencies need be preserved. We address this challenge in Section 5.In what follows, we propose an extended rank-probit likelihood , which more naturally models categoricaland ordinal data with few levels. We use this likelihood to define a new Gaussian copula model for mixedcategorical, binary, ordinal, count, and continuous data.3.1. A Marginal Likelihood for Unordered Categorical Variables.
Suppose that a categoricalvariable in the dataset, y j , posseses k distinct levels. We represent this variable as k binary variables γ j l ∈ { , } , l = 1 , . . . , k , such that y ij = m ⇐⇒ { γ ij m = 1 } ∩ { γ ij l = 0 } , ∀ l (cid:54) = m. These binary variables can be expressed using a latent Gaussian representation akin to probit regression[15]. Specifically, we introduce latent variables ( z ij , . . . , z ij k ) such that the event { γ ij m = 1 } ∩ { γ ij l =0 } , ∀ l (cid:54) = m is equivalent to(6) ( z ij , . . . , z ij k ) ∈ d (cid:48) ( y j ) ≡ ∪ km =1 { z i ∈ R k : z im > , z il < , l (cid:54) = m } A BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS analogous to D ( y j ) in (3). The latent variables in (6) describe the inclination of observations to belongto one level over another. With this representation, the multinomial data do imply a partial orderingamong latent variables, but the ordering comes within each observation: the ranking occurs among thelevels of a categorical variable, and this ranking is unique to the individual. This is contrary to what isinduced by the rank likelihood, where a ranking is induced among individuals through a comparison ofnumerical quantities observed in the population. We can then combine this latent variable representationof categorical data with the rank-based representation of numerical measurements, which we refer to asthe extended rank-probit likelihood .Consider a dataset with p columns comprised of q categorical variables, each with a potentially uniquenumber of levels k , . . . , k q , and r variables that are ordinal, count, continuous, or binary. For eachcategorical variable, we apply the latent binary representation from (6) and aggregate across the entiredataset: D (cid:48) c = { Z n × k c : ∪ k c j =1 z ij > , z il < , l (cid:54) = j } for c = 1 , . . . , q . Across all q categorical variables,the observed group memberships in the dataset satisfy the event D (cid:48) = ∪ qc =1 D (cid:48) c . As was the case with therank likelihood, the probability that Z q satisfies D (cid:48) ( Y q ) does not depend on the marginal distributionsof the categorical variables, which allows us to join this event with the event D . More formally, we havethat observing the full dataset Y ∈ R n × p ∗ , where p ∗ = r + (cid:80) qc =1 k c , implies that Z must satisfy the newevent E = D ∪ D (cid:48) . Like (4)-(5), we can decompose the full data likelihood using this new event: P ( Y | C , F , . . . , F p ) = P ( Y , Z ∈ E | C , F , . . . , F p )(7) = P ( Z ∈ E | C ) × P ( Y | Z ∈ E , C , F , . . . , F p )(8)where the equivalence in (7) once again arises since observing Y implies that Z satisfies the event E . In the Gaussian copula, the parameter of interest is C , and in the decomposition of the full datalikelihood in (8), we see that the left term depends solely on C . As in [8], we proceed to estimatethe Guassian copula parameters solely based on the marginal likelihood, P ( Z ∈ E | C ), which is the extended rank-probit likelihood .3.2. Adjusting the Gaussian Copula Sampling Model.
The event D (cid:48) c ( y c ) is identical to the setrestriction imposed for the data augmentation in the diagonal orthant multinomial probit model of [16].In this model, it is possible to derive a convenient link function whereby the mean of the multivariatenormal distribution governing latent z gives class probabilities for each observation. Specifically, theevent D (cid:48) c ( y c ) induces the following probability distribution on y c :(9) P ( y c = h ) = P ( z h > , { z l < } l (cid:54) = h ) . This general format lends insight into how we can interpret latent variables corresponding to categor-ical levels that satisfy the extended rank-probit likelihood. For categorical data, there is no relation-ship between latent variables and realizations of observed data through a marginal CDF. Instead, eachcategorical observation is represented through a k c -dimensional latent vector which encodes the classmembership probabilities for that observation.This results in a simple modification to the Gaussian copula model, namely that we must estimate anintercept, α , for the p ∗ -dimensional Gaussian distribution that characterizes the multivariate dependencestructure in our data. This intercept is non-zero only for components of the latent vector correspondingto levels of a categorical variable. The reason for this is simple: (9) tells us that if we were to leavethe mean vector to be , our model would imply equal probability among the levels of each categoricalvariable. 4. Bayesian Estimation and Data Synthesis
A Factor Model for the Extended Rank-Probit Likelihood.
Bayesian inference and syntheticdata generation for the extended rank-probit likelihood requires a model for the latent Gaussian variables z . Factor models offer a natural approach: they capture dependence among high dimensional datathrough a parsimonious low-rank model and computationally scalable posterior sampling algorithms.Specifically, our model is given by:(10) z i = α + Λ η i + (cid:15) i , (cid:15) i iid ∼ N ( , Σ )where α encodes categorical probabilities, Λ is a p ∗ × k matrix of factor loadings, η i a k × Σ = diag ( σ , . . . , σ p ∗ ). BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 7
The dependence among the latent variables is captured in the lower dimensional vector of factors, η i ,usually with k << p ∗ . By marginalizing over η , the latent variables satisfy z i ∼ N ( α , Ω ) with reducedrank covariance Ω = ΛΛ (cid:48) + Σ . The advantage of modeling the data on the covariance scale, ratherthan using correlations, is computational simplicity. Priors for λ , σ j , and η i enjoy conjugacy under aGaussian likelihood. As a result, a simple and effective Gibbs sampling algorithm may be developed. Incontrast, correlation matrices have rigid structure; the diagonal terms must be one and the off diagonalterms must be between zero and one in absolute value. A comparable factor model for a correlationmatrix would require alternative priors and more complex sampling algorithms. Still, posterior inferencefor the correlation matrix is available under (10): given posterior samples of the covariance matrix Ω ,we can easily rescale C ij = ω ij / (cid:112) Ω ii Ω jj and similarly ˜ α j = α j / (cid:112) Ω jj .We specify the following priors for the non-zero components of α , σ − j , and η i : α j ∼ N (0 , , σ − j ∼ IG ( a σ , b σ ) , η i ∼ N ( , I k )independently. For the elements of Λ , we utilize the multiplicative gamma process prior of [17]: [ λ j,h | φ jh , τ h ] ∼ N (0 , φ − jh τ − h ) independently with local scale parameters φ jh ∼ Gamma ( ν/ , ν/
2) and globalscale parameters τ h = (cid:81) hl =1 δ l for δ ∼ Gamma ( a ,
1) and δ l ∼ Gamma ( a , , l ≥
2. Because ofthe stochastically increasing nature of τ h (under the constraint that a > Ω . The parsimony affordedby this prior on the factor loadings allows estimation to scale well with p , making joint modeling, andtherefore synthesis, of high dimensional and highly correlated datasets more feasible.For the marginal CDFs F j (for non-categorical variables) in the extended rank-probit likelihood, wesubstitute an estimate using the empirical CDFs ˆ F j , as in [8] and [9]. This choice ensures that datarealized through the generating model comes on the correct scale with consistent marginal properties,and can be re-scaled to avoid infinities. For continuous variables, we estimate the CDFs using a kernelsmoother to ensure unique values in the data synthesis as an additional layer of privacy protection.More broadly, the extended rank-probit likelihood is compatible with any marginal model for each (non-categorical) variable j .4.2. MCMC Sampling Algorithm.
The MCMC sampling algorithm proceeds as follows, and is de-composed into three blocks for clarity:(1)
Sample the Factor Model Parameters : • λ j, − | − ∼ N (( D j − + σ − j η T η ) − η T σ − j ( z j − α j )), where D j − = diag ( φ j τ , . . . , φ jk τ jk ), z j = ( z j , . . . z nj ) T , and η = ( η j , . . . η nj ) T , for j = 1 , . . . , p ∗ • σ − j | − ∼ Gamma ( a σ + n , b σ + (cid:80) ni =1 { z ij − ( α j + λ Tj η ij ) } , for j = 1 , . . . , p ∗ • η i | − ∼ N k ( I k + Λ T Σ − Λ) − Λ T Σ − ( z i − α ) , ( I k + ΛΣ − Λ ) − ), where z i = ( z i , . . . , z ip ∗ ),for i = 1 , . . . , n • φ jh | − ∼ Gamma ( ν +12 , ν + τ h λ jh ), for j = 1 , . . . , p ∗ , h = 1 , . . . , k • δ | − ∼ Gamma ( a + p ∗ k , (cid:80) kl =1 τ (1) l (cid:80) pj =1 φ jl λ jl ), and for h ≥ δ h | − ∼ Gamma ( a + p ∗ ( k − h +1)2 , (cid:80) kl =1 τ ( h ) l (cid:80) pj =1 φ jl λ jl ), where τ hl = (cid:81) lt =1 ,t (cid:54) = h δ t , for h = 1 , . . . , k (2) Sample the intercepts α j for columns corresponding to categorical variable levels: • α j | − ∼ N (( nσ − j + 1) − σ − j (cid:80) ni =1 (cid:80) kh =1 ( z ij − λ jh η ih ) , ( nσ − j + 1) − )(3) Sample Z : for each column, sample z ij from a truncated normal, with lower and upper boundsfor each observation specified by the extended rank-probit likelihood: • z ij | − ∼ T N ( α j + (cid:80) kh =1 λ jh η ik , σ j , z lij , z uij ), where T N ( µ, σ , a, b ) denotes a truncatedunivariate normal with mean µ , variance σ , lower truncation a , and upper truncation b .For ordinal, count, and continuous variables, the truncation limits are z lij = max { z − ij : y − ij < y ij } , and z uij = min { z − ij : y − ij > y ij } , where z − ij = z j \ z ij . For columnscorresponding to categorical levels, the upper and lower truncation limits are(11) z lij = (cid:40) , y ij = 1 −∞ , y ij = 0 , z uij = (cid:40) ∞ , y ij = 10 , y ij = 0 A BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS
Most notably, the factor model parameters and the intercepts benefit from the conditional Gaussianityin (10), while the latent data imputation of { z ij } consists of univariate truncated normal sampling steps.In conjunction, these components produce an MCMC algorithm that is surprisingly scalable given thecomplexity and dimensionality of the model.4.3. Data Synthesis.
Posterior predictive sampling is used to construct a fully synthetic dataset.In particular, given draws from the posterior distribution of z ij via (10) using the aforementionedsampling algorithm, posterior predictive draws for the non-categorical variables are simply given by˜ y ij = ˆ F − j { Φ( z ij ) } . As a result, posterior predictive samples can be generated rapidly for these vari-ables.However, posterior predictive draws for the categorical variables present additional challenges due tothe ordering constraints. For each synthetic observation i , the latent variables z ij , . . . , z ij k for categoricalvariable j must satisfy the restriction that one and only one of these components is positive, whichcorresponds to the assigned category level. Since the full latent vector z may contain multiple categoricalvariables as well as a variety of non-categorical variables, enforcing such a constraint in the presence ofthe multivariate dependence implied by (10) creates significant computational issues. Note that thisissue persists only for posterior predictive sampling: in the case of posterior sampling of z ij , . . . , z ij k ,the observed data { y ij } inform exactly which of these components must be positive, while the remainingcomponents are constrained to be negative. In that case, samples are generated from independent andunivariate truncated normals. By comparison, for posterior predictive sampling, we only know thatexactly one of these components—but not which component—must be positive. Hence, the constraintregion is more difficult to enforce.To circumvent these challenges, we design a posterior predictive sampling approach that generates jointrealizations ˜ y = (˜ y cat , ˜ y − cat ) by (i) marginally simulating the categorical variables ˜ y cat and (ii) simulatingthe non-categorical variables ˜ y − cat | ˜ y cat conditionally on the categorical variables. The main advantagesfollow from (ii). First, given the categorical realizations ˜ y cat , the corresponding latent variables ˜ z cat are generated exactly as in the MCMC algorithm. More specifically, ˜ y cat determines precisely whichof the components of z cat must be positive, while the others must be negative. Hence, simulating z cat | ˜ y cat , − from the full conditional distribution only requires sampling from a multivariate truncatednormal distribution, a task made simple by the availability of several existent software packages suchas TruncatedNormal . Second, given these draws of the latent z cat , the augmented full conditionaldistribution z − cat | z cat , − simply modifies (10) using standard results for Gaussian conditionals.For the marginal sampling of ˜ y cat in (i), we adopt a fast and simple approach similar to the empiricalCDF estimates of F j for non-categorical variables. Let ˆ p denote the vector of empirical probabilities overthe union of all categorical variables. For example, the North Carolina dataset features two variables thatwe model as categorical, mother’s race and education level, with four and five categories, respectively.Consequently, ˆ p is a 20-dimensional vector of empirical probabilities. Using this empirical estimate, wesimply generate ˜ y cat from a multinomial distribution with probability vector ˆ p . Notably, this approachensures that categorical observations belong to exactly one level, and that we observe consistent marginalproperties for these categorical variables. As for the empirical estimates of ˆ F j , we may substitute anymarginal model for ˆ p at this stage. Note that this step does not preclude the inclusion of the categoricalvariables in the extended rank-probit likelihood and factor model (10): the sampling step ˜ y − cat | ˜ y cat relieson the learned associations from the latent z − cat | z cat , − under the proposed model.Therefore, this approach maintains the core of a semiparametric Gaussian copula model, in whichthe marginal distributions are estimated empirically and the joint distributions are modeled via latentGaussian variables. The primary difference in the proposed approach is that both the marginal andthe joint models appropriately incorporate unordered categorical variables along with count, binary, andcontinuous variables. The algorithm to create one synthetic observation is summarized below:˜ y cat ∼ multinomial (ˆ p )˜ z cat | ˜ y cat , − ∼ T M V N ( α cat , C cat , l, u )˜ z − cat | ˜ z cat , − ∼ M V N ( α ∗ , C ∗ )˜ y − cat j = ˆ F j { Φ(˜ z − cat j ) } BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 9 where ( α cat , C cat ) is the sub-vector and sub-matrix of ( α, C ) corresponding to categorical components,( l, u ) are the vectors of lower and upper truncation limits defined in (11) and determined by the realiza-tion of ˜ y cat , and α ∗ = C − cat,cat C − cat,cat ( α cat − ˜ z cat ) and C ∗ = C − cat, − cat − C − cat,cat C − cat,cat C cat, − cat are the moments for the conditionally multivariate Gaussian distribution z − cat | z cat , − . Note that thisalgorithm uses posterior samples of the key parameters ( α, C ).5. Adaptations for Nonlinear Regression
For our dataset, perhaps the most desired quality of the synthetic data is to maintain the relationshipsbetween educational development (EOG test scores) and social or environmental exposures. However,the Gaussian copula may be insufficient for modeling more complex relationships due to the linear factormodel on latent z . While a semiparametric, fully nonlinear joint model is an attractive alternative toconsider, the required modeling and computational demands introduce significant challenges, includinga high risk of over-fitting. In the case of synthetic data generation, this may result in high attributedisclosure risk and jeopardize privacy. Instead, we deploy the general flexibility of our framework specif-ically for modeling the EOG test scores conditional on the remaining variables using a semiparametricregression model.Let Y resp denote the response variables of interest and Y cop the remaining variables. We decomposethe data synthesis into P ( Y cop , Y resp ) = P ( Y cop ) P ( Y resp | Y cop ), where P ( Y cop ) is given by the extendedrank-probit likelihood factor model and P ( Y resp | Y cop ) is a semiparametric regression model. By isolatingthese variables Y resp from the joint model and synthesizing them conditional on social and environmentalexposures using a flexible, semiparametric regression, we enhance the complexity of the synthetic datawithout jeopardizing the convenience of creating the vast majority of the synthetic dataset through thejoint copula model.Using this strategy, we develop a nonlinear regression model motivated by the semiparametric Gaussiancopula, where the augmented data adhere to the rank-likelihood and observed data are realized by linkinglatent variables through the empirical CDF. By treating Y cop as covariates, the conditional model foreach y resp i given y cop i is z i = f ( y cop i ) + (cid:15) i , (cid:15) i iid ∼ N (0 , σ )(12) y resp i = ˆ F − { Φ( z i ) } (13)where f is a flexible regression function. Since the EOG test scores are integer-valued, (12)-(13) presentsa semiparametric version of [18]. More relevant, (12)-(13) is coupled with a rank likelihood for { z i } ,so these latent data preserve the ordering in the observed data { y resp i } . The presence of the empiricalCDF ˆ F ensures that the posterior predictive draws associated with(12)-(13) will maintain the marginalproperties of { y resp i } , while the regression function f captures the conditional dependence of Y resp | Y cop .For the regression function f , we use a Bayesian Additive Regression Tree (BART), which is capable ofmodeling nonlinear and low-order interactions among the covariates Y cop [19]. Because of the conditionalGaussianity in (12)-(13), the parameters of f can be sampled using existing algorithms for GaussianBART models, such as the dbarts package in R . Specifically, the sampling algorithm has two blocks: • Sample the latent data: z i | f, σ ∼ T N ( f ( y cop i ) , σ , z li , z ui ) for i = 1 , . . . , n , where z li = max { z − i : y − i < y i } , z ui = min { z − i : y − i > y i } , and z − i = z \ z i • Sample BART parameters: f, σ | z The first block follows from the rank likelihood construction, while the second block proceeds using aGaussian BART sampling step [19].For data synthesis, let ˜ y cop i denote a draw from the posterior predictive distribution of the cop-ula model for Y cop . The nonlinear outcomes are synthesized as ˜ y resp i = ˆ F − { Φ(˜ z i ) } where ˜ z i ∼ N ( ˆ f (˜ y cop i ) , ˆ σ ). In the sampling step for ˜ z i , we use the posterior means ˆ f and ˆ σ of f and σ , re-spectively. Although we could use any posterior samples of f and σ at this step, the use of the posteriormeans simplifies and stabilizes the data synthesis. Either way, we note that the nonlinearity affordedby (12) requires minimal computational burden: the other variables Y cop are modeled and synthesizedseparately using the extended rank-probit likelihood and factor model, while the modeling and synthesisof Y resp | Y cop features the same computational complexity as a BART probit regression. Results
Simulation Study.
We first highlight the gains of the rank-probit likelihood (RPL) over therank likelihood (RL; [8]) as a model for mixed unordered categorical and numerical data through asimulation study. We construct a simulated dataset with two variables: one unordered categorical( x ) with five levels and one integer-valued variable ( x ). The simulation design is comparable to therelationship between mother’s race ( x ) and EOG test score ( x ) in the North Carolina dataset. First,we select the marginal proportions for each category of x , p = ( . , . , . , . , . x from a multinomial distribution. Next, conditional on the level, we simulate the integer-valued variable x | x = l ∼ Poisson( r l ) where the Poisson rate r = (342 , , , , n = 5000 individuals. Note that the levels of x are assigned integer values from 1 and 5 for notational simplicity, but we do not assume an orderingamong the levels of x .For both the RPL and the RL, 500 synthetic datasets are generated, and the properties of thesedatasets are compared to those in the original data. For the RPL, x is given the full binarization in(6), the copula correlation matrix and a non-zero intercept α are estimated through the factor model(10), and datasets are synthesized as in Section 4.3. For the RL, we employ the workaround describedin Section 3: the categorical variable is one-hot-encoded as 4 independent binary variables with basecategory l = 1. Estimation of the copula proceeds under default prior specification using the sbgcop package [8], and data synthesis is carried out through posterior predictive sampling. The MCMC forboth models was run for 15,000 iterations, of which 9,000 were discarded as burn-in. Each of the 500synthetic datasets has the same size ( n = 5000) as the original simulated data.For each synthetic dataset, we recorded the mean of x by categorical level, ¯ X synl for l = 1 , . . . ,
5, akinto computing the average EOG test score by mother’s race level in the North Carolina dataset. We thencompute the mean squared error between this value and the ground truth, MSE = 5 − (cid:80) l =1 ( ¯ X synl − ¯ X l ) ,where ¯ X l is the sample mean of x | x = l . This MSE measures the ability of the synthetic data tocapture the dependence between an unordered categorical variable and an integer-valued variable. Wecompute the average and standard error of this statistic across all 500 synthetic datasets. In addition,we record the proportion of individuals for which the categorical variable x is erroneously assignedmultiple categories. A sizeable proportion of multiple classified individuals in a synthetic dataset isproblematic since these individuals would have to be discarded to maintain consistent categorizationswith the observed data. Such an ad hoc process can skew the marginal categorical proportions andnegatively impact other bivariate or joint dependencies, and requires further modifications to ensurethat the synthetic data have the same dimensions ( n ) as the original data.The results are presented in Table 2. Clearly, the proposed RPL approach is better suited for modelingrelationships between categorical and numerical measurements. Conditional distributions x | x = l consistently and closely concentrate around ground truth group means. In addition, our method fordata synthesis provides universally feasible categorical observations, while the RL with one-hot-encodinggenerates infeasible categorical data in more than 12% of observations.Method Avg. MSE S.E. MSE % Multiple Classified IndividualsRPL RL 2.359 0.763 12.4
Table 2.
The proposed RPL more closely and consistently preserves categorical-numeric relationships and completely avoids erroneous assignment of multiple categorylevels for an individual.Further advantages of the RPL over the RL are apparent in the modeling of ordinal variables withfew levels. For instance, mother’s education is an ordinal variable, but has a small number of levels(our dataset uses four levels). While the RL is specifically designed to model such variables as ordinal,with the RPL, we are afforded the choice as to whether we should treat them as ordinal or unorderedcategorical variables. Perhaps surprisingly, the latter approach can offer significant improvements.We return to the same simulated dataset used in the previous demonstration, but this time, the integerassignments to x are treated as numerical values in the RL, which is how [8] and [9] propose to handleordinal variables. Since the expectation of x | x = l is increasing in l , an ordinal relationship would BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 11 appear to be reasonable. Yet for the RPL, we continue to treat this variable as categorical. We onceagain create 500 datasets under each method, and record group conditional means ¯ X synl for each. Thedistribution of these group conditional means across 500 synthetic datasets, along with overlayed 95%intervals (dotted green lines), and the ground truth group conditional mean ¯ X l (solid red line) is shownin Figure 1. X syn F r quen cy
340 343 346 X syn F r quen cy X syn F r quen cy
344 346 348 X syn F r quen cy
347 349 351 X syn F r quen cy
351 353 355 F r equen cy
339 341 343 F r equen cy F r equen cy
346 348 F r equen cy
348 350 F r equen cy
350 352
Figure 1.
Distribution of ¯ X synl across synthetic datasets where x is treated as cat-egorical (top row) or ordinal (bottom row). The performance of the RPL in modelingordinal variables with few levels as categorical suggests additional advantages over theRL.When the ordering on x is acknowledged and the RL is employed, synthetic data are decidedlyinconsistent; the distribution of ¯ X synl across synthetic datasets concentrates far away from the groundtruth for all but one level. Under the RPL, the analyst can choose whether to treat such variables asordinal or categorical, and this choice can greatly improve the quality of the synthetic data. We providefurther simulation results in the supplementary materials regarding the sensitivity of these results to thenumber of the levels of ordinal x . Based on this analysis, we recommend that any ordinal variable withfewer than 10 levels be treated as categorical within the RPL.6.2. Synthetic North Carolina Data.
We apply the proposed methods to construct a syntheticversion of the North Carolina dataset. The dataset, described in Table 1, contains n = 19 ,
364 recordsof p = 26 variables. Both mother’s race and mother’s education are modeled as unordered categoricalvariables. Although a natural ordering exists for mother’s education, the simulations from Section 6.1strongly suggest that because the variable has so few levels (four), accuracy can be improved significantlyby treating this variable as an unordered categorical variable. In addition, we target the synthesisfor prediction of EOG reading and mathematics scores based on the remaining variables, includingdemographic and health information, environmental exposures, and social stressors. Specifically, weapply the model from Section 5 for EOG reading and mathematics scores as the response variables Y resp . By using the BART model in (12)-(13), our data synthesis can capture and reproduce nonlinearand interactive associations among these demographic, health, and exposure variables for predictingcognitive development. The MCMC algorithm for the remaining variables Y cop , modeled using theextended rank-probit likelihood and factor model, was run for 50,000 iterations, of which 25,000 werediscarded as burn-in. The sampler for Y resp was run for 1,100 iterations, of which 100 were discardedas burn-in.To evaluate our synthetic data, we first conduct univariate (Figure 2) and bivariate (Table 3) assess-ments of data utility by comparing synthetic and observed data. Because our method utilizes empiricalCDFs for continuous (with an additional kernel smoother), ordinal, and count variables, the syntheticdata capture notably non-Gaussian margins. In addition, we see consistent cross-tabulations of the twocategorical variables, mother’s education and mother’s race, between synthetic and observed data basedon our approach to data synthesis outlined in Section 4.3. MAGE PM25_1yr_June . . . . . . . . . . . . . . . . . . . Value P e r ce n t observed synthetic Figure 2.
Comparison of synthetic (light blue) and observed (dark blue) marginaldistributions for mother’s age (left) and chronic PM 2.5 exposure (right). Despite thenotable non-Gaussianity, the synthetic data distributions closely match the observeddata distributions.
Synthetic (Observed) Cross-Tabulations of Mother’s Education and Race
NH White NH Black Hisp NH Asian/PI NH OtherNo High School Diploma 0.091(0.093) 0.086(0.085) 0.087(0.091) 0.005(0.005) 0.002(0.003)High School Diploma 0.172(0.167) 0.134(0.135) 0.032(0.032) 0.006(0.007) 0.002(0.002)Some College/ Associates Deg. 0.118(0.116) 0.078(0.078) 0.008(0.008) 0.003(0.003) 0.002(0.001)Bachelor’s or Above 0.126(0.127) 0.037(0.036) 0.004(0.004) 0.005(0.005) 0.001(0.001)
Table 3.
The synthetic RPL data preserve multivariate categorical properties observedin the original data.Regression of EOG math and reading scores on social and environmental exposures can be usedto uncover main and low order interaction effects relating exposure profiles of children to educationaldevelopment [2]. An important assurance of the quality of our synthetic data is whether regressionmodels fit to synthetic and observed data produce highly similar results. This type of utility must beprioritized in the creation of our synthetic dataset given how it will likely be analyzed by others.To assess these conditional or regression associations, we fit Bayesian linear regression models toboth the synthetic and observed data and compare the posterior inference on the resulting regressioncoefficients. First, EOG reading scores are regressed on social stressors, environmental exposures, andother baseline demographic and health information. Next, we augment this model to include several keyinteractions between social and environmental exposures as in [2]. In both models, numerical predictorsare scaled to mean zero and unit variance. The regression coefficients are assigned horseshoe priors[20] so that in addition to comparing the strength of posterior signals, we can also evaluate whethersynthetic data produces consistent parameter shrinkage. This procedure applied to the synthetic data isa posterior predictive functional that is specifically customized for our task—reliable data synthesis forregression—as advocated by [21].For baseline comparisons, we again construct a synthetic dataset using the RL [8]. We employ thecategorical workaround for mother’s race and treat mother’s education as an ordinal variable with in-creasing levels from 1 (No High School Diploma) to 4 (Bachelor’s or Above). Once again, the issueof multiple classified observations became apparent, as the synthesis of mother’s race resulted in 500individuals belonging to at least two race groups. We discarded these individuals prior to the regressionanalysis, since synthetic datasets should adhere to the same categorization conventions as the observeddata, especially for a variable as important in context as race. While this decision maintained a necessaryconsistency between synthetic and observed data, it immediately diminished the utility of the RL – theresultant synthetic dataset possessed inconsistent marginal proportions for mother’s race.For each linear model and synthetic dataset, we compute statistics to measure point and intervalestimation accuracy. First, we compute standardized mean squared errors between the regression coef-ficient estimates under the observed and synthetic datasets. Concretely, for each predictor, we take the
BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 13 posterior mean by dataset, ˆ β obs and ˆ β syn , along with the standard deviation of the posterior samples ofthis regression coefficient from the observed data ˆ σ β obs , and compute ( ˆ β obs − ˆ β syn ) / ˆ σ β obs . This statisticpenalizes large deviations between posterior mean estimates for coefficients with low posterior uncer-tainty in the observed data. Therefore, synthetic data with low MSEs will have highly similar posteriormean coefficient estimates, reflecting a pattern of consistent inference.Next, we compute the average interval overlap for 95% posterior credible intervals fit to the observeddata and to each synthetic dataset. For each variable, we computeCIO = 0 . (cid:18) min ( u obs , u syn ) − max ( l obs , l syn ) u obs − l obs + min ( u obs , u syn ) − max ( l obs , l syn ) u syn − l syn (cid:19) , where ( l obs , u obs ) and ( l syn , u syn ) are the lower and upper endpoints for the observed dataset and eachsynthetic dataset, respectively. Larger values reflect more consistent overlap of credible intervals andthus more useful synthetic data [22].Table 4 reports average CIOs and MSEs for each model fit to the competing synthetic datasets.According to these metrics, the RPL synthetic data is far more useful than the RL synthetic data forregression with or without interactions. These improvements persist in both estimation accuracy andinterval overlap, which suggests that the proposed approach for data synthesis offers more fidelity to theobserved data for regression analysis. Utility Statistics for Models fit to Competing Synthetic datasetsModel MSE CIO
Main Effects RPL
Main Effects RL 7.417 0.515Main Effects + Interactions RPL
Main Effects + Interactions RL 5.755 0.567
Table 4.
RPL synthetic data has significantly higher utility than RL synthetic dataaccording to MSE and average CIO.A more nuanced view of the performance of each method comes in Figure 3, which compares posteriormean coefficient estimates and 95% credible intervals between synthetic and observed datasets under theaugmented linear model. In this plot, we see striking discrepancies in the intervals and parameter esti-mates for variables describing mother’s race and highest education levels for RL data (see the boundingred boxes). This confirms the insight provided by the simulation study, namely that the RL is insufficientfor modeling categorical variables and ordinal data with few levels. In treating both of these variables ascategorical under the RPL, we observe significant improvement, as evidenced by the significant overlapin credible intervals for these variables in the left panel of Figure 3. - - B a c he l o r ' s o r A bo v e S o m e C o ll ege / A ss o c D eg H . S . D i p l o m a T t e m p_a v g P M y r . R I _ E du c T t e m p_a v g A s i an / P I P M y r _ J une M A G EB W T P c t l T m r PB r e s u l t. ND I _ B i r t h G ES TT m . R I _ B i r t h T m . ND I _ B i r t h T m r N O P NC PB r e s u l t. R I _ B i r t h T m . R I _ B i r t h P M y r . ND I _ E du c T m . ND I _ B i r t h T m . R I _ B i r t h R I _ B i r t h P M ys . ND I _ E du c T m . ND I _ B i r t h P M ys . R I _ E du c S M O KE RND I _ B i r t h R I _ E du c M ed i c a i d PB r e s u l t P M ys _ J une N O T M A RR I E D T m r T t e m p_a v g ND I _ E du c NH O t he r M A L EE DH i s p NH B l a ck Main Effects + Interactions: RPL
RPL Est.Real Data Est. - - B a c he l o r ' s o r A bo v e S o m e C o ll ege / A ss o c D eg H . S . D i p l o m a T t e m p_a v g P M y r . R I _ E du c T t e m p_a v g A s i an / P I P M y r _ J une M A G EB W T P c t l T m r PB r e s u l t. ND I _ B i r t h G ES TT m . R I _ B i r t h T m . ND I _ B i r t h T m r N O P NC PB r e s u l t. R I _ B i r t h T m . R I _ B i r t h P M y r . ND I _ E du c T m . ND I _ B i r t h T m . R I _ B i r t h R I _ B i r t h P M ys . ND I _ E du c T m . ND I _ B i r t h P M ys . R I _ E du c S M O KE RND I _ B i r t h R I _ E du c M ed i c a i d PB r e s u l t P M ys _ J une N O T M A RR I E D T m r T t e m p_a v g ND I _ E du c NH O t he r M A L EE DH i s p NH B l a ck Main Effects + Interactions: RL
RL Est.Real Data Est.
Figure 3.
Posterior credible intervals and mean coefficient estimates (bisecting ’x’)under a Bayesian linear regression with several first order interactions. RPL syntheticdata outperforms RL synthetic data, especially for categorical predictors (red boundingboxes)
Comparisons with Iterative Conditioning.
An alternative strategy for data synthesis relieson iterative conditioning: since P ( Y , . . . , Y p ) = P ( Y ) (cid:81) pj =2 P ( Y j | Y , . . . , Y j − ), a synthetic datasetcan be generated by fitting separate models for each iterative conditional P ( Y j | Y , . . . , Y j − ) and thensimulating from these models. [11] uses classification and regression trees (CART) for each iterativeconditional, which is implemented in the synthpop package in R .The primary limitation of iterative synthesis is the need to specify an ordering of the variables.Ideally, the analyst should choose the ordering that provides the most useful synthetic data. Althoughany ordering provides a valid deconstruction of the joint distribution of Y , . . . , Y p , the quality of thesynthetic datasets can be highly sensitive to the ordering. Furthermore, it is not straightforward how bestto optimize the ordering: a purely combinatorial search would require evaluating p ! orderings, which isinfeasible even for moderate dimensions. In addition, exhaustive searches increase the risk of overfitting,which has implications for privacy protection. By comparison, the proposed RPL approach generatessynthetic data jointly , which obviates the need to select an ordering. At the same time, if there exists afavorable ordering or a conditional relationship of particular interest—such as a regression analysis witha certain response variable—the modifications in Section 5 can capitalize on this information.To illustrate this point we create 100 synthetic North Carolina datasets using iterative synthesis withCART. Each dataset is created with a different synthesis ordering specified by a unique and randompermutation of the p = 26 variables in the dataset. For the CART models, we require at least fiveobservations at terminal nodes of each tree to preserve confidentiality. Given each synthetic dataset, weagain fit the Bayesian linear regression model with main effects and interactions from Section 6.2 andcompute MSEs and CIOs.The results are presented in Figure 4, which displays the distribution of these utility metrics over allof the synthetic datasets produced by varying the ordering for synthesis. The overlaid red lines depictthe MSEs and average CIOs calculated from the synthetic dataset produced under our method. Thevariability in the utility statistics across synthetic datasets is notable, and the degree to which posteriorinference remains consistent between observed and synthetic data is clearly sensitive to the orderingof synthesis. Additionally, our synthetic dataset offers highly competitive performance among all theCART datasets produced. Distribution of MSE Across Synthetic CART Data Sets
MSE F r equen cy RPL MSE
Distribution of Avg. CIO Across Synthetic CART Data Sets
CIO F r equen cy RPL CIO
Figure 4.
Distributions of MSE and average CIO across synthetic CART datasets:synthetic data utility is sensitive to ordering while our synthetic dataset provides highlycompetitive utility.For more specific comparisons, we extract the “significant” predictors of EOG reading scores, definedas those variables for which the 95% credible interval for the regression coefficient excludes zero in themodel fit to observed data. In Figure 5, we compare the posterior means of these “significant” coefficientswith the same estimates obtained from the RPL synthetic data as well as the estimates computed from all100 CART synthetic datasets with distinct variable orderings. The distribution of the strongest signalsdisplay the highest variability among the different CART datasets. For instance, across CART datasets,estimates of the effect of Bachelor’s or Above ranges between 4.4 and 7.2. For Non-Hispanic Black theeffects range between -2.7 to -4.8. Meanwhile, significant coefficient estimates under synthetic RPL dataare generally quite close to observed values, rendering similar inference from this linear model were it fitto either our synthetic data or the observed data.
BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 15 −4 −2 0 2 4 6 8 lllllllllllllll
NH Black HispED MaleNDI_EducT1 PM2.5PBresultRI_nhb_EducPM25_1yr.RI_nhb_EducT2 TempBWTPctlMAGEH.S. DiplomaSome College/Associates Deg.Bachelor's or Above
Distribution of Significant Coefficient Estimates
Estimate Under Observed DataEstimate Under RPL Data
Figure 5.
Distribution of significant coefficient estimates across CART datasets withestimates from the observed data and from the RPL synthetic data overlayed. Thevariability of the estimates among CART datasets is high, especially for large signals,which would lead to differing inference. Contrarily, RPL estimates are consistently closeto observed estimates.6.4.
Privacy Implications.
Finally, we highlight the privacy preservation of our synthetic dataset.For fully synthetic data, the risk of re-identification is generally considered non-existent since syntheticindividuals do not correspond to any real individuals. However, the risk of sensitive attribute disclosuremay persist, since an adversary can potentially use synthetic data to accurately identify specific sensitiveattributes of real individuals [23]. For example, suppose it is public knowledge that a student was bornin a neighborhood that has the highest Neighborhood Deprivation Index (NDI). In addition, supposethat the student scored a 347 on the EOG reading assessment, which is sensitive information. In thepresence of overfitting, it is possible that a synthesized individual has an EOG score of 347 and themaximum NDI value. As a result, an adversary knowing only publicly available information can inferthis student’s sensitive attribute by simply referring to the synthetic data. Clearly, attribute disclosurerisk may arise when (i) an individual has outliers for publicly available data and (ii) the synthesizersuffers from overfitting.To assess the attribute disclosure risk of our synthetic data, we study the sensitive attributes cor-responding to outlier points in both the synthetic and observed data. Specifically, for a given publiclyavailable attribute, we examine sensitive attributes corresponding to the first and last fifty order sta-tistics for this variable in both the observed and synthetic data. For numerical sensitive attributes, wemeasure the R from the regression of observed sensitive values on synthetic sensitive values. If thesynthetic data are protective, this statistic should be close to zero: sensitive attributes corresponding tosynthetic outlier points should have little predictive power for corresponding observed sensitive values.For categorical or binary variables, we can similarly estimate the accuracy with which an adversary coulduse these corresponding sensitive attributes in the synthetic data to predict observed sensitive values.We focus on two diagnostics. First, we examine the synthesized and observed EOG reading scorescorresponding to extreme values of NDI at Birth. Next, we examine birth weight percentile correspondingto extreme values of first trimester average temperature exposure. The results are presented in Figure 6.In both cases, sensitive attributes corresponding to synthetic outlier points have little explanatory power.For the EOG-NDI relationship, it is important to note that there is a non-negligible positive correlationin the observed data between the two variables. This is likely why we observe a higher R value: outlierpoints in each dataset are likely to reflect this relationship consistently between synthetic and observed data. We conducted similar checks for several other combinations of sensitive and public variables andfound no noticeable attribute disclosure risks. Syn Reading O b s e r v ed R ead i ng R : 0.107 EOG Reading Corresponding to NDI Outlier Points
Syn BWT Pctl O b s e r v ed B W T P c t l R : 0.0036 BWT Pctl Corresponding to T1 Temp Outlier Points
Figure 6.
Outlier attribute disclosure risk: dashed line is the fit from regressing ob-served on synthetic sensitive values. Scales on the x and y axes have been removed toprevent the dissemination of any information in the observed data.For thorough comparisons, we also recorded the EOG-NDI disclosure risk for all synthetic datasetsproduced under the CART iterative synthesizer. While the average R for these synthetic datasets (0.11)was comparable to the RPL value, there was one synthetic dataset from CART that produced an R of 0.5. This shows the potential for the iterative synthesizer to drastically overfit to the data, leadingto privacy concerns. In particular, the choice of ordering for iterative synthesizers is critical not onlyfor synthetic data utility, but also for attribute disclosure risk. The ability of the RPL synthesizer tosidestep these challenges is a major benefit of the proposed approach.7. Conclusion
We have developed a Bayesian semiparametric modeling framework for mixed categorical, binary,count, and continuous variables. The joint Bayesian model is particularly useful for generating reliableand privacy-preserving synthetic datasets based on sensitive micro data. Micro data collected on in-dividuals are essential for health and health disparities research and practice, yet cannot be releaseddue to privacy concerns. This undermines scientific reproducibility and inhibits further study of criticalquestions about public health and child development. The proposed approach remedies theses issuesby generating synthetic datasets that empirically capture marginal, bivariate, and (nonlinear) regressionassociations among variables, while limiting privacy concerns from re-indentification and attribute dis-closure risks. Unlike existing methods for data synthesis, the proposed extended rank-probit likelihood incorporates unordered categorical variables—alongside binary, count, and continuous variables—andcan synthesize variables without needing to pre-select a variable ordering. Since regression analysis is acommon tool in epidemiological studies, we introduce a modified data synthesis strategy to target andpreserve key conditional relationships, including both nonlinearities and interactions.Using the proposed methods, we generated a synthetic version of a confidential dataset containingdozens of health, educational, and social measurements on nearly 20,000 North Carolina children. Com-pared to the original dataset, the synthetic version matches the marginal distributions, even for highlynon-Gaussian variables; reproduces cross-tabulations between categorical variables; preserves posteriorinference for linear regressions with shrinkage priors and interactions; and limits both re-indentificationand attribute disclosure risks. The proposed approach demonstrates substantial improvements over ex-isting methods, especially for unordered categorical variables and ordinal variables with few levels, inboth utility and privacy preservation. The role of unordered categorical variables—in conjunction withbinary, count, and continuous variables—cannot be understated: race or ethnicity is one such variable,and is a critical factor in health and health disparities research and practice.Although we have deployed the extended rank-probit likelihood factor model for data synthesis, thisBayesian modeling framework may also be useful for inference on mixed data. For instance, the copulacorrelation matrices C offer a new way to quantify associations among categorical, binary, count, andcontinuous variables. BAYESIAN FRAMEWORK FOR GENERATION OF FULLY SYNTHETIC MIXED DATASETS 17
In addition, the extended rank-probit likelihood provides a natural extension of multinomial choicemodels. Specifically, the diagonal orthant probit multinomial model of [16] relies on a diagonal covariancefor model fitting, translating to an assumption of independence of irrelevant alternatives. The extendedrank-probit likelihood, coupled with the factor model (10), offers an opportunity to relax this assumption.Beyond these inferential extensions, future work will investigate the possibility of incorporating non-linearity into the Gaussian copula model. For instance, replacing (10) with a nonlinear factor modelmay offer similar advantages as the nonlinear regression in Section 5, yet without the need to single outthe response variables of interest. However, additional modeling complexity will introduce new compu-tational challenges, and will require special care to avoid overfitting—which can lead to greater attributedisclosure risks.
Acknowledgements
The authors would like to thank Marie Lynn Miranda and Katherine B. Ensor for their valuablecomments and feedback. Research reported in this publication was supported by the National Instituteof Environmental Health Sciences of the National Institutes of Health under award number R01ES028819and the Army Research Office (Kowal) under award number W911NF-20-1-0184. The content, views,and conclusions contained in this document are those of the authors and should not be interpreted asrepresenting the official policies, either expressed or implied, of the National Institutes of Health, theNorth Carolina Department of Health and Human Services, Division of Public Health, the Army ResearchOffice, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprintsfor Government purposes notwithstanding any copyright notation herein.
References [1] M. L. Miranda, D. Kim, M. A. O. Galeano, C. J. Paul, A. P. Hull, and S. P. Morgan, “The relationship between earlychildhood blood lead levels and performance on end-of-grade tests,”
Environmental Health Perspectives , vol. 115,no. 8, pp. 1242–1247, 2007.[2] D. R. Kowal, M. Bravo, H. Leong, R. J. Griffin, K. B. Ensor, and M. L. Miranda, “Bayesian Variable Selection forUnderstanding Mixtures in Environmental Exposures,” 2020.[3] D. B. Rubin, “Statistical disclosure limitation,”
Journal of official Statistics , vol. 9, no. 2, pp. 461–468, 1993.[4] J. S. Murray and J. P. Reiter, “Multiple imputation of missing categorical and continuous values via bayesian mixturemodels with local dependence,”
Journal of the American Statistical Association , vol. 111, no. 516, pp. 1466–1479,2016.[5] D. B. Dunson and C. Xing, “Nonparametric bayes modeling of multivariate categorical data,”
Journal of the AmericanStatistical Association , vol. 104, no. 487, pp. 1042–1051, 2009.[6] J. Hu, J. P. Reiter, and Q. Wang, “Disclosure risk evaluation for fully synthetic categorical data,” in
Internationalconference on privacy in statistical databases , pp. 185–199, Springer, 2014.[7] H. Quick, S. H. Holan, C. K. Wikle, and J. P. Reiter, “Bayesian marked point process modeling for generating fullysynthetic public use data with point-referenced geography,”
Spatial Statistics , vol. 14, pp. 439–451, 2015.[8] P. D. Hoff, “Extending the rank likelihood for semiparametric copula estimation,”
The Annals of Applied Statistics ,vol. 1, no. 1, pp. 265–283, 2007.[9] J. S. Murray, D. B. Dunson, L. Carin, and J. E. Lucas, “Bayesian gaussian copula factor models for mixed data,”
Journal of the American Statistical Association , vol. 108, no. 502, pp. 656–665, 2013.[10] J. P. Reiter, “Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study,”
Journalof the Royal Statistical Society: Series A (Statistics in Society) , vol. 168, no. 1, pp. 185–205, 2005.[11] J. P. Reiter, “Using cart to generate partially synthetic public use microdata,”
Journal of Official Statistics , vol. 21,no. 3, p. 441, 2005.[12] G. Caiola and J. P. Reiter, “Random forests for generating partially synthetic, categorical data.,”
Trans. Data Priv. ,vol. 3, no. 1, pp. 27–42, 2010.[13] S. K. Kinney, J. P. Reiter, A. P. Reznek, J. Miranda, R. S. Jarmin, and J. M. Abowd, “Towards unrestricted publicuse business microdata: The synthetic longitudinal business database,”
International Statistical Review , vol. 79, no. 3,pp. 362–384, 2011.[14] M. Sklar, “Fonctions de repartition an dimensions et leurs marges,”
Publ. inst. statist. univ. Paris , vol. 8, pp. 229–231,1959.[15] J. H. Albert and S. Chib, “Bayesian analysis of binary and polychotomous response data,”
Journal of the Americanstatistical Association , vol. 88, no. 422, pp. 669–679, 1993.[16] J. Johndrow, D. Dunson, and K. Lum, “Diagonal orthant multinomial probit models,” in
Artificial Intelligence andStatistics , pp. 29–38, 2013.[17] A. Bhattacharya and D. B. Dunson, “Sparse bayesian infinite factor models,”
Biometrika , pp. 291–306, 2011.[18] D. R. Kowal and A. Canale, “Simultaneous transformation and rounding (star) models for integer-valued data,”
Electronic Journal of Statistics , vol. 14, no. 1, pp. 1744–1772, 2020.[19] H. A. Chipman, E. I. George, and R. E. McCulloch, “Bart: Bayesian additive regression trees,”
The Annals of AppliedStatistics , vol. 4, no. 1, pp. 266–298, 2010. [20] C. M. Carvalho, N. G. Polson, and J. G. Scott, “The horseshoe estimator for sparse signals,”
Biometrika , vol. 97,no. 2, pp. 465–480, 2010.[21] D. R. Kowal, “Fast, Optimal, and Targeted Predictions using Parametrized Decision Analysis,” arXiv preprint arXiv:2006.13107 , 2020.[22] J. Snoke, G. Raab, B. Nowok, C. Dibben, and A. Slavkovic, “General and specific utility measures for synthetic data,” arXiv preprint arXiv:1604.06651 , 2016.[23] J. P. Reiter, Q. Wang, and B. Zhang, “Bayesian estimation of disclosure risks for multiply imputed, synthetic data,”
Journal of Privacy and Confidentiality , vol. 6, no. 1, 2014.[24] T. E. Raghunathan, J. P. Reiter, and D. B. Rubin, “Multiple imputation for statistical disclosure limitation,”
Journalof official statistics , vol. 19, no. 1, p. 1, 2003.[25] K. M. Quinn, “Bayesian factor analysis for mixed ordinal and continuous responses,”
Political Analysis , vol. 12, no. 4,pp. 338–353, 2004.[26] B. Nowok, G. Raab, J. Snoke, and C. Dibben, “synthpop: Generating synthetic versions of sensitive microdata forstatistical disclosure control,”
R package version , pp. 1–3, 2016.[27] J. Drechsler and J. P. Reiter, “An empirical evaluation of easily implemented, nonparametric methods for generatingsynthetic datasets,”
Computational Statistics & Data Analysis , vol. 55, no. 12, pp. 3232–3243, 2011.[28] B. Nowok, “Utility of synthetic microdata generated using tree-based methods,”
UNECE Statistical Data Confiden-tiality Work Session , 2015.[29] M.-J. Woo, J. P. Reiter, A. Oganian, and A. F. Karr, “Global measures of data utility for microdata masked fordisclosure limitation,”
Journal of Privacy and Confidentiality , vol. 1, no. 1, 2009.[30] J. P. Reiter and R. Mitra, “Estimating risks of identification disclosure in partially synthetic data,”
Journal of Privacyand Confidentiality , vol. 1, no. 1, 2009.[31] F. Ferrari and D. B. Dunson, “Bayesian factor analysis for inference on interactions,”
Journal of the AmericanStatistical Association , pp. 1–12, 2020.[32] J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, “Bayesian factor regressionmodels in the “large p, small n” paradigm,”