[PDF] A smoothing approach for masking spatial data

Abstract

Individual-level health data are often not publicly available due to confidentiality; masked data are released instead. Therefore, it is important to evaluate the utility of using the masked data in statistical analyses such as regression. In this paper we propose a data masking method which is based on spatial smoothing techniques. The proposed method allows for selecting both the form and the degree of masking, thus resulting in a large degree of flexibility. We investigate the utility of the masked data sets in terms of the mean square error (MSE) of regression parameter estimates when fitting a Generalized Linear Model (GLM) to the masked data. We also show that incorporating prior knowledge on the spatial pattern of the exposure into the data masking may reduce the bias and MSE of the parameter estimates. By evaluating both utility and disclosure risk as functions of the form and the degree of masking, our method produces a risk-utility profile which can facilitate the selection of masking parameters. We apply the method to a study of racial disparities in mortality rates using data on more than 4 million Medicare enrollees residing in 2095 zip codes in the Northeast region of the United States.

Full PDF

aa r X i v : . [ s t a t . A P ] N ov The Annals of Applied Statistics (cid:13)

Institute of Mathematical Statistics, 2010

A SMOOTHING APPROACH FOR MASKING SPATIAL DATA

By Yijie Zhou, Francesca Dominici and Thomas A. Louis Merck Research Laboratories, Harvard University and Johns HopkinsUniversity

Individual-level health data are often not publicly available due toconﬁdentiality; masked data are released instead. Therefore, it is im-portant to evaluate the utility of using the masked data in statisticalanalyses such as regression. In this paper we propose a data maskingmethod which is based on spatial smoothing techniques. The pro-posed method allows for selecting both the form and the degree ofmasking, thus resulting in a large degree of ﬂexibility. We investi-gate the utility of the masked data sets in terms of the mean squareerror (MSE) of regression parameter estimates when ﬁtting a Gener-alized Linear Model (GLM) to the masked data. We also show thatincorporating prior knowledge on the spatial pattern of the exposureinto the data masking may reduce the bias and MSE of the parameterestimates. By evaluating both utility and disclosure risk as functionsof the form and the degree of masking, our method produces a risk-utility proﬁle which can facilitate the selection of masking parame-ters. We apply the method to a study of racial disparities in mortalityrates using data on more than 4 million Medicare enrollees residingin 2095 zip codes in the Northeast region of the United States.

1. Introduction.

Individual-level information such as health data col-lected by, for example, government agencies, are often not publicly available

Received October 2007; revised December 2009. Supported in part by the National Institute for Environmental Health SciencesGrant R01ES012054 and by the Environmental Protection Agency Grants R83622 andRD83241701. The content is solely the responsibility of the authors and does not neces-sarily represent the oﬃcial views of the National Institute of Environmental Health Sciencenor of the Environmental Protection Agency. Supported in part by the National Institute of Diabetes, Digestive and Kidney DiseasesGrant R01 DK061662. The content is solely the responsibility of the authors and does notnecessarily represent the oﬃcial views of the National Institute of Diabetes, Digestive andKidney Diseases.

Key words and phrases.

Statistical disclosure limitation, data masking, data utility, dis-closure risk, spatial smoothing.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Applied Statistics ,2010, Vol. 4, No. 3, 1451–1475. This reprint diﬀers from the original in paginationand typographic detail. 1

Y. ZHOU, F. DOMINICI AND T. A. LOUIS in order to preserve conﬁdentiality. On the other hand, there is public de-mand on these individual-level data for research purposes. As an example,associations of individual health with various risk factors are of great interestand concern nowadays. Statistical research that addresses these two compet-ing needs is known as statistical disclosure limitation , where a large numberof methods are developed on how to process and release information thatis subject to conﬁdentiality concern [Duncan and Lambert (1986); Fienbergand Willenborg (1998); Willenborg and Waal (1996, 2001)]. In this paper werefer to those methods that alter the original data values as “data masking.”Corresponding to the two competing needs, a data masking method shouldbe evaluated from both the utility of the masked data which represents theinformation retained after the masking, and the disclosure risk of the maskeddata which is the risk that a data intruder can obtain conﬁdential informa-tion (e.g., obtain original data values and/or identify an individual to whoma data record belongs). Ideally, masked data would have low disclosure riskwhile preserving data utility as much as possible.Examples of commonly used data masking methods include aggregatedtabular counts for categorical data [Fienberg and Slavkovic (2004)], dataswapping which exchanges values between selected records, with its variousextensions [Dalenius and Reiss (1982); Fienberg and McIntyre (2005)], cellsuppression where certain cells of contingency tables are not displayed [Cox(1995)], simulating synthetic data which have the same (conditional) dis-tribution as the original data [Rubin (1993); Fienberg, Makov and Steele(1998); Raghunathan, Reiter and Rubin (2003); Reiter (2003, 2005b)], andadditive random noise for continuous variables [Kim (1986); Sullivan andFuller (1989); Fuller (1993); Trottini et al. (2004)], etc.Among these methods, data aggregation, data swapping, additive ran-dom noise and many other methods can be formulated as matrix mask-ing [Duncan and Pearson (1991)]. Suppose data on n observations and p variables are stored in a n × p matrix. Matrix masking takes the generalform of Z ∗ = A Z B + C , where Z is the original data matrix and Z ∗ is themasked data matrix. Matrices A , B and C are row (observation) operator,column (variable) operator and random noise, respectively. Links betweenthe above masking methods to matrix masking are investigated in Duncanand Pearson (1991), Cox (1994), Fienberg (1994) and Fienberg, Makov andSteele (1998).Measuring and evaluating utility of masked data is important. In generalthere are two classes of utility measures. One is global utility measures whichreﬂect the general distribution of masked data compared to that of theoriginal data and are not speciﬁc to any analysis. Such measures includethe number of swaps in data swapping, the added variance in the additiverandom noise approach, diﬀerences between continuous original and maskeddata in their ﬁrst and second moments, etc. More sophisticated measures SMOOTHING APPROACH FOR MASKING SPATIAL DATA that compare distributions of masked and original data can be found inDobra et al. (2002), Gomatam, Karr and Sanil (2005) and Woo et al. (2009).In addition, Bayesian decision theory-based utility is discussed in Trottiniand Fienberg (2002) and Dobra, Fienberg and Trottini (2003).The second class of utility measures is analysis-speciﬁc tailored to an-alysts’ inference. For the utility associated with regression inference, Karret al. (2006) examine the overlap in the conﬁdence intervals of linear regres-sion coeﬃcients estimated with original and masked data. Kim (1986) andFuller (1993) show for the additive random noise approach that if maskeddata preserve the ﬁrst two moments of original data, then coeﬃcient es-timates from linear regression using masked data are (approximately) un-biased. In addition, the methods of aggregated tabular counts and dataswapping can produce valid results for loglinear models because they pre-serve the marginal total of contingency tables. This is equivalent to preserv-ing suﬃcient statistics for loglinear models, given that the margins of allhigher-order interactions that appear in the model are preserved [Fienbergand Slavkovic (2004); Fienberg and McIntyre (2005)]. Recently, Slavkovicand Lee (2010) investigated logistic regression inference for contingency ta-bles that preserve marginal total or conditional probabilities. However, for ageneral data structure additional research is needed. For example, bias andvariance of parameter estimates from nonlinear regression using masked dataare not quantiﬁed as functions of masking parameters.We propose a special case of matrix masking where we construct row(observation) transformed data, that is, Z ∗ = A Z , using spatial smoothing.We investigate the mean square error (MSE) of the regression parameterestimates when ﬁtting a Generalized Linear Model (GLM) to the maskeddata, and we provide guidance on how to select the masking parameters toreduce the MSE. Speciﬁcally, for both regressors and outcome we constructmasked data which are weighted averages of the original individual-leveldata by using linear smoothers. The shape of the smoothing weight func-tion deﬁnes the “form” of masking and the smoothness parameter measuresthe “degree” of masking. By choosing an appropriate weight function andsmoothness parameter value, the masked data can account for prior knowl-edge on the spatial pattern of individual-level data, and parameter estimatesfrom nonlinear regression using such masked data may be less subject to biasand MSE. Although data utility is our main focus, we also evaluate iden-tiﬁcation disclosure risk. We consider the scenario wherein a data intruderhas correct information on the risk factor regressors (e.g., exposure or de-mographic data) from some external data sources, and his/her objective isto obtain the conﬁdential information on the health outcome through recordmatching. Using our method, we can evaluate both the utility and the disclo-sure risk as functions of the form and the degree of masking, which produces Y. ZHOU, F. DOMINICI AND T. A. LOUIS a risk-utility proﬁle and can facilitate the selection of the masking parame-ters. We also derive a closed-form expression for calculating the ﬁrst-orderbias of the regression parameter estimates when estimated using the maskeddata, for any assumed distribution of the outcome given the regressors inthe exponential family.We apply our method to a study of racial disparities in risks of mortalityfor a large sample of the U.S. Medicare population. This study consistsof more than 4 million individuals in the Northeast region of the UnitedStates. We develop and apply statistical models to estimate the age andgender adjusted association between race and risks of mortality when usingboth the original individual-level data and the masked data. The estimatedassociation obtained from using the original individual-level data is the gold-standard, and we compare it to the estimated association obtained fromusing the masked data. We also calculate the identiﬁcation disclosure riskof the masked data sets.In Section 2 we detail the method, and in Section 3 we present the simula-tion studies. In Section 4 we apply our method to the Medicare data set, andin Section 5 we discuss the method and the results. The R code is providedin the Supplement [Zhou, Dominici and Louis (2010b)], while the Medicaredata set is not provided due to a conﬁdentiality agreement. Derivation ofthe closed-form expression for the ﬁrst-order bias of the GLM regressionparameter estimates when estimated using the masked data is presented inthe Appendix.

2. Methods.

Matrix masking using spatial smoothing.

Assume that the outcomevariable Y and the regressors X are spatial processes { Y ( s ) , X ( s ) } , andthe observed individual-level data { ( Y i , X i ) , i = 1 , . . . , N } are realizations ofthe spatial processes at locations s = { s , . . . , s N } , that is, X i = X ( s i ) , Y i = Y ( s i ), i = 1 , . . . , N . We construct masked data at s using spatial smoothing,and we show later that this masking approach is a special case of matrixmasking by row (observation) transformation.Let W λ ( u, s ; S ) denote the relative weight assigned to data at location s when generating smoothed data for the target location u , where λ ≥ S denotes all spatial locations in a study areaso s is a subset of S . The parameter λ controls the degree of smoothness,with smoothness increasing with λ . For notational convenience we suppressthe dependence of W on S .We consider a subclass of linear smoothers under which the smoothedspatial processes at location u are deﬁned as follows. For λ > Y λ ( u ) = Z Y ( s ) W λ ( u, s ) dN ( s ) . Z W λ ( u, s ) dN ( s ) , SMOOTHING APPROACH FOR MASKING SPATIAL DATA (2.1) X λ ( u ) = Z X ( s ) W λ ( u, s ) dN ( s ) . Z W λ ( u, s ) dN ( s ) , where N ( s ) is the counting process for locations with available data from thespatial processes { Y ( s ) , X ( s ) } . For ∀ u ∈ s we require that W ( u, s ) = I { s = u } .If W is continuous in λ , we deﬁne W ( u, s ) as lim λ ↓ W λ ( u, s ). Therefore, wehave that { Y ( s i ) , X ( s i ) } = { Y i , X i } , the original individual-level data.We generate masked data by taking the predictions from (2.1) at s wherethe original individual-level data are available, that is, { Y λ ( s i ) , X λ ( s i ) , i =1 , . . . , N } . By deﬁnition in (2.1), the masked data are weighted averages ofthe original individual-level data { Y ( s i ) , X ( s i ) } . The shape of the weightfunction W and the degree of smoothness λ control the form and the de-gree of masking, respectively, where the degree of masking increases withthe degree of smoothness. In practice, the masked data at location s i arecomputed by Y λ ( s i ) = N X k =1 Y k W λ ( s i , s k ) (cid:30) N X k =1 W λ ( s i , s k ) , (2.2) X λ ( s i ) = N X k =1 X k W λ ( s i , s k ) (cid:30) N X k =1 W λ ( s i , s k ) , where the same W and λ are applied to both Y and X . Examples of com-monly used smoothers within this class include parametric linear regressionsﬁtted by ordinary least square and weighted least square, penalized lin-ear splines with truncated polynomial basis, kernel smoothers and LOESSsmoothers [Simonoﬀ (1996); Bowman and Azzalini (1997); Hastie, Tibshi-rani and Friedman (2001); Ruppert, Wand and Carroll (2003)].Let Y and Y λ denote the vectors of { Y i } and { Y λ ( s i ) } , and let X and X λ denote the matrices of { X i } and { X λ ( s i ) } , respectively, where X i and X λ ( s i ) , i = 1 , . . . , N , are row vectors. It can be seen that Y λ = A λ Y and X λ = A λ X , where A λ = ( A λ ij ) = ( W λ ( s i , s j ) / P Nj =1 W λ ( s i , s j )). Therefore,constructing masked data by equation (2.2) is a special case of matrix mask-ing by row (observation) transformation. Reidentiﬁcation from ( Y λ , X λ ) to( Y , X ) requires knowledge of both W and λ as well as the existence of A − λ .2.2. Bias and variance in nonlinear regression using masked data.

Biasmay arise when a nonlinear model that is speciﬁed for the original individual-level data is ﬁtted to the masked data. Speciﬁcally, we assume the followingmodel for the original individual-level data which is viewed as the “truth,” g ( E {Y|X } ) = X β . (2.3) Y. ZHOU, F. DOMINICI AND T. A. LOUIS

Model (2.3) implies the analogous model for the masked data g ( E {Y λ |X λ } ) = X λ β (2.4)only for a linear function g ( x ) = ax , where a is a constant (except for fewspecial circumstances such as X i = x , i.e., constant exposure). Speciﬁcally, g ( E {Y λ |X λ } ) = aE {Y λ |X λ } = aE {Y λ |X } = a A λ E {Y|X } model (3) = a A λ a − X β = X λ β . It follows that for a nonlinear regression model (2.3), the coeﬃcient estimateobtained by ﬁtting model (2.4) will be a biased estimate of β . Therefore, it isimportant to evaluate the bias of the coeﬃcient estimate under model (2.4)as well as how the bias varies as a function of the form and the degree of datamasking. To consider both the bias and variance of the coeﬃcient estimateobtained by ﬁtting model (2.4), we evaluate the MSE as a function of theform and the degree of masking.It is common to assume that the masked data are mutually indepen-dent. However, they are generally correlated, since they combine informationacross the same original data. To investigate the impact of this correlationon the uncertainty of the coeﬃcient estimate when using the masked data,we compare the “naive” conﬁdence interval under model (2.4) which doesnot account for this correlation with an appropriate conﬁdence interval ob-tained by using simulation or bootstrap methods [Efron (1979); Efron andTibshirani (1993)].2.3. Identiﬁcation disclosure risk of masked data.

We evaluate the iden-tiﬁcation disclosure risk of the masked data by calculating the probabilityof identiﬁcation as developed in Reiter (2005a). To compute the risk of thereleased masked data set, we ﬁrst compute the probability of matching fora particular data record.Speciﬁcally, let Z = ( Y , X ) denote the unmasked data set and Z λ = ( Y λ , X λ )denote the released masked data set. Let t denote a data vector possessed bya data intruder, where t contains the true values for a particular individual. Z λ can be divided into two components: Z Uλ which consists of variables thatare not available in t , and Z Ap λ which consists of variables that are availablein t . Z = ( Z U , Z Ap ) is the same decomposition of the true data set. Let J be a random variable that equals j if to match t with the j th individual in Z λ . The probability of matching is Pr( J = j | t , Z λ ) , j = 1 , . . . , N , assumingthat t always corresponds to an individual within Z λ . Assumptions aboutthe knowledge and behavior of the intruder are used to determine this prob-ability. Using Bayes’ rule,Pr( J = j | t , Z λ ) = Pr( Z λ | J = j, t ) Pr( J = j | t ) P Nj =1 Pr( Z λ | J = j, t ) Pr( J = j | t ) , SMOOTHING APPROACH FOR MASKING SPATIAL DATA where Pr( Z λ | J = j, t ) can be decomposed intoPr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j , J = j, t ) · Pr( z Uλ,j | z Ap λ,j , J = j, t ) · Pr( z Ap λ,j | J = j, t ) . Following the guidance in Reiter (2005a), we compute each component ofPr( J = j | t , Z λ ) as follows:1. Pr( J = j | t ) = 1 /N . This is because the true values are replaced by someweighted averages upon releasing, so exact matching between t and any Z Ap λ record is not possible.2. Pr( z Ap λ,j | J = j, t ) equals 1 − k z Ap λ,j − t k max Nk =1 k z Ap λ,k − t k , (2.5)which is the tail probability of a uniform distribution with density 1 / max Nk =1 k z Ap λ,k − t k . We assume the intruder knows that the masked dataare weighted averages of the original data. As we point out at the endof Section 2.1, detailed information on W and λ shall not be released.Therefore, it is a reasonable assumption that the intruder will assumea uniform distribution based on the diﬀerence from t . The larger thediﬀerence, the smaller the probability.3. Pr( z Uλ,j | z Ap λ,j , J = j, t ) is computed through Z Pr( z Uλ,j | z Uj , z Ap λ,j , J = j, t ) Pr( z Uj | z Ap λ,j , J = j, t ) d z Uj , (2.6)where Pr( z Uλ,j | z Uj , z Ap λ,j , J = j, t ) = 1 − k z Uλ,j − z Uj k max Nk =1 k z Uλ,k − z Uj k , Pr( z Uj | z Ap λ,j , J = j, t )is obtained through regression of Z U on Z Ap λ , and the integral is computedusing Monte Carlo integration.4. Pr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j , J = j, t ) is conservatively assumedto be equal to 1. As pointed out in Reiter (2005a), such assumption pro-vides the upper limit on the identiﬁcation risks and greatly simpliﬁes thecalculation.Assuming a record t is matched to the individuals with the largest proba-bility of matching, we measure the identiﬁcation disclosure risk of the entirereleased data set using the expected percentage of correct matches. Same asin Reiter (2005a), we assume that the intruder possesses correct records forall individuals in the released data set and seeks to match each record withan individual with replacement, that is, matching of one record is indepen-dent from matching of another record. Let m j be the number of individual Y. ZHOU, F. DOMINICI AND T. A. LOUIS records with the maximum matching probability for t j , j = 1 , . . . , N . Let I j = 1 if the m j individual records contain the correct match, and I j = 0otherwise. The expected percentage of correct matches is P Nj =1 1 m j I j /N .

3. Simulation studies.

Data generation, parameter estimation and disclosure risk evalua-tion.

In this section we conduct simulation studies to illustrate that pa-rameter estimates from regression using masked data may be less subject tobias and MSE when the selection of the smoothing weight function accountsfor the spatial patterns of exposure. We illustrate this point using three ex-amples. In each case, we deﬁne the study area to be [ − , × [ − , s where individual-levelexposure and outcome data are obtained.In each example, we deﬁne a spatial process of exposure X ( s ) and weobtain X ( s i ) for s i ∈ s . We simulate the individual-level outcome data at s from a model of the general form Y ( s i ) i . i . d . ∼ Poisson( e µ + βX ( s i ) ) , (3.1)with the individual-level exposure coeﬃcient β being the parameter of in-terest. The values of µ and β are selected to achieve reasonable variabilityof E { Y ( s i ) | X ( s i ) } under model (3.1) across the locations.We construct the masked data { Y λ ( s i ) , X λ ( s i ) } using kernel smoothers,and we estimate the exposure coeﬃcient β λ under model Y λ ( s i ) i . i . d . ∼ Poisson( e µ λ + β λ X λ ( s i ) ) , (3.2)which is analogous to model (3.1) but ﬁtted to the masked data. The maskeddata are constructed and β λ is estimated for each combination of 20 λ valuesand two diﬀerent kernel weights, respectively, so we can evaluate the biasand the MSE as functions of both the smoothing weight and λ .In addition, we construct spatially aggregated data by equally partitioningthe study area into 7 × Y + j = P n j i =1 Y ( s i ) and¯ X · j = P n j i =1 X ( s i ) /n j , where n j is the total number of individual-level datapoints in cell j , j = 1 , . . . ,

49. We estimate the exposure coeﬃcient β e usingthe aggregated data { Y + j , ¯ X · j } under the analogous model Y + j i . i . d . ∼ n j · Poisson( e µ e + β e ¯ X · j ) . (3.3)To evaluate the identiﬁcation disclosure risk, we consider the scenariothat a data intruder possesses the correct exposure data, that is, X ( s i ) for s i ∈ s , and seeks the matches with the released data set in order to obtaininformation on the health outcome Y . Speciﬁcally, Z Ap is X and Z U is Y .We generate 500 replicates of the individual-level outcome data. For eachreplicate β λ and β e are estimated as above, and the estimates are averagedacross the 500 replicates. SMOOTHING APPROACH FOR MASKING SPATIAL DATA Choice of smoothing weight function.

To select a weight functionthat may lead to less bias and possibly smaller MSE when estimating theexposure coeﬃcient using the masked data, we notice that expectation ofthe masked outcome Y λ ( s i ) with respect to model (3.2) is E { Y λ ( s i ) | X λ ( s i ) } = e µ λ + β λ X λ ( s i ) , while expectation of Y λ ( s i ) with respect to model (3.1) is E { Y λ ( s i ) | X } = Z e µ + βX ( s ) W λ ( s i , s ) dN ( s )= e µ + βX λ ( s i ) Z e β [ X ( s ) − X λ ( s i )] W λ ( s i , s ) dN ( s ) , where X = { X ( s ) } . The comparison between E { Y λ ( s i ) | X } and E { Y λ ( s i ) | X λ ( s i ) } suggests that we can reduce the bias and possibly theMSE of estimating µ and β when using the masked data by selecting a W s.t. R e β [ X ( s ) − X λ ( s i )] W λ ( s i , s ) dN ( s ) is close to 1. One way to construct sucha W is to assign high weights to locations that receive similar exposure asthe target location and low weights otherwise. The W constructed in thisway has the property that it accounts for prior knowledge on the spatialpattern of the exposure. In our examples, this is also the spatial pattern ofthe outcome due to the model assumption (3.1). Therefore, to assess thediﬀerence in bias and MSE when varying the smoothing weight function, weconstruct two diﬀerent kernel weights for data masking in the way that oneweight accounts for prior knowledge on the spatial pattern of the exposureas above, while the other does not.3.3. Example I . We assume that the exposure is eradiated from a pointsource A and decreases symmetrically in all directions as the Euclideandistance from A increases. Speciﬁcally, we deﬁne X ( s ) = 7 exp( − r s / . s ∈ [ − , × [ − , r s is the Euclidean distance between lo-cation s and the point source A . Figure 1(a) shows the contour plot of X ( s ). The individual-level outcome is simulated from Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated data of exposure and outcome are con-structed by calculating group summaries of { Y ( s i ) , X ( s i ) } as described inSection 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withboth the Euclidean kernel weight W ∗ λ and the ring kernel weight W λ whichare deﬁned as follows: W ∗ λ ( u, s ) = exp( −k s − u k /λ ) , (3.4) W λ ( u, s ) = exp( −| r s − r u | /λ ) . (3.5) Y. ZHOU, F. DOMINICI AND T. A. LOUIS

The ring kernel weight W λ ( u, s ) decreases exponentially as the diﬀerencebetween r s and r u increases, and such diﬀerence is positively associated withthe diﬀerence between X ( s ) and X ( u ) according to the spatial pattern ofthe exposure. Figure 1(b) shows the contour plot of W λ ( s , · ). On the otherhand, the Euclidean kernel weight W ∗ λ ( u, s ) solely depends on k s − u k , theEuclidean distance between location u and location s , and therefore doesnot account for prior knowledge on the spatial distribution of the exposure. Fig. 1.

Example I of spatially varying exposure, weight function for spatial smooth-ing, estimates, and disclosure risk. (a) Contour plot of exposure from point source A : X ( s ) = 7 exp( − r s / . , with cells for spatial aggregation. (b) Contour plot of ringweight function W λ ( s , s ) = exp( −| r s − r s | /λ ) for calculating spatially smoothed ex-posure and outcome data at location s , from individual-level exposure X ( s ) in (a) and individual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −

25 + 4 X ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % conﬁdence intervalsby ﬁtting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are con-structed using the ring weight function in (b) and using the Euclidean weight function W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-ﬁcation disclosure risk measured by the expect percentage of correct record matching. (f)

Disclosure risk versus MSE for utility-risk trade-oﬀ.

SMOOTHING APPROACH FOR MASKING SPATIAL DATA Example II . We assume that the exposure is eradiated from a pointsource A and toward a certain direction. Speciﬁcally, we deﬁne X ( s ) = 7 × exp( − r s / − cos θ s /

3) for s ∈ [ − , × [ − , θ s is the angle betweenthe direction from point source A to location s and the direction that theexposure is toward, and r s is deﬁned the same as in Example I. Figure 2(a)shows the contour plot of X ( s ). The individual-level outcome is simulatedfrom Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated data of exposure and out-come are constructed by calculating group summaries of { Y ( s i ) , X ( s i ) } asdescribed in Section 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withthe Euclidean kernel weight (3.4) and the ring angle kernel weight W λ ( u, s ) = exp( − ( | r s − r u | + 2 | cos θ s − cos θ u | ) /λ ) , which decreases exponentially as the diﬀerence between r s and r u increasesas well as the diﬀerence between cos θ s and cos θ u increases. Figure 2(b)shows the contour plot of W λ ( s , · ).3.5. Example

III . We assume that the exposure is eradiated from a pointsource A but blocked in a certain area, such as blocked by a mountain, so theblocked area receives no exposure. Speciﬁcally, we deﬁne the unblocked areato be s x ≤ . ϑ s ≤ .

625 for s ∈ [ − , × [ − , s x is the x -axisvalue of location s and ϑ s is the angle between the positive x -axis and thedirection from point source A to location s . We deﬁne the exposure X ( s ) =7 exp( − r s / . · I s for s ∈ [ − , × [ − , I s is the indicator that s islocated within the unblocked area, and r s is deﬁned the same as in ExamplesI and II. Figure 3(a) shows the contour plot of X ( s ). The individual-leveloutcome is simulated from Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated dataof exposure and outcome are constructed by calculating group summariesof { Y ( s i ) , X ( s i ) } as described in Section 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withthe Euclidean kernel weight (3.4) and the ring block kernel weight W λ ( u, s ) = exp( −| r s − r u | /λ ) · ( I s = I u ) , which assigns nonzero weight only when location u and location s are bothin the blocked or unblocked area. In addition, the nonzero weight from W λ ( u, s ) decreases exponentially as the diﬀerence between r s and r u in-creases. Figure 3(b) shows the contour plot of W λ ( s , · ).3.6. Results.

Results of Example I on parameter estimates, MSE andidentiﬁcation risk averaged across the 500 simulation replicates are shown inFigure 1(c)–(e), respectively. Speciﬁcally, Figure 1(c) shows the estimated β λ as a function of λ for the ring kernel weight (3.5) and the Euclidean Y. ZHOU, F. DOMINICI AND T. A. LOUIS

Fig. 2.

Example II of spatially varying exposure, weight function for spatialsmoothing, estimates, and disclosure risk. (a) Contour plot of exposure frompoint source A toward a certain direction: X ( s ) = 7 exp( − r s / − cos θ s / , withcells for spatial aggregation. (b) Contour plot of ring angle weight function W λ ( s , s ) = exp( − ( | r s − r s | + 2 | cos θ s − cos θ s | ) /λ ) for calculating spatially smoothedexposure and outcome data at location s , from individual-level exposure X ( s ) in (a) andindividual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −

36 + βX ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % conﬁdence intervals by ﬁt-ting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are constructedusing the ring angle weight function in (b) and using the Euclidean weight function W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-ﬁcation disclosure risk measured by the expect percentage of correct record matching. (f)

Disclosure risk versus MSE for utility-risk trade-oﬀ. kernel weight (3.4), with the “naive” 95% conﬁdence intervals. By “naive”we mean that the conﬁdence intervals are computed by ﬁtting model (3.2)directly, and therefore do not account for the possible correlation betweenthe masked data as pointed out earlier in Section 2.2. The reference linesare placed at the true value of β and at the estimated β e , from which thebias of estimating the exposure coeﬃcient by using the estimated β λ canbe evaluated. Figure 1(d) shows the MSE as a function of λ for the twokernel weights, where in this example MSE is largely determined by the SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 3.

Example

III of spatially varying exposure, weight function for spatial smooth-ing, estimates, and disclosure risk. (a)

Contour plot of exposure from point source A but blocked in certain area: X ( s ) = 7 exp( − r s / . · I s where I s is the indicator of loca-tion s in the unblocked area, with cells for spatial aggregation. (b) Contour plot of ringblock weight function W λ ( s , s ) = exp( −| r s − r s | /λ ) · ( I s = I s ) for calculating spatiallysmoothed exposure and outcome data at location s , from individual-level exposure X ( s ) in (a) and individual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −

24 + βX ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % conﬁdence intervalsby ﬁtting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are con-structed using the ring block weight function in (b) and using the Euclidean weight func-tion W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-ﬁcation disclosure risk measured by the expect percentage of correct record matching. (f)

Disclosure risk versus MSE for utility-risk trade-oﬀ. bias. The reference lines are placed at the MSE from regression using theoriginal data (in which the bias part is 0) and the MSE of β e . Figure 1(e)shows the identiﬁcation disclosure risk of the masked data set measured bythe expected percentage of correct record matching, as a function of λ forthe two kernel weights. Figure 1(f) plots the disclosure risk versus MSE,which shows the trade-oﬀ between data utility and disclosure risk.We ﬁnd that data masking using the ring kernel weight (3.5) leads tosmaller bias and MSE when estimating the exposure coeﬃcient than mask- Y. ZHOU, F. DOMINICI AND T. A. LOUIS ing using the Euclidean kernel weight (3.4), for all λ values that are con-sidered. It suggests that when using the masked data for loglinear regres-sion, a masking procedure that preserves the spatial pattern of the originalindividual-level exposure and outcome data can lead to better estimates interms of smaller bias and MSE than a masking procedure that does notdo so. As λ increases, the bias and MSE increase for both kernel weights,while the diﬀerences in the bias and MSE between the two kernel weightsdecrease. This increase in the bias/MSE and decrease in the bias/MSE dif-ferences suggest that in the presence of a high degree of masking, choicefor the form of masking may be less inﬂuential on the resultant bias/MSE.Moreover, comparing the estimated β λ and β e , we ﬁnd that for small valuesof λ , the bias and MSE is smaller when using the estimated β λ from thering kernel weight (3.5).On the other hand, we ﬁnd that the disclosure risk is lower when using theEuclidean kernel weight (3.4) for data masking compared to using the ringkernel weight (3.5). This is not unexpected because masked data constructedusing the ring kernel weight is more informative about the original truevalues. However, with a tolerable potential disclosure risk [ < λ . Same as the trendfor bias and MSE, the diﬀerences in the disclosure risk between the twokernel weights become small as λ increases.Similar results of Example II and Example III are shown in Figure 2(c)–(f)and Figure 3(c)–(f).Figure 4 shows the width ratios comparing the 95% “naive” conﬁdenceintervals versus the percentile conﬁdence intervals obtained from the empiri-cal distribution of the estimates across the 500 simulations, for the estimatesof β λ in the three examples respectively. Width ratio when λ = 0 (the soliddot) is calculated using the nonsmoothed data, that is, the individual-leveldata. We ﬁnd that in these three examples, the “naive” conﬁdence intervalsgenerally overestimate the uncertainty of the β λ estimates, and the degreeof overestimation increases as λ increases. In addition, for Examples II andIII where the spatial patterns of exposure are nonisotropic, the degree ofoverestimation diﬀers for the weight functions with and without accountingfor prior knowledge on the spatial pattern of exposure.

4. Application to Medicare data.

We apply our method to the studyof racial disparities in risks of mortality for a sample of the U.S. Medicarepopulation.4.1.

Data source.

We extract a large data set at individual-level fromthe Medicare government database. Speciﬁcally, it includes individual age,

SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 4.

Width ratios comparing the 95 % “naive” conﬁdence intervals (CI) versus thepercentile CI obtained from the empirical distributions of the estimates across the 500simulations, for the estimates of β λ in (a) Example I , (b) Example II , and (c) Example

III of the simulation studies. Width ratio when λ = 0 (the solid dot) is calculated using thenonsmoothed data. race, gender and a day-speciﬁc death indicator over the period 1999–2002,for more than 4 million black and white Medicare enrollees who are 65years and older residing in the Northeast region of the U.S. People who areyounger than 65 at enrollment are eliminated because they are eligible forthe Medicare program due to the presence of either a certain disability orEnd Stage Renal Disease and therefore do not represent the general Medicarepopulation.Figure 5 shows the study area which includes 2095 zip codes in 64 countiesin the Northeast region of the U.S. We select the counties whose centroidsare located within the range that covers the Northeast coast region of theU.S., and we exclude zip codes without available study population from thestudy map. This area covers several large, urban cities including WashingtonDC, Baltimore, Philadelphia, New York City, New Haven and Boston. It hasthe advantage of high population density and substantial racial diversity.We categorize the age of individuals into 5 intervals based on age in his/herﬁrst year of observation: [65, 70), [70, 75), [75, 80), [80, 85) and [85, +).This categorization facilitates detection of age eﬀects because diﬀerences Y. ZHOU, F. DOMINICI AND T. A. LOUIS

Fig. 5.

Location of the 2095 zip codes included in our study area. in the risks of mortality for one-year increase in age are relatively small.We “coarsen” the daily survival information into yearly survival indicators.By doing so, we deﬁne our outcome as the probability of the occurrence ofdeath for an individual in one year. This deﬁnition adjusts for the diﬀerentialfollow-up time.4.2.

Statistical models and data masking.

Let i denote individual, j de-note zip code, t denote year, and D ijt be the death indicator for individual i in zip code j in year t . Similarly as in Zhou, Dominici and Louis (2010a),we deﬁne the individual-level model as logit Pr( D tij = 1) = β + β race ij + age ij β (4.1) + β gender ij + ( age × gender ) ij β . Geographic locations for each individual are needed to spatially smooththe individual-level data. However, from the Medicare data we only havethe longitude and latitude of the zip code centroids. Therefore, we applya two-step masking procedure on the individual-level data, where we ﬁrstaggregate the individual-level data to zip code-level, and we then spatiallysmooth the zip-code level aggregated data to construct the masked data atthe zip code-level.Speciﬁcally, let D ++ j denote the total death count and n j denote thetotal person-years of zip code j . We ﬁrst obtain from aggregation { % black j ,% agecat j , % male j , % ( agecat × male ) j , p j = D ++ j /n j , j = 1 , . . . , J } , whichare the marginal distributions of race, age, gender, the joint distribution ofage and gender, and the mortality rate, respectively, of each zip code. SMOOTHING APPROACH FOR MASKING SPATIAL DATA Due to the complex spatial pattern of the zip code-level covariates, weuse kernel smoothers with bivariate normal density kernel weights for spatialsmoothing, so the shape of the smoothing weight is ﬂexible by varying thecorrelation parameter value of the bivariate normal distribution. Let thevector s = { s , s } denote the location of a zip code, where s and s arethe longitude and latitude of the zip code centroid, respectively. We usesmoothing kernel weights of the general form W λ ( u, s ) = exp( − ( s − u , s − u ) T Σ − λ ( s − u , s − u ) / , where Σ λ = λ (cid:18) σ ρσ σ ρσ σ σ (cid:19) ,σ and σ are the variances of the longitude and latitude data of the 2095zip codes, respectively. We consider for ρ the following three values:1. ρ = 0, so the weight solely depends on the Euclidean distance k s − u k ;2. ρ = 0 .

5, so higher weight is assigned to s in the northeast and southwestdirections of u ;3. ρ = − .

5, so higher weight is assigned to s in the northwest and southeastdirections of u .Let p jλ denote the smoothed mortality rate of zip code j from whichwe calculate the smoothed death count D ++ jλ = p jλ · n j . Let % black jλ ,% agecat jλ , % male jλ , % ( agecat × male ) jλ denote the smoothed marginaldistributions of race, age, gender and the smoothed joint distribution of ageand gender, respectively, of zip code j . We specify the model for maskeddata as D ++ jλ ∼ Bin( n j , p jλ ) , logit p jλ = β λ + β λ % black jλ + β λ % agecat jλ (4.2) + β λ % male jλ + β λ % ( agecat × male ) jλ . The zip code-level nonsmoothed aggregated data are also used to ﬁt model(4.2).To evaluate the identiﬁcation disclosure risk, we consider the scenario thata data intruder possesses correct zip code-level demographic data and seeksthe matching with the masked zip code-level data set in order to obtaininformation on the zip code-level mortality. Speciﬁcally, the released dataset consists of % black jλ , % agecat jλ , % male jλ and p jλ , j = 1 , . . . , black j , % agecat j and % male j . Y. ZHOU, F. DOMINICI AND T. A. LOUIS

Choice of association measure.

The common approach to report theassociation between race and mortality risks is to report the race coeﬃcients β in model (4.1) and β λ in model (4.2), whose interpretation is subjected tothe coding of the race covariate. For direct understanding of the diﬀerencein the risk of death between the black and white populations, we deﬁneand report the population-level odds ratio (OR) of death comparing Blacksversus Whites, which is a function of the predicted values [Zhou, Dominiciand Louis (2010a)]. Therefore, interpretation of this association measuredoes not depend on model parameterization (e.g., on covariate centeringand scaling).Speciﬁcally, let P tijb = Pr( D tij = 1 | race ij = Black , age ij , gender ij ) ,P tijw = Pr( D tij = 1 | race ij = White , age ij , gender ij )denote the predicted probabilities of death in year t for a black person anda white person, respectively, whose other covariates values are the same asthe i th individual in the j th zip code. We deﬁne the population-level ORfrom the individual-level model (4.1) as follows: OR = P ··· b Q ··· w P ··· w Q ··· b , where P ··· b = X t,i,j P tijb , P ··· w = X t,i,j P tijw ,Q ··· b = 1 − P ··· b , Q ··· w = 1 − P ··· w . Similarly, we deﬁne population-level OR λ from model (4.2) using summaryprobabilities P · bλ = P j n j P jbλ P j n j and P · wλ = P j n j P jwλ P j n j , where P jbλ and P jwλ are the predicted probabilities of death in one yearfor zip codes that consist of solely black and solely white populations, re-spectively, and whose marginal and joint distributions of age and genderare the same as zip code j . “Naive” standard errors of log OR λ are cal-culated using the multivariate Delta Method [Casella and Berger (2002)].In addition, bootstrap conﬁdence intervals for log OR λ are calculated using1000 nonparametric bootstrap samples. Both “naive” and bootstrap conﬁ-dence intervals for OR λ are obtained by exponentiating the correspondingconﬁdence intervals for log OR λ . SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 6.

Estimates of OR λ under model (4.2) and identiﬁcation disclosure risk as a func-tion of λ for the three weight functions. Estimates of OR λ is plotted with the 95 % “naive”conﬁdence intervals (CI), CI using bootstrap standard error (SE) estimates, and bootstrappercentile CI. OR is estimated by ﬁtting model (4.2) to the nonsmoothed zip code-level ag-gregated data. (a) Estimates of OR λ for bivariate normal density kernel weight with ρ = 0 . (b) Estimates of OR λ for bivariate normal density kernel weight with ρ = 0 . . (c) Esti-mates of OR λ for bivariate normal density kernel weight with ρ = − . . (d) Identiﬁcationdisclosure risk measured by expected percentage of correct matching.

Results.

Figure 6(a)–(c) shows the estimates of OR λ under model (4.2)as a function of λ for the three kernel weights respectively, with the 95%“naive” conﬁdence intervals, conﬁdence intervals using bootstrap standarderror estimates and bootstrap percentile conﬁdence intervals. OR is esti-mated by ﬁtting model (4.2) to the nonsmoothed zip code-level aggregateddata. The reference line is placed at the estimate of OR under the individual-level model (4.1).For small values of λ ( < OR λ for all three kernelweights are smaller than the estimate of OR and therefore produce negative Y. ZHOU, F. DOMINICI AND T. A. LOUIS bias, while for larger values of λ the bias diﬀers substantially for diﬀerentkernel weights. For example, data masking using the kernel weight with ρ = 0 . λ valuesthat are considered. When using the kernel weight with ρ = − . OR λ are less subject to bias than those from usingthe other two kernel weights. Diﬀerences in MSE between the three kernelscan also be inferred from the plots, and we ﬁnd that using the kernel weightwith ρ = − . OR λ estimates, which is in the opposite direction ofthe relation between the “naive” and the appropriate conﬁdence intervalsin the simulation studies. The two bootstrap conﬁdence intervals are widerthan the “naive” conﬁdence interval when λ = 0, which suggests a systematicdiﬀerence between the bootstrap conﬁdence intervals and the “naive” con-ﬁdence intervals regardless of smoothing. This systematic diﬀerence occursbecause the nonsmoothed zip code-level aggregated data may not satisfy theBinomial model assumption in (4.2).Figure 6(d) shows the identiﬁcation disclosure risk of the masked dataset as measured by the expected percentage of correct matches when usingthe three kernel weights for masking, as a function of λ . The disclosure riskfor all three kernel weights are small, ranging from 0.01–0.04. The risk issimilar for the masked data sets when using the kernel weight with ρ = 0and ρ = − . ρ = 0 .

5. Discussion.

We propose a special case of matrix masking based onspatial smoothing techniques, where the smoothing weight function controlsthe form of masking, and the smoothness parameter value directly measuresthe degree of masking. Therefore, data utility and disclosure risk can becalculated as functions of both the form and the degree of masking. In fact,the smoothing weight function W can be any weight function and is not re-stricted by existing smoothing methods. With the variety of combinations ofweight functions and smoothness parameter values, it is feasible to constructmasked data that maintain high data utility while preserving conﬁdentiality.We consider a subclass of linear smoothers that produces masked data asweighted averages of the original data. Therefore, the masked data valuesare within a reasonable range. More importantly, correlation among thevariables is invariant under linear transformation, which may intrinsicallycontribute to better data utility of the masked data. On the other hand,this subclass is a large class. It includes many commonly used smoothers.We do not expect major restriction by focusing on this subclass of linearsmoothers.Using our method, we investigate the utility of the masked data in termsof bias, variance and MSE of parameter estimates when using the masked SMOOTHING APPROACH FOR MASKING SPATIAL DATA data for loglinear and logistic regression analysis. Note that similar studiescan be applied to any GLM. In addition, we evaluate the identiﬁcation dis-closure risk of the masked data set by calculating the expected percentage ofcorrect record matching. In the simulation studies, we provide guidance forconstructing masked data that can lead to better regression parameter esti-mates in terms of smaller bias and MSE for loglinear models, and we showthe trade-oﬀ between better estimates and lower disclosure risk. Speciﬁcally,masked data can be constructed by using a smoothing weight function thataccounts for prior knowledge on the spatial pattern of individual-level expo-sure, together with a reasonably low degree of masking. We provide guidancefor how to select such a smoothing weight function for loglinear models. Inaddition, we provide candidate weight functions for three simpliﬁed but rep-resentative spatial patterns of exposure.As is expected, masked data that can lead to better estimates are generallymore informative about the original data values and therefore are subjectto relatively higher identiﬁcation disclosure risk. However, the ﬂexibility inour data masking method enables constructing the masked data that canlead to good parameter estimates, while the disclosure risk is controlled ata low level. In the meanwhile, caution should be placed to the institutein releasing detailed information on the data masking approach along withmasked data. It is pointed out in Section 2.1 that simultaneously releasingthe smoothing weight function W and the smoothness parameter λ in theexistence of A − λ can lead to reidentiﬁcation of original data. However, evenif only partial information is released, for example, only the informationthat data are masked using smoothing and the smoothing weight function isreleased while the smoothing parameter value is not released, it is possiblethat a smart data intruder can still reconstruct the transformation matrix A λ .We apply our data masking method to the study of racial disparities inrisks of mortality for the Medicare population, and show how the bias andthe variance of the estimated OR of death comparing blacks to whites, andhow the identiﬁcation disclosure risk, vary with the form and the degreeof masking. The results suggest that in the absence of clear guidance, it ishelpful to explore a large ﬂexible family such as the bivariate normal densitykernel to identify a weight function that can lead to both good utility andlow identiﬁcation risk for the masked data.We compare the “naive” conﬁdence intervals with the appropriate oneswhich account for the possible correlation among masked data in both thesimulation studies and the data application, where we observe opposite di-rections in the relation between the “naive” and the appropriate conﬁdenceintervals. It suggests no general direction for that relation. One possible rea-son, which is also pointed out in Section 4.4, is that the unmasked data in Y. ZHOU, F. DOMINICI AND T. A. LOUIS the simulation study are simulated from Poisson distributions, while the un-masked data in the data application are real data and do not strictly followthe assumed binomial distribution. Therefore, in the data application, thestandard errors account for both the correlation among the masked dataand the discrepancy of the original data distribution from binomial.The simulation study and data application results show that masked dataconstructed using our method can well preserve conﬁdentiality. Speciﬁcally,the identiﬁcation disclosure risk is reasonably low for all scenarios that weconsider. Note that our calculation of the disclosure risk is conservative: weassume that an intruder possesses true values for all the regressors, and weuse probability 1 for the component Pr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j ,J = j, t ) in the calculation. In addition, the ﬂexibility in the selection ofsmoothing weight function W and smoothness parameter λ can also helpcontrol disclosure risk in addition to improving data utility.Based on our method, we additionally derive a closed-form expressionfor ﬁrst-order bias of the parameter estimates obtained using the maskeddata, for GLM that belong to the exponential family. The ﬁrst-order biascalculation is not necessary when both individual-level exposure and healthoutcome data are available so the actual bias can be computed. It may beused by researchers who have only the individual-level exposure informationto explore the possible bias in their analysis using masked data.Although our proposed method uses spatial smoothing and therefore ap-plies to spatial data, it can be easily generalized to other data types becausethe masking procedure is a smoothing technique that takes weighted aver-ages of the original data. For example, the proposed method can be general-ized to smoothing time series data by using the smoothing weight function W λ ( µ, s ), where µ and s denote time points. Also, note that an alternativemethod to mask spatial data is to mask the individual spatial location [seeArmstrong, Rushton and Zimmerman (1999); Wieland et al. (1998)].APPENDIX: FIRST-ORDER BIASWe derive a closed-form expression for the ﬁrst-order bias of estimatingthe regression coeﬃcients in a GLM that belongs to the exponential family,when using data masked by our method. Let β denote the vector of regres-sion coeﬃcients of a model speciﬁed for the original individual-level data.When the model belongs to the exponential family, its log likelihood can beexpressed as LL( β ) = N X i =1 Y i X i β − b ( X i β ) a ( φ ) + C ( Y i , φ ) ,b ′ ( X i β ) = g − ( X i β ), where b ′ ( · ) is the derivative of function b ( · ), and g ( · )is the link function. Substituting the individual-level data { Y i , X i } by the SMOOTHING APPROACH FOR MASKING SPATIAL DATA masked data { Y λ ( s i ) , X λ ( s i ) } , we obtain log likelihood of the analogousmodel when ﬁtted to the masked data,LL m ( β λ ; λ ) = N X i =1 Y λ ( s i ) X λ ( s i ) β λ − b ( X λ ( s i ) β λ ) a λ ( φ λ ) + C λ ( Y λ ( s i ) , φ λ ) , (A.1)where β λ denotes the corresponding vector of regression coeﬃcients. In orderto calculate the MLE of β λ , it is common procedure to calculate the scorefunction from the likelihood (A.1) and take its expectation with respectto the “true” individual-level model E { Y i | X i } . Denote the expected scorefunction as S ( λ, β λ ) and denote β ( λ ) as the solution s.t. S ( λ, β ( λ )) = 0. Itcan be shown that β (0) = β . Taking the derivative of S ( λ, β ( λ )) = 0 withrespect to λ and evaluating it at λ = 0, we obtain the standard result: β ′ (0) = − ( S (0 , β (0)) − · S (0 , β (0)) , (A.2)where S and S are the partial derivatives with respect to the ﬁrst andsecond components of ∂S/∂λ , respectively. Speciﬁcally, S (0 , β (0)) = N X i =1 X Ti (cid:18)Z h ( X ( s ) β ) R ( s i , s ) dN ( s ) − h ′ ( X i β ) Z X ( s ) T R ( s i , s ) dN ( s ) · β (cid:19) (A.3) S (0 , β (0)) = − N X i =1 h ′ ( X i β ) · X Ti X i , where R ( s i , s ) = ∂ ( W λ ( s i ,s ) / R W λ ( s i ,s ) dN ( s )) ∂λ | λ =0 and h ( · ) = g − ( · ), inverse ofthe link function of the GLM. In practice, S (0 , β (0)) in (A.3) is calculatedby substituting the the integrals by summations over all locations where theoriginal individual-level data are available.The quantity β ′ (0) denotes the instant bias of estimating β using maskeddata, when changing from no masking to a very low degree of masking. Asexpected, when (i) X ( s ) is constant across all locations in s , or (ii) g ( · ) is alinear function, S (0 , β (0)) is calculated to be 0, and therefore β ′ (0) = 0.Using β ′ (0), we can approximate the bias of estimating β when ﬁtting aGLM using masked data whose degree of masking is λ , by calculating β ( λ ) − β ≈ β ′ (0) · λ. This bias calculation can be extended to any function of β , for example, thepredicted value. Speciﬁcally, bias in estimating f ( β ) can be approximatedby f ( β ( λ )) − f ( β ) ≈ f ′ ( β ) · ( β ( λ ) − β ) ≈ f ′ ( β ) · β ′ (0) · λ. Y. ZHOU, F. DOMINICI AND T. A. LOUIS

It can be seen that the ﬁrst-order bias approximation can be easily gen-eralized to approximation using higher-order terms of the Taylor series ex-pansion in addition to the ﬁrst-order term. Speciﬁcally, β ( λ ) − β ≈ β ′ (0) · λ + β ′′ (0) · λ / · · · (A.4) + β ( n ) (0) · λ n /n ! , n ≥ . Similarly, we can generalize the bias approximation of estimating f ( β ).A limitation of the bias approximation using Taylor series expansion (A.4)is that we ignore the remainder term β ( n +1) ( ξ ) · λ n +1 ( n +1)! , ξ ∈ (0 , λ ), which maynot be small for large values of λ . Therefore, the approximation only capturesthe bias for λ ≈

0, that is, the instant direction and magnitude of the biaswhen changing from no masking to a very low degree of masking. It may notcapture the total bias for a speciﬁed degree of masking. In the applicationof our method to the Medicare data, the ﬁrst-order bias is calculated tobe 0 for all three kernel weights because R in (A.3) equals 0. In addition,when applying the bias approximation (A.4) to the three examples in thesimulation studies for n = 1 , . . . ,

5, the bias approximation is calculated tobe 0, while nonzero bias is shown by comparing the parameter estimateswhen using the masked data with the true parameter value.

Acknowledgment.

Thanks to Dr. Aidan McDermott for the help on theMedicare data sources.SUPPLEMENTARY MATERIAL

Supplement: R code (DOI: 10.1214/09-AOAS325SUPP). We provide theR code for (1) the simulation study utility part of the three examples, (2)the function to compute the disclosure risk, and (3) the calculation of thebivariate normal density kernel weight matrix.REFERENCES

Armstrong, M. P., Rushton, G. and

Zimmerman, D. L. (1999). Geographically mask-ing health data to preserve conﬁdentiality.

Stat. Med. Bowman, A. W. and

Azzalini, A. (1997).

Applied Smoothing Techniques for Data Anal-ysis: The Kernel Approach with S–Plus Illustrations . Oxford Univ. Press, Oxford.

Casella, G. and

Berger, R. L. (2002).

Statistical Inference . Duxbury Press, NorthScituate, MA.

Cox, L. H. (1994). Matrix masking methods for disclosure limitation in microdata.

SurveyMethodology Cox, L. H. (1995). Network models for complementary cell suppression.

J. Amer. Statist.Assoc. Dalenius, T. and

Reiss, S. P. (1982). Data-swapping: A technique for disclosure control.

J. Statist. Plann. Inference Dobra, A., Fienberg, S. E. and

Trottini, M. (2003). Assessing the risk of disclosureof conﬁdential categorical data. In

Bayesian Statistics 7 , Proceedings of the SeventhValencia International Meeting on Bayesian Statistics (J. Bernardo et al., eds) 125–144.Oxford Univ. Press, Oxford. MR2003170

Dobra, A., Karr, A., Sanil, A. and

Fienberg, S. (2002). Software systems for tabulardata releases.

Internat. J. Uncertain. Fuzziness Knowledge-Based Systems Duncan, G. T. and

Lambert, D. (1986). Disclosure-limited data dissemination.

J. Amer.Statist. Assoc. Duncan, G. T. and

Pearson, R. W. (1991). Rejoinder: “Enhancing access to microdatawhile protecting conﬁdentiality: Prospects for the future.”

Statist. Sci. Efron, B. (1979). Bootstrap methods: Another look at the jackknife.

Ann. Statist. Efron, B. and

Tibshirani, R. (1993).

An Introduction to the Bootstrap . Chapman &Hall, New York. MR1270903

Fienberg, S. E. (1994). Conﬂicts between the needs for access to statistical informationand demands for conﬁdentiality.

Journal of Oﬃcial Statistics Fienberg, S. E., Makov, U. E. and

Steele, R. J. (1998). Disclosure limitation usingperturbation and related methods for categorical data (with discussion).

Journal ofOﬃcial Statistics Fienberg, S. E. and

McIntyre, J. (2005). Data swapping: Variations on a theme byDalenius and Reiss.

Journal of Oﬃcial Statistics Fienberg, S. E. and

Slavkovic, A. B. (2004). Making the release of conﬁdential datafrom multi-way tables count.

Chance Fienberg, S. E. and

Willenborg, L. C. R. J. (1998). Introduction to the specialissue: Disclosure limitation methods for protecting the conﬁdentiality of statistical data.

Journal of Oﬃcial Statistics Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation (withdiscussion).

Journal of Oﬃcial Statistics Gomatam, S., Karr, A. F. and

Sanil, A. P. (2005). Data swapping as a decisionproblem.

Journal of Oﬃcial Statistics Hastie, T., Tibshirani, R. and

Friedman, J. H. (2001).

The Elements of StatisticalLearning: Data Mining, Inference, and Prediction: With 200 Full-Color Illustrations .Springer, New York. MR1851606

Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P. and

Sanil, A. P. (2006).A framework for evaluating the utility of data altered to protect conﬁdentiality.

Amer.Statist. Kim, J. (1986). A method for limiting disclosure in microdata based on random noiseand transformation. In

American Statistical Association, Proceedings of the Section onSurvey Research Methods

Raghunathan, T. E., Reiter, J. P. and

Rubin, D. B. (2003). Multiple imputation forstatistical disclosure limitation.

Journal of Oﬃcial Statistics Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets.

SurveyMethodology Reiter, J. P. (2005a). Estimating risks of identiﬁcation disclosure in microdata.

J. Amer.Statist. Assoc.

Reiter, J. P. (2005b). Releasing multiply imputed, synthetic public use microdata: Anillustration and empirical study.

J. Roy. Statist. Soc. Ser. A

Rubin, D. B. (1993). Comment on “Statistical disclosure limitation.”

Journal of OﬃcialStatistics Y. ZHOU, F. DOMINICI AND T. A. LOUIS

Ruppert, D., Wand, M. and

Carroll, R. (2003).

Semiparametric Regression . Cam-bridge Series in Statistical and Probabilistic Mathematics . Cambridge Univ. Press,Cambridge. MR1998720 Simonoff, J. S. (1996).

Smoothing Methods in Statistics . Springer, New York. MR1391963

Slavkovic, A. and

Lee, J. (2010). Synthetic two-way contingency tables that preserveconditional frequencies.

Stat. Methodol . DOI: 10.1016/j.stamet.2009.11.002. To appear.

Sullivan, G. and

Fuller, W. A. (1989). The use of measurement error to avoid disclo-sure. In

American Statistical Association, Proceedings of the Section on Survey ResearchMethods

Trottini, M. and

Fienberg, S. E. (2002). Modelling user uncertainty for disclosurerisk and data utility.

Internat. J. Uncertain. Fuzziness Knowledge-Based Systems Trottini, M., Fienberg, S., Makov, U. E. and

Meyer, M. (2004). Additive noiseand multiplicative bias as disclosure limitation techniques for continuous microdata:A simulation study.

Journal of Computational Methods for Science and Engineering Wieland, S. C., Cassa, C. A., Mandl, K. D. and

Berger, B. (1998). Revealing thespatial distribution of a disease while preserving privacy.

Proc. Natl. Acad. Sci. USA

Willenborg, L. C. R. J. and

Waal, T. D. (1996).

Statistical Disclosure Control inPractice . Lecture Notes in Statistics . Springer, New York.

Willenborg, L. C. R. J. and

Waal, T. D. (2001).

Elements of Statistical DisclosureControl . Springer, New York. MR1866909

Woo, M.-J., Reiter, J. P., Oganian, A. and

Karr, A. F. (2009). Global measures ofdata utility for microdata masked for disclosure limitation.

The Journal of Privacy andConﬁdentiality Zhou, Y., Dominici, F. and

Louis, T. A. (2010a). Racial disparities in risks of mortalityin a sample of the U.S. medicare population.

J. Roy. Statist. Soc. Ser. C Zhou, Y., Dominici, F. and

Louis, T. A. (2010b). Supplement to “A smoothing approachfor masking spatial data.” DOI: 10.1214/09-AOAS325SUPP.

Y. ZhouMerck Research LaboratoriesP.O. Box 2000RY34-A304Rahway, New Jersey 07065USAE-mail: yijie [email protected]

F. DominiciDepartment of BiostatisticsHarvard University655 Huntington AvenueBoston, Massachusetts 02115USAE-mail: [email protected]