A smoothing approach for masking spatial data
aa r X i v : . [ s t a t . A P ] N ov The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2010
A SMOOTHING APPROACH FOR MASKING SPATIAL DATA
By Yijie Zhou, Francesca Dominici and Thomas A. Louis Merck Research Laboratories, Harvard University and Johns HopkinsUniversity
Individual-level health data are often not publicly available due toconfidentiality; masked data are released instead. Therefore, it is im-portant to evaluate the utility of using the masked data in statisticalanalyses such as regression. In this paper we propose a data maskingmethod which is based on spatial smoothing techniques. The pro-posed method allows for selecting both the form and the degree ofmasking, thus resulting in a large degree of flexibility. We investi-gate the utility of the masked data sets in terms of the mean squareerror (MSE) of regression parameter estimates when fitting a Gener-alized Linear Model (GLM) to the masked data. We also show thatincorporating prior knowledge on the spatial pattern of the exposureinto the data masking may reduce the bias and MSE of the parameterestimates. By evaluating both utility and disclosure risk as functionsof the form and the degree of masking, our method produces a risk-utility profile which can facilitate the selection of masking parame-ters. We apply the method to a study of racial disparities in mortalityrates using data on more than 4 million Medicare enrollees residingin 2095 zip codes in the Northeast region of the United States.
1. Introduction.
Individual-level information such as health data col-lected by, for example, government agencies, are often not publicly available
Received October 2007; revised December 2009. Supported in part by the National Institute for Environmental Health SciencesGrant R01ES012054 and by the Environmental Protection Agency Grants R83622 andRD83241701. The content is solely the responsibility of the authors and does not neces-sarily represent the official views of the National Institute of Environmental Health Sciencenor of the Environmental Protection Agency. Supported in part by the National Institute of Diabetes, Digestive and Kidney DiseasesGrant R01 DK061662. The content is solely the responsibility of the authors and does notnecessarily represent the official views of the National Institute of Diabetes, Digestive andKidney Diseases.
Key words and phrases.
Statistical disclosure limitation, data masking, data utility, dis-closure risk, spatial smoothing.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2010, Vol. 4, No. 3, 1451–1475. This reprint differs from the original in paginationand typographic detail. 1
Y. ZHOU, F. DOMINICI AND T. A. LOUIS in order to preserve confidentiality. On the other hand, there is public de-mand on these individual-level data for research purposes. As an example,associations of individual health with various risk factors are of great interestand concern nowadays. Statistical research that addresses these two compet-ing needs is known as statistical disclosure limitation , where a large numberof methods are developed on how to process and release information thatis subject to confidentiality concern [Duncan and Lambert (1986); Fienbergand Willenborg (1998); Willenborg and Waal (1996, 2001)]. In this paper werefer to those methods that alter the original data values as “data masking.”Corresponding to the two competing needs, a data masking method shouldbe evaluated from both the utility of the masked data which represents theinformation retained after the masking, and the disclosure risk of the maskeddata which is the risk that a data intruder can obtain confidential informa-tion (e.g., obtain original data values and/or identify an individual to whoma data record belongs). Ideally, masked data would have low disclosure riskwhile preserving data utility as much as possible.Examples of commonly used data masking methods include aggregatedtabular counts for categorical data [Fienberg and Slavkovic (2004)], dataswapping which exchanges values between selected records, with its variousextensions [Dalenius and Reiss (1982); Fienberg and McIntyre (2005)], cellsuppression where certain cells of contingency tables are not displayed [Cox(1995)], simulating synthetic data which have the same (conditional) dis-tribution as the original data [Rubin (1993); Fienberg, Makov and Steele(1998); Raghunathan, Reiter and Rubin (2003); Reiter (2003, 2005b)], andadditive random noise for continuous variables [Kim (1986); Sullivan andFuller (1989); Fuller (1993); Trottini et al. (2004)], etc.Among these methods, data aggregation, data swapping, additive ran-dom noise and many other methods can be formulated as matrix mask-ing [Duncan and Pearson (1991)]. Suppose data on n observations and p variables are stored in a n × p matrix. Matrix masking takes the generalform of Z ∗ = A Z B + C , where Z is the original data matrix and Z ∗ is themasked data matrix. Matrices A , B and C are row (observation) operator,column (variable) operator and random noise, respectively. Links betweenthe above masking methods to matrix masking are investigated in Duncanand Pearson (1991), Cox (1994), Fienberg (1994) and Fienberg, Makov andSteele (1998).Measuring and evaluating utility of masked data is important. In generalthere are two classes of utility measures. One is global utility measures whichreflect the general distribution of masked data compared to that of theoriginal data and are not specific to any analysis. Such measures includethe number of swaps in data swapping, the added variance in the additiverandom noise approach, differences between continuous original and maskeddata in their first and second moments, etc. More sophisticated measures SMOOTHING APPROACH FOR MASKING SPATIAL DATA that compare distributions of masked and original data can be found inDobra et al. (2002), Gomatam, Karr and Sanil (2005) and Woo et al. (2009).In addition, Bayesian decision theory-based utility is discussed in Trottiniand Fienberg (2002) and Dobra, Fienberg and Trottini (2003).The second class of utility measures is analysis-specific tailored to an-alysts’ inference. For the utility associated with regression inference, Karret al. (2006) examine the overlap in the confidence intervals of linear regres-sion coefficients estimated with original and masked data. Kim (1986) andFuller (1993) show for the additive random noise approach that if maskeddata preserve the first two moments of original data, then coefficient es-timates from linear regression using masked data are (approximately) un-biased. In addition, the methods of aggregated tabular counts and dataswapping can produce valid results for loglinear models because they pre-serve the marginal total of contingency tables. This is equivalent to preserv-ing sufficient statistics for loglinear models, given that the margins of allhigher-order interactions that appear in the model are preserved [Fienbergand Slavkovic (2004); Fienberg and McIntyre (2005)]. Recently, Slavkovicand Lee (2010) investigated logistic regression inference for contingency ta-bles that preserve marginal total or conditional probabilities. However, for ageneral data structure additional research is needed. For example, bias andvariance of parameter estimates from nonlinear regression using masked dataare not quantified as functions of masking parameters.We propose a special case of matrix masking where we construct row(observation) transformed data, that is, Z ∗ = A Z , using spatial smoothing.We investigate the mean square error (MSE) of the regression parameterestimates when fitting a Generalized Linear Model (GLM) to the maskeddata, and we provide guidance on how to select the masking parameters toreduce the MSE. Specifically, for both regressors and outcome we constructmasked data which are weighted averages of the original individual-leveldata by using linear smoothers. The shape of the smoothing weight func-tion defines the “form” of masking and the smoothness parameter measuresthe “degree” of masking. By choosing an appropriate weight function andsmoothness parameter value, the masked data can account for prior knowl-edge on the spatial pattern of individual-level data, and parameter estimatesfrom nonlinear regression using such masked data may be less subject to biasand MSE. Although data utility is our main focus, we also evaluate iden-tification disclosure risk. We consider the scenario wherein a data intruderhas correct information on the risk factor regressors (e.g., exposure or de-mographic data) from some external data sources, and his/her objective isto obtain the confidential information on the health outcome through recordmatching. Using our method, we can evaluate both the utility and the disclo-sure risk as functions of the form and the degree of masking, which produces Y. ZHOU, F. DOMINICI AND T. A. LOUIS a risk-utility profile and can facilitate the selection of the masking parame-ters. We also derive a closed-form expression for calculating the first-orderbias of the regression parameter estimates when estimated using the maskeddata, for any assumed distribution of the outcome given the regressors inthe exponential family.We apply our method to a study of racial disparities in risks of mortalityfor a large sample of the U.S. Medicare population. This study consistsof more than 4 million individuals in the Northeast region of the UnitedStates. We develop and apply statistical models to estimate the age andgender adjusted association between race and risks of mortality when usingboth the original individual-level data and the masked data. The estimatedassociation obtained from using the original individual-level data is the gold-standard, and we compare it to the estimated association obtained fromusing the masked data. We also calculate the identification disclosure riskof the masked data sets.In Section 2 we detail the method, and in Section 3 we present the simula-tion studies. In Section 4 we apply our method to the Medicare data set, andin Section 5 we discuss the method and the results. The R code is providedin the Supplement [Zhou, Dominici and Louis (2010b)], while the Medicaredata set is not provided due to a confidentiality agreement. Derivation ofthe closed-form expression for the first-order bias of the GLM regressionparameter estimates when estimated using the masked data is presented inthe Appendix.
2. Methods.
Matrix masking using spatial smoothing.
Assume that the outcomevariable Y and the regressors X are spatial processes { Y ( s ) , X ( s ) } , andthe observed individual-level data { ( Y i , X i ) , i = 1 , . . . , N } are realizations ofthe spatial processes at locations s = { s , . . . , s N } , that is, X i = X ( s i ) , Y i = Y ( s i ), i = 1 , . . . , N . We construct masked data at s using spatial smoothing,and we show later that this masking approach is a special case of matrixmasking by row (observation) transformation.Let W λ ( u, s ; S ) denote the relative weight assigned to data at location s when generating smoothed data for the target location u , where λ ≥ S denotes all spatial locations in a study areaso s is a subset of S . The parameter λ controls the degree of smoothness,with smoothness increasing with λ . For notational convenience we suppressthe dependence of W on S .We consider a subclass of linear smoothers under which the smoothedspatial processes at location u are defined as follows. For λ > Y λ ( u ) = Z Y ( s ) W λ ( u, s ) dN ( s ) . Z W λ ( u, s ) dN ( s ) , SMOOTHING APPROACH FOR MASKING SPATIAL DATA (2.1) X λ ( u ) = Z X ( s ) W λ ( u, s ) dN ( s ) . Z W λ ( u, s ) dN ( s ) , where N ( s ) is the counting process for locations with available data from thespatial processes { Y ( s ) , X ( s ) } . For ∀ u ∈ s we require that W ( u, s ) = I { s = u } .If W is continuous in λ , we define W ( u, s ) as lim λ ↓ W λ ( u, s ). Therefore, wehave that { Y ( s i ) , X ( s i ) } = { Y i , X i } , the original individual-level data.We generate masked data by taking the predictions from (2.1) at s wherethe original individual-level data are available, that is, { Y λ ( s i ) , X λ ( s i ) , i =1 , . . . , N } . By definition in (2.1), the masked data are weighted averages ofthe original individual-level data { Y ( s i ) , X ( s i ) } . The shape of the weightfunction W and the degree of smoothness λ control the form and the de-gree of masking, respectively, where the degree of masking increases withthe degree of smoothness. In practice, the masked data at location s i arecomputed by Y λ ( s i ) = N X k =1 Y k W λ ( s i , s k ) (cid:30) N X k =1 W λ ( s i , s k ) , (2.2) X λ ( s i ) = N X k =1 X k W λ ( s i , s k ) (cid:30) N X k =1 W λ ( s i , s k ) , where the same W and λ are applied to both Y and X . Examples of com-monly used smoothers within this class include parametric linear regressionsfitted by ordinary least square and weighted least square, penalized lin-ear splines with truncated polynomial basis, kernel smoothers and LOESSsmoothers [Simonoff (1996); Bowman and Azzalini (1997); Hastie, Tibshi-rani and Friedman (2001); Ruppert, Wand and Carroll (2003)].Let Y and Y λ denote the vectors of { Y i } and { Y λ ( s i ) } , and let X and X λ denote the matrices of { X i } and { X λ ( s i ) } , respectively, where X i and X λ ( s i ) , i = 1 , . . . , N , are row vectors. It can be seen that Y λ = A λ Y and X λ = A λ X , where A λ = ( A λ ij ) = ( W λ ( s i , s j ) / P Nj =1 W λ ( s i , s j )). Therefore,constructing masked data by equation (2.2) is a special case of matrix mask-ing by row (observation) transformation. Reidentification from ( Y λ , X λ ) to( Y , X ) requires knowledge of both W and λ as well as the existence of A − λ .2.2. Bias and variance in nonlinear regression using masked data.
Biasmay arise when a nonlinear model that is specified for the original individual-level data is fitted to the masked data. Specifically, we assume the followingmodel for the original individual-level data which is viewed as the “truth,” g ( E {Y|X } ) = X β . (2.3) Y. ZHOU, F. DOMINICI AND T. A. LOUIS
Model (2.3) implies the analogous model for the masked data g ( E {Y λ |X λ } ) = X λ β (2.4)only for a linear function g ( x ) = ax , where a is a constant (except for fewspecial circumstances such as X i = x , i.e., constant exposure). Specifically, g ( E {Y λ |X λ } ) = aE {Y λ |X λ } = aE {Y λ |X } = a A λ E {Y|X } model (3) = a A λ a − X β = X λ β . It follows that for a nonlinear regression model (2.3), the coefficient estimateobtained by fitting model (2.4) will be a biased estimate of β . Therefore, it isimportant to evaluate the bias of the coefficient estimate under model (2.4)as well as how the bias varies as a function of the form and the degree of datamasking. To consider both the bias and variance of the coefficient estimateobtained by fitting model (2.4), we evaluate the MSE as a function of theform and the degree of masking.It is common to assume that the masked data are mutually indepen-dent. However, they are generally correlated, since they combine informationacross the same original data. To investigate the impact of this correlationon the uncertainty of the coefficient estimate when using the masked data,we compare the “naive” confidence interval under model (2.4) which doesnot account for this correlation with an appropriate confidence interval ob-tained by using simulation or bootstrap methods [Efron (1979); Efron andTibshirani (1993)].2.3. Identification disclosure risk of masked data.
We evaluate the iden-tification disclosure risk of the masked data by calculating the probabilityof identification as developed in Reiter (2005a). To compute the risk of thereleased masked data set, we first compute the probability of matching fora particular data record.Specifically, let Z = ( Y , X ) denote the unmasked data set and Z λ = ( Y λ , X λ )denote the released masked data set. Let t denote a data vector possessed bya data intruder, where t contains the true values for a particular individual. Z λ can be divided into two components: Z Uλ which consists of variables thatare not available in t , and Z Ap λ which consists of variables that are availablein t . Z = ( Z U , Z Ap ) is the same decomposition of the true data set. Let J be a random variable that equals j if to match t with the j th individual in Z λ . The probability of matching is Pr( J = j | t , Z λ ) , j = 1 , . . . , N , assumingthat t always corresponds to an individual within Z λ . Assumptions aboutthe knowledge and behavior of the intruder are used to determine this prob-ability. Using Bayes’ rule,Pr( J = j | t , Z λ ) = Pr( Z λ | J = j, t ) Pr( J = j | t ) P Nj =1 Pr( Z λ | J = j, t ) Pr( J = j | t ) , SMOOTHING APPROACH FOR MASKING SPATIAL DATA where Pr( Z λ | J = j, t ) can be decomposed intoPr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j , J = j, t ) · Pr( z Uλ,j | z Ap λ,j , J = j, t ) · Pr( z Ap λ,j | J = j, t ) . Following the guidance in Reiter (2005a), we compute each component ofPr( J = j | t , Z λ ) as follows:1. Pr( J = j | t ) = 1 /N . This is because the true values are replaced by someweighted averages upon releasing, so exact matching between t and any Z Ap λ record is not possible.2. Pr( z Ap λ,j | J = j, t ) equals 1 − k z Ap λ,j − t k max Nk =1 k z Ap λ,k − t k , (2.5)which is the tail probability of a uniform distribution with density 1 / max Nk =1 k z Ap λ,k − t k . We assume the intruder knows that the masked dataare weighted averages of the original data. As we point out at the endof Section 2.1, detailed information on W and λ shall not be released.Therefore, it is a reasonable assumption that the intruder will assumea uniform distribution based on the difference from t . The larger thedifference, the smaller the probability.3. Pr( z Uλ,j | z Ap λ,j , J = j, t ) is computed through Z Pr( z Uλ,j | z Uj , z Ap λ,j , J = j, t ) Pr( z Uj | z Ap λ,j , J = j, t ) d z Uj , (2.6)where Pr( z Uλ,j | z Uj , z Ap λ,j , J = j, t ) = 1 − k z Uλ,j − z Uj k max Nk =1 k z Uλ,k − z Uj k , Pr( z Uj | z Ap λ,j , J = j, t )is obtained through regression of Z U on Z Ap λ , and the integral is computedusing Monte Carlo integration.4. Pr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j , J = j, t ) is conservatively assumedto be equal to 1. As pointed out in Reiter (2005a), such assumption pro-vides the upper limit on the identification risks and greatly simplifies thecalculation.Assuming a record t is matched to the individuals with the largest proba-bility of matching, we measure the identification disclosure risk of the entirereleased data set using the expected percentage of correct matches. Same asin Reiter (2005a), we assume that the intruder possesses correct records forall individuals in the released data set and seeks to match each record withan individual with replacement, that is, matching of one record is indepen-dent from matching of another record. Let m j be the number of individual Y. ZHOU, F. DOMINICI AND T. A. LOUIS records with the maximum matching probability for t j , j = 1 , . . . , N . Let I j = 1 if the m j individual records contain the correct match, and I j = 0otherwise. The expected percentage of correct matches is P Nj =1 1 m j I j /N .
3. Simulation studies.
Data generation, parameter estimation and disclosure risk evalua-tion.
In this section we conduct simulation studies to illustrate that pa-rameter estimates from regression using masked data may be less subject tobias and MSE when the selection of the smoothing weight function accountsfor the spatial patterns of exposure. We illustrate this point using three ex-amples. In each case, we define the study area to be [ − , × [ − , s where individual-levelexposure and outcome data are obtained.In each example, we define a spatial process of exposure X ( s ) and weobtain X ( s i ) for s i ∈ s . We simulate the individual-level outcome data at s from a model of the general form Y ( s i ) i . i . d . ∼ Poisson( e µ + βX ( s i ) ) , (3.1)with the individual-level exposure coefficient β being the parameter of in-terest. The values of µ and β are selected to achieve reasonable variabilityof E { Y ( s i ) | X ( s i ) } under model (3.1) across the locations.We construct the masked data { Y λ ( s i ) , X λ ( s i ) } using kernel smoothers,and we estimate the exposure coefficient β λ under model Y λ ( s i ) i . i . d . ∼ Poisson( e µ λ + β λ X λ ( s i ) ) , (3.2)which is analogous to model (3.1) but fitted to the masked data. The maskeddata are constructed and β λ is estimated for each combination of 20 λ valuesand two different kernel weights, respectively, so we can evaluate the biasand the MSE as functions of both the smoothing weight and λ .In addition, we construct spatially aggregated data by equally partitioningthe study area into 7 × Y + j = P n j i =1 Y ( s i ) and¯ X · j = P n j i =1 X ( s i ) /n j , where n j is the total number of individual-level datapoints in cell j , j = 1 , . . . ,
49. We estimate the exposure coefficient β e usingthe aggregated data { Y + j , ¯ X · j } under the analogous model Y + j i . i . d . ∼ n j · Poisson( e µ e + β e ¯ X · j ) . (3.3)To evaluate the identification disclosure risk, we consider the scenariothat a data intruder possesses the correct exposure data, that is, X ( s i ) for s i ∈ s , and seeks the matches with the released data set in order to obtaininformation on the health outcome Y . Specifically, Z Ap is X and Z U is Y .We generate 500 replicates of the individual-level outcome data. For eachreplicate β λ and β e are estimated as above, and the estimates are averagedacross the 500 replicates. SMOOTHING APPROACH FOR MASKING SPATIAL DATA Choice of smoothing weight function.
To select a weight functionthat may lead to less bias and possibly smaller MSE when estimating theexposure coefficient using the masked data, we notice that expectation ofthe masked outcome Y λ ( s i ) with respect to model (3.2) is E { Y λ ( s i ) | X λ ( s i ) } = e µ λ + β λ X λ ( s i ) , while expectation of Y λ ( s i ) with respect to model (3.1) is E { Y λ ( s i ) | X } = Z e µ + βX ( s ) W λ ( s i , s ) dN ( s )= e µ + βX λ ( s i ) Z e β [ X ( s ) − X λ ( s i )] W λ ( s i , s ) dN ( s ) , where X = { X ( s ) } . The comparison between E { Y λ ( s i ) | X } and E { Y λ ( s i ) | X λ ( s i ) } suggests that we can reduce the bias and possibly theMSE of estimating µ and β when using the masked data by selecting a W s.t. R e β [ X ( s ) − X λ ( s i )] W λ ( s i , s ) dN ( s ) is close to 1. One way to construct sucha W is to assign high weights to locations that receive similar exposure asthe target location and low weights otherwise. The W constructed in thisway has the property that it accounts for prior knowledge on the spatialpattern of the exposure. In our examples, this is also the spatial pattern ofthe outcome due to the model assumption (3.1). Therefore, to assess thedifference in bias and MSE when varying the smoothing weight function, weconstruct two different kernel weights for data masking in the way that oneweight accounts for prior knowledge on the spatial pattern of the exposureas above, while the other does not.3.3. Example I . We assume that the exposure is eradiated from a pointsource A and decreases symmetrically in all directions as the Euclideandistance from A increases. Specifically, we define X ( s ) = 7 exp( − r s / . s ∈ [ − , × [ − , r s is the Euclidean distance between lo-cation s and the point source A . Figure 1(a) shows the contour plot of X ( s ). The individual-level outcome is simulated from Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated data of exposure and outcome are con-structed by calculating group summaries of { Y ( s i ) , X ( s i ) } as described inSection 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withboth the Euclidean kernel weight W ∗ λ and the ring kernel weight W λ whichare defined as follows: W ∗ λ ( u, s ) = exp( −k s − u k /λ ) , (3.4) W λ ( u, s ) = exp( −| r s − r u | /λ ) . (3.5) Y. ZHOU, F. DOMINICI AND T. A. LOUIS
The ring kernel weight W λ ( u, s ) decreases exponentially as the differencebetween r s and r u increases, and such difference is positively associated withthe difference between X ( s ) and X ( u ) according to the spatial pattern ofthe exposure. Figure 1(b) shows the contour plot of W λ ( s , · ). On the otherhand, the Euclidean kernel weight W ∗ λ ( u, s ) solely depends on k s − u k , theEuclidean distance between location u and location s , and therefore doesnot account for prior knowledge on the spatial distribution of the exposure. Fig. 1.
Example I of spatially varying exposure, weight function for spatial smooth-ing, estimates, and disclosure risk. (a) Contour plot of exposure from point source A : X ( s ) = 7 exp( − r s / . , with cells for spatial aggregation. (b) Contour plot of ringweight function W λ ( s , s ) = exp( −| r s − r s | /λ ) for calculating spatially smoothed ex-posure and outcome data at location s , from individual-level exposure X ( s ) in (a) and individual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −
25 + 4 X ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % confidence intervalsby fitting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are con-structed using the ring weight function in (b) and using the Euclidean weight function W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-fication disclosure risk measured by the expect percentage of correct record matching. (f)
Disclosure risk versus MSE for utility-risk trade-off.
SMOOTHING APPROACH FOR MASKING SPATIAL DATA Example II . We assume that the exposure is eradiated from a pointsource A and toward a certain direction. Specifically, we define X ( s ) = 7 × exp( − r s / − cos θ s /
3) for s ∈ [ − , × [ − , θ s is the angle betweenthe direction from point source A to location s and the direction that theexposure is toward, and r s is defined the same as in Example I. Figure 2(a)shows the contour plot of X ( s ). The individual-level outcome is simulatedfrom Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated data of exposure and out-come are constructed by calculating group summaries of { Y ( s i ) , X ( s i ) } asdescribed in Section 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withthe Euclidean kernel weight (3.4) and the ring angle kernel weight W λ ( u, s ) = exp( − ( | r s − r u | + 2 | cos θ s − cos θ u | ) /λ ) , which decreases exponentially as the difference between r s and r u increasesas well as the difference between cos θ s and cos θ u increases. Figure 2(b)shows the contour plot of W λ ( s , · ).3.5. Example
III . We assume that the exposure is eradiated from a pointsource A but blocked in a certain area, such as blocked by a mountain, so theblocked area receives no exposure. Specifically, we define the unblocked areato be s x ≤ . ϑ s ≤ .
625 for s ∈ [ − , × [ − , s x is the x -axisvalue of location s and ϑ s is the angle between the positive x -axis and thedirection from point source A to location s . We define the exposure X ( s ) =7 exp( − r s / . · I s for s ∈ [ − , × [ − , I s is the indicator that s islocated within the unblocked area, and r s is defined the same as in ExamplesI and II. Figure 3(a) shows the contour plot of X ( s ). The individual-leveloutcome is simulated from Y ( s i ) i . i . d . ∼ Poisson( e − X ( s i ) ). Aggregated dataof exposure and outcome are constructed by calculating group summariesof { Y ( s i ) , X ( s i ) } as described in Section 3.1.We construct masked data { Y λ ( s i ) , X λ ( s i ) } by using equation (2.2) withthe Euclidean kernel weight (3.4) and the ring block kernel weight W λ ( u, s ) = exp( −| r s − r u | /λ ) · ( I s = I u ) , which assigns nonzero weight only when location u and location s are bothin the blocked or unblocked area. In addition, the nonzero weight from W λ ( u, s ) decreases exponentially as the difference between r s and r u in-creases. Figure 3(b) shows the contour plot of W λ ( s , · ).3.6. Results.
Results of Example I on parameter estimates, MSE andidentification risk averaged across the 500 simulation replicates are shown inFigure 1(c)–(e), respectively. Specifically, Figure 1(c) shows the estimated β λ as a function of λ for the ring kernel weight (3.5) and the Euclidean Y. ZHOU, F. DOMINICI AND T. A. LOUIS
Fig. 2.
Example II of spatially varying exposure, weight function for spatialsmoothing, estimates, and disclosure risk. (a) Contour plot of exposure frompoint source A toward a certain direction: X ( s ) = 7 exp( − r s / − cos θ s / , withcells for spatial aggregation. (b) Contour plot of ring angle weight function W λ ( s , s ) = exp( − ( | r s − r s | + 2 | cos θ s − cos θ s | ) /λ ) for calculating spatially smoothedexposure and outcome data at location s , from individual-level exposure X ( s ) in (a) andindividual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −
36 + βX ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % confidence intervals by fit-ting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are constructedusing the ring angle weight function in (b) and using the Euclidean weight function W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-fication disclosure risk measured by the expect percentage of correct record matching. (f)
Disclosure risk versus MSE for utility-risk trade-off. kernel weight (3.4), with the “naive” 95% confidence intervals. By “naive”we mean that the confidence intervals are computed by fitting model (3.2)directly, and therefore do not account for the possible correlation betweenthe masked data as pointed out earlier in Section 2.2. The reference linesare placed at the true value of β and at the estimated β e , from which thebias of estimating the exposure coefficient by using the estimated β λ canbe evaluated. Figure 1(d) shows the MSE as a function of λ for the twokernel weights, where in this example MSE is largely determined by the SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 3.
Example
III of spatially varying exposure, weight function for spatial smooth-ing, estimates, and disclosure risk. (a)
Contour plot of exposure from point source A but blocked in certain area: X ( s ) = 7 exp( − r s / . · I s where I s is the indicator of loca-tion s in the unblocked area, with cells for spatial aggregation. (b) Contour plot of ringblock weight function W λ ( s , s ) = exp( −| r s − r s | /λ ) · ( I s = I s ) for calculating spatiallysmoothed exposure and outcome data at location s , from individual-level exposure X ( s ) in (a) and individual-level outcome Y ( s ) simulated by Y ( s ) ∼ Poisson(exp( −
24 + βX ( s ))) where β = 4 , with λ = 0 . . (c) Estimates of β λ with “naive” 95 % confidence intervalsby fitting model Y λ ( s ) ∼ Poisson(exp( µ λ + β λ X λ ( s ))) where { Y λ ( s ) , X λ ( s ) } are con-structed using the ring block weight function in (b) and using the Euclidean weight func-tion W ∗ λ ( s , s ) = exp( −k s − s k /λ ) , with reference lines at β = 4 and at the estimate fromaggregated data. (d) Mean square error (MSE) of β λ using “naive” variance. (e) Identi-fication disclosure risk measured by the expect percentage of correct record matching. (f)
Disclosure risk versus MSE for utility-risk trade-off. bias. The reference lines are placed at the MSE from regression using theoriginal data (in which the bias part is 0) and the MSE of β e . Figure 1(e)shows the identification disclosure risk of the masked data set measured bythe expected percentage of correct record matching, as a function of λ forthe two kernel weights. Figure 1(f) plots the disclosure risk versus MSE,which shows the trade-off between data utility and disclosure risk.We find that data masking using the ring kernel weight (3.5) leads tosmaller bias and MSE when estimating the exposure coefficient than mask- Y. ZHOU, F. DOMINICI AND T. A. LOUIS ing using the Euclidean kernel weight (3.4), for all λ values that are con-sidered. It suggests that when using the masked data for loglinear regres-sion, a masking procedure that preserves the spatial pattern of the originalindividual-level exposure and outcome data can lead to better estimates interms of smaller bias and MSE than a masking procedure that does notdo so. As λ increases, the bias and MSE increase for both kernel weights,while the differences in the bias and MSE between the two kernel weightsdecrease. This increase in the bias/MSE and decrease in the bias/MSE dif-ferences suggest that in the presence of a high degree of masking, choicefor the form of masking may be less influential on the resultant bias/MSE.Moreover, comparing the estimated β λ and β e , we find that for small valuesof λ , the bias and MSE is smaller when using the estimated β λ from thering kernel weight (3.5).On the other hand, we find that the disclosure risk is lower when using theEuclidean kernel weight (3.4) for data masking compared to using the ringkernel weight (3.5). This is not unexpected because masked data constructedusing the ring kernel weight is more informative about the original truevalues. However, with a tolerable potential disclosure risk [ < λ . Same as the trendfor bias and MSE, the differences in the disclosure risk between the twokernel weights become small as λ increases.Similar results of Example II and Example III are shown in Figure 2(c)–(f)and Figure 3(c)–(f).Figure 4 shows the width ratios comparing the 95% “naive” confidenceintervals versus the percentile confidence intervals obtained from the empiri-cal distribution of the estimates across the 500 simulations, for the estimatesof β λ in the three examples respectively. Width ratio when λ = 0 (the soliddot) is calculated using the nonsmoothed data, that is, the individual-leveldata. We find that in these three examples, the “naive” confidence intervalsgenerally overestimate the uncertainty of the β λ estimates, and the degreeof overestimation increases as λ increases. In addition, for Examples II andIII where the spatial patterns of exposure are nonisotropic, the degree ofoverestimation differs for the weight functions with and without accountingfor prior knowledge on the spatial pattern of exposure.
4. Application to Medicare data.
We apply our method to the studyof racial disparities in risks of mortality for a sample of the U.S. Medicarepopulation.4.1.
Data source.
We extract a large data set at individual-level fromthe Medicare government database. Specifically, it includes individual age,
SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 4.
Width ratios comparing the 95 % “naive” confidence intervals (CI) versus thepercentile CI obtained from the empirical distributions of the estimates across the 500simulations, for the estimates of β λ in (a) Example I , (b) Example II , and (c) Example
III of the simulation studies. Width ratio when λ = 0 (the solid dot) is calculated using thenonsmoothed data. race, gender and a day-specific death indicator over the period 1999–2002,for more than 4 million black and white Medicare enrollees who are 65years and older residing in the Northeast region of the U.S. People who areyounger than 65 at enrollment are eliminated because they are eligible forthe Medicare program due to the presence of either a certain disability orEnd Stage Renal Disease and therefore do not represent the general Medicarepopulation.Figure 5 shows the study area which includes 2095 zip codes in 64 countiesin the Northeast region of the U.S. We select the counties whose centroidsare located within the range that covers the Northeast coast region of theU.S., and we exclude zip codes without available study population from thestudy map. This area covers several large, urban cities including WashingtonDC, Baltimore, Philadelphia, New York City, New Haven and Boston. It hasthe advantage of high population density and substantial racial diversity.We categorize the age of individuals into 5 intervals based on age in his/herfirst year of observation: [65, 70), [70, 75), [75, 80), [80, 85) and [85, +).This categorization facilitates detection of age effects because differences Y. ZHOU, F. DOMINICI AND T. A. LOUIS
Fig. 5.
Location of the 2095 zip codes included in our study area. in the risks of mortality for one-year increase in age are relatively small.We “coarsen” the daily survival information into yearly survival indicators.By doing so, we define our outcome as the probability of the occurrence ofdeath for an individual in one year. This definition adjusts for the differentialfollow-up time.4.2.
Statistical models and data masking.
Let i denote individual, j de-note zip code, t denote year, and D ijt be the death indicator for individual i in zip code j in year t . Similarly as in Zhou, Dominici and Louis (2010a),we define the individual-level model as logit Pr( D tij = 1) = β + β race ij + age ij β (4.1) + β gender ij + ( age × gender ) ij β . Geographic locations for each individual are needed to spatially smooththe individual-level data. However, from the Medicare data we only havethe longitude and latitude of the zip code centroids. Therefore, we applya two-step masking procedure on the individual-level data, where we firstaggregate the individual-level data to zip code-level, and we then spatiallysmooth the zip-code level aggregated data to construct the masked data atthe zip code-level.Specifically, let D ++ j denote the total death count and n j denote thetotal person-years of zip code j . We first obtain from aggregation { % black j ,% agecat j , % male j , % ( agecat × male ) j , p j = D ++ j /n j , j = 1 , . . . , J } , whichare the marginal distributions of race, age, gender, the joint distribution ofage and gender, and the mortality rate, respectively, of each zip code. SMOOTHING APPROACH FOR MASKING SPATIAL DATA Due to the complex spatial pattern of the zip code-level covariates, weuse kernel smoothers with bivariate normal density kernel weights for spatialsmoothing, so the shape of the smoothing weight is flexible by varying thecorrelation parameter value of the bivariate normal distribution. Let thevector s = { s , s } denote the location of a zip code, where s and s arethe longitude and latitude of the zip code centroid, respectively. We usesmoothing kernel weights of the general form W λ ( u, s ) = exp( − ( s − u , s − u ) T Σ − λ ( s − u , s − u ) / , where Σ λ = λ (cid:18) σ ρσ σ ρσ σ σ (cid:19) ,σ and σ are the variances of the longitude and latitude data of the 2095zip codes, respectively. We consider for ρ the following three values:1. ρ = 0, so the weight solely depends on the Euclidean distance k s − u k ;2. ρ = 0 .
5, so higher weight is assigned to s in the northeast and southwestdirections of u ;3. ρ = − .
5, so higher weight is assigned to s in the northwest and southeastdirections of u .Let p jλ denote the smoothed mortality rate of zip code j from whichwe calculate the smoothed death count D ++ jλ = p jλ · n j . Let % black jλ ,% agecat jλ , % male jλ , % ( agecat × male ) jλ denote the smoothed marginaldistributions of race, age, gender and the smoothed joint distribution of ageand gender, respectively, of zip code j . We specify the model for maskeddata as D ++ jλ ∼ Bin( n j , p jλ ) , logit p jλ = β λ + β λ % black jλ + β λ % agecat jλ (4.2) + β λ % male jλ + β λ % ( agecat × male ) jλ . The zip code-level nonsmoothed aggregated data are also used to fit model(4.2).To evaluate the identification disclosure risk, we consider the scenario thata data intruder possesses correct zip code-level demographic data and seeksthe matching with the masked zip code-level data set in order to obtaininformation on the zip code-level mortality. Specifically, the released dataset consists of % black jλ , % agecat jλ , % male jλ and p jλ , j = 1 , . . . , black j , % agecat j and % male j . Y. ZHOU, F. DOMINICI AND T. A. LOUIS
Choice of association measure.
The common approach to report theassociation between race and mortality risks is to report the race coefficients β in model (4.1) and β λ in model (4.2), whose interpretation is subjected tothe coding of the race covariate. For direct understanding of the differencein the risk of death between the black and white populations, we defineand report the population-level odds ratio (OR) of death comparing Blacksversus Whites, which is a function of the predicted values [Zhou, Dominiciand Louis (2010a)]. Therefore, interpretation of this association measuredoes not depend on model parameterization (e.g., on covariate centeringand scaling).Specifically, let P tijb = Pr( D tij = 1 | race ij = Black , age ij , gender ij ) ,P tijw = Pr( D tij = 1 | race ij = White , age ij , gender ij )denote the predicted probabilities of death in year t for a black person anda white person, respectively, whose other covariates values are the same asthe i th individual in the j th zip code. We define the population-level ORfrom the individual-level model (4.1) as follows: OR = P ··· b Q ··· w P ··· w Q ··· b , where P ··· b = X t,i,j P tijb , P ··· w = X t,i,j P tijw ,Q ··· b = 1 − P ··· b , Q ··· w = 1 − P ··· w . Similarly, we define population-level OR λ from model (4.2) using summaryprobabilities P · bλ = P j n j P jbλ P j n j and P · wλ = P j n j P jwλ P j n j , where P jbλ and P jwλ are the predicted probabilities of death in one yearfor zip codes that consist of solely black and solely white populations, re-spectively, and whose marginal and joint distributions of age and genderare the same as zip code j . “Naive” standard errors of log OR λ are cal-culated using the multivariate Delta Method [Casella and Berger (2002)].In addition, bootstrap confidence intervals for log OR λ are calculated using1000 nonparametric bootstrap samples. Both “naive” and bootstrap confi-dence intervals for OR λ are obtained by exponentiating the correspondingconfidence intervals for log OR λ . SMOOTHING APPROACH FOR MASKING SPATIAL DATA Fig. 6.
Estimates of OR λ under model (4.2) and identification disclosure risk as a func-tion of λ for the three weight functions. Estimates of OR λ is plotted with the 95 % “naive”confidence intervals (CI), CI using bootstrap standard error (SE) estimates, and bootstrappercentile CI. OR is estimated by fitting model (4.2) to the nonsmoothed zip code-level ag-gregated data. (a) Estimates of OR λ for bivariate normal density kernel weight with ρ = 0 . (b) Estimates of OR λ for bivariate normal density kernel weight with ρ = 0 . . (c) Esti-mates of OR λ for bivariate normal density kernel weight with ρ = − . . (d) Identificationdisclosure risk measured by expected percentage of correct matching.
Results.
Figure 6(a)–(c) shows the estimates of OR λ under model (4.2)as a function of λ for the three kernel weights respectively, with the 95%“naive” confidence intervals, confidence intervals using bootstrap standarderror estimates and bootstrap percentile confidence intervals. OR is esti-mated by fitting model (4.2) to the nonsmoothed zip code-level aggregateddata. The reference line is placed at the estimate of OR under the individual-level model (4.1).For small values of λ ( < OR λ for all three kernelweights are smaller than the estimate of OR and therefore produce negative Y. ZHOU, F. DOMINICI AND T. A. LOUIS bias, while for larger values of λ the bias differs substantially for differentkernel weights. For example, data masking using the kernel weight with ρ = 0 . λ valuesthat are considered. When using the kernel weight with ρ = − . OR λ are less subject to bias than those from usingthe other two kernel weights. Differences in MSE between the three kernelscan also be inferred from the plots, and we find that using the kernel weightwith ρ = − . OR λ estimates, which is in the opposite direction ofthe relation between the “naive” and the appropriate confidence intervalsin the simulation studies. The two bootstrap confidence intervals are widerthan the “naive” confidence interval when λ = 0, which suggests a systematicdifference between the bootstrap confidence intervals and the “naive” con-fidence intervals regardless of smoothing. This systematic difference occursbecause the nonsmoothed zip code-level aggregated data may not satisfy theBinomial model assumption in (4.2).Figure 6(d) shows the identification disclosure risk of the masked dataset as measured by the expected percentage of correct matches when usingthe three kernel weights for masking, as a function of λ . The disclosure riskfor all three kernel weights are small, ranging from 0.01–0.04. The risk issimilar for the masked data sets when using the kernel weight with ρ = 0and ρ = − . ρ = 0 .
5. Discussion.
We propose a special case of matrix masking based onspatial smoothing techniques, where the smoothing weight function controlsthe form of masking, and the smoothness parameter value directly measuresthe degree of masking. Therefore, data utility and disclosure risk can becalculated as functions of both the form and the degree of masking. In fact,the smoothing weight function W can be any weight function and is not re-stricted by existing smoothing methods. With the variety of combinations ofweight functions and smoothness parameter values, it is feasible to constructmasked data that maintain high data utility while preserving confidentiality.We consider a subclass of linear smoothers that produces masked data asweighted averages of the original data. Therefore, the masked data valuesare within a reasonable range. More importantly, correlation among thevariables is invariant under linear transformation, which may intrinsicallycontribute to better data utility of the masked data. On the other hand,this subclass is a large class. It includes many commonly used smoothers.We do not expect major restriction by focusing on this subclass of linearsmoothers.Using our method, we investigate the utility of the masked data in termsof bias, variance and MSE of parameter estimates when using the masked SMOOTHING APPROACH FOR MASKING SPATIAL DATA data for loglinear and logistic regression analysis. Note that similar studiescan be applied to any GLM. In addition, we evaluate the identification dis-closure risk of the masked data set by calculating the expected percentage ofcorrect record matching. In the simulation studies, we provide guidance forconstructing masked data that can lead to better regression parameter esti-mates in terms of smaller bias and MSE for loglinear models, and we showthe trade-off between better estimates and lower disclosure risk. Specifically,masked data can be constructed by using a smoothing weight function thataccounts for prior knowledge on the spatial pattern of individual-level expo-sure, together with a reasonably low degree of masking. We provide guidancefor how to select such a smoothing weight function for loglinear models. Inaddition, we provide candidate weight functions for three simplified but rep-resentative spatial patterns of exposure.As is expected, masked data that can lead to better estimates are generallymore informative about the original data values and therefore are subjectto relatively higher identification disclosure risk. However, the flexibility inour data masking method enables constructing the masked data that canlead to good parameter estimates, while the disclosure risk is controlled ata low level. In the meanwhile, caution should be placed to the institutein releasing detailed information on the data masking approach along withmasked data. It is pointed out in Section 2.1 that simultaneously releasingthe smoothing weight function W and the smoothness parameter λ in theexistence of A − λ can lead to reidentification of original data. However, evenif only partial information is released, for example, only the informationthat data are masked using smoothing and the smoothing weight function isreleased while the smoothing parameter value is not released, it is possiblethat a smart data intruder can still reconstruct the transformation matrix A λ .We apply our data masking method to the study of racial disparities inrisks of mortality for the Medicare population, and show how the bias andthe variance of the estimated OR of death comparing blacks to whites, andhow the identification disclosure risk, vary with the form and the degreeof masking. The results suggest that in the absence of clear guidance, it ishelpful to explore a large flexible family such as the bivariate normal densitykernel to identify a weight function that can lead to both good utility andlow identification risk for the masked data.We compare the “naive” confidence intervals with the appropriate oneswhich account for the possible correlation among masked data in both thesimulation studies and the data application, where we observe opposite di-rections in the relation between the “naive” and the appropriate confidenceintervals. It suggests no general direction for that relation. One possible rea-son, which is also pointed out in Section 4.4, is that the unmasked data in Y. ZHOU, F. DOMINICI AND T. A. LOUIS the simulation study are simulated from Poisson distributions, while the un-masked data in the data application are real data and do not strictly followthe assumed binomial distribution. Therefore, in the data application, thestandard errors account for both the correlation among the masked dataand the discrepancy of the original data distribution from binomial.The simulation study and data application results show that masked dataconstructed using our method can well preserve confidentiality. Specifically,the identification disclosure risk is reasonably low for all scenarios that weconsider. Note that our calculation of the disclosure risk is conservative: weassume that an intruder possesses true values for all the regressors, and weuse probability 1 for the component Pr( z λ, , . . . , z λ,j − , z λ,j +1 , . . . , z λ,N | z λ,j ,J = j, t ) in the calculation. In addition, the flexibility in the selection ofsmoothing weight function W and smoothness parameter λ can also helpcontrol disclosure risk in addition to improving data utility.Based on our method, we additionally derive a closed-form expressionfor first-order bias of the parameter estimates obtained using the maskeddata, for GLM that belong to the exponential family. The first-order biascalculation is not necessary when both individual-level exposure and healthoutcome data are available so the actual bias can be computed. It may beused by researchers who have only the individual-level exposure informationto explore the possible bias in their analysis using masked data.Although our proposed method uses spatial smoothing and therefore ap-plies to spatial data, it can be easily generalized to other data types becausethe masking procedure is a smoothing technique that takes weighted aver-ages of the original data. For example, the proposed method can be general-ized to smoothing time series data by using the smoothing weight function W λ ( µ, s ), where µ and s denote time points. Also, note that an alternativemethod to mask spatial data is to mask the individual spatial location [seeArmstrong, Rushton and Zimmerman (1999); Wieland et al. (1998)].APPENDIX: FIRST-ORDER BIASWe derive a closed-form expression for the first-order bias of estimatingthe regression coefficients in a GLM that belongs to the exponential family,when using data masked by our method. Let β denote the vector of regres-sion coefficients of a model specified for the original individual-level data.When the model belongs to the exponential family, its log likelihood can beexpressed as LL( β ) = N X i =1 Y i X i β − b ( X i β ) a ( φ ) + C ( Y i , φ ) ,b ′ ( X i β ) = g − ( X i β ), where b ′ ( · ) is the derivative of function b ( · ), and g ( · )is the link function. Substituting the individual-level data { Y i , X i } by the SMOOTHING APPROACH FOR MASKING SPATIAL DATA masked data { Y λ ( s i ) , X λ ( s i ) } , we obtain log likelihood of the analogousmodel when fitted to the masked data,LL m ( β λ ; λ ) = N X i =1 Y λ ( s i ) X λ ( s i ) β λ − b ( X λ ( s i ) β λ ) a λ ( φ λ ) + C λ ( Y λ ( s i ) , φ λ ) , (A.1)where β λ denotes the corresponding vector of regression coefficients. In orderto calculate the MLE of β λ , it is common procedure to calculate the scorefunction from the likelihood (A.1) and take its expectation with respectto the “true” individual-level model E { Y i | X i } . Denote the expected scorefunction as S ( λ, β λ ) and denote β ( λ ) as the solution s.t. S ( λ, β ( λ )) = 0. Itcan be shown that β (0) = β . Taking the derivative of S ( λ, β ( λ )) = 0 withrespect to λ and evaluating it at λ = 0, we obtain the standard result: β ′ (0) = − ( S (0 , β (0)) − · S (0 , β (0)) , (A.2)where S and S are the partial derivatives with respect to the first andsecond components of ∂S/∂λ , respectively. Specifically, S (0 , β (0)) = N X i =1 X Ti (cid:18)Z h ( X ( s ) β ) R ( s i , s ) dN ( s ) − h ′ ( X i β ) Z X ( s ) T R ( s i , s ) dN ( s ) · β (cid:19) (A.3) S (0 , β (0)) = − N X i =1 h ′ ( X i β ) · X Ti X i , where R ( s i , s ) = ∂ ( W λ ( s i ,s ) / R W λ ( s i ,s ) dN ( s )) ∂λ | λ =0 and h ( · ) = g − ( · ), inverse ofthe link function of the GLM. In practice, S (0 , β (0)) in (A.3) is calculatedby substituting the the integrals by summations over all locations where theoriginal individual-level data are available.The quantity β ′ (0) denotes the instant bias of estimating β using maskeddata, when changing from no masking to a very low degree of masking. Asexpected, when (i) X ( s ) is constant across all locations in s , or (ii) g ( · ) is alinear function, S (0 , β (0)) is calculated to be 0, and therefore β ′ (0) = 0.Using β ′ (0), we can approximate the bias of estimating β when fitting aGLM using masked data whose degree of masking is λ , by calculating β ( λ ) − β ≈ β ′ (0) · λ. This bias calculation can be extended to any function of β , for example, thepredicted value. Specifically, bias in estimating f ( β ) can be approximatedby f ( β ( λ )) − f ( β ) ≈ f ′ ( β ) · ( β ( λ ) − β ) ≈ f ′ ( β ) · β ′ (0) · λ. Y. ZHOU, F. DOMINICI AND T. A. LOUIS
It can be seen that the first-order bias approximation can be easily gen-eralized to approximation using higher-order terms of the Taylor series ex-pansion in addition to the first-order term. Specifically, β ( λ ) − β ≈ β ′ (0) · λ + β ′′ (0) · λ / · · · (A.4) + β ( n ) (0) · λ n /n ! , n ≥ . Similarly, we can generalize the bias approximation of estimating f ( β ).A limitation of the bias approximation using Taylor series expansion (A.4)is that we ignore the remainder term β ( n +1) ( ξ ) · λ n +1 ( n +1)! , ξ ∈ (0 , λ ), which maynot be small for large values of λ . Therefore, the approximation only capturesthe bias for λ ≈
0, that is, the instant direction and magnitude of the biaswhen changing from no masking to a very low degree of masking. It may notcapture the total bias for a specified degree of masking. In the applicationof our method to the Medicare data, the first-order bias is calculated tobe 0 for all three kernel weights because R in (A.3) equals 0. In addition,when applying the bias approximation (A.4) to the three examples in thesimulation studies for n = 1 , . . . ,
5, the bias approximation is calculated tobe 0, while nonzero bias is shown by comparing the parameter estimateswhen using the masked data with the true parameter value.
Acknowledgment.
Thanks to Dr. Aidan McDermott for the help on theMedicare data sources.SUPPLEMENTARY MATERIAL
Supplement: R code (DOI: 10.1214/09-AOAS325SUPP). We provide theR code for (1) the simulation study utility part of the three examples, (2)the function to compute the disclosure risk, and (3) the calculation of thebivariate normal density kernel weight matrix.REFERENCES
Armstrong, M. P., Rushton, G. and
Zimmerman, D. L. (1999). Geographically mask-ing health data to preserve confidentiality.
Stat. Med. Bowman, A. W. and
Azzalini, A. (1997).
Applied Smoothing Techniques for Data Anal-ysis: The Kernel Approach with S–Plus Illustrations . Oxford Univ. Press, Oxford.
Casella, G. and
Berger, R. L. (2002).
Statistical Inference . Duxbury Press, NorthScituate, MA.
Cox, L. H. (1994). Matrix masking methods for disclosure limitation in microdata.
SurveyMethodology Cox, L. H. (1995). Network models for complementary cell suppression.
J. Amer. Statist.Assoc. Dalenius, T. and
Reiss, S. P. (1982). Data-swapping: A technique for disclosure control.
J. Statist. Plann. Inference Dobra, A., Fienberg, S. E. and
Trottini, M. (2003). Assessing the risk of disclosureof confidential categorical data. In
Bayesian Statistics 7 , Proceedings of the SeventhValencia International Meeting on Bayesian Statistics (J. Bernardo et al., eds) 125–144.Oxford Univ. Press, Oxford. MR2003170
Dobra, A., Karr, A., Sanil, A. and
Fienberg, S. (2002). Software systems for tabulardata releases.
Internat. J. Uncertain. Fuzziness Knowledge-Based Systems Duncan, G. T. and
Lambert, D. (1986). Disclosure-limited data dissemination.
J. Amer.Statist. Assoc. Duncan, G. T. and
Pearson, R. W. (1991). Rejoinder: “Enhancing access to microdatawhile protecting confidentiality: Prospects for the future.”
Statist. Sci. Efron, B. (1979). Bootstrap methods: Another look at the jackknife.
Ann. Statist. Efron, B. and
Tibshirani, R. (1993).
An Introduction to the Bootstrap . Chapman &Hall, New York. MR1270903
Fienberg, S. E. (1994). Conflicts between the needs for access to statistical informationand demands for confidentiality.
Journal of Official Statistics Fienberg, S. E., Makov, U. E. and
Steele, R. J. (1998). Disclosure limitation usingperturbation and related methods for categorical data (with discussion).
Journal ofOfficial Statistics Fienberg, S. E. and
McIntyre, J. (2005). Data swapping: Variations on a theme byDalenius and Reiss.
Journal of Official Statistics Fienberg, S. E. and
Slavkovic, A. B. (2004). Making the release of confidential datafrom multi-way tables count.
Chance Fienberg, S. E. and
Willenborg, L. C. R. J. (1998). Introduction to the specialissue: Disclosure limitation methods for protecting the confidentiality of statistical data.
Journal of Official Statistics Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation (withdiscussion).
Journal of Official Statistics Gomatam, S., Karr, A. F. and
Sanil, A. P. (2005). Data swapping as a decisionproblem.
Journal of Official Statistics Hastie, T., Tibshirani, R. and
Friedman, J. H. (2001).
The Elements of StatisticalLearning: Data Mining, Inference, and Prediction: With 200 Full-Color Illustrations .Springer, New York. MR1851606
Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P. and
Sanil, A. P. (2006).A framework for evaluating the utility of data altered to protect confidentiality.
Amer.Statist. Kim, J. (1986). A method for limiting disclosure in microdata based on random noiseand transformation. In
American Statistical Association, Proceedings of the Section onSurvey Research Methods
Raghunathan, T. E., Reiter, J. P. and
Rubin, D. B. (2003). Multiple imputation forstatistical disclosure limitation.
Journal of Official Statistics Reiter, J. P. (2003). Inference for partially synthetic, public use microdata sets.
SurveyMethodology Reiter, J. P. (2005a). Estimating risks of identification disclosure in microdata.
J. Amer.Statist. Assoc.
Reiter, J. P. (2005b). Releasing multiply imputed, synthetic public use microdata: Anillustration and empirical study.
J. Roy. Statist. Soc. Ser. A
Rubin, D. B. (1993). Comment on “Statistical disclosure limitation.”
Journal of OfficialStatistics Y. ZHOU, F. DOMINICI AND T. A. LOUIS
Ruppert, D., Wand, M. and
Carroll, R. (2003).
Semiparametric Regression . Cam-bridge Series in Statistical and Probabilistic Mathematics . Cambridge Univ. Press,Cambridge. MR1998720 Simonoff, J. S. (1996).
Smoothing Methods in Statistics . Springer, New York. MR1391963
Slavkovic, A. and
Lee, J. (2010). Synthetic two-way contingency tables that preserveconditional frequencies.
Stat. Methodol . DOI: 10.1016/j.stamet.2009.11.002. To appear.
Sullivan, G. and
Fuller, W. A. (1989). The use of measurement error to avoid disclo-sure. In
American Statistical Association, Proceedings of the Section on Survey ResearchMethods
Trottini, M. and
Fienberg, S. E. (2002). Modelling user uncertainty for disclosurerisk and data utility.
Internat. J. Uncertain. Fuzziness Knowledge-Based Systems Trottini, M., Fienberg, S., Makov, U. E. and
Meyer, M. (2004). Additive noiseand multiplicative bias as disclosure limitation techniques for continuous microdata:A simulation study.
Journal of Computational Methods for Science and Engineering Wieland, S. C., Cassa, C. A., Mandl, K. D. and
Berger, B. (1998). Revealing thespatial distribution of a disease while preserving privacy.
Proc. Natl. Acad. Sci. USA
Willenborg, L. C. R. J. and
Waal, T. D. (1996).
Statistical Disclosure Control inPractice . Lecture Notes in Statistics . Springer, New York.
Willenborg, L. C. R. J. and
Waal, T. D. (2001).
Elements of Statistical DisclosureControl . Springer, New York. MR1866909
Woo, M.-J., Reiter, J. P., Oganian, A. and
Karr, A. F. (2009). Global measures ofdata utility for microdata masked for disclosure limitation.
The Journal of Privacy andConfidentiality Zhou, Y., Dominici, F. and
Louis, T. A. (2010a). Racial disparities in risks of mortalityin a sample of the U.S. medicare population.
J. Roy. Statist. Soc. Ser. C Zhou, Y., Dominici, F. and
Louis, T. A. (2010b). Supplement to “A smoothing approachfor masking spatial data.” DOI: 10.1214/09-AOAS325SUPP.
Y. ZhouMerck Research LaboratoriesP.O. Box 2000RY34-A304Rahway, New Jersey 07065USAE-mail: yijie [email protected]
F. DominiciDepartment of BiostatisticsHarvard University655 Huntington AvenueBoston, Massachusetts 02115USAE-mail: [email protected]