Privacy-Preserving Causal Inference via Inverse Probability Weighting
PPrivacy-Preserving Causal Inferencevia Inverse Probability Weighting
Si Kai Lee ∗ Luigi Gresele
Mijung Park
1, 3
Krikamol Muandet Max Planck Institute for Intelligent Systems, Tübingen, Germany Max Planck Institute for Biological Cybernetics, Tübingen, Germany University of Tübingen, Tübingen, Germany
Abstract
The use of inverse probability weighting(IPW) methods to estimate the causal effectof treatments from observational studies iswidespread in econometrics, medicine and so-cial sciences. Although these studies ofteninvolve sensitive information, thus far therehas been no work on privacy-preserving IPWmethods. We address this by providing anovel framework for privacy-preserving IPW(PP-IPW) methods. We include a theoreticalanalysis of the effects of our proposed privati-sation procedure on the estimated averagetreatment effect, and evaluate our PP-IPWframework on synthetic, semi-synthetic andreal datasets. The empirical results are con-sistent with our theoretical findings.
The increasing ubiquity of machine learning in ourdaily lives has created a pressing need for trustworthyartificial intelligence (AI). One key tenet of trustwor-thy AI, defined in the European Commission’s EthicsGuidelines for Trustworthy AI [1], is privacy . Preserv-ing patient privacy is critical in medicine as it buildstrust and fosters thoughtful decision making, which inturn helps improve patient care. Although the privacyrequirements in the medical field are especially high,such requirements are not unique. In observationalstudies for social sciences, it is commonly assumed thatsensitive information such as employment, educationand criminal records that are used in analyses would bekept private. The datasets compiled from these studies ∗ Contact: [email protected] often contain personal information, hence conclusionsdrawn from statistical analyses on such datasets runsthe risk of violating the privacy of those present inthe datasets. Such risk extends to causal inference methods [2, 3, 4, 5] as they fall under the umbrella ofstatistical estimation tools.We illustrate the problem of causal inference with an ex-ample. In medicine, it is crucial to have a well-roundedunderstanding of the efficacy of different medical treat-ments since a treatment can produce a broad range ofresponses across patients and multiple treatments areusually available. The treatment effect can be modelledfrom observational data, by looking at how patientsresponded to different treatments in the past. How-ever, two formidable obstacles need to be overcomein order to obtain a sound causal effect estimation.First, for each patient we only observe the outcomeassociated with the treatment that patient received,and no other [6]. Second, the treatments that patientsreceive are not assigned at random, as doctors assignthe treatment they expect to work best for each pa-tient; as a result, treatment assignments and outcomesare subject to confounding , which could result in abiased estimate of the outcome of each treatment. Anaccurate estimate of the treatment effect should takeconfounding into account, possibly modelling how in-dividual characteristics of the patient determine theassigned treatment. However, this often requires col-lecting sensitive information from patients.The propensity score is arguably one of the most usedquantities in causal inference for observational studies.It forms the basis of popular techniques such as match-ing, stratification, and inverse probability weighting(IPW) [7, 8, 9, 10] which are extensively used in econo-metrics, medicine, and social sciences [3]. Moreover, theIPW estimator based on propensity scores is the back-bone of several counterfactual inference algorithms inthe machine learning literature [11, 12, 13, 14]. Despiteits widespread use, the propensity score—defined asthe probability of assignment of a particular treatment a r X i v : . [ c s . L G ] N ov rivacy-Preserving Inverse Probability Weighting given observed covariates—could depend on sensitiveinformation about the patients, such as age, gender,race, and ethnicity. Little concern has thus far beenraised regarding the privacy issues related to the use ofpropensity scores in such methods. Previously, propen-sity scores have been used by [15] as privatised substi-tutes for individual covariates. However, as we showin Section 3, this method still violates privacy [16]since sensitive observational data is used for estimating propensity scores.Inverse probability weighting (IPW) methods, whichemploys estimated propensity scores, are frequentlyused to estimate the average treatment effect (ATE)[17, 18]. Since estimating the ATE with IPW methodsstill require sensitive observational data, privacy isfurther violated.To address this important but neglected issue, we de-velop a novel framework for privacy-preserving IPWmethods. This framework consists of two steps: (1) welearn a privacy-preserving propensity score estimatorand (2) we output a privacy-preserving ATE estimator.In addition, we investigate the effect of privatisation inboth steps on the performance of the resulting causalanalysis. Related Work.
To the best of our knowledge, thispaper is the first work which formally investigate theprivatisation of the propensity score function and theaverage treatment effect estimates with IPW methods.There have only been few prior attempts to privatisecausal inference techniques in different contexts. Forexample, in [19], the authors demonstrated how onecould privatise statistical dependence scores such as
Spearman’s ρ and Kendall’s τ under the additive noisemodel. The main focus of [19] is to obtain privatisedscores that still can correctly identify the causal di-rection between two random variables. In [20], theauthors developed a differentially private constraint-based causal graph discovery method for categoricaldata. None of the above papers considered propensityscore-based causal inference methods which is our focus.Since propensity score-based methods are fundamen-tal to econometrics, medicine, social sciences etc., weexpect this work to impact diverse fields. Contributions.
In this paper, we propose a privacy-preserving framework for IPW methods which com-prises a propensity score estimator and an IPW-basedaverage treatment effect estimator We privatise theparameters of the logistic regression model used toestimate the propensity scores, as well as the outputof inverse probability weighting. This guarantees theprivacy of the individuals in the training dataset usedto learn the propensity score estimator and individualsin the estimation dataset used to estimate the ATE.We analyse the effect of the noise added to enforce
Table 1: Key Quantities
Symbol Description (cid:15)
Privacy loss δ Failure probability µ t Population mean under treatment t ˆ µ t Estimated µ t ˆ µ (cid:15)t Privatised ˆ µ t τ Average Treatment Effect (ATE), µ − µ ˆ τ ATE estimate, ˆ µ − ˆ µ ˆ τ n Partially privatised ATE estimate ˆ τ (cid:15)n Fully privatised ATE estimate π w Propensity score function π ˆ w Estimated π w π (cid:15) ˆ w Privatised π ˆ w privacy on the resulting estimated causal effect. Ouranalysis provides guidelines on how many samples weneed to guarantee a certain level of privacy while pro-viding accurate causal inference. We test our methodon synthetic, semi-synthetic and real-world datasets toillustrate its effectiveness.The rest of this paper is organised as follows. Section2 provides a review of propensity scores in causal infer-ence and differential privacy. The privacy-preservingpropensity scores, IPW estimator, as well as their the-oretical guarantees are then presented in Section 3,followed by experimental results in Section 4. Finally,Section 5 concludes the paper. In this section, we introduce relevant concepts fromcausal inference and differential privacy. The key quan-tities are summarised in Table 1.
The potential outcomes framework is one of the mostwidely-used approaches in causal inference [21, 22, 23].It provides the mathematical basis for estimating theoutcome of an experiment which has not been per-formed given outcomes observed under other experi-mental settings.Consider the setting where we want to estimate whethera given treatment has a positive, negative or null ef-fect on different units/individuals. We define T asthe treatment variable and Y t as the random variablerepresenting the potential outcome associated withtreatment T = t . In medicine, T could represent differ-ent cancer treatments and Y t an indicator for patientrecovery after treatment t . Throughout this paper, wefocus on the binary treatment setting, i.e. , T ∈ { , } , i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet and refer to the subset of the population with T = 1 as the treatment group, and the rest, with T = 0 , asthe control group. The random variables Y and Y are the outcomes associated with the treatment andcontrol group, respectively.The question we want to answer is: what is the effectof administering a treatment to a unit compared to notdoing so? Quantitatively, this can be characterised bythe differences of the outcome when the treatment is ad-ministered and the outcome when the treatment is notadministered, i.e. , Y − Y . To estimate this, we wouldrequire both the outcomes of treatment and no treat-ment to be observed for every unit. However, for eachunit we can observe only either Y or Y . In practice, wesubstitute each unobserved quantity with an estimateof its expected outcome, µ t := E [ Y t ] , and evaluate the average treatment effect (ATE) with τ = E [ Y ] − E [ Y ] .Given a dataset D = { ( t , y ) , . . . , ( t N , y N ) } , we canapproximate ˆ µ t with n − t (cid:80) Ni =1 ( t i = t ) · y i , where ( · ) is an indicator that returns 1 when t i = t and 0otherwise.If D is collected from a randomised experiment, ˆ τ :=ˆ µ − ˆ µ is an unbiased estimate of τ . However, theproblem with most observational studies is that ˆ τ is gen-erally biased because of potential confounding variables X that affect both T and Y t . For example, X couldbe the current stage of a patient’s cancer, which couldinfluence both the decision of the physician regardingthe treatment and the outcome of the treatment. Toobtain an unbiased ATE estimate despite the confound-ing variables X for dataset D = { ( x i , t i , y i ) } Ni =1 , weexpand each E [ Y t ] as E [ Y t ] = E X [ E t [ Y t | X, T = t ]] andcompute the difference of the quantity above with t = 1 and with t = 0 . The validity of this estimate can beassessed given three technical requirements, which weassume throughout: (i) Stable Unit Treatment Value Assumption(SUTVA): The observed outcome of the i th unit Y ( i ) is unaffected by the assigned treatment toother units. (ii) Ignorability: T | = ( Y , Y ) | X . (iii) Positivity: < P ( T = 1 | X = x ) < for all x .See [3] for a thorough exposition on these assumptions. Propensity Scores.
The propensity score is one ofthe most widely used quantities for causal analysisin observational studies [7, 8, 9, 10]. In the binarytreatment setting, the propensity score π ( x ) is definedas the probability of a unit with covariate x receiv-ing treatment T = 1 , π ( x ) := P ( T = 1 | X = x ) . Ithas been shown that, under the above set of assump-tions, the propensity score π ( x ) summarises all the relevant information in X for causal inference so that T | = ( Y , Y ) | π ( X ) holds [8]. Unfortunately, in mostobservational studies, the true treatment assignmentmechanism is not known. Thus, a common practiceis to fit a propensity score function on data D usingstandard statistical models π w ( x ) = f w ( x ) , where f w represents a model parameterised by a parameter vec-tor w . In this work, we focus on the logistic regressionmodel since it is the most frequently used model forfitting propensity scores [24]. The model is defined as π w ( x ) = 11 + e − w (cid:62) x (1)where w ∈ R d . Other popular techniques for propensityscore estimation include classification and regressiontrees (CART), boosted CART, random forests, etc . Inverse Probability Weighting (IPW).
One pop-ular propensity score-based method for estimating τ from observational data is IPW [17, 18]. Using IPW,we can obtain an unbiased ATE estimate ˆ τ = ˆ µ − ˆ µ with ˆ µ and ˆ µ defined as ˆ µ := 1 N N (cid:88) i =1 y i t i π ( x i ) , ˆ µ := 1 N N (cid:88) i =1 y i (1 − t i )1 − π ( x i ) , (2)where N is the number of units in D . In practice,we replace the propensity score function π with itsempirical estimate π ˆ w . The notion of differential privacy (DP) [16] provides awell-defined framework to describe the privacy prop-erties of statistical estimation algorithms. DP statesthat a privacy-preserving, randomised algorithm be-haves similarly on similar datasets. Specifically, thealgorithm’s behaviour is quantified in terms of a proba-bility ratio, which describes how the algorithm’s outputchanges when different datasets are used as an input.Intuitively, the probability ratio does not change much(behaves similarly) if the input datasets differ by asingle entry (similar datasets). The formal definitionis given below. Definition 1.
A randomised algorithm A with domain N |X | , where X is the data universe, satisfies ( (cid:15), δ ) -differential privacy, i.e., is ( (cid:15), δ ) -DP, if for all S ⊆
Range( A ) and for all neighbouring D , D (cid:48) ∈ N |X | suchthat (cid:107)D − D (cid:48) (cid:107) ≤ , i.e., there is only one entry differ-ence in the two datasets D , D (cid:48) , P [ A ( D ) ∈ S ] ≤ exp( (cid:15) ) P [ A ( D (cid:48) ) ∈ S ] + δ where the probability space is over the outputs of A . rivacy-Preserving Inverse Probability Weighting Here, (cid:15) is defined as the privacy loss controlling thelevel of privacy. For < δ < , δ defines the failureprobability , i.e. , an algorithm is (cid:15) -DP with probabilityat least − δ . Gaussian Mechanism.
The Gaussian mechanism[25] is commonly used to privatise models (see, e.g. ,[16, 26] for other DP mechanisms). The mechanismprivatises a vector-valued function f : D (cid:55)→ R p byadding Gaussian noise to it. The noise is calibratedbased on the L2- sensitivity of f , which is defined by S ( f ) = max D , D (cid:48) , (cid:107)D−D (cid:48) (cid:107) =1 (cid:107) f ( D ) − f ( D (cid:48) ) (cid:107) . Hence,the privatised function has the form ˜ f ( D ) = f ( D ) + N (0 , σ I d ) . A choice of σ ≥ (cid:15) − (cid:112) . /δ ) S ( f ) produces the function ˜ f ( D ) that is ( (cid:15), δ ) -DP. Differentially Private Empirical Risk Minimisa-tion (DP-ERM).
Let (cid:96) : R × R → R + be the lossfunction and vector w the model parameters of π w .Under ERM framework, the optimal model parame-ters ˆ w are obtained by minimising the empirical riskfunction J ( w , D ) = m − (cid:80) mi =1 (cid:96) ( π w ( x i ) , t i ) + λ Ω( w ) ,where Ω( · ) is the regulariser and λ > the regularisa-tion constant. If logistic regression (1) is used to modelthe propensity score function π w , the parameters w can be learned using ERM with a L -regulariser. Weassume throughout that X is contained in the L -unitball , i.e. , (cid:107) x i (cid:107) ≤ for all x i ∈ X . Note that thedataset D m = { ( x i , t i , y i ) } mi =1 contains the observedtreatment effects y i , but that they are not used tofit π w . In our case, the regularised cross-entropy loss J ( w , D ) is − m m (cid:88) i =1 t i log p i + (1 − t i ) log(1 − p i ) + λ (cid:107) w (cid:107) (3)where p i := p ( π w ( x i )) . This loss is equivalent tothe logistic loss used in [27]. The L2-sensitivity of ˆ w obtained by minimising (3) is given by S ( ˆ w ) =max D , D (cid:48) , (cid:107)D−D (cid:48) (cid:107) =1 (cid:107) ˆ w ( D ) − ˆ w ( D (cid:48) ) (cid:107) ≤ mλ ) − [27]. We start by explaining why logistic-regression basedestimation for the propensity score by minimising (3)is not private. An intuitive explanation is: since anERM solution can be written as a linear combinationof training samples, the ˆ w and ˆ w (cid:48) estimated fromdatasets D and D (cid:48) differing by one single entry couldbe completely different, e.g. , if the entry was an outlier.Hence, the likelihood ratio of the two models ˆ w and ˆ w (cid:48) estimated from two neighbouring datasets would be This is a typical assumption on the dataset in thedifferential privacy literature. unbounded, which makes ˆ w not (cid:15) -DP. A more rigorousproof can be found in [27].Since existing propensity score estimation often relieson ERM, standard methods for modelling the propen-sity score function would yield a model that violatesdifferential privacy: given π ˆ w and all points in dataset D bar one, it is possible to infer the covariates x i ofomitted unit. This is a serious problem since propensityscore-based methods are frequently used to estimate thecausal effect from observational studies containing sensi-tive data and highlights the need for privacy-preservingpropensity score estimators and propensity score-basedmethods.Privatising the ATE estimated via IPW requires twolayers of privatisation.The first step of our proposed method is to dividethe dataset D into D m = { x i , t i , y i } mi =1 and D n = { x i , t i , y i } ni =1 , and use the first m points to learn ˆ w andthe remaining n points to estimate ˆ τ . We separatelyprivatise the estimated propensity score function withrespect to the m datapoints and the estimated ATEwith respect to the n datapoints to ensure that all N datapoints (where N = m + n ) in D are protected.We describe each step in detail in the next two subsec-tions. To privatise the logistic regression model, we first com-pute a non-private version of the propensity score π ˆ w by minimising (3). Next, we use the Gaussian mech-anism to generate a privacy-preserving version of ˆ w ,defined as ˆ w (cid:15) . Definition 2.
Let ˆ w be the solution of (3) . A privacy-preserving propensity score function is π (cid:15) ˆ w ( x ) = 11 + exp( − ˆ w (cid:62) (cid:15) x ) = 11 + exp( − ˆ w (cid:62) x − z (cid:62) x ) , (4) where ˆ w (cid:15) := ˆ w + z with z ∼ N ( , σ I d ) and σ = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) for (cid:15) ∈ (0 , and any δ ∈ (0 , . Alongside Definition 2, we define the counterparts of ˆ τ , ˆ µ and ˆ µ which use π (cid:15) ˆ w ( x ) in place of π ˆ w ( x ) over the n points as ˆ τ n , ˆ µ (cid:15) and ˆ µ (cid:15) . This yields DP ATE w.r.t. D m : ˆ τ n := ˆ µ (cid:15) − ˆ µ (cid:15) where ˆ µ (cid:15) := 1 n n (cid:88) i =1 y i π (cid:15) ˆ w ( x i ) , ˆ µ (cid:15) := 1 n n (cid:88) i =1 y i − π (cid:15) ˆ w ( x i ) , (5) i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet with the first sum is over the n points where t i = 1 and the second sum is over the n points where t i = 0 ,and n + n = n . The estimator ˆ τ n only safeguards theprivacy of m points within the dataset, those belongingto the split D m . Privatisation of the points in D n isdiscussed in the next subsection. Here, we focus on theeffect of the noise added for protecting D m on causalinference. Characterising ˆ τ n . Both ˆ µ (cid:15) and ˆ µ (cid:15) are weightedsums of correlated log-normal random variables, wherethe magnitude and signs of the weights depend on thedata. This can be seen from their formulae ˆ µ (cid:15) = 1 n n (cid:88) i =1 y i (1 + exp( − w T x i ) exp( − z T x i )) , (6) ˆ µ (cid:15) = 1 n n (cid:88) i =1 y i (1 + exp( w T x i ) exp( z T x i )) , (7)and recalling that z is a Gaussian random variable. Ob-taining closed-form expressions for the above quantitieswith finite samples is highly nontrivial [28, 29].As a first step towards understanding the random vari-able ˆ τ n , we show how its expected value can be rewrit-ten in terms of the non-privatised ATE estimate ˆ τ andvariance of the added noise. See Appendix A for proof. Lemma 3.
Let α i := ( − − t i y i exp(( − t i ˆ w (cid:62) x i ) bea constant for i = 1 , . . . , n . Then, we have E [ˆ τ n ] = ˆ τ + g ( (cid:15), m, n, λ, δ ) where the function g ( (cid:15), m, n, λ, δ ) is given by n n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) . (8)Lemma 3 allows us to interpret ˆ τ n as a biased estimateof ˆ τ , where the additive bias term is a function of theprivacy loss (cid:15) , sample sizes m and n , regularisation con-stant λ , and failure probability δ . Notice that the bias g ( (cid:15), m, n, λ, δ ) converges to zero as either the privacybudget (cid:15) or the total number of points m + n goes toinfinity as to be expected. We complement our insightswith numerical simulations (see Figure 1) that describethe behaviour of this nontrivial bias term.To study the probability of ˆ τ n having the opposite signw.r.t. the non-private estimator, we assume that itssupport is bounded both from above and below. Thisallows us to employ standard concentration inequalityresults for variables with bounded supports. We boundthe support of the estimator by either deterministic , or probabilistic means. An in-depth discussion of how wedo so is provided in Appendix B.The next theorem characterises the behaviour of ˆ τ n . Theorem 4.
Assume that ˆ τ > and sign(ˆ τ ) = sign( τ ) .If | ˆ τ n | ≤ η for some η > , we have P (ˆ τ n ≤ | ˆ τ > ≤ exp (cid:0) − η − (ˆ τ + g ( (cid:15), m, n, λ, δ )) (cid:1) . Proof.
Given that | ˆ τ n | ≤ η (or the result in Lemma 6with probability at least − γ ), we can apply Lemma3 and Hoeffding’s inequality for bounded variables toyield the result.Qualitatively, we expect this since the theorem impliesthat the larger the magnitude of the true estimatedATE is, the smaller the probability of drawing incorrectcausal conclusions from the partially privatised estima-tor. Interestingly, the bound in Theorem 4 provides aquantitative characterisation, showing that such prob-ability decreases exponentially as a function of ˆ τ . Theprobability of drawing incorrect conclusions from ˆ τ n also depends exponentially on the bias g ( (cid:15), m, n, λ, δ ) .We provide empirical results clarifying the dependencyon this term in Section 4. We now proceed to privatising the n points used tocompute ˆ τ n . Given that P ( T = 1 | x i ) is boundedabove and below by ω and ω respectively for all i ∈ n , ˆ τ n is consequently bounded as well. By fur-ther assuming that | y i | ≤ C y for all i ∈ n , we en-sure that the L2-sensitivity of ˆ τ n , S (ˆ τ n ) , is boundedby n − C y max[1 /ω , / (1 − ω )] . We show how thisquantity is obtained in Appendix C.We apply the Gaussian mechanism to ˆ τ n to obtain aprivacy-preserving approximation ˆ τ (cid:15)n that is ( (cid:15), δ ) -DPw.r.t to the n points used to obtain the estimate. DP ATE w.r.t both D m and D n : ˆ τ (cid:15)n = ˆ τ n + e where the noise is drawn from e ∼ N (0 , σ n ) and with the noise standard deviation σ n := (cid:15) − (cid:112) . /δ ) S (ˆ τ n ) . Characterising ˆ τ (cid:15)n . We now present our main resultwhich bounds the probability that both ˆ τ (cid:15)n and ˆ τ n yieldincorrect causal conclusions. See Appendix E for proof. Theorem 5.
Assume that ˆ τ > and sign(ˆ τ ) = sign( τ ) .If | ˆ τ n | ≤ η for some η > , we have P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > ≤
12 exp (cid:18) − τ + g ) η (cid:19) (cid:20) (cid:18) | ˆ τ n | σ n √ (cid:19)(cid:21) , where g := g ( (cid:15), m, n, λ, δ ) . The additional / | ˆ τ n | /σ n √ term in Theo-rem 5 represents the probability that the added noise rivacy-Preserving Inverse Probability Weighting sample size ( m ) g ( , m , n , , ) against m = 0.2= 0.4= 0.6= 0.8= 0.99 privacy loss ( ) g ( , m , n , , ) against m = 10 m = 100 m = 1000 Figure 1: Behaviour of g ( (cid:15), m, n, λ, δ ) w.r.t. the sample size m and the privacy loss (cid:15) . As we can see, the biasdecreases as we expect when the sample size m and the privacy loss (cid:15) increase. Algorithm 1
PP-IPW
Input:
Data D = { ( x i , t i , y i ) } Ni =1 and privacy loss ( (cid:15), δ ) Split D into two random subsets D m and D n consisting of m and n data points. A. Obtain DP propensity score function Minimise J ( w , D m ) in (3) for the non-private estimate ˆ w . ˆ w (cid:15) = ˆ w + z where z ∼ N ( , σ m I d ) and σ m = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) . Output DP propensity score function π (cid:15) ˆ w ( x ) = 1 / (1 + exp( − ˆ w (cid:62) (cid:15) x )) . B. Obtain DP ATE Compute ˆ τ n := ˆ µ (cid:15) − ˆ µ (cid:15) , given D n , where ˆ µ (cid:15) := n − (cid:80) n i =1 y i π (cid:15) ˆ w ( x i ) and ˆ µ (cid:15) := n − (cid:80) n i =1 y i − π (cid:15) ˆ w ( x i ) . Output DP ATE ˆ τ (cid:15)n = ˆ τ n + e , where e ∼ N (0 , σ n ) and σ n := (cid:15) − (cid:112) . /δ ) S (ˆ τ n ) . Output:
DP propensity score function π (cid:15) ˆ w (w.r.t. D m ) and DP ATE ˆ τ (cid:15)n (w.r.t D ).from the Gaussian mechanism e is greater than | ˆ τ n | ,computed via the Gaussian CDF. The bound in Theo-rem 5 further accounts for ˆ τ n and σ n : the probabilitythat both ˆ τ (cid:15)n and ˆ τ n are negative increases exponen-tially with | ˆ τ n | and decreases exponentially with σ n respectively where the latter is a function of (cid:15) , δ and S (ˆ τ n ) .This theorem provides a full characterisation of our pro-posed differentially private ATE estimation procedure.It summarises the effects of protecting all m + n = N points, and show that the probability of drawing anincorrect causal conclusion decay exponentially w.r.tto the values of ˆ τ and ˆ τ n . Remark.
Notice that in Theorem 5 we chose tobound the probability P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > . How-ever, depending on the application, other quantities,such as P (ˆ τ (cid:15)n ≤ | ˆ τ > or P (ˆ τ (cid:15)n ≤ | ˆ τ n ≤ , ˆ τ > ,might be more interesting. It is also possible to boundthose quantities; the proofs follow from that of Theo-rem 5 presented in Appendix E, and we omit them forbrevity.Putting everything together, our algorithm is presentedin Algorithm 1. For the sake of simplicity, we assignedthe same privacy budget for privatising the propensity score and the average treatment effect but one couldchoose to have separate privacy levels for the two quan-tities. We also include in Appendix F bounds for theaverage treatment effect for the treated (ATT) and theaverage treatment effect for the controls (ATC). Weleave the development of privatised estimators basedon more sophisticated techniques as future work. In this section, we demonstrate our theoretical findingswith experiments on synthetic, semi-synthetic and realdata. We set the probability of failure δ = 10 − andregularisation coefficient λ = 0 . for all experiments.The logistic regression model is implemented in Py-Torch [30] and optimised via gradient descent using theentire dataset. To ensure reproducibility, we set therandom seed for both NumPy and PyTorch to 1. Synthetic Data.
For this set of experiments, wevary the m points used to fit the logistic regressionmodel, use n = 1000 points to estimate the ATE andaverage over 100 trials.We generate X ∈ R by sampling x i s separately from N ( , · I ) for each trial and standardising each sep- i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet privacy loss ( ) P(sign change) against for = 0.1
P(sgn( n ) sgn( ))P(sgn( n ) sgn( ), sgn( n ) sgn( )) (a) privacy loss ( ) P(sign change) against for = 2
P(sgn( n ) sgn( ))P(sgn( n ) sgn( ), sgn( n ) sgn( )) (b) sample size ( m ) True , , n against m for = 0.1 True n for =0.1 n for =0.3 n for =0.5 (c) sample size ( m ) True , , n against m for = 2 True n for =0.1 n for =0.3 n for =0.5 (d) Figure 2: Experimental results for synthetic data. (a) and (c) correspond to low-confidence with τ = 0 . . (b) and(d) correspond to the high-confidence with τ = 2 . See main text for interpretation.arate sampled set of x i s with the maximum L2-normof the x i s in the set. For each standardised x i , thetreatment assignment t i and outcome y i are generatedacross trials in the following manner: t i ∼ Bernoulli ( s ( a (cid:62) x i )) , a ∼ N ( , I ) ,y i = b (cid:62) x i + t i τ + ϑ, b ∼ N ( , I ) where ϑ ∼ N (0 , . , τ ∈ { . , } is a non-zero bias,and s ( · ) is a sigmoid function. We perturb the weightsof the learned logistic regression model with a samplefrom N ( , σ I ) with σ defined in (4).Figures 2a and 2b shows that the probability that thesigns of ˆ τ disagreeing with both ˆ τ n and ˆ τ (cid:15)n decreasesas (cid:15) and τ increases with m set to 1000. These resultsreflect the exponential dependence of P (ˆ τ n ≤ and P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ on g ( (cid:15), m, n, λ, δ ) = O (1 /(cid:15) ) and ˆ τ in Theorems 4 and 5.The convergence of ˆ τ (cid:15)n to ˆ τ as m increases from 500 to2500 at intervals of 500 seen in Figures 2c and 2d rein-forces the inverse relationship between g ( (cid:15), m, n, λ, δ ) and m earlier demonstrated in Figure 1. The errorbars represent the 95% confidence interval of the meanof the various estimates. The above plots also showthat increasing τ yields exponentially larger mean ATEestimates which checks out with the form of µ (cid:15) and µ (cid:15) in (6), as the expectation of the log-normal random variables weighting y i monotonically increases with thevariance of the Gaussian random variable. Semi-synthetic Data.
Next, we test our methodson the semi-synthetic binary-treatment Infant Healthand Development Programme (IHDP) dataset that wasintroduced in [31]. We use the train and test sets from[32] for fitting logistic regression and estimating theATE respectively with the true ATE of the train/testsplits being 4. IHDP is a real-world dataset with 25covariates describing 747 children and their mothers, de-randomised binary treatments and synthetic continuousoutcomes that can be used to compute a ground truthATE [31]. We create balanced training and ATE estima-tion datasets where m = 500 and n = 500 by samplingwith replacement 250 units with T = 1 and T = 0 and100 units with T = 1 and T = 0 respectively from theabove train and test sets. As the IHDP dataset comeswith 1000 different realisations of train and test data,we average over all realisations. In Tables 2, we seethat increasing (cid:15) generally increases the fidelity of themean ATE estimate and reduces P (sgn( ˆ τ n ) (cid:54) = sgn(ˆ τ )) and P (sgn( ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) respectively. We did not in-clude the average standard deviation of the estimatesas estimating ATE with IPW is known to have highvariance due to the unboundedness of the propensityscore function π ˆ w . Lastly, another critical observation rivacy-Preserving Inverse Probability Weighting Table 2: Average ˆ τ n , ˆ τ (cid:15)n , ρ (ˆ τ , ˆ τ n ) := P (sgn(ˆ τ n ) (cid:54) =sgn(ˆ τ )) , and ρ (ˆ τ , ˆ τ (cid:15)n ) := P (sgn(ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) for vari-ous (cid:15) over 1000 runs on IHDP dataset. The average ˆ τ is 4.80. Estimate Privacy loss ( (cid:15) ) . . . . . τ n -237487.30 32.34 8.37 5.22 5.31 ˆ τ (cid:15)n -237477.13 32.20 8.50 5.09 5.40 ρ (ˆ τ , ˆ τ n ) ρ (ˆ τ , ˆ τ (cid:15)n ) Table 3: Average ˆ τ n , ˆ τ (cid:15)n , ρ (ˆ τ , ˆ τ n ) := P (sgn(ˆ τ n ) (cid:54) =sgn(ˆ τ )) , and ρ (ˆ τ , ˆ τ (cid:15)n ) := P (sgn(ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) for vari-ous (cid:15) over 1000 runs on Lalonde dataset. The average ˆ τ is 902.11. Estimate Privacy loss ( (cid:15) ) . . . . . τ n ˆ τ (cid:15)n ρ (ˆ τ , ˆ τ n ) ρ (ˆ τ , ˆ τ (cid:15)n ) from Tables 2, especially for practitioners, is that toosmall an (cid:15) can lead to an unreliable estimate of ˆ τ n and ˆ τ (cid:15)n , i.e. , see the estimates for IHDP when (cid:15) = 0 . . Real Data.
To further verify our proposed method,we use the Lalonde observational studies benchmark[33] obtained from [34]. As we do not account for un-balanced datasets, we only use the original Lalondedataset with 297 treated and 425 control individualsand subsample from it to create our training and ATEestimation datasets. There are 9 covariates containingsensitive information such as age, education, and race,and the outcome is 1978 earnings. As no train/testsplits are provided, we sample without replacement100 units of T = 1 and T = 0 to create the ATEestimation dataset of size 200 and sample with re-placement 250 units with T = 1 and T = 0 from theremaining points to generate the training dataset ofsize 500. We do the above 1000 times to obtain thesame number of realisations as the IHDP dataset. Theresults for Lalonde in Tables 3 supplements those forIHDP: increasing (cid:15) improves the accuracy of the meanATE estimate and decreases P (sgn( ˆ τ n ) (cid:54) = sgn(ˆ τ )) and P (sgn( ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) . We proposed a differentially private IPW method foraverage treatment effect under the inverse probability weighting framework. A key element of our proposedmethod is the use of a newly defined private propensityscore estimator. Unlike traditional propensity scores,ours can be deployed in causal analysis without runningthe risk of exposing the covariates of any unit used inestimating the propensity score function. Furthermore,we demonstrate—both theoretically and empirically—that the ATE estimate resulting from an applicationof our method is consistent with its non-private coun-terpart with high probability. In other words, theproposed propensity score function not only safeguardsprivacy, but also yields valid causal analyses with highprobability.We believe this work highlights long-neglected privacyconcerns associated with the use of propensity scores incausal inference, but would also pave the way for sub-sequent developments at the intersection of differentialprivacy and causal inference. Although the startingpoint of our work is a specific choice of a non-privatemethod for ATE estimation, the analyses can be ex-tended to more sophisticated estimators. In particular,we note that IPW is mathematically equivalent to otherestimation methods such as stratification and the back-door correction (see, e.g. , [35]). Future extensionsof our work will be dedicated to the development ofprivatised estimators based on alternative ATE estima-tion methods as well as other causal estimands suchas conditional average treatment effect (CATE) [32].The effectiveness of private propensity scores in morecomplex methods and settings still remains an openquestion.
References [1] High-level Expert Group on Artificial Intelligence.Ethics guidelines for trustworthy AI, 2019.[2] Thomas A. Glass, Steven N. Goodman, Miguel A.Hernàn, and Jonathan M. Samet. Causal inferencein public health.
Annual Review of Public Health ,34(1):61–75, 2013.[3] Guido W. Imbens and Donald B. Rubin.
CausalInference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge UniversityPress, New York, NY, USA, 2015.[4] Richard M. Shiffrin. Drawing causal inference frombig data.
Proceedings of the National Academy ofSciences , 113(27):7308–7309, 2016.[5] Elias Bareinboim and Judea Pearl. Causal infer-ence and the data-fusion problem.
Proceedings ofthe National Academy of Sciences , 113(27):7345–7352, 2016.[6] Paul Holland. Statistics and causal inference.
Journal of the American Statistical Association ,81(396):945–960, 1986. i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet [7] Peter C Austin. An introduction to propensityscore methods for reducing the effects of confound-ing in observational studies. Multivariate behav-ioral research , 46(3):399–424, 2011.[8] Paul R. Rosenbaum and Donald B. Rubin. Thecentral role of the propensity score in observationalstudies for causal effects.
Biometrika , 70(1):41–55,1983.[9] Paul R Rosenbaum and Donald B Rubin. Reduc-ing bias in observational studies using subclassi-fication on the propensity score.
Journal of theAmerican statistical Association , 79(387):516–524,1984.[10] Paul R Rosenbaum and Donald B Rubin. Con-structing a control group using multivariatematched sampling methods that incorporate thepropensity score.
The American Statistician ,39(1):33–38, 1985.[11] Miroslav Dudík, John Langford, and Lihong Li.Doubly Robust Policy Evaluation and Learning. In
Proceedings of the 28th International Conferenceon Machine Learning , pages 1097–1104. Omni-press, 2011.[12] Léon Bottou, Jonas Peters, Joaquin Qui`nonero-Candela, Denis Charles, Max Chickering, ElonPortugaly, Dipankar Ray, Patrice Simard, andEd Snelson. Counterfactual reasoning and learn-ing systems: The example of computational ad-vertising.
Journal of Machine Learning Research ,14:3207–3260, 2013.[13] Adith Swaminathan and Thorsten Joachims.Counterfactual risk minimization: Learning fromlogged bandit feedback. In
Proceedings of the 32ndInternational Conference on Machine Learning ,pages 814–823. JMLR.org, 2015.[14] Adith Swaminathan, Akshay Krishnamurthy,Alekh Agarwal, Miro Dudik, John Langford,Damien Jose, and Imed Zitouni. Off-policy evalu-ation for slate recommendation. In
Advances inNeural Information Processing Systems 30 , pages3632–3642. Curran Associates, Inc., 2017.[15] Jeremy A Rassen, Daniel H Solomon, Jeffrey Cur-tis, Lisa Herrinton, and Sebastian Schneeweiss.Privacy-maintaining propensity score-based pool-ing of multiple databases applied to a study ofbiologics.
Medical care , 48:S83–9, 06 2010.[16] Cynthia Dwork, Aaron Roth, et al. The algo-rithmic foundations of differential privacy.
Foun-dations and Trends R (cid:13) in Theoretical ComputerScience , 9(3–4):211–407, 2014.[17] James M Robins, Miguel Angel Hernan, and Ba-bette Brumback. Marginal structural models andcausal inference in epidemiology, 2000. [18] Paul R Rosenbaum. Model-based direct adjust-ment. Journal of the American Statistical Associ-ation , 82(398):387–394, 1987.[19] Matt J. Kusner, Yu Sun, Karthik Sridharan, andKilian Q. Weinberger. Private causal inference. In
AISTATS , volume 51 of
JMLR Workshop and Con-ference Proceedings , pages 1308–1317. JMLR.org,2016.[20] D. Xu, S. Yuan, and X. Wu. Differential privacypreserving causal graph discovery. In ,pages 60–71, Aug 2017.[21] Jerzy Neyman. Sur les applications de la theoriedes probabilites aux experiences agricoles: Essaides principes. Master’s thesis, 7 1923. Excerptsreprinted in English, Statistical Science, Vol. 5,pp. 463–472. (D. M. Dabrowska, and T. P. Speed,Translators.).[22] D.B. Rubin. Estimating causal effects of treat-ments in randomized and nonrandomized studies.
Journal of Educational Psychology , 66(5):688–701,1974.[23] Donald Rubin. Causal inference using potentialoutcomes.
Journal of the American StatisticalAssociation , 100(469):322–331, 2005.[24] M Soledad Cepeda, Ray Boston, John T Farrar,and Brian L Strom. Comparison of logistic regres-sion versus propensity score when the number ofevents is low and there are multiple confounders.
American journal of epidemiology , 158(3):280–287,2003.[25] Cynthia Dwork, Krishnaram Kenthapadi, FrankMcSherry, Ilya Mironov, and Moni Naor. Our data,ourselves: Privacy via distributed noise generation.In
Annual International Conference on the The-ory and Applications of Cryptographic Techniques ,pages 486–503. Springer, 2006.[26] Anand D. Sarwate and Kamalika Chaudhuri. Sig-nal processing and machine learning with differen-tial privacy: Algorithms and challenges for continu-ous data.
IEEE Signal Process. Mag. , 30(5):86–94,2013.[27] Kamalika Chaudhuri, Claire Monteleoni, andAnand D Sarwate. Differentially private empiricalrisk minimization.
Journal of Machine LearningResearch , 12(Mar):1069–1109, 2011.[28] Archil Gulisashvili, Peter Tankov, et al. Tail behav-ior of sums and differences of log-normal randomvariables.
Bernoulli , 22(1):444–493, 2016.[29] Chi-Fai Lo. The sum and difference of two lognor-mal random variables.
Journal of Applied Mathe-matics , 2012, 2012. rivacy-Preserving Inverse Probability Weighting [30] Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. Automatic differentiation in PyTorch. In
NIPS Autodiff Workshop , 2017.[31] Jennifer L Hill. Bayesian nonparametric modelingfor causal inference.
Journal of Computationaland Graphical Statistics , 20(1):217–240, 2011.[32] Uri Shalit, Fredrik D Johansson, and David Sontag.Estimating individual treatment effect: general-ization bounds and algorithms. In
Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pages 3076–3085. JMLR. org,2017.[33] Robert J LaLonde. Evaluating the econometricevaluations of training programs with experimen-tal data.
The American economic review , pages604–620, 1986.[34] Christian Fong, Marc Ratkovic, Kosuke Imai,Chad Hazlett, Xiaolin Yang, and Sida Peng. cbps:Covariate Balancing Propensity Score , 2019.[35] Miguel A. Hernán and James M. Robins.
CausalInference . Boca Raton: Chapman & Hall/CRC,2019. Forthcoming.[36] S. Boucheron, G. Lugosi, and P. Massart.
Con-centration inequalities. A nonasymptotic theory ofindependence . Oxford University Press, 2013. i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet A Proof of Lemma 3
Proof.
Given a dataset D = { ( x i , t i , y i ) } ni =1 , let α i be defined in Lemma 3 and β i := exp( σ (cid:107) x i (cid:107) / for i = 1 , . . . , n . Then, taking the expectation of (6) w.r.t.the noise variable yields E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =1 y i + y i β i exp( − ˆ w (cid:62) x i ) , E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =0 y i + y i β i exp( ˆ w (cid:62) x i ) We further rewrite E [ˆ µ (cid:15) ] and E [ˆ µ (cid:15) ] as E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =1 { y i + y i exp( − ˆ w (cid:62) x i ) } +1 n (cid:88) t i =1 { y i β i exp( − ˆ w (cid:62) x i ) − y i exp( − ˆ w (cid:62) x i ) } = 1 n (cid:88) t i =1 y i + y i exp( − ˆ w (cid:62) x i ) +1 n (cid:88) t i =1 y i exp( − ˆ w (cid:62) x i )( β i − µ + 1 n (cid:88) t i =1 y i exp( − ˆ w (cid:62) x i )( β i − , E [ˆ µ (cid:15) ] = ˆ µ + 1 n (cid:88) t i =0 y i exp( ˆ w (cid:62) x i )( β i − Consequently, E [ˆ τ (cid:15) ] = E [ˆ µ (cid:15) ] − E [ˆ µ (cid:15) ]= ˆ µ − ˆ µ + 1 n n (cid:88) i =1 α i ( β i − τ + g ( σ ) where g ( σ ) := n − (cid:80) ni =1 α i ( β i − . Setting σ =2( (cid:15)mλ ) − (cid:112) . /δ ) and substituting back intoeach β i yields the result (8). B Bounds for ˆ τ n Deterministic Bounds.
The first case correspondsto using trimmed or clipped propensity scores. Givena constant < ξ < , we define the trimmed version of ˆ τ n as ˆ τ n,ξ := 1 n n (cid:88) i =1 y i max { ξ, π (cid:15) ˆ w ( x i ) } − n n (cid:88) i =1 y i max { ξ, − π (cid:15) ˆ w ( x i ) } . While ˆ τ n,ξ is a biased estimate of τ with a boundedvariance, it is often preferred due to its robustnessto outliers. If | y i | ≤ C y for all i ∈ n , it follows that | ˆ τ n,ξ | ≤ C y ξ − with probability 1. Probabilistic Bounds.
In the second case, we con-sider what happens if no trimming is applied. Althoughthe variance of ˆ τ n can be unbounded, it is a determin-istic function of a single sub-Gaussian noise variable z ∼ N ( , σ I d ) . Hence, we expect the bounded differ-ence condition for ˆ τ n to hold with high probability . Tothis end, let S := (cid:80) dj =1 z j . Since each component of z is independent, we have S ∼ N (0 , dσ ) . With S beinga sub-Gaussian random variable, Chernoff’s inequality[36, pp. 21] gives P ( | S | ≥ ζ ) ≤ (cid:0) − ζ (2 dσ ) − (cid:1) forsome ζ > . This implies that | S | ≤ ζ holds with prob-ability at least − γ where γ = 2 exp( − ζ (2 dσ ) − ) .The following lemma gives the probabilistic boundeddifference condition for ˆ τ n . Lemma 6.
Let ˆ τ n and ˆ τ (cid:48) n be two estimates with dif-ferent noise vectors z and z (cid:48) , respectively. Then, withprobability at least − γ , we have | ˆ τ n − ˆ τ (cid:48) n | ≤ η where η := 2 n sinh( ζ ) (cid:32) n (cid:88) i =1 y i exp( − ˆ w (cid:62) x i ) + n (cid:88) i =1 y i exp( ˆ w (cid:62) x i ) (cid:33) . Proof.
Let φ ( z ) be the deterministic function map-ping random variable z to ˆ τ n defined in (5). Fur-thermore, let C i := y i exp( − ˆ w (cid:62) x i ) for i ∈ n and C i := y i exp( ˆ w (cid:62) x i ) for i ∈ n . Given that | S | ≤ ζ holds with probability at least − γ , it follows that | φ ( z ) − φ ( z (cid:48) ) | = 1 n n (cid:88) i =1 C i (cid:12)(cid:12) exp( − z (cid:62) x i ) − exp( − z (cid:48)(cid:62) x i ) (cid:12)(cid:12) +1 n n (cid:88) i =1 C i (cid:12)(cid:12) exp( z (cid:62) x i ) − exp( z (cid:48)(cid:62) x i ) (cid:12)(cid:12) ≤ exp( ζ ) − exp( − ζ ) n (cid:32) n (cid:88) i =1 C i + n (cid:88) i =1 C i (cid:33) = 2 n sinh( ζ ) (cid:32) n (cid:88) i =1 C i + n (cid:88) i =1 C i (cid:33) =: η, also holds with probability at least − γ . This concludesthe proof. C L2-sensitivity of ˆ τ n Proof. If | y i | ≤ C y and Ω < P ( T = 1 | x i ) < Ω for all i ∈ n where C y , Ω and Ω are constants, we have S (ˆ τ n )= max D , D (cid:48) (cid:107)D−D (cid:48) (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 y i t i π ˆ w ( x i ) − n n (cid:88) i =1 y i (1 − t i )1 − π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 n max (cid:40) max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =1 (cid:13)(cid:13)(cid:13)(cid:13) y i π ˆ w ( x i ) − y (cid:48) i π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13) , rivacy-Preserving Inverse Probability Weighting max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =0 (cid:13)(cid:13)(cid:13)(cid:13) y i − π ˆ w ( x i ) − y (cid:48) i − π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) ≤ n max (cid:40) max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =1 (cid:13)(cid:13)(cid:13)(cid:13) y i Ω − y (cid:48) i Ω (cid:13)(cid:13)(cid:13)(cid:13) , max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =0 (cid:13)(cid:13)(cid:13)(cid:13) y i − Ω − y (cid:48) i − Ω (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) ≤ C y n max (cid:26) , − Ω (cid:27) . Therefore, the L2-sensitivity of ˆ τ n is bounded fromabove by n − C y max (cid:110) , − Ω (cid:111) . D Error of ˆ τ n This corollary bounds the error we incur by privatisingthe propensity score function.
Corollary 7.
For any constant ∆ > , P ( | ˆ τ n − ˆ τ | ≥ ∆) ≤ n ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof.
By Lemma 3 and the Markov inequality, P ( | ˆ τ (cid:15) − ˆ τ | ≥ ∆) ≤ | E [ˆ τ (cid:15) − ˆ τ ] | ∆= | ˆ τ + g ( (cid:15), m, n, λ, δ ) − ˆ τ | ∆= | g ( (cid:15), m, n, λ, δ ) | ∆= 1 n ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The last step follows from the definition of g ( (cid:15), m, n, λ, δ ) . This concludes the proof. E Proof of Theorem 5
Proof.
By the law of probability, P (ˆ τ (cid:15)n < , ˆ τ n < | ˆ τ > P (ˆ τ (cid:15)n < | ˆ τ n < , ˆ τ > P (ˆ τ n < | ˆ τ > . Since we already possess the upper bound on P (ˆ τ n < | ˆ τ > (see Theorem 4), we focus on obtaining P (ˆ τ (cid:15)n < | ˆ τ n < . Since ˆ τ (cid:15)n is normally distributed with mean ˆ τ n , this probability is just the probability that ˆ τ (cid:15)n < given that its mean is negative. Hence, we can get theexact probability using the Gaussian CDF, i.e. , P (ˆ τ (cid:15)n < | ˆ τ n < , ˆ τ >
0) = Φ (cid:18) | ˆ τ n | σ n (cid:19) = 12 (cid:20) erf (cid:18) | ˆ τ (cid:15) | σ n √ (cid:19)(cid:21) , where Φ( · ) denotes the CDF of the standard normaldistribution and erf ( · ) is the error function. Combiningthis with the bound in Theorem 4 yields P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > ≤
12 exp (cid:18) − τ + g ) ∆ (cid:19) (cid:20) (cid:18) | ˆ τ n | σ n √ (cid:19)(cid:21) , where g := g ( (cid:15), m, n, λ, δ ) . This concludes the proof. F Privatised ATT and ATC Estimates
The IPW estimators for ATT and ATC have the form ˆ τ ATT = 1 n n (cid:88) i =1 (cid:18) t i − (1 − t i ) π ˆ w ( x i )1 − π ˆ w ( x i ) (cid:19) y i , ˆ τ ATC = 1 n n (cid:88) i =1 (cid:18) t i − π ˆ w ( x i ) π ˆ w ( x i ) − (1 − t i ) (cid:19) y i and all related quantities are denoted with a subcriptedATT or ATC. First, we simplify on the π ( x i )1 − π ( x i ) and − π ( x i ) π ( x i ) terms in ˆ τ ATT and ˆ τ ATC to obtain: ˆ τ ATT = 1 n n (cid:88) i =1 (cid:0) t i − (1 − t i ) exp( ˆ w (cid:62) x i ) (cid:1) y i , ˆ τ ATC = 1 n n (cid:88) i =1 (cid:0) t i exp( − ˆ w (cid:62) x i ) − (1 − t i ) (cid:1) y i By employing the privacy-preserving propensity scoresfrom Definition 2, the perturbed ˆ τ n, ATT and ˆ τ n, ATC would have additional exp( z (cid:62) x i ) and exp( − z (cid:62) x i ) multiplicative terms on top of exp( ˆ w (cid:62) x i ) and exp( − ˆ w (cid:62) x i ) respectively with z ∼ N ( , σ I d ) and σ = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) for (cid:15) ∈ (0 , and any δ ∈ (0 , . This yields estimates that are DP w.r.t D m .We then adapt Lemma 3 to ATT and ATC byadding and subtracting µ , ATT and µ , ATC to the ex-pectation of the ATT and ATC respectively. Let α i, ATT := − t i =0 y i exp( w (cid:62) x i ) and α i, ATC := t i =1 y i exp( − w (cid:62) x i ) where is the indicator func-tion and its subscript the condition where the functionis 1. With in place, we obtain E [ˆ τ n, ATT ] = ˆ τ ATT + g ATT ( (cid:15), m, n, λ, δ ) where g ATT ( (cid:15), m, n, λ, δ )= 1 n n (cid:88) i =1 α i, ATT (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) , and E [ˆ τ n, ATC ] = ˆ τ ATC + g ATC ( (cid:15), m, n, λ, δ ) where i Kai Lee ∗ , Luigi Gresele , Mijung Park
1, 3 , Krikamol Muandet g ATC ( (cid:15), m, n, λ, δ )= 1 n n (cid:88) i =1 α i, ATC (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) . We bound the supports of ˆ τ n, ATT and ˆ τ n, ATC by en-suring that exp( − ˆ w (cid:62) x i ) and exp( ˆ w (cid:62) x i ) are smalleror equal to some constant ξ for all i ∈ n . This can beachieved by the techniques described in Appendix B.With that in place, we can then extend Theorem 4 tocover the cases where ˆ τ ATT and ˆ τ ATC are both assumedto be greater than 0 and have the correct sign as theirtrue counterparts. The next two theorems illustratethe behaviour of ˆ τ n, ATT and ˆ τ n, ATC respectively.
Theorem 8.
Assume that ˆ τ ATT > and sign(ˆ τ ATT ) =sign( τ ATT ) . If | ˆ τ n, ATT | ≤ η for some η > , we have P (ˆ τ n, ATT ≤ | ˆ τ ATT > ≤ exp (cid:0) − η − (ˆ τ ATT + g ATT ( (cid:15), m, n, λ, δ )) (cid:1) . Theorem 9.
Assume that ˆ τ ATC > and sign(ˆ τ ATC ) =sign( τ ATC ) . If | ˆ τ n, ATC | ≤ η for some η > , we have P (ˆ τ n, ATC ≤ | ˆ τ ATC > ≤ exp (cid:0) − η − (ˆ τ ATC + g ATC ( (cid:15), m, n, λ, δ )) (cid:1) . Unsurprisingly, the results of Theorems 8 and 9 havesimilar implications as Theorem 4 for the ATE: theprobability of drawing incorrect conclusions is expo-nentially related to the true estimated ATT/ATC andits corresponding bias g AT T / g AT C .Continuing with our analysis, we apply the Gaussianmechanism to ˆ τ n, ATT and ˆ τ n, ATC to ensure that theestimates are DP w.r.t. the remaining D n points.We first conduct a standard sensitivity analysis on ˆ τ n, ATT and ˆ τ n, ATC to compute the variance of the noiseadded to the estimates using the Gaussian mechanism.Assuming that | y i | ≤ C y and exp( − ˆ w (cid:62) x i ) ≤ ξ and exp( ˆ w (cid:62) x i ) ≤ ξ for all i ∈ n without loss of generality,we have S (ˆ τ n, ATT ) = S (ˆ τ n, ATC ) ≤ C y n max { , ξ } We then fully privatise the ATT and ATC esti-mates DP w.r.t to both D m and D n by addingnoise e to ˆ τ n, ATT and ˆ τ n, ATC where e ∼ N (0 , σ n ) and the noise standard deviation σ n, ATT/ATC := (cid:15) − (cid:112) . /δ ) S (ˆ τ n, ATT/ATC ) .The following theorems bounds bounds the probabilitythat both ˆ τ (cid:15)n, ATT and ˆ τ n, ATT / ˆ τ (cid:15)n, ATC and ˆ τ n, ATC yieldincorrect causal conclusions. The proofs are modifiedfrom that in Appendix E.
Theorem 10.
Assume that ˆ τ ATT > and sign(ˆ τ ATT ) = sign( τ ATT ) . If | ˆ τ ATT | ≤ η for some η > , we have P (ˆ τ (cid:15)n, ATT ≤ , ˆ τ n, ATT ≤ | ˆ τ ATT > ≤
12 exp (cid:18) − τ ATT + g ) η (cid:19) (cid:34) (cid:32) | ˆ τ n, ATT | σ n, ATT √ (cid:33)(cid:35) , where g := g ATT ( (cid:15), m, n, λ, δ ) . Theorem 11.
Assume that ˆ τ ATC > and sign(ˆ τ ATC ) = sign( τ ATC ) . If | ˆ τ ATC | ≤ η for some η > , we have P (ˆ τ (cid:15)n, ATC ≤ , ˆ τ n, ATC ≤ | ˆ τ ATC > ≤