[PDF] Privacy-Preserving Causal Inference via Inverse Probability Weighting

Abstract

The use of inverse probability weighting (IPW) methods to estimate the causal effect of treatments from observational studies is widespread in econometrics, medicine and social sciences. Although these studies often involve sensitive information, thus far there has been no work on privacy-preserving IPW methods. We address this by providing a novel framework for privacy-preserving IPW (PP-IPW) methods. We include a theoretical analysis of the effects of our proposed privatisation procedure on the estimated average treatment effect, and evaluate our PP-IPW framework on synthetic, semi-synthetic and real datasets. The empirical results are consistent with our theoretical findings.

Full PDF

PPrivacy-Preserving Causal Inferencevia Inverse Probability Weighting

Si Kai Lee ∗ Luigi Gresele

Mijung Park

1, 3

Krikamol Muandet Max Planck Institute for Intelligent Systems, Tübingen, Germany Max Planck Institute for Biological Cybernetics, Tübingen, Germany University of Tübingen, Tübingen, Germany

Abstract

The use of inverse probability weighting(IPW) methods to estimate the causal eﬀectof treatments from observational studies iswidespread in econometrics, medicine and so-cial sciences. Although these studies ofteninvolve sensitive information, thus far therehas been no work on privacy-preserving IPWmethods. We address this by providing anovel framework for privacy-preserving IPW(PP-IPW) methods. We include a theoreticalanalysis of the eﬀects of our proposed privati-sation procedure on the estimated averagetreatment eﬀect, and evaluate our PP-IPWframework on synthetic, semi-synthetic andreal datasets. The empirical results are con-sistent with our theoretical ﬁndings.

The increasing ubiquity of machine learning in ourdaily lives has created a pressing need for trustworthyartiﬁcial intelligence (AI). One key tenet of trustwor-thy AI, deﬁned in the European Commission’s EthicsGuidelines for Trustworthy AI [1], is privacy . Preserv-ing patient privacy is critical in medicine as it buildstrust and fosters thoughtful decision making, which inturn helps improve patient care. Although the privacyrequirements in the medical ﬁeld are especially high,such requirements are not unique. In observationalstudies for social sciences, it is commonly assumed thatsensitive information such as employment, educationand criminal records that are used in analyses would bekept private. The datasets compiled from these studies ∗ Contact: [email protected] often contain personal information, hence conclusionsdrawn from statistical analyses on such datasets runsthe risk of violating the privacy of those present inthe datasets. Such risk extends to causal inference methods [2, 3, 4, 5] as they fall under the umbrella ofstatistical estimation tools.We illustrate the problem of causal inference with an ex-ample. In medicine, it is crucial to have a well-roundedunderstanding of the eﬃcacy of diﬀerent medical treat-ments since a treatment can produce a broad range ofresponses across patients and multiple treatments areusually available. The treatment eﬀect can be modelledfrom observational data, by looking at how patientsresponded to diﬀerent treatments in the past. How-ever, two formidable obstacles need to be overcomein order to obtain a sound causal eﬀect estimation.First, for each patient we only observe the outcomeassociated with the treatment that patient received,and no other [6]. Second, the treatments that patientsreceive are not assigned at random, as doctors assignthe treatment they expect to work best for each pa-tient; as a result, treatment assignments and outcomesare subject to confounding , which could result in abiased estimate of the outcome of each treatment. Anaccurate estimate of the treatment eﬀect should takeconfounding into account, possibly modelling how in-dividual characteristics of the patient determine theassigned treatment. However, this often requires col-lecting sensitive information from patients.The propensity score is arguably one of the most usedquantities in causal inference for observational studies.It forms the basis of popular techniques such as match-ing, stratiﬁcation, and inverse probability weighting(IPW) [7, 8, 9, 10] which are extensively used in econo-metrics, medicine, and social sciences [3]. Moreover, theIPW estimator based on propensity scores is the back-bone of several counterfactual inference algorithms inthe machine learning literature [11, 12, 13, 14]. Despiteits widespread use, the propensity score—deﬁned asthe probability of assignment of a particular treatment a r X i v : . [ c s . L G ] N ov rivacy-Preserving Inverse Probability Weighting given observed covariates—could depend on sensitiveinformation about the patients, such as age, gender,race, and ethnicity. Little concern has thus far beenraised regarding the privacy issues related to the use ofpropensity scores in such methods. Previously, propen-sity scores have been used by [15] as privatised substi-tutes for individual covariates. However, as we showin Section 3, this method still violates privacy [16]since sensitive observational data is used for estimating propensity scores.Inverse probability weighting (IPW) methods, whichemploys estimated propensity scores, are frequentlyused to estimate the average treatment eﬀect (ATE)[17, 18]. Since estimating the ATE with IPW methodsstill require sensitive observational data, privacy isfurther violated.To address this important but neglected issue, we de-velop a novel framework for privacy-preserving IPWmethods. This framework consists of two steps: (1) welearn a privacy-preserving propensity score estimatorand (2) we output a privacy-preserving ATE estimator.In addition, we investigate the eﬀect of privatisation inboth steps on the performance of the resulting causalanalysis. Related Work.

To the best of our knowledge, thispaper is the ﬁrst work which formally investigate theprivatisation of the propensity score function and theaverage treatment eﬀect estimates with IPW methods.There have only been few prior attempts to privatisecausal inference techniques in diﬀerent contexts. Forexample, in [19], the authors demonstrated how onecould privatise statistical dependence scores such as

Spearman’s ρ and Kendall’s τ under the additive noisemodel. The main focus of [19] is to obtain privatisedscores that still can correctly identify the causal di-rection between two random variables. In [20], theauthors developed a diﬀerentially private constraint-based causal graph discovery method for categoricaldata. None of the above papers considered propensityscore-based causal inference methods which is our focus.Since propensity score-based methods are fundamen-tal to econometrics, medicine, social sciences etc., weexpect this work to impact diverse ﬁelds. Contributions.

In this paper, we propose a privacy-preserving framework for IPW methods which com-prises a propensity score estimator and an IPW-basedaverage treatment eﬀect estimator We privatise theparameters of the logistic regression model used toestimate the propensity scores, as well as the outputof inverse probability weighting. This guarantees theprivacy of the individuals in the training dataset usedto learn the propensity score estimator and individualsin the estimation dataset used to estimate the ATE.We analyse the eﬀect of the noise added to enforce

Table 1: Key Quantities

Symbol Description (cid:15)

Privacy loss δ Failure probability µ t Population mean under treatment t ˆ µ t Estimated µ t ˆ µ (cid:15)t Privatised ˆ µ t τ Average Treatment Eﬀect (ATE), µ − µ ˆ τ ATE estimate, ˆ µ − ˆ µ ˆ τ n Partially privatised ATE estimate ˆ τ (cid:15)n Fully privatised ATE estimate π w Propensity score function π ˆ w Estimated π w π (cid:15) ˆ w Privatised π ˆ w privacy on the resulting estimated causal eﬀect. Ouranalysis provides guidelines on how many samples weneed to guarantee a certain level of privacy while pro-viding accurate causal inference. We test our methodon synthetic, semi-synthetic and real-world datasets toillustrate its eﬀectiveness.The rest of this paper is organised as follows. Section2 provides a review of propensity scores in causal infer-ence and diﬀerential privacy. The privacy-preservingpropensity scores, IPW estimator, as well as their the-oretical guarantees are then presented in Section 3,followed by experimental results in Section 4. Finally,Section 5 concludes the paper. In this section, we introduce relevant concepts fromcausal inference and diﬀerential privacy. The key quan-tities are summarised in Table 1.

The potential outcomes framework is one of the mostwidely-used approaches in causal inference [21, 22, 23].It provides the mathematical basis for estimating theoutcome of an experiment which has not been per-formed given outcomes observed under other experi-mental settings.Consider the setting where we want to estimate whethera given treatment has a positive, negative or null ef-fect on diﬀerent units/individuals. We deﬁne T asthe treatment variable and Y t as the random variablerepresenting the potential outcome associated withtreatment T = t . In medicine, T could represent diﬀer-ent cancer treatments and Y t an indicator for patientrecovery after treatment t . Throughout this paper, wefocus on the binary treatment setting, i.e. , T ∈ { , } , i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet and refer to the subset of the population with T = 1 as the treatment group, and the rest, with T = 0 , asthe control group. The random variables Y and Y are the outcomes associated with the treatment andcontrol group, respectively.The question we want to answer is: what is the eﬀectof administering a treatment to a unit compared to notdoing so? Quantitatively, this can be characterised bythe diﬀerences of the outcome when the treatment is ad-ministered and the outcome when the treatment is notadministered, i.e. , Y − Y . To estimate this, we wouldrequire both the outcomes of treatment and no treat-ment to be observed for every unit. However, for eachunit we can observe only either Y or Y . In practice, wesubstitute each unobserved quantity with an estimateof its expected outcome, µ t := E [ Y t ] , and evaluate the average treatment eﬀect (ATE) with τ = E [ Y ] − E [ Y ] .Given a dataset D = { ( t , y ) , . . . , ( t N , y N ) } , we canapproximate ˆ µ t with n − t (cid:80) Ni =1 ( t i = t ) · y i , where ( · ) is an indicator that returns 1 when t i = t and 0otherwise.If D is collected from a randomised experiment, ˆ τ :=ˆ µ − ˆ µ is an unbiased estimate of τ . However, theproblem with most observational studies is that ˆ τ is gen-erally biased because of potential confounding variables X that aﬀect both T and Y t . For example, X couldbe the current stage of a patient’s cancer, which couldinﬂuence both the decision of the physician regardingthe treatment and the outcome of the treatment. Toobtain an unbiased ATE estimate despite the confound-ing variables X for dataset D = { ( x i , t i , y i ) } Ni =1 , weexpand each E [ Y t ] as E [ Y t ] = E X [ E t [ Y t | X, T = t ]] andcompute the diﬀerence of the quantity above with t = 1 and with t = 0 . The validity of this estimate can beassessed given three technical requirements, which weassume throughout: (i) Stable Unit Treatment Value Assumption(SUTVA): The observed outcome of the i th unit Y ( i ) is unaﬀected by the assigned treatment toother units. (ii) Ignorability: T | = ( Y , Y ) | X . (iii) Positivity: < P ( T = 1 | X = x ) < for all x .See [3] for a thorough exposition on these assumptions. Propensity Scores.

The propensity score is one ofthe most widely used quantities for causal analysisin observational studies [7, 8, 9, 10]. In the binarytreatment setting, the propensity score π ( x ) is deﬁnedas the probability of a unit with covariate x receiv-ing treatment T = 1 , π ( x ) := P ( T = 1 | X = x ) . Ithas been shown that, under the above set of assump-tions, the propensity score π ( x ) summarises all the relevant information in X for causal inference so that T | = ( Y , Y ) | π ( X ) holds [8]. Unfortunately, in mostobservational studies, the true treatment assignmentmechanism is not known. Thus, a common practiceis to ﬁt a propensity score function on data D usingstandard statistical models π w ( x ) = f w ( x ) , where f w represents a model parameterised by a parameter vec-tor w . In this work, we focus on the logistic regressionmodel since it is the most frequently used model forﬁtting propensity scores [24]. The model is deﬁned as π w ( x ) = 11 + e − w (cid:62) x (1)where w ∈ R d . Other popular techniques for propensityscore estimation include classiﬁcation and regressiontrees (CART), boosted CART, random forests, etc . Inverse Probability Weighting (IPW).

One pop-ular propensity score-based method for estimating τ from observational data is IPW [17, 18]. Using IPW,we can obtain an unbiased ATE estimate ˆ τ = ˆ µ − ˆ µ with ˆ µ and ˆ µ deﬁned as ˆ µ := 1 N N (cid:88) i =1 y i t i π ( x i ) , ˆ µ := 1 N N (cid:88) i =1 y i (1 − t i )1 − π ( x i ) , (2)where N is the number of units in D . In practice,we replace the propensity score function π with itsempirical estimate π ˆ w . The notion of diﬀerential privacy (DP) [16] provides awell-deﬁned framework to describe the privacy prop-erties of statistical estimation algorithms. DP statesthat a privacy-preserving, randomised algorithm be-haves similarly on similar datasets. Speciﬁcally, thealgorithm’s behaviour is quantiﬁed in terms of a proba-bility ratio, which describes how the algorithm’s outputchanges when diﬀerent datasets are used as an input.Intuitively, the probability ratio does not change much(behaves similarly) if the input datasets diﬀer by asingle entry (similar datasets). The formal deﬁnitionis given below. Deﬁnition 1.

A randomised algorithm A with domain N |X | , where X is the data universe, satisﬁes ( (cid:15), δ ) -diﬀerential privacy, i.e., is ( (cid:15), δ ) -DP, if for all S ⊆

Range( A ) and for all neighbouring D , D (cid:48) ∈ N |X | suchthat (cid:107)D − D (cid:48) (cid:107) ≤ , i.e., there is only one entry diﬀer-ence in the two datasets D , D (cid:48) , P [ A ( D ) ∈ S ] ≤ exp( (cid:15) ) P [ A ( D (cid:48) ) ∈ S ] + δ where the probability space is over the outputs of A . rivacy-Preserving Inverse Probability Weighting Here, (cid:15) is deﬁned as the privacy loss controlling thelevel of privacy. For < δ < , δ deﬁnes the failureprobability , i.e. , an algorithm is (cid:15) -DP with probabilityat least − δ . Gaussian Mechanism.

The Gaussian mechanism[25] is commonly used to privatise models (see, e.g. ,[16, 26] for other DP mechanisms). The mechanismprivatises a vector-valued function f : D (cid:55)→ R p byadding Gaussian noise to it. The noise is calibratedbased on the L2- sensitivity of f , which is deﬁned by S ( f ) = max D , D (cid:48) , (cid:107)D−D (cid:48) (cid:107) =1 (cid:107) f ( D ) − f ( D (cid:48) ) (cid:107) . Hence,the privatised function has the form ˜ f ( D ) = f ( D ) + N (0 , σ I d ) . A choice of σ ≥ (cid:15) − (cid:112) . /δ ) S ( f ) produces the function ˜ f ( D ) that is ( (cid:15), δ ) -DP. Diﬀerentially Private Empirical Risk Minimisa-tion (DP-ERM).

Let (cid:96) : R × R → R + be the lossfunction and vector w the model parameters of π w .Under ERM framework, the optimal model parame-ters ˆ w are obtained by minimising the empirical riskfunction J ( w , D ) = m − (cid:80) mi =1 (cid:96) ( π w ( x i ) , t i ) + λ Ω( w ) ,where Ω( · ) is the regulariser and λ > the regularisa-tion constant. If logistic regression (1) is used to modelthe propensity score function π w , the parameters w can be learned using ERM with a L -regulariser. Weassume throughout that X is contained in the L -unitball , i.e. , (cid:107) x i (cid:107) ≤ for all x i ∈ X . Note that thedataset D m = { ( x i , t i , y i ) } mi =1 contains the observedtreatment eﬀects y i , but that they are not used toﬁt π w . In our case, the regularised cross-entropy loss J ( w , D ) is − m m (cid:88) i =1 t i log p i + (1 − t i ) log(1 − p i ) + λ (cid:107) w (cid:107) (3)where p i := p ( π w ( x i )) . This loss is equivalent tothe logistic loss used in [27]. The L2-sensitivity of ˆ w obtained by minimising (3) is given by S ( ˆ w ) =max D , D (cid:48) , (cid:107)D−D (cid:48) (cid:107) =1 (cid:107) ˆ w ( D ) − ˆ w ( D (cid:48) ) (cid:107) ≤ mλ ) − [27]. We start by explaining why logistic-regression basedestimation for the propensity score by minimising (3)is not private. An intuitive explanation is: since anERM solution can be written as a linear combinationof training samples, the ˆ w and ˆ w (cid:48) estimated fromdatasets D and D (cid:48) diﬀering by one single entry couldbe completely diﬀerent, e.g. , if the entry was an outlier.Hence, the likelihood ratio of the two models ˆ w and ˆ w (cid:48) estimated from two neighbouring datasets would be This is a typical assumption on the dataset in thediﬀerential privacy literature. unbounded, which makes ˆ w not (cid:15) -DP. A more rigorousproof can be found in [27].Since existing propensity score estimation often relieson ERM, standard methods for modelling the propen-sity score function would yield a model that violatesdiﬀerential privacy: given π ˆ w and all points in dataset D bar one, it is possible to infer the covariates x i ofomitted unit. This is a serious problem since propensityscore-based methods are frequently used to estimate thecausal eﬀect from observational studies containing sensi-tive data and highlights the need for privacy-preservingpropensity score estimators and propensity score-basedmethods.Privatising the ATE estimated via IPW requires twolayers of privatisation.The ﬁrst step of our proposed method is to dividethe dataset D into D m = { x i , t i , y i } mi =1 and D n = { x i , t i , y i } ni =1 , and use the ﬁrst m points to learn ˆ w andthe remaining n points to estimate ˆ τ . We separatelyprivatise the estimated propensity score function withrespect to the m datapoints and the estimated ATEwith respect to the n datapoints to ensure that all N datapoints (where N = m + n ) in D are protected.We describe each step in detail in the next two subsec-tions. To privatise the logistic regression model, we ﬁrst com-pute a non-private version of the propensity score π ˆ w by minimising (3). Next, we use the Gaussian mech-anism to generate a privacy-preserving version of ˆ w ,deﬁned as ˆ w (cid:15) . Deﬁnition 2.

Let ˆ w be the solution of (3) . A privacy-preserving propensity score function is π (cid:15) ˆ w ( x ) = 11 + exp( − ˆ w (cid:62) (cid:15) x ) = 11 + exp( − ˆ w (cid:62) x − z (cid:62) x ) , (4) where ˆ w (cid:15) := ˆ w + z with z ∼ N ( , σ I d ) and σ = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) for (cid:15) ∈ (0 , and any δ ∈ (0 , . Alongside Deﬁnition 2, we deﬁne the counterparts of ˆ τ , ˆ µ and ˆ µ which use π (cid:15) ˆ w ( x ) in place of π ˆ w ( x ) over the n points as ˆ τ n , ˆ µ (cid:15) and ˆ µ (cid:15) . This yields DP ATE w.r.t. D m : ˆ τ n := ˆ µ (cid:15) − ˆ µ (cid:15) where ˆ µ (cid:15) := 1 n n (cid:88) i =1 y i π (cid:15) ˆ w ( x i ) , ˆ µ (cid:15) := 1 n n (cid:88) i =1 y i − π (cid:15) ˆ w ( x i ) , (5) i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet with the ﬁrst sum is over the n points where t i = 1 and the second sum is over the n points where t i = 0 ,and n + n = n . The estimator ˆ τ n only safeguards theprivacy of m points within the dataset, those belongingto the split D m . Privatisation of the points in D n isdiscussed in the next subsection. Here, we focus on theeﬀect of the noise added for protecting D m on causalinference. Characterising ˆ τ n . Both ˆ µ (cid:15) and ˆ µ (cid:15) are weightedsums of correlated log-normal random variables, wherethe magnitude and signs of the weights depend on thedata. This can be seen from their formulae ˆ µ (cid:15) = 1 n n (cid:88) i =1 y i (1 + exp( − w T x i ) exp( − z T x i )) , (6) ˆ µ (cid:15) = 1 n n (cid:88) i =1 y i (1 + exp( w T x i ) exp( z T x i )) , (7)and recalling that z is a Gaussian random variable. Ob-taining closed-form expressions for the above quantitieswith ﬁnite samples is highly nontrivial [28, 29].As a ﬁrst step towards understanding the random vari-able ˆ τ n , we show how its expected value can be rewrit-ten in terms of the non-privatised ATE estimate ˆ τ andvariance of the added noise. See Appendix A for proof. Lemma 3.

Let α i := ( − − t i y i exp(( − t i ˆ w (cid:62) x i ) bea constant for i = 1 , . . . , n . Then, we have E [ˆ τ n ] = ˆ τ + g ( (cid:15), m, n, λ, δ ) where the function g ( (cid:15), m, n, λ, δ ) is given by n n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) . (8)Lemma 3 allows us to interpret ˆ τ n as a biased estimateof ˆ τ , where the additive bias term is a function of theprivacy loss (cid:15) , sample sizes m and n , regularisation con-stant λ , and failure probability δ . Notice that the bias g ( (cid:15), m, n, λ, δ ) converges to zero as either the privacybudget (cid:15) or the total number of points m + n goes toinﬁnity as to be expected. We complement our insightswith numerical simulations (see Figure 1) that describethe behaviour of this nontrivial bias term.To study the probability of ˆ τ n having the opposite signw.r.t. the non-private estimator, we assume that itssupport is bounded both from above and below. Thisallows us to employ standard concentration inequalityresults for variables with bounded supports. We boundthe support of the estimator by either deterministic , or probabilistic means. An in-depth discussion of how wedo so is provided in Appendix B.The next theorem characterises the behaviour of ˆ τ n . Theorem 4.

Assume that ˆ τ > and sign(ˆ τ ) = sign( τ ) .If | ˆ τ n | ≤ η for some η > , we have P (ˆ τ n ≤ | ˆ τ > ≤ exp (cid:0) − η − (ˆ τ + g ( (cid:15), m, n, λ, δ )) (cid:1) . Proof.

Given that | ˆ τ n | ≤ η (or the result in Lemma 6with probability at least − γ ), we can apply Lemma3 and Hoeﬀding’s inequality for bounded variables toyield the result.Qualitatively, we expect this since the theorem impliesthat the larger the magnitude of the true estimatedATE is, the smaller the probability of drawing incorrectcausal conclusions from the partially privatised estima-tor. Interestingly, the bound in Theorem 4 provides aquantitative characterisation, showing that such prob-ability decreases exponentially as a function of ˆ τ . Theprobability of drawing incorrect conclusions from ˆ τ n also depends exponentially on the bias g ( (cid:15), m, n, λ, δ ) .We provide empirical results clarifying the dependencyon this term in Section 4. We now proceed to privatising the n points used tocompute ˆ τ n . Given that P ( T = 1 | x i ) is boundedabove and below by ω and ω respectively for all i ∈ n , ˆ τ n is consequently bounded as well. By fur-ther assuming that | y i | ≤ C y for all i ∈ n , we en-sure that the L2-sensitivity of ˆ τ n , S (ˆ τ n ) , is boundedby n − C y max[1 /ω , / (1 − ω )] . We show how thisquantity is obtained in Appendix C.We apply the Gaussian mechanism to ˆ τ n to obtain aprivacy-preserving approximation ˆ τ (cid:15)n that is ( (cid:15), δ ) -DPw.r.t to the n points used to obtain the estimate. DP ATE w.r.t both D m and D n : ˆ τ (cid:15)n = ˆ τ n + e where the noise is drawn from e ∼ N (0 , σ n ) and with the noise standard deviation σ n := (cid:15) − (cid:112) . /δ ) S (ˆ τ n ) . Characterising ˆ τ (cid:15)n . We now present our main resultwhich bounds the probability that both ˆ τ (cid:15)n and ˆ τ n yieldincorrect causal conclusions. See Appendix E for proof. Theorem 5.

Assume that ˆ τ > and sign(ˆ τ ) = sign( τ ) .If | ˆ τ n | ≤ η for some η > , we have P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > ≤

12 exp (cid:18) − τ + g ) η (cid:19) (cid:20) (cid:18) | ˆ τ n | σ n √ (cid:19)(cid:21) , where g := g ( (cid:15), m, n, λ, δ ) . The additional / | ˆ τ n | /σ n √ term in Theo-rem 5 represents the probability that the added noise rivacy-Preserving Inverse Probability Weighting sample size ( m ) g ( , m , n , , ) against m = 0.2= 0.4= 0.6= 0.8= 0.99 privacy loss ( ) g ( , m , n , , ) against m = 10 m = 100 m = 1000 Figure 1: Behaviour of g ( (cid:15), m, n, λ, δ ) w.r.t. the sample size m and the privacy loss (cid:15) . As we can see, the biasdecreases as we expect when the sample size m and the privacy loss (cid:15) increase. Algorithm 1

PP-IPW

Input:

Data D = { ( x i , t i , y i ) } Ni =1 and privacy loss ( (cid:15), δ ) Split D into two random subsets D m and D n consisting of m and n data points. A. Obtain DP propensity score function Minimise J ( w , D m ) in (3) for the non-private estimate ˆ w . ˆ w (cid:15) = ˆ w + z where z ∼ N ( , σ m I d ) and σ m = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) . Output DP propensity score function π (cid:15) ˆ w ( x ) = 1 / (1 + exp( − ˆ w (cid:62) (cid:15) x )) . B. Obtain DP ATE Compute ˆ τ n := ˆ µ (cid:15) − ˆ µ (cid:15) , given D n , where ˆ µ (cid:15) := n − (cid:80) n i =1 y i π (cid:15) ˆ w ( x i ) and ˆ µ (cid:15) := n − (cid:80) n i =1 y i − π (cid:15) ˆ w ( x i ) . Output DP ATE ˆ τ (cid:15)n = ˆ τ n + e , where e ∼ N (0 , σ n ) and σ n := (cid:15) − (cid:112) . /δ ) S (ˆ τ n ) . Output:

DP propensity score function π (cid:15) ˆ w (w.r.t. D m ) and DP ATE ˆ τ (cid:15)n (w.r.t D ).from the Gaussian mechanism e is greater than | ˆ τ n | ,computed via the Gaussian CDF. The bound in Theo-rem 5 further accounts for ˆ τ n and σ n : the probabilitythat both ˆ τ (cid:15)n and ˆ τ n are negative increases exponen-tially with | ˆ τ n | and decreases exponentially with σ n respectively where the latter is a function of (cid:15) , δ and S (ˆ τ n ) .This theorem provides a full characterisation of our pro-posed diﬀerentially private ATE estimation procedure.It summarises the eﬀects of protecting all m + n = N points, and show that the probability of drawing anincorrect causal conclusion decay exponentially w.r.tto the values of ˆ τ and ˆ τ n . Remark.

Notice that in Theorem 5 we chose tobound the probability P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > . How-ever, depending on the application, other quantities,such as P (ˆ τ (cid:15)n ≤ | ˆ τ > or P (ˆ τ (cid:15)n ≤ | ˆ τ n ≤ , ˆ τ > ,might be more interesting. It is also possible to boundthose quantities; the proofs follow from that of Theo-rem 5 presented in Appendix E, and we omit them forbrevity.Putting everything together, our algorithm is presentedin Algorithm 1. For the sake of simplicity, we assignedthe same privacy budget for privatising the propensity score and the average treatment eﬀect but one couldchoose to have separate privacy levels for the two quan-tities. We also include in Appendix F bounds for theaverage treatment eﬀect for the treated (ATT) and theaverage treatment eﬀect for the controls (ATC). Weleave the development of privatised estimators basedon more sophisticated techniques as future work. In this section, we demonstrate our theoretical ﬁndingswith experiments on synthetic, semi-synthetic and realdata. We set the probability of failure δ = 10 − andregularisation coeﬃcient λ = 0 . for all experiments.The logistic regression model is implemented in Py-Torch [30] and optimised via gradient descent using theentire dataset. To ensure reproducibility, we set therandom seed for both NumPy and PyTorch to 1. Synthetic Data.

For this set of experiments, wevary the m points used to ﬁt the logistic regressionmodel, use n = 1000 points to estimate the ATE andaverage over 100 trials.We generate X ∈ R by sampling x i s separately from N ( , · I ) for each trial and standardising each sep- i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet privacy loss ( ) P(sign change) against for = 0.1

P(sgn( n ) sgn( ))P(sgn( n ) sgn( ), sgn( n ) sgn( )) (a) privacy loss ( ) P(sign change) against for = 2

P(sgn( n ) sgn( ))P(sgn( n ) sgn( ), sgn( n ) sgn( )) (b) sample size ( m ) True , , n against m for = 0.1 True n for =0.1 n for =0.3 n for =0.5 (c) sample size ( m ) True , , n against m for = 2 True n for =0.1 n for =0.3 n for =0.5 (d) Figure 2: Experimental results for synthetic data. (a) and (c) correspond to low-conﬁdence with τ = 0 . . (b) and(d) correspond to the high-conﬁdence with τ = 2 . See main text for interpretation.arate sampled set of x i s with the maximum L2-normof the x i s in the set. For each standardised x i , thetreatment assignment t i and outcome y i are generatedacross trials in the following manner: t i ∼ Bernoulli ( s ( a (cid:62) x i )) , a ∼ N ( , I ) ,y i = b (cid:62) x i + t i τ + ϑ, b ∼ N ( , I ) where ϑ ∼ N (0 , . , τ ∈ { . , } is a non-zero bias,and s ( · ) is a sigmoid function. We perturb the weightsof the learned logistic regression model with a samplefrom N ( , σ I ) with σ deﬁned in (4).Figures 2a and 2b shows that the probability that thesigns of ˆ τ disagreeing with both ˆ τ n and ˆ τ (cid:15)n decreasesas (cid:15) and τ increases with m set to 1000. These resultsreﬂect the exponential dependence of P (ˆ τ n ≤ and P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ on g ( (cid:15), m, n, λ, δ ) = O (1 /(cid:15) ) and ˆ τ in Theorems 4 and 5.The convergence of ˆ τ (cid:15)n to ˆ τ as m increases from 500 to2500 at intervals of 500 seen in Figures 2c and 2d rein-forces the inverse relationship between g ( (cid:15), m, n, λ, δ ) and m earlier demonstrated in Figure 1. The errorbars represent the 95% conﬁdence interval of the meanof the various estimates. The above plots also showthat increasing τ yields exponentially larger mean ATEestimates which checks out with the form of µ (cid:15) and µ (cid:15) in (6), as the expectation of the log-normal random variables weighting y i monotonically increases with thevariance of the Gaussian random variable. Semi-synthetic Data.

Next, we test our methodson the semi-synthetic binary-treatment Infant Healthand Development Programme (IHDP) dataset that wasintroduced in [31]. We use the train and test sets from[32] for ﬁtting logistic regression and estimating theATE respectively with the true ATE of the train/testsplits being 4. IHDP is a real-world dataset with 25covariates describing 747 children and their mothers, de-randomised binary treatments and synthetic continuousoutcomes that can be used to compute a ground truthATE [31]. We create balanced training and ATE estima-tion datasets where m = 500 and n = 500 by samplingwith replacement 250 units with T = 1 and T = 0 and100 units with T = 1 and T = 0 respectively from theabove train and test sets. As the IHDP dataset comeswith 1000 diﬀerent realisations of train and test data,we average over all realisations. In Tables 2, we seethat increasing (cid:15) generally increases the ﬁdelity of themean ATE estimate and reduces P (sgn( ˆ τ n ) (cid:54) = sgn(ˆ τ )) and P (sgn( ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) respectively. We did not in-clude the average standard deviation of the estimatesas estimating ATE with IPW is known to have highvariance due to the unboundedness of the propensityscore function π ˆ w . Lastly, another critical observation rivacy-Preserving Inverse Probability Weighting Table 2: Average ˆ τ n , ˆ τ (cid:15)n , ρ (ˆ τ , ˆ τ n ) := P (sgn(ˆ τ n ) (cid:54) =sgn(ˆ τ )) , and ρ (ˆ τ , ˆ τ (cid:15)n ) := P (sgn(ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) for vari-ous (cid:15) over 1000 runs on IHDP dataset. The average ˆ τ is 4.80. Estimate Privacy loss ( (cid:15) ) . . . . . τ n -237487.30 32.34 8.37 5.22 5.31 ˆ τ (cid:15)n -237477.13 32.20 8.50 5.09 5.40 ρ (ˆ τ , ˆ τ n ) ρ (ˆ τ , ˆ τ (cid:15)n ) Table 3: Average ˆ τ n , ˆ τ (cid:15)n , ρ (ˆ τ , ˆ τ n ) := P (sgn(ˆ τ n ) (cid:54) =sgn(ˆ τ )) , and ρ (ˆ τ , ˆ τ (cid:15)n ) := P (sgn(ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) for vari-ous (cid:15) over 1000 runs on Lalonde dataset. The average ˆ τ is 902.11. Estimate Privacy loss ( (cid:15) ) . . . . . τ n ˆ τ (cid:15)n ρ (ˆ τ , ˆ τ n ) ρ (ˆ τ , ˆ τ (cid:15)n ) from Tables 2, especially for practitioners, is that toosmall an (cid:15) can lead to an unreliable estimate of ˆ τ n and ˆ τ (cid:15)n , i.e. , see the estimates for IHDP when (cid:15) = 0 . . Real Data.

To further verify our proposed method,we use the Lalonde observational studies benchmark[33] obtained from [34]. As we do not account for un-balanced datasets, we only use the original Lalondedataset with 297 treated and 425 control individualsand subsample from it to create our training and ATEestimation datasets. There are 9 covariates containingsensitive information such as age, education, and race,and the outcome is 1978 earnings. As no train/testsplits are provided, we sample without replacement100 units of T = 1 and T = 0 to create the ATEestimation dataset of size 200 and sample with re-placement 250 units with T = 1 and T = 0 from theremaining points to generate the training dataset ofsize 500. We do the above 1000 times to obtain thesame number of realisations as the IHDP dataset. Theresults for Lalonde in Tables 3 supplements those forIHDP: increasing (cid:15) improves the accuracy of the meanATE estimate and decreases P (sgn( ˆ τ n ) (cid:54) = sgn(ˆ τ )) and P (sgn( ˆ τ (cid:15)n ) (cid:54) = sgn(ˆ τ )) . We proposed a diﬀerentially private IPW method foraverage treatment eﬀect under the inverse probability weighting framework. A key element of our proposedmethod is the use of a newly deﬁned private propensityscore estimator. Unlike traditional propensity scores,ours can be deployed in causal analysis without runningthe risk of exposing the covariates of any unit used inestimating the propensity score function. Furthermore,we demonstrate—both theoretically and empirically—that the ATE estimate resulting from an applicationof our method is consistent with its non-private coun-terpart with high probability. In other words, theproposed propensity score function not only safeguardsprivacy, but also yields valid causal analyses with highprobability.We believe this work highlights long-neglected privacyconcerns associated with the use of propensity scores incausal inference, but would also pave the way for sub-sequent developments at the intersection of diﬀerentialprivacy and causal inference. Although the startingpoint of our work is a speciﬁc choice of a non-privatemethod for ATE estimation, the analyses can be ex-tended to more sophisticated estimators. In particular,we note that IPW is mathematically equivalent to otherestimation methods such as stratiﬁcation and the back-door correction (see, e.g. , [35]). Future extensionsof our work will be dedicated to the development ofprivatised estimators based on alternative ATE estima-tion methods as well as other causal estimands suchas conditional average treatment eﬀect (CATE) [32].The eﬀectiveness of private propensity scores in morecomplex methods and settings still remains an openquestion.

References [1] High-level Expert Group on Artiﬁcial Intelligence.Ethics guidelines for trustworthy AI, 2019.[2] Thomas A. Glass, Steven N. Goodman, Miguel A.Hernàn, and Jonathan M. Samet. Causal inferencein public health.

Annual Review of Public Health ,34(1):61–75, 2013.[3] Guido W. Imbens and Donald B. Rubin.

CausalInference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge UniversityPress, New York, NY, USA, 2015.[4] Richard M. Shiﬀrin. Drawing causal inference frombig data.

Proceedings of the National Academy ofSciences , 113(27):7308–7309, 2016.[5] Elias Bareinboim and Judea Pearl. Causal infer-ence and the data-fusion problem.

Proceedings ofthe National Academy of Sciences , 113(27):7345–7352, 2016.[6] Paul Holland. Statistics and causal inference.

Journal of the American Statistical Association ,81(396):945–960, 1986. i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet [7] Peter C Austin. An introduction to propensityscore methods for reducing the eﬀects of confound-ing in observational studies. Multivariate behav-ioral research , 46(3):399–424, 2011.[8] Paul R. Rosenbaum and Donald B. Rubin. Thecentral role of the propensity score in observationalstudies for causal eﬀects.

Biometrika , 70(1):41–55,1983.[9] Paul R Rosenbaum and Donald B Rubin. Reduc-ing bias in observational studies using subclassi-ﬁcation on the propensity score.

Journal of theAmerican statistical Association , 79(387):516–524,1984.[10] Paul R Rosenbaum and Donald B Rubin. Con-structing a control group using multivariatematched sampling methods that incorporate thepropensity score.

The American Statistician ,39(1):33–38, 1985.[11] Miroslav Dudík, John Langford, and Lihong Li.Doubly Robust Policy Evaluation and Learning. In

Proceedings of the 28th International Conferenceon Machine Learning , pages 1097–1104. Omni-press, 2011.[12] Léon Bottou, Jonas Peters, Joaquin Qui`nonero-Candela, Denis Charles, Max Chickering, ElonPortugaly, Dipankar Ray, Patrice Simard, andEd Snelson. Counterfactual reasoning and learn-ing systems: The example of computational ad-vertising.

Journal of Machine Learning Research ,14:3207–3260, 2013.[13] Adith Swaminathan and Thorsten Joachims.Counterfactual risk minimization: Learning fromlogged bandit feedback. In

Proceedings of the 32ndInternational Conference on Machine Learning ,pages 814–823. JMLR.org, 2015.[14] Adith Swaminathan, Akshay Krishnamurthy,Alekh Agarwal, Miro Dudik, John Langford,Damien Jose, and Imed Zitouni. Oﬀ-policy evalu-ation for slate recommendation. In

Advances inNeural Information Processing Systems 30 , pages3632–3642. Curran Associates, Inc., 2017.[15] Jeremy A Rassen, Daniel H Solomon, Jeﬀrey Cur-tis, Lisa Herrinton, and Sebastian Schneeweiss.Privacy-maintaining propensity score-based pool-ing of multiple databases applied to a study ofbiologics.

Medical care , 48:S83–9, 06 2010.[16] Cynthia Dwork, Aaron Roth, et al. The algo-rithmic foundations of diﬀerential privacy.

Foun-dations and Trends R (cid:13) in Theoretical ComputerScience , 9(3–4):211–407, 2014.[17] James M Robins, Miguel Angel Hernan, and Ba-bette Brumback. Marginal structural models andcausal inference in epidemiology, 2000. [18] Paul R Rosenbaum. Model-based direct adjust-ment. Journal of the American Statistical Associ-ation , 82(398):387–394, 1987.[19] Matt J. Kusner, Yu Sun, Karthik Sridharan, andKilian Q. Weinberger. Private causal inference. In

AISTATS , volume 51 of

JMLR Workshop and Con-ference Proceedings , pages 1308–1317. JMLR.org,2016.[20] D. Xu, S. Yuan, and X. Wu. Diﬀerential privacypreserving causal graph discovery. In ,pages 60–71, Aug 2017.[21] Jerzy Neyman. Sur les applications de la theoriedes probabilites aux experiences agricoles: Essaides principes. Master’s thesis, 7 1923. Excerptsreprinted in English, Statistical Science, Vol. 5,pp. 463–472. (D. M. Dabrowska, and T. P. Speed,Translators.).[22] D.B. Rubin. Estimating causal eﬀects of treat-ments in randomized and nonrandomized studies.

Journal of Educational Psychology , 66(5):688–701,1974.[23] Donald Rubin. Causal inference using potentialoutcomes.

Journal of the American StatisticalAssociation , 100(469):322–331, 2005.[24] M Soledad Cepeda, Ray Boston, John T Farrar,and Brian L Strom. Comparison of logistic regres-sion versus propensity score when the number ofevents is low and there are multiple confounders.

American journal of epidemiology , 158(3):280–287,2003.[25] Cynthia Dwork, Krishnaram Kenthapadi, FrankMcSherry, Ilya Mironov, and Moni Naor. Our data,ourselves: Privacy via distributed noise generation.In

Annual International Conference on the The-ory and Applications of Cryptographic Techniques ,pages 486–503. Springer, 2006.[26] Anand D. Sarwate and Kamalika Chaudhuri. Sig-nal processing and machine learning with diﬀeren-tial privacy: Algorithms and challenges for continu-ous data.

IEEE Signal Process. Mag. , 30(5):86–94,2013.[27] Kamalika Chaudhuri, Claire Monteleoni, andAnand D Sarwate. Diﬀerentially private empiricalrisk minimization.

Journal of Machine LearningResearch , 12(Mar):1069–1109, 2011.[28] Archil Gulisashvili, Peter Tankov, et al. Tail behav-ior of sums and diﬀerences of log-normal randomvariables.

Bernoulli , 22(1):444–493, 2016.[29] Chi-Fai Lo. The sum and diﬀerence of two lognor-mal random variables.

Journal of Applied Mathe-matics , 2012, 2012. rivacy-Preserving Inverse Probability Weighting [30] Adam Paszke, Sam Gross, Soumith Chintala, Gre-gory Chanan, Edward Yang, Zachary DeVito, Zem-ing Lin, Alban Desmaison, Luca Antiga, and AdamLerer. Automatic diﬀerentiation in PyTorch. In

NIPS Autodiﬀ Workshop , 2017.[31] Jennifer L Hill. Bayesian nonparametric modelingfor causal inference.

Journal of Computationaland Graphical Statistics , 20(1):217–240, 2011.[32] Uri Shalit, Fredrik D Johansson, and David Sontag.Estimating individual treatment eﬀect: general-ization bounds and algorithms. In

Proceedingsof the 34th International Conference on MachineLearning-Volume 70 , pages 3076–3085. JMLR. org,2017.[33] Robert J LaLonde. Evaluating the econometricevaluations of training programs with experimen-tal data.

The American economic review , pages604–620, 1986.[34] Christian Fong, Marc Ratkovic, Kosuke Imai,Chad Hazlett, Xiaolin Yang, and Sida Peng. cbps:Covariate Balancing Propensity Score , 2019.[35] Miguel A. Hernán and James M. Robins.

CausalInference . Boca Raton: Chapman & Hall/CRC,2019. Forthcoming.[36] S. Boucheron, G. Lugosi, and P. Massart.

Con-centration inequalities. A nonasymptotic theory ofindependence . Oxford University Press, 2013. i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet A Proof of Lemma 3

Proof.

Given a dataset D = { ( x i , t i , y i ) } ni =1 , let α i be deﬁned in Lemma 3 and β i := exp( σ (cid:107) x i (cid:107) / for i = 1 , . . . , n . Then, taking the expectation of (6) w.r.t.the noise variable yields E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =1 y i + y i β i exp( − ˆ w (cid:62) x i ) , E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =0 y i + y i β i exp( ˆ w (cid:62) x i ) We further rewrite E [ˆ µ (cid:15) ] and E [ˆ µ (cid:15) ] as E [ˆ µ (cid:15) ] = 1 n (cid:88) t i =1 { y i + y i exp( − ˆ w (cid:62) x i ) } +1 n (cid:88) t i =1 { y i β i exp( − ˆ w (cid:62) x i ) − y i exp( − ˆ w (cid:62) x i ) } = 1 n (cid:88) t i =1 y i + y i exp( − ˆ w (cid:62) x i ) +1 n (cid:88) t i =1 y i exp( − ˆ w (cid:62) x i )( β i − µ + 1 n (cid:88) t i =1 y i exp( − ˆ w (cid:62) x i )( β i − , E [ˆ µ (cid:15) ] = ˆ µ + 1 n (cid:88) t i =0 y i exp( ˆ w (cid:62) x i )( β i − Consequently, E [ˆ τ (cid:15) ] = E [ˆ µ (cid:15) ] − E [ˆ µ (cid:15) ]= ˆ µ − ˆ µ + 1 n n (cid:88) i =1 α i ( β i − τ + g ( σ ) where g ( σ ) := n − (cid:80) ni =1 α i ( β i − . Setting σ =2( (cid:15)mλ ) − (cid:112) . /δ ) and substituting back intoeach β i yields the result (8). B Bounds for ˆ τ n Deterministic Bounds.

The ﬁrst case correspondsto using trimmed or clipped propensity scores. Givena constant < ξ < , we deﬁne the trimmed version of ˆ τ n as ˆ τ n,ξ := 1 n n (cid:88) i =1 y i max { ξ, π (cid:15) ˆ w ( x i ) } − n n (cid:88) i =1 y i max { ξ, − π (cid:15) ˆ w ( x i ) } . While ˆ τ n,ξ is a biased estimate of τ with a boundedvariance, it is often preferred due to its robustnessto outliers. If | y i | ≤ C y for all i ∈ n , it follows that | ˆ τ n,ξ | ≤ C y ξ − with probability 1. Probabilistic Bounds.

In the second case, we con-sider what happens if no trimming is applied. Althoughthe variance of ˆ τ n can be unbounded, it is a determin-istic function of a single sub-Gaussian noise variable z ∼ N ( , σ I d ) . Hence, we expect the bounded diﬀer-ence condition for ˆ τ n to hold with high probability . Tothis end, let S := (cid:80) dj =1 z j . Since each component of z is independent, we have S ∼ N (0 , dσ ) . With S beinga sub-Gaussian random variable, Chernoﬀ’s inequality[36, pp. 21] gives P ( | S | ≥ ζ ) ≤ (cid:0) − ζ (2 dσ ) − (cid:1) forsome ζ > . This implies that | S | ≤ ζ holds with prob-ability at least − γ where γ = 2 exp( − ζ (2 dσ ) − ) .The following lemma gives the probabilistic boundeddiﬀerence condition for ˆ τ n . Lemma 6.

Let ˆ τ n and ˆ τ (cid:48) n be two estimates with dif-ferent noise vectors z and z (cid:48) , respectively. Then, withprobability at least − γ , we have | ˆ τ n − ˆ τ (cid:48) n | ≤ η where η := 2 n sinh( ζ ) (cid:32) n (cid:88) i =1 y i exp( − ˆ w (cid:62) x i ) + n (cid:88) i =1 y i exp( ˆ w (cid:62) x i ) (cid:33) . Proof.

Let φ ( z ) be the deterministic function map-ping random variable z to ˆ τ n deﬁned in (5). Fur-thermore, let C i := y i exp( − ˆ w (cid:62) x i ) for i ∈ n and C i := y i exp( ˆ w (cid:62) x i ) for i ∈ n . Given that | S | ≤ ζ holds with probability at least − γ , it follows that | φ ( z ) − φ ( z (cid:48) ) | = 1 n n (cid:88) i =1 C i (cid:12)(cid:12) exp( − z (cid:62) x i ) − exp( − z (cid:48)(cid:62) x i ) (cid:12)(cid:12) +1 n n (cid:88) i =1 C i (cid:12)(cid:12) exp( z (cid:62) x i ) − exp( z (cid:48)(cid:62) x i ) (cid:12)(cid:12) ≤ exp( ζ ) − exp( − ζ ) n (cid:32) n (cid:88) i =1 C i + n (cid:88) i =1 C i (cid:33) = 2 n sinh( ζ ) (cid:32) n (cid:88) i =1 C i + n (cid:88) i =1 C i (cid:33) =: η, also holds with probability at least − γ . This concludesthe proof. C L2-sensitivity of ˆ τ n Proof. If | y i | ≤ C y and Ω < P ( T = 1 | x i ) < Ω for all i ∈ n where C y , Ω and Ω are constants, we have S (ˆ τ n )= max D , D (cid:48) (cid:107)D−D (cid:48) (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 y i t i π ˆ w ( x i ) − n n (cid:88) i =1 y i (1 − t i )1 − π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 n max (cid:40) max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =1 (cid:13)(cid:13)(cid:13)(cid:13) y i π ˆ w ( x i ) − y (cid:48) i π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13) , rivacy-Preserving Inverse Probability Weighting max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =0 (cid:13)(cid:13)(cid:13)(cid:13) y i − π ˆ w ( x i ) − y (cid:48) i − π ˆ w ( x i ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) ≤ n max (cid:40) max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =1 (cid:13)(cid:13)(cid:13)(cid:13) y i Ω − y (cid:48) i Ω (cid:13)(cid:13)(cid:13)(cid:13) , max { x i ,y i ,t i } , { x (cid:48) i ,y (cid:48) i ,t (cid:48) i } ,t i ,t (cid:48) i =0 (cid:13)(cid:13)(cid:13)(cid:13) y i − Ω − y (cid:48) i − Ω (cid:13)(cid:13)(cid:13)(cid:13) (cid:41) ≤ C y n max (cid:26) , − Ω (cid:27) . Therefore, the L2-sensitivity of ˆ τ n is bounded fromabove by n − C y max (cid:110) , − Ω (cid:111) . D Error of ˆ τ n This corollary bounds the error we incur by privatisingthe propensity score function.

Corollary 7.

For any constant ∆ > , P ( | ˆ τ n − ˆ τ | ≥ ∆) ≤ n ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof.

By Lemma 3 and the Markov inequality, P ( | ˆ τ (cid:15) − ˆ τ | ≥ ∆) ≤ | E [ˆ τ (cid:15) − ˆ τ ] | ∆= | ˆ τ + g ( (cid:15), m, n, λ, δ ) − ˆ τ | ∆= | g ( (cid:15), m, n, λ, δ ) | ∆= 1 n ∆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 α i (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . The last step follows from the deﬁnition of g ( (cid:15), m, n, λ, δ ) . This concludes the proof. E Proof of Theorem 5

Proof.

By the law of probability, P (ˆ τ (cid:15)n < , ˆ τ n < | ˆ τ > P (ˆ τ (cid:15)n < | ˆ τ n < , ˆ τ > P (ˆ τ n < | ˆ τ > . Since we already possess the upper bound on P (ˆ τ n < | ˆ τ > (see Theorem 4), we focus on obtaining P (ˆ τ (cid:15)n < | ˆ τ n < . Since ˆ τ (cid:15)n is normally distributed with mean ˆ τ n , this probability is just the probability that ˆ τ (cid:15)n < given that its mean is negative. Hence, we can get theexact probability using the Gaussian CDF, i.e. , P (ˆ τ (cid:15)n < | ˆ τ n < , ˆ τ >

0) = Φ (cid:18) | ˆ τ n | σ n (cid:19) = 12 (cid:20) erf (cid:18) | ˆ τ (cid:15) | σ n √ (cid:19)(cid:21) , where Φ( · ) denotes the CDF of the standard normaldistribution and erf ( · ) is the error function. Combiningthis with the bound in Theorem 4 yields P (ˆ τ (cid:15)n ≤ , ˆ τ n ≤ | ˆ τ > ≤

12 exp (cid:18) − τ + g ) ∆ (cid:19) (cid:20) (cid:18) | ˆ τ n | σ n √ (cid:19)(cid:21) , where g := g ( (cid:15), m, n, λ, δ ) . This concludes the proof. F Privatised ATT and ATC Estimates

The IPW estimators for ATT and ATC have the form ˆ τ ATT = 1 n n (cid:88) i =1 (cid:18) t i − (1 − t i ) π ˆ w ( x i )1 − π ˆ w ( x i ) (cid:19) y i , ˆ τ ATC = 1 n n (cid:88) i =1 (cid:18) t i − π ˆ w ( x i ) π ˆ w ( x i ) − (1 − t i ) (cid:19) y i and all related quantities are denoted with a subcriptedATT or ATC. First, we simplify on the π ( x i )1 − π ( x i ) and − π ( x i ) π ( x i ) terms in ˆ τ ATT and ˆ τ ATC to obtain: ˆ τ ATT = 1 n n (cid:88) i =1 (cid:0) t i − (1 − t i ) exp( ˆ w (cid:62) x i ) (cid:1) y i , ˆ τ ATC = 1 n n (cid:88) i =1 (cid:0) t i exp( − ˆ w (cid:62) x i ) − (1 − t i ) (cid:1) y i By employing the privacy-preserving propensity scoresfrom Deﬁnition 2, the perturbed ˆ τ n, ATT and ˆ τ n, ATC would have additional exp( z (cid:62) x i ) and exp( − z (cid:62) x i ) multiplicative terms on top of exp( ˆ w (cid:62) x i ) and exp( − ˆ w (cid:62) x i ) respectively with z ∼ N ( , σ I d ) and σ = (cid:15) − (cid:112) . /δ ) S ( ˆ w ) for (cid:15) ∈ (0 , and any δ ∈ (0 , . This yields estimates that are DP w.r.t D m .We then adapt Lemma 3 to ATT and ATC byadding and subtracting µ , ATT and µ , ATC to the ex-pectation of the ATT and ATC respectively. Let α i, ATT := − t i =0 y i exp( w (cid:62) x i ) and α i, ATC := t i =1 y i exp( − w (cid:62) x i ) where is the indicator func-tion and its subscript the condition where the functionis 1. With in place, we obtain E [ˆ τ n, ATT ] = ˆ τ ATT + g ATT ( (cid:15), m, n, λ, δ ) where g ATT ( (cid:15), m, n, λ, δ )= 1 n n (cid:88) i =1 α i, ATT (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) , and E [ˆ τ n, ATC ] = ˆ τ ATC + g ATC ( (cid:15), m, n, λ, δ ) where i Kai Lee ∗ , Luigi Gresele , Mijung Park

1, 3 , Krikamol Muandet g ATC ( (cid:15), m, n, λ, δ )= 1 n n (cid:88) i =1 α i, ATC (cid:20) exp (cid:18) ( − t i . /δ ) (cid:107) x i (cid:107) (cid:15) m λ (cid:19) − (cid:21) . We bound the supports of ˆ τ n, ATT and ˆ τ n, ATC by en-suring that exp( − ˆ w (cid:62) x i ) and exp( ˆ w (cid:62) x i ) are smalleror equal to some constant ξ for all i ∈ n . This can beachieved by the techniques described in Appendix B.With that in place, we can then extend Theorem 4 tocover the cases where ˆ τ ATT and ˆ τ ATC are both assumedto be greater than 0 and have the correct sign as theirtrue counterparts. The next two theorems illustratethe behaviour of ˆ τ n, ATT and ˆ τ n, ATC respectively.

Theorem 8.

Assume that ˆ τ ATT > and sign(ˆ τ ATT ) =sign( τ ATT ) . If | ˆ τ n, ATT | ≤ η for some η > , we have P (ˆ τ n, ATT ≤ | ˆ τ ATT > ≤ exp (cid:0) − η − (ˆ τ ATT + g ATT ( (cid:15), m, n, λ, δ )) (cid:1) . Theorem 9.

Assume that ˆ τ ATC > and sign(ˆ τ ATC ) =sign( τ ATC ) . If | ˆ τ n, ATC | ≤ η for some η > , we have P (ˆ τ n, ATC ≤ | ˆ τ ATC > ≤ exp (cid:0) − η − (ˆ τ ATC + g ATC ( (cid:15), m, n, λ, δ )) (cid:1) . Unsurprisingly, the results of Theorems 8 and 9 havesimilar implications as Theorem 4 for the ATE: theprobability of drawing incorrect conclusions is expo-nentially related to the true estimated ATT/ATC andits corresponding bias g AT T / g AT C .Continuing with our analysis, we apply the Gaussianmechanism to ˆ τ n, ATT and ˆ τ n, ATC to ensure that theestimates are DP w.r.t. the remaining D n points.We ﬁrst conduct a standard sensitivity analysis on ˆ τ n, ATT and ˆ τ n, ATC to compute the variance of the noiseadded to the estimates using the Gaussian mechanism.Assuming that | y i | ≤ C y and exp( − ˆ w (cid:62) x i ) ≤ ξ and exp( ˆ w (cid:62) x i ) ≤ ξ for all i ∈ n without loss of generality,we have S (ˆ τ n, ATT ) = S (ˆ τ n, ATC ) ≤ C y n max { , ξ } We then fully privatise the ATT and ATC esti-mates DP w.r.t to both D m and D n by addingnoise e to ˆ τ n, ATT and ˆ τ n, ATC where e ∼ N (0 , σ n ) and the noise standard deviation σ n, ATT/ATC := (cid:15) − (cid:112) . /δ ) S (ˆ τ n, ATT/ATC ) .The following theorems bounds bounds the probabilitythat both ˆ τ (cid:15)n, ATT and ˆ τ n, ATT / ˆ τ (cid:15)n, ATC and ˆ τ n, ATC yieldincorrect causal conclusions. The proofs are modiﬁedfrom that in Appendix E.

Theorem 10.

Assume that ˆ τ ATT > and sign(ˆ τ ATT ) = sign( τ ATT ) . If | ˆ τ ATT | ≤ η for some η > , we have P (ˆ τ (cid:15)n, ATT ≤ , ˆ τ n, ATT ≤ | ˆ τ ATT > ≤

12 exp (cid:18) − τ ATT + g ) η (cid:19) (cid:34) (cid:32) | ˆ τ n, ATT | σ n, ATT √ (cid:33)(cid:35) , where g := g ATT ( (cid:15), m, n, λ, δ ) . Theorem 11.

Assume that ˆ τ ATC > and sign(ˆ τ ATC ) = sign( τ ATC ) . If | ˆ τ ATC | ≤ η for some η > , we have P (ˆ τ (cid:15)n, ATC ≤ , ˆ τ n, ATC ≤ | ˆ τ ATC > ≤