PPrivate Causal Inference
Matt J. Kusner Yu Sun Karthik Sridharan Kilian Q. Weinberger
Washington University in St. Louis [email protected]
Cornell University [email protected]
Cornell University [email protected]
Cornell University [email protected]
Abstract
Causal inference deals with identifying whichrandom variables “cause” or control otherrandom variables. Recent advances on thetopic of causal inference based on tools fromstatistical estimation and machine learninghave resulted in practical algorithms forcausal inference. Causal inference has the po-tential to have significant impact on medicalresearch, prevention and control of diseases,and identifying factors that impact economicchanges to name just a few. However, thesepromising applications for causal inferenceare often ones that involve sensitive or per-sonal data of users that need to be kept pri-vate (e.g., medical records, personal finances,etc). Therefore, there is a need for the de-velopment of causal inference methods thatpreserve data privacy. We study the problemof inferring causality using the current, pop-ular causal inference framework, the additivenoise model (ANM) while simultaneously en-suring privacy of the users. Our frameworkprovides differential privacy guarantees for avariety of ANM variants. We run extensiveexperiments, and demonstrate that our tech-niques are practical and easy to implement.
Causal identification allows one to reason about howmanipulations of certain random variables (the causes)affect the outcomes of others (the effects). Uncoveringthese causal structures has implications ranging fromcreating government policies to informing health-carepractices. Causal inference was motivated by the im-possibility of randomized intervention experiments in
Appearing in Proceedings of the 19 th International Con-ference on Artificial Intelligence and Statistics (AISTATS)2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright2016 by the authors. many cases, and the ambiguity of conditional inde-pendence testing [32, 25]. In the absence of interven-tions, it attempts to discover the underlying causal re-lationships of a set of random variables entirely basedon samples from their joint distribution. The field ofcausal inference is now a mature research area, cover-ing learning topics as diverse as supervised batch infer-ence [19, 23, 26], time-series causal prediction [10], andlinear dynamical systems [30]. Many inference meth-ods require only a regression technique and a way tocompute the independence between two distributionsgiven samples [13, 16].One would hope that researchers could publicly releasetheir causal inference findings to inform individualsand policy makers. One of the primary roadblocks todoing so is that often causal inference is performed ondata that individuals may wish to keep private, suchas data in the fields of medical diagnosis, fraud detec-tion, and risk analysis. Currently, no causal inferencemethod has formal guarantees about the privacy ofindividual data, which may be able to be inferred viaattacks such as reconstruction attacks [3].Arguably one of the best notion of privacy is differen-tial privacy, introduced by Dwork et al. [6] and sinceused throughout machine learning [4, 15, 20, 2, 5]. Dif-ferential privacy guarantees that the outcome of an al-gorithm only reveals aggregate information about theentire dataset and never about the individual. An indi-vidual who is considering to participate in a study canbe reassured that his/her personal information cannotbe recovered with extremely high probability.To our knowledge, this paper is the first to investigateprivate causal inference. We show that it is possible toprivately release the quantities produced by the highly-successful additive noise model (ANM) framework byadding small amounts of noise, as dictated by differ-ential privacy. Furthermore, these private quantities,with high probability, do not change the causal in-ference result, so long as it is confident enough. Wedemonstrate on a set of real-world causal inferencedatasets how our privacy-preserving methods can bereadily and usefully applied. a r X i v : . [ s t a t . M L ] A ug rivate Causal Inference Discovering the causal nature between random eventshas captivated researchers and philosophers long be-fore the formal developments of statistics. This in-terest was formalized by Reichenbach & Reichenbach[28] who argued that all statistical correlations in dataarise from underlying causal structures between theconcerned random variables. For example, the corre-lation between smoking and lung cancer was found toarise from a direct causal link [7].One of the most popular causal inference alternativesto conditional independence testing is the AdditiveNoise Model (ANM) approach developed by Hoyeret al. [13] and used in many recent works [35, 21, 18, 1].ANMs, originally designed for inferring whether X → Y or Y → X and later extended to large numbers ofrandom variables, work under the assumption that theeffect is a non-linear function of the cause plus inde-pendent noise. ANMs are one of many proposed causalinference methods in recent literature [16, 9, 19, 29]Work by Spirtes et al. [32], Pearl [25] shows how todetermine if X → Y when these variables are a partof a larger ‘causal network’, via conditional indepen-dence testing. One downside to conditional indepen-dence based approaches is that inherently they cannotdistinguish between Markov-equivalent graphs. Thusit may be possible that a certain set of conditionalindependences imply both X → Y and Y → X . Fur-thermore, if X and Y are the only variables in thecausal network there is no conditional independencetest to determine whether X → Y or Y → X . Our aim is to protect the privacy of individuals whosubmit personal information about two random vari-ables of interest X and Y . Their information shouldremain private when it is used to infer whether X causes Y ( X → Y ), or Y causes X ( Y → X ) using theANM framework. This personal information comes inthe form of i.i.d. samples { ( x i , y i ) } ni =1 from the jointdistribution P X,Y . We will assume that, 1. There isno confounding variable Z that commonly causes oris a common effect of X and Y . 2. X and Y do notsimultaneously cause each other. Deciding on the causal direction between two variables X and Y from a finite sample set has motivated anarray of research [8, 17, 33, 13, 35, 22, 16, 18, 19].Perhaps one of the most popular results is the AdditiveNoise Model (ANM) proposed by Hoyer et al. [13]. The ANM framework assumption is defined as follows. Definition 1.
Two random variables
X, Y with jointdensity p ( x, y ) are said to ‘satisfy an ANM’ X → Y if there exists a non-linear function f : R → R and arandom noise variable N Y , independent from X , i.e. X ⊥⊥ N Y , such that Y = f ( X ) + N Y . As defined, an ANM X → Y implies a functional re-lationship mapping X to Y , alongside independentnoise. In order for this model to be useful for causalinference we would like the induced joint distribution P X,Y for this ANM to be somehow identifiably differ-ent from the one induced by the ANM Y → X . If so,we say that the causal direction is identifiable [23]. Ifnot, we have no hope of recovering the causal directionpurely from samples under the ANM.Hoyer et al. [13] showed that ANMs are genericallyidentifiable from i.i.d. samples from P X,Y (except fora few special cases of non-linear functions f and noisedistributions). The intuition behind this is for the X → Y ANM, consider for most non-linear f and (forsimplicity) 0-mean N Y , the density p ( y | x ) has mean f ( x ) with distribution given by N Y . This implies that p ( y − f ( x ) | x ) has distribution N Y that is independentof X . However, p ( x − f − ( y ) | y ) is for many choices of f and N Y not independent of y . Algorithm 1
ANM Causal Inference [23] Input: train/test data { x i , y i } ni =1 , { x (cid:48) i , y (cid:48) i } mi =1 Regress on training data, to yield ˆ f , ˆ g , such that: ˆ f ( x i ) ≈ y i , ˆ g ( y i ) ≈ x i , ∀ i Compute residuals on test data: r (cid:48) Y := y (cid:48) − ˆ f ( x (cid:48) ), r (cid:48) X := x (cid:48) − ˆ g ( y (cid:48) ) Calculate dependence scores: s X → Y := s ( x (cid:48) , r (cid:48) Y ), s Y → X := s ( y (cid:48) , r (cid:48) X ) Return: s X → Y , s X → Y , and D , where D = (cid:40) X → Y if s X → Y < s Y → X Y → X if s X → Y > s Y → X Mooij et al. [23] give a practical algorithm for de-termining the causal relationship between X and Y (i.e., either X → Y or Y → X ), as shown in Algo-rithm 1. The first step is to partition the i.i.d. samplesinto a training and a testing set. We use the train-ing set to train the regression functions ˆ f : X → Y and ˆ g : Y → X . We use the testing set to computethe residuals r (cid:48) Y = y (cid:48) − ˆ f ( x (cid:48) ) and r (cid:48) X := x (cid:48) − ˆ g ( y (cid:48) ).If we have an ANM X → Y then the residual r (cid:48) Y isan estimate of the noise N Y which is assumed to be att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger independent of X . Therefore, we calculate the de-pendence between the residual r (cid:48) Y and the input x (cid:48) , s X → Y := s ( x (cid:48) , r (cid:48) Y ), and s Y → X := s ( y (cid:48) , r (cid:48) X ), using adependence score s ( · , · ). If s X → Y is less than s Y → X ,then we declare X → Y , otherwise Y → X . Crucially, the ANM approach hinges on the choice ofdependence score s ( · , · ). There have been many pro-posals, and we give a quick review of the most popularmethods (for a detailed review see Mooij et al. [23]). Spearman’s ρ is a rank correlation coefficient thatdescribes the extent to which one random variable is amonotonic function of the other. Specifically, imagineindependently sorting the observations { a , . . . , a m } and { b , . . . , b m } by value in increasing order. Let o ai be the rank of a i in the a -ordering, and similarly, o bi for b i in the b -ordering. Then Spearman’s ρ is, s ( a , b ) := (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:80) mi =1 d i m ( m − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where d i := ( o ai − o bi ) are the rank differences for a , b . Kendall’s τ . Similar to Spearman’s ρ , the Kendall τ rank score calls a pair of indices ( i, j ) concordant if it is the case that a i > a j and b i > b j . Otherwise( i, j ) is called discordant . Then the dependence scoreis defined as s ( a , b ) := | C − D | m ( m − C is the number of concordant pairs and D isthe number of discordant pairs. HSIC Score.
The first proposed score for the ANMcausal inference is based on the Hilbert-Schmidt In-dependence Criterion (HSIC) [11], which was used byHoyer et al. [13]. They compute an estimate of the p -value of the HSIC under the null hypothesis of in-dependence, selecting the causal direction having thelower p -value. Alternatively, one can use an estimatorto the HSIC value itself: s ( a , b ) := (cid:92) HSIC k θ ( a ) ,k θ ( b ) ( a , b ) (1)where k θ is a kernel with parameters θ . Mooij et al. [23]show that under certain assumptions the algorithm insection 1 with the HSIC dependence score is consistentfor estimating the causal direction in an ANM. Variance Score.
When the noise variables in theANM are Gaussian, the variance score was proposedin B¨uhlmann et al. [1], and defined as s ( a , b ) :=log V ( a ) + log V ( b ). Changes to a single input valuecan induce arbitrarily large changes to this score, which makes the variance score ill suited to preservedifferential privacy. IQR Score.
We introduce a robust version of thisscore by replacing the variance of the random vari-ables with their interquartile range (IQR). The IQR isthe difference between the third and first quartiles ofthe distribution and can be estimated empirically. Wedefined the following IQR-based score: s ( a , b ) := log IQR( a ) + log IQR( b ) . (2) We assume that the data set D = { ( x i , y i ) } containssensitive data that should not be inferred from therelease of the dependence scores. One of the mostwidely accepted mechanisms for private data release is differential privacy [6]. In a nutshell it ensures thatthe released scores can only be used to infer aggregateinformation about the data set and never about anindividual datum ( x i , y i ).Let us define the Hamming distance between two datasets d H ( D , ˜ D ) between two data sets D and ˜ D as thenumber of elements in which these two sets differ. If adata set D is changed to ˜ D , a distance d H ( D , ˜ D ) ≤ Definition 2.
A randomized algorithm A is ( (cid:15), δ ) - differentially private for (cid:15), δ ≥ if for all O ∈
Range( A ) and for all neighboring datasets D , ˜ D with d H ( D , ˜ D ) ≤ we have that Pr (cid:2) A ( D ) = O (cid:3) ≤ e (cid:15) Pr (cid:2) A ( ˜ D ) = O (cid:3) + δ. (3)One of the most popular methods for making an algo-rithm ( (cid:15), global sensitivity , ∆ A describing how much A changes when D changes,∆ A := max D , ˜ D⊆X s.t. d H ( D , ˜ D ) ≤ |A ( D ) − A ( ˜ D ) | . The Laplace mechanism hides the output of A with asmall amount of additive random noise, large enoughto hide the impact of any single datum ( x i , y i ). Definition 3.
Given a dataset D and an algorithm A ,the Laplace mechanism returns A ( D )+ ω , where ω isa noise variable drawn from Lap(0 , ∆ A /(cid:15) ) , the Laplacedistribution with scale parameter ∆ A /(cid:15) . It may be that the global sensitivity of an algorithm A is unbounded in general, but can be bounded in thecontext of a specific data set D over all neighbors ˜ D .For such datasets we can bound the local sensitivity ∆( D ) A := max ˜ D⊆X s.t. d H ( D , ˜ D ) ≤ |A ( D ) − A ( ˜ D ) | . rivate Causal Inference Table 1: Dependence scores and their privacy.A checkmark indicates that there exist meaningfulbounds on either the global or local sensitivity.
Test Training
Global Local Global Local
Score
Sense. Sense. Sense. Sense.Spearman’s ρ (cid:88) (cid:88) - (cid:88) Kendall’s τ (cid:88) (cid:88) - (cid:88) HSIC (cid:88) (cid:88) (cid:88) (cid:88)
IQR - (cid:88) - (cid:88) If an algorithm has bounded global sensitivity it cer-tainly has bounded local sensitivity. Nissim et al.[24], Dwork & Lei [4], Jain & Thakurta [14] show howto use the local sensitivity to cleverly produce privatequantities for datasets with bounded local sensitivity.
The data is partitioned into training and test set,which are used in different ways. We therefore in-troduce mechanisms to preserve training and test setprivacy respectively, which can be used jointly. Specif-ically, we show how to privatize the dependence scores s X → Y , s Y → X . The reason for this is four-fold: 1. Pri-vatizing the dependence score immediately privatizesthe causal direction D , because operations on differ-entially private outputs preserve privacy (so long asthey do not touch the data). 2. Releasing the scoresindicates how confident the ANM method is about thecausal direction, which is absent from the binary out-put D . 3. It is unclear which dependence score is bestfor a particular dataset, so we privatize multiple scoresand leave this choice to the practitioner. In this sectionwe begin with test set privacy and describe trainingset privacy in Section 5. Table 1 gives an overview oftest and training set privacy results for the dependencescores that we consider.Let ( x (cid:48) , y (cid:48) ) be the initial test data and (˜ x (cid:48) , ˜ y (cid:48) ) be thetest data after a single change in the dataset. Let˜ x (cid:48) = [ x (cid:48) , . . . , x (cid:48) k − , ˜ x (cid:48) k , x (cid:48) k +1 , . . . , x (cid:48) m ] (cid:62) and similarlyfor ˜ y so that this single change occurs at some index k . The key to preserving privacy is to show that theselected dependence score s ( · , · ) can be privatized. Weshow that if our dependence score is a rank correlationcoefficient (Spearman’s ρ , Kendall’s τ ) or the HSICscore [11], we can readily bound its test set global sen-sitivity when applied to ( x (cid:48) , y (cid:48) ) versus (˜ x (cid:48) , ˜ y (cid:48) ). As theIQR score has bounded test set local sensitivity we canapply the algorithm of Dwork & Lei [4] for privacy. We first demonstrate global sensitivity for the two rankcorrelation scores in Section 3.
Theorem 1.
The rank correlation coefficients havethe following global sensitivities,1. Let ρ ( · , · ) be Spearman’s ρ score, then | ρ ( x (cid:48) , r (cid:48) Y ) − ρ (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m
2. Let τ ( · , · ) be Kendall’s τ score, then | τ ( x (cid:48) , r (cid:48) Y ) − τ (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m Proof.
Our goal is to bound the following global sensi-tivity in both scores: | s ( x (cid:48) , r (cid:48) Y ) − s (˜ x (cid:48) , ˜ r (cid:48) Y ) | . For Spear-man’s ρ , suppose the change is on a k and b k , it is easyto verify that 1) d i changes by at most 2, for i (cid:54) = k ; 2) d k changes by at most m −
1; 3) d i ≤ m − i .Since d i − ( d i − = 4( d i − ≤ m −
2) for i (cid:54) = k ,the maximum change inside the summation is upperbounded by ( m − m −
8) + ( m − . Therefore,global sensitivity of ρ is bounded by6( m − m − m ( m − ≤ m .For Kendall’s τ we can affect at most ( m −
1) pairsby moving a single element of x (cid:48) , as well as ( m − r (cid:48) Y (either from concordant pairs todiscordant pairs, or vice versa). Therefore, the globalsensitivity of Kendall’s τ is | s ( x (cid:48) , r (cid:48) Y ) − s (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m − m ( m − ≤ m The bound on the global sensitivity ∆ of our scores en-ables us to apply the Laplace mechanism [6] to produce(2 (cid:15), p X → Y , p Y → X .Specifically, we add Laplace noise Lap(0 , ∆ /(cid:15) ) to ourSpearman’s ρ and Kendall’s τ scores to preserve pri-vacy w.r.t. the test set. Moreover, as a general prop-erty of differential privacy we can compute any func-tions on these private scores and, so long as they donot touch the data, the outputs of these functions arealso private. This means that we can compute the in-equality p X → Y < p Y → X to decide if X causes Y orvice-versa privately.An important consideration is to what degree the addi-tion of noise affects the true decision: s X → Y < s Y → X .Importantly, we can prove that, in certain cases, theaddition of Laplace noise required by the mechanismis small enough to not change the direction of causalinference. These are cases in which there is a large‘margin’ between the scores s X → Y and s Y → X . So long att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger as this margin is large enough and in the correct or-der the addition of Laplace noise has no effect on theinference with high probability. Theorem 2.
Given two random variables
X, Y whohave w.l.o.g. the causal relationship X → Y , assumethat they produce correctly-ordered scores: s X → Y
Y,j ) and H ij = δ { i = j } − /m . We assume k, l are bounded above by 1(e.g., the squared exponential kernel, the Matern ker-nel [27]). Our goal is to show that when we replace( x (cid:48) , y (cid:48) ) with (˜ x (cid:48) , ˜ y (cid:48) ) the global sensitivity is small.Specifically we prove the following theorem. Theorem 3.
The score in eq. (4) has a global sensi-tivity of at most m − m − . Specifically, | (cid:92) HSIC k,l ( x (cid:48) , r (cid:48) Y ) − (cid:92) HSIC k,l (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m − m − Proof.
For simplicity define H ( · , · ) := (cid:92) HSIC k,l ( · , · ).Note that, as the trace is cyclic: trace( KHLH ) =trace(
HKHL ). Further, let ˜ K, ˜ L be the kernels de-fined on the modified data (˜ x (cid:48) , ˜ y (cid:48) ). Then as the datais represented purely through the kernel matrices andthe trace is Lipschitz w.r.t. these matrices, we canapply the triangle inequality to yield, |H ( x (cid:48) , r (cid:48) Y ) − H (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤(cid:107) HLH (cid:107) ∞ (cid:107) K − ˜ K (cid:107) ( m − + (cid:107) HKH (cid:107) ∞ (cid:107) L − ˜ L (cid:107) ( m − To bound the infinity norms, let L = HLH , then | L ij | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ij − (cid:80) ma =1 L aj m − (cid:80) mb =1 L ib m + (cid:80) ma,b =1 L ab m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L ij ≤ HKH ). Fi-nally, note that as there is only a single-element dif-ference between ( x (cid:48) , r (cid:48) Y ) and (˜ x (cid:48) , ˜ r (cid:48) Y ), we have that (cid:107) K − ˜ K (cid:107) ≤ m − L, ˜ L ).In fact, we can improve this bound to m − m − usingtrace identities. We leave the proof of this to the ap-pendix. Given this global sensitivity bound we can useTheorem 2 to guarantee that under certain conditionsthe Laplace mechanism w.h.p. does not change thedirection of causal influence. Unfortunately the IQR does not have a bounded globalsensitivity, as there exist datasets for which the IQRcan change by an unbounded amount. Instead, Dwork& Lei [4] offer an efficient technique to privately releasethe IQR. We give a slightly modified version of theirAlgorithm in the appendix.First the algorithm defines two intervals B and B which both contain IQR( X ). If the IQR were to bepushed out of both of these intervals it would implythat the IQR changed by a factor of e . Therefore weloop over both intervals and calculate the number ofpoints A j that an adversary would need to change topush the IQR out of B or B . Note that A j is it-self a data-sensitive query and so, to preserve privacyof this query, we can add Laplace noise to it. Then,if one of these noisy estimates R j = A j + z , where z ∼ Lap(0 , /(cid:15) ) is larger than some threshold, it im-plies that with high probability (exactly 1 − δ ), that theIQR( X ) has multiplicative sensitivity of at most e , forthe specific dataset X . Note that this is precisely thelocal sensitivity as defined in Section 3, as it is specificto X . This means that we can add Laplace noise z tolog IQR( X ). If neither of the R j are above the thresh-old then the algorithm returns null: ⊥ . This algorithmwas shown to be (3 (cid:15), δ )-differentially private.In our case we would like to release four private IQRscores. Note that we must look at x (cid:48) three separatetimes: for IQR( x (cid:48) ) , IQR( r (cid:48) Y ) , and IQR( r (cid:48) X ) (and threetimes as well for y (cid:48) ). Therefore for both x (cid:48) and y (cid:48) we are composing three differentially private outputs.Under simple composition this would lead to (9 (cid:15), δ )differential privacy for both x (cid:48) and y (cid:48) . However, wecan make use of Corollary 3.21 in Dwork & Roth [5] togive ( (cid:15) (cid:48) , δ + δ (cid:48) )-differential privacy, for 0 < (cid:15) (cid:48) < rivate Causal Inference δ (cid:48) >
0, over three repeated mechanisms by ensuringeach private mechanism is (3 (cid:15), δ )-private, where 3 (cid:15) = (cid:15) (cid:48) / (2 (cid:112) /δ (cid:48) )).The remaining question is whether this noise addi-tion causes one to infer the incorrect causal direction.Again, as long as there is a significant margin betweenthe scores, we can preserve the correct causal inferencewith high probability as follows. Theorem 4.
Let Q x (cid:48) = log IQR ( x (cid:48) ) , and similarlyfor Q y (cid:48) , Q r (cid:48) X , Q r (cid:48) Y , be the true log-IQR scores. As welllet P x (cid:48) , P y (cid:48) , P r (cid:48) X , P r (cid:48) Y be the private versions, multipliedby e z noise where z ∼ Lap (0 , /(cid:15) ) . The the followingresults hold:1. [4] If the number of data-points needed to signif-icantly change the IQR, A j , is less than e then,the probability that any one of the private IQR P ∗ is released is small: P (cid:34) P ∗ (cid:54) = ⊥ | A or A ≤ e (cid:35) ≤ δ .
2. If all private log-IQR scores are released, and therelationship between the true scores holds Q x (cid:48) + Q r (cid:48) Y < Q y (cid:48) + Q r (cid:48) X (which implies X → Y ), thenthe probability that we make the correct causal in-ference from the private scores is large, P [ P x (cid:48) + P r (cid:48) Y < P y (cid:48) + P r (cid:48) X ] =1 − e − γσ σ (cid:16) σ + 33 σ γ + 9 σγ + γ (cid:17) where γ = Q y (cid:48) + Q r (cid:48) X − Q x (cid:48) + Q r (cid:48) Y , and σ = 1 /(cid:15) . The proof of these results is in the appendix. The firstresult says that the probability that we release an IQRscore just because too much noise was added to A j issmall. The second result says that with high probabil-ity we recover the true causal direction, depending onthe size of the dataset. Let ( x , y ) be the initial training data and (˜ x , ˜ y ) be thetraining data after a change in the dataset. Note that x and ˜ x differ in at most one element (similarly for y and ˜ y ). The length of both training datasets is n .From Algorithm 1, the only way the training set canaffect the dependency scores s X → Y , s Y → X is throughthe regression functions ˆ f , ˆ g , used to compute test setresiduals r (cid:48) Y , r (cid:48) X . We use the kernel ridge regressionmethod and so the functions ˆ f (and ˆ g ) can be writ-ten in the form: ˆ f ( w , x ) = w (cid:62) φ ( x ), where φ ( x ) is a(possibly infinite) feature space mapping to the Hilbert space corresponding to the kernel function used. Sim-ilar to other work on private regression [34] we assumethat | x | , | y | ≤
1. The ridge regression algorithm cannow be written as: w = argmin w ∈H λ (cid:107) w (cid:107) H + 1 n n (cid:88) i =1 ( w (cid:62) φ ( x i ) − y i ) , (5)where H is the corresponding Hilbert space. Prac-tically speaking, even though w may be infinite-dimensional, because it always appears in an innerproduct with the feature mapping φ ( x ) we can utilizethe ‘kernel trick’: k ( x i , x j ) = φ ( x i ) (cid:62) φ ( x j ) to avoidhaving to represent w explicitly.Let ˆ f ( w ∗ , · ) and ˆ f ( ˜ w ∗ , · ) be the classifiers resultingfrom the optimization problem in eq. (5) when trainedon ( x , y ) and (˜ x , ˜ y ), respectively (and similarly forˆ g ). We show that the residuals in Algorithm 1 arebounded. Theorem 5.
Say λ ≤ . Given that the classi-fiers ˆ f ( w ∗ , · ) , ˆ f ( ˜ w ∗ , · ) are the result of the optimiza-tion problem in eq. (5), the residuals of these functions r (cid:48) Y , ˜ r (cid:48) Y are bounded as, | r (cid:48) i,Y − ˜ r (cid:48) i,Y | ≤ nλ / (6) for all i , where r (cid:48) i,Y , ˜ r (cid:48) i,Y are the i th elements of r (cid:48) Y , ˜ r (cid:48) Y and m is the size of the test set. This bound holds equally for r (cid:48) X , ˜ r (cid:48) X . The proof of theabove is inspired by the work of Shalev-Shwartz et al.[31] and Jain & Thakurta [14]. We place the proof inthe appendix for the interested reader. As far as we areaware this is the tightest bound for the optimizationproblem in eq. (5), with a non-Lipschitz loss. In thefollowing, we use this bound to preserve training setprivacy for the dependence scores considered in theprevious section. Note that the bound in Theorem 5 directly implies thatthe ranking dependence scores have global sensitivity 1(equal to the size of their ranges). To see this note thatwe can consider an adversarial situation in which therank of every element of the residual r (cid:48) Y changes whenthe training set is altered in one element (as all theresidual elements may change). This means that theLaplace mechanism cannot guarantee useful privacy.Instead, note that both ranking scores may still havereasonably bounded local sensitivity. Specifically, ifwe consider the list of sorted residuals, it may be thatthere are large gaps between neighboring residuals. Ifthis is the case then changing the training set by onepoint may not change the residual rankings. Thus, the att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger HSIC IQRSpearman's Kendall's p r ob . o f c o rr e c t i n f e r en c e ⌧⇢ dataset id ✏ ✏✏✏ Figure 1: Probability of correctly identifying the causal direction on datasets selected from the Cause-EffectPairs Challenge [12]. Datasets for which the scores perform well were selected in order to isolate the effect ofprivatization on the scores.
HSIC p r ob . o f c o rr e c t i n f e r en c e best dataset id ↵ =0 . ↵ =1 ↵ =2 ✏ Figure 2: Training set privacy for the HSIC score. The three left-most plots show how λ affects the probabilityof correctly inferring the causal direction, while the right-most plot depicts this probability when the best λ isselected over a (cid:15) ∈ [0 . , Definition 4.
We call a function f k -stable ondataset D if modifying any k elements in D does notchange the value of f . Specifically, f ( D ) = f ( D ∗ ) forall D ∗ such that D can be transformed into D ∗ witha minimum of k element substitutions. We say f is unstable on D if it is not even -stable on D . The distance to instability of a dataset D w.r.t. a func-tion f is the number of elements that must be changedto reach an unstable dataset. With these definitions, we will use a modification ofthe Propose-Test-Release framework that makes useof this stability as described in Algorithm 13 in Dwork& Roth [5].
Theorem 6. [5] Algorithm 13 [5] is ( (cid:15), δ ) -differentially private. Further, for all β > if s ( x (cid:48) , r (cid:48) Y ) is log(1 /δ )+log(1 /β ) (cid:15) -stable on r (cid:48) Y , thenAlgorithm 13 releases s ( x (cid:48) , r (cid:48) Y ) w.p. at least − β . A lower bound on the distance to instability d is easilygiven by noting that s ( x (cid:48) , r (cid:48) Y ) always outputs the sameresult as long as none of the ranks of r (cid:48) Y change. Let γ be the smallest absolute distance between any tworanks. Then a lower bound on d is, d > (cid:98) nγλ / / (cid:99) .This is the largest number of training points that maychange so that the closest ranks moving towards eachother do not overlap (given that they change by at most the amount in eq. 6). This lower-bound is suffi-cient to use Algorithm 13 [5] to privatize the rankingdependence scores. For m ≥ , with kernels k, l ≤ where l is L l -Lipschitz, the HSIC score has a training setsensitivity as follows, (cid:12)(cid:12)(cid:12) (cid:92) HSIC k,l ( x (cid:48) , r (cid:48) Y ) − (cid:92) HSIC k,l ( x (cid:48) , ˜ r (cid:48) Y ) (cid:12)(cid:12)(cid:12) ≤ R L l √ mn where R = λ / . The proof follows directly from Theorem 5 and Lemma16 in Mooij et al. [23]. Thus, the Laplace mechanismgives us ( (cid:15),
Similar to the test set privacy section we will usepropose-test-release to give a useful, private IQR score.In fact, we will use IQR algorithm almost identically,except that we will define A j as the number of train-ing points required to move the IQR out of an interval.Note that a lower bound on A j is simply the number ofpoints required to move every input less than the me-dian to the left and every input larger than the median rivate Causal Inference Table 2: The non-private accuracies of the ANM model on a subset of the Cause-Effect Pairs Challenge [12], aswell as the probability of correct causal inference after privatization. dataset ids 4031 597 2209 2967 161 2132 1656 901 3484 1627size 7713 7748 7766 7771 7782 7784 7803 7820 7853 7862 (cid:15) = ∞ (non-private accuracies) Spearman’s ρ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
48 0 . ± .
32 1 . ± .
00 0 . ± .
00 0 . ± .
48 0 . ± .
00 1 . ± . Kendall’s τ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
48 0 . ± .
42 1 . ± .
00 0 . ± .
00 0 . ± .
42 0 . ± .
00 1 . ± . HSIC [11] . ± .
00 0 . ± .
00 1 . ± .
00 1 . ± .
00 0 . ± .
48 0 . ± .
52 1 . ± .
00 0 . ± .
52 1 . ± .
00 0 . ± . IQR [1] . ± .
53 0 . ± .
00 0 . ± .
32 1 . ± .
00 1 . ± .
00 1 . ± .
00 0 . ± .
00 0 . ± .
32 0 . ± .
00 1 . ± . (cid:15) = 0 . Spearman’s ρ . ± .
45 0 . ± .
00 0 . ± .
02 0 . ± .
10 0 . ± .
06 0 . ± .
02 0 . ± .
06 0 . ± .
21 0 . ± .
00 0 . ± . Kendall’s τ . ± .
48 0 . ± .
00 0 . ± .
00 0 . ± .
38 0 . ± .
24 1 . ± .
00 0 . ± .
09 0 . ± .
41 0 . ± .
00 1 . ± . HSIC [11] . ± .
17 0 . ± .
00 0 . ± .
01 0 . ± .
00 0 . ± .
01 0 . ± .
00 0 . ± .
00 0 . ± .
06 0 . ± .
03 0 . ± . IQR [1] . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± . (cid:15) = 1 Spearman’s ρ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
43 0 . ± .
17 1 . ± .
00 0 . ± .
07 0 . ± .
41 0 . ± .
00 1 . ± . Kendall’s τ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
48 0 . ± .
40 1 . ± .
00 0 . ± .
00 0 . ± .
42 0 . ± .
00 1 . ± . HSIC [11] . ± .
16 0 . ± .
03 0 . ± .
00 0 . ± .
01 0 . ± .
06 0 . ± .
01 0 . ± .
02 0 . ± .
25 1 . ± .
01 0 . ± . IQR [1] . ± .
04 0 . ± .
00 0 . ± .
00 0 . ± .
00 0 . ± .
01 0 . ± .
01 0 . ± .
00 0 . ± .
00 0 . ± .
01 0 . ± . (cid:15) = 2 Spearman’s ρ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
47 0 . ± .
17 1 . ± .
00 0 . ± .
02 0 . ± .
45 0 . ± .
00 1 . ± . Kendall’s τ . ± .
53 0 . ± .
00 0 . ± .
00 0 . ± .
48 0 . ± .
42 1 . ± .
00 0 . ± .
00 0 . ± .
42 0 . ± .
00 1 . ± . HSIC [11] . ± .
09 0 . ± .
04 1 . ± .
00 0 . ± .
01 0 . ± .
11 0 . ± .
01 0 . ± .
02 0 . ± .
26 1 . ± .
00 0 . ± . IQR [1] . ± .
09 0 . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± .
02 0 . ± .
02 0 . ± .
01 0 . ± .
01 0 . ± .
01 0 . ± . to the right (or the reverse of these), using the boundon r in eq. (6). The aforementioned privacy and util-ity results of the IQR propose-test-release frameworkapply here. The only difference is we just need to addnoise to the IQR scores computed on the residuals,which implies (6 (cid:15), δ )-privacy and that the results ofTheorem 4 can be tightened. We test our methods for private release of causal infer-ence statistics on a small subsets from the Cause-EffectPairs Competition collection Guyon [12]. Specifically,we randomly select 10 of the largest 25 datasets thathave a causal direction either X → Y or Y → X . Weaverage over 10 random 50/50 train/test splits of thedata. Table 2 shows the non-private accuracy of thefour dependence scores over these datasets. We showthe probability of correct causal inference changes asthese scores are made private w.r.t. the test set. Notethat these scores are often complementary, with theranking-based scores performing well on datasets inwhich HSIC does worse, and vice-versa.Figure 1 shows the effect of privatization on the de-pendence scores: HSIC and IQR. Note that, for low (cid:15) (increased privacy), the probability of correct influ-ence is lower as the amount of noise required blurs thetrue dependence scores. However, as (cid:15) increases, sodoes this probability, in some cases drastically. For theIQR score, recall that there is a probability that thealgorithm returns null: ⊥ , if R j is less than a thresholdcontrolled by δ . We investigated this probability, byvarying δ ∈ [10 − , − ] and sampling 10 ,
000 pointsfrom the appropriate Laplace distribution. We foundthat, for the IQR dataset in figure 1 every sample didnot move R j below the null threshold. Therefore, theprobability of null is essentially 0. The three left-most plots in Figure 2 demonstrate how λ , which has a large effect on the training set sensi-tivity (as described in eq. 6) affects the probabilityof correct inference. We perform this experiment fordifferent settings of (cid:15) , and each one produces a dis-tinctive ‘hump’ shape. This is because for small λ thesensitivity bound (6) is too large to produce mean-ingful causal inference. Similarly, for large λ the ker-nelized regression algorithm (5) is overly-regularized,which produces a poor regressor and poor dependencescores. Only when λ is within a certain range do webalance the size of the sensitivity bound with the sizeof the regularization. This range grows larger as (cid:15) increases as the privacy setting becomes less strict (re-quiring less noise). The right-most plot shows the cor-rect inference probability using the best λ for a rangeof (cid:15) ∈ [0 . , λ we canachieve high-quality causal inference that maintainsprivacy w.r.t. the training set. We have presented, to the best of our knowledge, thefirst work towards differentially private causal infer-ence. There are numerous directions of future work in-cluding privatizing other causal inference frameworks(e.g. IGCI [16]), analyzing that ANM algorithm with-out train/test splits, as well as other dependencescores. As there is significant overlap in the appli-cations of causal inference and private learning we be-lieve this work constitutes an important step towardsmaking causal inference practical.
Acknowledgments
KQW and MJK are supported by NSF grants IIA-1355406, IIS-1149882, EFRI-1137211. We thank theanonymous reviewers for their useful comments. att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger
References [1] B¨uhlmann, Peter, Peters, Jonas, Ernest, Jan, et al.Cam: Causal additive models, high-dimensional or-der search and penalized regression.
The Annals ofStatistics , 42(6):2526–2556, 2014.[2] Chaudhuri, Kamalika, Monteleoni, Claire, and Sar-wate, Anand D. Differentially private empirical riskminimization.
JMLR , 12:1069–1109, 2011.[3] Dinur, Irit and Nissim, Kobbi. Revealing infor-mation while preserving privacy. In
Proceedings ofthe SIGMOD-SIGACT-SIGART symposium on prin-ciples of database systems , pp. 202–210. ACM, 2003.[4] Dwork, Cynthia and Lei, Jing. Differential privacyand robust statistics. In
Proceedings of the forty-firstannual ACM symposium on Theory of computing , pp.371–380. ACM, 2009.[5] Dwork, Cynthia and Roth, Aaron. The algorithmicfoundations of differential privacy.
Theoretical Com-puter Science , 9(3-4):211–407, 2013.[6] Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi,and Smith, Adam. Calibrating noise to sensitivity inprivate data analysis. In
Theory of Cryptography , pp.265–284. Springer, 2006.[7] for Disease Control, Centers, Prevention, et al.
Howtobacco smoke causes disease: The biology and behav-ioral basis for smoking-attributable disease: A reportof the surgeon general . Centers for Disease Controland Prevention (US), 2010.[8] Friedman, Nir and Nachman, Iftach. Gaussian processnetworks. In
Proceedings of the Sixteenth conferenceon Uncertainty in artificial intelligence , pp. 211–219.Morgan Kaufmann Publishers Inc., 2000.[9] Geiger, Philipp, Janzing, Dominik, and Sch¨olkopf,Bernhard. Estimating causal effects by boundingconfounding. In
Proceedings of the 30th Conferenceon Uncertainty in Artificial Intelligence , pp. 240–249,2014.[10] Geiger, Philipp, Zhang, Kun, Schoelkopf, Bernhard,Gong, Mingming, and Janzing, Dominik. Causal in-ference by identification of vector autoregressive pro-cesses with hidden components. In
ICML , pp. 1917–1925, 2015.[11] Gretton, Arthur, Bousquet, Olivier, Smola, Alex, andSch¨olkopf, Bernhard. Measuring statistical depen-dence with hilbert-schmidt norms. In
Algorithmiclearning theory , pp. 63–77. Springer, 2005.[12] Guyon, I. Cause-effect pairs kaggle competi-tion, 2013. URL .[13] Hoyer, Patrik O, Janzing, Dominik, Mooij, Joris M,Peters, Jonas, and Sch¨olkopf, Bernhard. Nonlinearcausal discovery with additive noise models. In
Ad-vances in neural information processing systems , pp.689–696, 2009.[14] Jain, Prateek and Thakurta, Abhradeep. Differen-tially private learning with kernels. In
Proceedings ofthe 30th International Conference on Machine Learn-ing (ICML-13) , pp. 118–126, 2013.[15] Jain, Prateek, Kothari, Pravesh, and Thakurta,Abhradeep. Differentially private online learning.
COLT , 2012. [16] Janzing, Dominik, Mooij, Joris, Zhang, Kun, Lemeire,Jan, Zscheischler, Jakob, Daniuˇsis, Povilas, Steudel,Bastian, and Sch¨olkopf, Bernhard. Information-geometric approach to inferring causal directions.
Ar-tificial Intelligence , 182:1–31, 2012.[17] Kano, Yutaka and Shimizu, Shohei. Causal inferenceusing nonnormality. In
Proceedings of the Interna-tional Symposium on Science of Modeling, the 30thAnniversary of the Information Criterion , pp. 261–270, 2003.[18] Kpotufe, Samory, Sgouritsa, Eleni, Janzing, Dominik,and Sch¨olkopf, Bernhard. Consistency of causal infer-ence under the additive noise model. In
ICML , 2014.[19] Lopez-Paz, David, Muandet, Krikamol, Sch¨olkopf,Bernhard, and Tolstikhin, Iliya. Towards a learningtheory of cause-effect inference. In
ICML , 2015.[20] McSherry, Frank and Talwar, Kunal. Mechanism de-sign via differential privacy. In
FOCS , pp. 94–103.IEEE, 2007.[21] Mooij, Joris M, Stegle, Oliver, Janzing, Dominik,Zhang, Kun, and Sch¨olkopf, Bernhard. Probabilis-tic latent variable models for distinguishing betweencause and effect. In
Advances in Neural InformationProcessing Systems , pp. 1687–1695, 2010.[22] Mooij, Joris M, Janzing, Dominik, Heskes, Tom, andSch¨olkopf, Bernhard. On causal discovery with cyclicadditive noise models. In
Advances in neural infor-mation processing systems , pp. 639–647, 2011.[23] Mooij, Joris M, Peters, Jonas, Janzing, Dominik,Zscheischler, Jakob, and Sch¨olkopf, Bernhard. Dis-tinguishing cause from effect using observationaldata: methods and benchmarks. arXiv preprintarXiv:1412.3773 , 2014.[24] Nissim, Kobbi, Raskhodnikova, Sofya, and Smith,Adam. Smooth sensitivity and sampling in privatedata analysis. In
Proceedings of the thirty-ninth an-nual ACM symposium on Theory of computing , pp.75–84. ACM, 2007.[25] Pearl, Judea. Causality: models, reasoning, and in-ference. 2000.[26] Peters, Jonas, Mooij, Joris M, Janzing, Dominik, andSch¨olkopf, Bernhard. Causal discovery with continu-ous additive noise models.
The Journal of MachineLearning Research , 15(1):2009–2053, 2014.[27] Rasmussen, Carl Edward and Williams, ChristopherK. I. Gaussian processes for machine learning. 2006.[28] Reichenbach, Hans and Reichenbach, Maria.
The di-rection of time . Univ of California Press, 1956.[29] Sgouritsa, Eleni, Janzing, Dominik, Hennig, Philipp,and Sch¨olkopf, Bernhard. Inference of cause and effectwith unsupervised inverse regression. In
AISTATS ,pp. 847–855, 2015.[30] Shajarisales, Naji, Janzing, Dominik, Shoelkopf,Bernhard, and Besserve, Michel. Telling cause fromeffect in deterministic linear dynamical systems. In
ICML , 2015.[31] Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan,and Sridharan, Karthik. Stochastic convex optimiza-tion. In
COLT , 2009. rivate Causal Inference [32] Spirtes, Peter, Glymour, Clark N, and Scheines,Richard.
Causation, prediction, and search , vol-ume 81. MIT press, 2000.[33] Sun, Xiaohai, Janzing, Dominik, and Sch¨olkopf, Bern-hard. Causal reasoning by evaluating the complexityof conditional densities with kernel methods.
Neuro-computing , 71(7):1248–1256, 2008.[34] Talwar, Kunal, Thakurta, Abhradeep, and Zhang, Li.Private empirical risk minimization beyond the worstcase: The effect of the constraint set geometry. arXivpreprint arXiv:1411.5417 , 2014.[35] Zhang, Kun and Hyv¨arinen, Aapo. On the identifiabil-ity of the post-nonlinear causal model. In