[PDF] Private Causal Inference

Abstract

Causal inference deals with identifying which random variables "cause" or control other random variables. Recent advances on the topic of causal inference based on tools from statistical estimation and machine learning have resulted in practical algorithms for causal inference. Causal inference has the potential to have significant impact on medical research, prevention and control of diseases, and identifying factors that impact economic changes to name just a few. However, these promising applications for causal inference are often ones that involve sensitive or personal data of users that need to be kept private (e.g., medical records, personal finances, etc). Therefore, there is a need for the development of causal inference methods that preserve data privacy. We study the problem of inferring causality using the current, popular causal inference framework, the additive noise model (ANM) while simultaneously ensuring privacy of the users. Our framework provides differential privacy guarantees for a variety of ANM variants. We run extensive experiments, and demonstrate that our techniques are practical and easy to implement.

Full PDF

PPrivate Causal Inference

Matt J. Kusner Yu Sun Karthik Sridharan Kilian Q. Weinberger

Washington University in St. Louis [email protected]

Cornell University [email protected]

Abstract

Causal inference deals with identifying whichrandom variables “cause” or control otherrandom variables. Recent advances on thetopic of causal inference based on tools fromstatistical estimation and machine learninghave resulted in practical algorithms forcausal inference. Causal inference has the po-tential to have signiﬁcant impact on medicalresearch, prevention and control of diseases,and identifying factors that impact economicchanges to name just a few. However, thesepromising applications for causal inferenceare often ones that involve sensitive or per-sonal data of users that need to be kept pri-vate (e.g., medical records, personal ﬁnances,etc). Therefore, there is a need for the de-velopment of causal inference methods thatpreserve data privacy. We study the problemof inferring causality using the current, pop-ular causal inference framework, the additivenoise model (ANM) while simultaneously en-suring privacy of the users. Our frameworkprovides diﬀerential privacy guarantees for avariety of ANM variants. We run extensiveexperiments, and demonstrate that our tech-niques are practical and easy to implement.

Causal identiﬁcation allows one to reason about howmanipulations of certain random variables (the causes)aﬀect the outcomes of others (the eﬀects). Uncoveringthese causal structures has implications ranging fromcreating government policies to informing health-carepractices. Causal inference was motivated by the im-possibility of randomized intervention experiments in

Appearing in Proceedings of the 19 th International Con-ference on Artiﬁcial Intelligence and Statistics (AISTATS)2016, Cadiz, Spain. JMLR: W&CP volume 51. Copyright2016 by the authors. many cases, and the ambiguity of conditional inde-pendence testing [32, 25]. In the absence of interven-tions, it attempts to discover the underlying causal re-lationships of a set of random variables entirely basedon samples from their joint distribution. The ﬁeld ofcausal inference is now a mature research area, cover-ing learning topics as diverse as supervised batch infer-ence [19, 23, 26], time-series causal prediction [10], andlinear dynamical systems [30]. Many inference meth-ods require only a regression technique and a way tocompute the independence between two distributionsgiven samples [13, 16].One would hope that researchers could publicly releasetheir causal inference ﬁndings to inform individualsand policy makers. One of the primary roadblocks todoing so is that often causal inference is performed ondata that individuals may wish to keep private, suchas data in the ﬁelds of medical diagnosis, fraud detec-tion, and risk analysis. Currently, no causal inferencemethod has formal guarantees about the privacy ofindividual data, which may be able to be inferred viaattacks such as reconstruction attacks [3].Arguably one of the best notion of privacy is diﬀeren-tial privacy, introduced by Dwork et al. [6] and sinceused throughout machine learning [4, 15, 20, 2, 5]. Dif-ferential privacy guarantees that the outcome of an al-gorithm only reveals aggregate information about theentire dataset and never about the individual. An indi-vidual who is considering to participate in a study canbe reassured that his/her personal information cannotbe recovered with extremely high probability.To our knowledge, this paper is the ﬁrst to investigateprivate causal inference. We show that it is possible toprivately release the quantities produced by the highly-successful additive noise model (ANM) framework byadding small amounts of noise, as dictated by diﬀer-ential privacy. Furthermore, these private quantities,with high probability, do not change the causal in-ference result, so long as it is conﬁdent enough. Wedemonstrate on a set of real-world causal inferencedatasets how our privacy-preserving methods can bereadily and usefully applied. a r X i v : . [ s t a t . M L ] A ug rivate Causal Inference Discovering the causal nature between random eventshas captivated researchers and philosophers long be-fore the formal developments of statistics. This in-terest was formalized by Reichenbach & Reichenbach[28] who argued that all statistical correlations in dataarise from underlying causal structures between theconcerned random variables. For example, the corre-lation between smoking and lung cancer was found toarise from a direct causal link [7].One of the most popular causal inference alternativesto conditional independence testing is the AdditiveNoise Model (ANM) approach developed by Hoyeret al. [13] and used in many recent works [35, 21, 18, 1].ANMs, originally designed for inferring whether X → Y or Y → X and later extended to large numbers ofrandom variables, work under the assumption that theeﬀect is a non-linear function of the cause plus inde-pendent noise. ANMs are one of many proposed causalinference methods in recent literature [16, 9, 19, 29]Work by Spirtes et al. [32], Pearl [25] shows how todetermine if X → Y when these variables are a partof a larger ‘causal network’, via conditional indepen-dence testing. One downside to conditional indepen-dence based approaches is that inherently they cannotdistinguish between Markov-equivalent graphs. Thusit may be possible that a certain set of conditionalindependences imply both X → Y and Y → X . Fur-thermore, if X and Y are the only variables in thecausal network there is no conditional independencetest to determine whether X → Y or Y → X . Our aim is to protect the privacy of individuals whosubmit personal information about two random vari-ables of interest X and Y . Their information shouldremain private when it is used to infer whether X causes Y ( X → Y ), or Y causes X ( Y → X ) using theANM framework. This personal information comes inthe form of i.i.d. samples { ( x i , y i ) } ni =1 from the jointdistribution P X,Y . We will assume that, 1. There isno confounding variable Z that commonly causes oris a common eﬀect of X and Y . 2. X and Y do notsimultaneously cause each other. Deciding on the causal direction between two variables X and Y from a ﬁnite sample set has motivated anarray of research [8, 17, 33, 13, 35, 22, 16, 18, 19].Perhaps one of the most popular results is the AdditiveNoise Model (ANM) proposed by Hoyer et al. [13]. The ANM framework assumption is deﬁned as follows. Deﬁnition 1.

Two random variables

X, Y with jointdensity p ( x, y ) are said to ‘satisfy an ANM’ X → Y if there exists a non-linear function f : R → R and arandom noise variable N Y , independent from X , i.e. X ⊥⊥ N Y , such that Y = f ( X ) + N Y . As deﬁned, an ANM X → Y implies a functional re-lationship mapping X to Y , alongside independentnoise. In order for this model to be useful for causalinference we would like the induced joint distribution P X,Y for this ANM to be somehow identiﬁably diﬀer-ent from the one induced by the ANM Y → X . If so,we say that the causal direction is identiﬁable [23]. Ifnot, we have no hope of recovering the causal directionpurely from samples under the ANM.Hoyer et al. [13] showed that ANMs are genericallyidentiﬁable from i.i.d. samples from P X,Y (except fora few special cases of non-linear functions f and noisedistributions). The intuition behind this is for the X → Y ANM, consider for most non-linear f and (forsimplicity) 0-mean N Y , the density p ( y | x ) has mean f ( x ) with distribution given by N Y . This implies that p ( y − f ( x ) | x ) has distribution N Y that is independentof X . However, p ( x − f − ( y ) | y ) is for many choices of f and N Y not independent of y . Algorithm 1

ANM Causal Inference [23] Input: train/test data { x i , y i } ni =1 , { x (cid:48) i , y (cid:48) i } mi =1 Regress on training data, to yield ˆ f , ˆ g , such that: ˆ f ( x i ) ≈ y i , ˆ g ( y i ) ≈ x i , ∀ i Compute residuals on test data: r (cid:48) Y := y (cid:48) − ˆ f ( x (cid:48) ), r (cid:48) X := x (cid:48) − ˆ g ( y (cid:48) ) Calculate dependence scores: s X → Y := s ( x (cid:48) , r (cid:48) Y ), s Y → X := s ( y (cid:48) , r (cid:48) X ) Return: s X → Y , s X → Y , and D , where D = (cid:40) X → Y if s X → Y < s Y → X Y → X if s X → Y > s Y → X Mooij et al. [23] give a practical algorithm for de-termining the causal relationship between X and Y (i.e., either X → Y or Y → X ), as shown in Algo-rithm 1. The ﬁrst step is to partition the i.i.d. samplesinto a training and a testing set. We use the train-ing set to train the regression functions ˆ f : X → Y and ˆ g : Y → X . We use the testing set to computethe residuals r (cid:48) Y = y (cid:48) − ˆ f ( x (cid:48) ) and r (cid:48) X := x (cid:48) − ˆ g ( y (cid:48) ).If we have an ANM X → Y then the residual r (cid:48) Y isan estimate of the noise N Y which is assumed to be att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger independent of X . Therefore, we calculate the de-pendence between the residual r (cid:48) Y and the input x (cid:48) , s X → Y := s ( x (cid:48) , r (cid:48) Y ), and s Y → X := s ( y (cid:48) , r (cid:48) X ), using adependence score s ( · , · ). If s X → Y is less than s Y → X ,then we declare X → Y , otherwise Y → X . Crucially, the ANM approach hinges on the choice ofdependence score s ( · , · ). There have been many pro-posals, and we give a quick review of the most popularmethods (for a detailed review see Mooij et al. [23]). Spearman’s ρ is a rank correlation coeﬃcient thatdescribes the extent to which one random variable is amonotonic function of the other. Speciﬁcally, imagineindependently sorting the observations { a , . . . , a m } and { b , . . . , b m } by value in increasing order. Let o ai be the rank of a i in the a -ordering, and similarly, o bi for b i in the b -ordering. Then Spearman’s ρ is, s ( a , b ) := (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:80) mi =1 d i m ( m − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where d i := ( o ai − o bi ) are the rank diﬀerences for a , b . Kendall’s τ . Similar to Spearman’s ρ , the Kendall τ rank score calls a pair of indices ( i, j ) concordant if it is the case that a i > a j and b i > b j . Otherwise( i, j ) is called discordant . Then the dependence scoreis deﬁned as s ( a , b ) := | C − D | m ( m − C is the number of concordant pairs and D isthe number of discordant pairs. HSIC Score.

The ﬁrst proposed score for the ANMcausal inference is based on the Hilbert-Schmidt In-dependence Criterion (HSIC) [11], which was used byHoyer et al. [13]. They compute an estimate of the p -value of the HSIC under the null hypothesis of in-dependence, selecting the causal direction having thelower p -value. Alternatively, one can use an estimatorto the HSIC value itself: s ( a , b ) := (cid:92) HSIC k θ ( a ) ,k θ ( b ) ( a , b ) (1)where k θ is a kernel with parameters θ . Mooij et al. [23]show that under certain assumptions the algorithm insection 1 with the HSIC dependence score is consistentfor estimating the causal direction in an ANM. Variance Score.

When the noise variables in theANM are Gaussian, the variance score was proposedin B¨uhlmann et al. [1], and deﬁned as s ( a , b ) :=log V ( a ) + log V ( b ). Changes to a single input valuecan induce arbitrarily large changes to this score, which makes the variance score ill suited to preservediﬀerential privacy. IQR Score.

We introduce a robust version of thisscore by replacing the variance of the random vari-ables with their interquartile range (IQR). The IQR isthe diﬀerence between the third and ﬁrst quartiles ofthe distribution and can be estimated empirically. Wedeﬁned the following IQR-based score: s ( a , b ) := log IQR( a ) + log IQR( b ) . (2) We assume that the data set D = { ( x i , y i ) } containssensitive data that should not be inferred from therelease of the dependence scores. One of the mostwidely accepted mechanisms for private data release is diﬀerential privacy [6]. In a nutshell it ensures thatthe released scores can only be used to infer aggregateinformation about the data set and never about anindividual datum ( x i , y i ).Let us deﬁne the Hamming distance between two datasets d H ( D , ˜ D ) between two data sets D and ˜ D as thenumber of elements in which these two sets diﬀer. If adata set D is changed to ˜ D , a distance d H ( D , ˜ D ) ≤ Deﬁnition 2.

A randomized algorithm A is ( (cid:15), δ ) - diﬀerentially private for (cid:15), δ ≥ if for all O ∈

Range( A ) and for all neighboring datasets D , ˜ D with d H ( D , ˜ D ) ≤ we have that Pr (cid:2) A ( D ) = O (cid:3) ≤ e (cid:15) Pr (cid:2) A ( ˜ D ) = O (cid:3) + δ. (3)One of the most popular methods for making an algo-rithm ( (cid:15), global sensitivity , ∆ A describing how much A changes when D changes,∆ A := max D , ˜ D⊆X s.t. d H ( D , ˜ D ) ≤ |A ( D ) − A ( ˜ D ) | . The Laplace mechanism hides the output of A with asmall amount of additive random noise, large enoughto hide the impact of any single datum ( x i , y i ). Deﬁnition 3.

Given a dataset D and an algorithm A ,the Laplace mechanism returns A ( D )+ ω , where ω isa noise variable drawn from Lap(0 , ∆ A /(cid:15) ) , the Laplacedistribution with scale parameter ∆ A /(cid:15) . It may be that the global sensitivity of an algorithm A is unbounded in general, but can be bounded in thecontext of a speciﬁc data set D over all neighbors ˜ D .For such datasets we can bound the local sensitivity ∆( D ) A := max ˜ D⊆X s.t. d H ( D , ˜ D ) ≤ |A ( D ) − A ( ˜ D ) | . rivate Causal Inference Table 1: Dependence scores and their privacy.A checkmark indicates that there exist meaningfulbounds on either the global or local sensitivity.

Test Training

Global Local Global Local

Score

Sense. Sense. Sense. Sense.Spearman’s ρ (cid:88) (cid:88) - (cid:88) Kendall’s τ (cid:88) (cid:88) - (cid:88) HSIC (cid:88) (cid:88) (cid:88) (cid:88)

IQR - (cid:88) - (cid:88) If an algorithm has bounded global sensitivity it cer-tainly has bounded local sensitivity. Nissim et al.[24], Dwork & Lei [4], Jain & Thakurta [14] show howto use the local sensitivity to cleverly produce privatequantities for datasets with bounded local sensitivity.

The data is partitioned into training and test set,which are used in diﬀerent ways. We therefore in-troduce mechanisms to preserve training and test setprivacy respectively, which can be used jointly. Specif-ically, we show how to privatize the dependence scores s X → Y , s Y → X . The reason for this is four-fold: 1. Pri-vatizing the dependence score immediately privatizesthe causal direction D , because operations on diﬀer-entially private outputs preserve privacy (so long asthey do not touch the data). 2. Releasing the scoresindicates how conﬁdent the ANM method is about thecausal direction, which is absent from the binary out-put D . 3. It is unclear which dependence score is bestfor a particular dataset, so we privatize multiple scoresand leave this choice to the practitioner. In this sectionwe begin with test set privacy and describe trainingset privacy in Section 5. Table 1 gives an overview oftest and training set privacy results for the dependencescores that we consider.Let ( x (cid:48) , y (cid:48) ) be the initial test data and (˜ x (cid:48) , ˜ y (cid:48) ) be thetest data after a single change in the dataset. Let˜ x (cid:48) = [ x (cid:48) , . . . , x (cid:48) k − , ˜ x (cid:48) k , x (cid:48) k +1 , . . . , x (cid:48) m ] (cid:62) and similarlyfor ˜ y so that this single change occurs at some index k . The key to preserving privacy is to show that theselected dependence score s ( · , · ) can be privatized. Weshow that if our dependence score is a rank correlationcoeﬃcient (Spearman’s ρ , Kendall’s τ ) or the HSICscore [11], we can readily bound its test set global sen-sitivity when applied to ( x (cid:48) , y (cid:48) ) versus (˜ x (cid:48) , ˜ y (cid:48) ). As theIQR score has bounded test set local sensitivity we canapply the algorithm of Dwork & Lei [4] for privacy. We ﬁrst demonstrate global sensitivity for the two rankcorrelation scores in Section 3.

Theorem 1.

The rank correlation coeﬃcients havethe following global sensitivities,1. Let ρ ( · , · ) be Spearman’s ρ score, then | ρ ( x (cid:48) , r (cid:48) Y ) − ρ (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m

2. Let τ ( · , · ) be Kendall’s τ score, then | τ ( x (cid:48) , r (cid:48) Y ) − τ (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m Proof.

Our goal is to bound the following global sensi-tivity in both scores: | s ( x (cid:48) , r (cid:48) Y ) − s (˜ x (cid:48) , ˜ r (cid:48) Y ) | . For Spear-man’s ρ , suppose the change is on a k and b k , it is easyto verify that 1) d i changes by at most 2, for i (cid:54) = k ; 2) d k changes by at most m −

1; 3) d i ≤ m − i .Since d i − ( d i − = 4( d i − ≤ m −

2) for i (cid:54) = k ,the maximum change inside the summation is upperbounded by ( m − m −

8) + ( m − . Therefore,global sensitivity of ρ is bounded by6( m − m − m ( m − ≤ m .For Kendall’s τ we can aﬀect at most ( m −

1) pairsby moving a single element of x (cid:48) , as well as ( m − r (cid:48) Y (either from concordant pairs todiscordant pairs, or vice versa). Therefore, the globalsensitivity of Kendall’s τ is | s ( x (cid:48) , r (cid:48) Y ) − s (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m − m ( m − ≤ m The bound on the global sensitivity ∆ of our scores en-ables us to apply the Laplace mechanism [6] to produce(2 (cid:15), p X → Y , p Y → X .Speciﬁcally, we add Laplace noise Lap(0 , ∆ /(cid:15) ) to ourSpearman’s ρ and Kendall’s τ scores to preserve pri-vacy w.r.t. the test set. Moreover, as a general prop-erty of diﬀerential privacy we can compute any func-tions on these private scores and, so long as they donot touch the data, the outputs of these functions arealso private. This means that we can compute the in-equality p X → Y < p Y → X to decide if X causes Y orvice-versa privately.An important consideration is to what degree the addi-tion of noise aﬀects the true decision: s X → Y < s Y → X .Importantly, we can prove that, in certain cases, theaddition of Laplace noise required by the mechanismis small enough to not change the direction of causalinference. These are cases in which there is a large‘margin’ between the scores s X → Y and s Y → X . So long att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger as this margin is large enough and in the correct or-der the addition of Laplace noise has no eﬀect on theinference with high probability. Theorem 2.

Given two random variables

X, Y whohave w.l.o.g. the causal relationship X → Y , assumethat they produce correctly-ordered scores: s X → Y

Y,j ) and H ij = δ { i = j } − /m . We assume k, l are bounded above by 1(e.g., the squared exponential kernel, the Matern ker-nel [27]). Our goal is to show that when we replace( x (cid:48) , y (cid:48) ) with (˜ x (cid:48) , ˜ y (cid:48) ) the global sensitivity is small.Speciﬁcally we prove the following theorem. Theorem 3.

The score in eq. (4) has a global sensi-tivity of at most m − m − . Speciﬁcally, | (cid:92) HSIC k,l ( x (cid:48) , r (cid:48) Y ) − (cid:92) HSIC k,l (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤ m − m − Proof.

For simplicity deﬁne H ( · , · ) := (cid:92) HSIC k,l ( · , · ).Note that, as the trace is cyclic: trace( KHLH ) =trace(

HKHL ). Further, let ˜ K, ˜ L be the kernels de-ﬁned on the modiﬁed data (˜ x (cid:48) , ˜ y (cid:48) ). Then as the datais represented purely through the kernel matrices andthe trace is Lipschitz w.r.t. these matrices, we canapply the triangle inequality to yield, |H ( x (cid:48) , r (cid:48) Y ) − H (˜ x (cid:48) , ˜ r (cid:48) Y ) | ≤(cid:107) HLH (cid:107) ∞ (cid:107) K − ˜ K (cid:107) ( m − + (cid:107) HKH (cid:107) ∞ (cid:107) L − ˜ L (cid:107) ( m − To bound the inﬁnity norms, let L = HLH , then | L ij | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ij − (cid:80) ma =1 L aj m − (cid:80) mb =1 L ib m + (cid:80) ma,b =1 L ab m (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ L ij ≤ HKH ). Fi-nally, note that as there is only a single-element dif-ference between ( x (cid:48) , r (cid:48) Y ) and (˜ x (cid:48) , ˜ r (cid:48) Y ), we have that (cid:107) K − ˜ K (cid:107) ≤ m − L, ˜ L ).In fact, we can improve this bound to m − m − usingtrace identities. We leave the proof of this to the ap-pendix. Given this global sensitivity bound we can useTheorem 2 to guarantee that under certain conditionsthe Laplace mechanism w.h.p. does not change thedirection of causal inﬂuence. Unfortunately the IQR does not have a bounded globalsensitivity, as there exist datasets for which the IQRcan change by an unbounded amount. Instead, Dwork& Lei [4] oﬀer an eﬃcient technique to privately releasethe IQR. We give a slightly modiﬁed version of theirAlgorithm in the appendix.First the algorithm deﬁnes two intervals B and B which both contain IQR( X ). If the IQR were to bepushed out of both of these intervals it would implythat the IQR changed by a factor of e . Therefore weloop over both intervals and calculate the number ofpoints A j that an adversary would need to change topush the IQR out of B or B . Note that A j is it-self a data-sensitive query and so, to preserve privacyof this query, we can add Laplace noise to it. Then,if one of these noisy estimates R j = A j + z , where z ∼ Lap(0 , /(cid:15) ) is larger than some threshold, it im-plies that with high probability (exactly 1 − δ ), that theIQR( X ) has multiplicative sensitivity of at most e , forthe speciﬁc dataset X . Note that this is precisely thelocal sensitivity as deﬁned in Section 3, as it is speciﬁcto X . This means that we can add Laplace noise z tolog IQR( X ). If neither of the R j are above the thresh-old then the algorithm returns null: ⊥ . This algorithmwas shown to be (3 (cid:15), δ )-diﬀerentially private.In our case we would like to release four private IQRscores. Note that we must look at x (cid:48) three separatetimes: for IQR( x (cid:48) ) , IQR( r (cid:48) Y ) , and IQR( r (cid:48) X ) (and threetimes as well for y (cid:48) ). Therefore for both x (cid:48) and y (cid:48) we are composing three diﬀerentially private outputs.Under simple composition this would lead to (9 (cid:15), δ )diﬀerential privacy for both x (cid:48) and y (cid:48) . However, wecan make use of Corollary 3.21 in Dwork & Roth [5] togive ( (cid:15) (cid:48) , δ + δ (cid:48) )-diﬀerential privacy, for 0 < (cid:15) (cid:48) < rivate Causal Inference δ (cid:48) >

0, over three repeated mechanisms by ensuringeach private mechanism is (3 (cid:15), δ )-private, where 3 (cid:15) = (cid:15) (cid:48) / (2 (cid:112) /δ (cid:48) )).The remaining question is whether this noise addi-tion causes one to infer the incorrect causal direction.Again, as long as there is a signiﬁcant margin betweenthe scores, we can preserve the correct causal inferencewith high probability as follows. Theorem 4.

Let Q x (cid:48) = log IQR ( x (cid:48) ) , and similarlyfor Q y (cid:48) , Q r (cid:48) X , Q r (cid:48) Y , be the true log-IQR scores. As welllet P x (cid:48) , P y (cid:48) , P r (cid:48) X , P r (cid:48) Y be the private versions, multipliedby e z noise where z ∼ Lap (0 , /(cid:15) ) . The the followingresults hold:1. [4] If the number of data-points needed to signif-icantly change the IQR, A j , is less than e then,the probability that any one of the private IQR P ∗ is released is small: P (cid:34) P ∗ (cid:54) = ⊥ | A or A ≤ e (cid:35) ≤ δ .

2. If all private log-IQR scores are released, and therelationship between the true scores holds Q x (cid:48) + Q r (cid:48) Y < Q y (cid:48) + Q r (cid:48) X (which implies X → Y ), thenthe probability that we make the correct causal in-ference from the private scores is large, P [ P x (cid:48) + P r (cid:48) Y < P y (cid:48) + P r (cid:48) X ] =1 − e − γσ σ (cid:16) σ + 33 σ γ + 9 σγ + γ (cid:17) where γ = Q y (cid:48) + Q r (cid:48) X − Q x (cid:48) + Q r (cid:48) Y , and σ = 1 /(cid:15) . The proof of these results is in the appendix. The ﬁrstresult says that the probability that we release an IQRscore just because too much noise was added to A j issmall. The second result says that with high probabil-ity we recover the true causal direction, depending onthe size of the dataset. Let ( x , y ) be the initial training data and (˜ x , ˜ y ) be thetraining data after a change in the dataset. Note that x and ˜ x diﬀer in at most one element (similarly for y and ˜ y ). The length of both training datasets is n .From Algorithm 1, the only way the training set canaﬀect the dependency scores s X → Y , s Y → X is throughthe regression functions ˆ f , ˆ g , used to compute test setresiduals r (cid:48) Y , r (cid:48) X . We use the kernel ridge regressionmethod and so the functions ˆ f (and ˆ g ) can be writ-ten in the form: ˆ f ( w , x ) = w (cid:62) φ ( x ), where φ ( x ) is a(possibly inﬁnite) feature space mapping to the Hilbert space corresponding to the kernel function used. Sim-ilar to other work on private regression [34] we assumethat | x | , | y | ≤

1. The ridge regression algorithm cannow be written as: w = argmin w ∈H λ (cid:107) w (cid:107) H + 1 n n (cid:88) i =1 ( w (cid:62) φ ( x i ) − y i ) , (5)where H is the corresponding Hilbert space. Prac-tically speaking, even though w may be inﬁnite-dimensional, because it always appears in an innerproduct with the feature mapping φ ( x ) we can utilizethe ‘kernel trick’: k ( x i , x j ) = φ ( x i ) (cid:62) φ ( x j ) to avoidhaving to represent w explicitly.Let ˆ f ( w ∗ , · ) and ˆ f ( ˜ w ∗ , · ) be the classiﬁers resultingfrom the optimization problem in eq. (5) when trainedon ( x , y ) and (˜ x , ˜ y ), respectively (and similarly forˆ g ). We show that the residuals in Algorithm 1 arebounded. Theorem 5.

Say λ ≤ . Given that the classi-ﬁers ˆ f ( w ∗ , · ) , ˆ f ( ˜ w ∗ , · ) are the result of the optimiza-tion problem in eq. (5), the residuals of these functions r (cid:48) Y , ˜ r (cid:48) Y are bounded as, | r (cid:48) i,Y − ˜ r (cid:48) i,Y | ≤ nλ / (6) for all i , where r (cid:48) i,Y , ˜ r (cid:48) i,Y are the i th elements of r (cid:48) Y , ˜ r (cid:48) Y and m is the size of the test set. This bound holds equally for r (cid:48) X , ˜ r (cid:48) X . The proof of theabove is inspired by the work of Shalev-Shwartz et al.[31] and Jain & Thakurta [14]. We place the proof inthe appendix for the interested reader. As far as we areaware this is the tightest bound for the optimizationproblem in eq. (5), with a non-Lipschitz loss. In thefollowing, we use this bound to preserve training setprivacy for the dependence scores considered in theprevious section. Note that the bound in Theorem 5 directly implies thatthe ranking dependence scores have global sensitivity 1(equal to the size of their ranges). To see this note thatwe can consider an adversarial situation in which therank of every element of the residual r (cid:48) Y changes whenthe training set is altered in one element (as all theresidual elements may change). This means that theLaplace mechanism cannot guarantee useful privacy.Instead, note that both ranking scores may still havereasonably bounded local sensitivity. Speciﬁcally, ifwe consider the list of sorted residuals, it may be thatthere are large gaps between neighboring residuals. Ifthis is the case then changing the training set by onepoint may not change the residual rankings. Thus, the att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger HSIC IQRSpearman's Kendall's p r ob . o f c o rr e c t i n f e r en c e ⌧⇢ dataset id ✏ ✏✏✏ Figure 1: Probability of correctly identifying the causal direction on datasets selected from the Cause-EﬀectPairs Challenge [12]. Datasets for which the scores perform well were selected in order to isolate the eﬀect ofprivatization on the scores.

HSIC p r ob . o f c o rr e c t i n f e r en c e best dataset id ↵ =0 . ↵ =1 ↵ =2 ✏ Figure 2: Training set privacy for the HSIC score. The three left-most plots show how λ aﬀects the probabilityof correctly inferring the causal direction, while the right-most plot depicts this probability when the best λ isselected over a (cid:15) ∈ [0 . , Deﬁnition 4.

We call a function f k -stable ondataset D if modifying any k elements in D does notchange the value of f . Speciﬁcally, f ( D ) = f ( D ∗ ) forall D ∗ such that D can be transformed into D ∗ witha minimum of k element substitutions. We say f is unstable on D if it is not even -stable on D . The distance to instability of a dataset D w.r.t. a func-tion f is the number of elements that must be changedto reach an unstable dataset. With these deﬁnitions, we will use a modiﬁcation ofthe Propose-Test-Release framework that makes useof this stability as described in Algorithm 13 in Dwork& Roth [5].

Theorem 6. [5] Algorithm 13 [5] is ( (cid:15), δ ) -diﬀerentially private. Further, for all β > if s ( x (cid:48) , r (cid:48) Y ) is log(1 /δ )+log(1 /β ) (cid:15) -stable on r (cid:48) Y , thenAlgorithm 13 releases s ( x (cid:48) , r (cid:48) Y ) w.p. at least − β . A lower bound on the distance to instability d is easilygiven by noting that s ( x (cid:48) , r (cid:48) Y ) always outputs the sameresult as long as none of the ranks of r (cid:48) Y change. Let γ be the smallest absolute distance between any tworanks. Then a lower bound on d is, d > (cid:98) nγλ / / (cid:99) .This is the largest number of training points that maychange so that the closest ranks moving towards eachother do not overlap (given that they change by at most the amount in eq. 6). This lower-bound is suﬃ-cient to use Algorithm 13 [5] to privatize the rankingdependence scores. For m ≥ , with kernels k, l ≤ where l is L l -Lipschitz, the HSIC score has a training setsensitivity as follows, (cid:12)(cid:12)(cid:12) (cid:92) HSIC k,l ( x (cid:48) , r (cid:48) Y ) − (cid:92) HSIC k,l ( x (cid:48) , ˜ r (cid:48) Y ) (cid:12)(cid:12)(cid:12) ≤ R L l √ mn where R = λ / . The proof follows directly from Theorem 5 and Lemma16 in Mooij et al. [23]. Thus, the Laplace mechanismgives us ( (cid:15),

Similar to the test set privacy section we will usepropose-test-release to give a useful, private IQR score.In fact, we will use IQR algorithm almost identically,except that we will deﬁne A j as the number of train-ing points required to move the IQR out of an interval.Note that a lower bound on A j is simply the number ofpoints required to move every input less than the me-dian to the left and every input larger than the median rivate Causal Inference Table 2: The non-private accuracies of the ANM model on a subset of the Cause-Eﬀect Pairs Challenge [12], aswell as the probability of correct causal inference after privatization. dataset ids 4031 597 2209 2967 161 2132 1656 901 3484 1627size 7713 7748 7766 7771 7782 7784 7803 7820 7853 7862 (cid:15) = ∞ (non-private accuracies) Spearman’s ρ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

48 0 . ± .

32 1 . ± .

00 0 . ± .

00 0 . ± .

48 0 . ± .

00 1 . ± . Kendall’s τ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

48 0 . ± .

42 1 . ± .

00 0 . ± .

00 0 . ± .

42 0 . ± .

00 1 . ± . HSIC [11] . ± .

00 0 . ± .

00 1 . ± .

00 1 . ± .

00 0 . ± .

48 0 . ± .

52 1 . ± .

00 0 . ± .

52 1 . ± .

00 0 . ± . IQR [1] . ± .

53 0 . ± .

00 0 . ± .

32 1 . ± .

00 1 . ± .

00 1 . ± .

00 0 . ± .

00 0 . ± .

32 0 . ± .

00 1 . ± . (cid:15) = 0 . Spearman’s ρ . ± .

45 0 . ± .

00 0 . ± .

02 0 . ± .

10 0 . ± .

06 0 . ± .

02 0 . ± .

06 0 . ± .

21 0 . ± .

00 0 . ± . Kendall’s τ . ± .

48 0 . ± .

00 0 . ± .

00 0 . ± .

38 0 . ± .

24 1 . ± .

00 0 . ± .

09 0 . ± .

41 0 . ± .

00 1 . ± . HSIC [11] . ± .

17 0 . ± .

00 0 . ± .

01 0 . ± .

00 0 . ± .

01 0 . ± .

00 0 . ± .

00 0 . ± .

06 0 . ± .

03 0 . ± . IQR [1] . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± . (cid:15) = 1 Spearman’s ρ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

43 0 . ± .

17 1 . ± .

00 0 . ± .

07 0 . ± .

41 0 . ± .

00 1 . ± . Kendall’s τ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

48 0 . ± .

40 1 . ± .

00 0 . ± .

00 0 . ± .

42 0 . ± .

00 1 . ± . HSIC [11] . ± .

16 0 . ± .

03 0 . ± .

00 0 . ± .

01 0 . ± .

06 0 . ± .

01 0 . ± .

02 0 . ± .

25 1 . ± .

01 0 . ± . IQR [1] . ± .

04 0 . ± .

00 0 . ± .

00 0 . ± .

00 0 . ± .

01 0 . ± .

01 0 . ± .

00 0 . ± .

00 0 . ± .

01 0 . ± . (cid:15) = 2 Spearman’s ρ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

47 0 . ± .

17 1 . ± .

00 0 . ± .

02 0 . ± .

45 0 . ± .

00 1 . ± . Kendall’s τ . ± .

53 0 . ± .

00 0 . ± .

00 0 . ± .

48 0 . ± .

42 1 . ± .

00 0 . ± .

00 0 . ± .

42 0 . ± .

00 1 . ± . HSIC [11] . ± .

09 0 . ± .

04 1 . ± .

00 0 . ± .

01 0 . ± .

11 0 . ± .

01 0 . ± .

02 0 . ± .

26 1 . ± .

00 0 . ± . IQR [1] . ± .

09 0 . ± .

01 0 . ± .

01 0 . ± .

01 0 . ± .

02 0 . ± .

02 0 . ± .

01 0 . ± .

01 0 . ± .

01 0 . ± . to the right (or the reverse of these), using the boundon r in eq. (6). The aforementioned privacy and util-ity results of the IQR propose-test-release frameworkapply here. The only diﬀerence is we just need to addnoise to the IQR scores computed on the residuals,which implies (6 (cid:15), δ )-privacy and that the results ofTheorem 4 can be tightened. We test our methods for private release of causal infer-ence statistics on a small subsets from the Cause-EﬀectPairs Competition collection Guyon [12]. Speciﬁcally,we randomly select 10 of the largest 25 datasets thathave a causal direction either X → Y or Y → X . Weaverage over 10 random 50/50 train/test splits of thedata. Table 2 shows the non-private accuracy of thefour dependence scores over these datasets. We showthe probability of correct causal inference changes asthese scores are made private w.r.t. the test set. Notethat these scores are often complementary, with theranking-based scores performing well on datasets inwhich HSIC does worse, and vice-versa.Figure 1 shows the eﬀect of privatization on the de-pendence scores: HSIC and IQR. Note that, for low (cid:15) (increased privacy), the probability of correct inﬂu-ence is lower as the amount of noise required blurs thetrue dependence scores. However, as (cid:15) increases, sodoes this probability, in some cases drastically. For theIQR score, recall that there is a probability that thealgorithm returns null: ⊥ , if R j is less than a thresholdcontrolled by δ . We investigated this probability, byvarying δ ∈ [10 − , − ] and sampling 10 ,

000 pointsfrom the appropriate Laplace distribution. We foundthat, for the IQR dataset in ﬁgure 1 every sample didnot move R j below the null threshold. Therefore, theprobability of null is essentially 0. The three left-most plots in Figure 2 demonstrate how λ , which has a large eﬀect on the training set sensi-tivity (as described in eq. 6) aﬀects the probabilityof correct inference. We perform this experiment fordiﬀerent settings of (cid:15) , and each one produces a dis-tinctive ‘hump’ shape. This is because for small λ thesensitivity bound (6) is too large to produce mean-ingful causal inference. Similarly, for large λ the ker-nelized regression algorithm (5) is overly-regularized,which produces a poor regressor and poor dependencescores. Only when λ is within a certain range do webalance the size of the sensitivity bound with the sizeof the regularization. This range grows larger as (cid:15) increases as the privacy setting becomes less strict (re-quiring less noise). The right-most plot shows the cor-rect inference probability using the best λ for a rangeof (cid:15) ∈ [0 . , λ we canachieve high-quality causal inference that maintainsprivacy w.r.t. the training set. We have presented, to the best of our knowledge, theﬁrst work towards diﬀerentially private causal infer-ence. There are numerous directions of future work in-cluding privatizing other causal inference frameworks(e.g. IGCI [16]), analyzing that ANM algorithm with-out train/test splits, as well as other dependencescores. As there is signiﬁcant overlap in the appli-cations of causal inference and private learning we be-lieve this work constitutes an important step towardsmaking causal inference practical.

Acknowledgments

KQW and MJK are supported by NSF grants IIA-1355406, IIS-1149882, EFRI-1137211. We thank theanonymous reviewers for their useful comments. att J. Kusner, Yu Sun, Karthik Sridharan, Kilian Q. Weinberger

References [1] B¨uhlmann, Peter, Peters, Jonas, Ernest, Jan, et al.Cam: Causal additive models, high-dimensional or-der search and penalized regression.

The Annals ofStatistics , 42(6):2526–2556, 2014.[2] Chaudhuri, Kamalika, Monteleoni, Claire, and Sar-wate, Anand D. Diﬀerentially private empirical riskminimization.

JMLR , 12:1069–1109, 2011.[3] Dinur, Irit and Nissim, Kobbi. Revealing infor-mation while preserving privacy. In

Proceedings ofthe SIGMOD-SIGACT-SIGART symposium on prin-ciples of database systems , pp. 202–210. ACM, 2003.[4] Dwork, Cynthia and Lei, Jing. Diﬀerential privacyand robust statistics. In

Proceedings of the forty-ﬁrstannual ACM symposium on Theory of computing , pp.371–380. ACM, 2009.[5] Dwork, Cynthia and Roth, Aaron. The algorithmicfoundations of diﬀerential privacy.

Theoretical Com-puter Science , 9(3-4):211–407, 2013.[6] Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi,and Smith, Adam. Calibrating noise to sensitivity inprivate data analysis. In

Theory of Cryptography , pp.265–284. Springer, 2006.[7] for Disease Control, Centers, Prevention, et al.

Howtobacco smoke causes disease: The biology and behav-ioral basis for smoking-attributable disease: A reportof the surgeon general . Centers for Disease Controland Prevention (US), 2010.[8] Friedman, Nir and Nachman, Iftach. Gaussian processnetworks. In

Proceedings of the Sixteenth conferenceon Uncertainty in artiﬁcial intelligence , pp. 211–219.Morgan Kaufmann Publishers Inc., 2000.[9] Geiger, Philipp, Janzing, Dominik, and Sch¨olkopf,Bernhard. Estimating causal eﬀects by boundingconfounding. In

Proceedings of the 30th Conferenceon Uncertainty in Artiﬁcial Intelligence , pp. 240–249,2014.[10] Geiger, Philipp, Zhang, Kun, Schoelkopf, Bernhard,Gong, Mingming, and Janzing, Dominik. Causal in-ference by identiﬁcation of vector autoregressive pro-cesses with hidden components. In

ICML , pp. 1917–1925, 2015.[11] Gretton, Arthur, Bousquet, Olivier, Smola, Alex, andSch¨olkopf, Bernhard. Measuring statistical depen-dence with hilbert-schmidt norms. In

Algorithmiclearning theory , pp. 63–77. Springer, 2005.[12] Guyon, I. Cause-eﬀect pairs kaggle competi-tion, 2013. URL .[13] Hoyer, Patrik O, Janzing, Dominik, Mooij, Joris M,Peters, Jonas, and Sch¨olkopf, Bernhard. Nonlinearcausal discovery with additive noise models. In

Ad-vances in neural information processing systems , pp.689–696, 2009.[14] Jain, Prateek and Thakurta, Abhradeep. Diﬀeren-tially private learning with kernels. In

Proceedings ofthe 30th International Conference on Machine Learn-ing (ICML-13) , pp. 118–126, 2013.[15] Jain, Prateek, Kothari, Pravesh, and Thakurta,Abhradeep. Diﬀerentially private online learning.

COLT , 2012. [16] Janzing, Dominik, Mooij, Joris, Zhang, Kun, Lemeire,Jan, Zscheischler, Jakob, Daniuˇsis, Povilas, Steudel,Bastian, and Sch¨olkopf, Bernhard. Information-geometric approach to inferring causal directions.

Ar-tiﬁcial Intelligence , 182:1–31, 2012.[17] Kano, Yutaka and Shimizu, Shohei. Causal inferenceusing nonnormality. In

Proceedings of the Interna-tional Symposium on Science of Modeling, the 30thAnniversary of the Information Criterion , pp. 261–270, 2003.[18] Kpotufe, Samory, Sgouritsa, Eleni, Janzing, Dominik,and Sch¨olkopf, Bernhard. Consistency of causal infer-ence under the additive noise model. In

ICML , 2014.[19] Lopez-Paz, David, Muandet, Krikamol, Sch¨olkopf,Bernhard, and Tolstikhin, Iliya. Towards a learningtheory of cause-eﬀect inference. In

ICML , 2015.[20] McSherry, Frank and Talwar, Kunal. Mechanism de-sign via diﬀerential privacy. In

FOCS , pp. 94–103.IEEE, 2007.[21] Mooij, Joris M, Stegle, Oliver, Janzing, Dominik,Zhang, Kun, and Sch¨olkopf, Bernhard. Probabilis-tic latent variable models for distinguishing betweencause and eﬀect. In

Advances in Neural InformationProcessing Systems , pp. 1687–1695, 2010.[22] Mooij, Joris M, Janzing, Dominik, Heskes, Tom, andSch¨olkopf, Bernhard. On causal discovery with cyclicadditive noise models. In

Advances in neural infor-mation processing systems , pp. 639–647, 2011.[23] Mooij, Joris M, Peters, Jonas, Janzing, Dominik,Zscheischler, Jakob, and Sch¨olkopf, Bernhard. Dis-tinguishing cause from eﬀect using observationaldata: methods and benchmarks. arXiv preprintarXiv:1412.3773 , 2014.[24] Nissim, Kobbi, Raskhodnikova, Sofya, and Smith,Adam. Smooth sensitivity and sampling in privatedata analysis. In

Proceedings of the thirty-ninth an-nual ACM symposium on Theory of computing , pp.75–84. ACM, 2007.[25] Pearl, Judea. Causality: models, reasoning, and in-ference. 2000.[26] Peters, Jonas, Mooij, Joris M, Janzing, Dominik, andSch¨olkopf, Bernhard. Causal discovery with continu-ous additive noise models.

The Journal of MachineLearning Research , 15(1):2009–2053, 2014.[27] Rasmussen, Carl Edward and Williams, ChristopherK. I. Gaussian processes for machine learning. 2006.[28] Reichenbach, Hans and Reichenbach, Maria.

The di-rection of time . Univ of California Press, 1956.[29] Sgouritsa, Eleni, Janzing, Dominik, Hennig, Philipp,and Sch¨olkopf, Bernhard. Inference of cause and eﬀectwith unsupervised inverse regression. In

AISTATS ,pp. 847–855, 2015.[30] Shajarisales, Naji, Janzing, Dominik, Shoelkopf,Bernhard, and Besserve, Michel. Telling cause fromeﬀect in deterministic linear dynamical systems. In

ICML , 2015.[31] Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan,and Sridharan, Karthik. Stochastic convex optimiza-tion. In

COLT , 2009. rivate Causal Inference [32] Spirtes, Peter, Glymour, Clark N, and Scheines,Richard.

Causation, prediction, and search , vol-ume 81. MIT press, 2000.[33] Sun, Xiaohai, Janzing, Dominik, and Sch¨olkopf, Bern-hard. Causal reasoning by evaluating the complexityof conditional densities with kernel methods.

Neuro-computing , 71(7):1248–1256, 2008.[34] Talwar, Kunal, Thakurta, Abhradeep, and Zhang, Li.Private empirical risk minimization beyond the worstcase: The eﬀect of the constraint set geometry. arXivpreprint arXiv:1411.5417 , 2014.[35] Zhang, Kun and Hyv¨arinen, Aapo. On the identiﬁabil-ity of the post-nonlinear causal model. In

Related Researches

A Multi-Arm Bandit Approach To Subset Selection Under Constraints

by Ayush Deva

Mars Image Content Classification: Three Years of NASA Deployment and Recent Advances

by Kiri Wagstaff

Roughsets-based Approach for Predicting Battery Life in IoT

by Rajesh Kaluri

Robust Bandit Learning with Imperfect Context

by Jianyi Yang

Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

by Andrew Slavin Ross

Regularized Generative Adversarial Network

by Gabriele Di Cerbo

Automatic variational inference with cascading flows

by Luca Ambrogioni

Consensus Control for Decentralized Deep Learning

by Lingjing Kong

"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models

by Sahar Abdelnabi

Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers

by Jacob M. Springer

Pairwise Weights for Temporal Credit Assignment

by Zeyu Zheng

rl_reach: Reproducible Reinforcement Learning Experiments for Robotic Reaching Tasks

by Pierre Aumjaud

When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

by Niladri S. Chatterji

Deep Neural Network based Cough Detection using Bed-mounted Accelerometer Measurements

by Madhurananda Pahar

MISO-wiLDCosts: Multi Information Source Optimization with Location Dependent Costs

by Antonio Candelieri

Locally Adaptive Label Smoothing for Predictive Churn

by Dara Bahri

Learning a powerful SVM using piece-wise linear loss functions

by Pritam Anand

Reinforcement Learning For Constraint Satisfaction Game Agents (15-Puzzle, Minesweeper, 2048, and Sudoku)

by Anav Mehta

STUaNet: Understanding uncertainty in spatiotemporal collective human mobility

by Zhengyang Zhou

Noisy Recurrent Neural Networks

by Soon Hoe Lim

Learning State Representations from Random Deep Action-conditional Predictions

by Zeyu Zheng

Nonstochastic Bandits with Infinitely Many Experts

by X. Flora Meng

Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games

by Chen-Yu Wei

Machine Learning-based Classification of Active Walking Tasks in Older Adults using fNIRS

by Dongning Ma

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions

by Will Grathwohl

«
1

2

3

4

»

Submitted on 17 Dec 2015 (v1), last revised 20 Aug 2016 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar