[PDF] Causal Discovery Using Proxy Variables

Abstract

Discovering causal relations is fundamental to reasoning and intelligence. In particular, observational causal discovery algorithms estimate the cause-effect relation between two random entities X and Y, given n samples from P(X,Y). In this paper, we develop a framework to estimate the cause-effect relation between two static entities x and y: for instance, an art masterpiece x and its fraudulent copy y. To this end, we introduce the notion of proxy variables, which allow the construction of a pair of random entities (A,B) from the pair of static entities (x,y). Then, estimating the cause-effect relation between A and B using an observational causal discovery algorithm leads to an estimation of the cause-effect relation between x and y. For example, our framework detects the causal relation between unprocessed photographs and their modifications, and orders in time a set of shuffled frames from a video. As our main case study, we introduce a human-elicited dataset of 10,000 pairs of casually-linked pairs of words from natural language. Our methods discover 75% of these causal relations. Finally, we discuss the role of proxy variables in machine learning, as a general tool to incorporate static knowledge into prediction tasks.

Full PDF

CCausal Discovery Using Proxy Variables

Mateo Rojas-Carulla

Marco Baroni David Lopez-Paz Abstract

Discovering causal relations is fundamental toreasoning and intelligence. In particular, obser-vational causal discovery algorithms estimate thecause-effect relation between two random entities X and Y , given n samples from P ( X, Y ) .In this paper, we develop a framework to estimatethe cause-effect relation between two static enti-ties x and y : for instance, an art masterpiece x andits fraudulent copy y . To this end, we introducethe notion of proxy variables , which allow theconstruction of a pair of random entities ( A, B ) from the pair of static entities ( x, y ) . Then, esti-mating the cause-effect relation between A and B using an observational causal discovery algorithmleads to an estimation of the cause-effect relationbetween x and y . For example, our frameworkdetects the causal relation between unprocessedphotographs and their modiﬁcations, and ordersin time a set of shufﬂed frames from a video.As our main case study, we introduce a human-elicited dataset of 10,000 pairs of casually-linkedpairs of words from natural language. Our meth-ods discover 75% of these causal relations. Fi-nally, we discuss the role of proxy variables inmachine learning, as a general tool to incorporatestatic knowledge into prediction tasks.

1. Introduction

Discovering causal relations is a central task in science(Pearl, 2009; Beebee et al., 2009), and empowers humansto explain their experiences, predict the outcome of theirinterventions, wonder about what could have happened butnever did, or plan which decisions will shape the future totheir maximum beneﬁt. Causal discovery is essential to thedevelopment of common-sense (Kuipers, 1984; Waldrop,1987). In machine learning, it has been argued that causal Facebook AI Research, Paris, France University of Cam-bridge, Cambridge, UK MPI for Intelligent Systems, T¨ubingen,Germany. Correspondence to: Mateo Rojas-Carulla < [email protected] > . discovery algorithms are a necessary step towards machinereasoning (Bottou, 2014; Bottou et al., 2013; Lopez-Paz,2016) and artiﬁcial intelligence (Lake et al., 2016).The gold standard to discover causal relations is to performactive interventions (also called experiments) in the sys-tem under study (Pearl, 2009). However, interventions arein many situations expensive, unethical, or impossible torealize. In all of these situations, there is a prime need todiscover and reason about causality purely from observation.Over the last decade, the state-of-the-art in observationalcausal discovery has matured into a wide array of algorithms(Shimizu et al., 2006; Hoyer et al., 2009; Daniusis et al.,2012; Peters et al., 2014; Mooij et al., 2016; Lopez-Paz et al.,2015; Lopez-Paz, 2016). All these algorithms estimate thecausal relations between the random variables ( X , . . . , X p ) by estimating various asymmetries in P ( X , . . . , X p ) . Inthe interest of simplicity, this paper considers the problemof discovering the causal relation between two variables X and Y , given n samples from P ( X, Y ) .The methods mentioned estimate the causal relation betweentwo random entities X and Y , but often we are interestedinstead in two static entities x and y . These are a pair ofsingle objects for which it is not possible to deﬁne a proba-bility distribution directly. Examples of such static entitiesmay include one art masterpiece and its fraudulent copy,one translated document and its original version, or one pairof causally linked words in natural language, such as “virus”and “death”. Looking into the distant future, an algorithmable to discover the causal structure between static entitiesin natural language could read throughout medical journals,and discover the causal mechanisms behind a new cure fora speciﬁc disease–the very goal of the ongoing $45 milliondollar Big Mechanism

DARPA initiative (Cohen, 2015). Or,if we were able to establish the causal relation betweentwo arbitrary natural language statements, we could tacklegeneral-AI tasks such as the Winograd schema challenge(Levesque et al., 2012), which are out-of-reach for currentalgorithms. The above and many more are situations wherecausal discovery between static entities is at demand.

Our Contributions

First, we introduce the framework of proxy variables to esti-mate the causal relation between static entities (Section 3). a r X i v : . [ s t a t . M L ] F e b ausal Discovery Using Proxy Variables Second, we apply our framework to the task of inferring thecause-effect relation between pairs of images (Section 4). Inparticular, our methods are able to infer the causal relationbetween an image and its stylistic modiﬁcation in ofthe cases, and it can recover the correct ordering of a set ofshufﬂed video frames (Section 4.2).Third, we apply our framework to discover the cause-effectrelation between pairs of words in natural language (Sec-tion 5). To this end, we introduce a novel dataset of 10,000human-elicited pairs of words with known causal relation(Section 5.2). Our methods are able to recover ofthe cause-effect relations (such as “accident → injury” or“sentence → trial”) in this challenging task (Section 5.4).Fourth, we discuss the role of proxy variables as a toolto incorporate external knowledge, as provided by staticentities, into general prediction problems (Section 6).All our code and data are available at anonymous .We start the exposition by introducing the basic languageof observational causal discovery, as well as motivating itsrole in machine learning.

2. Causal Discovery in Machine Learning

The goal of observational causal discovery is to reveal thecause-effect relation between two random variables X and Y , given n samples ( x , y ) , . . . ( x n , y n ) from P ( X, Y ) . Inparticular, we say that “ X causes Y ” if there exists a mecha-nism F that transforms the values taken by the cause X intothe values taken by the effect Y , up to the effects of somerandom noise N . Mathematically, we write Y ← F ( X, N ) .Such equation highlights an asymmetric assignment ratherthan a symmetric equality. If we were to intervene andchange the value of the cause X , then a change in the valueof the effect Y would follow. On the contrary, if we were tomanipulate the value of the effect Y , a change in the cause X would not follow.When two random variables share a causal relation, theyoften become statistically dependent. However, when tworandom variables are statistically dependent, they do notnecessarily share a causal relation. This is at the origin ofthe famous warning “dependence does not imply causality”.This relation between dependence and causality was formal-ized by Reichenbach (1956) into the following principle. Principle 1 (Principle of common cause) . If two randomvariables X and Y are statistically dependent ( X (cid:54)⊥ Y ),then one of the following causal explanations must hold:i) X causes Y (write X → Y ), orii) Y causes X (write X ← Y ), oriii) there exists a random variable Z that is the commoncause of both X and Y (write X ← Z → Y ). − X − Y (a) Y = F ( X ) + N , X ⊥ N . − Y − X (b) X = G ( Y ) + E , Y (cid:54)⊥ E . Figure 1.

Example of an Additive Noise Model (ANM).

In the third case, X and Y are conditionally independentgiven Z (write X ⊥ Y | Z ). In machine learning, these three types of statistical depen-dencies are exploited without distinction, as dependence issufﬁcient to perform optimal predictions about identicallyand independently distributed (iid) data (Sch¨olkopf et al.,2012). However, we argue that taking into account the Prin-ciple of common cause would have far-reaching beneﬁtsin non-iid machine learning. For example, assume that weare interested in predicting the values of a target variable Y , given the values taken by two features ( X , X ) . Then,understanding the causal structure underlying ( X , X , Y ) brings two beneﬁts.First, interpretability . Explanatory questions such as “Whydoes Y = 2 when ( X , X ) = ( − , ?” , and counterfac-tual questions such as “What value would have Y taken,had X = − ?” cannot be answered using statistics alone,since their answers depend on the particular causal structureunderlying the data.Second, robustness . Predictors which estimate the valuestaken by a target variable Y given only its direct causesare robust with respect to distributional shifts on their in-puts. For example, let X ∼ P ( X ) , Y ← F ( X ) , and X ← F ( X ) . Then, the predictor E ( Y | X ) is invariantto changes in the joint distribution P ( X , X ) as long asthe causal mechanism F does not change. However, thepredictor E ( Y | X , X ) can vary wildly even if the causalmechanism F (the only one involved in computing Y ) doesnot change (Peters et al., 2016; Rojas-Carulla et al., 2015).The previous two points apply to the common “non-iid” sit-uations where we have access to data drawn from somedistribution P , but we are interested in some different butrelated distribution ˜ P . One natural way to phrase and lever-age the similarities between P and ˜ P is in terms of sharedcausal structures (Peters, 2012; Lopez-Paz, 2016).While it is indeed an attractive endeavor, discovering thecausal relation between two random variables purely from ausal Discovery Using Proxy Variables observation is an impossible task when considered in fullgenerality. Indeed, any of the three causal structures out-lined in Principle 1 could explain the observed dependencybetween two random variables. However, one can in manycases impose assumptions to render the causal relation be-tween two variables identiﬁable from their joint distribu-tion. For example, consider the family of Additive NoiseModels, or ANM (Hoyer et al., 2009; Peters et al., 2014;Mooij et al., 2016). In ANM, one assumes that the causalmodel has the form Y = F ( X ) + N , where X ⊥ N . Itturns out that, under some assumptions, the reverse ANM X = G ( Y ) + E will not satisfy the independence assump-tion Y ⊥ E (Fig. 1). The statistical dependence sharedby the cause and noise in the wrong causal direction is the footprint that renders the causal relation between X and Y identiﬁable from statistics alone.In situations where the ANM assumption is not satisﬁed(e.g., multiplicative or heteroskedastic noise) one may pre-fer learning-based causal discovery tools, such as the Ran-domized Causation Coefﬁcient (Lopez-Paz et al., 2015).RCC assumes access to a causal dataset D = { ( S i , l i ) } ni =1 ,where S i = ( x i,j , y i,j ) n i j =1 ∼ P i ( X i , Y i ) is a bag of exam-ples drawn from some distribution P i , (cid:96) i = +1 if X i → Y i ,and (cid:96) i = − if X i ← Y i . By featurizing each of the train-ing distribution samples S i using kernel mean embeddings(Smola et al., 2007), RCC learns a binary classiﬁer on D toreveal the causal footprints necessary to classify new pairsof random variables.However, both ANM and RCC based methods need n (cid:29) samples from P ( X, Y ) to classify the causal relation be-tween the random variables X and Y . Therefore, thesemethods are not suited to infer the causal relation between static entities such as, for instance, one painting and itsfraudulent copy. In the following section, we propose aframework to extend the state-of-the-art in causal discoverymethods to this important case.

3. The Main Concepts: Static Entities, ProxyVariables and Proxy Projections

In the following, we consider two static entities x, y in somespace S that satisfy the relation “ x causes y ”. Formally,this causal relation manifests the existence of a (possiblynoisy) mechanism f such that the value y is computed as y ← f ( x ) . This asymmetric assignment guarantees changesin the static cause x would lead to changes in the static effect y , but the converse would not hold.As mentioned previously, traditional causal discovery meth-ods cannot be directly applied to static entities. In orderto discover the causal relation between the pair of staticentities x and y , we introduce two main concepts: proxyvariables W , and proxy projections π . x yWA Bππ Figure 2.

A pair of static entities ( x, y ) share a causal relation ofinterest (thick blue arrow). A proxy variable W , together with a proxy projection π produces the random entities ( A, B ) , that sharethe causal footprint of ( x, y ) , denoted by the dotted blue arrow. First, a proxy random variable W is a random variabletaking values in some set W , which can be understood asa random source of information related to x and y . Thisdeﬁnition is on purpose rather vague and will be illustratedthrough several examples in the following sections.Second, a proxy projection is a function π : W × S → R .Using a proxy variable and projection, we can constructa pair of scalar random variables A = π ( W, x ) and B = π ( W, y ) . A proxy variable and projection are causal ifthe pair of random entities ( A, B ) share the same causalfootprint as the pair of static entities ( x, y ) . If the proxy variable and projection are causal, we may esti-mate the cause-effect relation between the static entities x and y in three steps. First, draw ( a , b ) , . . . , ( a n , b n ) from P ( A, B ) . Second, use an observational causal discoveryalgorithm to estimate the cause-effect relation between A and B given { ( a i , b i ) } ni =1 . Third, conclude “ x causes y ”if A → B , or “ y causes x ” if A ← B . This process issummarized in Figure 2.Note that the causal relation X → Y does not imply thecausal relation A → B in the interventional sense : even if A is a copy of X and B is a copy of Y , intervening on A will not change B ! We only care here about the presenceof statistically observable causal footprints between thevariables. Furthermore, our framework extends readily tothe case where x and y live in different modalities (say, x is an image and y is a piece of audio describing the image).In this case, all we need is a proxy variable W = ( W x , W y ) and a pair of proxy projections ( π x , π y ) with the appropriatestructure. For simplicity and throughout this paper, we willchoose our proxy variables and projections based on domainknowledge. Learning proxy variables and projections fromdata is an exciting area left for future research. The concept of causal footprint is relative to our assumptions.For instance, when assuming an ANM Y ← f ( X )+ N , the causalfootprint is the statistical independence between X and N . ausal Discovery Using Proxy Variables x = original image y = stylized image a i = D , E b i = D , E − −

20 0 20 40 60 80 A − − − − B Figure 3.

Sampling random patches at paired locations produces aproxy variable to discover the causal relation between two images.

4. Causal Discovery Using Proxies in Images

Consider the two images shown in Figure 3. The imageon the left is an unprocessed photograph of the T¨ubingenNeckarfront, while the one on the right is the same photo-graph after being stylized with the algorithm of Gatys et al.(2016). From a causal point of view, the unprocessed image x is the cause of the stylized image y . How can we leveragethe ideas from Section 3 to recover such causal relation?The following is one possible solution. Assume that the twoimages are represented by pixel intensity vectors x and y ,respectively. For n (cid:29) and j = 1 , . . . , n : • Draw a mask-image w j , which contains ones inside apatch at random coordinates, and zeroes elsewhere. • Compute a j = (cid:104) w j , x (cid:105) , and b j = (cid:104) w j , y (cid:105) .This process returns a sample { ( a j , b j ) } nj =1 drawn from P ( A, B ) , the joint distribution of the two scalar randomvariables ( A, B ) . The conversion from static entities ( x, y ) to random variables ( A, B ) is obtained by virtue of i) therandomness generated by the proxy variable W , which inthis particular case is incarnated as random masks and ii) acausal projection π , here a simple dot product.At this point, if the causal footprint between the randomentities ( A, B ) resembles the causal footprint between x and y , we can apply a regular causal discovery algorithm to ( A, B ) to estimate the causal relation between x and y . The intuition behind causal discovery using proxy variablesis that, although we observe ( x, y ) as static entities, these areunderlyingly complex, high-dimensional, structured objectsthat carry rich information about their causal relation. Theproxy variable W introduces randomness to sample differ-ent views of the high-dimensional causal structures, and π summarizes those views into scalar values. But why shouldthe causal footprint of these summaries cue the causal rela-tion between x and y ?We formalize this question for the speciﬁc case of stylizedimages, where x is the original image and y its stylizedversion. Let the causal mechanism mapping x to y operate locally . More precisely, assume that each k -subset y S i inthe stylized image is computed from the k -subset x S i in theoriginal image, as described by the ANM: y S i = f ( x S i ) + (cid:15) S i . Then, the stylized image y = F ( x ) + (cid:15) , where F ( x ) S i = f ( x S i ) . For simplicity, assume that f ( x S ) = g ( βx S ) where β is a k × k matrix and g acts element-wise. Then, let P ( W ) be a distribution over masks extracting random k -subsets,and let π ( · , · ) = (cid:104)· , ·(cid:105) , to obtain: A = π ( W, x ) = (cid:104)

W, x (cid:105) ,B = π ( W, y ) = (cid:104)

W, y (cid:105) = k (cid:88) j =1 g j (cid:32) k (cid:88) l =1 β jl ( x S ) l (cid:33) + N = k (cid:88) j =1 g j (cid:32) k (cid:88) l =1 α j A (cid:33) + N where N = (cid:80) kj =1 ( (cid:15) S ) j , and where we assume that β issuch that α j = β jl for all j ≤ k . Since A ⊥ N , the pair ( A, B ) also follows an ANM. We leave for future work theinvestigation on identiﬁability conditions for causal infer-ence using proxy variables. In order to illustrate the use of causal discovery using proxyvariables in images, we conducted two small experiments.In these experiments, we extract n = 1024 square patchesof size k = 10 pixels, and use the Additive Noise Model(Hoyer et al., 2009) to estimate the causal relation betweenthe constructed scalar random variables A and B .First, we collected unprocessed images together with stylizations (including the one from Figure 3), made ausal Discovery Using Proxy Variables Figure 4.

Causal discovery using proxy variables uncovers thecausal time signal to reorder a shufﬂed sequence of video frames. using the algorithm of Gatys et al. (2016). When applyingcausal discovery using proxy variables to this dataset, wecan correctly identify the correct direction of causation fromthe original image to its stylized version in of the cases.Second, we decomposed a video of drops of ink mixing withwater into frames { ( x i ) } i =1 , shown in Figure 4. Usingthe same mask proxy variable as above, we construct an × matrix M such that M ij = 1 if x i → x j accordingto our method and M ij = 0 otherwise. Then, we consider M to be the adjacency matrix of the causal DAG describingthe causal structure between the frames. By employingtopological sort on this graph, we were able to obtain thetrue ordering, unique among the , possible orderings.

5. Causal Discovery Using Proxies in NLP

As our main case study, consider discovering the causalrelation between pairs of words appearing in a large corpusof natural language. For instance, given the pair of words(virus, death), which represent our static entities x and y ,together with a large corpus of natural language, we want torecover causal relations such as “virus → death”, “sun → radiation”, “trial → sentence”, or “drugs → addiction”.This problem is extremely challenging for two reasons. First,word pairs are extremely varied in nature (compare “avo-cado causes guacamole” to “cat causes purr”), and some arevery rare (“wrestler causes pin”). Second, the causal rela-tion between two words can always be tweaked in context-speciﬁc ways. For instance, one can construct sentenceswhere “virus causes death” (e.g., the virus led to a quickdeath ), but also sentences where “death causes virus” (e.g., the multiple deaths in the area further spread the virus ).We are hereby interested in the canonical causal relationbetween pairs of words, assumed by human subjects whenspeciﬁc contexts are not provided (see Section 5.2). Fur-thermore, our interest lies in discovering the causal relationbetween pairs of words without the use of language-speciﬁcknowledge or heuristics. To the contrary, we aim to discover such causal relations by using generic observational causaldiscovery methods, such as the ones described in Section 2.In the following, Section 5.1 frames this problem in thelanguage of causal discovery between static entities. Then,Section 5.2 introduces a novel, human-generated, human-validated dataset to test our methods. Section 5.3 reviewsprior work on causal discovery in language. Finally, Sec-tion 5.4 presents experiments evaluating our methods. In the language of causal discovery with proxies, a pair ofwords is a pair of static entities: ( x, y ) = ( virus , death ) .In order to discover the causal relation between x and y ,we are in need of a proxy variable W , as introduced inSection 3. We will use a simple proxy: let P ( W = w ) bethe probability of the word w appearing in a sentence drawnat random from a large corpus of natural language.Using the proxy W , we need to deﬁne the pair of randomvariables A = π ( W, x ) and B = π ( W, y ) in terms of acausal projection π . Once we have deﬁned the causal pro-jection π , we can sample w , . . . , w n ∼ P ( W ) , construct a i = π ( w i , x ) , b i = π ( w i , y ) , and apply a causal discoveryalgorithm to the sample { ( a i , b i ) } ni =1 . Speciﬁcally, we es-timate P ( W ) from a large corpus of natural language, andsample n = 10 , words without replacement. Throughout our experimental evaluation, we will use andcompare different proxy projections π ( w, x ) :1) π w2vii ( w, x ) = (cid:104) v iw , v ix (cid:105) , where v iz ∈ R d is the input word2vec representation (Mikolov et al., 2013) of theword z . The dot-product (cid:104) v iw , v ix (cid:105) measures the simi-larity in meaning between the pair of words ( w, x ) .2) π w2vio ( w, x ) = (cid:104) v iw , v ox (cid:105) , where v oz ∈ R d is the out-put word2vec representation of the word z . The dot-product (cid:104) v iw , v ox (cid:105) is an unnormalized estimate of theconditional probability p ( x | w ) (Melamud et al., 2015).3) π w2voi ( w, x ) = (cid:104) v ow , v ix (cid:105) , an unnormalized estimate ofthe conditional probability p ( w | x ) .4) π counts ( w, x ) = p ( w, x ) , where the pmf p ( w, x ) isdirectly estimated from counting within-sentence co-occurrences in the corpus.5) π prec-counts ( w, x ) similar to the one above, but com-puted only over sentences where w precedes x .6) π pmi ( w, x ) = p ( w, x ) / ( p ( w ) p ( x )) , where the pmfs p ( w ) , p ( x ) , and p ( w, x ) are estimated from countingwords and (sentence-based) co-occurrences in the cor-pus. The log of this quantity is known as point-wise This is equivalent to sampling approximately the top 10,000most frequent words in the corpus. Due to the extremely skewednature of word frequency distributions (Baayen, 2001), samplingwith replacement would produce a list of very frequent words suchas a and the , sampled many times. ausal Discovery Using Proxy Variables mutual information, or PMI (Church & Hanks, 1990).7) π prec-pmi ( w, x ) , similar to the one above, but computedonly over sentences where w precedes x .Applying the causal projections to our sample from proxy W , we construct the n -vector Π x proj = ( π proj ( w , x ) , . . . , π proj ( w n , x )) , (1)and similarly for Π y proj , whereproj ∈ { w2vii , w2vio , w2voi , counts , prec-counts , pmi , prec-pmi } . (2)In particular, we use the skip-gram model implementa-tion of fastText (Bojanowski et al., 2016) to compute − dimensional word2vec representations. We introduce a human-elicited, human-ﬁltered dataset of , pairs of words with a known causal relation. Thisdataset was constructed in two steps:1) We asked workers from Amazon Mechanical Turk tocreate pairs of words linked by a causal relation. Weprovided the turks with examples of words with a clearcausal link (such as “sun causes radiation”) and exam-ples of related words not sharing a causal relation (suchas “knife” and “fork”). For details, see Appendix A.2) Each of the pairs collected from the previous step wasrandomly shufﬂed and submitted to different turks,none of whom had created any of the word pairs. Eachturk was required to classify the pair of words ( x, y ) as “ x causes y ”, “ y causes x ”, or “ x and y do not sharea causal relation”. For more details, see Appendix B.This procedure resulted in a dataset of , causal wordpairs ( x, y ) , each accompanied with three numbers: thenumber of turks that voted “ x causes y ”, the number ofturks that voted “ y causes x ”, and the number of turks thatvoted “ x and y do not share a causal relation”. The NLP community has devoted much attention to the prob-lem of identifying the semantic relation holding betweentwo words, with causality as a special case. Girju et al.(2009) discuss the results of the large shared task on relationclassiﬁcation they organized (their benchmark included only220 examples of cause-effect ). The task required recogniz-ing relations in context , but, as discussed by the authors,most contexts display the default relation we are after here(e.g., “The mutant virus gave him a severe ﬂu ” instantiatesthe default relation in which virus is the cause, ﬂu is theeffect). All participating systems used extra resources, suchas ontologies and syntactic parsing, on top of corpus data. They are thus outside the scope of the purely corpus-basedmethods we are considering here.Most NLP work speciﬁcally focusing on the causality re-lation relies on informative linking patterns co-occurringwith the target pairs (such as, most obviously, the conjunc-tion because ). These patterns are extracted and processedwith sophisticated methods, involving annotation, ontolo-gies, bootstrapping and/or manual ﬁltering (see, e.g., Blancoet al. 2008 Hashimoto et al. 2012, Radinsky et al. 2012, andreferences therein). We experimented with extracting link-ing patterns from our corpus, but, due to the relatively smallsize of the latter, results were extremely sparse (note thatpatterns can only be extracted from sentences in which bothcause and effect words occur). More recent work startedlooking at causal chains of events as expressed in text (seeMirza & Tonelli 2016 and references therein). Applying ourgeneric method to this task is a direction for future work.A semantic relation that received particular attention in NLPis that of entailment between words ( dog entails animal ). Ascausality is intuitively related to entailment, we will applybelow entailment detection methods to cause/effect classi-ﬁcation. Most lexical entailment methods rely on distribu-tional representations of the words in the target pair. Tradi-tionally, entailing pairs have been identiﬁed with unsuper-vised asymmetric similarity measures applied to distributedword representations (Geffet & Dagan, 2005; Kotlermanet al., 2010; Lenci & Benotto, 2012; Weeds et al., 2004).We will test one of these related measures, namely, WeedsPrecision (WS) . More recently, Santus et al. 2014 showedthat the relative entropy of distributed vectors representingthe words in a pair is an effective cue to which word isentailing the other, and we also look at entropy for our task.However, the most effective method to detect entailment isto apply a supervised classiﬁer to the concatenation of thevectors representing the words in a pair (Baroni et al., 2012;Roller et al., 2014; Weeds et al., 2014).

We evaluate a variety of methods to discover the causalrelation between two words appearing in a large corpus ofnatural language. We study methods that fall within threecategories: baselines , distribution-based causal discoverymethods , and feature-based supervised methods . Thesethree families of methods consider an increasing amount ofinformation about the task at hand, and therefore exhibit anincreasing performance up to classiﬁcation accuracy.All our computations will be based on the full EnglishWikipedia, as post-processed by Matt Mahoney (see ). Westudy the N = 1 , pairs of words out of , from thedataset described in Section 5.2 that achieved a consensusacross at least out of turks. We use RCC to estimate ausal Discovery Using Proxy Variables the causal relation between pairs of random variables.5.4.1. B ASELINES

These are a variety of unsupervised, heuristic baselines.Each baseline computes two scores, denoted by S x → y and S x ← y , predicting x → y if S x → y > S x ← y , and x ← y if S x → y < S x ← y . The baselines are: • frequency : S x → y is the number of sentences where x appears in the corpus, and S x ← y is the number ofsentences where y appears in the corpus. • precedence : considering only sentences from the cor-pus where both x and y appear, S x → y is the numberof sentences where x occurs before y , and S x ← y is thenumber of sentences where y occurs before x . • counts (entropy) : S x → y is the entropy of Π x counts , and S x ← y is the entropy of Π y counts , as deﬁned in (1). • counts (WS) : Using the WS measure of Weeds & Weir(2003), S x → y = WS (Π x counts , Π y counts ) , and S x ← y = WS (Π y counts , Π x counts ) . • prec-counts (entropy) : S x → y is the entropy of Π x prec-counts , and S x ← y is the entropy of Π y prec-counts (1). • prec-counts (WS) : analogous to the previous.The baselines PMI (entropy) , PMI (WS) , prec-PMI (en-tropy) , prec-PMI (WS) are analogous to the last four, but use (Π x (prec-)pmi , Π y (prec-)pmi ) instead of (Π x (prec-)counts , Π y (prec-)counts ) .Figure 5 shows the performance of these baselines in blue.5.4.2. D ISTRIBUTION - BASED CAUSAL DISCOVERYMETHODS

These methods implement our framework of causal dis-covery using proxy variables.

They classify n samplesfrom a 2-dimensional probability distribution as a whole.Recall that a vocabulary ( w j ) nj =1 drawn from the proxyis available. Given N word pairs ( x i , y i ) , this family ofmethods constructs a dataset D = (cid:8) ( { ( a ij , b ij ) } nj =1 , (cid:96) i ) (cid:9) Ni =1 ,where a ij = π proj ( w j , x i ) , b ij = π proj ( w j , y i ) , (cid:96) i = +1 if x i → y i and (cid:96) i = − otherwise. In short, D is a dataset of N “scatterplots” annotated with binary labels. The i -th scat-terplot contains n x i and y i , againstthe n vocabulary words drawn from the proxy.The samples ( a ij , b ij ) nj =1 are computed using a determinis-tic projection of iid draws from the proxy, meaning that { ( a ij , b ij ) } nj =1 ∼ P n ( A i , B i ) . Therefore, we could permutethe points inside each scatterplot without altering the resultsof these methods. In principle, we could also remove someof the points in the scatterplot without a signiﬁcant drop inperformance. Therefore, these methods search for causalfootprints at the 2-dimensional distribution level , and weterm them distribution-based causal discovery methods . The methods in this family ﬁrst split the dataset D intoa training set D tr and a test set D te . Then, the methodstrain RCC on the training set D tr , and test its classiﬁcationaccuracy on D te . This process is repeated ten times, splittingat random D into a training set containing of the pairs,and a test set containing of the pairs. Each methodbuilds on top of a causal projection from (2) above. Figure 5shows the test accuracy of these methods in green.5.4.3. F EATURE - BASED SUPERVISED METHODS

These methods use the same data generated by our causalprojections, but treat them as ﬁxed-size vectors fed to ageneric classiﬁer, rather than random samples to be analyzedwith an observational causal discovery method. They canbe seen as an oracle to upper-bound the amount of causalsignals (and signals correlated to causality) contained in ourdata. Speciﬁcally, they use n -dimensional vectors givenby the concatenation of those in (1). Given N word pairs ( x i , y i ) , they build a dataset D = (cid:16) (Π x i proj , Π y i proj ) , (cid:96) i (cid:17) Ni =1 ,where (cid:96) i = +1 if x i → y i , (cid:96) i = − if x i ← y i , and “proj”is a projection from (2). Next, we split the dataset D into atraining set D tr containing of the pairs, and a disjointtest set D te containing of the pairs. To evaluate theaccuracy of each method in this family, we train a randomforest of trees using D tr , and report its classiﬁcationaccuracy over D te . This process is repeated ten times, bysplitting the dataset D at random. The results are presentedas red bars in Figure 5. We also report the classiﬁcationaccuracy of training the random forest on the raw word2vecrepresentations of the pair of words (top three bars).5.4.4. D ISCUSSION OF RESULTS

Baseline methods are the lowest performing, up to testaccuracy. We believe that the performance of the best base-line, precedence , is due to the fact that most Wikipedia iswritten in the active voice, which often aligns with the tem-poral sequence of events, and thus correlates with causality.The feature-based methods perform best, achieving up to test classiﬁcation accuracy. However, feature-basedmethods enjoy the ﬂexibility of considering each of the n = 10 , elements in the causal projection as a distinctfeature. Therefore, feature-based methods do not focuson patterns to be found at a distributional level (such ascausality), and are vulnerable to permutation or removalof features. We believe that feature-based methods mayachieve their superior performance by overﬁtting to biasesin our dataset, which are not necessarily related to causality.Impressively, the best distribution-based causal discoverymethod achieves test classiﬁcation accuracy, which isa signiﬁcant improvement over the best baseline method.Importantly, our distribution-based methods take a whole ausal Discovery Using Proxy Variables -dimensional distribution as input to the classiﬁer; as such,these methods are robust with respect to permutations andremovals of the n distribution samples. We ﬁnd it encourag-ing that the best distribution-based method is the one basedon π w2voi . This suggests the intuitive interpretation thatthe distribution of a vocabulary conditioned on the causeword causes the distribution of the vocabulary conditionedon the effect word. Even more encouragingly, Figure 6shows a positive dependence between the test classiﬁca-tion accuracy of RCC and the conﬁdence of human annota-tions, when considering the test classiﬁcation accuracy ofall the causal pairs annotated with a human conﬁdence ofat least { , , , , , , , } . Thus, our proxy vari-ables and projections arguably capture a notion of causalityaligned with the one of human annotators.

6. Proxy Variables in Machine Learning

The central concept in this paper is the one of proxy vari-able . This is a variable W providing a random source ofinformation related to x and y .However, we can consider the reverse process of using astatic entity w to augment random statistics about a pair ofrandom variables X and Y . As it turns out, this could be anuseful process in general prediction problems.To illustrate, consider a supervised learning problem map-ping a feature random variable X into a target randomvariable Y . Such problem is often solved by consideringa sample { ( x i , y i ) } ni =1 ∼ P n ( X, Y ) . In this scenario, wemay contemplate an unpaired, external, static source of in-formation w (such as a memory), which might help solvingthe supervised learning problem at hand. One could incor-porate the information in the static source w by constructingthe proxy projection w i = π ( x i , w ) , and add them to thedataset to obtain { (( x i , w i ) , y i ) } ni =1 to build the predictor f ( x, π ( x, w )) .

7. Conclusion

We have introduced the necessary machinery to estimate thecausal relation between pairs of static entities x and y — onepiece of art and its forgery, one document and its translation,or the concepts underlying a pair of words appearing in acorpus of natural language. We have done so by introducingthe tool of proxy variables and projections, reducing ourproblem to one of observational causal inference betweenrandom entities. Throughout a variety of experiments, wehave shown the empirical effectiveness of our proposedmethod, and we have connected it to the general problem ofincorporating external sources of knowledge as additionalfeatures in machine learning problems. . . . . . . Figure 5.

Results for all methods on the NLP experiment. Ac-curacies above are statistically signiﬁcant with respect to aBinomial test at a signiﬁcance level α = 0 . . . . R CC a cc u r a c y Figure 6.

RCC accuracy versus human conﬁdence. ausal Discovery Using Proxy Variables

References

Baayen, H.

Word Frequency Distributions . Kluwer, 2001.Baroni, M., Bernardi, R., Do, N.-Q., and Shan, C.-C. En-tailment above the word level in distributional semantics.In

EACL , 2012.Beebee, H., Hitchcock, C., and Menzies, P.

The Oxfordhandbook of causation . Oxford University Press, 2009.Blanco, E., Castell, N., and Moldovan, D. Causal relationextraction. In

LREC , 2008.Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. En-riching word vectors with subword information. arXiv ,2016.Bottou, L. From machine learning to machine reasoning.

Machine learning , 2014.Bottou, L., Peters, J., Charles, D. X., Chickering, M., Portu-galy, E., Ray, D., Simard, P. Y., and Snelson, E. Counter-factual reasoning and learning systems: the example ofcomputational advertising.

JMLR , 2013.Church, K. and Hanks, P. Word association norms, mutualinformation, and lexicography.

Computational linguistics ,1990.Cohen, P. R. DARPA’s Big Mechanism program.

Physicalbiology , 2015.Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel,B., Zhang, K., and Sch¨olkopf, B. Inferring deterministiccausal relations. arXiv , 2012.Gatys, L. A., Ecker, A. S., and Bethge, M. Image styletransfer using convolutional neural networks. In

CVPR ,2016.Geffet, M. and Dagan, I. The distributional inclusion hy-potheses and lexical entailment. In

ACL , 2005.Girju, R., Nakov, P., Nastase, V., Szpakowicz, S., Turney, P.,and Yuret, D. Classiﬁcation of semantic relations betweennominals.

Language Resources and Evaluation , 2009.Hashimoto, C., Torisawa, K., De Saeger, S., Oh, J.-H., andKazama, J. Excitatory or inhibitory: A new semanticorientation extracts contradiction and causality from theweb. In

EMNLP , 2012.Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Sch¨olkopf,B. Nonlinear causal discovery with additive noise models.In

NIPS , 2009.Kotlerman, L., Dagan, I., Szpektor, I., and Zhitomirsky-Geffet, M. Directional distributional similarity for lexicalinference.

Natural Language Engineering , 2010. Kuipers, B. Commonsense reasoning about causality: de-riving behavior from structure.

Artiﬁcial intelligence ,1984.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople. arXiv , 2016.Lenci, A. and Benotto, G. Identifying hypernyms in distri-butional semantic spaces. In *SEM , 2012.Levesque, H., Davis, E., and Morgenstern, L. The WinogradSchema Challenge. In KR , 2012.Lopez-Paz, D. From dependence to causation . PhD thesis,University of Cambridge, 2016.Lopez-Paz, D., Muandet, K., Sch¨olkopf, B., and Tolstikhin,I. Towards a learning theory of cause-effect inference. In

ICML , 2015.Melamud, O., Levy, O., Dagan, I., and Ramat-Gan, I. Asimple word embedding model for lexical substitution.In

Workshop on Vector Space Modeling for Natural Lan-guage Processing , 2015.Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efﬁcientestimation of word representations in vector space. arXiv ,2013.Mirza, P. and Tonelli, S. CATENA: CAusal and TEmpo-ral relation extraction from NAtural language texts. In

COLING , 2016.Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., andSch¨olkopf, B. Distinguishing cause from effect usingobservational data: methods and benchmarks.

JMLR ,2016.Pearl, J.

Causality: Models, Reasoning, and Inference .Cambridge University Press, 2nd edition, 2009.Peters, J.

Restricted structural equation models for causalinference . PhD thesis, ETH Zurich, 2012.Peters, J., Mooij, J., Janzing, D., and Sch¨olkopf, B. Causaldiscovery with continuous additive noise models.

JMLR ,2014.Peters, J., B¨uhlmann, P., and Meinshausen, N. Causal infer-ence using invariant prediction: identiﬁcation and conﬁ-dence intervals.

JRSS B , 2016.Radinsky, K., Davidovich, S., and Markovitch, S. Learningcausality for news events prediction. In

WWW , 2012.Reichenbach, H. The direction of time, 1956.Rojas-Carulla, M., Sch¨olkopf, B., Turner, R., and Peters, J.Causal transfer in machine learning. arXiv , 2015. ausal Discovery Using Proxy Variables

Roller, S., Erk, K., and Boleda, G. Inclusive yet selec-tive: Supervised distributional hypernymy detection. In

COLING , 2014.Santus, E., Lenci, A., Lu, Q., and Schulte im Walde, S.Chasing hypernyms in vector spaces with entropy. In

EACL , 2014.Sch¨olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,K., and Mooij, J. On causal and anticausal learning. In

ICML , 2012.Shimizu, S., Hoyer, P., Hyv¨arinen, A., and Kerminen, A. Alinear non-gaussian acyclic model for causal discovery.

JMLR , 2006.Smola, A., Gretton, A., Song, L., and Sch¨olkopf, B. AHilbert space embedding for distributions. In

ALT .Springer, 2007.Waldrop, M. M. Causality, structure, and common sense.

Science , 1987.Weeds, J. and Weir, D. A general framework for distribu-tional similarity. In

EMNLP , 2003.Weeds, J., Weir, D., and McCarthy, D. Characterising mea-sures of lexical distributional similarity. In

COLING ,2004.Weeds, J., Clarke, D., Refﬁn, J., Weir, D., and Keller, B.Learning to distinguish hypernyms and co-hyponyms. In

COLING , 2014. upplementary material to

Causal discovery using proxy variables

A. Instructions for word pair creators

We will ask you to write word pairs (for instance, WordA and WordB) for which you believe the statement “WordA causesWordB” is true.To provide us with high quality word pairs, we ask you to follow these indications: • All word pairs must have the form “WordA → WordB”. It is essential that the ﬁrst word (WordA) is the cause, and thesecond word (WordB) is the effect. • WordA and WordB must be one word each (no spaces, and no “recessive gene → red hair”). Avoid compound wordssuch as “snow-blind”. • In most situations, you may come up with a word pair that can be justiﬁed both as “WordA → WordB” and “WordB → WordA”. In such situations, prefer the causal direction with the easiest explanation. For example, consider the wordpair “virus → death”. Most people would agree that “virus causes death“. However, “death causes virus” can be truein some speciﬁc scenario (for example, “because of all the deaths in the region, a new family of virus emerged.”).However, the explanation “virus causes death“ is preferred, because it is more general and depends less on the context. • We do not accept word pairs with an ambiguous causal relation, such as “book - paper”. • We do not accept simple variations of word pairs. For example, if you wrote down “dog → bark”, we will not credityou for other pairs such as “dogs → bark” or “dog → barking”. • Use frequent words (avoid strange words such as “clithridiate”). • Do not rely on our examples, and use your creativity. We are grateful if you come up with diverse word pairs! Please donot add any numbers (for example, “1 - dog → bark”). For your guidance, we provide you examples of word pairs thatbelong to different categories. Please bear in mind that we will reward your creativity: therefore, focus on providingnew word pairs with an evident causal direction, and do not limit yourself to the categories shown below.

1) Physical phenomenon : there exists a clear physical mechanism that explains why “WordA → WordB”. • sun → radiation (The sun is a source of radiation. If the sun were not present, then there would be no radiation.) • altitude → temperature • winter → cold • oil → energy

2) Events and consequences : WordA is an action or event, and WordB is a consequence of that action or event. • crime → punishment • accident → death • smoking → cancer • suicide → death ausal Discovery Using Proxy Variables • call → ring

3) Creator and producer : WordA is a creator or producer, WordB is the creation of the producer. • writer → book (the creator is a person) • painter → painting • father → son • dog → bark • bacteria → sickness • pen → drawing (the creator is an object) • chef → food • instrument → music • bomb → destruction • virus → death

4) Other categories! Up to you, please use your creativity! • fear → scream • age → salary B. Instructions for word pair validators

Please classify the relation between pairs of words A and B into one of three categories: either “A causes B”, “B causes A”,or “Non-causal or unrelated”.For example, given the pair of words “virus and death”, the correct answer would be: • virus causes death (correct); • death causes virus (wrong); • non-causal or unrelated (wrong).Some of the pairs that will be presented are non-causal. This may happen if: • The words are unrelated, like “toilet and beach”. • The words are related, but there is no clear causal direction. This is the case of “salad and lettuce”, since we can eatsalad without lettuce, or eat lettuce in a burger.To provide us with high quality categorization of word pairs, we ask you to follow these indications: • Prefer the causal direction with the simplest explanation. Most people would agree that “virus causes death”. However,“death causes virus” can be true in some speciﬁc scenario (for example, “because of all the deaths in the region, anew virus emerged.”). However, the explanation “virus causes death” is preferred, because it is true in more generalcontexts. • If no direction is clearer, mark the pair as non-causal. Here, conservative is good! ausal Discovery Using Proxy Variables • Think twice before deciding. We will present the pairs in random order!Please classify all the presented pairs. If one or more has not been answered, the whole batch will be invalid.

PLEASEDOUBLE CHECK THAT YOU HAVE ANSWERED ALL 40 WORD PAIRS.

Examples of causal word pairs: • “sun and radiation”: sun causes radiation • “energy and oil”: oil causes energy • “punishment and crime”: crime causes punishment • “instrument and music”: instrument causes music • “age and salary”: age causes salaryExamples of non-causal word pairs: • “video and games”: non-causal or unrelated • “husband and wife”: non-causal or unrelated • “salmon and shampoo”: non-causal or unrelated • “knife and gun”: non-causal or unrelated ••