[PDF] Causal Relational Learning

Abstract

Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in statistical studies and social sciences. However, existing methods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions and specifying causal queries using simple Datalog-like rules.CaRL provides a foundation for inferring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational data to illustrate the applicability of CaRL in social sciences and healthcare.

Full PDF

CC AUSAL R ELATIONAL L EARNING ∗ Babak Salimi , Harsh Parikh , Moe Kayali , Sudeepa Roy , Lise Getoor , and Dan Suciu University of Washington Duke University University of California at Santa CruzApril 9, 2020 A BSTRACT

Causal inference is at the heart of empirical research in natural and social sciences and is criticalfor scientiﬁc discovery and informed decision making. The gold standard in causal inference isperforming randomized controlled trials ; unfortunately these are not always feasible due to ethical,legal, or cost constraints. As an alternative, methodologies for causal inference from observationaldata have been developed in statistical studies and social sciences. However, existing methodscritically rely on restrictive assumptions such as the study population consisting of homogeneouselements that can be represented in a single ﬂat table, where each row is referred to as a unit . Incontrast, in many real-world settings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. Inthis paper, we present a formal framework for causal inference from such relational data. We proposea declarative language called CaRL for capturing causal background knowledge and assumptions andspecifying causal queries using simple Datalog-like rules. CaRL provides a foundation for inferringcausality and reasoning about the effect of complex interventions in relational domains. We presentan extensive experimental evaluation on real relational data to illustrate the applicability of CaRL insocial sciences and healthcare.

The importance of causal inference for making informed policy decisions has long been recognised in health, medicine,social sciences, and other domains. However, today’s decision-making systems typically do not go beyond predictiveanalytics and thus fail to answer questions such as “What would happen to revenue if the price of X is lowered?” Whilepredictive analytics has achieved remarkable success in diverse applications, it is mostly restricted to ﬁtting a model toobservational data based on associational patterns [39]. Causal inference, on the other hand, goes beyond associationalpatterns to the process that generates the data, thereby enabling analysts to reason about interventions (e.g., “Wouldrequiring ﬂu shots in schools reduce the chance of a future ﬂu epidemic?") and counterfactuals (e.g., “What wouldhave happened if past ﬂu shots were not taken?"). This adds signiﬁcantly more information in data analysis comparedto simple correlation or regression analysis; e.g., as the number of ﬂu cases increases, the rate of ﬂu shots might alsoincrease, but that does not imply that giving ﬂu shots increases the spread of ﬂu. This emphasizes the common sayingthat “correlation is not causation”, which is known to all, but is easy to overlook if one is not careful while analyzingdata for insights and possible actions.The gold standard in causal analysis is performing randomized controlled trials , where the subjects or units of study areassigned randomly to a treatment or a control (i.e., withheld from the treatment) group. The difference between thedistribution of the outcome variable of the treated and control groups represents the causal effect of the treatment onoutcome. However, control experiments are not always feasible due to ethical, legal, or cost constraints [47, 2]. Anattractive alternative that has been used in statistics, economics, and social sciences simulates control experiments using ∗ This is an extended version of a paper that accepted for publication at the Proceedings of the 2020 International Conference onManagement of Data [51]. a r X i v : . [ c s . D B ] A p r PRIL

9, 2020

Year

020 k40 k60 k N u m b e r o f P u b li c a t i o n s Controlled ExperimentObservational Study

Figure 1: Number of publications that include observational studies vs controlled experiments (obtained from Semantic-Scholar [53]).available observational data . While we can no longer assume that the treatment has been randomly assigned, underappropriate assumptions we can still estimate causal effects. Rubin’s Potential Outcome Framework [44] and Pearl’sCausal Models [36] (reviewed in Section 2) are two well-established frameworks which have been extensively studiedin the literature and used in various applications for causal inference from observational data [6, 46, 35, 31, 2]. A quicksearch on SemanticScholar reveals a growing interest in observational studies compared to controlled experiments, asshown in Figure 1.Causal frameworks, however, rely on the critical assumption that the units of study are sampled from a population ofhomogeneous units; in other words, the data can be represented in a single ﬂat table. This assumption is called the unithomogeneity assumption [17, 2]. In many real-world settings, however, the study domain consists of heterogeneousunits that have a complex relational structure ; and the data is naturally represented as multiple related tables. Forinstance, as presented later in our experiments with real data [20, 15], hospitals can record in several tables informationabout patients, medical practitioners, hospital stays, treatments performed, insurance, bills, and so on. Standard notionsused in causal analysis — such as units or subjects who receive a treatment in causal analysis — no longer readilyapply to relational data, prohibiting us from adopting existing causal inference frameworks to relational domains. Weillustrate these challenges with the following example.

Example 1.1 ( R EVIEW D ATA ) OpenReview [34] is a collection of paper submissions from several conferences, mostlyin ML and AI, along with their reviews. Signiﬁcantly, the collection contains review scores and author informationfor both accepted and rejected papers. Scopus [54] is a large, well maintained database of peer-reviewed literature,including scientiﬁc journals, books and conference proceedings. The Shanghai University Ranking [40] is one of threeauthoritative global university rankings. We crawled and integrated these sources to produce a relational database,which we show in simpliﬁed form in Figure 2. Data sources like this represent a treasure trove of information for theleadership of scientiﬁc conferences and journals. For example, they can help answer questions like “Does double-blindreviewing achieve its desired effect?” or “Does increasing (or decreasing) the page limit affect paper quality, and if so,how?”. To answer these real-life questions, discovering association is not sufﬁcient; instead, decision makers need toknow if there exists a causal effect. For example, suppose a conference is currently requiring double-blind submissions,and the leadership is questioning the effectiveness of this policy. If leadership reverts to single-blind submissions,would that represent an unfair advantage for authors from prestigious institutions? Given a dataset like that in Figure 2,one can run a few SQL queries and check whether those authors consistently get better reviews at single- rather thandouble-blind venues. However, this only proves or disproves correlation, not causation. Alternatively, one could applyRubin’s Potential Outcome Framework [44, 45], but that requires data to be presented as a single table of independentunits. Doing this naively on our dataset ( e.g. , computing the universal table [60]) means one cannot account for whatstatisticians refer to as interference and contagion effects [58, 32], both of which prohibit standard causal analysis. Forinstance, the prestige of an author not only inﬂuences his or her acceptance rate, but it also has a spill-over effect on theacceptance rate of his or her co-authors; this is called interference. Further, authors’ qualiﬁcations can be contagious ,meaning that if a junior author collaborates frequently with a senior author, then the overall quality of his or her ownresearch may increase over time.

Our contributions.

In this paper, we propose a declarative framework for

Causal Relational Learning , a foundationfor causal inference over relational domains. Our ﬁrst contribution is a declarative language, C a RL (Causal RelationalLanguage) , for representing causal background knowledge and assumptions in relational domains ( Section 3 ). CaRLcan represent complex causal models using just a few rules. The syntax of CaRL is designed to be intuitive for users torepresent complex causal models and ask causal queries, while the details of their semantics and query answering areabstracted from the users who need not be statisticians. 2

PRIL

9, 2020

Authors person prestige qualiﬁcation(h-index)Bob 1 50Carlos 0 20Eva 1 2

Submissions sub scores1 0.75s2 0.4s3 0.1

Authorship person subBob s1Eva s1Eva s2Eva s3Carlos s3

Submitted sub confs1 ConfDBs2 ConfAIs3 ConfAI

Conferences conf blindConfDB SingleConfAI Double

Figure 2: A multi-relational R

EVIEW D ATA instance.Our second contribution is to deﬁne semantics for complex causal queries where the treatment units and outcome unitsmight heterogeneous and controlling for confounding may require performing multiple joins and aggregates (

Section 4 ).Using CaRL, we can answer complex causal queries such as:“what is the effect of not having an insurance on mortalityof a patient in a critical care unit?”, where we are interested in estimating the average treatment effect (deﬁned later), or“what is the effect of authors’ collaborators’ prestige on acceptance of a paper?”, where we are interested in estimatingthe average relational effect ; several other types of queries are also supported.Our third contribution consists of an algorithm for answering causal queries from the given relational data (

Section 5 ).The algorithm performs a static analysis of the causal query, and it constructs a unit-table speciﬁc to the query and therelational causal model by identifying a set of attributes that are sufﬁcient for confounding adjustment. The constructedunit-table is amenable to sound causal inference using existing techniques.Finally, we present an end-to-end experimental evaluation of CaRL on both real and synthetic data (

Section 6 ). Theexperiments conducted on the following real-world relational datasets: 1) R

EVIEW D ATA [34, 40, 41], 2) MIMIC-III (Medical Information Mart for Intensive Care Data) [20], and 3) NIS (National Inpatient Sample Data) [15]. Weexamine the following causal queries:• R

EVIEW D ATA . What is the effect of authors’ prestige on the scores given by the receivers under single-blindand double-blind review processes?• MIMIC-III. What is the effect of not having insurance on patient’s mortality and length of hospital stay?• NIS. What is the effect of hospital size on healthcare affordability?In each setting, we report contrasts between correlation and causation, further highlighting the need for principledcausal analysis. Evaluation of CaRL on synthetic data showed that causal analysis ignoring the relational structure ofdata failed to recover the ground truth, but CaRL successfully recovered accurate results.

In this section we review fundamental concepts in causal analysis. We use capital letters X to denote random variables,and use lower case letters x to denote their values. We use boldface X , x to denote tuples of random variables andconstants respectively; and Dom ( X ) denotes the domain of variable X . Probabilistic causal models.

A probabilistic causal model [37] is a tuple M = (cid:104) U , V , Pr U , F (cid:105) , where U is aset of exogenous variables that cannot be observed, V is a set of observable or endogenous variables, and Pr U isa joint probability distribution on the exogenous variables U . The set F = ( F X ) X ∈ V is a set of non-parametricstructural equations of the form F X : Dom ( Pa V ( X )) × Dom ( Pa U ( X )) → Dom ( X ) , where Pa U ( X ) ⊆ U and Pa V ( X ) ⊆ V − { X } are called the exogenous parents and endogenous parents of X respectively. Intuitively,the exogenous variables U are not known, but we know their probability distribution; the endogenous variables arecompletely determined by their parents (exogenous and/or endogenous). Causal DAGs.

A probabilistic causal model is associated with a causal DAG (directed acyclic graph) G , whose nodesare the endogenous variables V , and whose edges are all pairs ( Z, X ) such that Z ∈ Pa V ( X ) . The causal DAG hidesexogenous variables (since we cannot observe them anyway) and instead captures their effect by deﬁning a probabilitydistribution Pr U on the endogenous variables. We will only refer to endogenous variables in the rest of the paper anddrop the subscript V from Pa V . Similarly, we will drop the subscript U from the probability distribution Pr U when itis clear from the context. Then the formula for Pr ( V ) is the same as that for a Bayesian network: This is possible under the causal sufﬁciency assumption: for any two variables

X, Y ∈ V , their exogenous parents are disjointand independent Pa U ( X ) ⊥⊥ Pa U ( Y ) . When this assumption fails, one adds more endogenous variables to the model to exposetheir dependencies. PRIL

9, 2020

QualificationQuality PrestigeScore

Figure 3: A standard causal DAG for Example 3.2. Pr ( V ) = (cid:89) X ∈ V Pr ( X | Pa ( X )) (1)Figure 3 shows a simple example of a causal graph based on Example 1.1: the Score of a paper is affected by its Qualityand by the Prestige of the author (assuming the reviews are single blind), whereas both Quality and Prestige are affectedby the author’s Qualiﬁcation. Here V = { Qualification , Quality , Prestige , Score } are endogenous variables, U endogenous variables are unknown (e.g., mood of a reviewer while reviewing the paper, the expected number ofpapers to be accepted, scores of other papers the reviewer reviewed, etc.) leading to a probability distribution on V .The dependencies can be represented by three structural equations:Quality ⇐ Qualiﬁcation ; Prestige ⇐ Qualiﬁcation ; Score ⇐ Quality , Prestige . (2) Interventions and the do operator. Causal models give semantics to interventions . An intervention representsactively setting an endogenous variable to some ﬁxed value and observing the effect denoted by the do -operatorintroduced by Pearl [38]. Formally, an intervention do ( W = w ) consists of setting variables W ⊆ V to some values W = w , and it deﬁnes the probability distribution Pr ( V | do ( W = w )) given by (1), where we remove all factors Pr ( X | Pa ( X )) , where X ∈ W . In other words, we modify the causal DAG by removing all edges entering the variables W on which we intervene; this fundamentally differs from conditioning, Pr ( V | W = w ) . Pearl has an extensivediscussion of the rationale for the do -operator and describes several equivalent formulas for estimating the effect of do ( W = w ) from an observed distribution [37]. Average treatment effect (ATE).

The causal analysis estimates the effect treatment variable T (typically a binaryvariable) on some outcome variable Y . This effect is often measured by the following quantity known as the averagetreatment effect (ATE) , which is expressed as follows in our notation:ATE ( Y, T ) = E [ Y | do ( T = 1)] − E [ Y | do ( T = 0)] (3)Much of the literature on causal inference in statistics addresses efﬁcient estimation of ATE from observational data. Unit of analysis and SUTVA.

Both Pearl’s [37] and Rubin’s causality [45] rely on the assumption that the studydomain consists of a set of units , or physical objects (e.g., authors, patients, publications, etc.) that can be subject to atreatment/intervention and exhibit a response to it. Furthermore, they rely on the assumption of no interference betweenthe units or Stable Unit Treatment Value Assumption (SUTVA) [45]. Intuitively SUTVA states that intervening on ortreating a unit does not have any consequences on the response of other units. In settings where the units of analysis arerelationally connected, this assumption is typically violated. In Example 1.1, prestige of an author (treatment) inﬂuencesthe acceptance chance (response) of his or her co-author(s) and collaborator(s) which leads to the violation of SUTVA.

Related Work.

Previous work has studied causal inference in the presence of interference [31, 11, 13, 14, 58, 61,4, 55, 32, 30]. These works address applications such as the study of infectious diseases [61, 14] or behavior andinteractions in social networks [55, 30, 57, 11, 61, 24, 18]. But in these studies the units are still homogeneous (e.g.,people connected by a social network), and they are unable to capture different entities of interests like papers, authors,reviews and their complex many-to-many relationships that we focus on in CaRL. There has been prior work onlearning causality from relational data [23, 3, 22, 25]; it focuses on discovering the structure of probabilistic graphicalmodels for this data. These models were originally proposed for Statistical Relational Learning, which aims to modela joint probability distribution over relational data amenable for probabilistic reasoning rather than causal inference[9]. This line of work differs from our work in that our objective is to develop a declarative framework to answercomplex causal queries about the effect of interventions, given the existing background knowledge. Note that causality has been used in various contexts [28], namely, to understand responsibility in query answering [26, 48], in database4

PRIL

9, 2020repair [27, 49, 7], and to motivate explanations and diagnosis [43, 50, 7]. It has also inspired different applications suchas hypothetical reasoning [5, 21, 8, 29]. These differ from our work in that they identify parts of the input that arecorrelated with the output of a transformation, which is useful but does not reﬂect the true causality needed for decisionmaking.

In this section, we present our declarative language called

CaRL (Causal Relational Language) that extends causalmodeling to relational data by allowing the user to (1) specify assumptions and background knowledge on the interactionsamong heterogenous units (Section 3.2), and (2) pose various causal queries (Section 3.3). We start with our data model,which forms the basis for our language.

The input schema for CaRL corresponds to any standard multi-relationaldatabase, e.g., R

EVIEW D ATA in Figure 2, but we assume the data is given in the following ‘entity-relationship-attribute’form for a simpler generalization of Pearl’s causal models. A relational causal schema is a tuple S = ( P , A ) , where P = E ∪ R represents a set of entities E and their relationships R , and A represents a set of attribute functions (orsimply attributes ) that encode the standard attributes of the entities and their relationships, with the only difference thatsome of these attributes may be ‘ unobserved ’ with missing values in all instances. The entities and their relationshipsare denoted by P ( . ) , the attribute functions are denoted by A [ . ] , and A Obs ⊆ A denotes the set of observed attributefunctions. We illustrate the mapping from standard relational model to relational causal schema using our runningexample . Example 3.1

The relational causal schema corresponding to the relational R

EVIEW D ATA in Figure 2 (with renames)is: P = Person(A) , Author ( A, S ) , Submission ( S ) , Submitted ( S, C ) , Conference ( C ) A = Prestige [ A ] , Qualiﬁcation [ A ] , Score [ S ] , Blind [ C ] , Quality [ S ] Here P consists of entities in the R EVIEW D ATA : E = { People, Submission, Conference } and their relationships R = { Authors, Submitted } ; The attribute function A corresponds to the attributes of these entities and relationships:Prestige [ A ] = the prestige of the author’s institution (e.g., rankings); Qualiﬁcation [ A ] = the qualiﬁcation of an authorby h-index ); Score [ S ] ∈ [0 , = the average score reviewers gave to a submission; Blind [ C ] = whether a conferencereview policy is single or double blind; Quality [ S ] = the quality of a submission. Note that Quality in A is missing inthe Submissions table in Figure 2, since it is an unobserved attribute function, and will be used in causal analysis basedon our background knowledge that quality of a submission may have an impact on its score. Observed instance and relational skeleton (instance).

Similar to a standard database instance given a standardrelational schema (as shown in Figure 2), an observed relational instance (or simply an instance ) conforms to agiven relational causal schema S = ( P , A ) with speciﬁc values (i.e., constants), however some (unobserved) attributefunctions may be missing in the instance (like ‘Quality’). The set of (constant or grounded) entities and relationships inan instance (excluding the grounded attribute functions) is referred to as the relational skeleton of the instance anddenoted by ∆ . Example 3.2

For relational causal schema given in Example 3.1 and the instance in Figure 2, the relational skeleton com-prises entities and relationships like

Person(“Bob”), Submission(“s1”), Author(“Bob”, “s1”), etc. Theobserved instance comprises the relational skeleton and the attribute functions like

Score[“s1”], Blind[“ConfDB”], etc., but not unobserved attributes like

Quality[“s1”] . Note that all observed attribute functions assume a ﬁxed valuegiven any instance.

The ﬁrst step of using CaRL is encoding the user’s background knowledge about potential causal dependencies amongattributes in an application. This is expressed in CaRL through a set of relational causal rules (deﬁned below) that For the purpose of this paper, we assume that the input is given in the form of relational causal schema, whereas we envisionthat in an end-to-end system with a graphical user interface, this mapping will be semi-automatically done with user’s input. There can be other measures of qualiﬁcations as well, e.g., the number of publications or citations, or the experience in terms ofyears. PRIL

9, 2020capture the causal assumptions. We refer to the set of relational causal rules speciﬁed by the user as the relationalcausal model . Deﬁnition 3.3 A relational causal rule over a relational causal schema S = ( P , A ) has the following form: A [ X ] ⇐ A [ X ] , . . . , A k [ X k ] WHERE Q ( Y ) (4)Here, A , A , · · · , A k ∈ A are attribute functions, Q is a (standard) conjunctive query over the schema P , and X , X i ( i = 1 , · · · , k ), Y are sets of variables and/or constants. All variables in X ∪ (cid:83) i X i must also occur in Y . We call A [ X ] the head of the rule, A [ X ] , . . . , A k [ X k ] the body of the rule, and Q ( Y ) the condition . We denote by φ A the setof rules with head A . Example 3.4

Consider the following relational causal model Φ for R EVIEW D ATA in Figure 2.Prestige [ A ] ⇐ Qualiﬁcation [ A ] WHERE Person ( A ) (5)Quality [ S ] ⇐ Qualiﬁcation [ A ] , Prestige [ A ] WHERE Author ( A, S ) (6)Score [ S ] ⇐ Prestige [ A ] WHERE Author ( A, S ) , (7)Score [ S ] ⇐ Quality [ S ] WHERE Submission ( S ) (8)Rule (5) says that the qualiﬁcation of a person causally affects his or her institutions’ prestige; rule (6) says that thequality of a submission is affected by its authors’ qualiﬁcations and prestige (authors from prestigious institutions haveaccess to more resources); rules (7) and (8) say that reviewers’ scores are based on the quality of a submission but mayalso be inﬂuenced by the prestige of its authors.A major advantage of specifying background knowledge using causal rules for the users is that they simply expressintuitive potential causal dependence among attributes without mentioning ‘how’ or associating any ‘weight’ to them ,while CaRL uses them to answer different causal queries (Section 3.3). A grounded rule is a rule (4) that contains only constants from a given instance (no variables) and has no condition (i.e., Q ≡ true ). A relational causal rule is a template for generating multiple grounded rules. Deﬁnition 3.5

Let ∆ be a relational skeleton. Fix a rule in the form of (4), and let Z denote all variables occurringin X ∪ X ∪ . . . ∪ X k . We associate to this rule the set of grounded rules obtained by substituting Z with any set ofconstants z such that ∆ | = Q ([ Y / z ]) . In other words, the query Q must be true in the database ∆ after substituting thevariables Z with the constants z and treating the variables Y − Z as existentially quantiﬁed. Given a relational causal model Φ comprising a set of relational causal rules and a relational skeleton ∆ comprisingthe entities and relationships in an instance, Φ ∆ denotes the set of all grounded rules. From Φ ∆ , we construct the relational causal graph G (Φ ∆ ) . The vertices of G (Φ ∆ ) (denoted A ∆ ) comprise all grounded attributes A [ x ] in Φ ∆ denoted A ∆ – recall that x represents a tuple of constants, an attribute function A corresponding to an entity has a singleconstant parameter as in Example 3.2, but A corresponding to a relationship predicate will have multiple parameters.The edges of G (Φ ∆ ) are all pairs ( A [ x ] , A j [ x j ]) where A [ x ] and A j [ x j ] appear in the head and body respectively of agrounded rule (4). We assume that the relational causal model is non-recursive, therefore, the causal graph is a DAG . Example 3.6

Given the skeleton ∆ in Figure 2, Φ generates the following grounded rules:Prestige [“ Bob ”] ⇐ Qualiﬁcation [“ Bob ”] – (also for “Carlos”, “Eva”) Quality [“ s ⇐ Qualiﬁcation [“ Bob ”] , Qualiﬁcation [“ Eva ”] Quality [“ s ⇐ Qualiﬁcation [“ Eva ”] Quality [“ s ⇐ Qualiﬁcation [“ Carlos ”] , Qualiﬁcation [“ Eva ”] Score [“ s ⇐ Quality [“ s , Prestige [“ Bob ”] , Prestige [“ Eva ”] Score [“ s ⇐ Quality [“ s , Prestige [“ Eva ”] Score [“ s ⇐ Quality [“ s , Prestige [“ Carlos ”] , Prestige [“ Eva ”] (9)These in turn lead to the causal graph shown in Figure 4. This fact, along with the declarative nature of the language, makes CaRL more friendly to users who are not causal inferenceexperts. While our language allows for recursive rules which capture feed-back loops and contagion, their treatment is beyond the scopeof the paper and is an interesting future work. PRIL

9, 2020

Score [“s1”] Score [“s2”]Quality [“s1”] Quality [“s2”]Prestige [“Bob”] Prestige [“Carlos”]Prestige [“Eva”]

Quali-ﬁcation [“Bob”]

Quali-ﬁcation [“Carlos”]

Quali-ﬁcation [“Eva”]

Score [“s3”]Quality [“s3”]

Figure 4: Relational causal graph corresponding to the grounded rules in Example 3.4.Note that the relational causal graph in Figure 4 is an extension of the standard causal DAG (by Pearl’s model [36])shown in Figure 3: the latter describes the potential causal dependence of the attributes whereas the former describes amore ﬁne grained version based on the entities and relationships in the relational data. For example, we do not have asingle node

Score , as in Figure 3, but instead have many nodes

Score [” s , Score [” s , etc. one for each submissionin ∆ in Figure 4. As in Section 2, the causal graph G (Φ ∆ ) deﬁnes a joint probability distribution Pr (cid:0) A [ x ] | Pa ( A [ x ] (cid:1) (10)with one conditional probability for each grounded attribute A [ x ] ; we describe these conditional probabilities inSection 4.1. Using CaRL, one can extend the set of attribute functions A with new aggregated attribute functions using one of the aggregate rules of the following forms. For A ∈ A AGG_ A [ W ] ⇐ A [ X ] WHERE Q ( Z ) (11)Here, Z ⊇ X ∪ W and AGG is an aggregate function on A , e.g., AVG (average) and VAR (variance). The newaggregated attribute functions AGG _ A are included in the extended attribute functions A (for simplicity, we use A for both given and extended attribute functions). Similar to relational causal rules, aggregated rules deﬁne a set ofgrounded rules with corresponding vertices and edges in the relational causal graph G (Φ ∆ ) . However, instead of aconditional probability distribution, a deterministic function AGG ( Pa ( AGG_ Y [ w ])) will be associated with eachAGG_ Y [ w ] ∈ AGG_ Y ∆ . For example, the following aggregate rule deﬁnes the average review score for each author.AVG_Score [ A ] ⇐ Score [ S ] WHERE Author ( A, S ) (12)Figure 5 shows the extension of Figure 4 with (12). Once the relational causal model Φ is speciﬁed, users can start asking causal queries. CaRL supports three typesof causal queries of the following form (their semantics are discussed in Section 4.4 and answering these queries isdiscussed in Section 5). The ATE query extends the notion standard ATE (discussed in Section 2) for relational data.CaRL also supports queries for aggregated response, isolated effect and relational effect.

Average treatment effect (ATE) query.

An ATE query estimates the average treatment effect (see Section 2) of a treatment attribute T [ X ] on a response attribute Y [ X (cid:48) ] and has the following form: (formally deﬁned in Section 4.4.1) Y [ X (cid:48) ] ⇐ T [ X ]? (13)This asks “what is the effect of T on Y ?”. For example, the query Score [ S ] ⇐ Prestige [ A ]? computes the ATE of Prestige of authors on

Score of a paper, i.e. , it compares papers’ scores in two hypothetical worlds in which all authorsare and are not afﬁliated with prestigious institutions (the causal effects of ‘some’ authors being from prestigiousinstitutions can be estimated from the relational effects queries described below). Following the standard assumption ofbinary treatments in the causality literature, we require the treatment attribute to be of binary domain, which can beenforced by using a threshold or a predicate on a non-binary domain.7

PRIL

9, 2020

Score [“s1”] Score [“s2”]Quality [“s1”] Quality [“s2”]Prestige [“Bob”] Prestige [“Carlos”]Prestige [“Eva”]

Quali-ﬁcation [“Bob”]

Quali-ﬁcation [“Carlos”]

Quali-ﬁcation [“Eva”]

Score [“s3”]Quality [“s3”]

Avg_ Score [“Bob”] Avg_ Score [“Carlos”]Avg_ Score [“Eva”]

Figure 5: Extended relational causal graph from Figure 4 with aggregated attribute AVG_Score [ A ] by (12). The directedpath from relational peer Eva’s prestige to average score of Bob is highlighted (Section 4.3). Aggregated response query.

An aggregated response query allows causal analysis on an aggregated form of theresponse variable and has the following syntax (formally deﬁned in Section 4.4.2):AGG_ Y [ X (cid:48) ] ⇐ T [ X ]? (14)For example, AVG_Score [ S ] ⇐ Prestige [ A ]? computes the treatment effect of the prestige of authors on the averagescore received by an author. Relational, isolated, and overall effects queries.

In relational domains, units that are relationally connected can havea causal inﬂuence on each other. For example, the Prestige of an author not only inﬂuences their average submissionscores but also their collaborators’ average submission scores. To measure such complex relational causal interactions,CaRL supports queries of the following form that output three quantities: relational, isolated, and overall causal effects(formally deﬁned in Section 4.4.3): Y [ X (cid:48) ] ⇐ T [ X ] ? WHEN (cid:104) cnd (cid:105)

PEERS TREATED (15)where (cid:104) cnd (cid:105) is a condition with the following grammar: (cid:104) cnd (cid:105) ←(cid:104)

LESS | MORE (cid:105)

THAN k % | AT (cid:104) MOST | LEAST (cid:105) k | EXACTLY k | ALL | NONE (16)For example, the query Score [ S ] ⇐ Prestige [ A ]? WHEN ALL PEERS TREATED computes three values for (i) isolated(an author’s prestige), (ii) relational (his/her coauthor’s prestige), and (iii) overall (all authors’ prestige) effect of prestigeon a submission’s score.

This section deﬁnes semantics of the causal queries described in Section 3.3. We ﬁx a relational causal schema S ,a relational skeleton ∆ , and a relational causal model Φ with a corresponding grounded causal graph G (Φ ∆ ) . Foran attribute function A ∈ A , denote U A to be the set of all tuples of grounded entities x such that A [ x ] ∈ A ∆ . Forexample, U Prestige consists of all authors, e.g., {“Bob", “Eva", “Carlos"}, whereas U Score consists of all submissions,e.g., { “s1" , “s2" }. We refer to each element x ∈ U A as a unit of an attribute function A . As discussed in Section 2, a causal DAG associates a conditional probability distribution Pr ( X | Pa ( X )) to each node X ∈ V ; these conditional probability distributions are unknown and must be estimated from available data even forstandard causal graphs described in Section 2, while there are additional challenges for relational causal graphs. Asdescribed in Section 3.2, in CaRL, the relational causal graph G (Φ ∆ ) is obtained by grounding the rules w.r.t. theskeleton database, ∆ , and the number of nodes is not ﬁxed ahead of time but depends on ∆ .We introduce the following structural homogeneity assumption , which is critical in CaRL to estimate the conditionalprobability distributions from a given observed dataset, and thereby conduct causal inference. Recall that A ∆ ⊆ A ∆ denotes the set of all groundings of an attribute A ∈ A in A ∆ :8 PRIL

9, 2020•

Structural Homogeneity:

All grounded attributes A [ x ] ∈ A ∆ of the same attribute A ∈ A share the samestructural equation and, hence, the same conditional probability distribution in equation (10).For instance, in Example 3.4, we assume that all groundings of type Prestige have the same structural equations.Note that the structural homogeneity assumption concerns only the underlying causal model in relational domainsthat consist of heterogeneous units; It is fundamentally different from the assumption of homogeneous units made intraditional causal inference (cf. Section 1).The structural homogeneity assumption, however, is not easily captured because different groundings of the sameattribute can have different number of parents. For instance, consider the atoms Score [“ s and Score [“ s fromequation (9). Score [“ s has two Prestige parents (since it has two authors, “Eva” and “Bob” ), whereas Score [“ s has one Prestige parent ( “Eva." ). We address this issue by using another layer of aggregate functions, that we call embeddings , ψ , and change Equation (10) to Pr (cid:16) A [ x ] | Ψ A (cid:0) Pa ( A [ x ]) (cid:1)(cid:17) (17)where Ψ A is a collection of mappings that projects the parents of A [ x ] into a low-dimensional vector with ﬁxeddimensionality for all A [ x ] ∈ A ∆ . Intuitively, we assume that the mappings provide sufﬁcient statistics for evaluatingthe underling structural questions corresponding to all A [ x ] ∈ A ∆ . More formally, we assume that Ψ A is known andconsists of a set of mappings { ψ AA , ψ AA , . . . } , one for each type of attribute A j occurring on the RHS of a rule (4),where each ψ AA j is an embedding function that maps the set of values of all parents of type A j into a low-dimensional embedding space with ﬁxed dimensionality. The embedding function can be a simple aggregate like average; othertypes of embeddings are discussed in Section 5.2.2. Example 4.1

Consider the three nodes of type Score for “ s ”, “ s ”, “ s ” in Figure 4, and consider their parents oftype Prestige. The number of their parents is 2 (for “ s ” – “Bob” and “Eva” with vector (cid:104) , (cid:105) for prestige), 1 (for “ s ”– “Eva” with vector (cid:104) (cid:105) ), and 2 (for “ s ” – “Eva” and “Carlos” with vector (cid:104) , (cid:105) ) respectively (the prestige values ofthe authors are in Figure 2), but under the homogeneity assumption, the conditional probability of scores given prestigeof authors would be computed by the same function by using a mapping ψ ScorePrestige to aggregate the vectors of Prestigevalues; we discuss choices for this aggregate function in Section 5.2.2.To summarize, the grounded causal graph G (Φ ∆ ) deﬁned by a relational causal model deﬁnes a joint probabilitydistribution given by: Pr ( A ∆ ) = (cid:89) A [ x ] ∈ A ∆ Pr (cid:16) A [ x ] | Ψ A (cid:0) Pa ( A [ x ]) (cid:1)(cid:17) (18)In some scenarios, the structural homogeneity assumption may not hold, for instance, the structural equations forsingle-blind and double-blind conferences can be different. Such situations can be expressed in CaRL by addingmultiple rules at different granularities in which the structural homogeneity assumption is perceived to hold, e.g., SBlind_Score [ S ] ⇐ Quality [ S ] WHERE Submission ( S ) DBlind_Score [ S ] ⇐ Quality [ S ] WHERE Submission ( S ) In standard causal analysis, the units can be considered tuples in a single unit table, with one attribute corresponding tothe treatment and another attribute corresponding to the response. For instance, in the schema given in Figure 2 andrelational causal graph in Figure 4, one could analyze the causal effect of qualiﬁcation of authors on their prestige, andthen the ‘authors’ form both the treated and response units. In contrast, for multi-relational causal analysis in CaRL,when one analyzes the causal effect of prestige of authors on scores of submissions, then intuitively the authors form thetreated units and the submissions form the response units. Even when authors (or submissions) form both the treatedand response units, CaRL allows inclusion of additional attributes from other relations that are covariates and requiredfor answering causal queries (see Section 5.1). Next we formally deﬁne these concepts.In relational causal analysis, we are given a treatment attribute function T [ X ] ∈ A and a response attribute function Y [ X (cid:48) ] ∈ A ; The set of units U T (resp. U Y ) denotes the entities or relationships corresponding to the treatment (resp.response) attribute function T (resp. Y ). For example, to study the effect of authors’ prestige on submission scores,Prestige [ A ] is the treatment attribute function and Score [ S ] is the response attribute function, U Prestige denotes all authorsas treated units and U Score denotes all submissions as response units (we assume without loss of generality that the9

PRIL

9, 2020attribute function names are unique and correspond to a single entity or relationship). We assume the treatment attributehas binary values whereas the response can be any real number.Given a set of treated units U T = { x , x , . . . } and a binary vector (cid:126)t = ( t , t , . . . ) , we are interested in the effect of aset of interventions do ( T ( x i ) = t i ) for all treated units x i , where each intervention replaces the NSE associated with T ( x i ) with a constant t i . In our example of the effect of prestige on score, the vector (cid:126)t corresponds to a particularassignment of prestige to all authors, e.g., the vector (cid:126) identiﬁes an intervention that hypothetically changes ‘all’ authors’afﬁliations to prestigious ones . By abuse of notation, we denote with do ( T [ S ] = (cid:126)t S ) a set of interventions in which anarbitrary subset of treated units S ⊆ U T receive (cid:126)t S (with an implicit assumption on the order of elements in the set S ).Having treated/response units and the treatment vectors allows us to have (1) non-uniform units that may be differententities or relationships, and (2) different types of treatments , e.g., forcing all authors to be of prestigious institutions as (cid:126) vs. one or some of the authors from prestigious institutions as (1 , , , · · · ) . CaRL aims to answer causal queries thatcompare the average response of the response units U Y to two alternative intervention strategies (cid:126)t and (cid:126)t (cid:48) applied to thetreated units U T , which we discuss next. Before we can formalize the semantics of causal queries described in Section 3.3, especially for the isolated andrelational effects, we need to establish a one-to-one correspondence between treated and response units by usingaggregations carefully. To this end, we ﬁrst deﬁne relational paths.

Deﬁnition 4.2 A relational path is a sequence of entities and relationships of the following form: P : E ( X ) R ( X ,X ) ←−−−−−−→ E ( X ) · · · E (cid:96) − ( X (cid:96) − ) R (cid:96) − ( X (cid:96) − ,X (cid:96) ) ←−−−−−−−−−→ E (cid:96) ( X (cid:96) ) (19)where E i ( X i ) ∈ E and R i − ( X i − , X i ) ∈ R , for i = 1 , · · · , (cid:96) .For instance, Conference ( C ) Submitted ( S,C ) ←−−−−−−−→ Submission ( S ) is a relational path in our example. The treated and responseunits corresponding to treatment and response attribute functions T and Y are said to be relationally connected if thereexists a relational path P that includes the entities or relationships for T and Y either as the endpoints in the pathor as the labels of the edges at the ends of the path. For example, for T [ X ] = Prestige[A] and Y [ X (cid:48) ] = Score[S],the treatment is an attribute function of the entity Author ( A ) , the response is an attribute function of the relationship Author ( A , S ) , and the treated and response units are relationally connected by the following relational path:Author ( A ) Author ( A,S ) ←−−−−−−→ Submission ( S ) (20)In this paper, we make the natural assumption that the treated and response units are relationally connected by at leastone relational path as otherwise the effect of treatment on the response is not meaningful. These units can then beuniﬁed using the aggregated response AGG_ Y [ X ] deﬁned with the following aggregate rule (see Section 3.2.4) thatmaps attribute Y of response units U Y to treatment units U T , where the units can be either entities or relationships.AGG_ Y [ X ] ⇐ Y [ X (cid:48) ] WHERE R ( X , X ) , . . . , R (cid:96) − ( X (cid:96) − , X (cid:96) ) (21)For example, to unify the treated and response units associated to T [ X ] = Prestige[A] and Y [ X (cid:48) ] = Score[S], the aggregate rule associated with the relational path in (20) coincides with (12): AVG_Score[A] ⇐ Score [ S ] WHERE Author ( A, S ) .Therefore, we assume from here on that the response units U Y are the same as the treated unit U T . Henceforth, wesimply refer to elements of U Y and U T as units and denote them with U ( T,Y ) = U T = U Y . In our example, after theuniﬁcation, the AVG_Score [ A ] can be considered as a new attribute function of authors (as in a ‘view’ in relationaldatabases), and the authors form U ( T,Y ) . Relational Peers.

Next, we deﬁne the notion of relational peers of a unit, which is central to the notion of relationaland isolated effects. Recall that the grounded causal graph G (Φ ∆ ) is extended with vertices and edges correspondingto aggregated attributes as discussed in Section 3.2.4. Deﬁnition 4.3

Given a treated attribute function T [ X ] , and a (possibly aggregated) response attribute function Y [ X ] ,we deﬁne the relational peers of a unit x ∈ U ( T,Y ) as a set of units P ( x ) ⊆ U ( T,Y ) − { x } such that for each p ∈ P ( x ) ,there exists a directed path from T [ p ] to Y [ x ] in G (Φ ∆ ) . We aggregate the response and not the treatment since aggregating treatments may lead to interventions that are not well deﬁned. PRIL

9, 2020For example, in Figure 5, treatment and aggregated response functions Prestige[A] and AVG_Score [ A ] , P (“ Bob ”) = { “ Eva ” } and P (“ Eva ”) = { “ Bob ” , “ Carlos ” } . In practice, the relational causal model is expected to form relationalpeers P ( x ) that consist only of units that are in some relational proximity of x , e.g., authors from the same institution,same research interests, etc. The following quantity measures the expected response of a unit x ∈ U ( T,Y ) when it receives the treatment t , and itsrelational peers receive the vector of treatments (cid:126)t = ( t , t . . . ) . Y x ( t, (cid:126)t ) def = E [ Y [ x ] | do (cid:0) T [ x ] = t (cid:1) , do (cid:0) T [ P ( x )] = (cid:126)t (cid:1) ] (22)In this paper, we assume do (cid:0) T [ P ( x )] = (cid:126)t (cid:1) is a well-deﬁned intervention for all units x , i.e. , it uniquely determineswhich relational peers of a unit would receive which treatment. For instance, this holds if P ( x ) is of the same size forall x , and it either has a natural ordering or is ordering-invariant. However, we allow several relaxations on the size andtype on (cid:126)t in our framework as discussed later. In this section, we deﬁne the semantics of causal queries outlined in Section 3.3 in terms of intervention; how thesecausal queries are answered in CaRL using uniﬁcation of treated and response units, embeddings, and selection ofcovariates is discussed in Section 5.

The primary causal query in CaRL is average treatment effect ( ATE) query of the form Y [ X (cid:48) ] ⇐ T [ X ]? as given in(13). Given treatment and response attribute functions T, Y , ATE is deﬁned as follows:ATE ( T, Y ) def = (cid:88) x (cid:48) ∈ U Y m ( E [ Y [ x (cid:48) ] | do ( T [ U T ] = (cid:126) − E [ Y [ x (cid:48) ] | do ( T [ U T ] = (cid:126) (23)Intuitively, ATE compares the expected response of the response units in two regimes of intervention: one in whichall units receive treatment and another where none do. For example, ATE ( P RESTIGE , S CORE ) compares scores ofsubmissions under two interventions in which all authors are and are not afﬁliated with prestigious institutions. Aggregate response queries of the form AGG_ Y [ X (cid:48) ] ⇐ T [ X ]? as given in (14) is deﬁned similar to ATE above,where we replace Y with AGG_ Y everywhere. Note that in the extended relational causal graphs, there are nodescorresponding to AGG_ Y as shown in Figure 5. The CaRL query (15) computes the following three quantities, which compare the average isolated (

AIE ), relational (

ARE ), and overall (

AOE ) effects of two alternative intervention strategies ( t, (cid:126)t ) and ( t (cid:48) , (cid:126)t (cid:48) ) over n response units .AIE ( t ; t (cid:48) | (cid:126)t ) def = 1 n (cid:88) x ∈ U ( T,Y ) Y x ( t, (cid:126)t ) − Y x ( t (cid:48) , (cid:126)t ) (24)ARE ( (cid:126)t ; (cid:126)t (cid:48) | t ) def = 1 n (cid:88) x ∈ U ( T,Y ) Y x ( t, (cid:126)t ) − Y x ( t, (cid:126)t (cid:48) ) (25) This assumption is far less strict than than the assumption of partial interference, which is standard in statistics, to extend Rubin’scausality to handle interference [58]. Also note that the assumption of no interference (or SUTVA) [44] translates to the statement P ( x ) = ∅ for all x ∈ U ( T,Y ) in relational causal models. We do not need the treatment vectors (cid:126)t,(cid:126)t (cid:48) applied to the peers to have the same size although they are applied to all units x . Wealso do not need all units x to have the same number of peers in P ( x ) . As the grammar deﬁned in (16) describes, we can assigntreatments to “at least/most k or k % ” neighbors, and that is well-deﬁned for all units x even if they do not have the exact samenumber of peers in P ( x ) . On the other hand, for such conditions, we do need to assume that the effects of interventions to the peersare ordering-invariant , e.g., the intervention can be applied to any of the k peers (with possible truncations for smaller peer sets) in P ( x ) . PRIL

9, 2020

Score [“s1”] Score [“s2”]Quality [“s1”] Quality [“s2”]Prestige [“Bob”] Prestige [“Carlos”]Prestige [“Eva”]

Quali-ﬁcation [“Bob”]

Quali-ﬁcation [“Carlos”]

Quali-ﬁcation [“Eva”]

Score [“s3”]Quality [“s3”]

Avg_ Score [“Bob”] Avg_ Score [“Carlos”]Avg_ Score [“Eva”] Ψ Q Ψ Q Ψ P Ψ P Ψ Q Ψ P Figure 6: Final relational causal graph obtained by (further) augmenting the graph in Figure 5 with embedding functions.For clarity, ψ QualityQualiﬁcations [ S ] is represented as ψ Q , and ψ QualityPrestige [ S ] , ψ ScorePrestige [ S ] as ψ P ..AOE ( t, (cid:126)t ; t (cid:48) , (cid:126)t (cid:48) ) def = 1 n (cid:88) x ∈ U ( T,Y ) Y x ( t, (cid:126)t ) − Y x ( t (cid:48) , (cid:126)t (cid:48) ) (26)Intuitively, the isolated causal effect of a treatment ﬁxes the treatment of the relational peers of a unit and compares itsexpected response under two treatment strategies assigned to the unit. On the other hand, the relational causal effect ofa treatment ﬁxes the treatment of a unit x and compares its expected response under two treatment strategies assignedto its relational peers. For example, the relational effect of Prestige[A] on AVG_Score [ A ] ﬁxes the prestige of an authorsuch as “ Bob ” and compares the counterfactual response AVG_Score [“ Bob ”] under two regimes of interventions inwhich the relational peers of “ Bob ” , e.g., “ Eva ” , receive two different treatment strategies, e.g., all of them haveprestigious afﬁliations versus none of them having such afﬁliations. Note that the overall causal effect is an extensionof ATE (23) for two arbitrary treatment strategies. Indeed, ATE coincides with AOE (1 ,(cid:126) | ,(cid:126) when the treated andresponse units are uniﬁed. The following proposition shows the connection between relational, isolated and overalleffects (we omit the proof due to lack of space). Proposition 4.1

The average overall effect can be decomposed into the average isolated and average relational effects,as follows:

AOE ( t, (cid:126)t ; t (cid:48) , (cid:126)t (cid:48) ) = AIE ( t, t (cid:48) | (cid:126)t ) + ARE ( (cid:126)t, (cid:126)t (cid:48) | t (cid:48) )= AIE ( t, t (cid:48) | (cid:126)t (cid:48) ) + ARE ( (cid:126)t, (cid:126)t (cid:48) | t ) (27) Given the syntax of different causal queries in Section 3.3 and their semantics in Section 4.4, now we describe how weanswer these queries in CaRL. The query answering component of CaRL consists of covariate detection (Section 5.1)and covariate adjustment (Section 5.2). The goal of covariate detection is to identify a sufﬁcient set of covariatesthat should be adjusted for to remove confounding effects. Then, in the process of covariate adjustment, the data istransformed into a ﬂat, single-table format so that causal inference can be performed using standard methods.

Given treatment and response attribute functions T [ X ] and Y [ X (cid:48) ] , to answer all types of causal queries deﬁned inSection 4.4, we need to estimate the effect of interventions of the form do (cid:0) T [ S ] = (cid:126)t S (cid:1) on a set of treated units S ⊆ U T ,on a response unit x (cid:48) ∈ U Y . This section proves a graphical criterion to select a sufﬁcient set of covariates from arelational causal graph G (Φ ∆ ) that enable the estimation of quantities of the form E (cid:2) Y [ x (cid:48) ] | do (cid:0) T [ S ] = (cid:126)t S (cid:1)(cid:3) andthereby the queries in Section 4.4. For this purpose we use the extended relational causal graph as shown in Figure 5 tomap possibly varying number of parent nodes to a ﬁxed and smaller dimension by adopting the idea of embeddingfunctions introduced in Section 4.1. We illustrate this with an example below.12 PRIL

9, 2020

Example 5.1

In Example 4.1, ψ Score

Prestige ( S ) now corresponds to a new attribute of a submission that maps the Prestigeattribute of Author s of that submission. Figure 6 shows the relational causal graph with augmented attributes computedusing the mapping functions or embeddings represented by the triangles.

The following theorem formalizes how the do-operator for relational causal graph can be estimated from observed data.This theorem uses the concept of d-separation from conditional independence in graphical models [36], denoted by X ⊥⊥ Y | G Z . The review of these concepts and the proof of the theorem is deferred to the full version of the paper due tolack of space. Theorem 5.2 (Relational Adjustment Formula)

Given an augmented relational causal graph G (Φ ∆ ) , treatmentand response attribute functions T, Y , and a set of treatment units (entities or relationships) S and their treatmentassignment vector (cid:126)t S , we have the following relational adjustment formula : E (cid:2) Y [ x (cid:48) ] | do (cid:0) T [ S ] = (cid:126)t S (cid:1)(cid:3) = (cid:88) z ∈ Dom ( Z ) E (cid:2) Y [ x (cid:48) ] | Z = z , T [ S (cid:48) ] = (cid:126)t S (cid:48) (cid:3) Pr ( Z = z ) (28) where S (cid:48) ⊆ S is such that, for each unit x ∈ S (cid:48) , there exists a directed path from the node T [ x ] to the node Y [ x (cid:48) ] in G (Φ ∆ ) , and Z is set of nodes in G (Φ ∆ ) corresponding to the groundings of a subset of observed attribute functions A Obs such that: Y [ x (cid:48) ] ⊥⊥ (cid:16) (cid:91) x ∈ S Pa (cid:0) T [ x ]) (cid:17) (cid:12)(cid:12) G (Φ ∆ ) (cid:16) (cid:91) x ∈ S T [ x ] , Z (cid:17) (29) Further, choosing Z to be the parent nodes of S (cid:48) in G (Φ ∆ ) always satisﬁes (29) as a sufﬁcient condition. (Intuitively, it is always sufﬁcient to condition for the ‘parents’ of treated units as they separate them from the rest of thegraph ensuring independence.) Here we illustrate with an example. Example 5.3

To compute ATE ( Prestige , Score ) in our example, we need to compute expectations of the form E (cid:2) Score [ s ] | do (cid:0) Prestige [ { “ Bob ” , “ Eva ” , “ Carlos ” } ] = (cid:126)t (cid:1)(cid:3) for (cid:126)t ∈ { (cid:126) ,(cid:126) } (30)where we intervene on all three authors in the example. By applying Theorem 5.2 for submission s =“ s ” , note that directed paths to Score [“ s ”] exists only from Prestige [“ Bob ”] and Prestige [“ Eva ”] , whichform the subset S (cid:48) . Further, it is sufﬁcient to condition on the parents of these two Prestige nodes, i.e., Z = { Qualiﬁcations [“ Bob ”] , Qualiﬁcations [“ Eva ”] } . Therefore, (30) reduces to: (cid:88) z ∈ Dom ( Z ) E (cid:2) Score [“ s ”] | Z = z , Prestige [ { “ Bob ” , “ Eva ” } ] = (cid:126)t (cid:1)(cid:3) Pr ( Z = z ) (31)Similarly, for s = “ s ” and Z = { Qualiﬁcations [“ Eva ”] } , we obtain (cid:88) z ∈ Dom ( Z ) E (cid:2) Score [“ s ”] | Z = z , Prestige [ { “ Eva ” } ] = t (cid:1)(cid:3) Pr ( Z = z ) (32)Note that the relational adjustment formula in (28) controls for an adequate set of covariates Z that confound thecausal effect of a treatment on an outcome ( Z is called the set of confounding covariates or covariates). For example,the causal effect of Prestige on Score is confounded by Qualiﬁcations. This is because, qualiﬁed researchers arelikely to belong to prestigious universities and qualiﬁed researchers are more likely to submit high quality papers.Therefore, to compute the ATE of Prestige on Score we need to control for author’s qualiﬁcations as in (31). Forestimating ATE ( Quality , Score ) (assuming quality is observed) by applying (29) we ﬁnd that for each submission s , E (cid:2) Score [ s ] | do (cid:0) Prestige [ U T ] = (cid:126) (cid:1)(cid:3) can be estimated by adjusting for the embedded attribute functions Z = { ψ ScorePrestige [ s ] , ψ ScoreQualiﬁcations [ s ] } .To estimate ATE ( Quality , AVG_Score ) (the effect on average acceptance rate of an author), on the other hand, weneed to estimate Pr (cid:16) AVG_Score [ A ] = y | do (cid:0) Prestige [ U T ] = (cid:126) (cid:1)(cid:17) , for each author. According to Equation (29), thiscan be done by adjusting for the joint distribution of the qualiﬁcations of all their coauthors, which is potentially veryhigh-dimensional, and therefore we need another round of embeddings to aggregate that information as discussed next.13 PRIL

9, 2020

Algorithm 1:

Constructing a unit table.

Input:

An augmented relational causal graph G (Φ ∆ ) , treated and outcome attribute functions T [ X ] and Y [ X (cid:48) ] . Output:

The unit table D ( Y, ψ T , ψ Z ) . for x (cid:48) ∈ U Y do U (cid:48) T ← A minimal subset of U T such that there exits adirected path in G (Φ ∆ ) from T [ x ] to Y [ x (cid:48) ] for all x ∈ U (cid:48) T Z ← A minimal set of vertices in G (Φ ∆ ) that satisﬁes the d -separation statement in Eq (29) ψ T ← ψ YT ( (cid:104) T [ x ] , . . . , T [ x | U (cid:48) T | ] (cid:105) ) Ψ Z ← Ψ Y Z ( Z ) Insert the tuple ( Y [ x ] , ψ T [ x ] , ψ Z [ x ])) to unit table D Unit Outcome ( Y ) Embeddedcoauthors’treatments ( ψ YT ) Embedded Collaborators’Covariates ( Ψ Y Z ) Author ID AVG_Score Prestige (AVG) Centrality (COUNT) H-index (AVG)Bob 0.75 1 1 2Carlos 0.1 1 1 2Eva 0.41 0.5 2 35Table 1: The unit table for T [ X ] = Prestige [ A ] and Y [ X (cid:48) ] = AVG_

Score [ A ] based on Figure 2. There are two challenges in estimating the causal queries in Section 4.4 using the relational adjustment formula (28):(1) when the set of confounding covariates Z has high dimensionality, estimating the conditional expectation in (28)from data is challenging, and (2) the causal queries need to compute averages across all response units. Hence, we needto estimate the formula (28) separately for each response unit that is not feasible. For instance, in Example 5.3, (31) and(32) need to be estimated separately.To address these issues, similar to Section 4.1, we use a set of embedding functions ψ YT and Ψ Y Z to project the treatmentand covariate vectors, respectively, into a low-dimensional embedding space with ﬁxed dimensionality and for allresponse units. This enables us to transform a (multi-) relational instance to a single low-dimensional ﬂat table. In the classical causal inference framework model discussed in Section 2, the units of interest are stored in a single unittable with attributes corresponding to the treatment, response, and confounding covariates as the columns. Here wegeneralize this concept to capture units in relational causal analysis.Given a relational causal graph G (Φ ∆ ) and treatment and response attribute functions T [ X ] and Y [ X (cid:48) ] , we useAlgorithm 1 to construct a unit table , which is a standard relation (table) with schema D ( Y, ψ YT , Ψ Y Z ) (note that Ψ Y Z denotes a vector of values for possible multiple covariates Z ). It consists of tuples ( Y [ x (cid:48) ] , ψ YT [ x (cid:48) ] , Ψ Y Z [ x (cid:48) ]) for eachresponse unit x (cid:48) ∈ U Y , where ψ YT [ X (cid:48) ] and Ψ Y Z [ X (cid:48) ] (with abuse of notation) are relational embedded attribute functionsthat correspond to the result of applying ψ YT and Ψ Y Z to the treatment and covariate vectors respectively. Example 5.4

Table 1 shows the unit table corresponding to T [ X ] = Prestige[A] and Y [ X (cid:48) ] = AVG_Score [ A ] . Here Authors constitute the response units and the aggregated response is an attribute of authors. In this table simple mappingssuch as average and count are used for embedding. Note that Table 1 also serves as the unit table for T [ X ] = Prestige[A]and Y [ X (cid:48) ] = Score [ A ] . In this case since the treated and response units are different CaRL uses the aggregated responseAVG_Score [ A ] for uniﬁcation (see Section 4.3).By rewriting the RHS of the relational adjustment formula (28) in terms of the attributes of the unit table and (cid:126)t e S (cid:48) theembedded representation of the treatment assignment (cid:126)t S (cid:48) , ( i.e. , (cid:126)t e S (cid:48) = ψ YT ( (cid:126)t S (cid:48) ) ), we obtain (cid:88) z ∈ Dom ( Ψ Y Z ) E (cid:2) Y | Ψ Y Z = z , ψ YT = (cid:126)t e S (cid:48) (cid:3) Pr ( Ψ Y Z = z ) (33)14 PRIL

9, 2020 a) b)

Figure 7: (a) The average treatment effect estimates and Pearson’s correlation for single-blind and double-blindsubmissions. (b) Pearson’s correlation, average isolated effect, average relational effect and average overall effect for allauthors on submissions in single-blind venues.Once we have a ﬂat unit table with columns for treatment, response, and covariates as in Section 2, the causal queriesin Section 4.4 can be estimated using (33) by applying the standard approaches to causal analysis like regression [2](the conditional expectation in (33) is a regression function) or matching methods [16, 12, 19] (matching treatment andcontrol units with the same/similar values). The validity of treatment effect estimates is conditional on the assumptionthat the background knowledge is accurate.

Embedding as a technique addresses both the issues of the high dimensionality and the variable size of the treatmentsand covariates that correspond to the response units, thereby making the estimation of causal queries more convenient.However, (33) only approximates (28), hence the quality of the answers depends on whether the embeddings preservesufﬁcient statistics. In this work, we use the following natural choices of embeddings, a formal study of the choicesof embedding functions in multi-relational causal analysis is an interesting direction of future work. (1)

Mean and median:

Uses basic aggregation functions, such as mean and median, together with the cardinality of the vectors (toaccount for the underlying topology of the relational skeleton, e.g, number of authors or collaborators). (2)

Padding:

Pads each variable size vector with out-of-band “empty markers" to make create same-sized vectors to use directly asthe embedding. (3)

Moments:

Uses a vector consists of k moments ( i.e. , mean, variance, skewness, etc.), where k ischosen to minimize response prediction loss. In this section, we conduct an experimental evaluation of CaRL, addressing three questions.

End-to-end performance :is CaRL effective in answering causal queries on relational data? Can it avoid simply discovering correlations insteadof true causation? Can it distinguish isolated effects from relational effects?

Quality of estimates : when ground truthis available, can CaRL recover the true treatment effects? And is the relational structure necessary for recovering thecorrect treatment effect?

Sensitivity to embeddings : how sensitive is CaRL’s performance to the choice of the type ofembedding strategy?

Experimental Setup.

The experiments were performed locally on a 64-bit Linux server with 1TB RAM and 4 IntelXeon processors with 15 cores @ 2.8GHz each.

We used three real datasets, two from the medical domain, and one about conferences, summarized in Table 2. Alldatasets contain interesting relationships that inform CaRL’s causal analysis. In addition, we generated a syntheticdataset in order to have control over the ground truth.

MIMIC-III

The Multiparameter Intelligent Monitoring in Intensive Care III (MIMIC-III) database is a large-volume, multi-parameter dataset collected from the ICUs of Beth Israel Deaconess Medical Center from 2008 to 201415

PRIL

9, 2020Dataset Tables [ ] Att. [ ] Rows [ ] Unit Table Cons. Query Ans.

MIMIC-III

26 324 400M 6h 4.5h

NIS R EVIEW D ATA S YNTHETIC R EVIEW D ATA [ P ] ⇐ Eth [ P ] , Religion [ P ] , Sex [ P ] WHERE Pa ( P ) Diag [ P ] ⇐ Eth [ P ] , Religion [ P ] , Sex [ P ] WHERE Pa ( P ) Dose [ D ] ⇐ Diag [ P ] , Severe [ P ] , Doc [ C ] WHERE Drug ( C, D ) , Care ( C, P ) Death [ P ] ⇐ Len [ P ] , Diag [ P ] , Dose [ D ] , Doc [ C ] WHERE Care ( C, P ) , Given ( D, P ) Len [ P ] ⇐ Dose [ D ] , Diag [ P ] WHERE Given ( D, P ) NIS

The Nationwide Inpatient Sample (NIS) [15] is a dataset of hospital stays across the US, produced by theDepartment of Health and Human Services once annually. We use the sample for the year 2006, which comprises8 million hospital admissions across 1035 hospitals. Each admission is associated with a hospital and the patient’sdemographic information, admission source, health history, performed procedures, and new diagnoses. Informationavailable about each hospital includes size, location, and ownership. We speciﬁed a casual model in CaRL using 16intuitive causal rules, using attributes whether the hospital is classiﬁed as large [1], patient’s medical bill, etc.; wemention a few below: Bill [ P ] ⇐ Illness_Severity[P]Bill [ P ] ⇐ Private_Ownership[H] WHERE Admitted ( P, H ) Bill [ P ] ⇐ Surgery_Performed[P]Admitted_to_large [ P ] ⇐ Illness_Severity[P] R EVIEW D ATA . R EVIEW D ATA consists of 2,075 papers submitted for review between 2017 and 2019 at 10 computerscience conferences and workshops, which have acceptance rates between 40%–84%. Each submission is associatedwith a number of referee reviews and an acceptance or rejection decision. About half of all submissions are double-blind, while the other half reveal author names to the reviewer. All submissions were unblinded after the conferencesconcluded. The dataset also contains an authors table, with the citation count, h-index, publishing experience (in years),and university ranking for each of the 4490 authors who contributed to a paper in the dataset. R

EVIEW D ATA wasbuilt by scraping, cleaning and normalizing data from OpenReview [34], Scopus [54] and the Shanghai UniversityRankings [40]. Scraping Scopus was done using the tool proposed in [41]. We plan to make R

EVIEW D ATA publiclyavailable. S YNTHETIC R EVIEW D ATA . We generated S

YNTHETIC R EVIEW D ATA mimicking the probability distributionsobserved in R

EVIEW D ATA . The relational skeleton was generated keeping in mind the correlations we observed in thereal data, e.g. , authors with high productivity tend to be afﬁliated with more prestigious institutions, and authors frommore prestigious institutions tend to collaborate with each other more. However, for each paper we let the number ofauthors and each submission’s choice of venue be determined randomly. We generated , authors with afﬁliationsto different institutions, along with , papers submitted to different venues. Next, we generated twodatasets to explore CaRL’s performance with and without relational effects. The ﬁrst dataset had a treatment effect ofprestige on review score of 0 for double-blind and 1 for single-blind venues, for all submissions. In the second dataset,the isolated effects stay the same for both double- and single-blind venues while there is a constant effect of / on thereview score of each submission if authors’ collaborators are prestigious.16 PRIL

9, 2020Figure 8: Comparing CATEs estimated using the universal table obtained by joining all base relations and CaRL.

In this section we used CaRL to answer several causal queries (including all kinds deﬁned in Section 3.3) on the realdatasets and evaluated their quality. Since we do not have ground truth for this data, we discuss which results aremore in agreement with the intuition or the literature in the ﬁeld. We also compared CaRL’s answer with more naive,correlation-based answers. All runtimes for these experiments are reported in Table 2.

MIMIC-III

We asked the following causal queries: what is the effect of not having health insurance on the mortalityrate? And what is the effect of non-insurance on the length-of-stay (in the hospital)? ( a ) Death [ P ] ⇐ SelfPay [ P ]? ( b ) Len [ P ] ⇐ SelfPay [ P ]? (34)The treated and control groups consist of patients without insurance (self-payers) and with insurance respectively.Table 3 shows the results for both the average treatment effect (ATE) and the naive difference of averages between thetwo groups. Computed naively, there is a signiﬁcant difference in both mortality rate and the length of stay betweeninsured and non-insured groups. However, after adjusting for confounders and mediators, we observe that there isalmost no effect on mortality rate; in other words, care givers do not discriminate in treating patients with and withoutinsurance. The discrepancy is due to the fact that self-payers tend to defer checking into a hospital until the problem issevere. The treatment effect on the length of stay is also attenuated compared to the estimated difference between theaverage of the outcomes for treated and control groups. Causal Query

Avg. ofTreated Avg. ofControl Diff. ofAverages ATEMIMIC 1 (34-a)

MIMIC 2 (34-b)

NIS 1 (35)

64% 31% 33% -10%

Table 3: The Average Treatment Effect (ATE) compared to naively computing the difference between the averages ofthe treated and control groups.

NIS

We asked the following causal query: are patients admitted to large hospitals charged more than those admittedto small hospitals? Expressed in CaRL, the query is:AVG_Bill [ H ] ⇐ Admitted_to_Large [ P ]? (35)The treated and control groups are large and small hospitals respectively. As before, we compared the ATE with thenaive difference of the average bills of the two groups, and show the results in Table 3. The naive computation showsthat the average bill at large hospitals is 33% more likely to be larger per patient ( i.e., less affordable). However, whencomputing the ATE, CaRL adjusts for the proﬁle of the patients each hospital receives, and we obtain a surprisingreversal of the trend. The reason for this discrepancy is that patients with more severe (and, thus, more costly) conditionstend to go to large hospitals, while small hospitals tend to have patients with milder conditions. In fact, the medicalliterature reports that, all else being equal, a larger hospital will provide more affordable treatment than a small one.17 PRIL

9, 2020

AIE ARE AOESingle-Blind Estimated

Ground Truth

Double-Blind Estimated

Ground Truth

Table 4: Averages for isolated, relational and overall effects for S

YNTHETIC R EVIEW D ATA by query in (36).

Method Embedding

Single-Blind Double-Blind

Estimated True Estimated TrueCaRL Mean . ± .

43 1 .

00 0 . ± . Median . ± . . ± . MomentSummary . ± . . ± . Padding . ± . . ± . UniversalTable N/A . ± . . ± . Table 5: Comparing the sensitivity of the quality of query answer to different choice of embeddings on S

YNTHETIC R EVIEW D ATA , using the query in (37).One meta-analysis [10] reports that economies of scale are present in the healthcare sector and so ﬁnds support for thepolicy of several national governments to consolidate smaller hospitals to increase productivity and efﬁciency. R EVIEW D ATA

We asked two casual queries: what is the effect of an author’s prestige on the average score ofhis/her submissions? And what is the effect on the submission score when more than 1/3 of her co-authors are treated?Expressed in CaRL, the queries are:AVG_Score [ A ] ⇐ Prestige [ A ]? (36)Score [ S ] ⇐ Prestige [ A ]? WHEN MORE THAN / PEERS TREATED (37)We ran each query twice, once on single-blind conferences, and once on double-blind; in CaRL, this is achievedby adding a where condition to the queries (not shown here), and computed the ATE in both cases. In addition, wealso computed the Pearson correlation between the score distributions of prestigious and non-prestigious authors. Theresults are shown in Figure 7(a), and show a signiﬁcant correlation, both for single-blind and double-blind conferences.However, CaRL found that the causal effect of prestige on submission scores was signiﬁcant for single-blind venues, but not signiﬁcant for double-blind venues. A naive interpretation of correlation-as-causation leads to the false conclusionthat double blinding is not effective in reducing bias. While the validity of our these ﬁndings depend on the validity ofthe underling assumptions made in this paper, we believe they surpass naive correlation. In particular, we note thatour results are in accordance with a series of controlled experiments that suggest double-blind reviewing does indeedreduce institutional prestige bias [42, 56, 33, 59].Given its primarily networked structure, R

EVIEW D ATA offers a great opportunity to compute peer effects. (In contrast,there are no relational peers for the causal queries on MIMIC-III and NIS.) We computed the effect of prestige acrosspeers on review scores in single-blind conferences, and used CaRL to compute the isolated, the relational, and theoverall effects as in (37). Figure 7(b) reveals that the isolated effect (AIE) is more signiﬁcant than the relational effect(ARE), meaning that an author’s own prestige has a stronger effect on his or her average submission score than theircollaborators’ prestige has, as we might expect. Furthermore, one can verify that we obtained AOE = AIE + ARE,which independently conforms with Proposition 4.1.

As the ground truth is not known for the real datasets, we use S

YNTHETIC R EVIEW D ATA to evaluate the quality of theestimates CaRL provides.We report estimated and true ATE, ARE, AIE and AOE to scrutinize CaRL’s performance. As seen in Table 4, CaRLis able to disentangle the isolated and relational effects present in S

YNTHETIC R EVIEW D ATA . It is able to do so forboth sub-populations, which have different generative rules. The different estimates are correctly recovered, and theproperty AOE = AIE + ARE from Proposition 4.1 is again respected.To test the ability to utilize relational structure, we computed the treatment effect estimates to the causal queries (37)using CaRL and compared to propensity score matching on the universal table obtained by joining all base relations.Table 5 compares the estimates by these two approaches with the ground truth. As shown, in all tested cases CaRL18

PRIL

9, 2020 a) b)

Figure 9: Relative likelihood of effects and corresponding averages for isolated, relational and overall effects for (a)single-blind venues and (b) double-blind venues. a) b)

Figure 10: Comparing the sensitivity of of the quality of query answer (CATE) to different choice of embeddings for (a)Single-Blind submissions and (b) Double-blind submissions.approximately recovered the ground truth within a reasonable error bound. However, causal inference on the universaltable resulted in an incorrect ATE with a considerable variance.

This experiment reveals that ignoring the relationalstructure in relational domains can lead to incorrect estimates and erroneous conclusions.

Assessing the effect of embeddings requires access to the ground truth, so we restrict ourselves to testing on S

YNTHETIC R EVIEW D ATA in this subsection. Table 5 shows that while CaRL consistently recovers the ATE, the correct choice ofembedding can improve its performance. We observe that simple embeddings (such as mean or median) recoveredapproximately the true average treatment effect. However, their estimate was less centered around the ground truthcompared to embeddings like padding or moment summarization. While padding had the tightest variance, momentsummarization also showed promising results. These trends apply regardless of whether we consider single- or double-blind venues, each of which has different generative models and ground truths. It is important to note that momentsummarization is one of the simpler approaches for set embedding . Additionally, the padding technique tends tocreate vectors that grow in proportion to the size of the relational skeleton, which limits to its applicability. More sophisticated approaches exist, e.g. , recurrent neural networks or kernel density estimators. We leave these for future work. PRIL

9, 2020

We introduced the Causal Relational Learning framework for performing causal inference on relational data. Thisframework allows users to encode background knowledge using a declarative language called C a RL (Causal RelationalLanguage) using simple Datalog-like rules, and ask various complex causal queries on relational data. CaRL isdesigned for researchers and analysts with a social science, healthcare, academic or legal background who are interestedinferring causality from a complex relational data. CaRL adds on to existing causal inference literature by relaxing the unit-homogeniety assumption and allowing the confounders, treatment units and outcome units to be of different kinds.We evaluated CaRL’s completeness and correctness on real-world and synthetic data from academic and healthcaredomains. CaRL is successfully able to recover the treatment effects for complex causal queries that may requiremultiple joins and aggregates.In future, we aim to extend CaRL to deal with complex cyclic causal dependencies using stationary distribution ofstochastic processes. We plan to study stochastic interventions and complex interventions on relational skeletons, whichare assumed to be ﬁxed in this paper. We also plan a theoretical and methodological study the functionality of differenttypes of embeddings. We aim to develop principled learning approach for ﬁnding efﬁcient embeddings using graphrepresentation learning and graph embedding.Recently, it has been shown that causality is foundational to the emerging ﬁeld of algorithmic fairness [52]. In futurework we plan to use causal relational learning to study a causality-based framework for fairness and discrimination inrelational domains. 20 PRIL

9, 2020

References [1] Agency for Healthcare Research and Quality. NIS data elements: Bedsize Categories.[2] Joshua D Angrist and Jörn-Steffen Pischke.

Mostly harmless econometrics: An empiricist’s companion . Princetonuniversity press, 2008.[3] David T. Arbour, Dan Garant, and David D. Jensen. Inferring network effects from observational data. In

ACMSIGKDD Conference on Knowledge Discovery and Data Mining , pages 715–724, 2016.[4] Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with applicationto a social network experiment.

Ann. Appl. Stat. , 11(4):1912–1947, 12 2017.[5] Andrey Balmin, Thanos Papadimitriou, and Yannis Papakonstantinou. Hypothetical queries in an olap environment.In

VLDB , volume 220, page 231, 2000.[6] Abhijit V Banerjee, Abhijit Banerjee, and Esther Duﬂo.

Poor economics: A radical rethinking of the way to ﬁghtglobal poverty . Public Affairs, 2011.[7] Leopoldo Bertossi and Babak Salimi. Causes for query answers from databases: Datalog abduction, view-updates,and integrity constraints.

International Journal of Approximate Reasoning , 90:226–252, 2017.[8] Daniel Deutch, Zachary G Ives, Tova Milo, and Val Tannen. Caravan: Provisioning for what-if analysis. In

CIDR ,2013.[9] Lise Getoor and Ben Taskar.

Introduction to Statistical Relational Learning (Adaptive Computation and MachineLearning) . The MIT Press, 2007.[10] Monica Giancotti, Annamaria Guglielmo, and Marianna Mauro. Efﬁciency and optimal size of hospitals: Resultsof a systematic search.

PLOS ONE , 12(3):e0174533, March 2017.[11] Bryan S Graham, Guido W Imbens, and Geert Ridder. Measuring the effects of segregation in the presence ofsocial spillovers: A nonparametric approach. Technical report, National Bureau of Economic Research, 2010.[12] Xing Sam Gu and Paul R Rosenbaum. Comparison of multivariate matching methods: Structures, distances, andalgorithms.

Journal of Computational and Graphical Statistics , 2(4):405–420, 1993.[13] M Elizabeth Halloran and Michael G Hudgens. Causal inference for vaccine effects on infectiousness.

TheInternational Journal of Biostatistics , 8(2):1–40, 2012.[14] M Elizabeth Halloran and Claudio J Struchiner. Causal inference in infectious diseases.

Epidemiology , 6(2):142–151, 1995.[15] Healthcare Cost and Utilization Project (HCUP). HCUP Nationwide Inpatient Sample (NIS), 2006.[16] Daniel E Ho, Kosuke Imai, Gary King, and Elizabeth A Stuart. Matching as nonparametric preprocessing forreducing model dependence in parametric causal inference.

Political analysis , 15(3):199–236, 2007.[17] Paul W. Holland. Statistics and causal inference.

Journal of the American Statistical Association , 81(396):pp.945–960, 1986.[18] Guanglei Hong and Stephen W Raudenbush. Evaluating kindergarten retention policy: A case study of causalinference for multilevel observational data.

Journal of the American Statistical Association , 101(475):901–910,2006.[19] Stefano M Iacus, Gary King, Giuseppe Porro, et al. Cem: software for coarsened exact matching.

Journal ofStatistical Software , 30(9):1–27, 2009.[20] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi,Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible criticalcare database.

Scientiﬁc data , 3:160035, 2016.[21] Laks VS Lakshmanan, Alex Russakovsky, and Vaishnavi Sashikanth. What-if olap queries with changingdimensions. In

International Conference on Data Engineering , pages 1334–1336. IEEE, 2008.[22] Sanghack Lee and Vasant Honavar. On learning causal models from relational data. In

Thirtieth AAAI Conferenceon Artiﬁcial Intelligence , 2016.[23] Marc Maier, Katerina Marazopoulou, David Arbour, and David Jensen. A sound and complete algorithm forlearning causal models from relational data. arXiv preprint arXiv:1309.6843 , 2013.[24] Marc Maier, Katerina Marazopoulou, and David Jensen. Reasoning about independence in probabilistic modelsof relational data. arXiv preprint arXiv:1302.4381 , 2013.21

PRIL

9, 2020[25] Marc Maier, Brian Taylor, Huseyin Oktay, and David Jensen. Learning causal models of relational domains. In

Twenty-Fourth AAAI Conference on Artiﬁcial Intelligence , 2010.[26] Alexandra Meliou, Wolfgang Gatterbauer, Katherine F. Moore, and Dan Suciu. The complexity of causality andresponsibility for query answers and non-answers.

Proc. VLDB Endow. (PVLDB) , 4(1):34–45, 2010.[27] Alexandra Meliou, Wolfgang Gatterbauer, Suman Nath, and Dan Suciu. Tracing data errors with view-conditionedcausality. In

ACM SIGMOD International Conference on Management of data , pages 505–516, 2011.[28] Alexandra Meliou, Sudeepa Roy, and Dan Suciu. Causality and explanations in databases.

Proceedings of theVLDB Endowment , 7(13):1715–1716, 2014.[29] Alexandra Meliou and Dan Suciu. Tiresias: the database oracle for how-to queries. In

Proceedings of the 2012ACM SIGMOD International Conference on Management of Data , pages 337–348. ACM, 2012.[30] Elizabeth L Ogburn, Ilya Shpitser, and Youjin Lee. Causal inference, social networks, and chain graphs. arXivpreprint arXiv:1812.04990 , 2018.[31] Elizabeth L Ogburn, Oleg Sofrygin, Ivan Diaz, and Mark J van der Laan. Causal inference for social network data. arXiv preprint arXiv:1705.08527 , 2017.[32] Elizabeth L Ogburn, Tyler J VanderWeele, et al. Causal diagrams for interference.

Statistical science , 29(4):559–578, 2014.[33] Kanu Okike, Kevin T Hug, Mininder S Kocher, and Seth S Leopold. Single-blind vs double-blind peer review inthe setting of author prestige.

JAMA , 316(12):1315–1316, 2016.[34] OpenReview. https://openreview.net .[35] Harsh Parikh, Cynthia Rudin, and Alexander Volfovsky. Malts: Matching after learning to stretch. arXiv preprintarXiv:1811.07415 , 2018.[36] Judea Pearl.

Causality: models, reasoning, and inference . Cambridge University Press, 2000.[37] Judea Pearl.

Causality . Cambridge University Press, 2009.[38] Judea Pearl et al. Causal inference in statistics: An overview.

Statistics Surveys , 3:96–146, 2009.[39] Judea Pearl and Dana Mackenzie.

The book of why: the new science of cause and effect . Basic Books, 2018.[40] Shanghai University Ranking. .[41] Michael E. Rose and John R. Kitchin. pybliometrics: Scriptable bibliometrics using a Python interface to Scopus.

SoftwareX , 10:100263, July 2019.[42] Joseph S Ross, Cary P Gross, Mayur M Desai, Yuling Hong, Augustus O Grant, Stephen R Daniels, Vladimir CHachinski, Raymond J Gibbons, Timothy J Gardner, and Harlan M Krumholz. Effect of blinded peer review onabstract acceptance.

JAMA , 295(14):1675–1680, 2006.[43] Sudeepa Roy and Dan Suciu. A formal approach to ﬁnding explanations for database queries. ACM SIGMODInternational Conference on Management of Data, 2014.[44] Donald B Rubin.

The Use of Matched Sampling and Regression Adjustment in Observational Studies . Ph.D.Thesis, Department of Statistics, Harvard University, Cambridge, MA, 1970.[45] Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions.

Journal of theAmerican Statistical Association , 100(469):322–331, 2005.[46] Donald B Rubin.

Matched sampling for causal effects . Cambridge University Press, 2006.[47] Donald B Rubin et al. For objective causal inference, design trumps analysis.

The Annals of Applied Statistics ,2(3):808–840, 2008.[48] Babak Salimi, Leopoldo Bertossi, Dan Suciu, and Guy Van den Broeck. Quantifying causal effects on queryanswering in databases. In

TaPP , 2016.[49] Babak Salimi and Leopoldo E. Bertossi. From causes for database queries to repairs and model-based diagnosisand back. In

International Conference on Database Theory , pages 342–362, 2015.[50] Babak Salimi and Leopoldo E. Bertossi. Causes for query answers from databases, datalog abduction andview-updates: The presence of integrity constraints. In

FLAIRS , pages 674–679, 2016.[51] Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, Sudeepa Roy, and Dan Suciu. Causal relational learning.

To appear in ACM SIGMOD International Conference on Management of Data , 2020.[52] Babak Salimi, Luke Rodriguez, Bill Howe, and Dan Suciu. Capuchin: Causal database repair for algorithmicfairness. arXiv preprint arXiv:1902.08283 , 2019. 22

PRIL

9, 2020[53] Semantic Scholar. .[54] Scopus. .[55] Cosma Rohilla Shalizi and Andrew C Thomas. Homophily and contagion are generically confounded in observa-tional social network studies.

Sociological methods & research , 40(2):211–239, 2011.[56] Richard Snodgrass. Single- versus double-blind reviewing: an analysis of the literature.

ACM Sigmod Record ,35(3):8–21, 2006.[57] Michael E Sobel. What do randomized studies of housing mobility demonstrate? causal inference in the face ofinterference.

Journal of the American Statistical Association , 101(476):1398–1407, 2006.[58] Eric J Tchetgen Tchetgen and Tyler J VanderWeele. On causal inference in the presence of interference.

Statisticalmethods in medical research , 21(1):55–75, 2012.[59] Andrew Tomkins, Min Zhang, and William D Heavlin. Reviewer bias in single-versus double-blind peer review.

Proceedings of the National Academy of Sciences , 114(48):12708–12713, 2017.[60] Jeffrey D. Ullman.

Principles of Database Systems . Pitman, 2nd edition, 1982.[61] Tyler J VanderWeele and Eric J Tchetgen Tchetgen. Bounding the infectiousness effect in vaccine trials.

Epidemi-ology , 22(5):686, 2011. 23

PRIL

9, 2020

Proof 7.1 (Proof of Theorem 5.2) (Sketch) Since embeddings Ψ ∆ are deterministic functions of ground attributes A ∆ , they are also random variables, hence we can deﬁne the joint probability distribution Pr ( A ∆ , Ψ ∆ ) . Also notethat the parents of an atom in the augmented ground causal diagram ˆ G (Φ ∆ ) corresponded to the embedded parents ofthe same node in ground causal diagram G (Φ ∆ ) .Since each atomic intervention do ( T [ x i ] = t i ) modiﬁes the augmented ground causal diagram ˆ G (Φ ∆ ) by removing theparents of T [ x i ] from ˆ G (Φ ∆ ) (implied from the factorization Eq 18), the post intervention distribution Pr (cid:16) A ∆ , Ψ ∆ | do (cid:0) T [ S ] = (cid:126)t S (cid:1)(cid:17) can be obtained from the pre-intervention (observed) distribution Pr ( A ∆ ) by removing all factors Pr (cid:16) A ( x ) , | Pa ( A ( x ) (cid:1)(cid:17) , from Pr ( A ∆ , Ψ ∆ ) (cf. Eq 18), hence we obtain the following: Pr (cid:16) A ∆ , Ψ ∆ | do (cid:0) T [ S ] = (cid:126)t S (cid:1)(cid:17) = Pr ( A ∆ ) (cid:81) x ∈ S Pr (cid:16) A ( x ) | Ψ A (cid:0) Pa (cid:48) ( A ( x ) (cid:1)(cid:17) (38) The following factorization implied by the chain rule of probability: Pr ( A ∆ , Ψ ∆ ) = Pr (cid:16) (cid:91) x ∈ S Pa (cid:0) T [ x ]) (cid:17) Pr (cid:16) T [ x ] | (cid:91) x ∈ S Pa ( T [ x ]) (cid:17) (39) Pr (cid:16) T [ x ] | T [ x ] , (cid:91) x ∈ S Pa ( T [ x ]) (cid:17) (40) ... Pr (cid:16) T [ x i ] | i − (cid:91) j =0 T [ x j ] , (cid:91) x ∈ S Pa (cid:0) T [ x ]) (cid:17) (41) ... Pr ( A ∆ , Ψ ∆ | (cid:91) x ∈ S Pa (cid:0) T [ x ]) , (cid:91) x ∈ S T [ x ]) (42) where A ∆ ∪ Ψ ∆ consist of all ground atoms in A ∆ ∪ Ψ ∆ except for (cid:83) x ∈ S Pa (cid:0) T [ x ]) and (cid:83) x ∈ S T [ x ]) (cid:9) . The acyclicityof ˆ G (Φ ∆ ) implies (cid:83) x ∈ S Pa (cid:0) T [ x ]) and (cid:83) x ∈ S T [ x ] are disjoint, hence the above factorization is valid.The following implied from the factorization in Eq. 42 and Eq 38, Pr (cid:16) Y [ x (cid:48) ] = y | do (cid:0) T [ S ] = (cid:126)t S (cid:1)(cid:17) = (cid:88) (cid:83) x ∈ S pa (cid:0) T [ x ] (cid:1) Pr (cid:0) Y [ x (cid:48) ] = y | (cid:91) x ∈ S pa (cid:0) T [ x ]) , T [ S ] = (cid:126)t S (cid:1) Pr ( (cid:91) x ∈ S pa (cid:0) T [ x ]) (cid:1) (43) Now, given a set Z ⊆ Ψ ∆ , we can rewrite the RHS of Eq. 43 into the following equivalent formula: RHS = (cid:88) (cid:83) x ∈ S pa (cid:0) T [ x ] (cid:1) (cid:88) z ∈ Dom ( Z ) Pr (cid:0) Y [ x (cid:48) ] = y | (cid:91) x ∈ S pa (cid:0) T [ x ]) , T [ S ] = (cid:126)t S , Z = z (cid:1) Pr ( Z = z | (cid:91) x ∈ S pa (cid:0) T [ x ]) , T [ S ] = (cid:126)t S ) Pr ( (cid:91) x ∈ S pa (cid:0) T [ x ]) (cid:1) (44) Now from the conditional independence in Eq.29 and the conditional independence statement T [ x ] ⊥⊥ Z , (cid:83) x ∈ S pa (cid:0) T [ x ]) | pa ( T [ x ]) for each x ∈ S , Eq 44 can be simpliﬁed as follows: RHS = (cid:88) (cid:83) x ∈ S pa (cid:0) T [ x ] (cid:1) (cid:88) z ∈ Dom ( Z ) Pr (cid:0) Y [ x (cid:48) ] = y | (cid:91) x ∈ S pa (cid:0) T [ x ]) , T [ S ] = (cid:126)t S , Z = z (cid:1) PRIL