[PDF] Towards Causal Representation Learning

Abstract

The two fields of machine learning and graphical causality arose and developed separately. However, there is now cross-pollination and increasing interest in both fields to benefit from the advances of the other. In the present paper, we review fundamental concepts of causal inference and relate them to crucial open problems of machine learning, including transfer and generalization, thereby assaying how causality can contribute to modern machine learning research. This also applies in the opposite direction: we note that most work in causality starts from the premise that the causal variables are given. A central problem for AI and causality is, thus, causal representation learning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications of causality for machine learning and propose key research areas at the intersection of both communities.

Full PDF

11 Towards Causal Representation Learning

Bernhard Schölkopf † , Francesco Locatello † , Stefan Bauer (cid:63) , Nan Rosemary Ke (cid:63) , Nal KalchbrennerAnirudh Goyal, Yoshua Bengio Abstract —The two ﬁelds of machine learning and graphicalcausality arose and developed separately. However, there is nowcross-pollination and increasing interest in both ﬁelds to beneﬁtfrom the advances of the other. In the present paper, we reviewfundamental concepts of causal inference and relate them tocrucial open problems of machine learning, including transferand generalization, thereby assaying how causality can contributeto modern machine learning research. This also applies in theopposite direction: we note that most work in causality startsfrom the premise that the causal variables are given. A centralproblem for AI and causality is, thus, causal representationlearning, the discovery of high-level causal variables from low-level observations. Finally, we delineate some implications ofcausality for machine learning and propose key research areasat the intersection of both communities.

I. I

NTRODUCTION

If we compare what machine learning can do to what animalsaccomplish, we observe that the former is rather limited at somecrucial feats where natural intelligence excels. These includetransfer to new problems and any form of generalization thatis not from one data point to the next (sampled from thesame distribution), but rather from one problem to the next —both have been termed generalization , but the latter is a muchharder form thereof, sometimes referred to as horizontal , strong ,or out-of-distribution generalization. This shortcoming is nottoo surprising, given that machine learning often disregardsinformation that animals use heavily: interventions in the world,domain shifts, temporal structure — by and large, we considerthese factors a nuisance and try to engineer them away. In accor-dance with this, the majority of current successes of machinelearning boil down to large scale pattern recognition on suitablycollected independent and identically distributed (i.i.d.) data.To illustrate the implications of this choice and its relation tocausal models, we start by highlighting key research challenges. a) Issue 1 – Robustness: With the widespread adoptionof deep learning approaches in computer vision [101, 140], † : equal contribution. (cid:63) : equal contribution.B. Schölkopf is at the Max-Planck Institute for Intelligent Systems, Max-Planck-Ring 4, 72076 Tübingen, Germany, [email protected] .F. Locatello is at ETH Zurich, Computer Science Department and the MaxPlanck Institute for Intelligent Systems. Work partially done while interningat Google Research Amsterdam. [email protected] S. Bauer is at the Max-Planck Institute for Intelligent Systems, [email protected] .N. R. Ke is at Mila and the University of Montreal, [email protected] .N. Kalchbrenner is at Google Research Amsterdam, [email protected] .A. Goyal is at Mila and the University of Montreal, [email protected]

Y. Bengio is at Mila, the University of Montreal, CIFAR Senior Fellow [email protected] natural language processing [54], and speech recognition [85],a substantial body of literature explored the robustness of theprediction of state-of-the-art deep neural network architectures.The underlying motivation originates from the fact that in thereal world there is often little control over the distribution fromwhich the data comes from. In computer vision [75, 228],changes in the test distribution may, for instance, comefrom aberrations like camera blur, noise or compressionquality [106, 129, 170, 206], or from shifts, rotations, orviewpoints [7, 11, 63, 282]. Motivated by this, new benchmarkswere proposed to speciﬁcally test generalization of classiﬁcationand detection methods with respect to simple algorithmicallygenerated interventions like spatial shifts, blur, changes inbrightness or contrast [106, 170], time consistency [94, 227],control over background and rotation [11], as well as imagescollected in multiple environments [19]. Studying the failuremodes of deep neural networks from simple interventions hasthe potential to lead to insights into the inductive biases ofstate-of-the-art architectures. So far, there has been no deﬁnitiveconsensus on how to solve these problems, although progresshas been made using data augmentation, pre-training, self-supervision, and architectures with suitable inductive biasesw.r.t. a perturbation of interest [233, 59, 63, 137, 170, 206]. Ithas been argued [188] that such ﬁxes may not be sufﬁcient,and generalizing well outside the i.i.d. setting requires learningnot mere statistical associations between variables, but anunderlying causal model . The latter contains the mechanismsgiving rise to the observed statistical dependences, and allowsto model distribution shifts through the notion of interventions[183, 237, 218, 34, 188, 181]. b) Issue 2 – Learning Reusable Mechanisms:

Infants’understanding of physics relies upon objects that can betracked over time and behave consistently [52, 236]. Such arepresentation allows children to quickly learn new tasks astheir knowledge and intuitive understanding of physics canbe re-used [15, 52, 144, 250]. Similarly, intelligent agents thatrobustly solve real-world tasks need to re-use and re-purposetheir knowledge and skills in novel scenarios. Machinelearning models that incorporate or learn structural knowledgeof an environment have been shown to be more efﬁcient andgeneralize better [14, 10, 16, 84, 197, 212, 8, 274, 26, 76, 83,141, 157, 177, 211, 245, 258, 272, 57, 182]. In a modularrepresentation of the world where the modules correspond tophysical causal mechanisms, many modules can be expected tobehave similarly across different tasks and environments. Anagent facing a new environment or task may thus only needto adapt a few modules in its internal representation of theworld [220, 84]. When learning a causal model, one shouldthus require fewer examples to adapt as most knowledge, i.e., a r X i v : . [ c s . L G ] F e b modules, can be re-used without further training. c) A Causality Perspective: Causation is a subtle conceptthat cannot be fully described using the language of Booleanlogic [151] or that of probabilistic inference; it requires theadditional notion of intervention [237, 183]. The manipulativedeﬁnition of causation [237, 183, 118] focuses on the fact thatconditional probabilities (“seeing people with open umbrellassuggests that it is raining”) cannot reliably predict the outcomeof an active intervention (“closing umbrellas does not stop therain”). Causal relations can also be viewed as the componentsof reasoning chains [151] that provide predictions for situationsthat are very far from the observed distribution and may evenremain purely hypothetical [163, 183] or require consciousdeliberation [128]. In that sense, discovering causal relationsmeans acquiring robust knowledge that holds beyond thesupport of an observed data distribution and a set of trainingtasks, and it extends to situations involving forms of reasoning.

Our Contributions:

In the present paper, we argue thatcausality, with its focus on representing structural knowledgeabout the data generating process that allows interventions andchanges, can contribute towards understanding and resolvingsome limitations of current machine learning methods. Thiswould take the ﬁeld a step closer to a form of artiﬁcial intelli-gence that involves thinking in the sense of Konrad Lorenz, i.e.,acting in an imagined space [163]. Despite its success, statisticallearning provides a rather superﬁcial description of reality thatonly holds when the experimental conditions are ﬁxed. Instead,the ﬁeld of causal learning seeks to model the effect of inter-ventions and distribution changes with a combination of data-driven learning and assumptions not already included in thestatistical description of a system. The present work reviews andsynthesizes key contributions that have been made to this end: • We describe different levels of modeling in physical systemsin Section II and present the differences between causal andstatistical models in Section III. We do so not only in termsof modeling abilities but also discuss the assumptions andchallenges involved. • We expand on the Independent Causal Mechanisms (ICM)principle as a key component that enables the estimationof causal relations from data in Section IV. In particular,we state the Sparse Mechanism Shift hypothesis as a con-sequence of the ICM principle and discuss its implicationsfor learning causal models. • We review existing approaches to learn causal relationsfrom appropriate descriptors (or features) in Section V. Wecover both classical approaches and modern re-interpretationsbased on deep neural networks, with a focus on theunderlying principles that enable causal discovery. • We discuss how useful models of reality may be learnedfrom data in the form of causal representations, and discussseveral current problems of machine learning from a causalpoint of view in Section VI. • We assay the implications of causality for practical machinelearning in Section VII. Using causal language, we revisitrobustness and generalization, as well as existing commonpractices such as semi-supervised learning, self-supervised

The present paper expands [221], leading to partial text overlap. learning, data augmentation, and pre-training. We discussexamples at the intersection between causality and machinelearning in scientiﬁc applications and speculate on theadvantages of combining the strengths of both ﬁelds tobuild a more versatile AI.II. L

EVELS OF C AUSAL M ODELING

The gold standard for modeling natural phenomena is a set ofcoupled differential equations modeling physical mechanismsresponsible for the time evolution. This allows us to predictthe future behavior of a physical system, reason about theeffect of interventions, and predict statistical dependenciesbetween variables that are generated by coupled time evolution.It also offers physical insights, explaining the functioning ofthe system, and lets us read off its causal structure. To thisend, consider the coupled set of differential equations d x dt = f ( x ) , x ∈ R d , (1)with initial value x ( t ) = x . The Picard–Lindelöf theoremstates that at least locally, if f is Lipschitz, there exists a uniquesolution x ( t ) . This implies in particular that the immediatefuture of x is implied by its past values.If we formally write this in terms of inﬁnitesimal differentials dt and d x = x ( t + dt ) − x ( t ) , we get: x ( t + dt ) = x ( t ) + dt · f ( x ( t )) . (2)From this, we can ascertain which entries of the vector x ( t ) mathematically determine the future of others x ( t + dt ) . Thistells us that if we have a physical system whose physicalmechanisms are correctly described using such an ordinarydifferential equation (1), solved for d x dt (i.e., the derivative onlyappears on the left-hand side), then its causal structure can bedirectly read off. While a differential equation is a rather comprehensivedescription of a system, a statistical model can be viewed as amuch more superﬁcial one. It often does not refer to dynamicprocesses; instead, it tells us how some of the variables allowprediction of others as long as experimental conditions do notchange. E.g., if we drive a differential equation system withcertain types of noise, or we average over time, then it maybe the case that statistical dependencies between componentsof x emerge, and those can then be exploited by machinelearning. Such a model does not allow us to predict the effectof interventions; however, its strength is that it can often belearned from observational data, while a differential equationusually requires an intelligent human to come up with it. Causalmodeling lies in between these two extremes. Like models inphysics, it aims to provide understanding and predict the effectof interventions. However, causal discovery and learning tryto arrive at such models in a data-driven way, replacing expertknowledge with weak and generic assumptions. The overall Note that this requires that the differential equation system describes thecausal physical mechanisms. If, in contrast, we considered a set of differentialequations that phenomenologically correctly describe the time evolution of asystem without capturing the underlying mechanisms (e.g., due to unobservedconfounding, or a form of course-graining that does not preserve the causalstructure [207]), then (2) may not be causally meaningful [221, 190].

TABLE IA

SIMPLE TAXONOMY OF MODELS . T

HE MOST DETAILED MODEL ( TOP ) IS A MECHANISTIC OR PHYSICAL ONE , USUALLY IN TERMS OF DIFFERENTIALEQUATIONS . A

T THE OTHER END OF THE SPECTRUM ( BOTTOM ), WE HAVE A PURELY STATISTICAL MODEL ; THIS CAN BE LEARNED FROM DATA , BUT ITOFTEN PROVIDES LITTLE INSIGHT BEYOND MODELING ASSOCIATIONS BETWEEN EPIPHENOMENA . C

AUSAL MODELS CAN BE SEEN AS DESCRIPTIONS THATLIE IN BETWEEN , ABSTRACTING AWAY FROM PHYSICAL REALISM WHILE RETAINING THE POWER TO ANSWER CERTAIN INTERVENTIONAL ORCOUNTERFACTUAL QUESTIONS . Model Predict in i.i.d. Predict under distr. Answer counter- Obtain Learn fromsetting shift/intervention factual questions physical insight dataMechanistic/physical yes yes yes yes ?Structural causal yes yes yes ? ?Causal graphical yes yes no ? ?Statistical yes no no no yes situation is summarized in Table I, adapted from [188]. Below,we address some of the tasks listed in Table I in more detail.

A. Predicting in the i.i.d. setting

Statistical models are a superﬁcial description of realityas they are only required to model associations. For a givenset of input examples X and target labels Y , we may beinterested in approximating P ( Y | X ) to answer questions like:“what is the probability that this particular image containsa dog?” or “what is the probability of heart failure givencertain diagnostic measurements (e.g., blood pressure) carriedout on a patient?”. Subject to suitable assumptions, thesequestions can be provably answered by observing a sufﬁcientlylarge amount of i.i.d. data from P ( X, Y ) [257]. Despite theimpressive advances of machine learning, causality offers anunder-explored complement: accurate predictions may notbe sufﬁcient to inform decision making. For example, thefrequency of storks is a reasonable predictor for human birthrates in Europe [168]. However, as there is no direct causal linkbetween those two variables, a change to the stork populationwould not affect the birth rates, even though a statistical modelmay predict so. The predictions of a statistical model are onlyaccurate within identical experimental conditions. Performingan intervention changes the data distribution, which may leadto (arbitrarily) inaccurate predictions [183, 237, 218, 188]. B. Predicting Under Distribution Shifts

Interventional questions are more challenging than predic-tions as they involve actions that take us out of the usual i.i.d.setting of statistical learning. Interventions may affect boththe value of a subset of causal variables and their relations.For example, “is increasing the number of storks in a countrygoing to boost its human birth rate?” and “would fewer peoplesmoke if cigarettes were more socially stigmatized?”. Asinterventions change the joint distribution of the variablesof interest, classical statistical learning guarantees [257] nolonger apply. On the other hand, learning about interventionsmay allow to train predictive models that are robust againstthe changes in distribution that naturally happen in the realworld. Here, interventions do not need to be deliberate actionsto achieve a goal. Statistical relations may change dynamicallyover time (e.g., people’s preferences and tastes) or there maysimply be a mismatch between a carefully controlled trainingdistribution and the test distribution of a model deployed in production. The robustness of deep neural networks has recentlybeen scrutinized and become an active research topic relatedto causal inference. We argue that predicting under distributionshift should not be reduced to just the accuracy on a test set. Ifwe wish to incorporate learning algorithms into human decisionmaking, we need to trust that the predictions of the algorithmwill remain valid if the experimental conditions are changed.

C. Answering Counterfactual Questions

Counterfactual problems involve reasoning about why thingshappened, imagining the consequences of different actions inhindsight, and determining which actions would have achieveda desired outcome. Answering counterfactual questions canbe more difﬁcult than answering interventional questions.However, this may be a key challenge for AI, as an intelligentagent may beneﬁt from imagining the consequences of itsactions as well as understanding in retrospect what led tocertain outcomes, at least to some degree of approximation. We have above mentioned the example of statistical predictionsof heart failure. An interventional question would be “howdoes the probability of heart failure change if we convince apatient to exercise regularly?” A counterfactual one would be“would a given patient have suffered heart failure if they hadstarted exercising a year earlier?”. As we shall discuss below,counterfactuals, or approximations thereof, are especiallycritical in reinforcement learning. They can enable agents toreﬂect on their decisions and formulate hypotheses that can beempirically veriﬁed in a process akin to the scientiﬁc method.

D. Nature of Data: Observational, Interventional,(Un)structured

The data format plays a substantial role in which typeof relation can be inferred. We can distinguish two axesof data modalities: observational versus interventional, andhand-engineered versus raw (unstructured) perceptual input. Note that the two types of questions occupy a continuum: to thisend, consider a probability which is both conditional and interventional P ( A | B, do ( C )) . If B is the empty set, we have a classical intervention;if B contained all (unobserved) noise terms, we have a counterfactual. If B isnot identical to the noise terms, but nevertheless informative about them, weget something in between. For instance, reinforcement learning practitionersmay call Q functions as providing counterfactuals, even though they model P (return from t | agent state at time t , do (action at time t )), and thereforecloser to an intervention (which is why they can be estimated from data). Observational and Interventional Data: an extreme form ofdata which is often assumed but seldom strictly available isobservational i.i.d. data, where each data point is independentlysampled from the same distribution. Another extreme isinterventional data with known interventions, where we observedata sets sampled from multiple distributions each of whichis the result of a known intervention. In between, we havedata with (domain) shifts or unknown interventions. Thisis observational in the sense that the data is only observedpassively, but it is interventional in the sense that there areinterventions/shifts, but unknown to us.

Hand Engineered Data vs. Raw Data: especially in classicalAI, data is often assumed to be structured into high-level andsemantically meaningful variables which may partially (modulosome variables being unobserved) correspond to the causalvariables of the underlying graph.

Raw Data , in contrast, isunstructured and does not expose any direct information aboutcausality.While statistical models are weaker than causal models, theycan be efﬁciently learned from observational data alone onboth hand-engineered features and raw perceptual input suchas images, videos, speech etc. On the other hand, althoughmethods for learning causal structure from observations exist[237, 188, 229, 113, 174, 187, 139, 17, 246, 277, 175, 123,186, 176, 36, 82, 161], learning causal relations frequentlyrequires collecting data from multiple environments, or theability to perform interventions [251]. In some cases, it isassumed that all common causes of measured variables are alsoobserved (causal sufﬁciency). Overall, a signiﬁcant amount ofprior knowledge is encoded in which variables are measured.Moving forward, one would hope to develop methods thatreplace expert data collection with suitable inductive biases andlearning paradigms such as meta-learning and self-supervision.If we wish to learn a causal model that is useful for a particularset of tasks and environments, the appropriate granularity ofthe high-level variables depends on the tasks of interest and onthe type of data we have at our disposal, for example whichinterventions can be performed and what is known about thedomain. III. C

AUSAL M ODELS AND I NFERENCE

As discussed, reality can be modeled at different levels,from the physical one to statistical associations betweenepiphenomena. In this section, we expand on the differencebetween statistical and causal modeling and review a formallanguage to talk about interventions and distribution changes.

A. Methods driven by i.i.d. data

The machine learning community has produced impressivesuccesses with machine learning applications to big dataproblems [148, 171, 223, 231, 53]. In these successes, thereare several trends at work [215]: (1) we have massive amountsof data, often from simulations or large scale human labeling,(2) we use high capacity machine learning systems (i.e.,complex function classes with many adjustable parameters), There are also algorithms that do not require causal sufﬁciency [237]. (3) we employ high-performance computing systems, andﬁnally (often ignored, but crucial when it comes to causality)(4) the problems are i.i.d. The latter can be guaranteed bythe construction of a task including training and test set (e.g.,image recognition using benchmark datasets). Alternatively,problems can be made approximately i.i.d., e.g.. by carefullycollecting the right training set for a given application problem,or by methods such as “experience replay” [171] where areinforcement learning agent stores observations in order tolater permute them for the purpose of re-training.For i.i.d. data, strong universal consistency results fromstatistical learning theory apply, guaranteeing convergence ofa learning algorithm to the lowest achievable risk. Such algo-rithms do exist, for instance, nearest neighbor classiﬁers, sup-port vector machines, and neural networks [257, 217, 239, 66].Seen in this light, it is not surprising that we can indeed matchor surpass human performance if given enough data. However,current machine learning methods often perform poorly whenfaced with problems that violate the i.i.d. assumption, yet seemtrivial to humans. Vision systems can be grossly misled ifan object that is normally recognized with high accuracy isplaced in a context that in the training set may be negativelycorrelated with the presence of the object. Distribution shiftsmay also arise from simple corruptions that are common inreal-world data collection pipelines [9, 106, 129, 170, 206].An example of this is the impact of socio-economic factors inclinics in Thailand on the accuracy of a detection system forDiabetic Retinopathy [18]. More dramatically, the phenomenonof “adversarial vulnerability” [249] highlights how even tiny buttargeted violations of the i.i.d. assumption, generated by addingsuitably chosen perturbations to images, imperceptible to hu-mans, can lead to dangerous errors such as confusion of trafﬁcsigns. Overall, it is fair to say that much of the current practice(of solving i.i.d. benchmark problems) and most theoreticalresults (about generalization in i.i.d. settings) fail to tackle thehard open challenge of generalization across problems.To further understand how the i.i.d. assumption is problem-atic, let us consider a shopping example. Suppose Alice islooking for a laptop rucksack on the internet (i.e., a rucksackwith a padded compartment for a laptop). The web shop’srecommendation system suggests that she should buy a laptopto go along with the rucksack. This seems odd because sheprobably already has a laptop, otherwise she would not belooking for the rucksack in the ﬁrst place. In a way, the laptopis the cause, and the rucksack is an effect. Now suppose we aretold whether a customer has bought a laptop. This reduces ouruncertainty about whether she also bought a laptop rucksack,and vice versa —- and it does so by the same amount (the mutual information ), so the directionality of cause and effectis lost. However, the directionality is present in the physicalmechanisms generating statistical dependence, for instance themechanism that makes a customer want to buy a rucksack onceshe owns a laptop. Recommending an item to buy constitutesan intervention in a system, taking us outside the i.i.d. setting.We no longer work with the observational distribution, but a dis- Note that the physical mechanisms take place in time, and if available,time order may provide additional information about causality. tribution where certain variables or mechanisms have changed.

B. The Reichenbach Principle: From Statistics to Causality

Reichenbach [198] clearly articulated the connectionbetween causality and statistical dependence. He postulated:

Common Cause Principle : if two observables X and Y are statistically dependent, then there exists a variable Z that causally inﬂuences both and explains all thedependence in the sense of making them independentwhen conditioned on Z .As a special case, this variable can coincide with X or Y .Suppose that X is the frequency of storks and Y the humanbirth rate. If storks bring the babies, then the correct causalgraph is X → Y . If babies attract storks, it is X ← Y . If thereis some other variable that causes both (such as economicdevelopment), we have X ← Z → Y .Without additional assumptions, we cannot distinguish thesethree cases using observational data. The class of observationaldistributions over X and Y that can be realized by thesemodels is the same in all three cases. A causal model thuscontains genuinely more information than a statistical one.While causal structure discovery is hard if we have only twoobservables [187], the case of more observables is surprisinglyeasier, the reason being that in that case, there are nontrivialconditional independence properties [238, 51, 74] implied bycausal structure. These generalize the Reichenbach Principleand can be described by using the language of causal graphsor structural causal models, merging probabilistic graphicalmodels and the notion of interventions [237, 183]. They are bestdescribed using directed functional parent-child relationshipsrather than conditionals. While conceptually simple in hindsight,this constituted a major step in the understanding of causality. C. Structural causal models (SCMs)

The SCM viewpoint considers a set of observables (or variables ) X , . . . , X n associated with the vertices of a directedacyclic graph (DAG). We assume that each observable is theresult of an assignment X i := f i ( PA i , U i ) , ( i = 1 , . . . , n ) , (3)using a deterministic function f i depending on X i ’s parents inthe graph (denoted by PA i ) and on an unexplained randomvariable U i . Mathematically, the observables are thus randomvariables, too. Directed edges in the graph represent directcausation, since the parents are connected to X i by directededges and through (3) directly affect the assignment of X i .The noise U i ensures that the overall object (3) can representa general conditional distribution P ( X i | PA i ) , and the set ofnoises U , . . . , U n are assumed to be jointly independent . Ifthey were not, then by the Common Cause Principle thereshould be another variable that causes their dependence, andthus our model would not be causally sufﬁcient .If we specify the distributions of U , . . . , U n , recursiveapplication of (3) allows us to compute the entailed observa-tional joint distribution P ( X , . . . , X n ) . This distribution has structural properties inherited from the graph [147, 183]: itsatisﬁes the causal Markov condition stating that conditionedon its parents, each X j is independent of its non-descendants.Intuitively, we can think of the independent noises as“information probes” that spread through the graph (much likeindependent elements of gossip can spread through a socialnetwork). Their information gets entangled, manifesting itselfin a footprint of conditional dependencies making it possibleto infer aspects of the graph structure from observational datausing independence testing. Like in the gossip analogy, thefootprint may not be sufﬁciently characteristic to pin downa unique causal structure. In particular, it certainly is not ifthere are only two observables, since any nontrivial conditionalindependence statement requires at least three variables. Thetwo-variable problem can be addressed by making additionalassumptions, as not only the graph topology leaves a footprintin the observational distribution, but the functions f i do, too.This point is interesting for machine learning, where muchattention is devoted to properties of function classes (e.g., priorsor capacity measures), and we shall return to it below. a) Causal Graphical Models: The graph structure alongwith the joint independence of the noises implies a canonicalfactorization of the joint distribution entailed by (3) into causalconditionals that we refer to as the causal (or disentangled)factorization , P ( X , . . . , X n ) = n (cid:89) i =1 P ( X i | PA i ) . (4)While many other entangled factorizations are possible, e.g., P ( X , . . . , X n ) = n (cid:89) i =1 P ( X i | X i +1 , . . . , X n ) , (5)the factorization (4) yields practical computational advantagesduring inference, which is in general hard, even when it comesto non-trivial approximations [210]. But more interestingly,it is the only one that decomposes the joint distribution intoconditionals corresponding to the structural assignments (3). Wethink of these as the causal mechanisms that are responsible forall statistical dependencies among the observables. Accordingly,in contrast to (5), the disentangled factorization represents thejoint distribution as a product of causal mechanisms. b) Latent variables and Confounders: Variables in acausal graph may be unobserved, which can make causalinference particularly challenging. Unobserved variables may confound two observed variables so that they either appearstatistically related while not being causally related (i.e., neitherof the variables is an ancestor of the other), or their statisticalrelation is altered by the presence of the confounder (e.g., onevariable is a causal ancestor for the other, but the confounderis a causal ancestor of both). Confounders may or may not beknown or observed. c) Interventions:

The SCM language makes it straight-forward to formalize interventions as operations that modify asubset of assignments (3), e.g., changing U i , setting f i (andthus X i ) to a constant, or changing the functional form of f i (and thus the dependency of X i on its parents) [237, 183].Several types of interventions may be possible [62] whichcan be categorized as: No intervention: only observational data is obtained from the causal model.

Hard/perfect: thefunction in the structural assignment (3) of a variable (or,analogously, of multiple variables) is set to a constant (implyingthat the value of the variable is ﬁxed), and then the entaileddistribution for the modiﬁed SCM is computed.

Soft/imperfect: the structural assignment (3) for a variable is modiﬁed bychanging the function or the noise term (this correspondsto changing the conditional distribution given its parents).

Uncertain: the learner is not sure which mechanism/variableis affected by the intervention.One could argue that stating the structural assignments asin (3) is not yet sufﬁcient to formulate a causal model. Inaddition, one should specify the set of possible interventionson the structural causal model. This may be done implicitlyvia the functional form of structural equations by allowing anyintervention over the domain of the mechanisms. This becomesrelevant when learning a causal model from data, as the SCMdepends on the interventions. Pragmatically, we should aimat learning causal models that are useful for speciﬁc sets oftasks of interest [207, 267] on appropriate descriptors (in termsof which causal statements they support) that must either beprovided or learned. We will return to the assumptions thatallow learning causal models and features in Section IV.

D. Difference Between Statistical Models, Causal GraphicalModels, and SCMs

An example of the difference between a statistical and acausal model is depicted in Figure 1. A statistical model maybe deﬁned for instance through a graphical model, i.e., aprobability distribution along with a graph such that the formeris Markovian with respect to the latter (in which case it can befactorized as (4)). However, the edges in a (generic) graphicalmodel do not need to be causal [97]. For instance, the twographs X → X → X and X ← X ← X imply the sameconditional independence(s) ( X and X are independent given X ). They are thus in the same Markov equivalence class, i.e.,if a distribution is Markovian w.r.t. one of the graphs, thenit also is w.r.t. the other graph. Note that the above servesas an example that the Markov condition is not sufﬁcient forcausal discovery. Further assumptions are needed, cf. belowand [237, 183, 188].A graphical model becomes causal if the edges of itsgraph are causal (in which case the graph is referred to as a“causal graph”), cf. (3). This allows to compute interventionaldistributions as depicted in Figure 1. When a variable isintervened upon, we disconnect it from its parents, ﬁx itsvalue, and perform ancestral sampling on its children.A structural causal model is composed of (i) a set of causalvariables and (ii) a set of structural equations with a distributionover the noise variables U i (or a set of causal conditionals).While both causal graphical models and SCMs allow to com-pute interventional distributions, only the SCMs allow to com-pute counterfactuals. To compute counterfactuals, we need to ﬁxthe value of the noise variables. Moreover, there are many waysto represent a conditional as a structural assignment (by pickingdifferent combinations of functions and noise variables). a) Causal Learning and Reasoning: The conceptual basisof statistical learning is a joint distribution P ( X , . . . , X n ) (where often one of the X i is a response variable denotedas Y ), and we make assumptions about function classes usedto approximate, say, a regression E [ Y | X ] . Causal learning considers a richer class of assumptions, and seeks to exploit thefact that the joint distribution possesses a causal factorization(4). It involves the causal conditionals P ( X i | PA i ) (e.g.,represented by the functions f i and the distribution of U i in (3)),how these conditionals relate to each other, and interventionsor changes that they admit. Once a causal model is available,either by external human knowledge or a learning process, causal reasoning allows to draw conclusions on the effectof interventions, counterfactuals and potential outcomes. Incontrast, statistical models only allow to reason about theoutcome of i.i.d. experiments.IV. I NDEPENDENT C AUSAL M ECHANISMS

We now return to the disentangled factorization (4) of thejoint distribution P ( X , . . . , X n ) . This factorization accordingto the causal graph is always possible when the U i areindependent, but we will now consider an additional notion ofindependence relating the factors in (4) to one another.Whenever we perceive an object, our brain assumes that theobject and the mechanism by which the information containedin its light reaches our brain are independent . We can violatethis by looking at the object from an accidental viewpoint,which can give rise to optical illusions [188]. The aboveindependence assumption is useful because in practice, itholds most of the time, and our brain thus relies on objectsbeing independent of our vantage point and the illumination.Likewise, there should not be accidental coincidences, such as3D structures lining up in 2D, or shadow boundaries coincidingwith texture boundaries. In vision research, this is called thegeneric viewpoint assumption.If we move around the object, our vantage point changes,but we assume that the other variables of the overall generativeprocess (e.g., lighting, object position and structure) areunaffected by that. This is an invariance implied by the aboveindependence, allowing us to infer 3D information even withoutstereo vision (“structure from motion”).For another example, consider a dataset that consists ofaltitude A and average annual temperature T of weather stations[188]. A and T are correlated, which we believe is due tothe fact that the altitude has a causal effect on temperature.Suppose we had two such datasets, one for Austria and one forSwitzerland. The two joint distributions P ( A, T ) may be ratherdifferent since the marginal distributions P ( A ) over altitudeswill differ. The conditionals P ( T | A ) , however, may be (closeto) invariant, since they characterize the physical mechanismsthat generate temperature from altitude. This similarity is lostupon us if we only look at the overall joint distribution, withoutinformation about the causal structure A → T . The causal fac-torization P ( A ) P ( T | A ) will contain a component P ( T | A ) thatgeneralizes across countries, while the entangled factorization P ( T ) P ( A | T ) will exhibit no such robustness. Cum grano salis,the same applies when we consider interventions in a system. S t a t i s t i ca l m o d e l C a u s a l m o d e l Fig. 1. Difference between statistical (left) and causal models (right) on a given set of three variables. While a statistical model speciﬁes a single probabilitydistribution, a causal model represents a set of distributions, one for each possible intervention (indicated with a in the ﬁgure).

For a model to correctly predict the effect of interventions,it needs to be robust to generalizing from an observationaldistribution to certain interventional distributions.One can express the above insights as follows [218, 188]:

Independent Causal Mechanisms (ICM) Principle.

The causal generative process of a system’s variablesis composed of autonomous modules that do not informor inﬂuence each other. In the probabilistic case, thismeans that the conditional distribution of each variablegiven its causes (i.e., its mechanism) does not informor inﬂuence the other mechanisms.

This principle entails several notions important to causality,including separate intervenability of causal variables,modularity and autonomy of subsystems, and invariance[183, 188]. If we have only two variables, it reduces toan independence between the cause distribution and themechanism producing the effect distribution.Applied to the causal factorization (4), the principle tells usthat the factors should be independent in the sense that(a) changing (or performing an intervention upon) one mech-anism P ( X i | PA i ) does not change any of the othermechanisms P ( X j | PA j ) ( i (cid:54) = j ) [218], and(b) knowing some other mechanisms P ( X i | PA i ) ( i (cid:54) = j ) doesnot give us information about a mechanism P ( X j | PA j ) [120].This notion of independence thus subsumes two aspects: theformer pertaining to inﬂuence, and the latter to information.The notion of invariant, autonomous, and independentmechanisms has appeared in various guises throughout thehistory of causality research [99, 71, 111, 183, 120, 240, 188].Early work on this was done by Haavelmo [99], stating theassumption that changing one of the structural assignmentsleaves the other ones invariant. Hoover [111] attributes toHerb Simon the invariance criterion : the true causal order isthe one that is invariant under the right sort of intervention. Aldrich [4] discusses the historical development of these ideasin economics. He argues that the “most basic question onecan ask about a relation should be: How autonomous isit?” [71, preface]. Pearl [183] discusses autonomy in detail,arguing that a causal mechanism remains invariant when othermechanisms are subjected to external inﬂuences. He points outthat causal discovery methods may best work “in longitudinalstudies conducted under slightly varying conditions, whereaccidental independencies are destroyed and only structuralindependencies are preserved.” Overviews are provided byAldrich [4], Hoover [111], Pearl [183], and Peters et al.[188, Sec. 2.2]. These seemingly different notions can beuniﬁed [120, 240].We view any real-world distribution as a product of causalmechanisms. A change in such a distribution (e.g., whenmoving from one setting/domain to a related one) will alwaysbe due to changes in at least one of those mechanisms.Consistent with the implication (a) of the ICM Principle, westate the following hypothesis: Sparse Mechanism Shift (SMS).

Small distributionchanges tend to manifest themselves in a sparse or localway in the causal/disentangled factorization (4), i.e.,they should usually not affect all factors simultaneously.

In contrast, if we consider a non-causal factorization, e.g.,(5), then many, if not all, terms will be affected simultaneouslyas we change one of the physical mechanisms responsible fora system’s statistical dependencies. Such a factorization maythus be called entangled , a term that has gained popularity inmachine learning [23, 109, 158, 247].The SMS hypothesis was stated in [181, 24, 221, 115], andin earlier form in [218, 279, 220]. An intellectual ancestoris Simon’s invariance criterion, i.e., that the causal structureremains invariant across changing background conditions [235].The hypothesis is also related to ideas of looking for featuresthat vary slowly [69, 270]. It has recently been used for learning causal models [131], modular architectures [84, 28]and disentangled representations [159].We have informally talked about the dependence of twomechanisms P ( X i | PA i ) and P ( X j | PA j ) when discussingthe ICM Principle and the disentangled factorization (4). Notethat the dependence of two such mechanisms does not coincidewith the statistical dependence of the random variables X i and X j . Indeed, in a causal graph, many of the random variableswill be dependent even if all mechanisms are independent.Also, the independence of the noise terms U i does not translateinto the independence of the X i . Intuitively speaking, theindependent noise terms U i provide and parameterize theuncertainty contained in the fact that a mechanism P ( X i | PA i ) is non-deterministic, and thus ensure that each mechanismadds an independent element of uncertainty. In this sense, theICM Principle contains the independence of the unexplainednoise terms in an SCM (3) as a special case.In the ICM Principle, we have stated that independenceof two mechanisms (formalized as conditional distributions)should mean that the two conditional distributions do not inform or inﬂuence each other. The latter can be thought of asrequiring that independent interventions are possible. To betterunderstand the former, we next discuss a formalization in termsof algorithmic independence . In a nutshell, we encode eachmechanism as a bit string, and require that joint compressionof these strings does not save space relative to independentcompressions.To this end, ﬁrst recall that we have so far discussed linksbetween causal and statistical structures. Of the two, the morefundamental one is the causal structure, since it captures thephysical mechanisms that generate statistical dependencies inthe ﬁrst place. The statistical structure is an epiphenomenonthat follows if we make the unexplained variables random.It is awkward to talk about statistical information containedin a mechanism since deterministic functions in the genericcase neither generate nor destroy information. This servesas a motivation to devise an alternative model of causalstructures in terms of Kolmogorov complexity [120]. TheKolmogorov complexity (or algorithmic information) of a bitstring is essentially the length of its shortest compression on aTuring machine, and thus a measure of its information content.Independence of mechanisms can be deﬁned as vanishingmutual algorithmic information; i.e., two conditionals areconsidered independent if knowing (the shortest compressionof) one does not help us achieve a shorter compression of theother.Algorithmic information theory provides a natural frameworkfor non-statistical graphical models [120, 126]. Just like thelatter are obtained from structural causal models by making theunexplained variables U i random, we obtain algorithmic graph-ical models by making the U i bit strings, jointly independentacross nodes, and viewing X i as the output of a ﬁxed Turing ma-chine running the program U i on the input PA i . Similar to thestatistical case, one can deﬁne a local causal Markov condition,a global one in terms of d-separation, and an additive decompo- In the sense that the mapping from PA i to X i is described by a non-trivialconditional distribution, rather than by a function. sition of the joint Kolmogorov complexity in analogy to (4), andprove that they are implied by the structural causal model [120].Interestingly, in this case, independence of noises and indepen-dence of mechanisms coincide, since the independent programsplay the role of the unexplained noise terms. This approachshows that causality is not intrinsically bound to statistics.V. C AUSAL D ISCOVERY AND M ACHINE L EARNING

Let us turn to the problem of causal discovery from data.Subject to suitable assumptions such as faithfulness [237], onecan sometimes recover aspects of the underlying graph fromobservational data by performing conditional independencetests. However, there are several problems with this approach.One is that our datasets are always ﬁnite in practice, andconditional independence testing is a notoriously difﬁcultproblem, especially if conditioning sets are continuous andmulti-dimensional. So while, in principle, the conditionalindependencies implied by the causal Markov condition holdirrespective of the complexity of the functions appearingin an SCM, for ﬁnite datasets, conditional independencetesting is hard without additional assumptions [225]. Recentprogress in (conditional) independence testing heavily relies onkernel function classes to represent probability distributions inreproducing kernel Hilbert spaces [90, 91, 73, 278, 60, 191, 42].The other problem is that in the case of only two variables,the ternary concept of conditional independence collapses andthe Markov condition thus has no nontrivial implications.It turns out that both problems can be addressed by makingassumptions on function classes. This is typical for machinelearning, where it is well-known that ﬁnite-sample general-ization without assumptions on function classes is impossible.Speciﬁcally, although there are universally consistent learningalgorithms, i.e., approaching minimal expected error in theinﬁnite sample limit, there are always cases where thisconvergence is arbitrarily slow. So for a given sample size, itwill depend on the problem being learned whether we achievelow expected error, and statistical learning theory providesprobabilistic guarantees in terms of measures of complexity offunction classes [55, 257].Returning to causality, we provide an intuition why assump-tions on the functions in an SCM should be necessary to learnabout them from data. Consider a toy SCM with only twoobservables X → Y . In this case, (3) turns into X = U (6) Y = f ( X, V ) (7)with U ⊥⊥ V . Now think of V acting as a random selectorvariable choosing from among a set of functions F = { f v ( x ) ≡ f ( x, v ) | v ∈ supp ( V ) } . If f ( x, v ) depends on v in a non-smooth way, it should be hard to glean information about theSCM from a ﬁnite dataset, given that V is not observed andits value randomly selects among arbitrarily different f v . One can recover the causal structure up to a

Markov equivalence class ,where DAGs have the same undirected skeleton and “immoralities” ( X i → X j ← X k ). This motivates restricting the complexity with which f depends on V . A natural restriction is to assume an additivenoise model X = U (8) Y = f ( X ) + V. (9)If f in (7) depends smoothly on V , and if V is relativelywell concentrated, this can be motivated by a local Taylorexpansion argument. It drastically reduces the effective size ofthe function class — without such assumptions, the latter coulddepend exponentially on the cardinality of the support of V .Restrictions of function classes not only make it easier to learnfunctions from data, but it turns out that they can break thesymmetry between cause and effect in the two-variable case:one can show that given a distribution over X, Y generatedby an additive noise model, one cannot ﬁt an additive noisemodel in the opposite direction (i.e., with the roles of X and Y interchanged) [113, 174, 187, 139, 17], cf. also [246].This is subject to certain genericity assumptions, and notableexceptions include the case where U, V are Gaussian and f islinear. It generalizes results of Shimizu et al. [229] for linearfunctions, and it can be generalized to include non-linear rescal-ings [277], loops [175], confounders [123], and multi-variablesettings [186]. Empirically, there is a number of methods thatcan detect causal direction better than chance [176], some ofthem building on the above Kolmogorov complexity model [36],some on generative models [82], and some directly learning toclassify bivariate distributions into causal vs. anticausal [161].While restrictions of function classes are one possibility toallow to identify the causal structure, other assumptions orscenarios are possible. So far, we have discussed that causalmodels are expected to generalize under certain distributionshifts since they explicitly model interventions. By the SMShypothesis, much of the causal structure is assumed to remaininvariant. Hence distribution shifts such as observing a systemin different “environments / contexts” can signiﬁcantly help toidentify causal structure [251, 188]. These contexts can comefrom interventions [218, 189, 192], non-stationary time series[117, 100, 193] or multiple views [89, 115]. The contexts canlikewise be interpreted as different tasks, which provide aconnection to meta-learning [22, 67, 213].The work of Bengio et al. [24] ties the generalizationin meta-learning to invariance properties of causal models,using the idea that a causal model should adapt faster tointerventions than purely predictive models. This was extendedto multiple variables and unknown interventions in [131],proposing a framework for causal discovery using neuralnetworks by turning the discrete graph search into a continuousoptimization problem. While [24, 131] focus on learning acausal model using neural networks with an unsupervisedloss, the work of Dasgupta et al. [50] explores learning acausal model using a reinforcement learning agent. Theseapproaches have in common that semantically meaningfulabstract representations are given and do not need to belearned from high-dimensional and low-level (e.g., pixel) data. VI. L EARNING C AUSAL V ARIABLES

Traditional causal discovery and reasoning assume thatthe units are random variables connected by a causal graph.However, real-world observations are usually not structuredinto those units to begin with, e.g., objects in images [162].Hence, the emerging ﬁeld of causal representation learningstrives to learn these variables from data, much like machinelearning went beyond symbolic AI in not requiring that thesymbols that algorithms manipulate be given a priori (cf.Bonet and Geffner [33]). To this end, we could try to connectcausal variables S , . . . , S n to observations X = G ( S , . . . , S n ) , (10)where G is a non-linear function. An example can be seen inFigure 2, where high-dimensional observations are the result ofa view on the state of a causal system that is then processed bya neural network to extract high-level variables that are usefulon a variety of tasks. Although causal models in economics,medicine, or psychology often use variables that are abstractionsof underlying quantities, it is challenging to state general condi-tions under which coarse-grained variables admit causal modelswith well-deﬁned interventions [41, 207]. Deﬁning objects orvariables that can be causally related amounts to coarse-grainingof more detailed models of the world, including microscopicstructural equation models [207], ordinary differential equa-tions [173, 208], and temporally aggregated time series [78].The task of identifying suitable units that admit causal modelsis challenging for both human and machine intelligence. Still,it aligns with the general goal of modern machine learning tolearn meaningful representations of data, where meaningful caninclude robust , explainable , or fair [142, 133, 276, 130, 260].To combine structural causal modeling (3) and representationlearning, we should strive to embed an SCM into largermachine learning models whose inputs and outputs may behigh-dimensional and unstructured, but whose inner workingsare at least partly governed by an SCM (that can be parame-terized with a neural network). The result may be a modulararchitecture, where the different modules can be individuallyﬁne-tuned and re-purposed for new tasks [181, 84] and the SMShypothesis can be used to enforce the appropriate structure.We visualize an example in Figure 3 where changes are sparsefor the appropriate causal variables (the position of the ﬁngerand the cube changed as a result of moving the ﬁnger), butdense in other representations, for example in the pixel space(as ﬁnger and cube move, many pixels change their value).At the extreme, all pixels may change as a result of a sparseintervention, for example, if the camera view or the lightingchanges.We now discuss three problems of modern machine learningin the light of causal representation learning. a) Problem 1 – Learning Disentangled Representations: We have earlier discussed the ICM Principle implying boththe independence of the SCM noise terms in (3) and thus thefeasibility of the disentangled representation P ( S , . . . , S n ) = n (cid:89) i =1 P ( S i | PA i ) (11) Fig. 2. Illustration of the causal representation learning problem setting. Perceptual data, such as images or other high-dimensional sensor measurements,can be thought of as entangled views of the state of an unknown causal system as described in (10). With the exception of possible task labels, none of thevariables describing the causal variables generating the system may be known. The goal of causal representation learning is to learn a representation (partially)exposing this unknown causal structure (e.g., which variables describe the system, and their relations). As full recovery may often be unreasonable, neuralnetworks may map the low-level features to some high-level variables supporting causal statements relevant to a set of downstream tasks of interest. Forexample, if the task is to detect the manipulable objects in a scene, the representation may separate intrinsic object properties from their pose and appearanceto achieve robustness to distribution shifts on the latter variables. Usually, we do not get labels for the high-level variables, but the properties of causal modelscan serve as useful inductive biases for learning (e.g., the SMS hypothesis).Fig. 3. Example of the SMS hypothesis where an intervention (which mayor may not be intentional/observed) changes the position of one ﬁnger ( ),and as a consequence, the object falls. The change in pixel space is entangled(or distributed), in contrast to the change in the causal model. as well as the property that the conditionals P ( S i | PA i ) be independently manipulable and largely invariant acrossrelated problems. Suppose we seek to reconstruct such a disentangled representation using independent mechanisms (11) from data, but the causal variables S i are not provided tous a priori. Rather, we are given (possibly high-dimensional) X = ( X , . . . , X d ) (below, we think of X as an image withpixels X , . . . , X d ) as in (10), from which we should constructcausal variables S , . . . , S n ( n (cid:28) d ) as well as mechanisms,cf. (3), S i := f i ( PA i , U i ) , ( i = 1 , . . . , n ) , (12)modeling the causal relationships among the S i . To thisend, as a ﬁrst step, we can use an encoder q : R d → R n taking X to a latent “bottleneck” representation comprising theunexplained noise variables U = ( U , . . . , U n ) . The next stepis the mapping f ( U ) determined by the structural assignments f , . . . , f n . Finally, we apply a decoder p : R n → R d . Forsuitable n , the system can be trained using reconstruction errorto satisfy p ◦ f ◦ q ≈ id on the observed images. If the causalgraph is known, the topology of a neural network implementing f can be ﬁxed accordingly; if not, the neural network decoderlearns the composition ˜ p = p ◦ f . In practice, one may notknow f , and thus only learn an autoencoder ˜ p ◦ q , where thecausal graph effectively becomes an unspeciﬁed part of thedecoder ˜ p , possibly aided by a suitable choice of architecture[149].Much of the existing work on disentanglement [109, 158,159, 256, 157, 135, 202, 61] focuses on independent factorsof variation. This can be viewed as the special case where thecausal graph is trivial, i.e., ∀ i : PA i = ∅ in (12). In this case,the factors are functions of the independent exogenous noisevariables, and thus independent themselves. However, the ICMPrinciple is more general and contains statistical independenceas a special case.Note that the problem of object-centric representationlearning [10, 39, 83, 86, 87, 138, 155, 160, 262, 255] canalso be considered a special case of disentangled factorizationas discussed here. Objects are constituents of scenes thatin principle permit separate interventions. A disentangledrepresentation of a scene containing objects should probablyuse objects as some of the building blocks of an overallcausal factorization , complemented by mechanisms such asorientation, viewing direction, and lighting.The problem of recovering the exogenous noise variablesis ill-deﬁned in the i.i.d. case as there are inﬁnitely manyequivalent solutions yielding the same observational distribu- For an example to see why this is often not desirable, note that thepresence of fork and knife may be statistically dependent, yet we might wanta disentangled representation to represent them as separate entities. Objects can be represented at different levels of granularity [207], i.e. asa single entity or as a composition of other causal variables encoding parts,properties, and other factors of variation. tion [158, 116, 188]. Additional assumptions or biases canhelp favoring certain solutions over others [158, 205]. Leebet al. [149] propose a structured decoder that embeds an SCMand automatically learns a hierarchy of disentangled factors.To make (12) causal, we can use the ICM Principle, i.e.,we should make the U i statistically independent, and weshould make the mechanisms independent. This could be doneby ensuring that they are invariant across problems, exhibitsparse changes to actions, or that they can be independentlyintervened upon [221, 21, 29]. Locatello et al. [159] showedthat the sparse mechanism shift hypothesis stated aboveis theoretically sufﬁcient when given suitable training data.Further, the SMS hypothesis can be used as supervision signalin practice even if PA i (cid:54) = ∅ [252]. However, which factors ofvariation can be disentangled depend on which interventionscan be observed [230, 159]. As discussed by Schölkopf et al.[220], Shu et al. [230], different supervision signals may beused to identify subsets of factors. Similarly, when learningcausal variables from data, which variables can be extracted andtheir granularity depends on which distribution shifts, explicitinterventions, and other supervision signals are available. b) Problem 2 – Learning Transferable Mechanisms: Anartiﬁcial or natural agent in a complex world is faced withlimited resources. This concerns training data, i.e., we only havelimited data for each task/domain, and thus need to ﬁnd waysof pooling/re-using data, in stark contrast to the current industrypractice of large-scale labeling work done by humans. It alsoconcerns computational resources: animals have constraints onthe size of their brains, and evolutionary neuroscience knowsmany examples where brain regions get re-purposed. Similarconstraints on size and energy apply as ML methods get em-bedded in (small) physical devices that may be battery-powered.Future AI models that robustly solve a range of problems in thereal world will thus likely need to re-use components, whichrequires them to be robust across tasks and environments [220].An elegant way to do this is to employ a modular structurethat mirrors a corresponding modularity in the world. In otherwords, if the world is indeed modular, in the sense that com-ponents/mechanisms of the world play roles across a range ofenvironments, tasks, and settings, then it would be prudent fora model to employ corresponding modules [84]. For instance,if variations of natural lighting (the position of the sun, clouds,etc.) imply that the visual environment can appear in brightnessconditions spanning several orders of magnitude, then visualprocessing algorithms in our nervous system should employmethods that can factor out these variations, rather than buildingseparate sets of face recognizers, say, for every lighting condi-tion. If, for example, our nervous system were to compensatefor the lighting changes by a gain control mechanism, then thismechanism in itself need not have anything to do with the physi-cal mechanisms bringing about brightness differences. However,it would play a role in a modular structure that corresponds tothe role that the physical mechanisms play in the world’s modu-lar structure. This could produce a bias towards models that ex-hibit certain forms of structural homomorphism to a world thatwe cannot directly recognize, which would be rather intriguing,given that ultimately our brains do nothing but turn neuronalsignals into other neuronal signals. A sensible inductive bias to learn such models is to look for independent causal mechanisms[180] and competitive training can play a role in this. Forpattern recognition tasks, [181, 84] suggest that learning causalmodels that contain independent mechanisms may help intransferring modules across substantially different domains. c) Problem 3 – Learning Interventional World Models andReasoning:

Deep learning excels at learning representations ofdata that preserve relevant statistical properties [23, 148]. How-ever, it does so without taking into account the causal propertiesof the variables, i.e., it does not care about the interventionalproperties of the variables it analyzes or reconstructs. Causalrepresentation learning should move beyond the representationof statistical dependence structures towards models that supportintervention, planning, and reasoning, realizing Konrad Lorenz’notion of thinking as acting in an imagined space [163].This ultimately requires the ability to reﬂect back on one’sactions and envision alternative scenarios, possibly necessitating(the illusion of) free will [184]. The biological function ofself-consciousness may be related to the need for a variablerepresenting oneself in one’s Lorenzian imagined space , andfree will may then be a means to communicate about actionstaken by that variable, crucial for social and cultural learning,a topic which has not yet entered the stage of machine learningresearch although it is at the core of human intelligence [107].VII. I MPLICATIONS FOR M ACHINE L EARNING

All of this discussion calls for a learning paradigm that doesnot rest on the usual i.i.d. assumption. Instead, we wish tomake a weaker assumption: that the data on which the modelwill be applied comes from a possibly different distribution,but involving (mostly) the same causal mechanisms [188]. Thisraises serious challenges: (a) in many cases, we need to inferabstract causal variables from the available low-level inputfeatures; (b) there is no consensus on which aspects of thedata reveal causal relations; (c) the usual experimental protocolof training and test set may not be sufﬁcient for inferring andevaluating causal relations on existing data sets, and we mayneed to create new benchmarks, for example with access toenvironment information and interventions; (d) even in thelimited cases we understand, we often lack scalable and numer-ically sound algorithms. Despite these challenges, we arguethis endeavor has concrete implications for machine learningand may shed light on desiderata and current practices alike.

A. Semi-Supervised Learning (SSL)

Suppose our underlying causal graph is X → Y , and at thesame time we are trying to learn a mapping X → Y . Thecausal factorization (4) for this case is P ( X, Y ) = P ( X ) P ( Y | X ) . (13)The ICM Principle posits that the modules in a joint distribu-tion’s causal decomposition do not inform or inﬂuence eachother. This means that in particular, P ( X ) should contain noinformation about P ( Y | X ) , which implies that SSL should befutile, in as far as it is using additional information about P ( X ) (from unlabelled data) to improve our estimate of P ( Y | X = x ) . In the opposite ( anticausal ) direction (i.e., the direction ofprediction is opposite to the causal generative process), however,SSL may be possible. To see this, we refer to Daniušis et al.[49] who deﬁne a measure of dependence between input P ( X ) and conditional P ( Y | X ) . Assuming that this measure is zeroin the causal direction (applying the ICM assumption describedin Section IV to the two-variable case), they show that it isstrictly positive in the anticausal direction. Applied to SSL inthe anticausal direction, this implies that the distribution of theinput (now: effect) variable should contain information aboutthe conditional of output (cause) given input, i.e., the quantitythat machine learning is usually concerned with.The study [218] empirically corroborated these predictions,thus establishing an intriguing bridge between the structure oflearning problems and certain physical properties (cause-effectdirection) of real-world data generating processes. It also led toa range of follow-up work [279, 266, 280, 77, 114, 281, 32, 96,263, 243, 195, 152, 156, 153, 167, 204, 115], complementingthe studies of Bareinboim and Pearl [12, 185], and it inspired athread of work in the statistics community exploiting invariancefor causal discovery and other tasks [189, 192, 105, 104, 115].On the SSL side, subsequent developments include furthertheoretical analyses [121, 188, Section 5.1.2] and a formof conditional SSL [259]. The view of SSL as exploitingdependencies between a marginal P ( X ) and a non-causal con-ditional P ( Y | X ) is consistent with the common assumptionsemployed to justify SSL [44]. The cluster assumption assertsthat the labeling function (which is a property of P ( Y | X ) )should not change within clusters of P ( X ) . The low-densityseparation assumption posits that the area where P ( Y | X ) takes the value of . should have small P ( X ) ; and the semi-supervised smoothness assumption , applicable also tocontinuous outputs, states that if two points in a high-densityregion are close, then so should be the corresponding outputvalues. Note, moreover, that some of the theoretical resultsin the ﬁeld use assumptions well-known from causal graphs(even if they do not mention causality): the co-training theorem [31] makes a statement about learnability from unlabelled data,and relies on an assumption of predictors being conditionallyindependent given the label, which we would normally expect ifthe predictors are (only) caused by the label, i.e., an anticausalsetting. This is nicely consistent with the above ﬁndings. B. Adversarial Vulnerability

One can hypothesize that the causal direction should alsohave an inﬂuence on whether classiﬁers are vulnerable to adversarial attacks . These attacks have recently becomepopular, and consist of minute changes to inputs, invisible to ahuman observer yet changing a classiﬁer’s output [249]. This isrelated to causality in several ways. First, these attacks clearlyconstitute violations of the i.i.d. assumption that underlies statis-tical machine learning. If all we want to do is a prediction in ani.i.d. setting, then statistical learning is ﬁne. In the adversarialsetting, however, the modiﬁed test examples are not drawn fromthe same distribution as the training examples. The adversarial Other dependence measures have been proposed for high-dimensional linearsettings and time series [124, 226, 27, 122, 119, 125]. phenomenon also shows that the kind of robustness currentclassiﬁers exhibit is rather different from the one a humanexhibits. If we knew both robustness measures, we could try tomaximize one while minimizing the other. Current methods canbe viewed as crude approximations to this, effectively modelingthe human’s robustness as a mathematically simple set, say, an l p ball of radius (cid:15) > : they often try to ﬁnd examples whichlead to maximal changes in the classiﬁer’s output, subject tothe constraint that they lie in an l p ball in the pixel metric. Aswe think of a classiﬁer as the approximation of a function, thelarge gradients exploited by these attacks are either a propertyof this function or a defect of the approximation.There are different ways of relating this to causal models. Asdescribed in [188, Section 1.4], different causal models can gen-erate the same statistical pattern recognition model. In one ofthose, we might provide a writer with a sequence of class labels y , with the instruction to produce a set of corresponding images x . Clearly, intervening on y will impact x , but intervening on x will not impact y , so this is an anticausal learning problem.In another setting, we might ask the writer to decide for herselfwhich digits to write, and to record the labels alongside thedigit (in this case, the classiﬁer would try to predict one effectfrom another one, a situation which we might call a confoundedone). In a last one, we might provide images to a person, andask the person to generate labels by classifying them.Let us now assume that we are in the causal settingwhere the causal generative model factorizes into independentcomponents, one of which is (essentially) the classiﬁcationfunction. As discussed in Section III, when specifying a causalmodel, one needs to determine which interventions are allowed,and a structural assignment will then, by deﬁnition, be validunder every possible (allowed) intervention. One may thusexpect that if the predictor approximates the causal mechanismthat is inherently transferable and robust, adversarial examplesshould be harder to ﬁnd [216, 134]. Recent work supports thisview: it was shown that a possible defense against adversarialattacks is to solve the anticausal classiﬁcation problem bymodeling the causal generative direction, a method which invision is referred to as analysis by synthesis [222]. A relateddefense method proceeds by reconstructing the input using anautoencoder before feeding it to a classiﬁer [95].

C. Robustness and Strong Generalization

We can speculate that structures composed of autonomousmodules, such as given by a causal factorization (4), shouldbe relatively robust to swapping out or modifying individualcomponents. Robustness should also play a role when studying strategic behavior , i.e., decisions or actions that take intoaccount the actions of other agents (including AI agents).Consider a system that tries to predict the probability ofsuccessfully paying back a credit, based on a set of features.The set could include, for instance, the current debt of a person,as well as their address. To get a higher credit score, peoplecould thus change their current debt (by paying it off), orthey could change their address by moving to a more afﬂuent Adversarial attacks may still exploit the quality of the (parameterized)approximation of a structural equation. neighborhood. The former probably has a positive causal impacton the probability of paying back; for the latter, this is lesslikely. Thus, we could build a scoring system that is morerobust with respect to such strategic behavior by only usingcausal features as inputs [132].To formalize this general intuition, one can consider a formof out-of-distribution generalization, which can be optimizedby minimizing the empirical risk over a class of distributionsinduced by a causal model of the data [5, 204, 169, 189,218]. To describe this notion, we start by recalling the usualempirical risk minimization setup. We have access to data froma distribution P ( X, Y ) and train a predictor g in a hypothesisspace H (e.g., a neural network with a certain architecturepredicting Y from X ) to minimize the empirical risk ˆ Rg (cid:63) = argmin g ∈ H ˆ R P ( X,Y ) ( g ) (14)where ˆ R P ( X,Y ) ( g ) = ˆ E P ( X,Y ) [ loss ( Y, g ( X ))] . (15)Here, we denote by ˆ E P ( X,Y ) the empirical mean computedfrom a sample drawn from P ( X, Y ) . When we refer to “out-of-distribution generalization” we mean having a small expectedrisk for a different distribution P † ( X, Y ) : R OODP † ( X,Y ) ( g ) = E P † ( X,Y ) [ loss ( Y, g ( X ))] . (16)Clearly, the gap between ˆ R P ( X,Y ) ( g ) and R OODP † ( X,Y ) ( g ) willdepend on how different the test distribution P † is from thetraining distribution P . To quantify this difference, we call environments the collection of different circumstances thatgive rise to the distribution shifts such as locations, times,experimental conditions, etc. Environments can be modeled ina causal factorization (4) as they can be seen as interventionson one or several causal variables or mechanisms. As amotivating example, one environment may correspond to where a measurement is taken (for example a certain room), and fromeach environment, we obtain a collection of measurements(images of objects in the same room). It is nontrivial (andin some cases provably hard [20]) to learn statistical modelsthat are stable across training environments and generalize tonovel testing environments [189, 204, 167, 5, 2] drawn fromthe same environment distribution.Using causal language, one could restrict P † ( X, Y ) to bethe result of a certain set of interventions, i.e., P † ( X, Y ) ∈ P G where P G is a set of interventional distributions over a causalgraph G . The worst case out-of-distribution risk then becomes R OOD P G ( g ) = max P † ∈ P G E P † ( X,Y ) [ loss ( Y, g ( X ))] . (17)To learn a robust predictor, we should have available a subsetof environment distributions E ⊂ P G and solve g (cid:63) = argmin g ∈ H max P † ∈ E ˆ E P † ( X,Y ) [ loss ( Y, g ( X ))] . (18)In practice, solving (18) requires specifying a causal modelwith an associated set of interventions. If the set of observedenvironments E does not coincide with the set of possibleenvironments P G , we have an additional estimation error thatmay be arbitrarily large in the worst case [5, 20]. D. Pre-training, Data Augmentation, and Self-Supervision

Learning predictive models solving the min-max opti-mization problem of (18) is challenging. We now interpretseveral common techniques in Machine Learning as means ofapproximating (18).The ﬁrst approach is enriching the distribution of thetraining set. This does not mean obtaining more examplesfrom P ( X, Y ) , but training on a richer dataset [244, 53],for example, through pre-training on a huge and diversecorpus [196, 54, 112, 137, 59, 35, 45, 253]. Since this strategyis based on standard empirical risk minimization, it can achievestronger generalization in practice only if the new trainingdistribution is sufﬁciently diverse to contain information aboutother distributions in P G .The second approach, often coupled with the previous one,is to rely on data augmentation to increase the diversity of thedata by “augmenting” it through a certain type of artiﬁciallygenerated interventions [9, 234, 140]. For the visual domain,common augmentations include performing transformationssuch as rotating the image, translating the image by a fewpixels, or ﬂipping the image horizontally, etc. The high-levelidea behind data augmentation is to encourage a system tolearn underlying invariances or symmetries present in theaugmented data distribution. For example, in a classiﬁca-tion task, translating the image by a few pixels does notchange the class label. One may view it as specifying aset of interventions E the model should be robust to (e.g.,random crops/interpolations/translation/rotations, etc). Insteadof computing the maximum over all distributions in E , onecan relax the problem by sampling from the interventionaldistributions and optimize an expectation over the differentaugmented images on a suitably chosen subset [38], usinga search algorithm like reinforcement learning [48] or analgorithm based on density matching [154].The third approach is to rely on self-supervisionto learn about P ( X ) . Certain pre-training methods[196, 54, 112, 35, 45, 253] have shown that it is possible toachieve good results using only very few class labels by ﬁrstpre-training on a large unlabeled dataset and then ﬁne-tuningon few labeled examples. Similarly, pre-training on largeunlabeled image datasets can improve performance by learningrepresentations that can efﬁciently transfer to a downstreamtask, as demonstrated by [179, 110, 102, 46, 92]. Thesemethods fall under the umbrella of self-supervised learning, afamily of techniques for converting an unsupervised learningproblem into a supervised one by using so-called pretext taskswith artiﬁcially generated labels without human annotations.The basic idea behind using pretext tasks is to force thelearner to learn representations that contain information about P ( X ) that may be useful for (an unknown) downstream task.Much of the work on methods that use self-supervision relieson carefully constructing pretext tasks. A central challengehere is to extract features that are indeed informative aboutthe data generating distribution. Ideas from the ICM Principlecould help develop methods that can automate the process ofconstructing pretext tasks. Finally, one can explicitly optimize(18), for example, through adversarial training [79]. In that case, P G would contain a set of attacks an adversary might perform,while presently, we consider a set of natural interventions.An interesting research direction is the combination of allthese techniques, large scale training, data augmentation, self-supervision, and robust ﬁne-tuning on the available data frommultiple, potentially simulated environments. E. Reinforcement Learning

Reinforcement Learning (RL) is closer to causality researchthan the machine learning mainstream in that it sometimeseffectively directly estimates do-probabilities. E.g., on-policylearning estimates do-probabilities for the interventions speci-ﬁed by the policy (note that these may not be hard interventionsif the policy depends on other variables). However, as soon asoff-policy learning is considered, in particular in the batch (orobservational) setting [146], issues of causality become subtle[164, 81]. An emerging line of work devoted to the intersectionof RL and causality includes [13, 21, 164, 37, 50, 275, 1].Causal learning applied to reinforcement learning can bedivided into two aspects, causal induction and causal inference.

Causal induction (discovery) involves learning causal relationsfrom data, for example, an RL agent learning a causal model ofthe environment.

Causal inference learns to plan and act basedon a causal model. Causal induction in an RL setting posesdifferent challenges than the classic causal learning settingswhere the causal variables are often given. However, there isaccumulating evidence supporting the usefulness of an appro-priate structured representation of the environment [2, 26, 258]. a) World Models:

Model-based RL [248, 67] isrelated to causality as it aims at modeling the effect ofactions (interventions) on the current state of the world.Particularly relevant for causal leaning are generativeworld models that capture some of the causal relationsunderlying the environment and serve as Lorenzianimagined spaces (see I

NTRODUCTION above) to train RLagents [127, 248, 98, 47, 271, 178, 232, 214, 268]. Structuredgenerative approaches further aim at decomposing anenvironment into multiple entities with causally correct relationsamong them, modulo the completeness of the variables, andconfounding [58, 265, 43, 264, 14, 136]. However, many ofthe current approaches (regardless of structure), only buildpartial models of the environment [88]. Since they do notobserve the environment at every time step, the environmentmay become an unobserved confounder affecting both theagent’s actions and the reward. To address this issue, a modelcan use the backdoor criterion conditioning on its policy [200]. b) Generalization, Robustness, and Fast Transfer:

While RL has already achieved impressive results, the samplecomplexity required to achieve consistently good performanceis often prohibitively high. Further, RL agents are often brittle(if data is limited) in the face of even tiny changes to theenvironment (either visual or mechanistic changes) unseen inthe training phase. The question of generalization in RL isessential to the ﬁeld’s future both in theory and practice. Oneproposed solution towards the goal of designing machines thatcan extrapolate experience across environments and tasks is to learn invariances in a causal graph structure. A key requirementto learn invariances from data may be the possibility toperform and learn from interventions. Work in developmentalpsychology argues that there is a need to experiment in orderto discover causal relationships [80]. This can be modelledas an RL environment, where the agent can discover causalfactors through interventions and observing their effects.Further, causal models may allow to model the environment asa set of underlying independent causal mechanisms such that,if there is a change in distribution, not all the mechanisms needto be re-learned. However, there are still open questions aboutthe right way to think about generalization in RL, the rightway to formalize the problem, and the most relevant tasks. c) Counterfactuals:

Counterfactual reasoning has beenfound to improve the data efﬁciency of RL algorithms [37, 165],improve performance [50], and it has been applied to communi-cate about past experiences in the multi-agent setting [68, 241].These ﬁndings are consistent with work in cognitive psychology[64], arguing that counterfactuals allow to reason about the use-fulness of past actions and transfer these insights to correspond-ing behavioral intentions in future scenarios [203, 199, 145].We argue that future work in RL should consider coun-terfactual reasoning as a critical component to enable actingin imagined spaces and formulating hypotheses that can besubsequently tested with suitably chosen interventions. d) Ofﬂine RL:

The success of deep learning methods inthe case of supervised learning can be largely attributed tothe availability of large datasets and methods that can scale tolarge amounts of data. In the case of reinforcement learning,collecting large amounts of high-ﬁdelity diverse data fromscratch can be expensive and hence becomes a bottleneck.Ofﬂine RL [72, 150] tries to address this concern by learning apolicy from a ﬁxed dataset of trajectories, without requiring anyexperimental or interventional data (i.e., without any interactionwith the environment). The effective use of observational data(or logged data) may make real-world RL more practical byincorporating diverse prior experiences. To succeed at it, anagent should be able to infer the consequence of different sets ofactions compared to those seen during training (i.e., the actionsin the logged data), which essentially makes it a counterfactualinference problem. The distribution mismatch between thecurrent policy and the policy that was used to collect ofﬂinedata makes ofﬂine RL challenging as this requires us to movewell beyond the assumption of independently and identicallydistributed data. Incorporating invariances, by factorizingknowledge in terms of independent causal mechanisms canhelp make progress towards the ofﬂine RL setting.

F. Scientiﬁc Applications

A fundamental question in the application of machine learn-ing in natural sciences is to which extent we can complementour understanding of a physical system with machine learning.One interesting aspect is physics simulation with neuralnetworks [93], which can substantially increase the efﬁciencyof hand-engineered simulators [103, 143, 269, 211, 264].Signiﬁcant out-of-distribution generalization of learned physicalsimulators may not be necessary if experimental conditions are carefully controlled, although the simulator has to be completelyre-trained if the conditions change.On the other hand, the lack of systematic experimentalconditions may become problematic in other applications suchas healthcare. One example is personalized medicine, wherewe may wish to build a model of a patient health state througha multitude of data sources, like electronic health records andgenetic information [65, 108]. However, if we train a clinicalsystem on doctors’ actions in controlled settings, the systemwill likely provide little additional insight compared to thedoctors’ knowledge and may fail in surprising ways whendeployed [18]. While it may be useful to automate certaindecisions, an understanding of causality may be necessaryto recommend treatment options that are personalized andreliable [201, 242, 224, 273, 6, 3, 30, 165].Causality also has signiﬁcant potential in helping understandmedical phenomena, e.g., in the current Covid-19 pandemic,where causal mediation analysis helps disentangle differenteffects contributing towards case fatality rates when a textbookexample of Simpson’s paradox was observed [261].Another example of a scientiﬁc application is in astronomy,where causal models were used to identify exoplanets under theconfounding of the instrument. Exoplanets are often detectedas they partially occlude their host star when they transit infront of it, causing a slight decrease in brightness. Sharedpatterns in measurement noise across stars light-years apartcan be removed in order to reduce the instrument’s inﬂuenceon the measurement [219], which is critical especially in thecontext of partial technical failures as experienced in the Keplerexoplanet search mission. The application of [219] lead to thediscovery of 36 planet candidates [70], of which 21 weresubsequently validated as bona ﬁde exoplanets [172]. Fouryears later, astronomers found traces of water in the atmosphereof the exoplanet K2-18b — the ﬁrst such discovery for anexoplanet in the habitable zone, i.e., allowing for liquid water[25, 254]. This planet turned out to be one that had ﬁrst beendetected in [70, exoplanet candidate EPIC 201912552]. G. Multi-Task Learning and Continual Learning

State-of-the-art AI is relatively narrow , i.e., trained toperform speciﬁc tasks, as opposed to the broad , versatileintelligence allowing humans to adapt to a wide range ofenvironments and develop a rich set of skills. The humanability to discover robust, invariant high-level concepts andabstractions, and to identify causal relationships from obser-vations appears to be one of the key factors allowing for asuccessful generalization from prior experiences to new, oftenquite different, “out-of-distribution” settings.Multi-task learning refers to building a system that can solvemultiple tasks across different environments [40, 209]. Thesetasks usually share some common traits. By learning similaritiesacross tasks, a system could utilize knowledge acquired fromprevious tasks more efﬁciently when encountering a new task.One possibility of learning such similarities across tasks isto learn a shared underlying data-generating process as acausal generative model whose components satisfy the SMShypothesis [220]. In certain cases, causal models adapt fasterto sparse interventions in distribution [131, 194]. At the same time, we have clearly come a long way alreadywithout explicitly treating the multi-task problem as a causalone. Fuelled by abundant data and compute, AI has made re-markable advances in a wide range of applications, from imageprocessing and natural language processing [35] to beatinghuman world champions in games such as chess, poker andGo [223], improving medical diagnoses [166], and generatingmusic [56]. A critical question thus arises: “Why can’t we justtrain a huge model that learns environments’ dynamics (e.g.in a RL setting) including all possible interventions? After all,distributed representations can generalize to unseen examplesand if we train over a large number of interventions we mayexpect that a big neural network will generalize across them” .To address this, we make several points. To begin with, if datawas not sufﬁciently diverse (which is an untestable assumptiona priori), the worst-case error to unseen shifts may still bearbitrarily high (see Section VII-C). While in the short term,we can often beat “out-of-distribution” benchmarks by trainingbigger models on bigger datasets, causality offers an importantcomplement. The generalization capabilities of a model are tiedto its assumptions (e.g., how the model is structured and howit was trained). The causal approach makes these assumptionsmore explicit and aligned with our understanding of physics andhuman cognition, for instance by relying on the IndependentCausal Mechanisms principle. When these assumptions arevalid, a learner that does not use them should fare worse thanone that does. Further, if we had a model that was successfulin all interventions over a certain environment, we may wantto use it in different environments that share similar albeit notnecessarily identical dynamics. The causal approach, and inparticular the ICM principle, point to the need to decomposeknowledge about the world into independent and recomposablepieces (recomposable depending on the interventions or changesin environment), which suggests more work on modular MLarchitectures and other ways to enforce the ICM principle infuture ML approaches.At its core, i.i.d. pattern recognition is but a mathematicalabstraction, and causality may be essential to most forms ofanimate learning. Until now, machine learning has neglected afull integration of causality, and this paper argues that it wouldindeed beneﬁt from integrating causal concepts. We arguethat combining the strengths of both ﬁelds, i.e., current deeplearning methods as well as tools and ideas from causality, maybe a necessary step on the path towards versatile AI systems.VIII. C

ONCLUSION

In this work, we discussed different levels of models, includ-ing causal and statistical ones. We argued that this spectrumbuilds upon a range of assumptions both in terms of modelingand data collection. In an effort to bring together causalityand machine learning research programs, we ﬁrst presented adiscussion on the fundamentals of causal inference. Second,we discussed how the independent mechanism assumptionsand related notions such as invariance offer a powerful bias forcausal learning. Third, we discussed how causal relations mightbe learned from observational and interventional data whencausal variables are observed. Fourth, we discussed the open problem of causal representation learning, including its relationto recent interest in the concept of disentangled representationsin deep learning. Finally, we discussed how some openresearch questions in the machine learning community maybe better understood and tackled within the causal framework,including semi-supervised learning, domain generalization, andadversarial robustness.Based on this discussion, we list some critical areas forfuture research: a) Learning Non-Linear Causal Relations at Scale: Notall real-world data is unstructured and the effect of interventionscan often be observed, for example, by stratifying the datacollection across multiple environments. The approximationabilities of modern machine learning methods may proveuseful to model non-linear causal relations among largenumbers of variables. For practical applications, classicaltools are not only limited in the linearity assumptions oftenmade but also in their scalability. The paradigms of meta-and multi-task learning are close to the assumptions anddesiderata of causal modeling, and future work should consider(1) understanding under which conditions non-linear causalrelations can be learned, (2) which training frameworks allow tobest exploit the scalability of machine learning approaches, and(3) providing compelling evidence on the advantages over (non-causal) statistical representations in terms of generalization, re-purposing, and transfer of causal modules on real-world tasks. b) Learning Causal Variables: “Disentangled” represen-tations learned by state-of-the-art neural network methodsare still distributed in the sense that they are represented ina vector format with an arbitrary ordering in the dimensions.This ﬁxed-format implies that the representation size cannotbe dynamically changed; for example, we cannot change thenumber of objects in a scene. Further, structured and modularrepresentation should also arise when a network is trainedfor (sets of) speciﬁc tasks, not only auteoncoding. Differenthigh-level variables may be extracted depending on the taskand affordances at hand. Understanding under which conditionscausal variables can be recovered could provide insights intowhich interventions we are robust to in predictive tasks. c) Understanding the Biases of Existing Deep LearningApproaches:

Scaling to massive data sets, relying on dataaugmentation and self-supervision have all been successfullyexplored to improve the robustness of the predictions of deeplearning models. It is nontrivial to disentangle the beneﬁts ofthe individual components and it is often unclear which “trick”should be used when dealing with a new task, even if we havean intuition about useful invariances. The notion of stronggeneralization over a speciﬁc set of interventions may be usedto probe existing methods, training schemes, and datasets inorder to build a taxonomy of inductive biases. In particular, itis desirable to understand how design choices in pre-training(e.g., which datasets/tasks) positively impact both transfer androbustness downstream in a causal sense. d) Learning Causally Correct Models of the World and theAgent:

In many real-world reinforcement learning (RL) settings,abstract state representations are not available. Hence, theability to derive abstract causal variables from high-dimensional,low-level pixel representations and then recover causal graphs is important for causal induction in real-world reinforcementlearning settings. Moreover, building a causal descriptionfor both a model of the agent and the environment (worldmodels) should be essential for robust and versatile model-based reinforcement learning.IX. A

CKNOWLEDGMENTS

Many thanks to the past and present members of the Tübin-gen causality team, without whose work and insights this articlewould not exist, in particular to Dominik Janzing, ChaochaoLu and Julius von Kügelgen who gave helpful comments on[221]. The text has also beneﬁtted from discussions withElias Bareinboim, Christoph Bohle, Leon Bottou, IsabelleGuyon, Judea Pearl, and Vladimir Vapnik. Thanks to Woutervan Amsterdam for pointing out typos in the ﬁrst version.We also thank Thomas Kipf, Klaus Greff, and Alexanderd’Amour for the useful discussions. Finally, we thank thethorough anonymous reviewers for highly valuable feedbackand suggestions. R

EFERENCES [1] Ossama Ahmed, Frederik Träuble, Anirudh Goyal, AlexanderNeitz, Manuel Wuthrich, Yoshua Bengio, Bernhard Schölkopf,and Stefan Bauer. Causalworld: A robotic manipulationbenchmark for causal structure and transfer learning. In

International Conference on Learning Representations , 2021.[2] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, MateuszLitwin, Bob McGrew, Arthur Petron, Alex Paino, MatthiasPlappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’scube with a robot hand. arXiv preprint 1910.07113 , 2019.[3] Ahmed Alaa and Mihaela Schaar. Limits of estimating hetero-geneous treatment effects: Guidelines for practical algorithmdesign. In

International Conference on Machine Learning ,pages 129–138, 2018.[4] J. Aldrich. Autonomy.

Oxford Economic Papers , 41:15–34,1989.[5] Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and DavidLopez-Paz. Invariant risk minimization. arXiv preprint1907.02893 , 2019.[6] Onur Atan, James Jordon, and Mihaela van der Schaar.Deep-treat: Learning optimal personalized treatments fromobservational data using neural networks. In

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence , 2018.[7] Aharon Azulay and Yair Weiss. Why do deep convolutionalnetworks generalize so poorly to small image transformations?

Journal of Machine Learning Research , 20(184):1–25, 2019.[8] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch,Thien Huu Nguyen, Harm de Vries, and Aaron Courville.Systematic generalization: what is required and can it belearned? arXiv preprint 1811.12889 , 2018.[9] H. Baird. Document image defect models. In

Proc., IAPRWorkshop on Syntactic and Structural Pattern Recognition ,pages 38–46, Murray Hill, NJ, 1990.[10] Victor Bapst, Alvaro Sanchez-Gonzalez, Carl Doersch, Kim-berly Stachenfeld, Pushmeet Kohli, Peter Battaglia, and JessicaHamrick. Structured agents for physical construction. In

International Conference on Machine Learning , pages 464–474, 2019.[11] Andrei Barbu, David Mayo, Julian Alverio, William Luo,Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and BorisKatz. Objectnet: A large-scale bias-controlled dataset forpushing the limits of object recognition models. In

Advancesin Neural Information Processing Systems , pages 9448–9458,2019. [12] E. Bareinboim and J. Pearl. Transportability from multipleenvironments with limited experiments: Completeness results.In Advances in Neural Information Processing Systems 27 ,pages 280–288, 2014.[13] E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobservedconfounders: A causal approach. In

Advances in NeuralInformation Processing Systems 28 , pages 1342–1350, 2015.[14] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo JimenezRezende, et al. Interaction networks for learning about objects,relations and physics. In

Advances in neural informationprocessing systems , pages 4502–4510, 2016.[15] Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenenbaum.Simulation as an engine of physical scene understanding.

Proceedings of the National Academy of Sciences , 110(45):18327–18332, 2013.[16] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, AlvaroSanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski,Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner,et al. Relational inductive biases, deep learning, and graphnetworks. arXiv preprint 1806.01261 , 2018.[17] S. Bauer, B. Schölkopf, and J. Peters. The arrow of timein multivariate time series. In

Proceedings of the 33ndInternational Conference on Machine Learning , volume 48of

JMLR Workshop and Conference Proceedings , pages 2043–2051, 2016.[18] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko,Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M Var-doulakis. A human-centered evaluation of a deep learningsystem deployed in clinics for the detection of diabeticretinopathy. In

Proceedings of the 2020 CHI Conference onHuman Factors in Computing Systems , pages 1–12, 2020.[19] Sara Beery, Grant Van Horn, and Pietro Perona. Recognitionin terra incognita. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , pages 456–473, 2018.[20] S. Ben-David, T. Lu, T. Luu, and D. Pál. Impossibilitytheorems for domain adaptation. In

Proceedings of theInternational Conference on Artiﬁcial Intelligence and Statistics13 (AISTATS) , pages 129–136, 2010.[21] Emmanuel Bengio, Valentin Thomas, Joelle Pineau, DoinaPrecup, and Yoshua Bengio. Independently controllable features. arXiv preprint 1703.07718 , 2017.[22] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier.

Learninga synaptic learning rule . IJCNN-91-Seattle International JointConference on Neural Networks (Vol. 2, pp. 969-vol). IEEE.,1990.[23] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep-resentation learning: A review and new perspectives. arXivpreprint 1206.5538 , 2012.[24] Yoshua Bengio, Tristan Deleu, Nasim Rahaman, RosemaryKe, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, andChristopher Pal. A meta-transfer objective for learning todisentangle causal mechanisms. arXiv preprint 1901.10912 ,2019.[25] Björn Benneke, Ian Wong, Caroline Piaulet, Heather A.Knutson, Ian J. M. Crossﬁeld, Joshua Lothringer, Caroline V.Morley, Peter Gao, Thomas P. Greene, Courtney Dressing,Diana Dragomir, Andrew W. Howard, Peter R. McCullough,Eliza M. R. Kempton Jonathan J. Fortney, and Jonathan Fraine.Water vapor on the habitable-zone exoplanet K2-18b. arXivpreprint 1909.04642 , 2019.[26] Christopher Berner, Greg Brockman, Brooke Chan, VickiCheung, Przemysław Dkebiak, Christy Dennison, David Farhi,Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2with large scale deep reinforcement learning. arXiv preprint1912.06680 , 2019.[27] M. Besserve, N. Shajarisales, B. Schölkopf, and D. Janzing.Group invariance principles for causal generative models. In

Proceedings of the 21st International Conference on ArtiﬁcialIntelligence and Statistics (AISTATS) , pages 557–565, 2018. [28] M. Besserve, R. Sun, D. Janzing, and B. Schölkopf. A theory ofindependent mechanisms for extrapolation in generative models.In , February 2021.[29] Michel Besserve, Rémy Sun, and Bernhard Schölkopf. Coun-terfactuals uncover the modular structure of deep generativemodels. arXiv preprint 1812.03253, published at ICLR 2020 ,2018.[30] Ioana Bica, Ahmed M Alaa, and Mihaela van der Schaar. Timeseries deconfounder: Estimating treatment effects over time inthe presence of hidden confounders. arXiv preprint 1902.00450 ,2019.[31] Avrim Blum and Tom Mitchell. Combining labeled andunlabeled data with co-training. In

Proceedings of the EleventhAnnual Conference on Computational Learning Theory , pages92–100, New York, NY, USA, 1998. ACM.[32] Patrick Blöbaum, Takashi Washio, and Shohei Shimizu. Errorasymmetry in causal and anticausal regression. arXiv preprint1610.03263 , 2016.[33] Blai Bonet and Hector Geffner. Learning ﬁrst-order symbolicrepresentations for planning from the structure of the statespace. arXiv preprint 1909.05546 , 2019.[34] L. Bottou, J. Peters, J. Quiñonero-Candela, D. X. Charles, D. M.Chickering, E. Portugualy, D. Ray, P. Simard, and E. Snelson.Counterfactual reasoning and learning systems: The exampleof computational advertising.

Journal of Machine LearningResearch , 14:3207–3260, 2013.[35] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, PranavShyam, Girish Sastry, Amanda Askell, et al. Language modelsare few-shot learners. arXiv preprint 2005.14165 , 2020.[36] Kailash Budhathoki and Jilles Vreeken. Causal inference bycompression. In

IEEE 16th International Conference on DataMining , 2016.[37] Lars Buesing, Theophane Weber, Yori Zwols, SebastienRacaniere, Arthur Guez, Jean-Baptiste Lespiau, and NicolasHeess. Woulda, coulda, shoulda: Counterfactually-guided policysearch. arXiv preprint 1811.06272 , 2018.[38] C. J. C. Burges and B. Schölkopf. Improving the accuracy andspeed of support vector learning machines. In M. Mozer,M. Jordan, and T. Petsche, editors,

Advances in NeuralInformation Processing Systems , volume 9, pages 375–381,Cambridge, MA, USA, 1997. MIT Press.[39] Christopher P Burgess, Loic Matthey, Nicholas Watters,Rishabh Kabra, Irina Higgins, Matt Botvinick, and AlexanderLerchner. Monet: Unsupervised scene decomposition andrepresentation. arXiv preprint 1901.11390 , 2019.[40] Rich Caruana. Multitask learning.

Machine learning , 28(1):41–75, 1997.[41] Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt.Multi-level cause-effect systems. arXiv preprint 1512.07942 ,2015.[42] Krzysztof Chalupka, Pietro Perona, and Frederick Eberhardt.Fast conditional independence test for vector variables withlarge sample sizes. arXiv preprint 1804.02747 , 2018.[43] Michael B Chang, Tomer Ullman, Antonio Torralba, andJoshua B Tenenbaum. A compositional object-based approachto learning physical dynamics. In , 2017.[44] O. Chapelle, B. Schölkopf, and A. Zien, editors.

Semi-Supervised Learning

Proceedings of the 37th InternationalConference on Machine Learning , 2020.[46] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geof-frey Hinton. A simple framework for contrastive learning ofvisual representations. arXiv preprint 2002.05709 , 2020. [47] Silvia Chiappa, Sébastien Racaniere, Daan Wierstra, andShakir Mohamed. Recurrent environment simulators. In ,2017.[48] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan,and Quoc V Le. Autoaugment: Learning augmentationstrategies from data. In Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages 113–123,2019.[49] P. Daniušis, D. Janzing, J. M. Mooij, J. Zscheischler, B. Steudel,K. Zhang, and B. Schölkopf. Inferring deterministic causalrelations. In

Proceedings of the 26th Annual Conference onUncertainty in Artiﬁcial Intelligence (UAI) , pages 143–150,2010.[50] Ishita Dasgupta, Jane Wang, Silvia Chiappa, Jovana Mitrovic,Pedro Ortega, David Raposo, Edward Hughes, Peter Battaglia,Matthew Botvinick, and Zeb Kurth-Nelson. Causal reasoningfrom meta-reinforcement learning. arXiv preprint 1901.08162 ,2019.[51] A. P. Dawid. Conditional independence in statistical theory.

Journal of the Royal Statistical Society B , 41(1):1–31, 1979.[52] Stanislas Dehaene.

How We Learn: Why Brains Learn BetterThan Any Machine... for Now . Penguin, 2020.[53] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, andLi Fei-Fei. Imagenet: A large-scale hierarchical image database.In , pages 248–255. Ieee, 2009.[54] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. Bert: Pre-training of deep bidirectional transformersfor language understanding. arXiv preprint 1810.04805 , 2018.[55] L. Devroye, L. Györﬁ, and G. Lugosi.

A Probabilistic Theory ofPattern Recognition , volume 31 of

Applications of Mathematics .Springer, New York, NY, 1996.[56] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong WookKim, Alec Radford, and Ilya Sutskever. Jukebox: A generativemodel for music. arXiv preprint 2005.00341 , 2020.[57] Andrea Dittadi, Frederik Träuble, Francesco Locatello, ManuelWüthrich, Vaibhav Agrawal, Ole Winther, Stefan Bauer, andBernhard Schölkopf. On the transfer of disentangled repre-sentations in realistic settings. In

International Conference onLearning Representations , 2021.[58] Carlos Diuk, Andre Cohen, and Michael L Littman. An object-oriented representation for efﬁcient reinforcement learning. In

Proceedings of the 25th international conference on Machinelearning , pages 240–247, 2008.[59] Josip Djolonga, Jessica Yung, Michael Tschannen, Rob Romi-jnders, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver,Matthias Minderer, Alexander D’Amour, Dan Moldovan, et al.On robustness and transferability of convolutional neuralnetworks. arXiv preprint 2007.08558 , 2020.[60] G. Doran, K. Muandet, K. Zhang, and B. Schölkopf. Apermutation-based kernel conditional independence test. InN. L. Zhang and J. Tian, editors,

Proceedings of the 30thConference on Uncertainty in Artiﬁcial Intelligence , pages132–141, Corvallis, OR, 2014. AUAI Press. URL http://auai.org/uai2014/proceedings/individuals/194.pdf.[61] Cian Eastwood and Christopher KI Williams. A framework forthe quantitative evaluation of disentangled representations. In

International Conference on Learning Representations , 2018.[62] Daniel Eaton and Kevin Murphy. Exact Bayesian structurelearning from uncertain interventions. In

Artiﬁcial Intelligenceand Statistics , pages 107–114, 2007.[63] Logan Engstrom, Brandon Tran, Dimitris Tsipras, LudwigSchmidt, and Aleksander Madry. Exploring the landscape ofspatial robustness. arXiv preprint 1712.02779 , 2017.[64] Kai Epstude and Neal J Roese. The functional theory ofcounterfactual thinking.

Personality and social psychologyreview , 12(2):168–192, 2008.[65] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, ClaireCui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guideto deep learning in healthcare.

Nature Medicine , 25(1):24–29,2019.[66] András Faragó and Gábor Lugosi. Strong universal consistencyof neural network classiﬁers.

IEEE Transactions on InformationTheory , 39(4):1146–1151, 2006.[67] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint 1703.03400 , 2017.[68] Jakob N Foerster, Gregory Farquhar, Triantafyllos Afouras,Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. In

Thirty-second AAAI conference onartiﬁcial intelligence , 2018.[69] Peter Földiák. Learning invariance from transformationsequences.

Neural Computation , 3(2):194–200, 1991.[70] D. Foreman-Mackey, B. T. Montet, D. W. Hogg, T. D. Morton,D. Wang, and B. Schölkopf. A systematic search for transitingplanets in the K2 data.

The Astrophysical Journal , 806(2),2015. URL http://stacks.iop.org/0004-637X/806/i=2/a=215.[71] R. Frisch, T. Haavelmo, T.C. Koopmans, and J. Tinbergen.

Autonomy of economic relations . Universitets SocialøkonomiskeInstitutt, Oslo, Norway, 1948.[72] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In

International Conference on Machine Learning , pages 2052–2062, 2019.[73] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernelmeasures of conditional dependence. In

Advances in NeuralInformation Processing Systems 20 , pages 489–496, 2008.[74] D. Geiger and J. Pearl. Logical and algorithmic propertiesof independence and their application to Bayesian networks.

Annals of Mathematics and Artiﬁcial Intelligence , 2:165–178,1990.[75] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, MatthiasBethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape biasimproves accuracy and robustness. arXiv preprint 1811.12231 ,2018.[76] Muhammad Waleed Gondal, Manuel Wüthrich, Djordje Miladi-novi´c, Francesco Locatello, Martin Breidt, Valentin Volchkov,Joel Akpo, Olivier Bachem, Bernhard Schölkopf, and StefanBauer. On the transfer of inductive bias from simulation tothe real world: a new disentanglement dataset. In

Advances inNeural Information Processing Systems , pages 15740–15751,2019.[77] M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, andB. Schölkopf. Domain adaptation with conditional transfer-able components. In

Proceedings of the 33nd InternationalConference on Machine Learning , pages 2839–2848, 2016.[78] M. Gong, K. Zhang, B. Schölkopf, C. Glymour, and D. Tao.Causal discovery from temporally aggregated time series. In

Proceedings of the Thirty-Third Conference on Uncertainty inArtiﬁcial Intelligence (UAI) , page ID 269, 2017.[79] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples. arXiv preprint1412.6572 , 2014.[80] Alison Gopnik, Clark Glymour, David M Sobel, Laura E Schulz,Tamar Kushnir, and David Danks. A theory of causal learningin children: causal maps and Bayes nets.

Psychological review ,111(1):3, 2004.[81] Omer Gottesman, Fredrik Johansson, Joshua Meier, Jack Dent,Donghun Lee, Srivatsan Srinivasan, Linying Zhang, Yi Ding,David Wihl, Xuefeng Peng, Jiayu Yao, Isaac Lage, ChristopherMosch, Li wei H. Lehman, Matthieu Komorowski, MatthieuKomorowski, Aldo Faisal, Leo Anthony Celi, David Sontag,and Finale Doshi-Velez. Evaluating reinforcement learningalgorithms in observational health settings. arXiv preprint1805.12298 , 2018. [82] Olivier Goudet, Diviyan Kalainathan, Philippe Caillou, IsabelleGuyon, David Lopez-Paz, and Michèle Sebag. Causal genera-tive neural networks. arXiv preprint 1711.08936 , 2017.[83] Anirudh Goyal, Alex Lamb, Phanideep Gampa, PhilippeBeaudoin, Sergey Levine, Charles Blundell, Yoshua Bengio,and Michael Mozer. Object ﬁles and schemata: Factorizingdeclarative and procedural knowledge in dynamical systems. arXiv preprint 2006.16225 , 2020.[84] Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani,Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. Re-current independent mechanisms. In International Conferenceon Learning Representations , 2021.[85] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.Speech recognition with deep recurrent neural networks. In , pages 6645–6649. IEEE, 2013.[86] Klaus Greff, Raphaël Lopez Kaufman, Rishabh Kabra, NickWatters, Christopher Burgess, Daniel Zoran, Loic Matthey,Matthew Botvinick, and Alexander Lerchner. Multi-objectrepresentation learning with iterative variational inference. In

International Conference on Machine Learning , pages 2424–2433, 2019.[87] Klaus Greff, Sjoerd van Steenkiste, and Jürgen Schmidhuber.On the binding problem in artiﬁcial neural networks. arXivpreprint 2012.05208 , 2020.[88] Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, YanWu, Hamza Merzic, and Aaron van den Oord. Shaping beliefstates with generative environment models for rl. In

Advancesin Neural Information Processing Systems , pages 13475–13487,2019.[89] Luigi Gresele, Paul K Rubenstein, Arash Mehrjou, FrancescoLocatello, and Bernhard Schölkopf. The incomplete rosettastone problem: Identiﬁability results for multi-view nonlinearica. arXiv preprint 1905.06642 , 2019.[90] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Mea-suring statistical dependence with Hilbert-Schmidt norms. In

Algorithmic Learning Theory , pages 63–78. Springer-Verlag,2005.[91] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, andB. Schölkopf. Kernel methods for measuring independence.

Journal of Machine Learning Research , 6:2075–2129, 2005.[92] Jean-Bastien Grill, Florian Strub, Florent Altché, CorentinTallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch,Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Ghesh-laghi Azar, et al. Bootstrap your own latent: A new approachto self-supervised learning. arXiv preprint 2006.07733 , 2020.[93] Radek Grzeszczuk, Demetri Terzopoulos, and Geoffrey Hinton.Neuroanimator: Fast neural network emulation and controlof physics-based models. In

Proceedings of the 25th annualconference on Computer graphics and interactive techniques ,pages 9–20, 1998.[94] Keren Gu, Brandon Yang, Jiquan Ngiam, Quoc Le, andJonathan Shlens. Using videos to evaluate image modelrobustness. arXiv preprint 1904.10076 , 2019.[95] Shixiang Gu and Luca Rigazio. Towards deep neuralnetwork architectures robust to adversarial examples, 2014.arXiv:1412.5068.[96] Ruocheng Guo, Lu Cheng, Jundong Li, P. Richard Hahn, andHuan Liu. A survey of learning causality with data: Problemsand methods. arXiv preprint 1809.09337 , 2018.[97] I. Guyon, D. Janzing, and B. Schölkopf. Causality: Objectivesand assessment. In I. Guyon, D. Janzing, and B. Schölkopf,editors,

JMLR Workshop and Conference Proceedings: Volume6 , pages 1–42, Cambridge, MA, USA, 2010. MIT Press.[98] David Ha and Jürgen Schmidhuber. World models. arXivpreprint 1803.10122 , 2018.[99] T. Haavelmo. The probability approach in econometrics.

Econometrica , 12:S1–S115 (supplement), 1944.[100] Hermanni Hälvä and Aapo Hyvärinen. Hidden markov nonlinear ica: Unsupervised learning from nonstationary timeseries. arXiv preprint 2006.12107 , 2020.[101] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In

Proceedingsof the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.[102] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and RossGirshick. Momentum contrast for unsupervised visual represen-tation learning. In

Proceedings of the IEEE/CVF Conference onComputer Vision and Pattern Recognition , pages 9729–9738,2020.[103] Siyu He, Yin Li, Yu Feng, Shirley Ho, Siamak Ravanbakhsh,Wei Chen, and Barnabás Póczos. Learning to predict thecosmological structure formation.

Proceedings of the NationalAcademy of Sciences , 116(28):13825–13832, 2019.[104] Christina Heinze-Deml and Nicolai Meinshausen. Conditionalvariance penalties and domain shift robustness. arXiv preprint1710.11469 , 2017.[105] Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen.Invariant causal prediction for nonlinear models. arXiv preprint1706.08576 , 2017.[106] Dan Hendrycks and Thomas Dietterich. Benchmarking neuralnetwork robustness to common corruptions and perturbations. arXiv preprint 1903.12261 , 2019.[107] Joseph Henrich.

The Secret of our Success . Princeton UniversityPress, 2016.[108] Katharine E Henry, David N Hager, Peter J Pronovost, andSuchi Saria. A targeted real-time early warning score (trews-core) for septic shock.

Science translational medicine , 7(299):299ra122–299ra122, 2015.[109] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess,Xavier Glorot, Matthew Botvinick, Shakir Mohamed, andAlexander Lerchner. beta-vae: Learning basic visual conceptswith a constrained variational framework. In

InternationalConference on Learning Representations , 2016.[110] R Devon Hjelm and William Buchwalter. Learning represen-tations by maximizing mutual information across views. In

Advances in Neural Information Processing Systems , pages15535–15545, 2019.[111] K. D. Hoover. Causality in economics and econometrics. InS. N. Durlauf and L. E. Blume, editors,

The New PalgraveDictionary of Economics . Palgrave Macmillan, Basingstoke,UK, 2nd edition, 2008.[112] Jeremy Howard and Sebastian Ruder. Universal language modelﬁne-tuning for text classiﬁcation. arXiv preprint 1801.06146 ,2018.[113] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, andB. Schölkopf. Nonlinear causal discovery with additive noisemodels. In

Advances in Neural Information Processing Systems21 (NIPS) , pages 689–696, 2009.[114] B. Huang, K. Zhang, J. Zhang, R. Sanchez-Romero, C. Gly-mour, and B. Schölkopf. Behind distribution shift: Miningdriving forces of changes and causal arrows. In

IEEE 17thInternational Conference on Data Mining (ICDM 2017) , pages913–918, 2017.[115] Biwei Huang, Kun Zhang, Jiji Zhang, Joseph Ramsey, RubenSanchez-Romero, Clark Glymour, and Bernhard Schölkopf.Causal discovery from heterogeneous/nonstationary data.

Jour-nal of Machine Learning Research , 21(89):1–53, 2020. URLhttp://jmlr.org/papers/v21/19-232.html.[116] Aapo Hyvärinen and Petteri Pajunen. Nonlinear independentcomponent analysis: Existence and uniqueness results.

Neuralnetworks , 12(3):429–439, 1999.[117] AJ Hyvarinen and Hiroshi Morioka. Nonlinear ica of temporallydependent stationary sources. In

Proceedings of MachineLearning Research , 2017.[118] Guido W Imbens and Donald B Rubin.

Causal inferencein statistics, social, and biomedical sciences . CambridgeUniversity Press, 2015. [119] D. Janzing. Causal regularization. In Advances in NeuralInformation Processing Systems 33 , 2019.[120] D. Janzing and B. Schölkopf. Causal inference using the algo-rithmic Markov condition.

IEEE Transactions on InformationTheory , 56(10):5168–5194, 2010.[121] D. Janzing and B. Schölkopf. Semi-supervised interpolation inan anticausal learning scenario.

Journal of Machine LearningResearch , 16:1923–1948, 2015.[122] D. Janzing and B. Schölkopf. Detecting non-causal artifacts inmultivariate linear regression models. In

Proceedings of the35th International Conference on Machine Learning (ICML) ,pages 2250–2258, 2018.[123] D. Janzing, J. Peters, J. M. Mooij, and B. Schölkopf. Identifyingconfounders using additive noise models. In

Proceedings of the25th Annual Conference on Uncertainty in Artiﬁcial Intelligence(UAI) , pages 249–257, 2009.[124] D. Janzing, P. Hoyer, and B. Schölkopf. Telling cause fromeffect based on high-dimensional observations. In J. Fürnkranzand T. Joachims, editors,

Proceedings of the 27th InternationalConference on Machine Learning , pages 479–486, 2010.[125] D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler,P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions.

ArtiﬁcialIntelligence , 182–183:1–31, 2012.[126] D. Janzing, R. Chaves, and B. Schölkopf. Algorithmic indepen-dence of initial condition and dynamical law in thermodynamicsand causal inference.

New Journal of Physics , 18(9), 2016.URL http://stacks.iop.org/1367-2630/18/i=9/a=093052.[127] Leslie Pack Kaelbling, Michael L Littman, and Andrew WMoore. Reinforcement learning: A survey.

Journal of artiﬁcialintelligence research , 4:237–285, 1996.[128] Daniel Kahneman.

Thinking, fast and slow , pages 1–5. IEEE, 2016.[130] Amir-Hossein Karimi, Julius von Kügelgen, BernhardSchölkopf, and Isabel Valera. Algorithmic recourse underimperfect causal knowledge: a probabilistic approach. arXiv2006.06831 , 2020. Published at NeurIPS.[131] Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, StefanBauer, Hugo Larochelle, Bernhard Schölkopf, Michael Mozer,Chris Pal, and Yoshua Bengio. Learning neural causal modelsfrom unknown interventions. arXiv preprint 1910.01075v2 ,2020.[132] Moein Khajehnejad, Behzad Tabibian, Bernhard Schölkopf,Adish Singla, and Manuel Gomez-Rodriguez. Optimal decisionmaking under strategic behavior. arXiv preprint 1905.09239 ,2019.[133] N. Kilbertus, M. Rojas Carulla, G. Parascandolo, M. Hardt,D. Janzing, and B. Schölkopf. Avoiding discrimination throughcausal reasoning. In

Advances in Neural Information ProcessingSystems 30 , pages 656–666, 2017.[134] Niki Kilbertus, Giambattista Parascandolo, and BernhardSchölkopf. Generalization in anti-causal learning. arXivpreprint 1812.00524 , 2018.[135] Hyunjik Kim and Andriy Mnih. Disentangling by factorising.In

International Conference on Machine Learning , 2018.[136] Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling,and Richard Zemel. Neural relational inference for interactingsystems. In

International Conference on Machine Learning ,pages 2688–2697, 2018.[137] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.Big transfer (bit): General visual representation learning. arXivpreprint 1912.11370 , 2019.[138] Adam Kosiorek, Hyunjik Kim, Yee Whye Teh, and IngmarPosner. Sequential attend, infer, repeat: Generative modellingof moving objects.

Advances in Neural Information ProcessingSystems , 31:8606–8616, 2018.[139] S. Kpotufe, E. Sgouritsa, D. Janzing, and B. Schölkopf.Consistency of causal inference under the additive noise model.In

Proceedings of the 31th International Conference on MachineLearning , pages 478–486, 2014.[140] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural networks.In

Advances in neural information processing systems , pages1097–1105, 2012.[141] Tejas D Kulkarni, Ankush Gupta, Catalin Ionescu, Sebas-tian Borgeaud, Malcolm Reynolds, Andrew Zisserman, andVolodymyr Mnih. Unsupervised learning of object keypointsfor perception and control. In

Advances in Neural InformationProcessing Systems , pages 10723–10733, 2019.[142] Matt J Kusner, Joshua Loftus, Chris Russell, and Ricardo Silva.Counterfactual fairness. In

Advances in Neural InformationProcessing Systems 30 , pages 4066–4076. Curran Associates,Inc., 2017.[143] L’ubor Ladick`y, SoHyeon Jeong, Barbara Solenthaler, MarcPollefeys, and Markus Gross. Data-driven ﬂuid simulationsusing regression forests.

ACM Transactions on Graphics (TOG) ,34(6):1–9, 2015.[144] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, andSamuel J Gershman. Building machines that learn and thinklike people.

Behavioral and brain sciences , 40, 2017.[145] Janet Landman, Elizabeth A Vandewater, Abigail J Stewart,and Janet E Malley. Missed opportunities: Psychologicalramiﬁcations of counterfactual thought in midlife women.

Journal of Adult Development , 2(2):87–97, 1995.[146] Sascha Lange, Thomas Gabel, and Martin Riedmiller. Batchreinforcement learning. In Marco Wiering and Martijn van Ot-terlo, editors,

Reinforcement Learning: State-of-the-Art , pages45–73. Springer, Berlin, Heidelberg, 2012.[147] S. L. Lauritzen.

Graphical Models . Oxford University Press,New York, NY, 1996.[148] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deeplearning.

Nature , 521(7553):436–444, 2015.[149] Felix Leeb, Yashas Annadani, Stefan Bauer, and BernhardSchölkopf. Structural autoencoders improve representations forgeneration and transfer. arXiv preprint 2006.07796 , 2020.[150] Sergey Levine, Aviral Kumar, George Tucker, and JustinFu. Ofﬂine reinforcement learning: Tutorial, review, andperspectives on open problems. arXiv preprint 2005.01643 ,2020.[151] David Lewis. Causation.

The journal of philosophy , 70(17):556–567, 1974.[152] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, andDacheng Tao. Domain generalization via conditional invariantrepresentation. arXiv preprint 1807.08479 , 2018.[153] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, TongliangLiu, Kun Zhang, and Dacheng Tao. Deep domain generalizationvia conditional invariant adversarial networks. In

The EuropeanConference on Computer Vision (ECCV) , 2018.[154] Sungbin Lim, Ildoo Kim, Taesup Kim, Chiheon Kim, andSungwoong Kim. Fast autoaugment. In

Advances in NeuralInformation Processing Systems , pages 6665–6675, 2019.[155] Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, WeihaoSun, Gautam Singh, Fei Deng, Jindong Jiang, and SungjinAhn. Space: Unsupervised object-oriented scene representationvia spatial attention and decomposition. In

InternationalConference on Learning Representations , 2019.[156] Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detectingand correcting for label shift with black box predictors. arXiv preprint 1802.03916 , 2018.[157] Francesco Locatello, Gabriele Abbati, Tom Rainforth, StefanBauer, Bernhard Schölkopf, and Olivier Bachem. On thefairness of disentangled representations. In Advances in NeuralInformation Processing Systems , pages 14544–14557, 2019.[158] Francesco Locatello, Stefan Bauer, Mario Lucic, GunnarRätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem.Challenging common assumptions in the unsupervised learningof disentangled representations.

Proceedings of the 36thInternational Conference on Machine Learning , 2019.[159] Francesco Locatello, Ben Poole, Gunnar Rätsch, BernhardSchölkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In

Proceed-ings of the 37th International Conference on Machine Learning(ICML) , 2020.[160] Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner,Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, AlexeyDosovitskiy, and Thomas Kipf. Object-centric learning withslot attention. In

Advances in Neural Information ProcessingSystems , 2020.[161] D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin.Towards a learning theory of cause-effect inference. In

Proceedings of the 32nd International Conference on MachineLearning , pages 1452–1461, 2015.[162] D. Lopez-Paz, R. Nishihara, S. Chintala, B. Schölkopf, andL. Bottou. Discovering causal signals in images. In

IEEEConference on Computer Vision and Pattern Recognition(CVPR) , pages 58–66, 2017.[163] K. Lorenz.

Die Rückseite des Spiegels . R. Piper & Co. Verlag,1973.[164] Chaochao Lu, Bernhard Schölkopf, and José Miguel Hernández-Lobato. Deconfounding reinforcement learning in observationalsettings. arXiv preprint 1812.10576 , 2018.[165] Chaochao Lu, Biwei Huang, Ke Wang, José Miguel Hernández-Lobato, Kun Zhang, and Bernhard Schölkopf. Sample-efﬁcientreinforcement learning via counterfactual-based data augmen-tation. arXiv preprint 2012.09092 , 2020.[166] Alexander Selvikvåg Lundervold and Arvid Lundervold. Anoverview of deep learning in medical imaging focusing on MRI.

Zeitschrift für Medizinische Physik , 29(2):102–127, 2019.[167] Sara Magliacane, Thijs van Ommen, Tom Claassen, StephanBongers, Philip Versteeg, and Joris M. Mooij. Domainadaptation by using causal inference to predict invariantconditional distributions. In

Proc. NeurIPS , 2018.[168] Robert Matthews. Storks deliver babies (p= 0.008).

TeachingStatistics , 22(2):36–38, 2000.[169] Nicolai Meinshausen. Causality from a distributional robustnesspoint of view. In ,pages 6–10. IEEE, 2018.[170] Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, EvgeniaRusak, Oliver Bringmann, Alexander S Ecker, Matthias Bethge,and Wieland Brendel. Benchmarking robustness in objectdetection: Autonomous driving when winter is coming. arXivpreprint 1907.07484 , 2019.[171] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A.Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, MartinRiedmiller, Andreas K. Fidjeland, Georg Ostrovski, StigPetersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou,Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg,and Demis Hassabis. Human-level control through deepreinforcement learning.

Nature , 518(7540):529–533, 2015.[172] B. T. Montet, T. D. Morton, D. Foreman-Mackey, J. A. Johnson,D. W. Hogg, B. P. Bowler, D. W. Latham, A. Bieryla, andA. W. Mann. Stellar and planetary properties of K2 campaign1 candidates and validation of 17 planets, including a planetreceiving earth-like insolation.

The Astrophysical Journal , 809(1):25, 2015.[173] J. Mooij, D. Janzing, and B. Schölkopf. From ordinary differ-ential equations to structural causal models: the deterministic case. In A. Nicholson and P. Smyth, editors,

Proceedings of theTwenty-Ninth Conference Annual Conference on Uncertaintyin Artiﬁcial Intelligence

Proceedings of the 26th International Conferenceon Machine Learning (ICML) , pages 745–752, 2009.[175] J. M. Mooij, D. Janzing, T. Heskes, and B. Schölkopf. Oncausal discovery with cyclic additive noise models. In

Advancesin Neural Information Processing Systems 24 (NIPS) , 2011.[176] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, andB. Schölkopf. Distinguishing cause from effect using obser-vational data: methods and benchmarks.

Journal of MachineLearning Research , 17(32):1–102, 2016.[177] Damian Mrowca, Chengxu Zhuang, Elias Wang, Nick Haber,Li Fei-Fei, Josh Tenenbaum, and Daniel L K Yamins. Flexibleneural representation for physics prediction. In

Advancesin Neural Information Processing Systems , pages 8799–8810,2018.[178] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, andSatinder Singh. Action-conditional video prediction using deepnetworks in atari games. In

Advances in neural informationprocessing systems , pages 2863–2871, 2015.[179] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representa-tion learning with contrastive predictive coding. arXiv preprint1807.03748 , 2018.[180] G. Parascandolo, M. Rojas-Carulla, N. Kilbertus, andB. Schölkopf. Learning independent causal mechanisms.In

Workshop: Learning Disentangled Representations: fromPerception to Control at the 31st Conference on NeuralInformation Processing Systems (NIPS) , 2017.[181] G. Parascandolo, N. Kilbertus, M. Rojas-Carulla, andB. Schölkopf. Learning independent causal mechanisms. In

Proceedings of the 35th International Conference on MachineLearning, PMLR 80:4036-4044 , 2018.[182] Giambattista Parascandolo, Alexander Neitz, ANTONIO ORVI-ETO, Luigi Gresele, and Bernhard Schölkopf. Learningexplanations that are hard to vary. In

International Conferenceon Learning Representations , 2021.[183] J. Pearl.

Causality: Models, Reasoning, and Inference . Cam-bridge University Press, New York, NY, 2nd edition, 2009.[184] J. Pearl. Giving computers free will.

Forbes , 2009.[185] Judea Pearl and Elias Bareinboim. External validity: From do-calculus to transportability across populations. arXiv preprint1503.01603 , 2015.[186] J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Identiﬁa-bility of causal graphs using functional models. In

Proceedingsof the 27th Annual Conference on Uncertainty in ArtiﬁcialIntelligence (UAI) , pages 589–598, 2011.[187] J. Peters, J. M. Mooij, D. Janzing, and B. Schölkopf. Causaldiscovery with continuous additive noise models.

Journalof Machine Learning Research , 15:2009–2053, 2014. URLhttp://jmlr.org/papers/v15/peters14a.html.[188] J. Peters, D. Janzing, and B. Schölkopf.

Elements of CausalInference - Foundations and Learning Algorithms . MIT Press,Cambridge, MA, USA, 2017.[189] Jonas Peters, Peter Bühlmann, and Nicolai Meinshausen. Causalinference by using invariant prediction: identiﬁcation andconﬁdence intervals.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 78(5):947–1012, 2016.[190] Jonas Peters, Stefan Bauer, and Niklas Pﬁster. Causal modelsfor dynamical systems. arXiv preprint 2001.06208 , 2020.[191] N. Pﬁster, P. Bühlmann, B. Schölkopf, and J. Peters. Kernel-based tests for joint independence.

Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 80(1):5–31, 2018.[192] Niklas Pﬁster, Stefan Bauer, and Jonas Peters. Learning stable and predictive structures in kinetic systems. Proceedings of theNational Academy of Sciences , 116(51):25405–25411, 2019.[193] Niklas Pﬁster, Peter Bühlmann, and Jonas Peters. Invariantcausal prediction for sequential data.

Journal of the AmericanStatistical Association , 114(527):1264–1276, 2019.[194] Rémi Le Priol, Reza Babanezhad Harikandeh, Yoshua Bengio,and Simon Lacoste-Julien. An analysis of the adaptation speedof causal models. arXiv preprint 2005.09136 , 2020.[195] Stephan Rabanser, Stephan Günnemann, and Zachary C. Lipton.Failing loudly: An empirical study of methods for detectingdataset shift. arXiv preprint 1810.11953 , 2018.[196] Alec Radford, Karthik Narasimhan, Tim Salimans, and IlyaSutskever. Improving language understanding by generativepre-training, 2018.[197] Nasim Rahaman, Anirudh Goyal, Muhammad Waleed Gondal,Manuel Wuthrich, Stefan Bauer, Yash Sharma, Yoshua Bengio,and Bernhard Schölkopf. Spatially structured recurrent modules.In

International Conference on Learning Representations , 2021.[198] H. Reichenbach.

The Direction of Time . University of CaliforniaPress, Berkeley, CA, 1956.[199] Laine K Reichert and John R Slate. Reﬂective learning: Theuse of “if only...” statements to improve performance.

SocialPsychology of Education , 3(4):261–275, 1999.[200] Danilo J Rezende, Ivo Danihelka, George Papamakarios,Nan Rosemary Ke, Ray Jiang, Theophane Weber, Karol Gregor,Hamza Merzic, Fabio Viola, Jane Wang, et al. Causally correctpartial models for reinforcement learning. arXiv preprint2002.02836 , 2020.[201] Jonathan G Richens, Ciarán M Lee, and Saurabh Johri.Improving the accuracy of medical diagnosis with causalmachine learning.

Nature Communications , 11(1):3923, 2020.[202] Karl Ridgeway and Michael C Mozer. Learning deep disen-tangled embeddings with the f-statistic loss. In

Advances inNeural Information Processing Systems , pages 185–194, 2018.[203] Neal J Roese. The functional basis of counterfactual thinking.

Journal of personality and Social Psychology , 66(5):805, 1994.[204] M. Rojas-Carulla, B. Schölkopf, R. Turner, and J. Peters.Invariant models for causal transfer learning.

Journal ofMachine Learning Research , 19(36):1–34, 2018.[205] Michal Rolinek, Dominik Zietlow, and Georg Martius. Varia-tional autoencoders pursue PCA directions (by accident). In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 12406–12415, 2019.[206] Prasun Roy, Subhankar Ghosh, Saumik Bhattacharya, andUmapada Pal. Effects of degradations on deep neural networkarchitectures. arXiv preprint 1807.10108 , 2018.[207] P. K. Rubenstein, S. Weichwald, S. Bongers, J. M. Mooij,D. Janzing, M. Grosse-Wentrup, and B. Schölkopf. Causalconsistency of structural equation models. In

Proceedingsof the Thirty-Third Conference on Uncertainty in ArtiﬁcialIntelligence , pages 808–817, 2017.[208] P. K. Rubenstein, S. Bongers, B. Schölkopf, and J. M. Mooij.From deterministic ODEs to dynamic structural causal models.In

Proceedings of the 34th Conference on Uncertainty inArtiﬁcial Intelligence (UAI) , 2018.[209] Sebastian Ruder. An overview of multi-task learning in deepneural networks. arXiv preprint 1706.05098 , 2017.[210] Stuart Russell and Peter Norvig.

Artiﬁcial intelligence: amodern approach . Prentice Hall, 2002.[211] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, RexYing, Jure Leskovec, and Peter W Battaglia. Learning tosimulate complex physics with graph networks. arXiv preprint2002.09405 , 2020.[212] Adam Santoro, David Raposo, David G Barrett, MateuszMalinowski, Razvan Pascanu, Peter Battaglia, and TimothyLillicrap. A simple neural network module for relationalreasoning. In

Advances in neural information processingsystems , pages 4967–4976, 2017.[213] Jürgen Schmidhuber.

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook .PhD thesis, Technische Universität München, 1987.[214] Jürgen Schmidhuber. Curious model-building control systems.In

Proc. international joint conference on neural networks ,pages 1458–1463, 1991.[215] B. Schölkopf. Artiﬁcial intelligence: Learning to see and act.

Nature , 518(7540):486–487, 2015.[216] B. Schölkopf. Causal learning, 2017. Invited Talk, 34thInternational Conference on Machine Learning (ICML), https://vimeo.com/238274659.[217] B. Schölkopf and A. J. Smola.

Learning with Kernels . MITPress, Cambridge, MA, 2002.[218] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, andJ. M. Mooij. On causal and anticausal learning. In

Proceedingsof the 29th International Conference on Machine Learning(ICML) , pages 1255–1262, 2012.[219] B. Schölkopf, D. Hogg, D. Wang, D. Foreman-Mackey, D. Janz-ing, C.-J. Simon-Gabriel, and J. Peters. Modeling confoundingby half-sibling regression.

Proceedings of the National Academyof Science (PNAS) , 113(27):7391–7398, 2016.[220] B. Schölkopf, D. Janzing, and D. Lopez-Paz. Causal andstatistical learning. In

Oberwolfach Reports , volume 13(3),pages 1896–1899, 2016. doi: 10.14760/OWR-2016-33. URLhttps://publications.mfo.de/handle/mfo/3537.[221] Bernhard Schölkopf. Causality for machine learning. arXivpreprint 1911.10500 , 2019.[222] Lukas Schott, Jonas Rauber, Matthias Bethge, and WielandBrendel. Towards the ﬁrst adversarially robust neural networkmodel on MNIST. In

International Conference on LearningRepresentations , 2019. URL https://openreview.net/forum?id=S1EHOsC9tX.[223] Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert,Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez,Edward Lockhart, Demis Hassabis, Thore Graepel, et al.Mastering atari, go, chess and shogi by planning with a learnedmodel. arXiv preprint 1911.08265 , 2019.[224] Peter Schulam and Suchi Saria. Reliable decision support usingcounterfactual models. In

Advances in Neural InformationProcessing Systems , pages 1697–1708, 2017.[225] Rajen D. Shah and Jonas Peters. The hardness of conditionalindependence testing and the generalised covariance measure. arXiv preprint 1804.07203 , 2018.[226] N. Shajarisales, D. Janzing, B. Schölkopf, and M. Besserve.Telling cause from effect in deterministic linear dynamicalsystems. In

Proceedings of the 32nd International Conferenceon Machine Learning (ICML) , pages 285–294, 2015.[227] Vaishaal Shankar, Achal Dave, Rebecca Roelofs, Deva Ra-manan, Benjamin Recht, and Ludwig Schmidt. Do imageclassiﬁers generalize across time? arXiv preprint 1906.02168 ,2019.[228] Rakshith Shetty, Bernt Schiele, and Mario Fritz. Not using thecar to see the sidewalk–quantifying and controlling the effects ofcontext in classiﬁcation and segmentation. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition ,pages 8218–8226, 2019.[229] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. J. Kerminen. Alinear non-Gaussian acyclic model for causal discovery.

Journalof Machine Learning Research , 7:2003–2030, 2006.[230] Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, andBen Poole. Weakly supervised disentanglement with guarantees. arXiv preprint 1910.09772 , 2019.[231] David Silver, Aja Huang, Chris J Maddison, Arthur Guez,Laurent Sifre, George Van Den Driessche, Julian Schrittwieser,Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al.Mastering the game of go with deep neural networks and treesearch.

Nature , 529(7587):484–489, 2016.[232] David Silver, Hado Hasselt, Matteo Hessel, Tom Schaul, ArthurGuez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, NeilRabinowitz, Andre Barreto, et al. The predictron: End-to-end learning and planning. In International Conference on MachineLearning , pages 3191–3199. PMLR, 2017.[233] Patrice Simard, Bernard Victorri, Yann LeCun, and JohnDenker. Tangent prop - a formalism for specifying selectedinvariances in an adaptive network. In J. Moody, S. Hanson,and R. P. Lippmann, editors,

Advances in Neural Informa-tion Processing Systems , volume 4, pages 895–903. Morgan-Kaufmann, 1992. URL https://proceedings.neurips.cc/paper/1991/ﬁle/65658fde58ab3c2b6e5132a39fae7cb9-Paper.pdf.[234] Patrice Y Simard, David Steinkraus, John C Platt, et al. Bestpractices for convolutional neural networks applied to visualdocument analysis. In

Proceedings of the Seventh InternationalConference on Document Analysis and Recognition (ICDAR2003) , volume 3, 2003.[235] H. A. Simon. Causal ordering and identiﬁability. In W. C. Hoodand T. C. Koopmans, editors,

Studies in Econometric Methods ,pages 49–74. John Wiley & Sons, New York, NY, 1953. CowlesCommission for Research in Economics, Monograph No. 14.[236] Elizabeth S Spelke. Principles of object perception.

Cognitivescience , 14(1):29–56, 1990.[237] P. Spirtes, C. Glymour, and R. Scheines.

Causation, Prediction,and Search . MIT Press, Cambridge, MA, 2nd edition, 2000.[238] W. Spohn.

Grundlagen der Entscheidungstheorie . Scriptor-Verlag, 1978.[239] I. Steinwart and A. Christmann.

Support Vector Machines .Springer, New York, NY, 2008.[240] B. Steudel, D. Janzing, and B. Schölkopf. Causal Markovcondition for submodular information measures. In

Proceedingsof the 23rd Annual Conference on Learning Theory (COLT) ,pages 464–476, 2010.[241] Jianyu Su, Stephen Adams, and Peter A Beling. Counterfactualmulti-agent reinforcement learning with graph convolutioncommunication. arXiv preprint 2004.00470 , 2020.[242] Adarsh Subbaswamy and Suchi Saria. Counterfactual nor-malization: Proactively addressing dataset shift and improvingreliability using causal mechanisms. arXiv preprint 1808.03253 ,2018.[243] Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Prevent-ing failures due to dataset shift: Learning predictive modelsthat transport. arXiv preprint 1812.04597 , 2018.[244] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and AbhinavGupta. Revisiting unreasonable effectiveness of data in deeplearning era. In

Proceedings of the IEEE internationalconference on computer vision , pages 843–852, 2017.[245] Chen Sun, Per Karlsson, Jiajun Wu, Joshua B Tenenbaum, andKevin Murphy. Stochastic prediction of multi-agent interactionsfrom partial observations. arXiv preprint 1902.09641 , 2019.[246] X. Sun, D. Janzing, and B. Schölkopf. Causal inferenceby choosing graphs with most plausible Markov kernels. In

Proceedings of the 9th International Symposium on ArtiﬁcialIntelligence and Mathematics , 2006.[247] Raphael Suter, Djordje Miladinovic, Bernhard Schölkopf, andStefan Bauer. Robustly disentangled causal mechanisms:Validating deep representations for interventional robustness.In

International Conference on Machine Learning , pages 6056–6065. PMLR, 2019.[248] Richard S Sutton, Andrew G Barto, et al.

Introduction toreinforcement learning , volume 135. MIT press Cambridge,1998.[249] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, JoanBruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus.Intriguing properties of neural networks. arXiv preprint1312.6199 , 2013.[250] Ern˝o Téglás, Edward Vul, Vittorio Girotto, Michel Gonzalez,Joshua B Tenenbaum, and Luca L Bonatti. Pure reasoning in12-month-old infants as probabilistic inference.

Science , 332(6033):1054–1059, 2011.[251] J. Tian and J. Pearl. Causal discovery from changes. In

Proceedings of the 17th Annual Conference on Uncertainty in Artiﬁcial Intelligence (UAI) , pages 512–522, 2001.[252] Frederik Träuble, Elliot Creager, Niki Kilbertus, AnirudhGoyal, Francesco Locatello, Bernhard Schölkopf, and StefanBauer. Is independence all you need? on the generalization ofrepresentations learned from correlated data. arXiv preprint2006.07886 , 2020.[253] Michael Tschannen, Josip Djolonga, Marvin Ritter, AravindhMahendran, Neil Houlsby, Sylvain Gelly, and Mario Lucic.Self-supervised learning of video-induced visual invariances.In

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , pages 13806–13815, 2020.[254] Angelos Tsiaras, Ingo Waldmann, G. Tinetti, Jonathan Ten-nyson, and Sergei Yurchenko. Water vapour in the atmosphereof the habitable-zone eight-earth-mass planet K2-18b.

NatureAstronomy , 2019. doi: 10.1038/s41550-019-0878-9.[255] Sjoerd Van Steenkiste, Michael Chang, Klaus Greff, and JürgenSchmidhuber. Relational neural expectation maximization:Unsupervised discovery of objects and their interactions. In ,2018.[256] Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmid-huber, and Olivier Bachem. Are disentangled representationshelpful for abstract visual reasoning? In

Advances in NeuralInformation Processing Systems , pages 14178–14191, 2019.[257] V. N. Vapnik.

Statistical Learning Theory . Wiley, New York,NY, 1998.[258] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki,Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David HChoi, Richard Powell, Timo Ewalds, Petko Georgiev, et al.Grandmaster level in StarCraft II using multi-agent reinforce-ment learning.

Nature , 575(7782):350–354, 2019.[259] J. von Kügelgen, A. Mey, M. Loog, and B. Schölkopf. Semi-supervised learning, causality and the conditional clusterassumption.

Conference on Uncertainty in Artiﬁcial Intelligence(UAI) , 2020.[260] Julius von Kügelgen, Umang Bhatt, Amir-Hossein Karimi,Isabel Valera, Adrian Weller, and Bernhard Schölkopf. Onthe fairness of causal algorithmic recourse. arXiv 2010.06529 ,2020.[261] Julius von Kügelgen, Luigi Gresele, and Bernhard Schölkopf.Simpson’s paradox in Covid-19 case fatality rates: a mediationanalysis of age-related causal effects. arXiv 2005.07180 , 2020.[262] Julius von Kügelgen, Ivan Ustyuzhaninov, Peter Gehler,Matthias Bethge, and Bernhard Schölkopf. Towards causalgenerative scene models via competition of experts. arXiv2004.12906 , 2020.[263] Haohan Wang, Zexue He, Zachary C. Lipton, and Eric P.Xing. Learning robust representations by projecting superﬁcialstatistics out. arXiv preprint 1903.06256 , 2019.[264] Nicholas Watters, Daniel Zoran, Theophane Weber, PeterBattaglia, Razvan Pascanu, and Andrea Tacchetti. Visualinteraction networks: Learning a physics simulator from video.In

Advances in neural information processing systems , pages4539–4547, 2017.[265] Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher PBurgess, and Alexander Lerchner. Cobra: Data-efﬁcient model-based rl through unsupervised object discovery and curiosity-driven exploration. arXiv preprint 1905.09275 , 2019.[266] S. Weichwald, B. Schölkopf, T. Ball, and M. Grosse-Wentrup.Causal and anti-causal learning in pattern recognition forneuroimaging. In . IEEE, 2014.[267] Sebastian Weichwald.

Pragmatism and Variable Transforma-tions in Causal Modelling . PhD thesis, ETH Zurich, 2019.[268] Marco Wiering and Martijn Van Otterlo.

Reinforcementlearning , volume 12. Springer, 2012.[269] Steffen Wiewel, Moritz Becher, and Nils Thuerey. Latent spacephysics: Towards learning the temporal evolution of ﬂuid ﬂow.In

Computer Graphics Forum , volume 38, pages 71–82. Wiley Online Library, 2019.[270] Laurenz Wiskott and Terrence J Sejnowski. Slow feature anal-ysis: unsupervised learning of invariances.

Neural computation ,14(4):715–70, April 2002. ISSN 0899-7667.[271] Chris Xie, Sachin Patil, Teodor Moldovan, Sergey Levine,and Pieter Abbeel. Model-based reinforcement learning withparametrized physical models and optimism-driven exploration.In , pages 504–511. IEEE, 2016.[272] Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, JiajunWu, Antonio Torralba, and Joshua B Tenenbaum. CLEVRER:Collision events for video representation and reasoning. arXivpreprint 1910.01442 , 2019.[273] Jinsung Yoon, James Jordon, and Mihaela van der Schaar.Ganite: Estimation of individualized treatment effects usinggenerative adversarial nets. In

International Conference onLearning Representations , 2018.[274] Vinicius Zambaldi, David Raposo, Adam Santoro, Victor Bapst,Yujia Li, Igor Babuschkin, Karl Tuyls, David Reichert, TimothyLillicrap, Edward Lockhart, et al. Deep reinforcement learningwith relational inductive biases. In

International Conferenceon Learning Representations , 2018.[275] J. Zhang and E. Bareinboim. Near-optimal reinforcementlearning in dynamic treatment regimes. In

Advances in NeuralInformation Processing Systems 33 , pages 13401–13411, 2019.[276] Junzhe Zhang and Elias Bareinboim. Fairness in decision-making - the causal explanation formula. In

Proceedings ofthe Thirty-Second AAAI Conference on Artiﬁcial Intelligence,New Orleans, Louisiana, USA , pages 2037–2045, 2018.[277] K. Zhang and A. Hyvärinen. On the identiﬁability of the post-nonlinear causal model. In

Proceedings of the 25th AnnualConference on Uncertainty in Artiﬁcial Intelligence (UAI) ,pages 647–655, 2009.[278] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf. Kernel-based conditional independence test and application in causaldiscovery. In

Proceedings of the 27th Annual Conference onUncertainty in Artiﬁcial Intelligence (UAI) , pages 804–813,2011.[279] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domainadaptation under target and conditional shift. In

Proceedingsof the 30th International Conference on Machine Learning(ICML) , pages 819–827, 2013.[280] K. Zhang, M. Gong, and B. Schölkopf. Multi-source domainadaptation: A causal view. In

Proceedings of the 29th AAAIConference on Artiﬁcial Intelligence , pages 3150–3157, 2015.[281] K. Zhang, B. Huang, J. Zhang, C. Glymour, and B. Schölkopf.Causal discovery from nonstationary/heterogeneous data: Skele-ton estimation and orientation determination. In

Proceedingsof the Twenty-Sixth International Joint Conference on ArtiﬁcialIntelligence (IJCAI 2017) , pages 1347–1353, 2017.[282] Richard Zhang. Making convolutional networks shift-invariantagain. arXiv preprint 1904.11486arXiv preprint 1904.11486