[PDF] Causality in cognitive neuroscience: concepts, challenges, and distributional robustness

Abstract

While probabilistic models describe the dependence structure between observed variables, causal models go one step further: they predict, for example, how cognitive functions are affected by external interventions that perturb neuronal activity. In this review and perspective article, we introduce the concept of causality in the context of cognitive neuroscience and review existing methods for inferring causal relationships from data. Causal inference is an ambitious task that is particularly challenging in cognitive neuroscience. We discuss two difficulties in more detail: the scarcity of interventional data and the challenge of finding the right variables. We argue for distributional robustness as a guiding principle to tackle these problems. Robustness (or invariance) is a fundamental principle underlying causal methodology. A causal model of a target variable generalises across environments or subjects as long as these environments leave the causal mechanisms intact. Consequently, if a candidate model does not generalise, then either it does not consist of the target variable's causes or the underlying variables do not represent the correct granularity of the problem. In this sense, assessing generalisability may be useful when defining relevant variables and can be used to partially compensate for the lack of interventional data.

Full PDF

CCausality in cognitive neuroscience:concepts, challenges, and distributionalrobustness

Sebastian Weichwald & Jonas PetersJuly 2020

While probabilistic models describe the dependence structure between ob-served variables, causal models go one step further: they predict, for example,how cognitive functions are affected by external interventions that perturb neur-onal activity. In this review and perspective article, we introduce the concept ofcausality in the context of cognitive neuroscience and review existing methodsfor inferring causal relationships from data. Causal inference is an ambitioustask that is particularly challenging in cognitive neuroscience. We discuss twodifficulties in more detail: the scarcity of interventional data and the challenge offinding the right variables. We argue for distributional robustness as a guidingprinciple to tackle these problems. Robustness (or invariance) is a fundamentalprinciple underlying causal methodology. A causal model of a target variablegeneralises across environments or subjects as long as these environments leavethe causal mechanisms intact. Consequently, if a candidate model does not gen-eralise, then either it does not consist of the target variable’s causes or the under-lying variables do not represent the correct granularity of the problem. In thissense, assessing generalisability may be useful when defining relevant variablesand can be used to partially compensate for the lack of interventional data.

Cognitive neuroscience aims to describe and understand the neuronal underpinnings of cog-nitive functions such as perception, attention, or learning. The objective is to character-ise brain activity and cognitive functions, and to relate one to the other. The submissionguidelines for the Journal of Cognitive Neuroscience, for example, state: “The Journal willnot publish research reports that bear solely on descriptions of function without address-ing the underlying brain events, or that deal solely with descriptions of neurophysiologyor neuroanatomy without regard to function.” We think that understanding this relation1 a r X i v : . [ q - b i o . N C ] J u l equires us to relate brain events and cognitive function in terms of the cause-effect re-lationships that govern their interplay. A causal model could, for example, describe howcognitive functions are affected by external interventions that perturb neuronal activity (cf.Section 2.1). Reid et al. (2019) argue that “the ultimate phenomenon of theoretical interestin all FC [functional connectivity] research is understanding the causal interaction amongneural entities”.Causal inference in cognitive neuroscience is of great importance and perplexity. Thismotivates our discussion of two pivotal challenges. First, the scarcity of interventional datais problematic as several causal models may be equally compatible with the observed datawhile making conflicting predictions only about the effects of interventions (cf. Section 3.1).Second, the ability to understand how neuronal activity gives rise to cognition depends onfinding the right variables to represent the neuronal activity (cf. Section 3.2). Our startingpoint is the well-known observation that causal models of a target (or response) variable aredistributionally robust and thus generalise across environments, subjects, and interventionalshifts (Haavelmo, 1944; Aldrich, 1989; Pearl, 2009). Models that do not generalise are eitherbased upon the wrong variables that do not represent causal entities or include variables thatare not causes of the target variable. We thus propose to pursue robust (or invariant) models.That way, distributional robustness may serve as a guiding principle towards a causal un-derstanding of cognitive function and may help us tackle both challenges mentioned above. We consider the following simplified examples. Assume that the consumption of alcohol af-fects reaction times in a cognitive task. In a randomised controlled trial we find that drinkingalcoholic (versus non-alcoholic) beer results in slowed reaction times hereinafter. Therefore,we may write ‘alcohol → reaction time’ and call alcohol a cause of reaction time and reac-tion time an effect of alcohol. Intervening on the cause results in a change in the distributionof the effect. In our example, prohibiting the consumption of any alcoholic beers results infaster reaction times.In cognitive neuroscience one may wish to describe how the neuronal activity is alteredupon beer consumption and how this change in turn affects the reaction time. For this, weadditionally require a measurement of neuronal activity, say a functional magnetic resonanceimaging (fMRI) scan and voxel-wise blood-oxygen-level dependent (BOLD) signals, that canserve as explanans in a description of the phenomenon ‘alcohol → neuronal activity → reaction time’. We distinguish the following two scenarios: Running Example A, illustrated in Figure 1a.

A so-called treatment or stimulus vari-able T (say, consumption of alcohol) affects neuronal activity as measured by a d -dimensionalfeature vector X = [ X , . . . , X d ] (cid:62) and the target variable Y reflects a cognitive function (say,reaction time). We may concisely write T → X → Y for a treatment that affects neur-onal activity which in turn maintains a cognitive function (this is analogous to the ‘stimulus → brain activity → response’ set-up considered in Weichwald, Schölkopf, Ball & Grosse-Wentrup, 2014; Weichwald et al., 2015). 2 unning Example B, illustrated in Figure 1b. We may wish to describe how neuronalentities cause one another and hence designate one such entity as the target variable Y . Inthis example, we consider a target variable corresponding to a specific brain signal or regioninstead of a behavioural or cognitive response. Several methods such as Granger causality or constraint-based methods have been applied tothe problem of inferring causality from cognitive neuroscience data. We describe these meth-ods in Section 2.3. In addition, there are ongoing conceptual debates that revolve around theprinciple of causality in cognitive neuroscience, some of which we now mention. Mehler andKording (2018) raise concerns about the “lure of causal statements” and expound the problemof confounding when interpreting functional connectivity. Confounders are similarly prob-lematic for multi-voxel pattern analyses (Todd, Nystrom & Cohen, 2013; Woolgar, Golland& Bode, 2014). The causal interpretation of encoding and decoding (forward and backward,univariate and multivariate) models has received much attention as they are common inthe analysis of neuroimaging data: Davis et al. (2014) examine the differences between themodel types, Haufe et al. (2014) point out that the weights of linear backward models maybe misleading, and Weichwald et al. (2015) extend the latter argument to non-linear mod-els and clarify which causal interpretations are warranted from either model type. Featurerelevance in mass-univariate and multivariate models can be linked to marginal and con-ditional dependence statements that yield an enriched causal interpretation when both arecombined (Weichwald et al., 2015); this consideration yields refined results in neuroimaginganalyses (Huth et al., 2016; Bach, Symmonds, Barnes & Dolan, 2017; Varoquaux et al., 2018)and explains improved functional connectivity results when combining bivariate and par-tial linear dependence measures (Sanchez-Romero & Cole, 2020). Problems such as indirectmeasurements and varying temporal delays complicate causal Bayesian network approachesfor fMRI (Ramsey et al., 2010; Mumford & Ramsey, 2014). Smith et al. (2011) present a sim-ulation study evaluating several methods for estimating brain networks from fMRI data anddemonstrate that identifying the direction of network links is difficult. The discourse onhow to leverage connectivity analyses to understand mechanisms in brain networks is on-going (Valdes-Sosa, Roebroeck, Daunizeau & Friston, 2011; Waldorp, Christoffels & van deVen, 2011; Smith, 2012; Mill, Ito & Cole, 2017). Many of the above problems and findings arerelated to the two key challenges that we discuss in Section 3.

We begin Section 2 by formally introducing causal concepts. In Section 2.1, we outline whywe believe there is a need for causal models in cognitive neuroscience by considering whattypes of questions could be answered by an OraCle Modelling (OCM) approach. We discussthe problem of models that are observationally equivalent yet make conflicting predictionsabout the effects of interventions in Section 2.2. In Section 2.3, we review different causal dis-covery methods and their underlying assumptions. We focus on two challenges for causality3 HX X X Y ? ? ? (a) Running Example A T HX X Y ? ? (b) Running Example B Figure 1: Illustration of two scenarios in cognitive neuroscience where we seek a causal ex-planation focusing on a target variable Y that either resembles (a) a cognitive func-tion, or (b) a neuronal entity. The variables T , X = [ X , . . . , X d ] (cid:62) , and H representtreatment, measurements of neuronal activity, and an unobserved variable, respect-ively. 4n cognitive neuroscience that are expounded in Section 3: (1) the scarcity of interventionaldata and (2) the challenge of finding the right variables. In Section 4, we argue that oneshould seek distributionally robust variable representations and models to tackle these chal-lenges. Most of our arguments in this work are presented in an i.i.d. setting and we brieflydiscuss the implications for time-dependent data in Section 4.5. We conclude in Section 5and outline ideas that we regard as promising for future research. In contrast to classical probabilistic models, causal models induce not only an observationaldistribution but also a set of so-called interventional distributions. That is, they predict how asystem reacts under interventions. We present an introduction to causal models that is basedon pioneer work by Pearl (2009) and Spirtes, Glymour and Scheines (2001). Our expositionis inspired by Weichwald (2019, Chapter 2), which provides more introductory intuition intocausal models viewed as structured sets of interventional distributions. For both simplicityand focus of exposition, we omit a discussion of counterfactual reasoning and other akincausality frameworks such as the potential outcomes formulation of causality (Imbens &Rubin, 2015). We phrase this article within the framework and terminology of StructuralCausal Models (SCMs) (Bollen, 1989; Pearl, 2009).An SCM over variables Z = [ Z , . . . , Z d ] (cid:62) consists of structural equations that relate each variable Z k to its parents PA ( Z k ) ⊆ { Z , . . . , Z d } anda noise variable N k via a function f k such that Z k : = f k ( PA ( Z k ) , N k ) , and a noise distribution P N of the noise variables N = [ N , . . . , N d ] (cid:62) .We associate each SCM with a directed causal graph where the nodes correspond to thevariables Z , . . . , Z d and we draw an edge from Z i to Z j whenever Z i appears on the righthand side of the equation Z j : = f j ( PA ( Z j ) , N j ) . That is, if Z i ∈ PA ( Z j ) the graph contains theedge Z i → Z j . Here, we assume that this graph is acyclic. The structural equations and noisedistributions together induce the observational distribution P Z of Z , . . . , Z d as simultaneoussolution to the equations. (Bongers, Peters, Schölkopf and Mooij (2018) formally define SCMswhen the graph includes cycles.)The following is an example of a linear Gaussian SCM: Z : = f ( PA ( Z ) , N ) = Z + N Z : = f ( PA ( Z ) , N ) = N Z : = f ( PA ( Z ) , N ) = Z + · Z + N with mutually independent standard-normal noise variables N , N , N . The correspondinggraph is Z Z Z P Z , which is the multivariate Gaussiandistribution (cid:169)(cid:173)(cid:171) Z Z Z (cid:170)(cid:174)(cid:172) ∼ P Z = N (cid:169)(cid:173)(cid:171)(cid:169)(cid:173)(cid:171) (cid:170)(cid:174)(cid:172) , (cid:169)(cid:173)(cid:171) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:172) . (1)In addition to the observational distribution, an SCM induces interventional distributions.Each intervention denotes a scenario in which we fix a certain subset of the variables to acertain value. For example, the intervention do ( Z : = , Z : = ) denotes the scenario wherewe force Z and Z to take on the values 0 and 5, respectively. The interventional distributionsare obtained by (a) replacing the structural equations of the intervened upon variables by thenew assignment, and (b) considering the distribution induced by the thus obtained new set ofstructural equations. For example, the distribution under intervention do ( Z : = a ) for a ∈ R ,denoted by P Z do ( Z : = a ) , is obtained by changing the equation Z : = f ( PA ( Z ) , N ) to Z : = a .In the above example, we find P Z do ( Z : = a ) = N (cid:169)(cid:173)(cid:171)(cid:169)(cid:173)(cid:171) a a (cid:170)(cid:174)(cid:172) , (cid:169)(cid:173)(cid:171) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:172) ,where X ∼ N ( a , ) if and only if P ( X = a ) =

1. Analogously, for b ∈ R and intervention on Z we have P Z do ( Z : = b ) = N (cid:169)(cid:173)(cid:171)(cid:169)(cid:173)(cid:171) bb · b (cid:170)(cid:174)(cid:172) , (cid:169)(cid:173)(cid:171) (cid:170)(cid:174)(cid:172)(cid:170)(cid:174)(cid:172) .The distribution of Z differs between the observational distribution and the interventionaldistribution, that is, P Z (cid:44) P do ( Z : = b ) Z . We call a variable X an (indirect) cause of a variable Y if there exists an intervention on X under which the distribution of Y is different from itsdistribution in the observational setting. Thus, Z is a cause of Z . The edge Z → Z in theabove causal graph reflects this cause-effect relationship. In contrast, Z remains standard-normally distributed under all interventions do ( Z : = a ) on Z . Because the distribution of Z remains unchanged under any intervention on Z , Z is not a cause of Z .In general, interventional distributions do not coincide with the corresponding conditionaldistributions. In our example we have P Z | Z = a (cid:44) P Z do ( Z : = a ) while P Z | Z = b = P Z do ( Z : = b ) . Wefurther have that the conditional distribution P Z | Z , Z of Z given its parents Z and Z isinvariant under interventions on variables other than Z . We call a model of Z based on Z , Z invariant (cf. Section 4.1).We have demonstrated how an SCM induces a set of observational and interventionaldistributions. The interventional distributions predict observations of the system upon in-tervening on some of its variables. As such, a causal model holds additional content com-pared to a common probabilistic model that amounts to one distribution to describe futureobservations of the same unchanged system. Sometimes we are only interested in modellingcertain interventions or cannot perform others as there may be no well-defined correspond-ing real-world implementation. For example, we cannot intervene on a person’s gender. In6hese cases it may be helpful to explicitly restrict ourselves to a set of interventions of in-terest. Furthermore, the choice of an intervention set puts constraints on the granularity ofthe model (cf. Section 3.2 and Rubenstein & Weichwald et al., 2017). We do not always need causal models to answer our research question. For some scientificquestions it suffices to consider probabilistic, that is, observational models. For example, ifwe wish to develop an algorithm for early diagnosis of Alzheimer’s disease from brain scans,we need to model the conditional distribution of Alzheimer’s disease given brain activity.Since this can be computed from the joint distribution, a probabilistic model suffices. If,however, we wish to obtain an understanding that allows us to optimally prevent progressionof Alzheimer’s disease by, for example, cognitive training or brain stimulation, we are in factinterested in a causal understanding of the Alzheimer’s disease and require a causal model.Distinguishing between these types of questions is important as it informs us about themethods we need to employ in order to answer the question at hand. To elaborate upon thisdistinction, we now discuss scenarios related to our running examples and the relationshipbetween alcohol consumption and reaction time (cf. Section 1.1). Assume we have accessto a powerful OraCle Modelling (OCM) machinery that is unaffected by statistical problemssuch as model misspecification, multiple-testing, or small sample sizes. By asking ourselves,what queries must be answered by OCM for us to ‘understand’ the cognitive function, thedifference between causal and non-causal questions becomes apparent.Assume, firstly, we ran the reaction task experiment with multiple subjects, fed all obser-vations to our OCM machinery, and have Kim visiting our lab today. Since OCM yields us theexact conditional distribution of reaction times P Y | T = t for Kim having consumed T = t unitsof alcoholic beer, we may be willing to bet against our colleagues on how Kim will performin the reaction task experiment they are just about to participate in. No causal model forbrain activity is necessary.Assume, secondly, that we additionally record BOLD responses X = [ X , . . . , X d ] (cid:62) at cer-tain locations and times during the reaction task experiment. We can query OCM for thedistribution of BOLD signals that we are about to record, that is, P X | T = t , or the distributionof reaction times given we measure Kim’s BOLD responses X = x , that is, P Y | T = t , X = x . Asbefore, we may bet against our colleagues on how Kim’s BOLD signals will look like in theupcoming reaction task experiment or bet on their reaction time once we observed the BOLDactivity X = x prior to a reaction cue. Again, no causal model for brain activity is required.In both of the above situations, we have learned something useful. Given that the data wereobtained in an experiment in which alcohol consumption was randomised, we have learned,in the first situation, to predict reaction times after an intervention on alcohol consumption.This may be considered an operational model for alcohol consumption and reaction time. Inthe second situation, we have learned how the BOLD signal responds to alcohol consumption.Yet, in none of the above situations have we gained understanding of the neuronal underpin-nings of the cognitive function and the reaction times. Knowing the conditional distributions P Y | T = t and P Y | T = t , X = x for any t yields no insight into any of the following questions. Whichbrain regions maintain fast reaction times? Where in the brain should we release drugs that7xcite neuronal activity in order to counterbalance the effect of alcohol? How do we need toupdate our prediction if we learnt that Kim just took a new drug that lowers blood pressurein the prefrontal cortex? To answer such questions, we require causal understanding.If we had a causal model, say in form of an SCM, we could address the above questions.An SCM offers an explicit way to model the system under manipulations. Therefore, a causalmodel can help to answer questions about where to release an excitatory drug. It may enableus to predict whether medication that lowers blood pressure in the prefrontal cortex willaffect Kim’s reaction time; in general, this is the case if the corresponding variables appearin the structural equations for Y or any of Y ’s ancestors.Instead of identifying conditional distributions, one may formulate the problem as a re-gression task with the aim to learn the conditional mean functions t → E [ X | T = t ] and ( t , x ) → E [ Y | T = t , X = x ] . These functions are then parameterised in terms of t or t and x .We argue in Section 2.2, point (2), that such parameters do not carry a causal meaning andthus do not help to answer the questions above.Promoted by slogans such as ‘correlation does not imply causation’ careful and associ-ational language is sometimes used in the presentation of cognitive neuroscience studies.We believe, however, that a clear language that states whether a model should be interpretedcausally (that is, as an interventional model) or non-causally (that is, as an observationalmodel) is needed. This will help to clarify both the real world processes the model can beused for and the purported scientific claims.Furthermore, causal models may generalise better than non-causal models. We expectsystematic differences between subjects and between different trials or recording days of thesame subject. These different situations, or environments, are presumably not arbitrarily dif-ferent. If they were, we could not hope to gain any scientific insight from such experiments.The apparent question is, which parts of the model we can expect to generalise betweenenvironments. It is well-known that causal models capture one such invariance property,which is implicit in the definition of interventions. An intervention on one variable leavesthe assignments of the other variables unaffected. Therefore, the conditional distributions ofthese other variables, given their parents, are also unaffected by the intervention (Haavelmo,1944; Aldrich, 1989). Thus, causal models may enable us to formulate more clearly whichmechanisms we assume to be invariant between subjects. For example, we may assume thatthe mechanism how alcohol intake affects brain activity differs between subjects, whereasthe mechanism from signals in certain brain regions to reaction time is invariant. We discussthe connection between causality and robustness in Section 4. Causal models entail strictly more information than observational models. We now introducethe notion of equivalence of models (Pearl, 2009; Peters, Janzing & Schölkopf, 2017; Bongerset al., 2018). This notion allows us to discuss the falsifiability of causal models, which isimportant when assessing candidate models and their ability to capture cause-effect rela-tionships that govern a cognitive process under investigation.We call two models observationally equivalent if they induce the same observational dis-tribution. Two models are said to be interventionally equivalent if they induce the same8bservational and interventional distributions. As discussed above, for some interventionsthere may not be a well-defined corresponding experiment in the real world. We thereforealso consider interventional equivalence with respect to a restricted set of interventions.One reason why learning causal models from observational data is difficult is the existenceof models that are observationally but not interventionally equivalent. Such models agreein their predictions about the observed system yet disagree in their predictions about theeffects of certain interventions. We continue the example from Section 2 and consider thefollowing two SCMs: Z : = Z + N Z : = √ · N Z : = N Z : = / · Z + / √ · N Z : = Z + · Z + N Z : = Z + · Z + N where in both cases N , N , N are mutually independent standard-normal noise variables.The two SCMs are observationally equivalent as they induce the same observational dis-tribution, the one shown in Equation (1). The models are not interventionally equivalent,however, since P do ( Z : = ) Z = N ( , ) and P do ( Z : = ) Z = N ( / , / ) for the left and right model, re-spectively. The two models can be told apart when interventions on Z or Z are considered.They are interventionally equivalent with respect to interventions on Z .The existence of observationally equivalent models that are not interventionally equival-ent has several implications. (1) Without assumptions, it is impossible to learn causal struc-ture from observational data. This is not exclusive to causal inference from data and ananalogous statement holds true for regression (Györfi, Kohler, Krzyżak & Walk, 2002). Theregression problem is solvable only under certain simplicity assumptions, for example, onthe smoothness of the regression function, which have been proven useful in real worldapplications. Similarly, there are several assumptions that can be exploited for causal dis-covery. We discuss some of these assumptions in Section 2.3. (2) As a consequence, withoutfurther restrictive assumptions on the data generating process, the estimated parametersdo not carry any causal meaning. For example, given any finite sample from the observa-tional distribution, both of the above SCMs yield exactly the same likelihood. Therefore, theabove structures cannot be told apart by a method that employs the maximum likelihoodestimation principle. Instead, which SCM and thus which parameters are selected in sucha situation may depend on starting values, optimisation technique, or numerical precision.(3) Assume that we are given a probabilistic (observational) model of a data generating pro-cess. To falsify it, we may apply a goodness-of-fit test based on an observational samplefrom that process. An interventional model cannot be falsified based on observational dataalone and one has to also take into account the outcome of interventional experiments. Thisrequires that we are in agreement about how to perform the intervention in practice (seealso Section 3.2). Interventional data may be crucial in particular for rejecting some of theobservationally equivalent models (cf. the example above). The scarcity of interventionaldata therefore poses a challenge for causality in cognitive neuroscience (cf. Section 3.1).9 .3 Causal discovery The task of learning a causal model from observational (or a combination of observationaland interventional) data is commonly referred to as causal discovery or causal structurelearning. We have argued in the preceding section that causal discovery from purely obser-vational data is impossible without any additional assumptions or background knowledge.In this section, we discuss several assumptions that render (parts of) the causal structureidentifiable from the observational distribution. In short, assumptions concern how causallinks manifest in observable statistical dependences, functional forms of the mechanisms,certain invariances under interventions, or the order of time. We briefly outline how theseassumptions can be exploited in algorithms. Depending on the application at hand, one maybe interested in learning the full causal structure as represented by its graph or in identify-ing a local structure such as the causes of a target variable Y . The methods described belowcover either of the two cases. We keep the description brief focussing on the main ideas andintuition, while more details can be found in the respective references. Randomisation.

The often called ‘gold standard’ to establishing whether T causes Y is tointroduce controlled perturbations, that is, targeted interventions, to a system. Without ran-domisation, a dependence between T and Y could stem from a confounder between T and Y or from a causal link from Y to T . If T is randomised it is no further governed by the outcomeof any other variable or mechanism. Instead, it only depends on the outcome of a random-isation experiment, such as the roll of a die. If we observe that under the randomisation, Y depends on T , say the higher T the higher Y , then there must be a (possibly indirect) causalinfluence from T to Y . In our running examples, this allows us to conclude that the amountof alcoholic beer consumed causes reaction times (cf. Section 1.1). When falsifying interven-tional models, it suffices to consider randomised experiments as interventions (Peters et al.,2017, Proposition 6.48). In practice, however, performing randomised experiments is ofteninfeasible due to cost or ethical concerns, or impossible as, for example, we cannot randomisegender nor fully control neuronal activity in the temporal lobe. While it is sometimes arguedthat the experiment conducted by James Lind in 1747 to identify a treatment for scurvy isamong the first randomised controlled trials, the mathematical theory and methodology waspopularised by Ronald A. Fisher in the early 20th century (Conniffe, 1991). Constraint-based methods.

Constraint-based methods rely on two assumptions that con-nect properties of the causal graph with conditional independence statements in the induceddistribution. The essence of the first assumption is sometimes described as Reichenbach’scommon cause principle (Reichenbach, 1956): If X and Y are dependent, then there must besome cause-effect structure that explains the observed dependence, that is, either X causes Y ,or Y causes X , or another unobserved variable H causes both X and Y , or some combinationof the aforementioned. This principle is formalised by the Markov condition (see for exampleLauritzen, 1996). This assumption is considered to be mild. Any distribution induced by anacyclic SCM satisfies the Markov condition with respect to the corresponding graph (Laur-itzen, Dawid, Larsen & Leimer, 1990; Pearl, 2009). The second assumption (often referred toas faithfulness), states that any (conditional) independence between random variables is im-10lied by the graph structure (Spirtes et al., 2001). For example, if two variables are independ-ent, then neither does cause the other nor do they share a common cause. Both assumptionstogether establish a one-to-one correspondence between conditional independences in thedistribution and graphical separation properties between the corresponding nodes.The back-bone of the constraint-based causal discovery algorithms such as the PC al-gorithm is to test for marginal and conditional (in)dependences in observed data and to findall graphs that encode the same list of separation statements (Spirtes et al., 2001; Pearl, 2009).This allows us to infer a so-called Markov equivalence class of graphs: all of its members en-code the same set of conditional independences. It has been shown that two directed acyclicgraphs (assuming that all nodes are observed) are Markov equivalent if and only if they havethe same skeleton and v-structures → ◦ ← (Verma & Pearl, 1990). Allowing for hidden vari-ables, as done by the FCI algorithm, for example, enlarges the class of equivalent graphs andthe output is usually less informative (Spirtes et al., 2001).The following example further illustrates the idea of a constraint-based search. For sim-plicity, we assume a linear Gaussian setting, so that (conditional) independence coincideswith vanishing (partial) correlation. Say we observe X , Y , and Z . Assume that the partialcorrelation between X and Z given Y vanishes while none of the other correlations andpartial correlations vanish. Under the Markov and faithfulness assumptions there are mul-tiple causal structures that are compatible with those constraints, such as X → Y → Z , X ← Y ← Z , X ← Y → Z , or HX Y Z , or

HX Y Z ,where H is unobserved. Still, the correlation pattern rules out certain other causal structures.For example, neither X → Y ← Z nor X ← H → Y ← Z can be the correct graph structuresince either case would imply that X and Z are uncorrelated (and X ⊥⊥ Z | Y is not satisfied).Variants of the above setting were considered in neuroimaging where a randomised exper-imental stimulus or time-ordering was used to further disambiguate between the remainingpossible structures (Grosse-Wentrup, Janzing, Siegel & Schölkopf, 2016; Weichwald, Gretton,Schölkopf & Grosse-Wentrup, 2016a; Weichwald, Grosse-Wentrup & Gretton, 2016b; Mas-takouri, Schölkopf & Janzing, 2019). Constraint-based causal inference methodology alsoclarifies the interpretation of encoding and decoding analyses in neuroimaging and has in-formed a refined understanding of the neural dynamics of probabilistic reward predictionand an improved functional atlas (Weichwald et al., 2015; Bach et al., 2017; Varoquaux et al.,2018).Direct applications of this approach in cognitive neuroscience are difficult, not only dueto the key challenges discussed in Section 3, but also due to indirect and spatially smearedneuroimaging measurements that effectively spoil conditional independences. In the linearsetting, there are recent advances that explicitly tackle the problem of inferring the causalstructure between latent variables, say the neuronal entities, based on observations of recor-ded variables (Silva, Scheine, Glymour & Spirtes, 2006). Further practical challenges includethe difficulty of testing for non-parametric conditional independence (Shah & Peters, 2020)and near-faithfulness violations (Uhler, Raskutti, Bühlmann & Yu, 2013).11 core-based methods. Instead of directly exploiting the (conditional) independences toinform our inference about the causal graph structure, score-based methods assess differentgraph structures by their ability to fit observed data (see for example Chickering, 2002). Thisapproach is motivated by the idea that graph structures that encode the wrong (conditional)independences will also result in bad model fit. Assuming a parametric model class, wecan evaluate the log-likelihood of the data and score different candidate graph structures bythe Bayesian Information Criterion, for example. The number of possible graph structuresto search over grows super-exponentially. That combinatorial difficulty can be dealt withby applying greedy search procedures that usually, however, do not come with finite sampleguarantees. Alternatively, Zheng, Dan, Aragam, Ravikumar and Xing (2020) exploit an algeb-raic characterisation of graph structures to maximise a score over acyclic graphs by solvinga continuous optimisation problem. The score-based approach relies on correctly specifyingthe model class. Furthermore, in the presence of hidden variables, the search space growseven larger and model scoring is complicated by the need to marginalise over those hiddenvariables (Jabbari, Ramsey, Spirtes & Cooper, 2017).

Restricted structural causal models.

Another possibility is to restrict the class of func-tions in the structural assignments and the noise distributions. Linear non-Gaussian acyclicmodels (Shimizu, Hoyer, Hyvärinen & Kerminen, 2006), for example, assume that the struc-tural assignments are linear and the noise distributions are non-Gaussian. As for independentcomponent analysis, identifiability of the causal graph follows from the Darmois-Skitovichtheorem (Darmois, 1953; Skitovič, 1962). Similar results hold for nonlinear models with addit-ive noise (Hoyer, Janzing, Mooij, Peters & Schölkopf, 2008; Zhang & Hyvärinen, 2009; Peters,Mooij, Janzing & Schölkopf, 2014; Bühlmann, Peters & Ernest, 2014) or linear Gaussian mod-els when the error variances of the different variables are assumed to be equal (Peters &Bühlmann, 2014). The additive noise assumption is a powerful, yet restrictive, assumptionthat may be violated in practical applications.

Dynamic causal modelling (DCM).

We may have prior beliefs about the existence anddirection of some of the edges. Incorporating these by careful specification of the priors is anexplicit modelling step in DCM (Valdes-Sosa et al., 2011). Given such a prior, we may preferone model over the other among the two observationally equivalent models presented inSection 2.2, for example. Since the method’s outcome relies on this prior information, anydisagreement on the validity of that prior information necessarily yields a discourse aboutthe method’s outcome (Lohmann, Erfurth, Müller & Turner, 2012). Further, a simulationstudy raised concerns regarding the validity of the model selection procedure in DCM (Fris-ton, Harrison & Penny, 2003; Lohmann et al., 2012; Friston, Daunizeau & Stephan, 2013;Breakspear, 2013; Lohmann, Müller & Turner, 2013).

Granger causality.

Granger causality is among the most popular approaches for the ana-lysis of connectivity between time-evolving processes. It exploits the existence of time andthe fact that causes precede their effects. Together with its non-linear extensions it has beenconsidered for the analysis of neuroimaging data with applications to electro-encephalography12EEG) and fMRI data (Marinazzo, Pellicoro & Stramaglia, 2008; Marinazzo, Liao, Chen &Stramaglia, 2011; Stramaglia, Wu, Pellicoro & Marinazzo, 2012; Stramaglia, Cortes & Mar-inazzo, 2014). The idea is sometimes wrongly described as follows: If including the past of Y t improves our prediction of X t compared to a prediction that is only based on the pastof X t alone, then Y Granger-causes X . Granger (1969) himself put forward a more carefuldefinition that includes a reference to all the information in the universe: If the predictionof X t based on all the information in the universe up to time t is better than the predic-tion where we use all the information in the universe up to time t apart from the past of Y t ,then Y Granger-causes X . In practice, we may instead resort to a multivariate formulationof Granger causality. If all relevant variables are observed (often referred to as causal suffi-ciency), there is a close correspondence between Granger causality and the constraint-basedapproach (Peters et al., 2017, Chapter 10.3.3). Observing all relevant variables, however, isa strong assumption which is most likely violated for data sets in cognitive neuroscience.While Granger causality may be combined with a goodness-of-fit test to at least partiallydetect the existence of confounders (Peters, Janzing & Schölkopf, 2013), it is commonly ap-plied as a computationally efficient black box approach that always outputs a result. In thepresence of instantaneous effects (for example, due to undersampling) or hidden variables,these results may be erroneous (see, for example, Sanchez-Romero et al., 2019). Inferring causes of a target variable.

We now consider a problem that is arguably sim-pler than inferring the full causal graph: identifying the causes of some target variable ofinterest. As outlined in the running examples in Section 1.1, we assume that we have obser-vations of the variables T , Y , X , . . . , X d , where Y denotes the target variable. Assume thatthere is an unknown structural causal model that includes the variables T , Y , X , . . . , X d andthat describes the data generating process well. To identify the variables among X , . . . , X d that cause Y , it does not suffice to regress Y on X , . . . , X d . The following example of an SCMshows that a good predictive model for Y is not necessarily a good interventional model for Y . Consider X : = N Y : = X + N Y X : = · Y + N X X Y where N , N , N Y are mutually independent standard-normal noise variables. X is a goodpredictor for Y , but X does not have any causal influence on Y : the distribution of Y isunchanged upon interventions on X .Recently, causal discovery methods have been proposed that aim to infer the causal parentsof Y if we are given data from different environments, that is, from different experimentalconditions, repetitions, or different subjects. These methods exploit a distributional robust-ness property of causal models and are described in Section 4. Cognitive function versus brain activity as the target variable.

When we are inter-ested in inferring direct causes of a target variable Y , it can be useful to include backgroundknowledge. Consider our Running Example A (cf. Section 1.1 and Figure 1a) with reaction13ime as the target variable and assume we are interested in inferring which of the variablesmeasuring neuronal activity are causal for the reaction time Y . We have argued in the preced-ing paragraph that if a variable X j is predictive of Y , it does not necessarily have to be causalfor Y . Assuming, however, that we can exclude that the cognitive function ‘reaction time’causes brain activity (for example, because of time ordering), we obtain the following simpli-fication: every X j that is predictive of Y , must be an indirect or direct cause of Y , confoundedwith Y , or a combination of both. This is different if our target variable is a neuronal entityas in Running Example B (cf. Figure 1b). Here, predictive variables can be either ancestors of Y , confounded with Y , descendants of Y , or some combination of the aforementioned (thesestatements follow from the Markov condition). Performing causal inference on measurements of neuronal activity comes with several chal-lenges, many of which have been discussed in the literature (cf. Section 1.2). In the followingtwo subsections we explicate two challenges that we think deserve special attention. In Sec-tion 4, we elaborate on how distributional robustness across environments, such as differentrecording sessions or subjects, can serve as a guiding principle for tackling those challenges.

In Section 2.2 we discussed that different causal models may induce the same observationaldistribution while they make different predictions about the effects of interventions. That is,observationally equivalent models need not be interventionally equivalent. This implies thatsome models can only be refuted when we observe the system under interventions whichperturb some specific variables in our model. In contrast to broad perturbations of the system,we call targeted interventions those for which the intervention target is known and for whichwe can list the intervened-upon variables in our model, say “ X , X , X have been intervenedupon.” Even if some targeted interventions are available, there may still be multiple modelsthat are compatible with all observations obtained under those available interventions. Inthe worst case, a sequence of up to d targeted interventional experiments may be required todistinguish between the possible causal structures over d observables X , . . . , X d when theexistence of unobserved variables cannot be excluded while assuming Markovianity, faith-fulness, and acyclicity (Eberhardt, 2013). In general, the more interventional scenarios areavailable to us, the more causal models we can falsify and the further we can narrow downthe set of causal models compatible with the data.Therefore, the scarcity of targeted interventional data is a barrier to causal inference incognitive neuroscience. Our ability to intervene on neural entities such as the BOLD levelor oscillatory bandpower in a brain region is limited and so is our ability to either identifythe right causal model from interventional data or to test causal hypotheses that are madein the literature. One promising avenue are non-invasive brain stimulation techniques suchas transcranial magnetic or direct/alternating current stimulation which modulate neuralactivity by creating a field inside the brain (Nitsche et al., 2008; Herrmann, Rach, Neuling &14trüber, 2013; Bestmann & Walsh, 2017; Kar, Ito, Cole & Krekelberg, 2020). Since the stimula-tion acts broadly and its neurophysiological effects are not yet fully understood, transcranialstimulation cannot be understood as targeted intervention on some specific neuronal en-tity in our causal model (Antal & Herrmann, 2016; Vosskuhl, Strüber & Herrmann, 2018).The inter-individual variability in response to stimulation further impedes its direct use forprobing causal pathways between brain regions (López-Alonso, Cheeran, Rio-Rodriguez &Fernández-del-Olmo, 2014). Bergmann and Hartwigsen (2020) review the obstacles to infer-ring causality from non-invasive brain stimulation studies and provide guidelines to atten-uate the aforementioned. Invasive stimulation techniques, such as deep brain stimulationrelying on electrode implants (Mayberg et al., 2005), may enable temporally and spatiallymore fine-grained perturbations of neural entities. Dubois et al. (2017) exemplify how torevise causal structures inferred form observational neuroimaging data on a larger cohortthrough direct stimulation of specific brain regions and concurrent fMRI on a smaller cohortof neurosurgical epilepsy patients. In non-human primates, concurrent optogenetic stim-ulation with whole-brain fMRI had been used to map the wiring of the medial prefrontalcortex (Liang et al., 2015; Lee et al., 2010). Yet, there are ethical barriers to large-scale in-vasive brain stimulation studies and it may not be exactly clear how an invasive stimulationcorresponds to an intervention on, say, the BOLD response measured in some voxels. Wethus believe that targeted interventional data will remain a scarcity due to physical and eth-ical limits to non-invasive and invasive brain stimulation.Consider the following variant of our Running Example B (cf. Section 1.1). Assume that(a) the consumption of alcoholic beer T slows neuronal activity in the brain regions X , X ,and Y , (b) X is a cause of X , and (c) X is a cause of Y . Here, (a) could have been establishedby randomising T , whereas (b) and (c) may be background knowledge. Nothing is known,however, about the causal relationship between X and Y (apart from the confounding effectof X ). The following graph summarises these causal relationships between the variables: T HX X Y ? Assume we establish on observational data that there is a dependence between X and Y and that we cannot render these variables conditionally independent by conditioning on anycombination of the remaining observable variables T and X . Employing the widely acceptedMarkov condition, we can conclude that either X → Y , X ← Y , X ← H → Y for someunobserved variable H , or some combination of the aforementioned settings. Without anyfurther assumptions, however, these models are observationally equivalent. That is, we can-not refute any of the above possibilities based on observational data alone. Even randomising15 does not help: The above models are interventionally equivalent with respect to interven-tions on T . We could apply one of the causal discovery methods described in Section 2.3.All of these methods, however, employ further assumptions on the data generating processthat go beyond the Markov condition. We may deem some of those assumptions implausiblegiven prior knowledge about the system. Yet, in the absence of targeted interventions on X , X or Y , we can neither falsify candidate models obtained by such methods nor can wetest all of the underlying assumptions. In Section 4.2, we illustrate how we may benefit fromheterogeneity in the data, that is, from interventional data where the intervention target isunknown. Causal discovery often starts by considering observations of some variables Z , . . . , Z d amongwhich we wish to infer cause-effect relationships, thereby implicitly assuming that thosevariables are defined or constructed in a way that they can meaningfully be interpreted ascausal entities in our model. This, however, is not necessarily the case in neuroscience.Without knowing how higher-level causal concepts emerge from lower levels, for example,it is hard to imagine how to make sense and use of a causal model of the 86 billion neuronsin a human brain (Herculano-Houzel, 2012). One may hypothesise that a model of averagedneuronal activity in distinct functional brain regions may be pragmatically useful to reasonabout the effect of different treatments and to understand the brain. For such an approach weneed to find the right transformation of the high-dimensional observed variables to obtainthe right variables for a causal explanation of the system.The problem of relating causal models with different granularity and finding the rightchoice of variable transformations that enable causal reasoning has received attention inthe causality literature also outside of neuroscience applications. Eberhardt (2016) fleshesout an instructive two-variable example that demonstrates that the choice of variables forcausal modelling may be underdetermined even if interventions were available. For a wrongchoice of variables our ability to causally reason about a system breaks. An example of thisis the historic debate about whether a high cholesterol diet was beneficial or harmful withrespect to heart disease. It can be partially explained by an ambiguity of how exactly totalcholesterol is manipulated. Today, we know that low-density lipoproteins and high-densitylipoproteins have opposing effects on heart disease risk. Merging these variables togetherto total cholesterol does not yield a variable with a well-defined intervention: Referring toan intervention on total cholesterol does not specify what part of the intervention is dueto a change in low-density lipoproteins (LDL) versus high-density lipoproteins (HDL). Assuch, only including total cholesterol instead of LDL and HDL may therefore be regarded asa too coarse-grained variable representation that breaks a model’s causal semantics, that is,the ability to map every intervention to a well-defined interventional distribution (Spirtes &Scheines, 2004; Steinberg, 2007; Truswell, 2010).Yet, we may sometimes prefer to transform micro variables into macro variables. This canresult in a concise summary of the causal information that abstracts away detail, is easier tocommunicate and operationalise, and more effectively represents the information necessaryfor a certain task (Hoel, Albantakis & Tononi, 2013; Hoel, 2017; Weichwald, 2019); for ex-16mple, a causal model over 86 billion neurons may be unwieldy for a brain surgeon aimingto identify and remove malignant brain tissue guided by the cognitive impairments observedin a patient. Rubenstein & Weichwald et al. (2017) formalise a notion of exact transforma-tions that ensures causally consistent reasoning between two causal models where the vari-ables in one model are transformations of the variables in the other. Roughly speaking, twomodels are considered causally consistent if the following two ways to reason about howthe distribution of the macro-variables changes upon a macro-level intervention agree withone another: (a) find an intervention on the micro-variables that corresponds to the con-sidered macro-level intervention, and consider the macro-level distribution implied by themicro-level intervention, and (b) obtain the interventional distribution directly within themacro-level structural causal model sidestepping any need to refer to the micro-level. If thetwo resulting distributions agree with one another for all (compositions of) interventions,then the two models are said to be causally consistent and we can view the macro-level asan exact transformation of the micro-level causal model that preserves its causal semantics.A formal exposition of the framework and its technical subtleties can be found in the afore-mentioned work. Here, we revisit a variant of the cholesterol example for an illustration ofwhat it entails for two causal models to be causally consistent and illustrate a failure mode:Consider variables L (LDL), H (HDL), and D (disease), where D : = H − L + N D for L , H , N d mutually independent random variables. Then a model based on the transformed variables T = L + H and D ≡ D is in general not causally consistent with the original model: For ( l , h ) (cid:44) ( l , h ) with l + h = l + h the interventional distributions induced by the micro-level model corresponding to setting L : = l and H : = h or alternatively L : = l and H : = h do in general not coincide due to the differing effects of L and H on D . Both interventionscorrespond to the same level of T and the intervention setting T : = t with t = l + h = l + h in the macro-level model. Thus, the distributions obtained from reasoning (a) and (b) abovedo not coincide. If, on the other hand, we had (cid:101) D : = H + L + N D , then we could indeed use amacro-level model where we consider T = H + L to reason about the distribution of (cid:101) D underthe intervention do ( T : = t ) without running into conflict with the interventional distribu-tions implied by all corresponding interventions in the micro-level model. This example cananalogously be considered in the context of our running examples (cf. Section 1.1): Insteadof LDL, HDL, and disease one could alternatively think of some neuronal activity ( L ) thatdelays motor response, some neuronal activity ( H ) that increases attention levels, and thedetected reaction time ( D ) assessed by subjects performing a button press; the scenario thentranslates into how causal reasoning about the cause of slowed reaction times is hamperedonce we give up on considering H and L as two separate neural entities and instead tryto reason about the average activity T . Janzing, Rubenstein and Schölkopf (2018) observesimilar problems for causal reasoning when aggregating variables and show that the obser-vational and interventional stationary distributions of a bivariate autoregressive processescannot in general be described by a two-variable causal model. A recent line of researchfocuses on developing a notion of approximate transformations of causal models (Beckers &Halpern, 2019; Beckers, Eberhardt & Halpern, 2019). While there exist first approaches tolearn discrete causal macro-variables from data (Chalupka, Perona & Eberhardt, 2015; Cha-lupka, Eberhardt & Perona, 2016), we are unaware of any method that is generally applicableand learns causal variables from complex high-dimensional data.17n cognitive neuroscience, we commonly treat large-scale brain networks or brain sys-tems as causal entities and then proceed to infer interactions between those (Yeo et al., 2011;Power et al., 2011). Smith et al. (2011) demonstrate that this should be done with caution:Network identification is strongly susceptible to slightly wrong or different definitions ofthe regions of interest (ROIs) or the so-called atlas. Analyses based on Granger causalitydepend on the level of spatial aggregation and were shown to reflect the intra-areal proper-ties instead of the interactions among brain regions if an ill-suited aggregation level is con-sidered (Chicharro & Panzeri, 2014). Currently, there does not seem to be consensus as towhich macroscopic entities and brain networks are the right ones to (causally) reason aboutcognitive processes (Uddin, Yeo & Spreng, 2019). Furthermore, the observed variables them-selves are already aggregates: A single fMRI voxel or the local field potential at some corticallocation reflects the activity of thousands of neurons (Logothetis, 2008; Einevoll, Kayser,Logothetis & Panzeri, 2013); EEG recordings are commonly considered a linear superposi-tion of cortical electromagnetic activity which has spurred the development of blind sourceseparation algorithms that try to invert this linear transformation to recover the underlyingcortical variables (Nunez & Srinivasan, 2006). The concept of causality is linked to invariant models and distributional robustness. Con-sider again the setting with a target variable Y and covariates X , . . . , X d , as described inthe running examples in Section 1.1. Suppose that the system is observed in different en-vironments. Suppose further that the generating process can be described by an SCM, that PA ( Y ) ⊆ { X , . . . , X d } are the causal parents of Y , and that the different environments cor-respond to different interventions on some of the covariates, while we neither (need to) knowthe interventions’ targets nor its precise form. In our reaction time example, the two envir-onments may represent two subjects (say, a left-handed subject right after having dinner anda trained race car driver just before a race) that differ in the mechanisms for X , X , and X .Then the joint distribution over Y , X , . . . , X d may be different between the environmentsand also the marginal distributions may vary. Yet, if the interventions do not act directlyon Y the causal model is invariant in the following sense: the conditional distribution of Y | X PA ( Y ) is the same in all environments. In the reaction time examples this could translateto the neuronal causes that facilitate fast (versus slow) reaction times to be the same acrosssubjects. This invariance can be formulated in different ways. For example, we have for all k and (cid:96) , where k and (cid:96) denote the indices of two environments, and for almost all x ( Y k | X k PA ( Y ) = x ) = ( Y (cid:96) | X (cid:96) PA ( Y ) = x ) in distribution . (2)Equivalently, E ⊥⊥ Y | X PA ( Y ) , (3)18here the variable E represents the environment. In practice, we often work with modelclasses such as linear or logistic regression for modelling the conditional distribution Y | X PA ( Y ) .For such model classes, the above statements simplify. In case of linear models, for ex-ample, Equations (2) and (3) translate to regression coefficients and error variances beingequal across different environments.For an example, consider a system that, for environment E =

1, is governed by the follow-ing structural assignmentsSCM for E = X : = N X : = · X + N X : = N Y : = X + X + X + N Y X : = Y + N X X X X Y − − X − Y E=1E=-1 with N , N , N , N , N Y mutually independent and standard-normal, and where environment E = − X in the assignment for X to −

1. Here, for example, { X , X , X } and { X , X , X } are so-called invariant sets: theconditionals Y | X , X , X and Y | X , X , X are the same in both environments. The invari-ant models Y | X , X , X and Y | X , X , X generalise to a new environment E = −

2, whichchanges the same weight to −

2, in that they would still predict well. Note that Y | X , X , X isa non-causal model. The lack of invariance of Y | X is illustrated by the different regressionlines in the scatter plot on the right.The validity of (2) and (3) follows from the fact that the interventions do not act on Y directly and can be proved using the equivalence of Markov conditions (Lauritzen, 1996;Peters, Bühlmann & Meinshausen, 2016, Section 6.6). Here, we try to argue that it also makessense intuitively. Suppose that someone proposes to have found a complete causal model fora target variable Y , using certain covariates X S (for Y , we may again think of the reactiontime in Example A). Suppose that fitting that model for different subjects yields significantlydifferent model fits – maybe even with different signs for the causal effects from variablesin X S to Y such that E ⊥⊥ Y | X S is violated. In this case, we would become sceptical aboutwhether the proposed model is indeed a complete causal model. Instead, we might suspectthat the model is missing an important variable describing how reaction time depends onbrain activity.In practice, environments can represent different sources of heterogeneity. In a cognitiveneuroscience setting, environments may be thought of as different subjects who react differ-ently, yet not arbitrarily so (cf. Section 2.1), to varying levels of alcohol consumption. Like-wise, different experiments that are thought to involve the same cognitive processes may bethought of as environments; for example, the relationship ‘neuronal activity → reaction time’(cf. Example A, Section 1.1) may be expected to translate from an experiment that compares19eaction times after consumption of alcoholic versus non-alcoholic beers to another experi-ment where subjects are exposed to Burgundy wine versus grape juice. The key assumptionis that the environments do not alter the mechanism of Y —that is, f Y ( PA ( Y ) , N Y ) —directlyor, more formally, there are no interventions on Y . To test whether a set of covariates isinvariant, as described in (2) and (3), no causal background knowledge is required.The above invariance principle is also known as ‘modularity’ or ‘autonomy’. It has beendiscussed not only in the field of econometrics (Haavelmo, 1944; Aldrich, 1989; Hoover, 2008),but also in philosophy of science. Woodward (2005) discusses how the invariance idea rejectsthat ‘either a generalisation is a law or else is purely accidental’. In our notion, the criteria (2)and (3) depend on the environments E . In particular, a model may be invariant with respectto some changes, but not with respect to others. In this sense, robustness and invarianceshould always be thought with respect to a certain set of changes. Woodward (2005) intro-duces the possibility to talk about various degrees of invariance, beyond the mere existence orabsence of invariance, while acknowledging that mechanisms that are sensitive even to mildchanges in the background conditions are usually considered as not scientifically interesting.Cartwright (2003) analyses the relationship between invariant and causal relations using lin-ear deterministic systems and draws conclusions analogous to the ones discussed above. Inthe context of the famous Lucas critique (Lucas, 1976), it is debated to which extent invari-ance can be used for predicting the effect of changes in economic policy (Cartwright, 2009):Economy consists of many individual players who are capable of adapting their behaviourto a change in policy. In cognitive neuroscience, we believe that the situation is different.Cognitive mechanisms do change and adapt, but not necessarily arbitrarily quickly. Somecognitive mechanism of an individual at the same day can be assumed to be invariant withrespect to changes in the visual input, say. Depending on the precise setup, however, we mayexpect moderate changes of the mechanisms, say, for example, the development of cognitivefunction in children or learning effects. In other settings, where mechanisms may be subjectto arbitrary large changes, scientific insight seems impossible (see Section 2.1).Recently, the principle of invariance has also received increasing attention in the statisticsand machine learning community (Schölkopf et al., 2012; Peters et al., 2016; M. Arjovsky &Lopez-Paz, 2019). It can also be applied to models that do not have the form of an SCM. Ex-amples include dynamical models that are governed by differential equations (Pfister, Bauer& Peters, 2019a). The idea of distributional robustness across changing background conditions may help usto falsify causal hypotheses, even when interventional data is difficult to obtain, and in thissense may guide us towards models that are closer to the causal ground truth. For this, sup-pose that the data are obtained in different environments and that we expect a causal modelfor Y to yield robust performance across these environments (see Section 4.1). Even if welack targeted interventional data in cognitive neuroscience and thus cannot test a causal hy-pothesis directly, we can test the above implication. We can test the invariance, for example,using conditional independence tests or specialised tests for linear models (Chow, 1960). Wecan, as a surrogate, hold out one environment, train our model on the remaining environ-20ents, and evaluate how well that model performs on the held-out data (cf. Figure 2); thereasoning is that a non-invariant model may not exhibit robust predictive performance andinstead yield a bad predictive performance for one or more of the folds. If a model fails theabove then either (1) we included the wrong variables, (2) we have not observed import-ant variables, or (3) the environment directly affects Y . Tackling (1), we can try to refineour model and search for different variable representations and variable sets that render ourmodel invariant and robust in the post-analysis. In general, there is no way to recover from(2) and (3), however.While a model that is not invariant across environments cannot be the complete causalmodel (assuming the environments do not act directly on the target variable), it may stillhave non-trivial prediction performance and predict better than a simple baseline methodin a new, unseen environment. The usefulness of a model is questionable, however, if itspredictive performance on held-out environments is not significantly better than a simplebaseline. Conversely, if our model shows robust performance on the held-out data and isinvariant across environments, it has the potential of being a causal model (while it neednot be; see Section 4.1 for an example). Furthermore, a model that satisfies the invarianceproperty is interesting in itself as it may enable predictions in new, unseen environments. Forthis line of argument, it does not suffice to employ a cross-validation scheme that ignores theenvironment structure and only assesses predictability of the model on data pooled acrossenvironments. Instead, we need to respect the environment structure and assess the distri-butional robustness of the model across these environments.For an illustration of the interplay between invariance and predictive performance, con-sider a scenario in which X → Y → H → X , where H is unobserved. Here, we regarddifferent subjects as different environments and suppose that (unknown to us) the environ-ment acts on H : One may think of a variable E pointing into H . Let us assume that our studycontains two subjects, one that we use for training and another one that we use as held-outfold. We compare a model of the form (cid:98) Y = f (cid:0) X PA ( Y ) (cid:1) = f ( X ) with a model of the form (cid:101) Y = д ( X ) = д ( X , X ) . On a single subject, the latter model including all observed variableshas more predictive power than the former model that only includes the causes of Y . Thereason is that X carries information about H , which can be leveraged to predict Y . As aresult, д ( X , X ) may predict Y well (and even better than f ( X ) ) on the held-out subject if itis similar to the training subject in that the distribution of H does not change between thesubjects. If, however, H was considerably shifted for the held-out subject, then the perform-ance of predicting Y by д ( X , X ) may be considerably impaired. Indeed, the invariance isviolated and we have E (cid:54)⊥⊥ Y | X , X . In contrast, the causal parent model f ( X ) may haveworse accuracy on the training subject but satisfies invariance: Even if the distribution of H is different for held-out subjects compared to the training subject, the predictive performanceof the model f ( X ) does not change. We have E ⊥⊥ Y | X .In practice, we often consider more than two environments. We hence have access to sev-eral environments when training our model, even if we leave out one of the environmentsto test on. In principle, we can thus already during training distinguish between invariantand non-invariant models. While some methods have been proposed that explicitly makeuse of these different environments during training time (cf. Section 4.4), we regard this as a21 n n ... n k K n K Available data K e n v i r o n m e n t s E a c h e n v i r o n m e n t k h o l d s n k o b s e r v a t i o n s n . . . n k K n K Fit model for Y | X targetcovariates ...repeat for K splits... n Predictive onheld-out fold ? Leave-one-environment-out cross-validationInvariance-test across environments n n . . . n k K n K Y | X Y | X Y k | X k Y K | X K Invariantacross folds ? Figure 2: Illustration of a cross-validation scheme across K environments (cf. Section 4.2). En-vironments can correspond to recordings on different days, during different tasks,or on different subjects, and define how the data is split into folds for the cross-validation scheme. We propose to assess a model by (a) leave-one-environment-outcross-validation testing for robust predictive performance on the held-out fold and(b) an invariance-test across environments assessing whether the model is invari-ant across folds. The cross-validation scheme (a) is repeated K times, so that eachenvironment acts as a held-out fold once. Models whose predictive performancedoes not generalise to held-out data or that are not invariant across environmentscan be refuted as non-causal. For linear models, for example, invariance across en-vironments can be assessed by evaluating to which extent regression coefficientsand error variances differ across folds (cf. Section 4.2).22ainly unexplored but promising area of research. In Section 4.2.1, we present a short ana-lysis of classifying motor imagery conditions on EEG data that demonstrates how leveragingrobustness may yield models that generalise better to unseen subjects.In summary, employing distributional robustness as guiding principle prompts us to rejectmodels as non-causal if they are not invariant or if they do not generalise better than a simplebaseline to unseen environments, such as sessions, days, neuroimaging modalities, subjects,or other slight variations to the experimental setup. Models that are distributionally robustand do generalise to unseen environments are not necessarily causal but satisfy the pre-requisites for being interesting candidate models when it comes to capturing the underlyingcausal mechanisms. Here, we illustrate the proposed cross-validation scheme presented in Figure 2 on motorimagery EEG data due to Tangermann et al. (2012). The data consist of EEG recordings of 9subjects performing multiple trials of 4 different motor imagery tasks. For each subject 22-channel EEG recordings at 250 Hz sampling frequency are available for 2 days with 6 runs of48 trials each. We analysed the publicly available data that is bandpass filtered between 0 . −

30, 8 −

20, 20 −

30, and 58 − − · =

24 configurations of trial features, we fitted 5 different lineardiscriminant analysis classifiers without shrinkage, with automatic shrinkage based on theLedoit-Wolf lemma, and shrinkage parameter settings 0 .

2, 0 .

5, and 0 .

8. These 120 pipelineswere fitted once on the entire training data and classification accuracies and areas under thereceiver operating curve scores obtained on 4 held-out subjects ( y -axes in Figure 3). Classifierperformance was cross-validated on the training data following the following three differentcross-validation schemes (cross-validation scores are shown on the x -axes in Figure 3): loso-cv Leave-one-subject-out cross-validation is the proposed cross-validation scheme. Wehold out data corresponding to each training subject once, fit an LDA classifier onthe remaining training data, and assess the models accuracy on the held-out training23 .

20 0 .

25 0 .

30 0 .

35 0 .

40 0 . cv-estimate of mean(accuracy), 5 training subjects . . . . . . m e a n ( a cc u r a c y ) o n h e l d - o u t s u b j e c t s held-out accuracy vs cv-estimates, 120 models loso-accuracy, τ =0.59lobo-accuracy, τ =0.52looo-accuracy, τ =0.53 top model selected by loso-cvtop model selected by lobo-cvtop model selected by looo-cvbest-possible model .

50 0 .

55 0 .

60 0 .

65 0 .

70 0 . cv-estimate of mean(auc), 5 training subjects . . . . . . m e a n ( a u c ) o n h e l d - o u t s u b j e c t s held-out auc vs cv-estimates, 120 models loso-auc, τ =0.76lobo-auc, τ =0.71 top model selected by loso-cvtop model selected by lobo-cvbest-possible model Figure 3: We compare 120 models for the prediction of 4 motor imagery tasks that leveragedifferent EEG components, bandpower features in different frequency bands, anddifferent classifiers. The left and right panel consider classification accuracy or AUCaveraged over four held-out subjects as performance measure, respectively. Theleave-one-subject-out (loso) cross-validation accuracy on 5 training subjects cap-tures how robust a model is across training subjects. This leave-one-environment-out cross-validation scheme (see Figure 2) seems indeed able to identify modelsthat generalise slightly better to new unseen environments (here the 4 held-outsubjects) than a comparable 7-fold leave-one-block-out (lobo)-cv or the leave-one-observation-out (looo) scheme. This is reflected in the Kendall’s τ rank correlationand the scores of the top-ranked models. All top-ranked models outperform ran-dom guessing on the held-out subjects (which corresponds to 25% and 50% in theleft and right figure, respectively). The displacement along the x -axis of the lobo-and loso-cv scores indicates the previously reported overestimation of held-out per-formance when using those cross-validation schemes.24ubject. The average of those cross-validation scores reflects how robustly each of the120 classifier models performs across environments (here subjects). lobo-cv Leave-one-block-out cross-validation is a 7-fold cross-validation scheme that is sim-ilar to the above loso-cv scheme, where the training data is split into random 7 blocksof roughly equal size. Not respecting the environment structure within the trainingdata, this cross-validation scheme does not capture a models robustness across envir-onments. looo-cv

Leave-one-observation-out cross-validation leaves out a single observation and isequivalent to lobo-cv with a block size of one.In Figure 3 we display the results of the different cross-validation schemes and the Kend-all’s τ rank correlation between the different cv-scores derived on the training data and amodel’s classification performance on the four held-out subjects. The loso-cv scores correl-ate more strongly with held-out model performance and thereby slightly better resolve therelative model performance. Considering the held-out performance for the models with topcv-scores, we observe that selecting models based on the loso-cv score may indeed selectmodels that tend to perform slightly better on new unseen subjects. Furthermore, compar-ing the displacement of the model scores from the diagonal shows that the loso-cv scheme’sestimates are less-biased than the lobo and looo cross-validation scores, when used as an es-timate for the performance on held-out subjects; this is in line with Varoquaux et al. (2017). Whether distributional robustness holds can depend on whether we consider the right vari-ables. This is shown by the following example. Assume that the target Y is caused by thetwo brain signals X and X via Y : = αX + βX + N Y ,for some α (cid:44) β (cid:44)

0, and noise variable N Y . Assume further that the environment influencesthe covariates X and X via X : = X + E + N and X : = E + N , but does not influence Y directly. Here, X and X may represent neuronal activity in two brain regions that arecausal for reaction times while E may indicate the time of day or respiratory activity. Wethen have the invariance property E ⊥⊥ Y | X , X . If, however, we were to construct or—due to limited measurement ability—only be able toobserve (cid:101) X : = X + X , then whenever α (cid:44) β we would find that E (cid:54)⊥⊥ Y | (cid:101) X . This conditional dependence is due to many value pairs for ( X , X ) leading to the samevalue of (cid:101) X : Given (cid:101) X = (cid:101) x , the value of E holds information about whether say ( X , X ) = (cid:101) x , ) or ( X , X ) = ( , (cid:101) x ) is more probable and thus–since X and X enter Y with differentweights–holds information about Y ; E and Y are conditionally dependent given (cid:101) X . Thus, theinvariance may break down when aggregating variables in an ill-suited way. This exampleis generic in that the same conclusions hold for all assignments X : = f ( X , E , N ) and X : = f ( E , N ) , as long as causal minimality, a weak form of faithfulness, is satisfied (Spirtes et al.,2001).Rather than taking the lack of robustness as a deficiency, we believe that this observationhas the potential to help us finding the right variables and granularity to model our systemof interest. If we are given several environments, the guiding principle of distributionalrobustness can nudge our variable definition and ROI definition towards the construction ofvariables that are more suitable for causally modelling some cognitive function. If some ROIactivity or some EEG bandpower feature does not satisfy any invariance across environmentsthen we may conclude that our variable representation is misaligned with the underlyingcausal mechanisms or that important variables have not been observed (assuming that theenvironments do not act on Y directly).This idea can be illustrated by a thought experiment that is a variation of the LDL-HDLexample in Section 3.2: Assume we wish to aggregate multiple voxel-activities and representthem by the activity of a ROI defined by those voxels. For example, let us consider the reactiontime scenario (Example A, Section 1.1) and voxels X , . . . , X d . Then we may aggregate thevoxels X = (cid:205) di = X i to obtain a macro-level model in which we can still sensibly reason aboutthe effect of an intervention on the treatment variable T onto the distribution of X , the ROIsaverage activity. Yet, the model is in general not causally consistent with the original model.First, our ROI may be chosen too coarse such that for (cid:98) x (cid:44) (cid:101) x ∈ R d with (cid:205) di = (cid:98) x i = (cid:205) di = (cid:101) x i = x the interventional distributions induced by the micro-level model corresponding to settingall X i : = (cid:101) x i or alternatively to X i : = (cid:98) x i do not coincide—for example, a ROI that effectivelycaptures the global average voxel-activity cannot resolve whether a higher activity is due toincreased reaction-time-driving neuronal entities or due to some upregulation of other neur-onal processes unrelated to the reaction time, such as respiratory activity. This ROI would beill-suited for causal reasoning and non-robust as there are two micro-level interventions thatimply different distributions on the reaction times while corresponding to the same inter-vention setting X : = x with x = (cid:205) di = (cid:98) x i = (cid:205) di = (cid:101) x i in the macro-level model. Second, our ROImay be defined too fine grained such that, for example, the variable representation does onlypick up on the left-hemisphere hub of a distributed neuronal process relevant for reactiontimes. If the neuronal process has different laterality in different subjects, then predicting theeffects of interventions on only the left-hemispherical neuronal activity cannot be expectedto translate to all subjects. Here, a macro-variable that averages more voxels, say symmetricof both hemispheres, may be more robust to reason about the causes of reaction times thanthe more fine grained unilateral ROI. In this sense, seeking for variable constructions thatenable distributionally robust models across subjects, may nudge us to meaningful causalentities. The spatially refined and finer resolved cognitive atlas obtained by Varoquaux et al.(2018), whose map definition procedure was geared towards an atlas that would be robust-ness across multiple studies and 196 different experimental conditions, may be seen as anindicative manifestation of the above reasoning.26 .4 Existing methods exploiting robustness We now present some existing methods that explicitly consider the invariance of a model.While many of these methods are still in their infancy when considering real world applica-tions, we believe that further development in that area could play a vital role when tacklingcausal questions in cognitive neuroscience.

Robust Independent Component Analysis.

Independent component analysis (ICA) iscommonly performed in the analysis of magneto- and electro-electroencephalography (MEGand EEG) data in order to invert the inevitable measurement transformation that leaves uswith observations of a linear superposition of underlying cortical (and non-cortical) activ-ity. The basic ICA model assumes our vector of observed variables X is being generated as X = AS where A is a mixing matrix and S = [ S , . . . , S d ] (cid:62) is a vector of unobserved mutuallyindependent source signals. The aim is to find the unmixing matrix V = A − . If we performICA on individual subjects’ data separately, the resulting unmixing matrices will often dif-fer between subjects. This not only hampers the interpretation of the resulting sources assome cortical activity that we can identify across subjects, it also hints—in light of the abovediscussion—at some unexplained variation that is due to shifts in background conditionsbetween subjects such as different cap positioning or neuroanatomical variation. Instead ofsimply pooling data across subjects, Pfister et al. (2019b) propose a methodology that expli-citly exploits the existence of environments, that is, the fact that EEG samples can be groupedby subjects they were recorded from. This way, the proposed confounding-robust ICA (coro-ICA) procedure identifies an unmixing of the signals that generalises to new subjects. Theadditional robustness resulted, for their considered example, in improved classification ac-curacies on held-out subjects and can be viewed as a first-order adjustment for subject spe-cific differences. The application of ICA procedures to pooled data will generally result incomponents that do not robustly transfer to new subjects and are thus necessarily variablesthat do not lend themselves for a causal interpretation. The coroICA procedure aims to ex-ploit the environments to identify unmixing matrices that generalise across subjects. Causal discovery with exogenous variation.

Invariant causal prediction, proposed byPeters et al. (2016), aims at identifying the parents of Y within a set of covariates X , . . . , X d .We have argued that the true causal model is invariant across environments, see Equation (2),if the data are obtained in different environments and the environment does not directlyinfluence Y . That is, when enumerating all invariant models by searching through subsetsof X , . . . , X d , one of these subsets must be the set of causal parents of Y . As a result, theintersection (cid:98) S = ∩ S : S invariant S of all sets of covariates that yield invariant models is guaranteedto be a subset of the causes PA ( Y ) of Y . (Here, we define the intersection over the empty indexset as the empty set.) Testing invariance with a hypothesis test to the level α , say α = . (cid:98) S is contained in the set of parents of Y with high probability P (cid:16)(cid:98) S ⊆ PA ( Y ) (cid:17) ≥ − α . Y directly, for example, there is no invariant set andin the presence of hidden variables, the intersection (cid:98) S of invariant models can still be shownto be a subset of the ancestors of Y with large probability.It is further possible to model the environment as a random variable (using an indicatorvariable, for example), that is often called a context variable. One can then exploit the back-ground knowledge of its exogeneity to identify the full causal structure instead of focussingon identifying the causes of a target variable. Several approaches have been suggested (forexample, Spirtes et al., 2001; Eaton & Murphy, 2007; Zhang, Huang, Zhang, Glymour &Schölkopf, 2017; Mooij, Magliacane & Claassen, 2020). Often, these methods first identifythe target of the intervention and then exploit known techniques of constraint- or score-based methods. Some of the above methods also make use of time as a context variable orenvironment (Zhang et al., 2017; Pfister, Bühlmann & Peters, 2018). Anchor regression.

We argued above that focusing on invariance has an advantage wheninferring causal structure from data. If we are looking for generalisability across environ-ments, however, focusing solely on invariance may be too restrictive. Instead, we may selectthe most predictive model among all invariant models. The idea of anchor regression is toexplicitly trade off invariance and predictability (Rothenhäusler, Meinshausen, Bühlmann &Peters, 2018). For a target variable Y , predictor variables X , and so-called anchor variables A = [ A , . . . , A q ] (cid:62) that represent the different environments and are normalised to haveunit variance, the anchor regression coefficients are obtained as solutions to the followingminimisation problem (cid:98) b γ : = arg min b ∈ R d E (cid:2) ( Y − b (cid:62) X ) (cid:3)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) prediction + γ E (cid:2) (cid:107) A ( Y − b (cid:62) X )(cid:107) (cid:3)(cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) invariance . Higher parameters γ steer the regression towards more invariant predictions (convergingagainst the two stage least squares solutions in identifiable instrumental variable settings).For γ = (cid:98) b γ can be shown tohave the best predictive power under shift interventions up to a certain strength that dependson γ . As before, the anchor variables can code time, environments, subjects, or other factors,and we thus obtain a regression that is distributionally robust against shifts in those factors. So far, we have mainly focused on the setting of i.i.d. data. Most of the causal inferenceliterature dealing with time dependency considers discrete-time models. This comes withadditional complications for causal inference. For example, there are ongoing efforts to ad-apt causal inference algorithms and account for sub- or sup-sampling and temporal aggreg-ation (Danks & Plis, 2013; Hyttinen, Plis, Järvisalo, Eberhardt & Danks, 2016). Problems oftemporal aggregation relate to Challenge 2 of finding the right variables, which is a concep-tual problem in time series models that requires us to clarify our notion of intervention for28ime-evolving systems (Rubenstein & Weichwald et al., 2017). When we observe time serieswith non-stationarities we may consider these as resulting from some unknown shift inter-ventions. That is, non-stationarities over time may be due to shifts in the background condi-tions and as such can be understood as shifts in environments analogous to the i.i.d. setting.This way, we may again leverage the idea of distributional robustness for inference on time-evolving systems for which targeted interventional data is scarce. Extensions of invariantcausal prediction to time series data that aim to leverage such variation have been proposedby Christiansen and Peters (2018), Pfister et al. (2018) and the ICA procedure described inSection 4.4 also exploits non-stationarity over time. SCMs extend to continuous-time mod-els (Peters, Bauer & Pfister, 2020), where the idea to trade off prediction and invariance hasbeen applied to the problem of inferring chemical reaction networks (Pfister et al., 2019a).A remark is in order if we wish to describe time-evolving systems by one causal summarygraph where each time series component is collapsed into one node: For this to be reasonable,we need to assume a time-homogeneous causal structure. Furthermore, it requires us tocarefully clarify its causal semantics: While summary graphs can capture the existence ofcause-effect relationships, they do in general not correspond to a structural causal modelthat admits a causal semantics nor enables interventional predictions that are consistent withthe underlying time-resolved structural causal model (Rubenstein & Weichwald et al., 2017;Janzing et al., 2018). That is, the wrong choice of time-agnostic variables and correspondinginterventions may be ill-suited to represent the cause-effect relationships of a time-evolvingsystem (cf. Challenge 2 and Rubenstein & Weichwald et al., 2017; Janzing et al., 2018).

Causal inference in cognitive neuroscience is ambitious. It is important to continue theopen discourse about the many challenges, some of which are mentioned above. Thanksto the open and critical discourse there is great awareness and caution when interpretingneural correlates (Rees, Kreiman & Koch, 2002). Yet, “FC [functional connectivity] research-ers already work within a causal inference framework, whether they realise it or not” (Reidet al., 2019).In this article we have provided our view on the numerous obstacles to a causal under-standing of cognitive function. If we, explicitly or often implicitly, ask causal questions, weneed to employ causal assumptions and methodology. We propose to exploit that causalmodels using the right variables are distributionally robust. In particular, we advocate dis-tributional robustness as a guiding principle for causality in cognitive neuroscience. Whilecausal inference in general and in cognitive neuroscience in particular is a challenging task,we can at least exploit the rational to refute models and variables as non-causal that arefrail to shifts in the environment. This guiding principle does not necessarily identify causalvariables nor causal models, but it nudges our search into the right direction away fromfrail models and non-causal variables. While we presented first attempts that aim to lever-age observations obtained in different environments (cf. Section 4.4), this article poses morequestions for future research than it answers.We believe that procedures that exploit environments during training are a promising29venue for future research. While we saw mildly positive results in our case study, furtherresearch needs to show whether this trend persists in studies with many subjects. It maybe possible to obtain improvements when combining predictive scores on held-out-subjectswith other measures of invariance and robustness. The development of such methods maybe spurred and guided by field-specific benchmarks (or competitions) that assess models’distributional robustness across a wide range of scenarios, environments, cognitive tasks,and subjects.When considering robustness or invariance across trainings environments, the questionarises how the ability to infer causal structure and the generalisation performance to unseenenvironments depend on the number of training environments. While first attempts havebeen made to theoretically understand that relation (Rothenhäusler et al., 2018; Christiansen,Pfister, Jakobsen, Gnecco & Peters, 2020), most of the underlying questions are still open.We believe that an answer would depend on the strength of the interventions, the samplesize, the complexity of the model class and, possibly, properties of the (meta-)distribution ofenvironments.We believe that advancements regarding the errors-in-variable problem may have import-ant implications for cognitive neuroscience. Nowadays, we can obtain neuroimaging meas-urements at various spatial and temporal resolutions using, among others, fMRI, MEG andEEG, positron emission tomography, or near-infrared spectroscopy (Filler, 2009; Poldrack,2018). Yet, all measurement modalities are imperfect and come with different complica-tions. One general problem is that the observations are corrupted by measurement noise.The errors-in-variables problem complicates even classical regression techniques where wewish to model Y ≈ βX ∗ + ϵ but only have access to observations of a noise-corrupted X = X ∗ + η (Schennach, 2016). This inevitably carries over and hurdles causal inferenceas the measurement noise spoils conditional independence testing, biases any involved re-gression steps, and troubles additive noise approaches that aim to exploit noise propertiesfor directing causal edges and methods testing for invariance. First steps addressing theseissues in the context of causal discovery have been proposed by (Zhang et al., 2018; Blom,Klimovskaia, Magliacane & Mooij, 2018; Scheines & Ramsey, 2016).Summarising, we believe that there is a need for causal models if we aim to understand theneuronal underpinnings of cognitive function. Only causal models equip us with conceptsthat allow us to explain, describe, predict, manipulate, deal and interact with, and reasonabout a system and that allow us to generalise to new, unseen environments. A merely as-sociational model suffices to predict naturally unfolding disease progression, for example.We need to obtain understanding in form of a causal model if our goal is to guide rehabil-itation after cognitive impairment or to inform the development of personalised drugs thattarget specific neuronal populations. Distributional robustness and generalisability to un-seen environments is an ambitious goal, in particular in biological systems and even moreso in complex systems such as the human brain. Yet, it may be the only and most promisingway forward. 30 cknowledgments The authors thank the anonymous reviewers for their constructive and helpful feedback onan earlier version of this manuscript. SW was supported by the Carlsberg Foundation.31 eferences

Aldrich, J. (1989). Autonomy.

Oxford Economic Papers , , 15–34.Antal, A. & Herrmann, C. S. (2016). Transcranial alternating current and random noise stim-ulation: possible mechanisms. Neural Plasticity , .Bach, D. R., Symmonds, M., Barnes, G. & Dolan, R. J. (2017). Whole-Brain Neural Dynamicsof Probabilistic Reward Prediction. Journal of Neuroscience , (14), 3789–3798.Beckers, S., Eberhardt, F. & Halpern, J. Y. (2019). Approximate Causal Abstraction. In Proceed-ings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI) . AUAIPress.Beckers, S. & Halpern, J. Y. (2019). Abstracting Causal Models. In

Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence . AAAI Press.Bergmann, T. O. & Hartwigsen, G. (2020). Inferring Causality from Noninvasive Brain Stimu-lation in Cognitive Neuroscience.

Journal of Cognitive Neuroscience , 1–29. Forthcoming.Bestmann, S. & Walsh, V. (2017). Transcranial electrical stimulation.

Current Biology , (23),R1258–R1262.Blom, T., Klimovskaia, A., Magliacane, S. & Mooij, J. M. (2018). An upper bound for randommeasurement error in causal discovery. In Proceedings of the 34th annual conference onUncertainty in Artificial Intelligence (UAI) .Bollen, K. A. (1989).

Structural Equations with Latent Variables . John Wiley & Sons.Bongers, S., Peters, J., Schölkopf, B. & Mooij, J. M. (2018). Theoretical Aspects of Cyclic Struc-tural Causal Models. arXiv preprint arXiv:1611.06221 .Breakspear, M. (2013). Dynamic and stochastic models of neuroimaging data: A comment onLohmann et al.

NeuroImage , , 270–274.Bühlmann, P., Peters, J. & Ernest, J. (2014). CAM: Causal Additive Models, high-dimensionalorder search and penalized regression. Annals of Statistics , , 2526–2556.Cartwright, N. (2003). Two theorems on invariance and causality. Philosophy of Science , (1),203–224.Cartwright, N. (2009). Causality, invariance and policy. In M. Klopotek, A. Przepiorkowski,S. T. Wierzchon & K. Trojanowski (Eds.), The oxford handbook of philosophy of econom-ics (pp. 410–423). New York, NY: Oxford University Press.Chalupka, K., Eberhardt, F. & Perona, P. (2016). Multi-Level Cause-Effect Systems. In

Proceed-ings of the 19th International Conference on Artificial Intelligence and Statistics (Vol. 51,pp. 361–369). Proceedings of Machine Learning Research. PMLR.Chalupka, K., Perona, P. & Eberhardt, F. (2015). Visual Causal Feature Learning. In

Proceedingsof the Thirty-First Conference on Uncertainty in Artificial Intelligence (UAI) . AUAI Press.Chicharro, D. & Panzeri, S. (2014). Algorithms of causal inference for the analysis of effectiveconnectivity among brain regions.

Frontiers in Neuroinformatics , , 64.Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of Ma-chine Learning Research , , 507–554.Chow, G. C. (1960). Tests of equality between sets of coefficients in two linear regressions. Econometrica , , 591–605.Christiansen, R. & Peters, J. (2018). Invariant Causal Prediction in the Presence of LatentVariables. arXiv preprint arXiv:1808.05541 .32hristiansen, R., Pfister, N., Jakobsen, M., Gnecco, N. & Peters, J. (2020). The difficult task ofdistribution generalization in nonlinear models. arXiv preprint arXiv:2006.07433 .Conniffe, D. (1991). R.A. Fisher and the development of statistics - a view in his centeraryyear. Journal of the Statistical and Social Inquiry Society of Ireland , (3), 55–108.Danks, D. & Plis, S. (2013). Learning causal structure from undersampled time series.Darmois, G. (1953). Analyse générale des liaisons stochastiques. Revue de l’Institut Interna-tional de Statistique , , 2–8.Davis, T., LaRocque, K. F., Mumford, J. A., Norman, K. A., Wagner, A. D. & Poldrack, R. A.(2014). What do differences between multi-voxel and univariate analysis mean? Howsubject-, voxel-, and trial-level variance impact fMRI analysis. NeuroImage , , 271–283.Dubois, J., Oya, H., Tyszka, J. M., Howard, M., Eberhardt, F. & Adolphs, R. (2017). Causalmapping of emotion networks in the human brain: Framework and initial findings. Neuropsychologia .Eaton, D. & Murphy, K. P. (2007). Exact Bayesian structure learning from uncertain inter-ventions. In

Proceedings of the 11th international conference on artificial intelligence andstatistics (AISTATS) (pp. 107–114).Eberhardt, F. (2013). Experimental indistinguishability of causal structures.

Philosophy of Sci-ence , (5), 684–696.Eberhardt, F. (2016). Green and grue causal variables. Synthese , (4), 1029–1046.Einevoll, G. T., Kayser, C., Logothetis, N. K. & Panzeri, S. (2013). Modelling and analysisof local field potentials for studying the function of cortical circuits. Nature ReviewsNeuroscience , (11), 770–785.Filler, A. (2009). The history, development and impact of computed imaging in neurologicaldiagnosis and neurosurgery: CT, MRI, and DTI. Nature Precedings .Friston, K. J., Harrison, L. & Penny, W. (2003). Dynamic causal modelling.

NeuroImage , (4),1273–1302.Friston, K., Daunizeau, J. & Stephan, K. E. (2013). Model selection and gobbledygook: Re-sponse to Lohmann et al. NeuroImage , , 275–278.Granger, C. W. J. (1969). Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica , (3), 424–438.Grosse-Wentrup, M., Janzing, D., Siegel, M. & Schölkopf, B. (2016). Identification of causalrelations in neuroimaging data with latent confounders: An instrumental variable ap-proach. NeuroImage , , 825–833.Györfi, L., Kohler, M., Krzyżak, A. & Walk, H. (2002). A Distribution-Free Theory of Nonpara-metric Regression . Springer.Haavelmo, T. (1944). The Probability Approach in Econometrics.

Econometrica , , S1–S115(supplement).Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.-D., Blankertz, B. & Bießmann,F. (2014). On the interpretation of weight vectors of linear models in multivariateneuroimaging. NeuroImage , , 96–110.Herculano-Houzel, S. (2012). The remarkable, yet not extraordinary, human brain as a scaled-up primate brain and its associated cost. Proceedings of the National Academy of Sci-ences , (Supplement 1), 10661–10668.33errmann, C. S., Rach, S., Neuling, T. & Strüber, D. (2013). Transcranial alternating currentstimulation: a review of the underlying mechanisms and modulation of cognitive pro-cesses. Frontiers in human neuroscience , , 279.Hoel, E. P. (2017). When the Map Is Better Than the Territory. Entropy , (5).Hoel, E. P., Albantakis, L. & Tononi, G. (2013). Quantifying causal emergence shows thatmacro can beat micro. Proceedings of the National Academy of Sciences , (49), 19790–19795.Hoover, K. D. (2008). Causality in economics and econometrics. In S. N. Durlauf & L. E. Blume(Eds.), The new palgrave dictionary of economics (2nd). Basingstoke, UK: Palgrave Mac-millan.Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. & Schölkopf, B. (2008). Nonlinear causaldiscovery with additive noise models. In

Advances in Neural Information ProcessingSystems 21 (NeurIPS) (pp. 689–696). NeurIPS Foundation.Huth, A. G., Lee, T., Nishimoto, S., Bilenko, N. Y., Vu, A. T. & Gallant, J. L. (2016). Decoding theSemantic Content of Natural Movies from Human Brain Activity.

Frontiers in SystemsNeuroscience , , 81.Hyttinen, A., Plis, S., Järvisalo, M., Eberhardt, F. & Danks, D. (2016). Causal discovery fromsubsampled time series data by constraint optimization. In Proceedings of the EighthInternational Conference on Probabilistic Graphical Models (Vol. 52, pp. 216–227). Pro-ceedings of Machine Learning Research. PMLR.Imbens, G. W. & Rubin, D. B. (2015).

Causal Inference for Statistics, Social, and BiomedicalSciences: An Introduction . Cambridge University Press.Jabbari, F., Ramsey, J., Spirtes, P. & Cooper, G. (2017). Discovery of causal models that con-tain latent variables through bayesian scoring of independence constraints. In

JointEuropean Conference on Machine Learning and Knowledge Discovery in Databases (pp. 142–157). Springer.Janzing, D., Rubenstein, P. & Schölkopf, B. (2018). Structural causal models for macro-variablesin time-series. arXiv preprint arXiv:1804.03911 .Kar, K., Ito, T., Cole, M. W. & Krekelberg, B. (2020). Transcranial alternating current stim-ulation attenuates BOLD adaptation and increases functional connectivity.

Journal ofNeurophysiology , (1), 428–438.Lauritzen, S. (1996). Graphical Models . Oxford University Press.Lauritzen, S. L., Dawid, A. P., Larsen, B. N. & Leimer, H.-G. (1990). Independence propertiesof directed markov fields.

Networks , , 491–505.Lee, J. H., Durand, R., Gradinaru, V., Zhang, F., Goshen, I., Kim, D.-S., Fenno, L. E., Ramakrish-nan, C. & Deisseroth, K. (2010). Global and local fMRI signals driven by neurons definedoptogenetically by type and wiring. Nature , (7299), 788–792.Liang, Z., Watson, G. D., Alloway, K. D., Lee, G., Neuberger, T. & Zhang, N. (2015). Map-ping the functional network of medial prefrontal cortex by combining optogeneticsand fMRI in awake rats. NeuroImage , , 114–123.Logothetis, N. K. (2008). What we can do and what we cannot do with fMRI. Nature , (7197),869–878.Lohmann, G., Erfurth, K., Müller, K. & Turner, R. (2012). Critical comments on dynamic causalmodelling. NeuroImage , (3), 2322–2329.34ohmann, G., Müller, K. & Turner, R. (2013). Response to commentaries on our paper: Criticalcomments on dynamic causal modelling. NeuroImage , , 279–281.López-Alonso, V., Cheeran, B., Rio-Rodriguez, D. & Fernández-del-Olmo, M. (2014). Inter-individual variability in response to non-invasive brain stimulation paradigms. Brainstimulation , (3), 372–380.Lucas, R. J. (1976). Econometric policy evaluation: A critique. Carnegie-Rochester ConferenceSeries on Public Policy , (1), 19–46.M. Arjovsky, I. G., L. Bottou & Lopez-Paz, D. (2019). Invariant Risk Minimization. arXiv pre-print arXiv:1907.02893 .Marinazzo, D., Pellicoro, M. & Stramaglia, S. (2008). Kernel-Granger causality and the analysisof dynamical networks. Physical Review E , (5), 056215.Marinazzo, D., Liao, W., Chen, H. & Stramaglia, S. (2011). Nonlinear connectivity by Grangercausality. NeuroImage , (2), 330–338.Mastakouri, A., Schölkopf, B. & Janzing, D. (2019). Selecting causal brain features with asingle conditional independence test per feature. In Advances in Neural InformationProcessing Systems 32 (NeurIPS) (pp. 12532–12543). NeurIPS Foundation.Mayberg, H. S., Lozano, A. M., Voon, V., McNeely, H. E., Seminowicz, D., Hamani, C., Schwalb,J. M. & Kennedy, S. H. (2005). Deep Brain Stimulation for Treatment-Resistant Depres-sion.

Neuron , (5), 651–660.Mehler, D. M. A. & Kording, K. P. (2018). The lure of causal statements: rampant mis-inferenceof causality in estimated connectivity. arXiv preprint arXiv:1812.03363 .Mill, R. D., Ito, T. & Cole, M. W. (2017). From connectome to cognition: the search for mech-anism in human functional brain networks. NeuroImage , , 124–139.Mooij, J. M., Magliacane, S. & Claassen, T. (2020). Joint causal inference from multiple con-texts. Journal of Machine Learning Research , (99), 1–108.Mumford, J. A. & Ramsey, J. D. (2014). Bayesian networks for fMRI: A primer. NeuroImage , , 573–582.Nitsche, M. A., Cohen, L. G., Wassermann, E. M., Priori, A., Lang, N., Antal, A., Paulus, W.,Hummel, F., Boggio, P. S., Fregni, F. & Pascual-Leone, A. (2008). Transcranial directcurrent stimulation: State of the art 2008. Brain Stimulation , (3), 206–223.Nunez, P. L. & Srinivasan, R. (2006). Electric Fields of the Brain: The neurophysics of EEG (2nd ed.). Oxford University Press.Pearl, J. (2009).

Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge UniversityPress.Peters, J., Bauer, S. & Pfister, N. (2020). Causal models for dynamical systems. arXiv preprintarXiv:2001.06208 .Peters, J. & Bühlmann, P. (2014). Identifiability of Gaussian Structural Equation Models withEqual Error Variances.

Biometrika , (1), 219–228.Peters, J., Bühlmann, P. & Meinshausen, N. (2016). Causal inference using invariant predic-tion: identification and confidence intervals. Journal of the Royal Statistical Society:Series B (with discussion) , (5), 947–1012.Peters, J., Janzing, D. & Schölkopf, B. (2013). Causal Inference on Time Series using StructuralEquation Models. In Advances in Neural Information Processing Systems 26 (NeurIPS) (pp. 585–592). NeurIPS Foundation. 35eters, J., Janzing, D. & Schölkopf, B. (2017).

Elements of Causal Inference: Foundations andLearning Algorithms . MIT Press.Peters, J., Mooij, J. M., Janzing, D. & Schölkopf, B. (2014). Causal Discovery with ContinuousAdditive Noise Models.

Journal of Machine Learning Research , , 2009–2053.Pfister, N., Bauer, S. & Peters, J. (2019a). Learning stable and predictive structures in kineticsystems. Proceedings of the National Academy of Sciences , (51), 25405–25411.Pfister, N., Bühlmann, P. & Peters, J. (2018). Invariant Causal Prediction for Sequential Data. Journal of the American Statistical Association , (527), 1264–1276.Pfister, N., Weichwald, S., Bühlmann, P. & Schölkopf, B. (2019b). Robustifying IndependentComponent Analysis by Adjusting for Group-Wise Stationary Noise. Journal of Ma-chine Learning Research , (147), 1–50. (Co-first authorship between NP and SW)Poldrack, R. A. (2018). The new mind readers: what neuroimaging can and cannot reveal aboutour thoughts . Princeton University Press.Power, J. D., Cohen, A. L., Nelson, S. M., Wig, G. S., Barnes, K. A., Church, J. A., Vogel, A. C.,Laumann, T. O., Miezin, F. M., Schlaggar, B. L. et al. (2011). Functional network organ-ization of the human brain.

Neuron , (4), 665–678.Ramsey, J. D., Hanson, S. J., Hanson, C., Halchenko, Y. O., Poldrack, R. A. & Glymour, C.(2010). Six problems for causal inference from fMRI. NeuroImage , (2), 1545–1558.Rees, G., Kreiman, G. & Koch, C. (2002). Neural correlates of consciousness in humans. NatureReviews Neuroscience , , 261–270.Reichenbach, H. (1956). The Direction of Time . University of California Press.Reid, A. T., Headley, D. B., Mill, R. D., Sanchez-Romero, R., Uddin, L. Q., Marinazzo, D., Lurie,D. J., Valdés-Sosa, P. A., Hanson, S. J., Biswal, B. B. et al. (2019). Advancing functionalconnectivity research from association to causation.

Nature neuroscience , (10).Rothenhäusler, D., Meinshausen, N., Bühlmann, P. & Peters, J. (2018). Anchor regression:heterogeneous data meets causality. arXiv preprint arXiv:1801.06229 .Rubenstein, P. K., Weichwald, S., Bongers, S., Mooij, J. M., Janzing, D., Grosse-Wentrup, M. &Schölkopf, B. (2017). Causal Consistency of Structural Equation Models. In Proceedingsof the Thirty-Third Conference on Uncertainty in Artificial Intelligence (UAI) . (Co-firstauthorship between PKR and SW)Sanchez-Romero, R. & Cole, M. W. (2020). Combining Multiple Functional Connectivity Meth-ods to Improve Causal Inferences.

Journal of Cognitive Neuroscience , 1–15. Forthcom-ing.Sanchez-Romero, R., Ramsey, J. D., Zhang, K., Glymour, M. R. K., Huang, B. & Glymour,C. (2019). Estimating feedforward and feedback effective connections from fMRI timeseries: Assessments of statistical methods.

Network Neuroscience , (2), 274–306.Scheines, R. & Ramsey, J. (2016). Measurement error and causal discovery. In CEUR workshopproceedings (Vol. 1792, pp. 1–7).Schennach, S. M. (2016). Recent advances in the measurement error literature.

Annual Reviewof Economics , , 341–377.Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K. & Mooij, J. M. (2012). On causaland anticausal learning. In Proceedings of the 29th International Conference on MachineLearning (ICML) (pp. 1255–1262). Omnipress.36hah, R. & Peters, J. (2020). The Hardness of Conditional Independence Testing and the Gen-eralised Covariance Measure.

Annals of Statistics (accepted), ArXiv e-prints (1804.07203) .Shimizu, S., Hoyer, P. O., Hyvärinen, A. & Kerminen, A. (2006). A Linear Non-Gaussian Acyc-lic model for Causal Discovery.

Journal of Machine Learning Research , (10), 2003–2030.Silva, R., Scheine, R., Glymour, C. & Spirtes, P. (2006). Learning the structure of linear latentvariable models. Journal of Machine Learning Research , , 191–246.Skitovič, V. P. (1962). Linear combinations of independent random variables and the normaldistribution law. Selected Translations in Mathematical Statistics and Probability , , 211–228.Smith, S. M. (2012). The future of FMRI connectivity. Neuroimage , (2), 1257–1266.Smith, S. M., Miller, K. L., Salimi-Khorshidi, G., Webster, M., Beckmann, C. F., Nichols, T. E.,Ramsey, J. D. & Woolrich, M. W. (2011). Network modelling methods for FMRI. Neuroim-age , (2), 875–891.Spirtes, P., Glymour, C. N. & Scheines, R. (2001). Causation, Prediction, and Search (2nd ed.).MIT Press.Spirtes, P. & Scheines, R. (2004). Causal Inference of Ambiguous Manipulations.

Philosophyof Science , (5), 833–845.Steinberg, D. (2007). The Cholesterol Wars: The Skeptics vs the Preponderance of Evidence . Aca-demic Press.Stramaglia, S., Wu, G.-R., Pellicoro, M. & Marinazzo, D. (2012). Expanding the transfer en-tropy to identify information circuits in complex systems.

Physical Review E , (6),066211.Stramaglia, S., Cortes, J. M. & Marinazzo, D. (2014). Synergy and redundancy in the Grangercausal analysis of dynamical networks. New Journal of Physics , (10), 105003.Tangermann, M., Müller, K.-R., Aertsen, A., Birbaumer, N., Braun, C., Brunner, C., Leeb, R.,Mehring, C., Miller, K., Müller-Putz, G., Nolte, G., Pfurtscheller, G., Preissl, H., Schalk,G., Schlögl, A., Vidaurre, C., Waldert, S. & Blankertz, B. (2012). Review of the BCI Com-petition IV. Frontiers in Neuroscience , , 55.Todd, M. T., Nystrom, L. E. & Cohen, J. D. (2013). Confounds in multivariate pattern analysis:Theory and rule representation case study. NeuroImage , , 157–165.Truswell, A. S. (2010). Cholesterol and Beyond: The Research on Diet and Coronary Heart Dis-ease 1900–2000 . Springer.Uddin, L. Q., Yeo, B. T. T. & Spreng, R. N. (2019). Towards a Universal Taxonomy of Macro-scale Functional Human Brain Networks.

Brain Topography , (6), 926–942.Uhler, C., Raskutti, G., Bühlmann, P. & Yu, B. (2013). Geometry of the faithfulness assumptionin causal inference. Annals of Statistics , (2), 436–463.Valdes-Sosa, P. A., Roebroeck, A., Daunizeau, J. & Friston, K. (2011). Effective connectivity:influence, causality and biophysical modeling. Neuroimage , (2), 339–361.Varoquaux, G., Raamana, P. R., Engemann, D. A., Hoyos-Idrobo, A., Schwartz, Y. & Thirion, B.(2017). Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines. NeuroImage , , 166–179. Individual Subject Prediction.Varoquaux, G., Schwartz, Y., Poldrack, R. A., Gauthier, B., Bzdok, D., Poline, J.-B. & Thirion,B. (2018). Atlases of cognition with large-scale human brain mapping. PLOS Computa-tional Biology , (11), 1–18. 37erma, T. & Pearl, J. (1990). Equivalence and Synthesis of Causal Models. In Proceedings ofthe Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI) .Vosskuhl, J., Strüber, D. & Herrmann, C. S. (2018). Non-invasive brain stimulation: a paradigmshift in understanding brain oscillations.

Frontiers in human neuroscience , , 211.Waldorp, L., Christoffels, I. & van de Ven, V. (2011). Effective connectivity of fMRI data usingancestral graph theory: Dealing with missing regions. NeuroImage , (4), 2695–2705.Weichwald, S. (2019). Pragmatism and Variable Transformations in Causal Modelling (Doctoraldissertation, ETH Zurich).Weichwald, S., Gretton, A., Schölkopf, B. & Grosse-Wentrup, M. (2016a). Recovery of non-linear cause-effect relationships from linearly mixed neuroimaging data. In

Pattern Re-cognition in Neuroimaging (PRNI), 2016 International Workshop on . IEEE.Weichwald, S., Grosse-Wentrup, M. & Gretton, A. (2016b). MERLiN: Mixture Effect Recoveryin Linear Networks.

IEEE Journal of Selected Topics in Signal Processing , (7), 1254–1266.Weichwald, S., Meyer, T., Özdenizci, O., Schölkopf, B., Ball, T. & Grosse-Wentrup, M. (2015).Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroIm-age , , 48–59.Weichwald, S., Schölkopf, B., Ball, T. & Grosse-Wentrup, M. (2014). Causal and anti-causallearning in pattern recognition for neuroimaging. In Pattern Recognition in Neuroima-ging (PRNI), 2014 International Workshop on . IEEE.Woodward, J. (2005).

Making Things Happen: A Theory of Causal Explanation . Oxford Uni-versity Press.Woolgar, A., Golland, P. & Bode, S. (2014). Coping with confounds in multivoxel patternanalysis: What should we do about reaction time differences? A comment on Todd,Nystrom & Cohen 2013.

NeuroImage , , 506–512.Yeo, B. T. T., Krienen, F. M., Sepulcre, J., Sabuncu, M. R., Lashkari, D., Hollinshead, M., Roff-man, J. L., Smoller, J. W., Zöllei, L., Polimeni, J. R., Fischl, B., Liu, H. & Buckner, R. L.(2011). The organization of the human cerebral cortex estimated by intrinsic functionalconnectivity. Journal of Neurophysiology , (3), 1125–1165.Zhang, K., Gong, M., Ramsey, J., Batmanghelich, K., Spirtes, P. & Glymour, C. (2018). Causaldiscovery with linear non-gaussian models undermeasurement error: structural identi-fiability results. In Proceedings of the 34th annual conference on Uncertainty in ArtificialIntelligence (UAI) . AUAI Press.Zhang, K., Huang, B., Zhang, J., Glymour, C. & Schölkopf, B. (2017). Causal discovery fromnonstationary/heterogeneous data: skeleton estimation and orientation determination.In

Proceedings of the 26th international joint conference on artificial intelligence (IJCAI) (pp. 1347–1353).Zhang, K. & Hyvärinen, A. (2009). On the Identifiability of the Post-Nonlinear Causal Model.In

Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intel-ligence (UAI) . AUAI Press.Zheng, X., Dan, C., Aragam, B., Ravikumar, P. & Xing, E. P. (2020). Learning sparse nonpara-metric DAGs. In