MERLiN: Mixture Effect Recovery in Linear Networks
11 MERLiN: Mixture Effect Recovery in Linear Networks
Sebastian Weichwald, Moritz Grosse-Wentrup, Arthur Gretton
This is the author’s version of an article that is published in
IEEE Journal of Selected Topics in Signal Processing,
Abstract —Causal inference concerns the identification ofcause-effect relationships between variables, e. g. establishingwhether a stimulus affects activity in a certain brain region. Theobserved variables themselves often do not constitute meaningful causal variables, however, and linear combinations need to beconsidered. In electroencephalographic studies, for example, oneis not interested in establishing cause-effect relationships betweenelectrode signals (the observed variables), but rather betweencortical signals (the causal variables) which can be recovered aslinear combinations of electrode signals.We introduce MERLiN (Mixture Effect Recovery in LinearNetworks), a family of causal inference algorithms that im-plement a novel means of constructing causal variables fromnon-causal variables. We demonstrate through application toEEG data how the basic MERLiN algorithm can be extendedfor application to different (neuroimaging) data modalities.Given an observed linear mixture, the algorithms can recovera causal variable that is a linear effect of another given vari-able. That is, MERLiN allows us to recover a cortical signalthat is affected by activity in a certain brain region, whilenot being a direct effect of the stimulus. The Python/Matlabimplementation for all presented algorithms is available onhttps://github.com/sweichwald/MERLiN.
Index Terms —causal inference, causal variable construction,linear mixtures
I. I
NTRODUCTION
Causal inference requires causal variables. Observed vari-ables do not themselves always constitute the causal re-lata that admit meaningful causal statements, however, andtransformations of the variables might be required to isolatecausal signals. Images, for example, consist of microscopicvariables (pixel colour values) while the identification ofmeaningful cause-effect relationships requires the constructionof macroscopic causal variables (e. g. whether the image showsa magic wand) [1]. That is, it is implausible that a descrip-tion of effects of changing the colour value of one singlepixel, the microscopic variable, leads to characterisation of ameaningful cause-effect relationship; however, the existenceof a magic wand, the macroscopic variable, may lead tomeaningful statements of the form “manipulating the imagesuch that it shows a magic wand affects the chances oflittle Maggie favouring this image”. A similar problem oftenoccurs whenever only a linear mixture of causal variablescan be observed. In electroencephalography (EEG) studies, for
Sebastian Weichwald is with the Empirical Inference Department, MaxPlanck Institute for Intelligent Systems, T¨ubingen, Germany, and has beenwith the Centre for Computational Statistics and Machine Learning, UniversityCollege London, London, United Kingdom, e-mail: [email protected] Grosse-Wentrup is with the Empirical Inference Department,Max Planck Institute for Intelligent Systems, T¨ubingen, Germany, e-mail:[email protected] Gretton is with the Gatsby Computational NeuroscienceUnit, Sainsbury Wellcome Centre, London, United Kingdom, e-mail:[email protected]. example, measurements at electrodes placed on the scalp areconsidered to be instantaneously and linearly superimposedelectromagnetic activity of sources in the brain [2]. Again,statements about the microscopic variables are meaningless,e. g. “manipulating the electrode’s signal affects the subject’sattentional state”; however, macroscopic variables such as theactivity in the parietal cortex, extracted as a linear combinationof electrode signals, admit meaningful causal statements suchas “manipulating the activity in the parietal cortex affects thesubject’s attentional state”. Standard causal inference methodsrequire that all underlying causal variables, i. e., the sourcesin the brain, are first constructed –or rather recovered– fromthe observed mixture, i. e., the electrode signals.There exist a plethora of methods to construct macroscopicvariables both from images and linear mixtures. However,prevailing methods to learn visual features [3], [4], [5] ignorethe causal aspect, and are fundamentally different from the re-cent and (to our knowledge) only work that demonstrates howvisual causal features can be learned by a sequence of inter-ventional experiments [1]. Likewise, methods to (re-)constructvariables from linear mixtures commonly ignore the causalaspect and often rest on implausible assumptions. For instance,independent component analysis (ICA), commonly employedin the analysis of EEG data, rests on the assumption ofmutually independent sources [6], [7], [8]. One may arguethat muscular or ocular artefacts are independent of thecortical sources and can be extracted via ICA [9], [10]. Itseems implausible, though, that cortical sources are mutuallyindependent. In fact, if they were mutually independent therewould be no cause-effect relationships between them. Thus,methods ignoring the causal aspect are unsuited to constructmeaningful causal variables.Mixture Effect Recovery in Linear Networks (MERLiN)aims to construct a causal variable from a linear mixturewithout requiring multiple interventional experiments. Thefundamental idea is to directly search for statistical in- anddependences that imply, under assumptions discussed below, acertain cause-effect relationship. In essence, given iid samplesof a univariate randomised variable S , a univariate causaleffect C of S , and a multivariate variable F , MERLiNsearches for a linear combination w such that w (cid:62) F is acausal effect of C , i. e., S → C → w (cid:62) F . This implementsthe novel idea to construct causal variables such that theresulting statistical properties guarantee meaningful cause-effect relationships.As an illustration, consider the directed acyclic graph (DAG) S → C → C C shown in Figure 1, where the gapindicates that C is disconnected from all other variables. Inthis notation edges denote cause-effect relationships startingat the cause and pointing towards the effect. S denotes arandomised variable. We assume that only a linear mixture a r X i v : . [ s t a t . M E ] S e p S C C C F F F observed linear mixturelinear mixingcausal variablesFig. 1. Problem illustration where S → C → C C is the underlyingcausal graph and F , F , F are observed variables that are a linear mixtureof C , C , C . F = A [ C , C , C ] (cid:62) ( A is an unknown mixing matrix) ofthe causal variables C , C , C can be observed, and that v such that C = v (cid:62) F is known. MERLiN’s goal is to recoverfrom the observed linear mixture F = [ F , F , F ] (cid:62) a causalvariable that is an effect of C , i. e., to find w such that w (cid:62) F is an effect of C , where w (cid:62) F = C is a valid solution.We introduce the basic MERLiN Σ − algorithm that canrecover the sought-after causal variable when the cause-effect relationships are linear with additive Gaussian noise(cf. Section III). We also demonstrate how the algorithm canbe extended for application to different (neuroimaging) datamodalities by (a) including data specific preprocessing stepsinto the optimisation procedure, and (b) incorporating a prioridomain knowledge (cf. Section V). Here, both concepts aredemonstrated through application to EEG data when cause-effect relationships within individual frequency bands are ofinterest.Thus, we present three related algorithms MERLiN − ,MERLiN bp Σ − , and MERLiN bp +Σ − that are based on optimisa-tion of precision matrix entries (indicated by the subscript Σ − ). The MERLiN bp Σ − and MERLiN bp +Σ − algorithms includepreprocessing steps that allow application to timeseries data,identifying a linear combination of timeseries signals suchthat the resulting log-bandpower (indicated by the superscript bp ) reveals the sough-after cause-effect relationship. A furtherextension (indicated by the superscript bp + ) takes domainknowledge about time-lags into account.For stimulus-based neuroimaging studies [11], theMERLiN bp Σ − and MERLiN bp +Σ − algorithms can establish acause-effect relationship between brain state features that areobserved only as part of a linear mixture. As such, MERLiNis able to provide insights into brain networks beyond thosereadily obtained from encoding and decoding models trainedon pre-defined variables [12]. Furthermore, it employs theframework of Causal Bayesian Networks (CBNs) that hasrecently been fruitful in the neuroimaging community [13],[14], [15], [16], [12], [17] — the important advantage overmethods based on information flow being that it yieldstestable predictions on the impact of interventions [18], [19].MERLiN shows good performance both on synthetic andEEG data recorded during neurofeedback experiments. ThePython/Matlab implementation for all presented algorithms isavailable on https://github.com/sweichwald/MERLiN. II. P RELIMINARIES
A. Causal Bayesian Networks
In general causal inference requires three steps.1) construction of (causal) variables2) inference of cause-effect relationships among the vari-ables defined in 1)3) estimation of the functional form and strength of thecausal links established in 2)In this manuscript we focus on and merge the first two steps.More specifically, a causal variable is implicitly constructedby optimising for properties that at the same time establisha specific cause-effect relationship for this variable. Herewe briefly introduce the main aspects of Causal BayesianNetworks (CBNs) that define causation in terms of effects ofinterventions and allow inference of cause-effect relationships(step 2) from conditional independences in the observed dis-tribution. For an exhaustive treatment see [20], [21].
Definition 1 (Structural Equation Model) . We define a struc-tural equation model (SEM) S as a set of structural equations X i = f i ( PA i , N i ) , i ∈ N s (cid:44) { n ∈ N : 1 ≤ n ≤ s } wherethe so-called noise variables are independently distributedaccording to P N ,...,N s = P N · · · P N s . For i ∈ N s the set PA i ⊆ { X , ..., X s } \ X i contains the so-called parents of X i and f i describes how X i relates to the random variablesin PA i and N i . This induces the unique joint distributiondenoted by P S (cid:44) P X ,...,X s . Replacing at least one of the functions f i , i ∈ N s by aconstant ♠ yields a new SEM. We say X i has been intervenedon, which is denoted by do( X i = ♠ ) , leads to the SEM S| do( X i = ♠ ) , and induces the interventional distribution P S| do( X i = ♠ ) (cid:44) P X ,...,X s | do( X i = ♠ ) . Definition 2 (Cause and Effect) . X i is a cause of X j ( i, j ∈ N s , i (cid:54) = j ) wrt. a SEM S iff there exists ♥ ∈ R such that P X j | do( X i = ♥ ) (cid:54) = P X j . X j is an effect of X i iff X i is a causeof X j . Often the considered SEM S is omitted if it is clearfrom the context.For each SEM S there is a corresponding graph G S ( V, E ) with V (cid:44) { X , ..., X s } and E (cid:44) { ( X i , X j ) : X i ∈ PA j , X j ∈ V } that has the random variables as nodes anddirected edges pointing from parents to children. We employthe common assumption that this graph is acyclic, i. e., G S will always be a directed acyclic graph (DAG).It is insightful to consider the following implication ofDefinition 2: If in G S there is no directed path from X i to X j , X i is not a cause of X j (wrt. S ). The following exampleshows that without further assumptions the converse is not truein general, i. e., existence of a path does not generally imply acause-effect relationship. This nuisance will be accounted forby the faithfulness assumption (cf. Definition 6 below). Weprovide supportive graphical depictions in Figure 2. Note that the distribution P S has a density if P N ,...,N s has a densityand the functions f i , i ∈ N s are differentiable. P X j | do( X i = ♥ ) and P X j denote the marginal distributions of X j corre-sponding to P S| do( X i = ♥ ) and P S respectively. X = N X = − X + N X = X + X + N X X X (a) SEM S and graph G S . X = ♥ X = − X + N X = X + X + N X X X (b) SEM S| do( X = ♥ ) andgraph G S| do( X = ♥ ) . X = N X = ♥ X = X + X + N X X X (c) SEM S| do( X = ♥ ) andgraph G S| do( X = ♥ ) . X = N X = − X + N X = ♥ X X X (d) SEM S| do( X = ♥ ) andgraph G S| do( X = ♥ ) .Fig. 2. SEMs and graphs accompanying example 3. Example 3.
Consider a SEM S with structural equationsand graph G S shown in Figure 2a and noise variables ( N , N , N ) ∼ N (0 , . In G S there is a directed path(in fact even a directed edge) from X to X while P X | do( X = ♥ ) = P X = P N + N = N (0 , for all ♥ ∈ R ,i. e., intervening on X does not alter the distribution of X .That is, X is not a cause of X wrt. S despite the existenceof the edge X → X (cf. Figure 2b).Observe that P X | do( X = ♥ ) = N ( ♥ , (cid:54) = N (0 ,
2) = P X for ♥ (cid:54) = 0 , i. e., X is, as one may intuitively expect, a causeof X wrt. S (cf. Figure 2c). Likewise, X indeed turns outnot to be a cause of X or X as can be seen from Figure 2d.So far a DAG G S simply depicts all parent-child relation-ships defined by the SEM S . Missing directed paths indicatemissing cause-effect relationships. In order to specify the linkbetween statistical independence (denoted by ⊥⊥ ) wrt. the jointdistribution P S and properties of the DAG G S (representing aSEM S ) we need the following definitions. Definition 4 (d-separation) . For a fixed graph G disjoint setsof nodes A and B are d-separated by a third disjoint set C (denoted by A ⊥ d-sep B | C ) iff all pairs of nodes a ∈ A and b ∈ B are d-separated by C . A pair of nodes a (cid:54) = b is d-separated by C iff every path between a and b is blocked by C . A path between nodes a and b is blocked by C iff there isan intermediate node z on the path such that (i) z ∈ C and z istail-to-tail ( ← z → ) or head-to-tail ( → z → ), or (ii) z is head-to-head ( → z ← ) and neither z nor any of its descendants isin C . Definition 5 (Markov property) . A distribution P X ,...,X s satisfies the global Markov property wrt a graph G if A ⊥ d-sep B | C = ⇒ A ⊥⊥ B | C. It satisfies the local Markov property wrt G if each nodeis conditionally independent of its non-descendants given its parents. Both properties are equivalent if P X ,...,X s hasa density (cf. [22, Theorem 3.27]); in this case we say P X ,...,X s is Markov wrt G . Definition 6 (Faithfulness) . P S generated by a SEM S is saidto be faithful wrt. G S , if A ⊥ d-sep B | C ⇐ = A ⊥⊥ B | C. Conveniently the distribution P S generated by a SEM S isMarkov wrt. G S (cf. [21, Theorem 1.4.1] for a proof). Hence,if we assume faithfulness conditional independences and d-separation properties become equivalent A ⊥ d-sep B | C ⇐⇒ A ⊥⊥ B | C Summing up, we have defined interventional causation interms of SEMs and have seen how a SEM gives rise to aDAG. This DAG has two convenient features. Firstly, the DAGyields a visualisation that allows to easily grasp missing cause-effect relationships that correspond to missing directed paths.Secondly, assuming faithfulness d-separation properties of thisDAG are equivalent to conditional independence propertiesof the joint distribution. Thus, conditional independencestranslate into causal statements, e. g. ‘a variable becomesindependent of all its non-effects given its immediate causes’or ‘cause and effect are marginally dependent’. Furthermore,the causal graph G S can be identified from conditional in-dependences observed in P S — at least up to a so-called Markov equivalence class, the set of graphs that entail thesame conditional independences [23].
B. Optimisation on the Stiefel manifold
The proposed algorithms require optimisation of objectivefunctions over the unit-sphere O d − (cid:44) { x ∈ R d : || x || = 1 } .For generality we view the sphere as a special case of theStiefel manifold V d × p (cid:44) { M ∈ R d × p : M (cid:62) M = I p × p } ( p ≤ d ) for p = 1 . Implementing the respective objectivefunctions in Theano [24], [25], we use the Python toolboxPymanopt [26] to perform optimisation directly on the respec-tive Stiefel manifold using a steepest descent algorithm withstandard back-tracking line-search. This approach is exact andefficient, relying on automated differentiation and respectingthe manifold geometry.III. T
HE BASIC
MERL I N ALGORITHM
We consider a situation in which only a linear combinationof observed variables constitutes a meaningful causal variable.These scenarios naturally arise if only samples of a linear mix-ture F , ..., F d (cid:48) of the underlying causal variables C , ..., C d are accessible (cf. Figure 1). Standard causal inference meth-ods cannot infer cause-effect relationships among the causal For simplicity we assume that distributions have a density wrt. someproduct measure throughout this text. Intuitively, this is saying that conditional independences are due to thecausal structure and not accidents of parameter values [20, p. 9]. For the experiments presented in this manuscript we set both the minimumstep size and gradient norm to − (arbitrary choice) and the maximumnumber of steps to (generous choice based on preliminary test runs thatmet the former stopping criteria in much earlier iterations). Our toolbox allowsto adjust both parameters. S C C C h C C ... C d Fig. 3. Example of a causal graph underlying the described problem scenario(cf. Section III-A). h is a hidden variable. variables C , ..., C d without first undoing the unknown linearmixing (also known as blind source separation). MERLiN aimsto establish a cause-effect relationship among causal variablesin a linear network while reconstructing a causal variable at thesame time. In other words, a causal variable is reconstructedby optimising for the statistical properties that imply a certainkind of cause-effect relationship.In this section we first provide the formal problem de-scription. We then derive sufficient conditions for the kindof cause-effect relationship MERLiN aims to establish, anddiscuss assumptions on the linear mixing. Finally, the basicprecision matrix based MERLiN algorithm is introduced,which optimises for these sufficient statistical properties inorder to recover a linear causal effect from an observed linearmixture. A. Formal problem description
The terminology introduced in Section II-A allows to pre-cisely state the problem as follows.
1) Assumptions:
Let S and C , ..., C d denote (finitelymany) real-valued random variables. We assume existence ofa SEM S , potentially with unobserved variables h , ..., h l , thatinduces P S = P S,C ,...,C d ,h ,...,h l . We refer to the correspond-ing graph G S as the true causal graph and call its nodes causalvariables . We further assume that • S affects C indirectly via C , • P S is faithful wrt. G S , • there are no edges in G S pointing into S .In an experimental setting the last condition is ensured byrandomising S . Figure 3 depicts an example of how G S mightlook.
2) Given data: • m iid samples S = [ s , ..., s m ] (cid:62) of S and F =[ f i,j ] i =1: m,j =1: d (cid:48) of F where F (cid:44) [ F , ..., F d (cid:48) ] (cid:62) = A C is the observed linear mixture of the causal variables C (cid:44) [ C , ..., C d ] (cid:62) and A ∈ R d (cid:48) × d denotes the unknownmixing matrix • the linear combination v ∈ R d (cid:48) that extracts the causalvariable C = v (cid:62) F is assumed knownThat is, we have samples of S , F , and C but not of C , ..., C d where F is an unknown linear mixture of C , ..., C d . By saying a variable X causes Z indirectly via Y we imply (a) existenceof a directed path X (cid:57)(cid:57)(cid:75) Y (cid:57)(cid:57)(cid:75) Z , and (b) that there is no directed path X (cid:57)(cid:57)(cid:75) Z without Y on it (this also excludes the edge X → Z ). Randomisation corresponds to an intervention: the structural equation of S is replaced by S = N where N is an independent randomisation variable,e. g. assigning placebo or treatment according to an independent Bernoullivariable. independent and identically distributed
3) Desired output:
Find w ∈ R d (cid:48) such that aC i = w (cid:62) F where C i is an effect of C ( i ∈ N d , a ∈ R \ { } ). In otherwords, the aim is to recover a causal variable –up to scaling–that is an effect of C . For example, recovery of the causalvariable C is a valid solution. The factor a reflects the scaleindeterminacy that results from the linear mixing, i. e., since A is unknown the scale of the causal variables cannot bedetermined unless further assumptions are employed or a prioriknowledge is available. B. MERLiN’s strategy
We are given that there exists at least one causal variable C that is indirectly affected by S via C . However, we onlyhave access to samples of the linear mixture F and samplesof S . Note the following properties of C : • Since P S is faithful wrt. G S it follows that C (cid:54)⊥⊥ C (and C (cid:54)⊥⊥ S ). • Since P S is Markov wrt. G S it follows that C ⊥⊥ S | C .We can derive the following sufficient conditions for a causalvariable to be indirectly affected by S via C . Claim . Given the assumptions in Section III-A1 and a causalvariable Y . If Y (cid:54)⊥⊥ C and Y ⊥⊥ S | C , then S indirectlyaffects Y via C . In particular, a directed path from C to Y ,denoted by C (cid:57)(cid:57)(cid:75) Y , exists. Proof:
From Y (cid:54)⊥⊥ C and P S being Markov wrt. G S itfollows that Y and C are not d-separated in G S by the emptyset. In G S there must be at least one path C (cid:57)(cid:57)(cid:75) Y , C (cid:76)(cid:57)(cid:57) Y or C (cid:76)(cid:57)(cid:57) X (cid:57)(cid:57)(cid:75) Y for some node X . By assumption C is affected by S , i. e., we have S (cid:57)(cid:57)(cid:75) C in G S . Hence, in G S there must be at least one path S (cid:57)(cid:57)(cid:75) C (cid:57)(cid:57)(cid:75) Y , S (cid:57)(cid:57)(cid:75) C (cid:76)(cid:57)(cid:57) Y or S (cid:57)(cid:57)(cid:75) C (cid:76)(cid:57)(cid:57) X (cid:57)(cid:57)(cid:75) Y for some node X . Underthe assumption of faithfulness, the latter two cases contradict Y ⊥⊥ S | C . Hence, in G S at least one path S (cid:57)(cid:57)(cid:75) C (cid:57)(cid:57)(cid:75) Y exists.From Y ⊥⊥ S | C and P S being faithful wrt. G S it followsthat Y and S are d-separated in G S by C . That is, given C every path between S and Y is blocked. In particular, in G S there is no edge S → Y and no path S (cid:57)(cid:57)(cid:75) Y without C onit. Hence, Y is indeed indirectly affected by S via C .This leads to our general idea on how to find a linearcombination that recovers a causal effect of C . If MERLiNfinds w ∈ R d (cid:48) such that the following two statistical propertieshold true(a) w (cid:62) F (cid:54)⊥⊥ C , and(b) w (cid:62) F ⊥⊥ S | C then we have identified a candidate causal effect of C , i. e.,we have identified a variable such that S → C → w (cid:62) F . Notethat conditioning on S cannot unblock a path that was blockedbefore since there are no edges pointing into S ; conversely theconditional dependence w (cid:62) F (cid:54)⊥⊥ C | S implies the marginaldependence w (cid:62) F (cid:54)⊥⊥ C . Hence, MERLiN can also optimisefor the following alternative statistical properties(a’) w (cid:62) F (cid:54)⊥⊥ C | S , and(b) w (cid:62) F ⊥⊥ S | C to recover a candidate causal effect of C . This reformulationis useful since it allows optimisation of two analogous condi-tional (in)dependence properties instead of marginal and con-ditional (in)dependence. Ideally and under mixing assumptionsdiscussed below, optimising w wrt. these statistical propertieswill indeed recover a causal variable , i. e., w (cid:62) F = aC i ( i ∈ N d , a ∈ R \ { } ), that is an effect of C . Note that thisapproach even works in the presence of hidden confounders. C. Mixing assumptions
MERLiN’s strategy presented in the previous section is tooptimise a linear combination of the observed linear mixturesuch that two statistical properties are fulfilled. However,without imposing further assumptions on the linear mixingit may be impossible to recover the desired causal variable bythis procedure. Here we discuss problems that can occur forarbitrary mixing and specify our mixing assumptions, namelythat A is an orthogonal d × d matrix.In the first place, there may not exist a solution to MER-LiN’s problem if A has rank less than d . Hence, assume that A has rank d and, for simplicity, that A is a square d × d matrix. This guarantees existence of a solution: if the mixingmatrix A is invertible a solution to the problem is to recover C via w = A − , d .However, if we only assume A to be invertible MERLiNmay not be able to recover (a multiple of) a causal variable C i from the sought-after statistical properties alone. Thefollowing example demonstrates the problem that arises fromthe fact that C itself is part of the observed linear mixture. Example 8.
Assume S → C → C C is the true butunknown causal graph, where the gap indicates that C isdisconnected from all other variables. Assume all variablesare non-degenerate and that the unknown mixing matrix A isinvertible. Then, any variable recovered as linear combinationfrom the observed linear mixture F = A C = A [ C , C , C ] (cid:62) can be written as (cid:16) a A − , d + b A − , d + c A − , d (cid:17) F = aC + bC + cC (cid:44) Y a,b,c for some a, b, c ∈ R .MERLiN aims to recover the causal variable C (up toscaling) by optimising a, b, c such that the statistical properties • Y a,b,c (cid:54)⊥⊥ C (or equivalently Y a,b,c (cid:54)⊥⊥ C | S ), and • Y a,b,c ⊥⊥ S | C hold true (cf. Section III-B). Indeed bC = Y ,b, ( b (cid:54) = 0 )fulfils these statistical properties and is the desired output.However, all Y a, ,c ( a, c (cid:54) = 0 ) likewise fulfil the statisticalproperties while not being (a multiple of) a causal variable.This example demonstrates that without imposing furtherconstraints on the linear mixing MERLiN may recover C (ensuring the dependence on S ) with independent variablesadded on top (ensuring conditional independence of S given Note that A being at least rank d is not a necessary condition, i. e.,an effect of C may be recoverable even in cases where A has rank lessthan d . As an example consider the case where C is an effect of C and A = [ I d × , d × ( d − ] . However, the focus is a sufficient condition for theexistence of a solution. C ), e. g. Y , , = C + C in above example. Although thesought-after statistical properties hold true for this variable,this is clearly not a desirable output and does not recover acausal variable.One way to mitigate this situation is to restrict search to theorthogonal complement v ⊥ of v . This way, the signal of C inthe linear mixture F is attenuated. In particular, if the mixingmatrix A is orthogonal restricting search to v ⊥ amounts tocomplete removal of C ’s signal from F . We therefore assumethat A is an orthogonal d × d matrix and restrict the search to v ⊥ . It is then no longer possible to add arbitrary multiplesof C onto independent variables to introduce the sought-after dependence, i. e., the recovery of non-causal variableslike C + C in above example is prevented.Note that while adding independent variables onto effectsis still possible (e. g. consider Y , , = C + C in aboveexample), it will be counter-acted by setting up the objectivefunction accordingly — roughly speaking, as we ‘maximisedependence’, then these independent variables will be sup-pressed, since they act as noise and reduce dependence. D. MERLiN Σ − : precision matrix magic The basic MERLiN algorithm aims to recover a linearcausal effect from an observed linear mixture. In particular,we assume that the cause-effect relationships S → C → C between the underlying causal variables S, C and C arelinear with additive Gaussian noise. In such a linear network,zero entries in the precision matrix imply missing edges in thegraph [22]. Hence, if Y is a linear effect of C the precisionmatrix of the three variables S, C and Y is of the form Σ − (cid:44) Σ − S,C ,Y = (cid:63) (cid:63) (cid:63) (cid:63) (cid:63) (cid:63) (cid:63) where stars indicate non-zero entries. This implies the partialcorrelations ρ Y,C | S = (cid:63) and ρ Y,S | C = 0 which, in the Gaus-sian case, amounts to the desired conditional (in-)dependences(a’) Y (cid:54)⊥⊥ C | S and (b) Y ⊥⊥ S | C (cf. Section III-B) [27].Exploiting this link, the precision matrix based MERLiN Σ − algorithm (cf. Algorithm 1) implements the general idea pre-sented in Section III-B by maximising the objective function f ( w ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) where (cid:98) Σ − w (cid:44) (cid:98) Σ − S,C , w (cid:62) F (here we assumed d ≤ m andinvertibility). Optimisation is performed over the unit-sphere O d − after projecting F onto the orthogonal complement v ⊥ .IV. S IMULATION EXPERIMENTS
A. Description of synthetic data D d × mT ( a, b ) denotes the synthetic dataset that is generatedby Algorithm 2. It consists of samples of an orthogonal linearmixture of underlying causal variables that follow the causalgraph shown in Figure 3. The parameters a and b determine For numerical reasons one might want to use the approximation √· + (cid:15) ≈| · | for small < (cid:15) ∈ R to ensure that f is differentiable everywhere. Algorithm 1
MERLiN Σ − Input: S ∈ R m × , F ∈ R d × m , v ∈ R d × Procedure: • set C := F (cid:62) v and F := P ( v ) F ∈ R ( d − × m • define the objective function for w ∈ O d − as f ( w ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) where the empirical precision matrix is (cid:98) Σ − w = (cid:18) m − (cid:2) S , C , F (cid:62) w (cid:3) (cid:62) H m (cid:2) S , C , F (cid:62) w (cid:3)(cid:19) − • optimise f as described in Section II-B to obtain thevector w ∗ ∈ O d − Output: w = P ( v ) (cid:62) w ∗ ∈ O d − Definitions: • P ( v ) is the ( d − × d orthonormal matrix that accom-plishes projection onto the orthogonal complement v ⊥• H m = I m × m − m m × m is the m × m centering matrixthe statistical properties of the generated dataset as follows.The parameter b adjusts the severity of hidden confoundingbetween C and C . Note also that the link between S and C is weaker for higher values of b , i. e., corr( S, C ) = / ( b + a ) . The link between C and C becomes noisierfor higher values of a , i. e., corr( C , C ) = ( b ) / ( b + a ) .Furthermore, the value of the objective function for recovering C is lower for higher values of a since –in the infinite samplelimit– we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) Σ − S,C ,C (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16) Σ − S,C ,C (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) = 1 a Hence, these datasets allow to analyse the behaviour of thealgorithm for cause-effect relationships of different strengthsand its robustness against hidden confounding.
B. Assessing MERLiN’s performance
We introduce two performance measures to assess MER-LiN’s performance on synthetic data with known ground truth w G . Since a solution can only be identified up to scaling, weonly need to consider the ( d − -sphere O d − = { x ∈ R d : || x || = 1 } . The closer a vector w ∈ O d − or its negation − w is to the ground truth w G ∈ O d − the better. This leads tothe performance measure of an gular di stance andi w G ( w ) (cid:44) min ( (cid:94) ( w , w G ) , (cid:94) ( − w , w G )) ∈ [0 , π / ] Another approach is to assess the quality of the recovered w by the probability of obtaining a vector that is closer to w G if chosen uniformly at random on the ( d − -sphere. Wedefine the p robability o f a b etter v ector as pobv w G ( w ) (cid:44) P [ | w r · w G | > | w · w G | ] where w r ∼ Unif( O d − ) and d is the dimension of theinput vector. This quantity is obtained by dividing the area Algorithm 2
Generating the synthetic dataset D d × mT ( a, b ) . Input: d, m ∈ N , a, b ∈ R , T ∈ { G, B } Procedure: • generate a random orthogonal a d × d matrix A byGram-Schmidt orthonormalising a matrix with entriesindependently drawn from a standard normal distribution • set v (cid:62) := (cid:0) A − (cid:1) , d = (cid:0) A (cid:62) (cid:1) , d • set w (cid:62) G := (cid:0) A − (cid:1) , d = (cid:0) A (cid:62) (cid:1) , d • generate independent mean parameters µ , ..., µ d , µ h from N (0 , • generate m independent samples according to the follow-ing SEM S = N C = µ + N + S + bh C = µ + aN + C C = µ + N + SC = µ + N + bh C k = µ k + N k ( k ∈ N d ) where ( N , ..., N d ) ∼ N (0 , d , h ∼ N ( µ h , , and N ∼ Unif( {− , +1 } ) if T = B or S ∼ N (0 , if T = G • arrange the m samples s , ..., s m of S in a column vector S • arrange each sample of ( C , ..., C d ) in a column vectorand (pre-)multiply by A to obtain the correspondingsample of ( F , ..., F d ) • arrange the m samples of ( F , ..., F d ) as columns of a d × m matrix F Output: S , F , v , w G a Since we can ignore scaling, it is not a problem that we infact generate an orthonormal matrix.of the smallest hyperspherical cap centred at w G that con-tains w or − w by half the area of the ( d − -sphere.The former equals the area of the hyperspherical cap ofheight h = 1 − | w · w G | , the latter equals the area of thehyperspherical cap of height r = 1 . Exploiting the conciseformulas for the area of a hyperspherical cap with radius r presented in [28] we obtain pobv w G ( w ) = I h (2 − h ) (cid:18) d − , (cid:19) where h = 1 −| w · w G | and I x ( a, b ) is the regularized incom-plete beta function. It is interesting to note that I x ( ( d − / , / ) is the cumulative distribution function of a Beta( ( d − / , / ) distribution such that | w r · w G | ∼ Beta( / , ( d − / ) .For simplicity we drop the ground truth vector w G fromthe notation and simply assume that the corresponding groundtruth vector is always the point of reference. Both performancemeasures are related inasmuch as pobv( w ) = 0 iff andi( w ) =0 and pobv( w ) = 1 iff andi( w ) = π / . However, they capturesomewhat complementary information: andi( w ) assesses how d=5 d=50 d=100011 / / b = a=0.1 m=300m=100m=50 d=5 d=50 d=100a=0.5 d=5 d=50 d=100011 / / / / b = . d=5 d=50 d=100 d=5 d=50 d=100011 / / / / b = d=5 d=50 d=100 d=5 d=50 d=100011 / / Fig. 4. The boxplots summarise the results of experiments runningMERLiN Σ − on datasets D d × mT ( a, b ) for T = G (cf. Section IV-A). Theperformance measure pobv w G ( w ) is shown on the y -axes and describedin Section IV-B (low values are good). The box for d = 100 , m = 50 ismissing since MERLiN Σ − can only be applied if d ≤ m . close the vector is in absolute terms, while pobv( w ) accountsfor the increased problem complexity in higher dimensions. C. Experimental results
We applied the precision matrix based MERLiN Σ − algo-rithm (cf. Algorithm 1) to the synthetic datasets D d × mT ( a, b ) described in Section IV-A. The results of runs for dif-ferent configurations of d, m, a, b are summarised as boxplotsin Figures 4 and 5. Recall that lower values of pobv w G ( w ) and andi w G ( w ) indicate that w is closer to the ground truth w G . We observe the following: • The results for Gaussian ( T = G ) or binary ( T = B ; notshown here) variable S are indistinguishable. • Performance is insensitive to the severity of hiddenconfounding, which can be seen by comparing the plotsrow-wise for the different values of b . This behaviour isexpected since C (cid:54)⊥⊥ S | C . • Performance decreases with increasing noise level, i. e.,with increasing values of a . Note that C is a sum of C and noise aN with variance b and a respectively. • The problem becomes harder in higher dimensions, re-sulting in worse performance. However, the results for pobv w G ( w ) indicate that even if the solution is notthat close to w G in an absolute sense ( andi w G ( w ) )the solution is good in a probabilistic sense. • More samples increase performance. Especially if thenoise level a and the dimension d are not both high at thesame time, MERLiN still achieves good performance on m = 300 samples (cf. the results for a = 0 . , d = 100 or a = 1 , d = 5 ). For each run we create a new dataset. This is the case for all ex-periments on synthetic data. The performance measures andi w G ( w ) and pobv w G ( w ) are always considered wrt. the corresponding w G of eachdataset instance. d=5 d=50d=1000 π/ π/ π/ b = a=0.1 m=300m=100m=50 d=5 d=50d=100a=0.5 d=5 d=50d=1000 π/ π/ π/ π/ π/ π/ b = . d=5 d=50d=100 d=5 d=50d=1000 π/ π/ π/ π/ π/ π/ b = d=5 d=50d=100 d=5 d=50d=1000 π/ π/ π/ Fig. 5. The boxplots summarise the results of experiments runningMERLiN Σ − on datasets D d × mT ( a, b ) for T = G (cf. Section IV-A). Theperformance measure andi w G ( w ) is shown on the y -axes and describedin Section IV-B (low values are good). The box for d = 100 , m = 50 ismissing since MERLiN Σ − can only be applied if d ≤ m . V. H
OW TO EXTEND
MERL I NIn this section we demonstrate how the basic MERLiNalgorithm can be extended to enable application to differentdata modalities by (a) including data specific preprocessingsteps into the optimisation procedure (cf. Section V-A), and(b) incorporating a priori domain knowledge (cf. Section V-B).In particular, we demonstrate this for neuroimaging data,since stimulus-based experiments pose a prototype applicationscenario for MERLiN for the following reasons.1) In stimulus-based experiments the stimulus S is ran-domised, meeting the assumption in Section III-A1 [11].2) Recent work in the neuroimaging community focusseson functional networks, i. e., a (linear) combinationof activity spread across the brain that is functionally(read causally ) relevant [29]. Additionally, the recordedactivity is often assumed to be a linear combination ofunderlying cortical variables, as for example in EEG [2].3) Simple univariate methods suffice to identify an effect C of S [12, Interpretation rule S1].The proposed method can readily be applied and com-plement the standard analysis procedures employed in suchexperiments. More precisely, MERLiN can recover meaningfulcortical networks (read causal variables ) that are causallyaffected by C , thereby establishing a cause-effect relationshipbetween two functional brain state features. A. MERLiN bp Σ − : adaptation to EEG data Analyses of EEG data commonly focus on trial-averagedlog-bandpower in a particular frequency band. Accordingly,we aim at identifying a linear combination of electrode signalssuch that the trial-averaged log-bandpower of the recoveredsignal is indirectly affected by the stimulus via another pre-defined cortical signal. We demonstrate how to do so byextending the basic MERLiN algorithm to include the log-bandpower computation into the optimisation procedure.
More precisely, we consider EEG trial-data of the form (cid:101) F ∈ R d × m × n where d denotes the number of electrodes, m the number of trials, and n the length of the time series (cid:101) F i,j, n for each electrode i ∈ N d and each sample j ∈ N m . The aim is to identify a linear combination w ∈ R d × suchthat the log-bandpower of the resulting one-dimensional trialsignals w (cid:62) (cid:101) F d,j, n is a causal effect of the log-bandpower ofthe one-dimensional trial signals v (cid:62) (cid:101) F d,j, n . However, sincethe two operations of computing the log-bandpower (afterwindowing) and taking a linear combination do not commute,we cannot compute the trial-averaged log-bandpower for eachchannel first and then apply the standard precision matrixbased MERLiN Σ − algorithm. Instead, MERLiN bp Σ − has beenadapted to the analysis of EEG data by switching in the log-bandpower computation.To simplify the optimisation loop we exploit the fact thatapplying a Hanning window and the FFT to each channel’ssignal commutes with taking a linear combination of thewindowed and Fourier transformed time series. Note that av-eraging of the log-moduli ( log( | · | ) ) of the Fourier coefficientsdoes not commute with taking a linear combination. Hence,windowing and computing the FFT is done in a separatepreprocessing step (cf. Algorithm 3), while the trial-averagedbandpower is computed within the optimisation loop aftertaking the linear combination. Implementation details for thebandpower and precision matrix based MERLiN bp Σ − algorithmare described in Algorithm 4. To ease implementation we treatthe complex numbers as two-dimensional vector space over thereals. B. MERLiN bp +Σ − : incorporating a priori knowledge Here we demonstrate how to incorporate a priori do-main knowledge by modifying the objective function of theMERLiN bp Σ − algorithm. Utilising a priori knowledge aboutvolume conduction in EEG recordings results in the refinedMERLiN bp +Σ − algorithm.A cortical source projects into more than one EEG elec-trode. In general, these volume conduction artefacts might leadto wrong conclusions about interactions between sources [30].Imaginary coherency, as introduced in [31], may help todifferentiate volume conduction artefacts from interactionsbetween cortical sources. To briefly recap the rationale, weemploy the common assumption that the signals measured atthe EEG electrodes have no time-lag to the cortical signals[32]. The coherency at a certain frequency of two time series X and Y with Fourier coefficients x ( j ) , y ( j ) , j ∈ N n isdefined as coh X,Y ( j ) = E [ x ( j ) y ∗ ( j )] (cid:112) E [ x ( j ) x ∗ ( j )] E [ y ( j ) y ∗ ( j )] where ∗ denotes complex conjugation. Next consider thecoherency of X and Y + X coh X,Y + X = E [ x ( j ) y ∗ ( j )]+ E [ x ( j ) x ∗ ( j )] √ E [ x ( j ) x ∗ ( j )] E [( y ( j )+ x ( j ))( y ∗ ( j )+ x ∗ ( j ))] Note that the MERLiN Σ − algorithm takes data of the form F ∈ R d × m as input and cannot readily be applied to timeseries data (cid:101) F ∈ R d × m × n . We apply a Hanning window in order to keep the feature computation inline with [17].
Algorithm 3
Preprocessing for bp algorithm.
Input: S ∈ R m × , (cid:101) F ∈ R d × m × n , v ∈ R d × , the samplingfrequency f s , and the desired frequency range defined by ω and ω Procedure: • set a := (cid:98) ω nf s (cid:99) , b := (cid:98) ω nf s (cid:99) , and n (cid:48) := b − a + 1 • for i from to d , for j from to m – center, apply Hanning window and compute FFT,i. e., treat (cid:101) F i,j, n as a column vector and set (cid:101) F i,j, n := T W H n (cid:101) F i,j, n • extract relevant Fourier coefficients corresponding to v ,i. e., set V Im := Im (cid:16) v (cid:62) (cid:101) F d,j,a : b (cid:17) j =1: m ∈ R m × n (cid:48) V Re := Re (cid:16) v (cid:62) (cid:101) F d,j,a : b (cid:17) j =1: m ∈ R m × n (cid:48) • remove direction v from (cid:101) F , i. e., for j from to m set F Im1:( d − ,j, n (cid:48) := Im (cid:16) P ( v ) (cid:101) F d,j,a : b (cid:17) ∈ R ( d − × n (cid:48) F Re1:( d − ,j, n (cid:48) := Re (cid:16) P ( v ) (cid:101) F d,j,a : b (cid:17) ∈ R ( d − × n (cid:48) such that F Im , F Re ∈ R ( d − × m × n (cid:48) Output: V Im , V Re ∈ R m × n (cid:48) and F Im , F Re ∈ R ( d − × m × n (cid:48) Definitions: • P ( v ) is the ( d − × d orthonormal matrix that accom-plishes projection onto the orthogonal complement v ⊥• H n = I n × n − n n × n is the n × n centering matrix • W = (cid:104) (cid:16) − cos πkn − (cid:17)(cid:105) k,l =1: n is the n × n Hanningwindow matrix • T = (cid:2) exp (cid:0) − ı πk ln (cid:1)(cid:3) k,l =1: n is the n × n FFT matrix
Algorithm 4
MERLiN bp Σ − Refer to Algorithm 5 and instead use the objective function f ( w ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) and observe that E [ x ( j ) , x ∗ ( j )] is real. This shows thatnon-zero imaginary coherency icoh X,Y ( j ) (cid:44) Im(coh
X,Y ( j )) cannot be due to volume conduction and indicates time-lagged interaction between sources since it implies that Im( E [ x ( j ) y ∗ ( j )]) (cid:54) = 0 . This a priori knowledge is incorporated in MERLiN bp +Σ − byadapting the objective function to be f ( w ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:48) (cid:88) j =1 icoh( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) where (cid:98) Σ − w denotes the empirical precision matrix of the log-bandpower features after taking the linear combination w Here we exploit the assumption of instantaneous mixing mentioned above.
Algorithm 5
MERLiN bp +Σ − Input: S ∈ R m × , (cid:101) F ∈ R d × m × n , v ∈ R d × , the samplingfrequency f s , and the desired frequency range defined by ω and ω Procedure: • obtain V Im , V Re ∈ R m × n (cid:48) and F Im , F Re ∈ R d (cid:48) × m × n (cid:48) via Algorithm 3 where d (cid:48) = d − • set C := (cid:32) n (cid:48) (cid:80) n (cid:48) j =1 log ∗ (cid:32) (cid:113) ( V Im i,j ) + ( V Re i,j ) n (cid:33)(cid:33) i =1: m ∈ R m × (average log-bandpower per trial) • define the objective function for w ∈ O d − as f ( w ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:48) (cid:88) j =1 icoh( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) · (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) where the empirical precision matrix is (cid:98) Σ − w = (cid:18) m − S , C , D w ] (cid:62) H m [ S , C , D w ] (cid:19) − , the average log-bandpower per trial depending on w is D w = n (cid:48) (cid:80) n (cid:48) j =1 log ∗ (cid:114)(cid:16) w (cid:62) F Im1: d (cid:48) ,i,j (cid:17) + (cid:16) w (cid:62) F Re1: d (cid:48) ,i,j (cid:17) n i =1: m ∈ R m × , and the imaginary coherency icoh( j ) for each frequency j ∈ N n (cid:48) equals (cid:104) V Im i,j · w (cid:62) F Re1: d (cid:48) ,i,j − V Re i,j · w (cid:62) F Im1: d (cid:48) ,i,j (cid:105) i =1: m (cid:115)(cid:68) ( V Im i,j ) + ( V Re i,j ) (cid:69) i =1: m (cid:28)(cid:16) w (cid:62) F Im1: d (cid:48) ,i,j (cid:17) + (cid:16) w (cid:62) F Re1: d (cid:48) ,i,j (cid:17) (cid:29) i =1: m • optimise f as described in Section II-B to obtain thevector w ∗ ∈ O d − Output: w = P ( v ) (cid:62) w ∗ ∈ O d − Definitions: • P ( v ) is the ( d − × d orthonormal matrix that accom-plishes projection onto the orthogonal complement v ⊥• H m = I m × m − m m × m is the m × m centering matrix • log ∗ is the extended log function with log ∗ ( x ) =log( x ) , x > and log ∗ (0) = 0 • the notation (cid:104)·(cid:105) i =1: m denotes the empirical mean, i. e., (cid:104) a i (cid:105) i =1: m = m (cid:80) mi =1 a i and icoh( j ) denotes the imaginary coherency between thesignals corresponding to v and w estimated as average overall trials (cf. Algorithm 5 for details). While there are severalways of setting up the objective function we have chosenthis multiplicative set-up as it quite naturally captures thefollowing idea: whenever we find the resulting bandpower tobe dependent on C we also want to ensure that this is not justan artefact due to volume conduction. Note that this extensionmay also help disentangle true cortical sources, i. e., the causalvariables, by avoiding a mixture of distinct sources affectedby C that have different time-lags and hence result in lowerimaginary coherency. VI. E XPERIMENTS ON EMPIRICAL
EEG
DATA
A. Sanity check of the included log-bandpower computation
We ran simulation experiments with the MERLiN bp Σ − algo-rithm analogous to those presented in Section IV. For thiswe used datasets T D d × m × nT ( a, b ) that are generated from D d × mT ( a, b ) with fixed mixing matrix A = I d × d as follows.While S , v , w G remain unchanged the d × m matrix F isreplaced by a d × m × n tensor (cid:101) F that consists of dm chunksof randomly chosen real EEG signals of length n . Each signal (cid:101) F i,j, n is modified such that the log-bandpower in the desiredfrequency band equals F i,j .The log-bandpower computation was incorporated into thealgorithm in such a way that the optima for MERLiN bp Σ − on T D d × m × nT ( a, b ) coincide with those for MERLiN Σ − onthe corresponding dataset D d × m × nT ( a, b ) ; however, the shapeof the objective functions is different. Accordingly and asexpected, sanity checks of MERLiN bp Σ − on T D d × m × nT ( a, b ) show trends for varying parameters T, d, m, a, b similar tothose discussed in Section IV-C.
B. Description of empirical data
We next evaluate MERLiN with EEG data recorded duringa neurofeedback experiment [33]. Subjects in this studywere instructed in pseudo-randomised order to up- or down-regulate the amplitude of γ -oscillations ( – Hz) in the rightsuperior parietal cortex (SPC). For the feedback the activityin the SPC was extracted by a linearly-constrained-minimumvariance (LCMV) beamformer [34] that was trained on minresting-state EEG data.Each recording session ( subjects a sessions referred toas S1R1, S1R2, S2R1, ...) consists of trials of secondseach. The stimulus variable S is either +1 or − dependingon whether the subject was instructed to up- or down-regulate γ -power in the SPC. Electromagnetic artefacts were attenu-ated as described in [33, Section 2.4.1] and the EEG datadownsampled to Hz. We are also given the spatial filter v ∈ R × for each session, i. e., the beamformer that wasused to extract the feedback signal. Thus, the data of onesession can be arranged as S ∈ {− , +1 } × , v ∈ R × and (cid:101) F ∈ R × × where (cid:101) F contains the timeseries (oflength ) for each channel and trial. C. Assessing MERLiN’s performance
MERLiN’s performance on these data is assessed by com-paring against results from an earlier exhaustive search ap-proach. The hypothesis in [17] is that γ -oscillations in theSPC modulate γ -oscillations in the medial prefrontal cortex(MPFC) and was derived from previous transcranial magneticstimulation studies [35]. In order to test this hypothesis, thesignal of K (cid:44) dipoles across the cortical surfacewas extracted using a LCMV beamformer and a three-shellspherical head model [36]. The SCI algorithm was used toassess for every dipole whether its γ -log-bandpower is a Data was recorded at active electrodes placed according to theextended – system at a sampling frequency of Hz and convertedto common average reference. L a t e r a l v i e w M e d i a l v i e w Left hemisphere Right hemisphereMedial prefrontal cortex (MPFC) S C I - v a l u e Focus ofneurofeedbackAnterior middle frontalgyrus (aMFG)
Fig. 6. Figure adapted from [17]. The neurofeedback target area in the rightSPC is indicated by a pink circle. The SCI value denotes the percentage ofdipoles within a radius of 7 mm that were found to be modulated by the SPC.From these results, the authors inferred the primary targets of the right SPCto be the MPFC and additionally the right aMFG. causal effect of the γ -log-bandpower in the SPC. This analysisconfirmed the MPFC as a causal effect of the SPC (cf.Figure 6).To allow comparison against these results we derive a vector g ∈ R K × that represents the involvement of each corticaldipole in the signal identified by MERLiN bp +Σ − as the linearcombination w of electrode signals. A scalp topography isreadily obtained via a ∝ Σ w where the i th entry of Σ w is thecovariance between the i th EEG channel and the source that isrecovered by w [37, Equation (7)]. Here Σ denotes the subject-specific covariance matrix in the γ -frequency band. A dipoleinvolvement vector g is obtained from a via dynamic statis-tical parametric mapping (dSPM; with the identity as noisecovariance matrix) [38]. The resulting vectors are expected tobe in line with previous findings and the hypothesis that theMPFC is affected by the SPC. D. Experimental results
We applied MERLiN bp +Σ − several times, i. e., with differentrandom initialisations, to the data of each of the sessions. We found that the γ -activation maps a obtained for eachspatial filter w were (a) rather smooth and similar to what istypically assumed to be neurophysiologically plausible [39],and (b) consistent across different initialisations within ses-sions. The group average and individual dipole involvementvectors are shown in Figure 7, and Table I shows the resultingabsolute (partial) correlations between γ -bandpower in theSPC ( C ), the γ -bandpower of the signal w (cid:62) F identified bythe MERLiN bp +Σ − algorithm ( C ), and the instruction to up- ordown-regulate the γ -bandpower in the SPC ( S ).Our results are in line with the previous findings (cf.Figure 6) inasmuch as they support the hypothesis that the Since there are only samples per session we decided to select a subsetof EEG channels distributed across the scalp (again according to the – system) after performing the preprocessing according to Algorithm 3. Hence,each run of the algorithm yields a spatial filter w ∈ R × and a dipoleinvolvement vector g ∈ R K × . TABLE IA BSOLUTE ( PARTIAL ) CORRELATIONS BETWEEN γ - BANDPOWER IN THE
SPC ( C ), THE γ - BANDPOWER OF THE SIGNAL w (cid:62) F IDENTIFIED BY THE
MERL I N bp +Σ − ALGORITHM ( C ), AND THE INSTRUCTION TO UP - ORDOWN - REGULATE THE γ - BANDPOWER IN THE
SPC ( S ). Session | ρ S,C | | ρ C ,C | | ρ S,C | (cid:12)(cid:12) ρ S,C | C (cid:12)(cid:12) S1R1 .
88 0 .
36 0 .
38 0 . S1R2 .
81 0 .
64 0 .
51 0 . S2R1 .
34 0 .
92 0 .
40 0 . S2R2 .
44 0 .
55 0 .
17 0 . S3R1 .
93 0 .
90 0 .
83 0 . S3R2 .
88 0 .
95 0 .
93 0 . MPFC is a causal effect of the SPC, i. e., S → C → C . Weobserve the following: • For five out of six sessions and on group average theMPFC shows up as being causally affected by the SPC. • Comparing the marginal correlation ρ S,C to the partialcorrelation ρ S,C | C suggests that indeed C screens off S and C , which is incompatible with the causal graph C ← S → C . Recall that C (cid:54)⊥⊥ C and C ⊥⊥ S | C are sufficient to uniquely infer S → C → C (cf.Section III-B). • Unlike the results in [17], the anterior middle frontalgyrus does not show up in Figure 7. • The parietal/posterior cingulate cortex shows up for ses-sions S1R1 –here in addition to the MPFC– and forsession S3S2.Note that we used MERLiN to recover only one causalvariable and hence, that the results are not expected to exactlyresemble the exhaustive search results in [17]. If the trueunderlying graph is as depicted in Figure 8, then MERLiNmay recover any combination aC + bC as causal effect of C . This may explain both why the anterior middle frontalgyrus does not show up in our analysis –MERLiN recoveringonly one effect, namely C but not C – and the lack of intra-subject consistency –slight inter-session differences may leadto recovery of different combinations of effects of C . Alsonote that if the assumption of orthogonal mixing is violatedthe SPC signal can only be attenuated but not removed (cf.Section III-C). This may explain the outlier result for sessionS3R2: The high correlation between C and C indicates thatessentially the SPC signal was recovered, i. e., C ≈ C .VII. D ISCUSSION
A. Summary of contributions
We have proposed a novel idea on how to construct causalvariables from observed non-causal variables by explicitlyoptimising for the statistical properties that imply a certaincause-effect relationship. This tackles the important problemof causal variable construction, an issue in causal inferencethat often goes unaddressed and is circumvented by presup-posing pre-defined meaningful variables among which cause-effect relationships are to be inferred. The resulting MERLiNalgorithm can recover (or construct) a causal variable from Group average
Left hemisphere L a t e r a l v i e w Right hemisphere M e d i a l v i e w S1R1 S1R2S2R1 S2R2S3R1 S3R2
Fig. 7. Spatial pattern of the effect of the SPC as identified by MERLiN bp +Σ − .Group average (first row) and for individual sessions (bottom rows). Eachsubplot consists of a lateral (top) and medial (bottom) view of the left (left)and right (right) hemisphere. (All colorscales from “blue” to “red” range from to the largest value to be plotted.) an observed linear mixture that is linearly affected by anothergiven variable. MERLiN can moreover be extended to enableapplication to different data modalities by (a) including dataspecific computation routines into the optimisation procedure,and (b) incorporating further constraints derived from a prioridomain knowledge. We chose to demonstrate this throughapplication to EEG data, since stimulus-based neuroimagingstudies are a natural application scenario for MERLiN (cf. S C C C Fig. 8. Example causal graph. C and C are two distinct effects of C . Section V). Results on empirical EEG data indicate thatMERLiN can infer meaningful brain state features (read causalvariables ) and establish a cause-effect relationship betweentwo cortical signals.
B. Applications in neuroimaging
As discussed in Section V interesting application scenariosfor MERLiN naturally arise in stimulus-based neuroimagingstudies. MERLiN’s fundamental idea is that the construction ofcausal variables should explicitly take into account statisticalproperties that correspond to causal structure. This supersedessource separation procedures that often rest on implausibleassumptions and are not tailored towards subsequent causalanalyses (e. g. ICA in the context of EEG data). BesidesMERLiN’s conceptual vantage it is computationally efficientand enables us to bypass both source localisation (e. g. beam-forming, dSPM) and an exhaustive search over dipoles.While we have chosen EEG as an example use case forextended MERLiN algorithms, the extension presented inthis manuscript is hoped to serve as a demonstration thatwill help researchers to adapt the MERLiN algorithm toother neuroimaging modalities. Future research may focuson extending MERLiN to enable application to functionalmagnetic resonance imaging data. This will, due to the highdimensionality compared to the number of samples, againrequire a modification of the objective function regularisingthe complexity of the linear combination w to avoid perfectrecovery of the stimulus variable. C. Limitations and future research
MERLiN tries to identify w (cid:62) F = C in S → C → C andrests on the assumption that there is no direct effect S → C .This assumption narrows down the class of causal variablesMERLiN can recover, e. g. if the true causal graph is as shownin Figure 9a then MERLiN cannot recover C . However, weargue that this is not a strong limitation. First, in stimulus-based neuroimaging experiments the assumption is likely tobe fulfilled if C is chosen to be a brain state feature thatreflects upstream processing of sensory input, e. g. the sec-ondary visual cortex V2 may be assumed to be only indirectlyaffected by visual stimuli via the primary visual cortex V1.Second, the MERLiN algorithm is robust in the sense thatthe statistical properties that it optimises for are sufficient toinfer the cause-effect relationships S → C → w (cid:62) F . Inother words, we are on the safe side as long as we refrainfrom drawing a conclusion if the statistical properties are notmet for the identified variable. Future research may focus on S C C (a) S C C h (b)Fig. 9. Example causal graphs where h denotes a hidden confounder. how to recover C in scenarios like in Figure 9a. This iscomplicated since the graphs in Figure 9a and Figure 9b areMarkov equivalent, i. e., they entail the same conditional (in-)dependences. Hence, the cause-effect relationship C → C cannot be reliably inferred from conditional (in-)dependencesalone.MERLiN may be applied in the d > m (high dimensionand small sample) setting if an additional regularisation termpenalizes the complexity of the linear combination w . Thisleads to the following more general form of the objectivefunction f ( w ) = (1 − λ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − λ (cid:12)(cid:12)(cid:12)(cid:12)(cid:16)(cid:98) Σ − w (cid:17) , (cid:12)(cid:12)(cid:12)(cid:12) − complexity( w ) where the additional parameter λ ∈ [0 , may enable toimprove performance by weighing dependence/conditional in-dependence depending on the problem structure at hand.Another limitation is that the MERLiN algorithm presentedin this manuscript only works for linear networks, i. e., itfails for non-linear cause-effect relationships. This may not bea strong limitation for neuroimaging applications given thatthere is empirical evidence that the relations found in EEGand functional magnetic resonance imaging are predominantlylinear [40], [41]. Nevertheless, future research will focus onextending MERLiN to non-linear cause-effect relationships,with preliminary results already available [42].Future research may also investigate possibilities to assessthe statistical significance and uncertainty associated with thelinear combination identified by MERLiN. The former may beaccomplished by a permutation scheme that involves runningthe optimisation for each permutation. However, it cannot beaccomplished by standard significance tests for (conditional)dependence, since an optimisation procedure is employedin obtaining the variables being tested, and this proceduremust be corrected for when determining the threshold forsignificance. The latter may be accomplished by bootstraptechniques. D. Disentangling multiple effects
MERLiN cannot unambiguously recover multiple effectsseparately (e. g.
A, B or C in Figure 10) as opposed to anylinear combination of those effects that all satisfy the sought-after statistical properties (e. g. aA + bB + cC in Figure 10).However, incorporating a priori knowledge as demonstratedin Section V-B can mitigate this. When analysing EEG data,for instance, one could a priori exclude spatial filters that areneurophysiologically implausible and run optimisation overthe complement set instead of the whole unit-sphere. S X ABC D
Fig. 10. Example causal graph for which it is supposed that the indirect( X → A → B ) and direct ( X → B ) effects of X on B cancel. E. Faithfulness
While the faithfulness assumption remains untestable we areunlikely to encounter violations in practice, e. g. we can showthat faithfulness holds almost surely if the causal relationshipsare linear [43]. Multivariate causal inference methods maybe robust against certain violations of faithfulness, and henceoffer an alternative to such arguments. MERLiN, for example,is able to identify cause-effect relationships in unfaithfulscenarios that cannot be revealed by classical univariate ap-proaches. Consider the graph shown in Figure 10 and supposethat the indirect ( X → A → B ) and direct ( X → B ) effects of X on B cancel, i. e., X ⊥⊥ B wrt. the resulting and unfaithfuljoint distribution. In this example, univariate methods cannotinfer the existence of the edge X → B , while MERLiN canin principle determine that B is part of the revealed linearcombination and as such directly affected by X . The link tofaithfulness prompts further research on multivariate methodsand variants of the faithfulness assumption. Furthermore, itstresses the importance of causal variable construction.R EFERENCES[1] K. Chalupka, P. Perona, and F. Eberhardt, “Visual causal feature learn-ing,” in
Proceedings of the 31th Conference on Uncertainty in ArtificialIntelligence , 2015.[2] P. Nunez and R. Srinivasan,
Electric Fields of the Brain: The Neuro-physics of EEG . Oxford University Press, 2006.[3] D. Lowe, “Object recognition from local scale-invariant features,” in
Computer Vision, 1999. The Proceedings of the Seventh IEEE Interna-tional Conference on , vol. 2, pp. 1150–1157, 1999.[4] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in
Computer Vision and Pattern Recognition, 2005. CVPR2005. IEEE Computer Society Conference on , vol. 1, pp. 886–893, 2005.[5] H. Bay, A. Ess, T. Tuytelaars, and L. V. Gool, “Speeded-up robustfeatures (SURF),”
Computer Vision and Image Understanding , vol. 110,no. 3, pp. 346–359, 2008.[6] A. Hyv¨arinen and E. Oja, “Independent component analysis: algorithmsand applications,”
Neural Networks , vol. 13, no. 4, pp. 411–430, 2000.[7] S. Makeig, M. Westerfield, T.-P. Jung, S. Enghoff, J. Townsend,E. Courchesne, and T. J. Sejnowski, “Dynamic brain sources of visualevoked responses,”
Science , vol. 295, no. 5555, pp. 690–694, 2002.[8] A. Hyv¨arinen, J. Karhunen, and E. Oja,
Independent Component Anal-ysis . Adaptive and Cognitive Dynamic Systems: Signal Processing,Learning, Communications and Control, Wiley, 2004.[9] J. Iriarte, E. Urrestarazu, M. Valencia, M. Alegre, A. Malanda, C. Viteri,and J. Artieda, “Independent component analysis as a tool to eliminateartifacts in EEG: A quantitative study,”
Journal of clinical neurophysi-ology , vol. 20, no. 4, pp. 249–257, 2003.[10] A. Delorme, T. Sejnowski, and S. Makeig, “Enhanced detection ofartifacts in EEG data using higher-order statistics and independentcomponent analysis,”
NeuroImage , vol. 34, no. 4, pp. 1443–1449, 2007.[11] S. Weichwald, B. Sch¨olkopf, T. Ball, and M. Grosse-Wentrup, “Causaland anti-causal learning in pattern recognition for neuroimaging,” in
Pattern Recognition in Neuroimaging, 2014 International Workshop on ,pp. 1–4, June 2014. [12] S. Weichwald, T. Meyer, O. ¨Ozdenizci, B. Sch¨olkopf, T. Ball, andM. Grosse-Wentrup, “Causal interpretation rules for encoding anddecoding models in neuroimaging,” NeuroImage , vol. 110, pp. 48–59,2015.[13] J. D. Ramsey, S. J. Hanson, C. Hanson, Y. O. Halchenko, R. A.Poldrack, and C. Glymour, “Six problems for causal inference fromfMRI,”
NeuroImage , vol. 49, no. 2, pp. 1545–1558, 2010.[14] M. Grosse-Wentrup, B. Sch¨olkopf, and J. Hill, “Causal influence ofgamma oscillations on the sensorimotor rhythm,”
NeuroImage , vol. 56,no. 2, pp. 837–842, 2011.[15] J. D. Ramsey, S. J. Hanson, and C. Glymour, “Multi-subject searchcorrectly identifies causal connections and most causal directions in theDCM models of the smith et al. simulation study,”
NeuroImage , vol. 58,no. 3, pp. 838–848, 2011.[16] J. A. Mumford and J. D. Ramsey, “Bayesian networks for fMRI: Aprimer,”
NeuroImage , vol. 86, pp. 573–582, 2014.[17] M. Grosse-Wentrup, D. Janzing, M. Siegel, and B. Sch¨olkopf, “Identifi-cation of causal relations in neuroimaging data with latent confounders:An instrumental variable approach,”
NeuroImage , vol. 125, pp. 825–833,2016.[18] M. Eichler and V. Didelez, “On Granger causality and the effect ofinterventions in time series,”
Lifetime Data Analysis , vol. 16, no. 1,pp. 3–32, 2010.[19] J. T. Lizier and M. Prokopenko, “Differentiating information transferand causal effect,”
The European Physical Journal B , vol. 73, no. 4,pp. 605–615, 2010.[20] P. Spirtes, C. Glymour, and R. Scheines,
Causation, Prediction, andSearch . MIT press, 2nd ed., 2000.[21] J. Pearl,
Causality: Models, Reasoning and Inference . CambridgeUniversity Press, 2nd ed., 2009.[22] S. L. Lauritzen,
Graphical Models . Oxford University Press, 1996.[23] T. Verma and J. Pearl, “Equivalence and synthesis of causal models,”in
Proceedings of the 6th Conference on Uncertainty in ArtificialIntelligence , (Amsterdam, NL), pp. 255–268, Elsevier Science, 1990.[24] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des-jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A CPUand GPU math expression compiler,” in
Proceedings of the Python forScientific Computing Conference (SciPy) , June 2010. Oral Presentation.[25] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow,A. Bergeron, N. Bouchard, and Y. Bengio, “Theano: New featuresand speed improvements.” Deep Learning and Unsupervised FeatureLearning NIPS 2012 Workshop, 2012.[26] J. Townsend, N. Koep, and S. Weichwald, “Pymanopt: A PythonToolbox for Manifold Optimization using Automatic Differentiation,” arXiv preprint arXiv:1603.03236 , 2016.[27] K. Baba, R. Shibata, and M. Sibuya, “Partial correlation and conditionalcorrelation as measures of conditional independence,”
Australian & NewZealand Journal of Statistics , vol. 46, no. 4, pp. 657–664, 2004.[28] S. Li, “Concise formulas for the area and volume of a hypersphericalcap,”
Asian Journal of Mathematics and Statistics , vol. 4, no. 1, pp. 66–70, 2011.[29] M. P. van den Heuvel and H. E. H. Pol, “Exploring the brain network:A review on resting-state fMRI functional connectivity,”
EuropeanNeuropsychopharmacology , vol. 20, no. 8, pp. 519 – 534, 2010. [30] P. L. Nunez, R. Srinivasan, A. F. Westdorp, R. S. Wijesinghe, D. M.Tucker, R. B. Silberstein, and P. J. Cadusch, “EEG coherency: I:statistics, reference electrode, volume conduction, laplacians, corticalimaging, and interpretation at multiple scales,”
Electroencephalographyand Clinical Neurophysiology , vol. 103, no. 5, pp. 499–515, 1997.[31] G. Nolte, O. Bai, L. Wheaton, Z. Mari, S. Vorbach, and M. Hallett,“Identifying true brain interaction from EEG data using the imaginarypart of coherency,”
Clinical Neurophysiology , vol. 115, no. 10, pp. 2292–2307, 2004.[32] J. Stinstra and M. Peters, “The volume conductor may act as a temporalfilter on the ECG and EEG,”
Medical and Biological Engineering andComputing , vol. 36, no. 6, pp. 711–716, 1998.[33] M. Grosse-Wentrup and B. Sch¨olkopf, “A brain–computer interfacebased on self-regulation of gamma-oscillations in the superior parietalcortex,”
Journal of neural engineering , vol. 11, no. 5, p. 056015, 2014.[34] B. D. Van Veen, W. Van Drongelen, M. Yuchtman, and A. Suzuki, “Lo-calization of brain electrical activity via linearly constrained minimumvariance spatial filtering,”
Biomedical Engineering, IEEE Transactionson , vol. 44, no. 9, pp. 867–880, 1997.[35] A. C. Chen, D. J. Oathes, C. Chang, T. Bradley, Z.-W. Zhou, L. M.Williams, G. H. Glover, K. Deisseroth, and A. Etkin, “Causal in-teractions between fronto-parietal central executive and default-modenetworks in humans,”
Proceedings of the National Academy of Sciences ,vol. 110, no. 49, pp. 19944–19949, 2013.[36] J. C. Mosher, R. M. Leahy, and P. S. Lewis, “EEG and MEG: Forwardsolutions for inverse methods,”
Biomedical Engineering, IEEE Transac-tions on , vol. 46, no. 3, pp. 245–259, 1999.[37] S. Haufe, F. Meinecke, K. G¨orgen, S. D¨ahne, J.-D. Haynes, B. Blankertz,and F. Bießmann, “On the interpretation of weight vectors of linearmodels in multivariate neuroimaging,”
NeuroImage , vol. 87, pp. 96–110,2014.[38] A. M. Dale, A. K. Liu, B. R. Fischl, R. L. Buckner, J. W. Belliveau,J. D. Lewine, and E. Halgren, “Dynamic statistical parametric mapping:Combining fMRI and MEG for high-resolution imaging of corticalactivity,”
Neuron , vol. 26, no. 1, pp. 55–67, 2000.[39] A. Delorme, J. Palmer, J. Onton, R. Oostenveld, and S. Makeig,“Independent EEG sources are dipolar,”
PLoS ONE , vol. 7, p. e30135,02 2012.[40] K.-R. M¨uller, C. W. Anderson, and G. E. Birch, “Linear and nonlinearmethods for brain-computer interfaces,”
Neural Systems and Rehabili-tation Engineering, IEEE Transactions on , vol. 11, no. 2, pp. 165–169,2003.[41] T. Naselaris, K. N. Kay, S. Nishimoto, and J. L. Gallant, “Encoding anddecoding in fmri,”
Neuroimage , vol. 56, no. 2, pp. 400–410, 2011.[42] S. Weichwald, A. Gretton, B. Sch¨olkopf, and M. Grosse-Wentrup,“Recovery of non-linear cause-effect relationships from linearly mixedneuroimaging data,” arXiv preprint arXiv:1605.00391 , 2016. to appearin Pattern Recognition in Neuroimaging, 2016 International Workshopon.[43] C. Meek, “Strong completeness and faithfulness in bayesian networks,”in