[PDF] Granger causality and transfer entropy are equivalent for Gaussian variables

Abstract

Granger causality is a statistical notion of causal influence based on prediction via vector autoregression. Developed originally in the field of econometrics, it has since found application in a broader arena, particularly in neuroscience. More recently transfer entropy, an information-theoretic measure of time-directed information transfer between jointly dependent processes, has gained traction in a similarly wide field. While it has been recognized that the two concepts must be related, the exact relationship has until now not been formally described. Here we show that for Gaussian variables, Granger causality and transfer entropy are entirely equivalent, thus bridging autoregressive and information-theoretic approaches to data-driven causal inference.

Full PDF

aa r X i v : . [ m a t h - ph ] N ov Granger causality and transfer entropy are equivalent forGaussian variables

Lionel Barnett ∗ Centre for Computational Neuroscience and RoboticsSchool of InformaticsUniversity of SussexBrighton BN1 9QJ, UK

Adam B. Barrett † and Anil K. Seth ‡ Sackler Centre for Consciousness ScienceSchool of InformaticsUniversity of SussexBrighton BN1 9QJ, UK (Dated: November 10, 2009)

Abstract

Granger causality is a statistical notion of causal inﬂuence based on prediction via vector au-toregression. Developed originally in the ﬁeld of econometrics, it has since found application ina broader arena, particularly in neuroscience. More recently transfer entropy , an information-theoretic measure of time-directed information transfer between jointly dependent processes, hasgained traction in a similarly wide ﬁeld. While it has been recognized that the two concepts mustbe related, the exact relationship has until now not been formally described. Here we show that forGaussian variables, Granger causality and transfer entropy are entirely equivalent, thus bridgingautoregressive and information-theoretic approaches to data-driven causal inference.

PACS numbers: 87.10.Mn, 87.19.L, 87.19.lj, 87.19.lo, 89.70.CfKeywords: Granger causality, transfer entropy, causal inference X and Y , it is said that “ Y G-causes X ” if, in an appropriatestatistical sense, Y assists in predicting the future of X beyond the degree to which X al-ready predicts its own future. Importantly, identiﬁcation of a G-causality interaction is not identical to identifying a physically instantiated causal interaction in a system. Althoughthe two descriptions are intimately related [4, 5], physically instantiated causal structure canonly be unambiguously identiﬁed by perturbing a system and observing the consequences[1]. Nonetheless, G-causality is pragmatic, well-deﬁned, and has delivered many insightsinto the functional connectivity of systems in a variety of ﬁelds, particularly in neuroscience[6].The information-theoretic notion of transfer entropy was formulated by Schreiber [7]as a measure of directed (time-asymmetric) information transfer between joint processes.In contrast to G-causality, transfer entropy is framed not in terms of prediction but interms of resolution of uncertainty . One can say that “the transfer entropy from Y to X ”is the degree to which Y disambiguates the future of X beyond the degree to which X already disambiguates its own future. There is therefore an attractive symmetry betweenthe notions (“predicts” ↔ “disambiguates”) which has been noted previously (see e.g. [8])but never explicitly speciﬁed. In this Letter we show that under Gaussian assumptionsthey are in fact entirely equivalent. Our results therefore provide a framework for inferringcausality which uniﬁes information-theoretic and autoregressive approaches.We use a standard mathematical vector/matrix notation in which bold type generallydenotes vector quantities and upper-case type denotes matrices or random variables, ac-cording to context. All vectors are considered to be row vectors. The symbol ‘ ⊺ ’ denotes thetranspose operator and ‘ ⊕ ’ denotes concatenation of vectors, so that for x = ( x , . . . , x n )and y = ( y , . . . , y m ), x ⊕ y is the 1 × ( n + m ) vector ( x , . . . , x n , y , . . . , y m ).Given jointly distributed multivariate random variables (i.e. random vectors) X , Y , wedenote by Σ ( X ) the n × n matrix of covariances cov( X i , X j ) and by Σ ( X , Y ) the n × m X i , Y α ). We then use Σ ( X | Y ) to denote the n × n matrix Σ ( X | Y ) ≡ Σ ( X ) − Σ ( X , Y ) Σ ( Y ) − Σ ( X , Y ) ⊺ (1)deﬁned when Σ ( Y ) is invertible. Σ ( X | Y ) appears as the covariance matrix of the residualsof a linear regression of X on Y [ cf. eq. (3) below]; thus, by analogy with partial correlation [9] we term Σ ( X | Y ) the partial covariance [28] of X given Y .Suppose we have a multivariate stochastic process X t in discrete time [29] (i.e. the randomvariables X ti are jointly distributed). We use the notation X ( p ) t ≡ X t ⊕ X t − ⊕ . . . ⊕ X t − p +1 to denote X itself, along with p − lags , so that X ( p ) t is a 1 × pn random vector for each t .Given the lag p , we use the shorthand notation X − t ≡ X ( p ) t − for the lagged variable.Let X , Y be jointly distributed random vectors and consider the linear regression X = α + Y · A + ε (2)where the m × n matrix A comprises the regression coeﬃcients, α = ( α , . . . , α n ) are theconstant terms and the random vector ε = ( ε , . . . , ε n ) comprises the residuals. The meansquared error (MSE) may then be written in terms of the covariance matrix of the residualsas E = trace( Σ ( ε )). E is just the sum of the variances of the ε i , sometimes known as the total variance . Performing an Ordinary Least Squares (OLS) to ﬁnd the coeﬃcients A thatminimize E yields [assuming Σ ( Y ) invertible] A = Σ ( Y ) − Σ ( X , Y ) ⊺ and we ﬁnd that forthe least squares ﬁt the covariance matrix of the residuals is given by Σ ( ε ) = Σ ( X | Y ) (3)with Σ ( X | Y ) the partial covariance as deﬁned by (1). We note that the same coeﬃ-cients A which minimize the total variance E also minimize the generalized variance | Σ ( ε ) | [10], where | · | denotes the determinant (this procedure is sometimes referred to as “LeastGeneralized Variance”; see e.g. [11]).If the residuals ε can be taken to be uncorrelated with the regressors Y in (2)—as wouldbe the case, for instance, for a multivariate autoregressive (MVAR) model—the residualcovariance matrix can be derived directly from (2). Taking the covariance of both sides of(2) yields Σ ( X ) = A ⊺ Σ ( Y ) A + Σ ( ε ) (4)3ince the residuals and regressors are uncorrelated, we also have0 = Σ ( Y , ε )= Σ ( Y , X − α − Y · A )= Σ ( X , Y ) ⊺ − Σ ( Y ) A (5)Solving (5) for A and substituting in (4) we recover eq. (3) for Σ ( ε ). We note that eqs. (4)and (5) are essentially Yule-Walker equations [6] for the regression (2).Suppose now we have three jointly distributed, stationary [30] multivariate stochasticprocesses X t , Y t , Z t (“variables” for brevity). Consider the regression models: X t = α t + (cid:16) X ( p ) t − ⊕ Z ( r ) t − (cid:17) · A + ε t (6) X t = α ′ t + (cid:16) X ( p ) t − ⊕ Y ( q ) t − ⊕ Z ( r ) t − (cid:17) · A ′ + ε ′ t (7)so that the “predictee” variable X is regressed ﬁrstly on the previous p lags of itself plus r lags of the conditioning variable Z and secondly, in addition, on q lags of the “predictor”variable Y [31]. The G-causality of Y to X given Z is a measure of the extent to whichinclusion of Y in the second model (7) reduces the prediction error of the ﬁrst model (6).The standard measure of G-causality in the literature is deﬁned for univariate predictorand predictee variables Y and X , and is given by the natural logarithm of the ratio of theresidual variance in the restricted regression (6) to that of the unrestricted regression (7).In our notation [32] F Y → X | Z ≡ ln (cid:18) var( ε t )var( ε ′ t ) (cid:19) = ln (cid:18) Σ ( ε t ) Σ ( ε ′ t ) (cid:19) = ln (cid:18) Σ ( X | X − ⊕ Z − ) Σ ( X | X − ⊕ Y − ⊕ Z − ) (cid:19) (8)where the last equality follows from the general formula (3). By stationarity this expressiondoes not depend on time t , so we drop the subscript when there is no danger of confusion.Note that the residual variance of the ﬁrst regression will always be larger than or equal tothat of the second, so that F Y → X | Z ≥ b F Y → X | Z will have (asymptoticallyfor large samples) a χ -distribution under the null hypothesis F Y → X | Z = 0 [12, 13] and anon-central χ -distribution under the alternative hypothesis F Y → X | Z > multivariate ; see [16] and [17] for motivation anddiscussion regarding this generalization. For the case of a multivariate predictor, eq. (8)above (with Y replaced by the bold-type Y ) is a valid and consistent formula for G-causality.However, generalization to the case of a multivariate predictee is less clear cut and theredoes not yet appear to be a standard deﬁnition for G-causality in the literature. Here we usean extension ﬁrst proposed by Geweke [14], in which the residual variance var( ε t ) = Σ ( ε t )is replaced by the generalized variance | Σ ( ε t ) | : F Y → X | Z ≡ ln (cid:18) | Σ ( ε t ) || Σ ( ε ′ t ) | (cid:19) = ln (cid:18) | Σ ( X | X − ⊕ Z − ) || Σ ( X | X − ⊕ Y − ⊕ Z − ) | (cid:19) (9)This formula always produces a non-negative quantity, and for a univariate predicteereduces to (8). Moreover, its estimator is also asymptotically χ -distributed. Geweke [14]lists a number of motivations for this choice, to which we add the result presented inthis Letter. (An alternative formulation for multivariate G-causality is proposed in [16],although see [17] for more detailed discussion and further motivation for the form (9).)With X t , Y t , Z t as before, the transfer entropy of Y to X given Z [7, 18] is deﬁned asthe diﬀerence between the entropy of X conditioned on its own past and the past of Z , andits entropy conditioned, in addition, on the past of Y : T Y → X | Z ≡ H (cid:0) X | X − ⊕ Z − (cid:1) − H (cid:0) X | X − ⊕ Y − ⊕ Z − (cid:1) (10)where H ( · ) denotes entropy and H ( · | · ) conditional entropy. Again, by stationarity transferentropy does not depend on time t , and T Y → X | Z ≥ T Y → X | Z may be understoodas the degree of uncertainty of X resolved by the past of Y over and above the degree ofuncertainty of X resolved by its own past. As with Granger causality, the transfer entropyliterature generally deals only with univariate variables, although in this case the extension(10) to the multivariate case is unproblematic.We now turn to the equivalence with G-causality. For a multivariate Gaussian random5ariable X we have the well-known expression [19] H ( X ) = ln( | Σ ( X ) | ) + n ln(2 πe )for entropy in terms of the determinant of the covariance matrix, where n is the dimensionof X . We now show that the conditional entropy H ( X | Y ) for two jointly multivariateGaussian variables may be expressed in terms of the determinant of the correspondingpartial covariance matrix: H ( X | Y ) = ln( | Σ ( X | Y ) | ) + n ln(2 πe ) (11)To see this, we have H ( X | Y ) ≡ H ( X ⊕ Y ) − H ( Y )= ln( | Σ ( X ⊕ Y ) | ) − ln( | Σ ( Y ) | )+ n ln(2 πe )Now Σ ( X ⊕ Y ) =  Σ ( X ) Σ ( X , Y ) Σ ( X , Y ) ⊺ Σ ( Y )  and from the block determinant identity [20] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A BC D (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = | D | (cid:12)(cid:12) A − BD − C (cid:12)(cid:12) we have | Σ ( X ⊕ Y ) | = | Σ ( Y ) | · | Σ ( X | Y ) | from which we obtain (11) [33].If, then, the processes X t , Y t , Z t are jointly multivariate Gaussian (i.e. any ﬁnite subsetof the component variables X ti , Y sα , Z ua has a joint Gaussian distribution) it follows from(11) that the expression (10) for transfer entropy becomes [34] T Y → X | Z ≡ ln (cid:18) | Σ ( X | X − ⊕ Z − ) || Σ ( X | X − ⊕ Y − ⊕ Z − ) | (cid:19) (12)Comparing (12) with (9) leads directly to our central result: if all processes are jointlyGaussian, then F Y → X | Z = 2 T Y → X | Z (13)6o that Granger causality and transfer entropy are equivalent up to a factor of

2. Thisresult holds, in particular, for a univariate predictee X with the standard deﬁnition (8) ofG-causality.Empirically, numerical equivalence between G-causality and transfer entropy will dependon the method used to estimate the transfer entropy in sample. If it is assumed at the outsetthat the data may be reasonably modeled as Gaussian—and that, consequently, conditionalentropies may be estimated from the appropriate sample covariance matrices—then,of course, numerical equivalence will be guaranteed. If, however, conditional entropiesare estimated directly from sampled probability distributions, results will vary with theestimation technique. It is known that naive estimation of transfer entropy by partitioningof the state space is problematic [7] and that such estimators frequently fail to convergeto the correct result [18]. In practice, more sophisticated techniques such as kernel [21] or k -nearest neighour estimators [22, 23], will need to be deployed; however, such techniquesmay entail their own assumptions about the empirical distribution of the data (see [18]for a good discussion on these points). Furthermore, unlike G-causality, for which the(asymptotic) distribution of the sample statistic is known, we are not aware of any suchgeneral result for transfer entropy. Thus in particular signiﬁcance testing for transferentropy estimates is likely to be hard.Our result (13) provides for the ﬁrst time a uniﬁed framework for data-driven causalinference that bridges information-theoretic and autoregressive methods. In particular, itopens new research possibilities in transforming ﬁndings originally developed in one domaininto the other. For example, an advantage of the autoregressive approach is that it admits astraightforward decomposition by frequency [6, 14]. Our result now provides a foundation forthe development of spectral implementations of transfer entropy. In the opposite direction,the invariance of information-theoretic quantities under general nonlinear transformations[18] could potentially prove useful in the identiﬁcation of appropriate nonlinear autoregres-sive models [24, 25]. Preliminary work by the authors indicates, perhaps surprisingly, thatunder Gaussian assumptions there is nothing extra to account for by nonlinear extensions toG-causality, since a stationary Gaussian AR process is necessarily linear [17]. This ﬁndinghas practical signiﬁcance because sensitivity to nonlinear data features is often presented as7 reason to prefer transfer entropy to G-causality (see e.g. [26]).As regards Gaussian assumptions, although their appropriateness may be disputed in thecontext of speciﬁc physical systems, they are nevertheless widely employed in neuroscience,econometrics and beyond, frequently in the role of an analytical benchmark for subsequentmore physically motivated analysis. In practice, given empirical data it is likely to bediﬃcult to establish the extent to which Gaussian assumptions are tenable, particularly forhighly multivariate datasets and limited sample sizes. Further research is thus required tocharacterize—both analytically and in sample—the manner in which the equivalence (13)breaks down when Gaussian assumptions fail. As a starting point it is known, at least, thatin the generic (non-Gaussian) case, nonzero G-causality implies nonzero transfer entropy[27].More generally, G-causality is typically implemented within the well-understood andeasily applicable framework of MVAR modeling. This implementation, however, impliesmany assumptions about how to model the data. Transfer entropy by contrast, althoughon a theoretical level “model agnostic” (in the sense that it involves no presumptions aboutthe joint statistical distribution of the data), may present severe diﬃculties in empiricalapplication. Investigators, then, are free to use whichever practical methods best suit theirdata. Numerical issues aside, the analytical equivalence (13) furnishes the essential pointthat—under Gaussian assumptions—G-causality has a natural interpretation as transferentropy and vice-versa . Acknowledgments.

AKS is supported by EPRSC Leadership FellowshipEP/G007543/1, which also supports the work of ABB. ∗ Electronic address: [email protected] † Electronic address: [email protected] ‡ Electronic address: [email protected][1] J. Pearl,

Causality: Models, reasoning, and inference (Cambridge University Press, Cam-bridge, UK, 1999).[2] N. Wiener, in

Modern Mathematics for Engineers , edited by E. Beckenbach (McGraw Hill,New York, 1956).

3] C. Granger, Econometrica , 424 (1969).[4] A. Seth and G. Edelman, Neural Comput. , 910 (2007).[5] A. J. Cadotte, T. B. DeMarse, P. He, and M. Ding, PLoS One , e3355 (2008).[6] M. Ding, Y. Chen, and S. Bressler, in Handbook of Time Series Analysis , edited by S. Schelter,M. Winterhalder, and J. Timmer (Wiley, Wienheim, 2006), pp. 438–460.[7] T. Schreiber, Phys. Rev. Lett. , 461 (2000).[8] M. Palu˘s, V. Kom´arek, Z. Hrn˘c´ı˘r, and K. ˘St˘erbov´a, Phys. Rev. E , 046211 (2001).[9] M. G. Kendall and A. Stuart, The advanced theory of statistics , vol. 2. ”Inference and Rela-tionship” (Griﬃn, 1979).[10] S. S. Wilks, Biometrika , 471 (1932).[11] J. Davidson, Econometric Theory (Wiley-Blackwell, 2000).[12] C. W. J. Granger, Inform. Control , 28 (1963).[13] P. Whittle, J. Royal Stat. Soc. B , 125 (1953).[14] J. Geweke, J. Am. Stat. Assoc. , 304 (1982).[15] A. Wald, T. Am. Math. Soc. , 426 (1943).[16] C. Ladroue, S. Guo, K. Kendrick, and J. Feng, PLoS One , e6899 (2009).[17] L. Barnett, A. B. Barrett, and A. Seth (2009), unpublished.[18] A. Kaiser and T. Schreiber, Physica D , 43 (2002).[19] A. Papoulis and S. Pillai, Probability, random variables, and stochastic processes (McGraw-Hill, New York, NY, 2002), 4th edition.[20] R. A. Horn and C. R. Johnson,

Matrix Analysis (Cambridge University Press, 1985).[21] B. W. Silverman,

Density estimation for statistics and data analysis , vol. 26 of

Monographson Statistics and Applied Probability (Chapman & Hall, London, 1986).[22] A. Kraskov, H. St¨ogbauer, and P. Grassberger, Phys. Rev. E , 066138 (2004).[23] S. Frenzel and B. Pompe, Phys. Rev. Lett. , 204101 (2007).[24] Y. Chen, G. Rangarajan, J. Feng, and M. Ding, Phys. Lett. A , 26 (2004).[25] N. Ancona, D. Marinazzo, and S. Stramaglia, Phys. Rev. E , 056221 (2004).[26] O. Sporns and M. Lungarella, PLoS Comput. Biol. , e144 (2006).[27] D. Marinazzo, M. Pellicoro, and S. Stramaglia, Phys. Rev. Lett. , 144103 (2008).[28] This is to be distinguished from the conditional covariance , which will in general be a randomvariable - although later we note that for Gaussian variables the notions coincide.

29] While our analysis may be extended to continuous time we focus here on the discrete timecase.[30] The analysis carries through for the non-stationary case, but for simplicity we assume herethat all processes are stationary.[31] Eqs. (6) and (7) are not to be interpreted as specifying actual MVAR processes; indeed, assuch they could not be consistent due to the Y ( q ) t − dependence in (7). Furthermore, we notethat the variables X t , Y t , Z t may depend on latent (unknown) or exogenous (unmeasured)variables not included in the regressions. Rather, we should view the regressions purely as predictive models speciﬁed by the parameters A and A ′ respectively, which are then to beﬁtted by an OLS or equivalent procedure.[32] Note that even though X and Y are univariate, the lagged variables X − and Y − will generallybe multivariate (at least if p, q > X , Y a standard result states thatthe covariance matrix of X given Y = y does not depend on the value of y , so that the conditional covariance cov( X | Y ) is a well-deﬁned (non-random) covariance matrix, which isjust the partial covariance Σ ( X | Y ) of (1).[34] This is essentially a multivariate, conditional version of the formula given in [18], eq. (19).) of (1).[34] This is essentially a multivariate, conditional version of the formula given in [18], eq. (19).