Fluctuation-response theorem for Kullback-Leibler divergences to quantify causation
FFluctuation-response theorem for Kullback-Leibler divergences to quantify causation
Andrea Auconi , Benjamin M. Friedrich , and Andrea Giansanti cfaed, Technische Universit¨at Dresden, 01069 Dresden, Germany Dipartimento di Fisica, Sapienza Universit`a di Roma, 00185 Rome, Italy (Dated: February 16, 2021)We define a new measure of causation from a fluctuation-response theorem for Kullback-Leiblerdivergences, based on the information-theoretic cost of perturbations. This information responsehas both the invariance properties required for an information-theoretic measure and the physicalinterpretation of a propagation of perturbations. In linear systems, the information response reducesto the transfer entropy, providing a connection between Fisher and mutual information.
In the general framework of stochastic dynamical sys-tems, the term causation refers to the influence that avariable x exerts over the dynamics of another variable y . Measures of causation find application in neuroscience[1], climate studies [2], cancer research [3], and finance[4]. However, a widely accepted quantitative definitionof causation is still missing.Causation manifests itself in two inseparable forms:information flow [5–8], and propagation of perturba-tions [9–12]. Ideally, a quantitative measure of causationshould connect both perspectives.Information flow is commonly quantified by the trans-fer entropy [13–17], that is the average conditional mu-tual information corresponding to the uncertainty reduc-tion in forecasting the time evolution of y that is achievedupon knowledge of x . The mutual information is a spe-cial case of Kullback-Leibler (KL) divergence, a dimen-sionless measure of distinguishability between probabilitydistributions [18]. As such, the transfer entropy abstractsfrom the underlying physics to give an invariant descrip-tion in terms of the strength of probabilistic dependen-cies.From the interventional point of view [9–12], causationis identified with how a perturbation applied to x prop-agates in the system to effect y . Although a direct per-turbation of observables is unfeasible in most real-worldsituations, the fluctuation-response theorem establishesa connection between the response to a small perturba-tion and the correlation of fluctuations in the natural(unperturbed) dynamics [19–22].The fluctuation-response theorem considers the first-order expansion of the response with respect to the per-turbation. The corresponding linear response coefficienthas been suggested as a measure of causation [11, 12].However, it has the same physical units as y/x , and itcan assume negative values; thus, is not directly relatedto any information-theoretic measure.In stochastic dynamical systems with nonlinear inter-actions, perturbing x may not only affect the evolutionof the expectation value of y , but it may also affect theevolution of the variance of y , and in fact its entire prob-ability distribution. The KL divergence from the naturalto the perturbed probability densities has recently beenidentified as the universal upper bound to the physical response of any observable relative to its natural fluctu-ations [23].In this Letter, we define a new measure of causationin the form of a linear response coefficient between KLdivergences, which we would like to call information re-sponse . In particular, we consider the ratio of two KL di-vergences, one for the response and one for the perturba-tion, where the latter represents an information-theoreticcost of the perturbation. For small perturbations, we for-mulate a fluctuation-response theorem that expresses thisratio as a ratio of Fisher information.In linear systems, this new information response re-duces to the transfer entropy, which provides a connec-tion between Fisher and mutual information, and thus aconnection between fluctuation-response theory and in-formation flows. Kullback-Leibler (KL) divergence.
Consider twoprobability distributions p ( w ) and q ( w ) of a randomvariable w . The KL divergence from q ( w ) to p ( w ) isdefined as D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) ≡ Z dw p ( w ) ln (cid:18) p ( w ) q ( w ) (cid:19) ; (1)it is not symmetric in its arguments, and non-negative.Importantly, it is invariant under invertible trans-formations w → w [18], namely D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) = D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) . The problem of causation.
Consider a stochastic sys-tem of n variables evolving with ergodic Markovian dy-namics. Our goal is to define a quantitative measure ofcausation, i.e., the influence that a variable x exerts overthe dynamics of another variable y . We want this defi-nition to have both the invariance property of KL diver-gences, and the physical interpretation of a propagationof perturbations.Since the dynamics is ergodic, and therefore station-ary, it suffices to consider the stochastic variables x ≡ x ( t = 0), y ≡ y ( t = 0) at t = 0, and a time inter-val τ later y τ ≡ y ( t = τ ). To avoid cluttered nota-tion, we will implicitly assume that the current values ofthe remaining n − y , e.g., p ( y τ (cid:12)(cid:12) y ) ≡ p ( y τ (cid:12)(cid:12) y , z ). Conditioning on z avoids con-founding variables in z to introduce spurious causal linksbetween x and y [24]. a r X i v : . [ c s . I T ] F e b Local response divergence.
Let us consider the systemat t = 0 with steady-state distribution p ( x , y ). Wemake an ideal measurement of its actual state ( x , y ).Immediately after the measurement, we perturb the stateby introducing a small displacement (cid:15) > x , namely x ⇒ x + (cid:15) . If the effect of this perturbationpropagates to y , then it is reflected in the KL divergencefrom the natural to the perturbed prediction d x → yτ ( x , y , (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3) , (2)which is a function of the local condition ( x , y ) andthe perturbation strength (cid:15) . We name it local re-sponse divergence, and denote its ensemble average by h d x → yτ ( x , y , (cid:15) ) i .The concept of causation, interpreted in the frameworkof fluctuation-response theory, is only meaningful withrespect to an arrow of time. That means to postulatethat the perturbation cannot have effects at past times p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) ≡ ( p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) for τ ≥ ,p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) for τ < . (3)In writing the conditional probability p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) ,we implicitly assumed p ( x + (cid:15), y ) >
0, meaning thatthe condition provoked by the perturbation is possibleunder the natural statistics. This implies that the re-sponse statistics can be predicted without actually per-turbing the system, which is the main idea of fluctuation-response theory [19–22].
Information-theoretic cost.
The mean local responsedivergence h d x → yτ ( x , y , (cid:15) ) i , like any response functionin fluctuation-response theory, is defined in relation to aperturbation, irrespective of how difficult it may be toperform this perturbation. Intuitively, we expect that ittakes more effort to perturb those variables that fluctuateless. Therefore, we consider the KL divergence from thenatural to the perturbed ensemble of conditions c x ( (cid:15) ) ≡ D (cid:2) p ( x − (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) , (4)to quantify the information-theoretic cost of perturba-tions, and call it perturbation divergence .For example, for an underdamped Brownian particle,the perturbation divergence is equivalent to the averagethermodynamic work required to perform an (cid:15) perturba-tion of its velocity, up to a factor being the temperature,see Supplementary Information (SI). For an equilibriumensemble in a potential U ( x ), with Boltzmann distribu-tion p ( x ) ∼ exp( − βU ( x )), the perturbation divergence isthe average reversible work c x ( (cid:15) ) = β h U ( x + (cid:15) ) − U ( x ) i .Note that the definition of Eq. (4) is general, and can beapplied to more abstract models where thermodynamicquantities are not clearly identified. + x x * = x + x t x * t t yy * y y t y * t y p ( y | x , y ) p ( y | x + , y ) 2 0 2 x + + p ( x | y ) p ( x | y ) FIG. 1. Here we show, on a concrete example, the origin ofthe two KL divergences entering the information response ofEq. (5). (Upper) Response to the perturbation x ⇒ x + (cid:15) at the trajectory level. x ∗ t ( y ∗ t ) is the perturbed trajectoryof x t ( y t ), for the same noise realization. (Lower Left) Localresponse divergence d x → yτ ( x , y , (cid:15) ): change of predicted dis-tribution of y τ for the condition ( x , y ) for a timescale τ = 3.(Lower Right) Perturbation divergence c x ( (cid:15) ): instantaneousdisplacement of the steady-state ensemble conditional to aparticular y . The dynamics follows the nonlinear stochasticmodel of Eq. (17) with parameters t R = 10, q = 0 . α = 0 . β = 0 .
2, for a perturbation (cid:15) = 0 . Information response.
We introduce the informationresponse as the ratio between mean local response diver-gence and perturbation divergence, in the limit of a smallperturbation Γ x → yτ ≡ lim (cid:15) → h d x → yτ ( x , y , (cid:15) ) i c x ( (cid:15) ) . (5)We can interpret Γ x → yτ as an information-theoretic linearresponse coefficient. This information response is ourmeasure of x → y causation with respect to the timescale τ , see Fig. 1. The time arrow requirement (Eq. (3))implies Γ x → yτ = 0 for τ < local information response γ x → yτ ( x , y ) ≡ lim (cid:15) → d x → yτ ( x , y , (cid:15) ) /c x ( (cid:15) ), we canequivalently write Γ x → yτ = h γ x → yτ ( x , y ) i .The information response in the form of Eq. (5) inher-ently relies on the concept of controlled perturbations.We can reformulate it in purely observational form, inthe spirit of the fluctuation-response theorem [19–22],provided p ( x , y , y τ ) is sufficiently smooth. Fisher information.
The one-parameter family { p ( y τ (cid:12)(cid:12) x , y ) } x of probability densities parametrized by x (for fixed y ) can be equipped with a Riemannianmetric having d x → yτ ( x , y , (cid:15) ) as squared line element. Infact, the leading order term in the Taylor expansion of aKL divergence between probabilities that differ only bya small perturbation of a parameter is of second order,with coefficients known as Fisher information [18, 25].Explicitly, expanding the mean response divergence for τ >
0, we obtain h d x → yτ ( x , y , (cid:15) ) i = − (cid:15) (cid:10) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) (cid:11) + O ( (cid:15) ) , (6)where we used the interventional causality requirement(Eq. (3)), and probability normalization. Similarly, forthe perturbation divergence we have c x ( (cid:15) ) = − (cid:15) (cid:10) ∂ x ln p ( x (cid:12)(cid:12) y ) (cid:11) + O ( (cid:15) ) . (7)Applying the Fisher information representation to theinformation response, we get for τ > x → yτ = (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) , (8)that is the fluctuation-response theorem for KL diver-gences. For generalizations and a discussion of the con-nection with the classical fluctuation-response theoremsee [26] and SI text. Eq. (8) is the ratio of two secondderivatives over the same physical variable x , and it canbe regarded as an application of L’Hˆopital’s rule to Eq.(5).In general, Fisher information is not easily connectedto Shannon entropy and mutual information [27]. Below,we show that for linear stochastic systems, the informa-tion response, which is a ratio of Fisher information (Eq.(8)), is equivalent to the transfer entropy, a conditionalform of mutual information. Transfer entropy.
The most widely used measure ofinformation flow is the conditional mutual information T x → yτ ≡ (cid:10) D (cid:2) p (cid:0) x , y τ (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) , (9)which is generally called transfer entropy [13–17]. Itis the average KL divergence from conditional indepen-dence of x and y τ given y .The transfer entropy is used in nonequilibrium thermo-dynamics of measurement-feedback systems, where it isrelated to work extraction and dissipation through fluc-tuation theorems [16, 28, 29]; in data science, causal net-work reconstruction from time series is based on statisti-cal significance tests for the presence of transfer entropy[24].If uncertainty is measured by the Shannon entropy S [ p ( x )] = − R dx p ( x ) ln p ( x ), then the transfer entropy quantifies how much, on average, the uncertainty in pre-dicting y τ from y decreases if we additionally get toknow x , T x → yτ = (cid:10) S (cid:2) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3) − S (cid:2) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3)(cid:11) .While the joint probability p ( x , y , y τ ) contains allthe physics of the interacting dynamics of x and y , thedescription in terms of the scalar transfer entropy T x → yτ represents a form of coarse-graining.We introduce the local transfer entropy t x → yτ ( x , y ) = D (cid:2) p ( y τ (cid:12)(cid:12) x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) y ) (cid:3) ; thus for the (macroscopic)transfer entropy T x → yτ = h t x → yτ ( x , y ) i .We next show that T x → yτ and Γ x → yτ are intimately re-lated for linear systems. Linear stochastic dynamics.
As example of applica-tion, we study the information response in Ornstein-Uhlenbeck (OU) processes [30], i.e., linear stochastic sys-tems of the type dξ ( i ) t dt + n X j =1 A ij ξ ( j ) t = η ( i ) t , (10)where D η ( i ) t η ( j ) t E = q ij δ ( t − t ) is Gaussian white noisewith symmetric and constant covariance matrix. For thesystem to be stationary, we require the eigenvalues of theinteraction matrix A ij to have positive real part. For oursetting, we identify x ≡ ξ ( i ) and y ≡ ξ ( j ) for some par-ticular ( i, j ), and z ≡ { ξ ( k ) } k =1 ,...,n \{ ξ ( i ) , ξ ( j ) } as the re-maining variables. Here, probability densities are normaldistributions, p ( y τ (cid:12)(cid:12) x , y ) = N y τ ( h y τ (cid:12)(cid:12) x , y i , σ y τ | x ,y ),with mean h y τ (cid:12)(cid:12) x , y i and variance σ y τ | x ,y ≡h y τ (cid:12)(cid:12) x , y i − h y τ (cid:12)(cid:12) x , y i , and similarly for p ( y τ (cid:12)(cid:12) y ) and p ( x (cid:12)(cid:12) y ). Expectations depend linearly on the condi-tions, ∂ x h y τ (cid:12)(cid:12) x , y i = 0, and variances are independentof them, ∂ x σ y τ | x ,y = 0. Recall the implicit condition-ing on the confounding variables z through y .Applying these Gaussian properties to Eq. (8), theinformation response becomes:Γ x → yτ = (cid:0) ∂ x h y τ (cid:12)(cid:12) x , y i (cid:1) σ x | y σ y τ | x ,y , (11)where ∂ x h y τ (cid:12)(cid:12) x , y i can be interpreted as the coefficientof x in the linear regression for y τ based on the predic-tors ( x , y ), and σ y τ | x ,y as its error variance. The vari-ance σ x | y quantifies the strength of the natural fluctu-ations of x (variable to be perturbed) conditional on y (other variables). In fact, the information-theoretic costof the perturbation, c x ( (cid:15) ) = (cid:15) σ − x | y + O ( (cid:15) ), is higher if x and y are more correlated.In linear systems, the transfer entropy is equivalent toGranger causality [31] T x → yτ = ln (cid:18) σ y τ | y σ y τ | x ,y (cid:19) , (12)as can be seen by substituting the Gaussian expressionsfor p ( y τ (cid:12)(cid:12) x , y ) and p ( y τ (cid:12)(cid:12) y ) into Eq. (9). FIG. 2. Local information response (Left) and local transferentropy (Right) are different, although their expectation val-ues agree in linear systems. The model is the OU process ofEq. (15) with parameters t R = 10, q = 0 . α = 0 . β = 0 . τ = 3. The decrease in uncertainty in adding the predictor x to the linear regression of y τ based on y reads σ y τ | y − σ y τ | x ,y = σ x | y (cid:0) ∂ x h y τ (cid:12)(cid:12) x , y i (cid:1) , (13)see SI text. Comparing Eq. (11) with Eq. (12) andusing Eq. (13), we obtain a non-trivial equivalence be-tween information response and transfer entropy for OUprocesses, Γ x → yτ = e T x → yτ − . (14)Remarkably, despite the equivalence of the macroscopicquantities Γ x → yτ and T x → yτ , the corresponding local quan-tities are markedly different, see Fig. 2.In Fig. 2, we show the local response divergence γ x → yτ ( x , y ) and local transfer entropy t x → yτ ( x , y ) forthe hierarchical OU process of two variables ( dxdt = − xt R + η t , dydt = αx − βy, (15)with h η t η t i = qδ ( t − t ), and parameters α , β > t R > q >
0. This is possibly the simplest model of nonequi-librium stationary interacting dynamics with continuousvariables [32]. However, the pattern of Fig. 2 is qualita-tively the same for any linear OU process. In fact, theperturbation x ⇒ x + (cid:15) shifts the prediction p ( y τ (cid:12)(cid:12) x , y )by the same amount on the y axis, (cid:15)∂ x h y τ (cid:12)(cid:12) x , y i , inde-pendently of the condition ( x , y ), without affecting thevariance σ y τ | x ,y . Hence, d x → yτ ( x , y , (cid:15) ) is constant inspace, and the local contribution only reflects the density p ( x , y ), here a bivariate Gaussian. On the contrary, theKL divergence corresponding to the change of the predic-tion p ( y τ (cid:12)(cid:12) y ) into p ( y τ (cid:12)(cid:12) x , y ) given by the knowledge of x , is strongly dependent on ( x , y ). In fact, the localtransfer entropy reads t x → yτ ( x , y ) = T x → yτ +( ∂ x h y τ | x , y i ) σ y τ | y h(cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) − σ x | y i , (16)see SI text. In particular, for likely values x ≈ (cid:10) x (cid:12)(cid:12) y (cid:11) , the divergence t x → yτ ( x , y ) is smaller comparedto the unlikely situations x (cid:29) (cid:10) x (cid:12)(cid:12) y (cid:11) and x (cid:28) (cid:10) x (cid:12)(cid:12) y (cid:11) . Thus, when multiplied by the steady-state den-sity p ( x , y ), t x → yτ ( x , y ) attains a bimodal shape. Nonlinear example.
As a counter-example for thegeneral validity of Eq. (14) for nonlinear systems, con-sider the following nonlinear Langevin equation for twovariables ( dxdt = − xt R + η t , dydt = αx − βy. (17)Numerical simulations (same parameters as for Eq. (15))show that Eq. (14) is violated, see SI for details. Hence,in general, the transfer entropy is not easily connected tothe information response. Ensemble information response.
Similar to the above,we can define an analogous information response at theensemble level. From the same perturbation x ⇒ x + (cid:15) ,we consider the unconditional response divergence g d x → yτ ( (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) , (18)i.e., we evaluate the response at the ensemble level, with-out knowledge of the measurement ( x , y ), p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) = (cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1)(cid:11) . (19)In general g d x → yτ ( (cid:15) ) = h d x → yτ ( x , y , (cid:15) ) i .We define the ensemble information response as g Γ x → yτ ≡ lim (cid:15) → g d x → yτ ( (cid:15) ) c x ( (cid:15) )= − D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) , (20)where the second line, valid only for τ >
0, is the cor-responding fluctuation-response theorem. A straight-forward generalization to arbitrary perturbation pro-files (cid:15) ( x , y ) is discussed in SI text. Note that wecould write g d x → yτ ( (cid:15) ) through the Fisher information (cid:10) ∂ (cid:15) ln (cid:10) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:11)(cid:11) (cid:12)(cid:12) (cid:15) =0 , but the partial derivativewould be over the perturbation parameter (cid:15) , and wefound it more natural to consider the self-predictionquantity D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E . See SI text fortechnical details on expectation brakets.In linear systems, the ensemble information responsetakes the form g Γ x → yτ = Γ x → yτ e − I xy,yτ = e − I y,yτ (cid:16) − e − T x → yτ (cid:17) , (21)where I y,yτ ≡ D (cid:2) p ( y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y ) p ( y τ ) (cid:3) is the mutual in-formation between y and y τ , and I xy,yτ = I y,yτ + T x → yτ isthe mutual information that the two predictors ( x , y )together have on the output y τ , see SI text.From the nonnegativity of informations, we obtain thebound 0 ≤ g Γ x → yτ ≤
1. We see that g Γ x → yτ increases withthe transfer entropy T x → yτ , and decreases with the auto-correlation I y,yτ . Since I y,yτ diverges for τ → x ensemble takesa finite time to fully propagate its effect to the y ensem-ble. Since time-lagged informations vanish for τ → ∞ in ergodic processes, ensembles relax asymptotically to-wards steady-state after a perturbation, and correspond-ingly the ensemble information response vanishes. Thisprovides a trade-off shape for g Γ x → yτ as a function of thetimescale τ . Note the asymptotics g Γ x → yτ / Γ x → yτ → τ → ∞ , also resulting from ergodicity. Discussion.
In this Letter, we introduced a new mea-sure of causation that has both the invariance proper-ties required for an information-theoretic measure andthe physical interpretation of a propagation of perturba-tions. It has the form of a linear response coefficientbetween Kullback-Leibler divergences, and it is basedon the information-theoretic cost of perturbations. Wewould like to call it information response .We study the behavior of the information responseanalytically in linear stochastic systems, and show thatit reduces to the known transfer entropy in this case.This establishes a first connection between fluctuation-response theory and information flow, i.e., the two mainperspectives to the problem of causation at present. Ad-ditionally, it provides a new relation between Fisher andmutual information.We suggest our information response for the design ofnew quantitative causal inference methods [24]. Its prac-tical estimation on time series, as it is normally the casefor information-theoretic measures, depends on the learn-ability of probability distributions from a finite amountof data [33, 34].
Acknowledgments
We thank M Scazzocchio for help-ful discussions. AA is supported by the DFG throughFR3429/3 to BMF; AA, and BMF are supported throughthe Excellence Initiative by the German Federal andState Governments (Cluster of Excellence PoL EXC-2068). [1] A. K. Seth, A. B. Barrett, and L. Barnett, Grangercausality analysis in neuroscience and neuroimaging, Journal of Neuroscience , 3293 (2015).[2] J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls,D. Coumou, E. Deyle, C. Glymour, M. Kretschmer,M. D. Mahecha, J. Mu˜noz-Mar´ı, et al. , Inferring cau-sation from time series in earth system sciences, Naturecommunications , 1 (2019).[3] L. Luzzatto and P. P. Pandolfi, Causality and chance inthe development of cancer, N Engl J Med , 84 (2015).[4] O. Kwon and J.-S. Yang, Information flow between stockindices, EPL (Europhysics Letters) , 68003 (2008).[5] S. Ito and T. Sagawa, Information thermodynamics oncausal networks, Physical Review Letters , 180603(2013).[6] J. M. Horowitz and M. Esposito, Thermodynamicswith continuous information flow, Physical Review X ,031015 (2014).[7] R. G. James, N. Barnett, and J. P. Crutchfield, Infor-mation flows? a critique of transfer entropies, PhysicalReview Letters , 238701 (2016).[8] A. Auconi, A. Giansanti, and E. Klipp, Causal influencein linear langevin networks without feedback, PhysicalReview E , 042315 (2017).[9] J. Pearl, Causality (Cambridge university press, 2009).[10] D. Janzing, D. Balduzzi, M. Grosse-Wentrup,B. Sch¨olkopf, et al. , Quantifying causal influences,The Annals of Statistics , 2324 (2013).[11] E. Aurell and G. Del Ferraro, Causal analysis,correlation-response, and dynamic cavity, in Journal ofPhysics: Conference Series , Vol. 699 (2016) p. 012002.[12] M. Baldovin, F. Cecconi, and A. Vulpiani, Understand-ing causation via correlations and linear response theory,Physical Review Research , 043436 (2020).[13] J. Massey, Causality, feedback and directed information,in Proc. Int. Symp. Inf. Theory Applic.(ISITA-90) (Cite-seer, 1990) pp. 303–305.[14] T. Schreiber, Measuring information transfer, PhysicalReview Letters , 461 (2000).[15] N. Ay and D. Polani, Information flows in causal net-works, Advances in complex systems , 17 (2008).[16] J. M. Parrondo, J. M. Horowitz, and T. Sagawa, Thermo-dynamics of information, Nature physics , 131 (2015).[17] T. M. Cover, Elements of information theory (John Wi-ley & Sons, 1999).[18] S. I. Amari,
Information geometry and its applications ,Vol. 194 (Springer, 2016).[19] R. Kubo, The fluctuation-dissipation theorem, Reportson progress in physics , 255 (1966).[20] R. Kubo, Brownian motion and nonequilibrium statisti-cal mechanics, Science , 330 (1986).[21] U. M. B. Marconi, A. Puglisi, L. Rondoni, and A. Vulpi-ani, Fluctuation–dissipation: response theory in statisti-cal physics, Physics reports , 111 (2008).[22] C. Maes, Response theory: A trajectory-based approach,Frontiers in Physics , 229 (2020).[23] A. Dechant and S.-i. Sasa, Fluctuation–response in-equality out of equilibrium, Proceedings of the NationalAcademy of Sciences , 6430 (2020).[24] J. Runge, Causal network reconstruction from time se-ries: From theoretical assumptions to practical estima-tion, Chaos: An Interdisciplinary Journal of NonlinearScience , 075310 (2018).[25] S. Ito and A. Dechant, Stochastic time evolution, infor-mation geometry, and the cram´er-rao bound, PhysicalReview X , 021056 (2020). [26] Eq. (8) holds for a larger class of divergences beyondthe KL divergence, because the Fisher information is theunique invariant metric [18].[27] X.-X. Wei and A. A. Stocker, Mutual information, fisherinformation, and efficient coding, Neural computation , 305 (2016).[28] T. Sagawa and M. Ueda, Nonequilibrium thermodynam-ics of feedback control, Physical Review E , 021104(2012).[29] M. L. Rosinberg and J. M. Horowitz, Continuous in-formation flow fluctuations, EPL (Europhysics Letters) , 10007 (2016).[30] H. Risken, Fokker-planck equation, in The Fokker-Planck Equation (Springer, 1996) pp. 63–95.[31] L. Barnett, A. B. Barrett, and A. K. Seth, Grangercausality and transfer entropy are equivalent for gaussianvariables, Physical Review Letters , 238701 (2009).[32] A. Auconi, A. Giansanti, and E. Klipp, Information ther-modynamics for time series of signal-response models,Entropy , 177 (2019).[33] W. Bialek, C. G. Callan, and S. P. Strong, Field theo-ries for learning probability distributions, Physical Re-view Letters , 4693 (1996).[34] W. Bialek, S. E. Palmer, and D. J. Schwab, What makesit possible to learn probability distributions in the natu-ral world?, arXiv preprint arXiv:2008.12279 (2020). upplementary Information for the manuscript ”Fluctuation-response theorem forKullback-Leibler divergences to quantify causation” Andrea Auconi, Benjamin Friedrich, and Andrea Giansanti
A. Convention on expectation symbols
In order not to overload the formalism, when taking expectations we don’t specify the variables over which theyare taken. However, here we show how they can be understood immediately from the context. As an example, let usconsider the following expression (numerator of Eq. (20) in the main text),
D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E , (1)where p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) denotes the conditional probability of y τ given the knowledge of ( x , y ).The term ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) is a function of the three variables ( x , y , y τ ). The expectation (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) is conditional on the knowledge of y τ , therefore it is taken over the remaining variables ( x , y ) with respect to theconditional probability p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) , (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) ≡ Z Z dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) . (2)The outer expectation, being left with only the variable y τ , is necessarily taken with respect to the unconditionalprobability p ( y τ ), D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E = Z dy τ p ( y τ ) (cid:18)Z Z dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:19) . (3)As a second example, consider the Kullback-Leibler divergence D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) ≡ Z dy τ (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) ln (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) p ( y τ ) ! , (4)which quantifies the distinguishability of two probability distributions defined over the same variable y τ . Whenintroduced in its context (below Eq. (29)), we know that ( x , y ) are stochastic variables, while (cid:15) is a scalar. Thereforethe expectation (cid:10) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:11) is necessarily taken over the conditions ( x , y ), (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) = Z Z dx dy p ( x , y ) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) . (5)
1. Iterated conditioning theorem
We often use the iterated conditioning theorem (cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11)(cid:11) = h y i , (cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11)(cid:11) = Z dxp ( x ) Z dyp (cid:0) y (cid:12)(cid:12) x (cid:1) y = Z dyp ( y ) y Z dxp (cid:0) x (cid:12)(cid:12) y (cid:1) = Z dyp ( y ) y = h y i . (6)For a more general proof in terms of σ -algebras, see [1, 2].Importantly, note that (cid:10)(cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11) (cid:12)(cid:12) y (cid:11)(cid:11) = h y i , (cid:10)(cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11) (cid:12)(cid:12) y (cid:11)(cid:11) = R dyp ( y ) R dxp ( x | y ) R dy p ( y (cid:12)(cid:12) x ) y = R dyp ( y ) R dxp ( x ) R dy p ( y (cid:12)(cid:12) x ) y = R dxp ( x ) R dy p ( y (cid:12)(cid:12) x ) y = R dxp ( x ) R dyp ( y (cid:12)(cid:12) x ) y = h y i . (7) a r X i v : . [ c s . I T ] F e b B. Perturbation divergence in an underdamped Brownian particle
Let us consider an underdamped Brownian particle of mass m immersed in a thermal reservoir at temperature T and viscous damping λ . The stochastic dynamics of its velocity v t follows the Langevin equation [3], m ˙ v t = − λv t + ξ t + F t , (8)where ξ t denotes Gaussian white noise with covariance h ξ t ξ t i = 2 λT δ ( t − t ), and we put the Boltzmann constant tounity, k B = 1. The external perturbation is exerted through a force of intensity f ∆ t applied to the particle during thetime interval [0 , ∆ t ], that is written F t = f ∆ t I [0 , ∆ t ] and converges to a pulse for ∆ t →
0. Before the perturbation isapplied at time t = 0, the ensemble is at equilibrium and velocities are Gaussian distributed: p ( v ) = G v (0 , Tm ). Fora generic time instant t in 0 ≤ t ≤ ∆ t the formal solution for v t is written v t = v e − λm t + m R t dt ξ t e − λm ( t − t ) + fm ∆ t R t dt e − λm ( t − t ) = v e − λm t + m R t dt ξ t e − λm ( t − t ) + fλ ∆ t (cid:16) − e − λm t (cid:17) , (9)the three terms corresponding respectively to the deterministic relaxation of the initial condition, the noise, and theperturbation. Let us evaluate the total variation of the velocity at the end of the perturbation period,∆ v ≡ v ∆ t − v = (cid:18) fλ ∆ t − v (cid:19) (cid:16) − e − λm ∆ t (cid:17) + 1 m Z t dt ξ t e − λm ∆ t . (10)Let us consider its average over the noise realizations conditional on the initial velocity v , (cid:10) ∆ v (cid:12)(cid:12) v (cid:11) = (cid:18) fλ ∆ t − v (cid:19) (cid:16) − e − λm ∆ t (cid:17) = fm + O (∆ t ) , (11)which is independent on v at zero order in ∆ t . The fluctuations around (cid:10) ∆ v (cid:12)(cid:12) v (cid:11) are described by the variance σ v | v ≡ D(cid:0) ∆ v − (cid:10) ∆ v (cid:12)(cid:12) v (cid:11)(cid:1) (cid:12)(cid:12) v E = m R ∆ t dt R ∆ t dt h ξ t ξ t i e − λm (2∆ t − t − t ) = Tm (cid:16) − e − λm ∆ t (cid:17) = O (∆ t ) , (12)which vanishes in the limit of a pulse perturbation ∆ t →
0. Therefore, if the perturbation is performed enough fast,the Gaussian p (cid:0) ∆ v (cid:12)(cid:12) v (cid:1) converges pointwise to the Dirac delta δ (∆ v − h ∆ v i ), and we can simply write∆ v = h ∆ v i = fm . (13)Let us now consider the amount of work required to perform the perturbation, W ≡ R ∆ t F t v t dt = f ∆ t R ∆ t v t dt = f ∆ t h v mλ (cid:16) − e − λm ∆ t (cid:17) + m R ∆ t dt R t dt ξ t e − λm ( t − t ) + fλ ∆ t (cid:16) ∆ t − mλ (cid:16) − e − λm ∆ t (cid:17)(cid:17)i . (14)Its conditional expectation is (cid:10) W (cid:12)(cid:12) v (cid:11) = fλ ∆ t h v m (cid:16) − e − λm ∆ t (cid:17) + f ∆ t (cid:16) ∆ t − mλ (cid:16) − e − λm ∆ t (cid:17)(cid:17)i = f (cid:16) v + f m (cid:17) + O (∆ t ) = m ∆ v (cid:0) v + ∆ v (cid:1) + O (∆ t ) , (15)and its variance σ W | v = (cid:16) fm ∆ t (cid:17) R ∆ t dt R ∆ t ds R t dt R s ds h ξ t ξ s i e − λm ( t − t + s − s ) ≤ λT (cid:16) fm ∆ t (cid:17) R ∆ t dt R ∆ t ds R t dt ≤ λT (cid:16) fm (cid:17) ∆ t = O (∆ t ) . (16)We see that the work performed in a pulse perturbation is W = f (cid:16) v + f m (cid:17) , which strongly depends on the condition v for small f . The ensemble average is h W i = f m = m (∆ v ) , and the variance σ W = f σ v = f Tm = 2 h W i T . Wecan identify the average work h W i = m (∆ v ) with the instantaneous change in kinetic energy of the ensemble, andis therefore reversible [4].Let us consider the perturbation divergence, Eq. (4) in the main text, for this 1D example, c v (∆ v ) = D (cid:2) p ( v − ∆ v ) (cid:12)(cid:12)(cid:12)(cid:12) p ( v ) (cid:3) = D (cid:2) p ( v ) (cid:12)(cid:12)(cid:12)(cid:12) p ( v + ∆ v ) (cid:3) = R dv G v (0 , Tm ) ln (cid:18) G v ( , Tm ) G v ( ∆ v, Tm ) (cid:19) = m T R dv G v (cid:0) , Tm (cid:1) (cid:0) − v ∆ v + (∆ v ) (cid:1) = m T (∆ v ) = h W i T . (17)For the underdamped Brownian particle, Eq. (17) formalizes the equivalence of information-theoretic cost and ther-modynamic cost of perturbations, up to a factor being the temperature T . C. Proof of Equation (6) - Fisher information
Here we review the basic connection between KL divergence and Fisher information, applied to our frame-work. The interventional causality requirement, Eq. (3) in the main text, imposes that for positive τ > p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) ≡ p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) , thus the local response divergence is d x → yτ ( x , y , (cid:15) ) = D (cid:2) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) x , y ) (cid:3) . Let us take its ensemble average and expand in orders of the perturbation, h d x → yτ ( x , y , (cid:15) ) i = (cid:10) D (cid:2) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) x , y ) (cid:3)(cid:11) = R R dx dy p ( x , y ) R dy τ p ( y τ (cid:12)(cid:12) x + (cid:15), y ) ln p (cid:16) y τ (cid:12)(cid:12) x + (cid:15),y (cid:17) p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) ! = R R dx dy p ( x − (cid:15), y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) p (cid:16) y τ (cid:12)(cid:12) x − (cid:15),y (cid:17) ! = − (cid:15) R R dx dy p ( x − (cid:15), y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) + O ( (cid:15) )= − (cid:15) R R dx dy p ( x , y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) + O ( (cid:15) )= − (cid:15) (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + O ( (cid:15) ) , (18)where we used the probability normalization R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = R dy τ ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = ∂ x R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = ∂ x − (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) is called Fisher information [5]. D. Proof of Equation (13)
Here we derive the relation between the variances σ y τ | x ,y and σ y τ | y , starting from the linearity of conditionalexpectations, (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) , and (cid:10) x (cid:12)(cid:12) y (cid:11) = y ∂ y (cid:10) x (cid:12)(cid:12) y (cid:11) . Recall that the currentstate of confounding variables z is absorbed into y , so that (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) ≡ (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) , and y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) ≡ y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) + z ∂ z (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) .We apply the iterated conditioning to p ( y τ (cid:12)(cid:12) y ), p ( y τ (cid:12)(cid:12) y ) = R dx p ( x , y τ (cid:12)(cid:12) y ) = R dx p ( x (cid:12)(cid:12) y ) p ( y τ (cid:12)(cid:12) x , y )= R dx G x (cid:16)(cid:10) x (cid:12)(cid:12) y (cid:11) , σ x | y (cid:17) G y τ (cid:16)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) , σ y τ | x ,y (cid:17) = A R dx e − Bx + Cx = A p πB e C B , (19)where in the last line we recognized the form of a Gaussian integral with A = πσ x | y σ yτ | x ,y exp " − (cid:16) y τ − y ∂ y D y τ (cid:12)(cid:12) x ,y E(cid:17) σ yτ | x ,y − D x (cid:12)(cid:12) y E σ x | y , (20) B = σ x | y + (cid:16) ∂ x D y τ (cid:12)(cid:12) x ,y E(cid:17) σ yτ | x ,y , (21) C = (cid:16) y τ − y ∂ y D y τ (cid:12)(cid:12) x ,y E(cid:17) ∂ x D y τ (cid:12)(cid:12) x ,y E σ yτ | x ,y + D x (cid:12)(cid:12) y E σ x | y . (22)Now we equate the expression of Eq. (19) with the Gaussian p ( y τ (cid:12)(cid:12) y ) = G y τ (cid:16)(cid:10) y τ (cid:12)(cid:12) y (cid:11) , σ y τ | y (cid:17) , which can be donealready equating the prefactors, and obtain σ y τ | y − σ y τ | x ,y = σ x | y (cid:0) ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) , (23)which relates a reduction in variance to the corresponding linear regression coefficient. E. Proof of Equation (16)
Here we derive the form of the local contribution t x → yτ ( x , y ) to the transfer entropy T x → yτ , defined by T x → yτ ≡ Z dx dy dy τ p ( x , y , y τ ) ln (cid:18) p ( y τ | x , y ) p ( y τ | y ) (cid:19) ≡ Z dx dy p ( x , y ) t x → yτ ( x , y ) , (24)where we identify t x → yτ ( x , y ) = D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) . Substituting the Gaussian expressions p ( y τ | x , y ) = G ( h y τ | x , y i , σ y τ | x ,y ) and p ( y τ | y ) = G ( h y τ | y i , σ y τ | y ) in D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) we get t x → yτ ( x , y ) = D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) ≡ R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) p (cid:16) y τ (cid:12)(cid:12) y (cid:17) ! = − + ln σ yτ | y σ yτ | x ,y + σ yτ | y R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) ( y τ − h y τ | x , y i + h y τ | x , y i − h y τ | y i ) = − + ln σ yτ | y σ yτ | x ,y + σ yτ | x ,y +( h y τ | x ,y i−h y τ | y i ) σ yτ | y . (25)From the linear regression h y τ | x , y i = x ∂ x h y τ | x , y i + y ∂ y h y τ | x , y i we find h y τ | x , y i − h y τ | y i = h y τ | x , y i − R dx p ( x (cid:12)(cid:12) y ) h y τ | x , y i = (cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) ∂ x h y τ | x , y i . (26)Transfer entropy and Granger causality are equivalent in linear systems [6], T x → yτ = 12 ln σ y τ | y σ y τ | x ,y ! , (27)as can be found immediately by substituing the corresponding Gaussian expressions in Eq. (24). This relation togetherwith Eq. (23) and Eq. (26) gives t x → yτ ( x , y ) − T x → yτ = ( ∂ x h y τ | x , y i ) σ y τ | y h(cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) − σ x | y i , (28)which relates the local transfer entropy t x → yτ ( x , y ) deviation from its macroscopic counterpart T x → yτ to the localconditional log-likelihood (cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) . The minimum local transfer entropy is attained at x = (cid:10) x (cid:12)(cid:12) y (cid:11) for any y , and it gives min x t x → yτ ( x , y ) = (cid:0) e − T x → yτ + 2 T x → yτ − (cid:1) ≥ F. Proof of Eq. (20) - Ensemble information response
Here we give a more detailed derivation of the fluctuation-response theorem for the ensemble information response.Consider the ensemble response divergence of Eq. (18-19) in the main text, ] d x → yτ ( (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) , (29)where the last line holds only for τ > p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) is the probability of y τ given that at time t = 0 the perturbation x ⇒ x + (cid:15) wasapplied, but without knowledge of the current state ( x , y ). Assuming p ( x , y , y τ ) to be smooth, and expanding inorders of the perturbation, from Eq. (29) we obtain ] d x → yτ ( (cid:15) ) = R dy τ (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) ln D p (cid:16) y τ (cid:12)(cid:12) x + (cid:15),y (cid:17)E p ( y τ ) ! = R dy τ (cid:2) p ( y τ ) + (cid:15) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:3) ln (cid:16) (cid:15)p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + (cid:15) p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:17) + O ( (cid:15) )= (cid:15) R dy τ p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + O ( (cid:15) )= (cid:15) D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E + O ( (cid:15) ) , (30)where we used (cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) = p ( y τ ), and from the second line we expanded the logarithm, ln(1+ δ ) = δ − δ + O ( δ ),we inverted the order of integration and derivation, and used the normalization of probability, namely Z dy τ (cid:10) ∂ nx p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = (cid:28) ∂ nx Z dy τ p ( y τ (cid:12)(cid:12) x , y ) (cid:29) = (cid:10) ∂ nx (cid:11) = 0 , (31)for any n ∈ N + . The last line of Eq. (30) follows from (cid:10) ∂ x p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = (cid:10) p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = R dx dy p ( x , y , y τ ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) = p ( y τ ) R dx dy p ( x , y (cid:12)(cid:12) y τ ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y )= p ( y τ ) (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) . (32)We define the ensemble information response as ^ Γ x → yτ ( (cid:15) ) ≡ lim (cid:15) → ] d x → yτ ( (cid:15) ) c x ( (cid:15) ) , (33)and from Eq. (30) it directly follows the fluctuation-response theorem ^ Γ x → yτ ( (cid:15) ) = − D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) . (34) G. Proof of Equation (21)
Let us substitute the Gaussian expressions for the probabilities of a linear Ornstein-Uhlenbeck process into theensemble information response (Eq. (34)), ^ Γ x → yτ = − (cid:28)D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) (cid:12)(cid:12) y τ E (cid:29)D ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E = σ x | y ∂ x D y τ (cid:12)(cid:12) x ,y E σ yτ | x ,y ! D(cid:10) y τ − (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11) E = Γ x → yτ σ yτ σ yτ | x ,y D(cid:0) y τ − (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11)(cid:1) E = Γ x → yτ σ yτ σ yτ | x ,y (cid:0) − ∂ y τ h x (cid:12)(cid:12) y τ i ∂ x h y τ (cid:12)(cid:12) x , y i − ∂ y τ h y (cid:12)(cid:12) y τ i ∂ y h y τ (cid:12)(cid:12) x , y i (cid:1) , (35)where in the third passage we used Γ x → yτ = (cid:16) ∂ x h y τ (cid:12)(cid:12) x ,y i (cid:17) σ x | y σ yτ | x ,y (Eq. (11) in main text), and h x (cid:12)(cid:12) y τ i = y τ ∂ y τ h x (cid:12)(cid:12) y τ i .The last line can be written more compactly as ^ Γ x → yτ = Γ x → yτ σ y τ σ y τ | x ,y (cid:0) − ∂ y τ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11)(cid:1) . (36) x yx y FIG. 1. Information response Γ x → yτ and its ensemble counterpart ^ Γ x → yτ as a function of the timescale τ . The model is the(linear) OU process of Eq. (44) with parameters t R = 10, β = 0 . α = 0 . D = 0 . To relate it with Shannon information and transfer entropy, let us consider the conditional variance σ y τ | x ,y ≡ (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) − (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) . (37)Using the linear property of variances being independent of the conditions, ∂ x σ y τ | x ,y = 0, which implies D σ y τ | x ,y E = σ y τ | x ,y , we take the expectation of Eq. (37) obtaining σ y τ | x ,y ≡ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) − D(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) E . (38)We see that the first term on the RHS is simply the unconditional variance σ y τ = σ y . From iterated conditioning (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) = R R dx dy p ( x , y ) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = R R R dx dy dy τ p ( x , y ) p ( y τ (cid:12)(cid:12) x , y ) y τ = R dy τ p ( y τ ) y τ R R dx dy p ( x , y (cid:12)(cid:12) y τ )= R dy τ p ( y τ ) y τ = (cid:10) y τ (cid:11) = σ y . (39)Then substituting in Eq. (38) we get σ y − σ y τ | x ,y = D(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) E = (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) = (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1)(cid:11) = R dx dy p ( x , y ) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) = R dx dy p ( x , y ) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) y τ = h x y τ i ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + h y y τ i ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = σ y (cid:0) ∂ y τ (cid:10) x (cid:12)(cid:12) y τ (cid:11) ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + ∂ y τ (cid:10) y (cid:12)(cid:12) y τ (cid:11) ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) = σ y ∂ y τ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11) , (40)which substituted into Eq. (36) gives ^ Γ x → yτ = Γ x → yτ σ y τ | x ,y σ y τ . (41)Let us introduce the total information as the mutual information between the couple of variables ( x , y ) and y τ , I xy,yτ ≡ D (cid:2) p ( x , y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) p ( y τ ) (cid:3) = D (cid:2) p ( y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y ) p ( y τ ) (cid:3) + (cid:10) D (cid:2) p (cid:0) x , y τ (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) = I y,yτ + T x → yτ . (42)In linear systems the mutual information is I y,yτ = ln (cid:18) σ yτ σ yτ | y (cid:19) and the transfer entropy T x → yτ = ln (cid:18) σ yτ | y σ yτ | x ,y (cid:19) ,so that the total information is I xy,yτ = I y,yτ + T x → yτ = ln (cid:18) σ yτ σ yτ | x ,y (cid:19) , and from Eq. (41) we obtain ^ Γ x → yτ = Γ x → yτ e − I xy,yτ = e − I y,yτ (cid:16) − e − T x → yτ (cid:17) , (43)which relates the two definitions of information response. In Fig. 1 we plot them for the 2D hierarchical OU process[9] (Eq. 15 in the main text) ( dxdt = − xt R + η t , dydt = αx − βy, (44)with h η t η t i = qδ ( t − t ), and parameters α , β > t R > q > H. Nonlinear example
We considered the nonlinear SDE ( dxdt = − xt rel + η t , dydt = αx − βy, (45)with white noise h η t η t i = qδ ( t − t ), and parameters α , β > t rel > q >
0. For intuition, x can be interpretedas an external fluctuating concentration signal with timescale t rel , and y as a noiseless biochemical response that ismore activated when the signal is far from its average value x = 0 in either positive or negative direction. We checkednumerically that the equivalence between transfer entropy and information response for linear OU processes (Eq. 14in the main text) does not hold here (see Fig. 3), and the transfer entropy is not easily connected to interventionalcausation. For a specific τ = 3 we plot the local contributions to the response divergence and transfer entropy, seeFig. 2. The local response divergence is governed, at least qualitatively, by the squared derivative of the quadraticinteraction ∼ (cid:0) ∂ x x (cid:1) ∼ x . As a result the product d x → yτ ( x , y , (cid:15) ) p ( x , y ) is bimodal. The conditional local density p ( x (cid:12)(cid:12) y ), at least for large y , is also bimodal because of the quadratic driving and finite correlation time of the signal.For a given y , the local transfer entropy t x → yτ ( x , y ) ≡ D (cid:2) p ( y τ (cid:12)(cid:12) x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) y ) (cid:3) is larger for unlikely x whichmeans, given the bimodality of p ( x (cid:12)(cid:12) y ), in addition to the increase in the two sides x → ±∞ like it is also in thelinear case, also towards a peak at x = 0. Therefore, when multiplied by the local density p ( x , y ), the local transferentropy contribution t x → yτ ( x , y ) p ( x , y ) has four peaks for a fixed y (three for small y ). I. General perturbations
This manuscript is based on a particular type of perturbation, namely an (cid:15) -shift of a variable at t = 0. In thelocal response divergence, since the measurement completely resolves the uncertainty, the perturbation correspondsto a shift of the corresponding delta distribution, δ ( x (0) − x ) δ ( y (0) − y ) ⇒ δ ( x (0) − x − (cid:15) ) δ ( y (0) − y ). In theensemble response divergence instead the perturbation is written p ( x , y ) ⇒ p ( x − (cid:15), y ). Note that in both caseswe use the information-theoretic cost at the ensemble level, c x ≡ D (cid:2) p ( x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x − (cid:15), y ) (cid:3) , since the KL divergencebetween two different Dirac-deltas is not defined.More in general, a perturbation of x at the ensemble level can be written in the form p (cid:0) x (cid:12)(cid:12) y (cid:1) ⇒ p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )] ≡ p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) , (46)with R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) = 0. The perturbed probability of y τ is written p ∗ ( y τ ) = R R dx dy p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) p ( y ) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) = R R dx dy p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) p ( y ) p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )]= p ( y τ ) (cid:2) (cid:15) R R dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) h x ( x , y ) (cid:3) = p ( y τ ) (cid:2) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:3) . (47) FIG. 2. Local information response and local transfer entropy in the nonlinear model of Eq. (45) with parameters t rel = 10, β = 0 . α = 0 . D = 0 .
1, for a timescale τ = 3. We can define the generalized response divergence as a functional d i → jτ [ h ]( (cid:15) ) ≡ D (cid:2) p ∗ ( y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = R dy τ p ( y τ ) (cid:2) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:3) ln (cid:0) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:1) = R dy τ p ( y τ ) (cid:16) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) + (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) (cid:17) + O ( (cid:15) )= (cid:15) (cid:10)(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) + (cid:15) D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E = (cid:15) D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E , (48)where in the last passage we used the iterated conditioning theorem and the h ( x , y ) normalization, (cid:10)(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h h x ( x , y ) i = (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) (cid:11) = 0. Similarly, the information-theoretic cost is c x [ h ]( (cid:15) ) ≡ D (cid:2) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) p ( y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) = (cid:10) D (cid:2) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) = (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )] ln [1 + (cid:15)h x ( x , y )] (cid:11) = (cid:15) (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) (cid:11) = (cid:15) (cid:10) h x ( x , y ) (cid:11) . (49)Then the generalized information response and its corresponding fluctuation-response theorem are written ^ Γ x → yτ [ h ] ≡ lim (cid:15) → d i → jτ [ h ]( (cid:15) ) c x [ h ]( (cid:15) ) = D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E h h x ( x , y ) i . (50) J. Linear response theory
In this section we review the classical fluctuation-response theorem [7, 10–13] and the linear fluctuation-responseinequality for the corresponding KL divergence [14], and we motivate the introduction of the information response inthis framework. Let us expand the average response of y τ to the small perturbation x ⇒ x + (cid:15) , for τ > (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) = (cid:10)(cid:10) y τ (cid:12)(cid:12) x + (cid:15), y (cid:11)(cid:11) ≡ R R R dx dy dy τ y τ p ( x , y ) p ( y τ (cid:12)(cid:12) x + (cid:15), y )= R R R dx dy dy τ y τ p ( y τ (cid:12)(cid:12) x , y ) p ( x − (cid:15), y )= R R R dx dy dy τ y τ p ( y τ (cid:12)(cid:12) x , y ) (cid:2) p ( x , y ) − (cid:15)∂ x p ( x , y ) + O ( (cid:15) ) (cid:3) = h y τ i − (cid:15) h y τ ∂ x ln p ( x , y ) i + O ( (cid:15) ) . (51) x y e T x y x y e I ( y , y ) (1 e T x y ) FIG. 3. Information response Γ x → yτ and its ensemble counterpart ^ Γ x → yτ as a function of the timescale τ , compared to thecorresponding combination of mutual informations they reduce to in linear systems. The model is the nonlinear Langevinsystem of Eq. (45) with parameters t rel = 10, β = 0 . α = 0 . D = 0 . In the limit (cid:15) → fluctuation-response theorem :lim (cid:15) → (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) − h y τ i (cid:15) = − h y τ ∂ x ln p ( x , y ) i , (52)which equates the linear response coefficient to a correlation evaluated in the unperturbed dynamics.For those systems having a symmetry in the correlation function, h y τ ∂ x ln p ( x , y ) i = ± h y − τ ∂ x ln p ( x , y ) i , theWiener-Kintchine theorem applied to Eq. (52) gives the equivalence between subseptibility and cross-spectral density,that applied to Brownian motion gives the celebrated Einstein relation [7].Let us now take the absolute value of both sides in the fluctuation-response theorem (Eq. (52)), apply the iteratedconditioning to the RHS, and then the Cauchy-Schwarz inequality (cid:12)(cid:12) R f ( x ) g ( x ) dx (cid:12)(cid:12) ≤ R (cid:12)(cid:12) f ( x ) (cid:12)(cid:12) dx R (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) dx , toobtain (cid:12)(cid:12)(cid:12)(cid:12) lim (cid:15) → D y τ (cid:12)(cid:12) x ⇒ x + (cid:15) E −h y τ i (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) h y τ ∂ x ln p ( x , y ) i (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:10) y τ (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:10) ( y τ − h y τ i ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) R dy τ p ( y τ ) ( y τ − h y τ i ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) (cid:12)(cid:12)(cid:12)(cid:12) ≤ qR dy τ p ( y τ ) ( y τ − h y τ i ) R dy τ p ( y τ ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) = r σ y τ D(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) E , (53)where we used (cid:10) h y τ i (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h y τ i (cid:10)(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h y τ i h ∂ x ln p ( x , y ) i = 0, and identifiedthe variance σ y τ = R dy τ p ( y τ ) ( y τ − h y τ i ) . Using the expressions for the ensemble response divergence of Eq. (29)-(30), namely ] d x → yτ ( (cid:15) ) ≡ D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = (cid:15) D(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) E + O ( (cid:15) ), we obtain the linearfluctuation-response inequality [14] (cid:12)(cid:12) (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) − h y τ i (cid:12)(cid:12) ≤ σ y τ q D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) + O ( (cid:15) ) , (54)0which identifies the KL divergence D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) as the information-theoretic bound to the responseof y τ relative to its natural fluctuations σ y τ .The two fundamental results derived above suggest the possibility of a fluctuation-response theorem forKL divergences, that is what we do in the main text. In particular, starting from the KL divergence D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) which describes the response, we define a second KL divergence to quantify theinformation-theoretic cost of perturbations. Then we expand separately these two KL divergences, and they areboth of order O ( (cid:15) ) for (cid:15) →
0, with the corresponding Taylor coefficients having the form of Fisher information. Theresulting linear response coefficient is then a ratio of Fisher information, such relation we interpret as an information-theoretic fluctuation-response theorem.Here we sketch the analogy between ours and the classical fluctuation-response theorem:lim (cid:15) → (cid:10) y τ (cid:12)(cid:12) x → x + (cid:15) (cid:11) − h y τ i (cid:15) = − h y τ ∂ x ln p ( x , y ) i . • ResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponse • PerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbation • CorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelation ^ Γ x → yτ ≡ lim (cid:15) → D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) D (cid:2) p ( x − (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) = − (cid:28)D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) (cid:12)(cid:12) y τ E (cid:29)D ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E linear = e − I y,yτ (1 − e − T x → yτ ) , and we added the connection between fluctuation-response theory and mutual informations obtained for linear systems(Eq. (21) in the main text).We outlined the analogy of the classical fluctuation-dissipation theorem with our ensemble information response ^ Γ x → yτ , but in the main text we first focus on the information response Γ x → yτ , which is the averaged conditional(local) version of ^ Γ x → yτ . While the connection with the original fluctuation-response theorem is loose, the structureof perturbation-response-correlation is analogous,Γ x → yτ ≡ lim (cid:15) → (cid:10) D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3)(cid:11) D (cid:2) p ( x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x − (cid:15), y ) (cid:3) = D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17)ED ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E linear = e T x → yτ − . K. Application to data science
In the main text, we present our results in relation to the current literature in theoretical fields such as fluctuation-response theory, information theory, and nonequilibrium thermodynamics. Here we motivate our study also in relationto the current trend of data science.The accuracy of predictions is one of the main goals in statistics and applied physics. In general, predictions canbe obtained from mechanical models, where physical intuition plays a role in selecting the relevant observables andcharacterizing their interactions [15], or from machine learning approaches, where the large availability of (labeled)data enables high-dimensional computing architectures to be trained for pattern recognition [16]. In this latter case,predictions are not explainable in terms of intuitive mechanisms or geometrical relations. In other words, the abilityof doing predictions does not imply understanding [17, 18].With the aim of explainability , an helpful representation of the dynamics is given by causal networks [9, 19], whereweighted directed links between nodes represent the propagation of perturbations between variables in the network, orthe information flow. Causal networks are coarse-grained representations of the dynamics and its interactions, limitedto a set of scalars representing how much a variable is influencing the dynamics of other variables, such dynamicsbeing observed over a timescale τ (maybe tunable). As an example, the simplest way to define a causal network isthe correlation matrix, however not always the most appropriate.1To quantify such degree of causation we motivate the use of our information response, defined as the ratio ofthe change of a prediction over the change of a predictor, both evaluated as KL divergences. It has the formof an information-theoretic fluctuation-response theorem, and therefore it has both the invariance properties frominformation theory and the physical interpretation of a propagation of perturbations. While the present setting islimited to dyadic relations between variables, a generalization in terms of simultaneous perturbations of multiplevariables is possible, and will be discussed in a future manuscript.Once a particular definition of causation is chosen, determining and quantifying the strength of causal links becomesa problem of statistical estimation, and is the subject of causal inference [20, 21]. In this manuscript we are interestedin the former problem, i.e., to define a quantitative measure of causation. [1] S. E. Shreve, Stochastic calculus for finance II:Continuous-time models , Vol. 11 (Springer Science &Business Media, 2004).[2] I. Karatzas and S. E. Shreve, in
Brownian Motion andStochastic Calculus (Springer, 1998) pp. 47–127.[3] R. Kubo, M. Toda, and N. Hashitsume,
Statisticalphysics II: nonequilibrium statistical mechanics , Vol. 31(Springer Science & Business Media, 2012).[4] J. M. Horowitz and H. Sandberg, New Journal of Physics , 125007 (2014).[5] S.-i. Amari, Information geometry and its applications ,Vol. 194 (Springer, 2016).[6] L. Barnett, A. B. Barrett, and A. K. Seth, Physicalreview letters , 238701 (2009).[7] H. Risken, in