[PDF] Fluctuation-response theorem for Kullback-Leibler divergences to quantify causation

Abstract

We define a new measure of causation from a fluctuation-response theorem for Kullback-Leibler divergences, based on the information-theoretic cost of perturbations. This information response has both the invariance properties required for an information-theoretic measure and the physical interpretation of a propagation of perturbations. In linear systems, the information response reduces to the transfer entropy, providing a connection between Fisher and mutual information.

Full PDF

FFluctuation-response theorem for Kullback-Leibler divergences to quantify causation

Andrea Auconi , Benjamin M. Friedrich , and Andrea Giansanti cfaed, Technische Universit¨at Dresden, 01069 Dresden, Germany Dipartimento di Fisica, Sapienza Universit`a di Roma, 00185 Rome, Italy (Dated: February 16, 2021)We deﬁne a new measure of causation from a ﬂuctuation-response theorem for Kullback-Leiblerdivergences, based on the information-theoretic cost of perturbations. This information responsehas both the invariance properties required for an information-theoretic measure and the physicalinterpretation of a propagation of perturbations. In linear systems, the information response reducesto the transfer entropy, providing a connection between Fisher and mutual information.

In the general framework of stochastic dynamical sys-tems, the term causation refers to the inﬂuence that avariable x exerts over the dynamics of another variable y . Measures of causation ﬁnd application in neuroscience[1], climate studies [2], cancer research [3], and ﬁnance[4]. However, a widely accepted quantitative deﬁnitionof causation is still missing.Causation manifests itself in two inseparable forms:information ﬂow [5–8], and propagation of perturba-tions [9–12]. Ideally, a quantitative measure of causationshould connect both perspectives.Information ﬂow is commonly quantiﬁed by the trans-fer entropy [13–17], that is the average conditional mu-tual information corresponding to the uncertainty reduc-tion in forecasting the time evolution of y that is achievedupon knowledge of x . The mutual information is a spe-cial case of Kullback-Leibler (KL) divergence, a dimen-sionless measure of distinguishability between probabilitydistributions [18]. As such, the transfer entropy abstractsfrom the underlying physics to give an invariant descrip-tion in terms of the strength of probabilistic dependen-cies.From the interventional point of view [9–12], causationis identiﬁed with how a perturbation applied to x prop-agates in the system to eﬀect y . Although a direct per-turbation of observables is unfeasible in most real-worldsituations, the ﬂuctuation-response theorem establishesa connection between the response to a small perturba-tion and the correlation of ﬂuctuations in the natural(unperturbed) dynamics [19–22].The ﬂuctuation-response theorem considers the ﬁrst-order expansion of the response with respect to the per-turbation. The corresponding linear response coeﬃcienthas been suggested as a measure of causation [11, 12].However, it has the same physical units as y/x , and itcan assume negative values; thus, is not directly relatedto any information-theoretic measure.In stochastic dynamical systems with nonlinear inter-actions, perturbing x may not only aﬀect the evolutionof the expectation value of y , but it may also aﬀect theevolution of the variance of y , and in fact its entire prob-ability distribution. The KL divergence from the naturalto the perturbed probability densities has recently beenidentiﬁed as the universal upper bound to the physical response of any observable relative to its natural ﬂuctu-ations [23].In this Letter, we deﬁne a new measure of causationin the form of a linear response coeﬃcient between KLdivergences, which we would like to call information re-sponse . In particular, we consider the ratio of two KL di-vergences, one for the response and one for the perturba-tion, where the latter represents an information-theoreticcost of the perturbation. For small perturbations, we for-mulate a ﬂuctuation-response theorem that expresses thisratio as a ratio of Fisher information.In linear systems, this new information response re-duces to the transfer entropy, which provides a connec-tion between Fisher and mutual information, and thus aconnection between ﬂuctuation-response theory and in-formation ﬂows. Kullback-Leibler (KL) divergence.

Consider twoprobability distributions p ( w ) and q ( w ) of a randomvariable w . The KL divergence from q ( w ) to p ( w ) isdeﬁned as D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) ≡ Z dw p ( w ) ln (cid:18) p ( w ) q ( w ) (cid:19) ; (1)it is not symmetric in its arguments, and non-negative.Importantly, it is invariant under invertible trans-formations w → w [18], namely D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) = D (cid:2) p ( w ) (cid:12)(cid:12)(cid:12)(cid:12) q ( w ) (cid:3) . The problem of causation.

Consider a stochastic sys-tem of n variables evolving with ergodic Markovian dy-namics. Our goal is to deﬁne a quantitative measure ofcausation, i.e., the inﬂuence that a variable x exerts overthe dynamics of another variable y . We want this deﬁ-nition to have both the invariance property of KL diver-gences, and the physical interpretation of a propagationof perturbations.Since the dynamics is ergodic, and therefore station-ary, it suﬃces to consider the stochastic variables x ≡ x ( t = 0), y ≡ y ( t = 0) at t = 0, and a time inter-val τ later y τ ≡ y ( t = τ ). To avoid cluttered nota-tion, we will implicitly assume that the current values ofthe remaining n − y , e.g., p ( y τ (cid:12)(cid:12) y ) ≡ p ( y τ (cid:12)(cid:12) y , z ). Conditioning on z avoids con-founding variables in z to introduce spurious causal linksbetween x and y [24]. a r X i v : . [ c s . I T ] F e b Local response divergence.

Let us consider the systemat t = 0 with steady-state distribution p ( x , y ). Wemake an ideal measurement of its actual state ( x , y ).Immediately after the measurement, we perturb the stateby introducing a small displacement (cid:15) > x , namely x ⇒ x + (cid:15) . If the eﬀect of this perturbationpropagates to y , then it is reﬂected in the KL divergencefrom the natural to the perturbed prediction d x → yτ ( x , y , (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3) , (2)which is a function of the local condition ( x , y ) andthe perturbation strength (cid:15) . We name it local re-sponse divergence, and denote its ensemble average by h d x → yτ ( x , y , (cid:15) ) i .The concept of causation, interpreted in the frameworkof ﬂuctuation-response theory, is only meaningful withrespect to an arrow of time. That means to postulatethat the perturbation cannot have eﬀects at past times p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) ≡ ( p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) for τ ≥ ,p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) for τ < . (3)In writing the conditional probability p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) ,we implicitly assumed p ( x + (cid:15), y ) >

0, meaning thatthe condition provoked by the perturbation is possibleunder the natural statistics. This implies that the re-sponse statistics can be predicted without actually per-turbing the system, which is the main idea of ﬂuctuation-response theory [19–22].

Information-theoretic cost.

The mean local responsedivergence h d x → yτ ( x , y , (cid:15) ) i , like any response functionin ﬂuctuation-response theory, is deﬁned in relation to aperturbation, irrespective of how diﬃcult it may be toperform this perturbation. Intuitively, we expect that ittakes more eﬀort to perturb those variables that ﬂuctuateless. Therefore, we consider the KL divergence from thenatural to the perturbed ensemble of conditions c x ( (cid:15) ) ≡ D (cid:2) p ( x − (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) , (4)to quantify the information-theoretic cost of perturba-tions, and call it perturbation divergence .For example, for an underdamped Brownian particle,the perturbation divergence is equivalent to the averagethermodynamic work required to perform an (cid:15) perturba-tion of its velocity, up to a factor being the temperature,see Supplementary Information (SI). For an equilibriumensemble in a potential U ( x ), with Boltzmann distribu-tion p ( x ) ∼ exp( − βU ( x )), the perturbation divergence isthe average reversible work c x ( (cid:15) ) = β h U ( x + (cid:15) ) − U ( x ) i .Note that the deﬁnition of Eq. (4) is general, and can beapplied to more abstract models where thermodynamicquantities are not clearly identiﬁed. + x x * = x + x t x * t t yy * y y t y * t y p ( y | x , y ) p ( y | x + , y ) 2 0 2 x + + p ( x | y ) p ( x | y ) FIG. 1. Here we show, on a concrete example, the origin ofthe two KL divergences entering the information response ofEq. (5). (Upper) Response to the perturbation x ⇒ x + (cid:15) at the trajectory level. x ∗ t ( y ∗ t ) is the perturbed trajectoryof x t ( y t ), for the same noise realization. (Lower Left) Localresponse divergence d x → yτ ( x , y , (cid:15) ): change of predicted dis-tribution of y τ for the condition ( x , y ) for a timescale τ = 3.(Lower Right) Perturbation divergence c x ( (cid:15) ): instantaneousdisplacement of the steady-state ensemble conditional to aparticular y . The dynamics follows the nonlinear stochasticmodel of Eq. (17) with parameters t R = 10, q = 0 . α = 0 . β = 0 .

2, for a perturbation (cid:15) = 0 . Information response.

We introduce the informationresponse as the ratio between mean local response diver-gence and perturbation divergence, in the limit of a smallperturbation Γ x → yτ ≡ lim (cid:15) → h d x → yτ ( x , y , (cid:15) ) i c x ( (cid:15) ) . (5)We can interpret Γ x → yτ as an information-theoretic linearresponse coeﬃcient. This information response is ourmeasure of x → y causation with respect to the timescale τ , see Fig. 1. The time arrow requirement (Eq. (3))implies Γ x → yτ = 0 for τ < local information response γ x → yτ ( x , y ) ≡ lim (cid:15) → d x → yτ ( x , y , (cid:15) ) /c x ( (cid:15) ), we canequivalently write Γ x → yτ = h γ x → yτ ( x , y ) i .The information response in the form of Eq. (5) inher-ently relies on the concept of controlled perturbations.We can reformulate it in purely observational form, inthe spirit of the ﬂuctuation-response theorem [19–22],provided p ( x , y , y τ ) is suﬃciently smooth. Fisher information.

The one-parameter family { p ( y τ (cid:12)(cid:12) x , y ) } x of probability densities parametrized by x (for ﬁxed y ) can be equipped with a Riemannianmetric having d x → yτ ( x , y , (cid:15) ) as squared line element. Infact, the leading order term in the Taylor expansion of aKL divergence between probabilities that diﬀer only bya small perturbation of a parameter is of second order,with coeﬃcients known as Fisher information [18, 25].Explicitly, expanding the mean response divergence for τ >

0, we obtain h d x → yτ ( x , y , (cid:15) ) i = − (cid:15) (cid:10) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) (cid:11) + O ( (cid:15) ) , (6)where we used the interventional causality requirement(Eq. (3)), and probability normalization. Similarly, forthe perturbation divergence we have c x ( (cid:15) ) = − (cid:15) (cid:10) ∂ x ln p ( x (cid:12)(cid:12) y ) (cid:11) + O ( (cid:15) ) . (7)Applying the Fisher information representation to theinformation response, we get for τ > x → yτ = (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) , (8)that is the ﬂuctuation-response theorem for KL diver-gences. For generalizations and a discussion of the con-nection with the classical ﬂuctuation-response theoremsee [26] and SI text. Eq. (8) is the ratio of two secondderivatives over the same physical variable x , and it canbe regarded as an application of L’Hˆopital’s rule to Eq.(5).In general, Fisher information is not easily connectedto Shannon entropy and mutual information [27]. Below,we show that for linear stochastic systems, the informa-tion response, which is a ratio of Fisher information (Eq.(8)), is equivalent to the transfer entropy, a conditionalform of mutual information. Transfer entropy.

The most widely used measure ofinformation ﬂow is the conditional mutual information T x → yτ ≡ (cid:10) D (cid:2) p (cid:0) x , y τ (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) , (9)which is generally called transfer entropy [13–17]. Itis the average KL divergence from conditional indepen-dence of x and y τ given y .The transfer entropy is used in nonequilibrium thermo-dynamics of measurement-feedback systems, where it isrelated to work extraction and dissipation through ﬂuc-tuation theorems [16, 28, 29]; in data science, causal net-work reconstruction from time series is based on statisti-cal signiﬁcance tests for the presence of transfer entropy[24].If uncertainty is measured by the Shannon entropy S [ p ( x )] = − R dx p ( x ) ln p ( x ), then the transfer entropy quantiﬁes how much, on average, the uncertainty in pre-dicting y τ from y decreases if we additionally get toknow x , T x → yτ = (cid:10) S (cid:2) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3) − S (cid:2) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3)(cid:11) .While the joint probability p ( x , y , y τ ) contains allthe physics of the interacting dynamics of x and y , thedescription in terms of the scalar transfer entropy T x → yτ represents a form of coarse-graining.We introduce the local transfer entropy t x → yτ ( x , y ) = D (cid:2) p ( y τ (cid:12)(cid:12) x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) y ) (cid:3) ; thus for the (macroscopic)transfer entropy T x → yτ = h t x → yτ ( x , y ) i .We next show that T x → yτ and Γ x → yτ are intimately re-lated for linear systems. Linear stochastic dynamics.

As example of applica-tion, we study the information response in Ornstein-Uhlenbeck (OU) processes [30], i.e., linear stochastic sys-tems of the type dξ ( i ) t dt + n X j =1 A ij ξ ( j ) t = η ( i ) t , (10)where D η ( i ) t η ( j ) t E = q ij δ ( t − t ) is Gaussian white noisewith symmetric and constant covariance matrix. For thesystem to be stationary, we require the eigenvalues of theinteraction matrix A ij to have positive real part. For oursetting, we identify x ≡ ξ ( i ) and y ≡ ξ ( j ) for some par-ticular ( i, j ), and z ≡ { ξ ( k ) } k =1 ,...,n \{ ξ ( i ) , ξ ( j ) } as the re-maining variables. Here, probability densities are normaldistributions, p ( y τ (cid:12)(cid:12) x , y ) = N y τ ( h y τ (cid:12)(cid:12) x , y i , σ y τ | x ,y ),with mean h y τ (cid:12)(cid:12) x , y i and variance σ y τ | x ,y ≡h y τ (cid:12)(cid:12) x , y i − h y τ (cid:12)(cid:12) x , y i , and similarly for p ( y τ (cid:12)(cid:12) y ) and p ( x (cid:12)(cid:12) y ). Expectations depend linearly on the condi-tions, ∂ x h y τ (cid:12)(cid:12) x , y i = 0, and variances are independentof them, ∂ x σ y τ | x ,y = 0. Recall the implicit condition-ing on the confounding variables z through y .Applying these Gaussian properties to Eq. (8), theinformation response becomes:Γ x → yτ = (cid:0) ∂ x h y τ (cid:12)(cid:12) x , y i (cid:1) σ x | y σ y τ | x ,y , (11)where ∂ x h y τ (cid:12)(cid:12) x , y i can be interpreted as the coeﬃcientof x in the linear regression for y τ based on the predic-tors ( x , y ), and σ y τ | x ,y as its error variance. The vari-ance σ x | y quantiﬁes the strength of the natural ﬂuctu-ations of x (variable to be perturbed) conditional on y (other variables). In fact, the information-theoretic costof the perturbation, c x ( (cid:15) ) = (cid:15) σ − x | y + O ( (cid:15) ), is higher if x and y are more correlated.In linear systems, the transfer entropy is equivalent toGranger causality [31] T x → yτ = ln (cid:18) σ y τ | y σ y τ | x ,y (cid:19) , (12)as can be seen by substituting the Gaussian expressionsfor p ( y τ (cid:12)(cid:12) x , y ) and p ( y τ (cid:12)(cid:12) y ) into Eq. (9). FIG. 2. Local information response (Left) and local transferentropy (Right) are diﬀerent, although their expectation val-ues agree in linear systems. The model is the OU process ofEq. (15) with parameters t R = 10, q = 0 . α = 0 . β = 0 . τ = 3. The decrease in uncertainty in adding the predictor x to the linear regression of y τ based on y reads σ y τ | y − σ y τ | x ,y = σ x | y (cid:0) ∂ x h y τ (cid:12)(cid:12) x , y i (cid:1) , (13)see SI text. Comparing Eq. (11) with Eq. (12) andusing Eq. (13), we obtain a non-trivial equivalence be-tween information response and transfer entropy for OUprocesses, Γ x → yτ = e T x → yτ − . (14)Remarkably, despite the equivalence of the macroscopicquantities Γ x → yτ and T x → yτ , the corresponding local quan-tities are markedly diﬀerent, see Fig. 2.In Fig. 2, we show the local response divergence γ x → yτ ( x , y ) and local transfer entropy t x → yτ ( x , y ) forthe hierarchical OU process of two variables ( dxdt = − xt R + η t , dydt = αx − βy, (15)with h η t η t i = qδ ( t − t ), and parameters α , β > t R > q >

0. This is possibly the simplest model of nonequi-librium stationary interacting dynamics with continuousvariables [32]. However, the pattern of Fig. 2 is qualita-tively the same for any linear OU process. In fact, theperturbation x ⇒ x + (cid:15) shifts the prediction p ( y τ (cid:12)(cid:12) x , y )by the same amount on the y axis, (cid:15)∂ x h y τ (cid:12)(cid:12) x , y i , inde-pendently of the condition ( x , y ), without aﬀecting thevariance σ y τ | x ,y . Hence, d x → yτ ( x , y , (cid:15) ) is constant inspace, and the local contribution only reﬂects the density p ( x , y ), here a bivariate Gaussian. On the contrary, theKL divergence corresponding to the change of the predic-tion p ( y τ (cid:12)(cid:12) y ) into p ( y τ (cid:12)(cid:12) x , y ) given by the knowledge of x , is strongly dependent on ( x , y ). In fact, the localtransfer entropy reads t x → yτ ( x , y ) = T x → yτ +( ∂ x h y τ | x , y i ) σ y τ | y h(cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) − σ x | y i , (16)see SI text. In particular, for likely values x ≈ (cid:10) x (cid:12)(cid:12) y (cid:11) , the divergence t x → yτ ( x , y ) is smaller comparedto the unlikely situations x (cid:29) (cid:10) x (cid:12)(cid:12) y (cid:11) and x (cid:28) (cid:10) x (cid:12)(cid:12) y (cid:11) . Thus, when multiplied by the steady-state den-sity p ( x , y ), t x → yτ ( x , y ) attains a bimodal shape. Nonlinear example.

As a counter-example for thegeneral validity of Eq. (14) for nonlinear systems, con-sider the following nonlinear Langevin equation for twovariables ( dxdt = − xt R + η t , dydt = αx − βy. (17)Numerical simulations (same parameters as for Eq. (15))show that Eq. (14) is violated, see SI for details. Hence,in general, the transfer entropy is not easily connected tothe information response. Ensemble information response.

Similar to the above,we can deﬁne an analogous information response at theensemble level. From the same perturbation x ⇒ x + (cid:15) ,we consider the unconditional response divergence g d x → yτ ( (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) , (18)i.e., we evaluate the response at the ensemble level, with-out knowledge of the measurement ( x , y ), p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) = (cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1)(cid:11) . (19)In general g d x → yτ ( (cid:15) ) = h d x → yτ ( x , y , (cid:15) ) i .We deﬁne the ensemble information response as g Γ x → yτ ≡ lim (cid:15) → g d x → yτ ( (cid:15) ) c x ( (cid:15) )= − D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) , (20)where the second line, valid only for τ >

0, is the cor-responding ﬂuctuation-response theorem. A straight-forward generalization to arbitrary perturbation pro-ﬁles (cid:15) ( x , y ) is discussed in SI text. Note that wecould write g d x → yτ ( (cid:15) ) through the Fisher information (cid:10) ∂ (cid:15) ln (cid:10) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:11)(cid:11) (cid:12)(cid:12) (cid:15) =0 , but the partial derivativewould be over the perturbation parameter (cid:15) , and wefound it more natural to consider the self-predictionquantity D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E . See SI text fortechnical details on expectation brakets.In linear systems, the ensemble information responsetakes the form g Γ x → yτ = Γ x → yτ e − I xy,yτ = e − I y,yτ (cid:16) − e − T x → yτ (cid:17) , (21)where I y,yτ ≡ D (cid:2) p ( y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y ) p ( y τ ) (cid:3) is the mutual in-formation between y and y τ , and I xy,yτ = I y,yτ + T x → yτ isthe mutual information that the two predictors ( x , y )together have on the output y τ , see SI text.From the nonnegativity of informations, we obtain thebound 0 ≤ g Γ x → yτ ≤

1. We see that g Γ x → yτ increases withthe transfer entropy T x → yτ , and decreases with the auto-correlation I y,yτ . Since I y,yτ diverges for τ → x ensemble takesa ﬁnite time to fully propagate its eﬀect to the y ensem-ble. Since time-lagged informations vanish for τ → ∞ in ergodic processes, ensembles relax asymptotically to-wards steady-state after a perturbation, and correspond-ingly the ensemble information response vanishes. Thisprovides a trade-oﬀ shape for g Γ x → yτ as a function of thetimescale τ . Note the asymptotics g Γ x → yτ / Γ x → yτ → τ → ∞ , also resulting from ergodicity. Discussion.

In this Letter, we introduced a new mea-sure of causation that has both the invariance proper-ties required for an information-theoretic measure andthe physical interpretation of a propagation of perturba-tions. It has the form of a linear response coeﬃcientbetween Kullback-Leibler divergences, and it is basedon the information-theoretic cost of perturbations. Wewould like to call it information response .We study the behavior of the information responseanalytically in linear stochastic systems, and show thatit reduces to the known transfer entropy in this case.This establishes a ﬁrst connection between ﬂuctuation-response theory and information ﬂow, i.e., the two mainperspectives to the problem of causation at present. Ad-ditionally, it provides a new relation between Fisher andmutual information.We suggest our information response for the design ofnew quantitative causal inference methods [24]. Its prac-tical estimation on time series, as it is normally the casefor information-theoretic measures, depends on the learn-ability of probability distributions from a ﬁnite amountof data [33, 34].

Acknowledgments

We thank M Scazzocchio for help-ful discussions. AA is supported by the DFG throughFR3429/3 to BMF; AA, and BMF are supported throughthe Excellence Initiative by the German Federal andState Governments (Cluster of Excellence PoL EXC-2068). [1] A. K. Seth, A. B. Barrett, and L. Barnett, Grangercausality analysis in neuroscience and neuroimaging, Journal of Neuroscience , 3293 (2015).[2] J. Runge, S. Bathiany, E. Bollt, G. Camps-Valls,D. Coumou, E. Deyle, C. Glymour, M. Kretschmer,M. D. Mahecha, J. Mu˜noz-Mar´ı, et al. , Inferring cau-sation from time series in earth system sciences, Naturecommunications , 1 (2019).[3] L. Luzzatto and P. P. Pandolﬁ, Causality and chance inthe development of cancer, N Engl J Med , 84 (2015).[4] O. Kwon and J.-S. Yang, Information ﬂow between stockindices, EPL (Europhysics Letters) , 68003 (2008).[5] S. Ito and T. Sagawa, Information thermodynamics oncausal networks, Physical Review Letters , 180603(2013).[6] J. M. Horowitz and M. Esposito, Thermodynamicswith continuous information ﬂow, Physical Review X ,031015 (2014).[7] R. G. James, N. Barnett, and J. P. Crutchﬁeld, Infor-mation ﬂows? a critique of transfer entropies, PhysicalReview Letters , 238701 (2016).[8] A. Auconi, A. Giansanti, and E. Klipp, Causal inﬂuencein linear langevin networks without feedback, PhysicalReview E , 042315 (2017).[9] J. Pearl, Causality (Cambridge university press, 2009).[10] D. Janzing, D. Balduzzi, M. Grosse-Wentrup,B. Sch¨olkopf, et al. , Quantifying causal inﬂuences,The Annals of Statistics , 2324 (2013).[11] E. Aurell and G. Del Ferraro, Causal analysis,correlation-response, and dynamic cavity, in Journal ofPhysics: Conference Series , Vol. 699 (2016) p. 012002.[12] M. Baldovin, F. Cecconi, and A. Vulpiani, Understand-ing causation via correlations and linear response theory,Physical Review Research , 043436 (2020).[13] J. Massey, Causality, feedback and directed information,in Proc. Int. Symp. Inf. Theory Applic.(ISITA-90) (Cite-seer, 1990) pp. 303–305.[14] T. Schreiber, Measuring information transfer, PhysicalReview Letters , 461 (2000).[15] N. Ay and D. Polani, Information ﬂows in causal net-works, Advances in complex systems , 17 (2008).[16] J. M. Parrondo, J. M. Horowitz, and T. Sagawa, Thermo-dynamics of information, Nature physics , 131 (2015).[17] T. M. Cover, Elements of information theory (John Wi-ley & Sons, 1999).[18] S. I. Amari,

Information geometry and its applications ,Vol. 194 (Springer, 2016).[19] R. Kubo, The ﬂuctuation-dissipation theorem, Reportson progress in physics , 255 (1966).[20] R. Kubo, Brownian motion and nonequilibrium statisti-cal mechanics, Science , 330 (1986).[21] U. M. B. Marconi, A. Puglisi, L. Rondoni, and A. Vulpi-ani, Fluctuation–dissipation: response theory in statisti-cal physics, Physics reports , 111 (2008).[22] C. Maes, Response theory: A trajectory-based approach,Frontiers in Physics , 229 (2020).[23] A. Dechant and S.-i. Sasa, Fluctuation–response in-equality out of equilibrium, Proceedings of the NationalAcademy of Sciences , 6430 (2020).[24] J. Runge, Causal network reconstruction from time se-ries: From theoretical assumptions to practical estima-tion, Chaos: An Interdisciplinary Journal of NonlinearScience , 075310 (2018).[25] S. Ito and A. Dechant, Stochastic time evolution, infor-mation geometry, and the cram´er-rao bound, PhysicalReview X , 021056 (2020). [26] Eq. (8) holds for a larger class of divergences beyondthe KL divergence, because the Fisher information is theunique invariant metric [18].[27] X.-X. Wei and A. A. Stocker, Mutual information, ﬁsherinformation, and eﬃcient coding, Neural computation , 305 (2016).[28] T. Sagawa and M. Ueda, Nonequilibrium thermodynam-ics of feedback control, Physical Review E , 021104(2012).[29] M. L. Rosinberg and J. M. Horowitz, Continuous in-formation ﬂow ﬂuctuations, EPL (Europhysics Letters) , 10007 (2016).[30] H. Risken, Fokker-planck equation, in The Fokker-Planck Equation (Springer, 1996) pp. 63–95.[31] L. Barnett, A. B. Barrett, and A. K. Seth, Grangercausality and transfer entropy are equivalent for gaussianvariables, Physical Review Letters , 238701 (2009).[32] A. Auconi, A. Giansanti, and E. Klipp, Information ther-modynamics for time series of signal-response models,Entropy , 177 (2019).[33] W. Bialek, C. G. Callan, and S. P. Strong, Field theo-ries for learning probability distributions, Physical Re-view Letters , 4693 (1996).[34] W. Bialek, S. E. Palmer, and D. J. Schwab, What makesit possible to learn probability distributions in the natu-ral world?, arXiv preprint arXiv:2008.12279 (2020). upplementary Information for the manuscript ”Fluctuation-response theorem forKullback-Leibler divergences to quantify causation” Andrea Auconi, Benjamin Friedrich, and Andrea Giansanti

A. Convention on expectation symbols

In order not to overload the formalism, when taking expectations we don’t specify the variables over which theyare taken. However, here we show how they can be understood immediately from the context. As an example, let usconsider the following expression (numerator of Eq. (20) in the main text),

D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E , (1)where p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) denotes the conditional probability of y τ given the knowledge of ( x , y ).The term ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) is a function of the three variables ( x , y , y τ ). The expectation (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) is conditional on the knowledge of y τ , therefore it is taken over the remaining variables ( x , y ) with respect to theconditional probability p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) , (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) ≡ Z Z dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) . (2)The outer expectation, being left with only the variable y τ , is necessarily taken with respect to the unconditionalprobability p ( y τ ), D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E = Z dy τ p ( y τ ) (cid:18)Z Z dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:19) . (3)As a second example, consider the Kullback-Leibler divergence D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) ≡ Z dy τ (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) ln (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) p ( y τ ) ! , (4)which quantiﬁes the distinguishability of two probability distributions deﬁned over the same variable y τ . Whenintroduced in its context (below Eq. (29)), we know that ( x , y ) are stochastic variables, while (cid:15) is a scalar. Thereforethe expectation (cid:10) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:11) is necessarily taken over the conditions ( x , y ), (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) = Z Z dx dy p ( x , y ) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) . (5)

1. Iterated conditioning theorem

We often use the iterated conditioning theorem (cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11)(cid:11) = h y i , (cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11)(cid:11) = Z dxp ( x ) Z dyp (cid:0) y (cid:12)(cid:12) x (cid:1) y = Z dyp ( y ) y Z dxp (cid:0) x (cid:12)(cid:12) y (cid:1) = Z dyp ( y ) y = h y i . (6)For a more general proof in terms of σ -algebras, see [1, 2].Importantly, note that (cid:10)(cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11) (cid:12)(cid:12) y (cid:11)(cid:11) = h y i , (cid:10)(cid:10)(cid:10) y (cid:12)(cid:12) x (cid:11) (cid:12)(cid:12) y (cid:11)(cid:11) = R dyp ( y ) R dxp ( x | y ) R dy p ( y (cid:12)(cid:12) x ) y = R dyp ( y ) R dxp ( x ) R dy p ( y (cid:12)(cid:12) x ) y = R dxp ( x ) R dy p ( y (cid:12)(cid:12) x ) y = R dxp ( x ) R dyp ( y (cid:12)(cid:12) x ) y = h y i . (7) a r X i v : . [ c s . I T ] F e b B. Perturbation divergence in an underdamped Brownian particle

Let us consider an underdamped Brownian particle of mass m immersed in a thermal reservoir at temperature T and viscous damping λ . The stochastic dynamics of its velocity v t follows the Langevin equation [3], m ˙ v t = − λv t + ξ t + F t , (8)where ξ t denotes Gaussian white noise with covariance h ξ t ξ t i = 2 λT δ ( t − t ), and we put the Boltzmann constant tounity, k B = 1. The external perturbation is exerted through a force of intensity f ∆ t applied to the particle during thetime interval [0 , ∆ t ], that is written F t = f ∆ t I [0 , ∆ t ] and converges to a pulse for ∆ t →

0. Before the perturbation isapplied at time t = 0, the ensemble is at equilibrium and velocities are Gaussian distributed: p ( v ) = G v (0 , Tm ). Fora generic time instant t in 0 ≤ t ≤ ∆ t the formal solution for v t is written v t = v e − λm t + m R t dt ξ t e − λm ( t − t ) + fm ∆ t R t dt e − λm ( t − t ) = v e − λm t + m R t dt ξ t e − λm ( t − t ) + fλ ∆ t (cid:16) − e − λm t (cid:17) , (9)the three terms corresponding respectively to the deterministic relaxation of the initial condition, the noise, and theperturbation. Let us evaluate the total variation of the velocity at the end of the perturbation period,∆ v ≡ v ∆ t − v = (cid:18) fλ ∆ t − v (cid:19) (cid:16) − e − λm ∆ t (cid:17) + 1 m Z t dt ξ t e − λm ∆ t . (10)Let us consider its average over the noise realizations conditional on the initial velocity v , (cid:10) ∆ v (cid:12)(cid:12) v (cid:11) = (cid:18) fλ ∆ t − v (cid:19) (cid:16) − e − λm ∆ t (cid:17) = fm + O (∆ t ) , (11)which is independent on v at zero order in ∆ t . The ﬂuctuations around (cid:10) ∆ v (cid:12)(cid:12) v (cid:11) are described by the variance σ v | v ≡ D(cid:0) ∆ v − (cid:10) ∆ v (cid:12)(cid:12) v (cid:11)(cid:1) (cid:12)(cid:12) v E = m R ∆ t dt R ∆ t dt h ξ t ξ t i e − λm (2∆ t − t − t ) = Tm (cid:16) − e − λm ∆ t (cid:17) = O (∆ t ) , (12)which vanishes in the limit of a pulse perturbation ∆ t →

0. Therefore, if the perturbation is performed enough fast,the Gaussian p (cid:0) ∆ v (cid:12)(cid:12) v (cid:1) converges pointwise to the Dirac delta δ (∆ v − h ∆ v i ), and we can simply write∆ v = h ∆ v i = fm . (13)Let us now consider the amount of work required to perform the perturbation, W ≡ R ∆ t F t v t dt = f ∆ t R ∆ t v t dt = f ∆ t h v mλ (cid:16) − e − λm ∆ t (cid:17) + m R ∆ t dt R t dt ξ t e − λm ( t − t ) + fλ ∆ t (cid:16) ∆ t − mλ (cid:16) − e − λm ∆ t (cid:17)(cid:17)i . (14)Its conditional expectation is (cid:10) W (cid:12)(cid:12) v (cid:11) = fλ ∆ t h v m (cid:16) − e − λm ∆ t (cid:17) + f ∆ t (cid:16) ∆ t − mλ (cid:16) − e − λm ∆ t (cid:17)(cid:17)i = f (cid:16) v + f m (cid:17) + O (∆ t ) = m ∆ v (cid:0) v + ∆ v (cid:1) + O (∆ t ) , (15)and its variance σ W | v = (cid:16) fm ∆ t (cid:17) R ∆ t dt R ∆ t ds R t dt R s ds h ξ t ξ s i e − λm ( t − t + s − s ) ≤ λT (cid:16) fm ∆ t (cid:17) R ∆ t dt R ∆ t ds R t dt ≤ λT (cid:16) fm (cid:17) ∆ t = O (∆ t ) . (16)We see that the work performed in a pulse perturbation is W = f (cid:16) v + f m (cid:17) , which strongly depends on the condition v for small f . The ensemble average is h W i = f m = m (∆ v ) , and the variance σ W = f σ v = f Tm = 2 h W i T . Wecan identify the average work h W i = m (∆ v ) with the instantaneous change in kinetic energy of the ensemble, andis therefore reversible [4].Let us consider the perturbation divergence, Eq. (4) in the main text, for this 1D example, c v (∆ v ) = D (cid:2) p ( v − ∆ v ) (cid:12)(cid:12)(cid:12)(cid:12) p ( v ) (cid:3) = D (cid:2) p ( v ) (cid:12)(cid:12)(cid:12)(cid:12) p ( v + ∆ v ) (cid:3) = R dv G v (0 , Tm ) ln (cid:18) G v ( , Tm ) G v ( ∆ v, Tm ) (cid:19) = m T R dv G v (cid:0) , Tm (cid:1) (cid:0) − v ∆ v + (∆ v ) (cid:1) = m T (∆ v ) = h W i T . (17)For the underdamped Brownian particle, Eq. (17) formalizes the equivalence of information-theoretic cost and ther-modynamic cost of perturbations, up to a factor being the temperature T . C. Proof of Equation (6) - Fisher information

Here we review the basic connection between KL divergence and Fisher information, applied to our frame-work. The interventional causality requirement, Eq. (3) in the main text, imposes that for positive τ > p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1) ≡ p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) , thus the local response divergence is d x → yτ ( x , y , (cid:15) ) = D (cid:2) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) x , y ) (cid:3) . Let us take its ensemble average and expand in orders of the perturbation, h d x → yτ ( x , y , (cid:15) ) i = (cid:10) D (cid:2) p ( y τ (cid:12)(cid:12) x + (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) x , y ) (cid:3)(cid:11) = R R dx dy p ( x , y ) R dy τ p ( y τ (cid:12)(cid:12) x + (cid:15), y ) ln p (cid:16) y τ (cid:12)(cid:12) x + (cid:15),y (cid:17) p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) ! = R R dx dy p ( x − (cid:15), y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) p (cid:16) y τ (cid:12)(cid:12) x − (cid:15),y (cid:17) ! = − (cid:15) R R dx dy p ( x − (cid:15), y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) + O ( (cid:15) )= − (cid:15) R R dx dy p ( x , y ) R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) + O ( (cid:15) )= − (cid:15) (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + O ( (cid:15) ) , (18)where we used the probability normalization R dy τ p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = R dy τ ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = ∂ x R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) = ∂ x − (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) is called Fisher information [5]. D. Proof of Equation (13)

Here we derive the relation between the variances σ y τ | x ,y and σ y τ | y , starting from the linearity of conditionalexpectations, (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) , and (cid:10) x (cid:12)(cid:12) y (cid:11) = y ∂ y (cid:10) x (cid:12)(cid:12) y (cid:11) . Recall that the currentstate of confounding variables z is absorbed into y , so that (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) ≡ (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) , and y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) ≡ y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) + z ∂ z (cid:10) y τ (cid:12)(cid:12) x , y , z (cid:11) .We apply the iterated conditioning to p ( y τ (cid:12)(cid:12) y ), p ( y τ (cid:12)(cid:12) y ) = R dx p ( x , y τ (cid:12)(cid:12) y ) = R dx p ( x (cid:12)(cid:12) y ) p ( y τ (cid:12)(cid:12) x , y )= R dx G x (cid:16)(cid:10) x (cid:12)(cid:12) y (cid:11) , σ x | y (cid:17) G y τ (cid:16)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) , σ y τ | x ,y (cid:17) = A R dx e − Bx + Cx = A p πB e C B , (19)where in the last line we recognized the form of a Gaussian integral with A = πσ x | y σ yτ | x ,y exp " − (cid:16) y τ − y ∂ y D y τ (cid:12)(cid:12) x ,y E(cid:17) σ yτ | x ,y − D x (cid:12)(cid:12) y E σ x | y , (20) B = σ x | y + (cid:16) ∂ x D y τ (cid:12)(cid:12) x ,y E(cid:17) σ yτ | x ,y , (21) C = (cid:16) y τ − y ∂ y D y τ (cid:12)(cid:12) x ,y E(cid:17) ∂ x D y τ (cid:12)(cid:12) x ,y E σ yτ | x ,y + D x (cid:12)(cid:12) y E σ x | y . (22)Now we equate the expression of Eq. (19) with the Gaussian p ( y τ (cid:12)(cid:12) y ) = G y τ (cid:16)(cid:10) y τ (cid:12)(cid:12) y (cid:11) , σ y τ | y (cid:17) , which can be donealready equating the prefactors, and obtain σ y τ | y − σ y τ | x ,y = σ x | y (cid:0) ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) , (23)which relates a reduction in variance to the corresponding linear regression coeﬃcient. E. Proof of Equation (16)

Here we derive the form of the local contribution t x → yτ ( x , y ) to the transfer entropy T x → yτ , deﬁned by T x → yτ ≡ Z dx dy dy τ p ( x , y , y τ ) ln (cid:18) p ( y τ | x , y ) p ( y τ | y ) (cid:19) ≡ Z dx dy p ( x , y ) t x → yτ ( x , y ) , (24)where we identify t x → yτ ( x , y ) = D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) . Substituting the Gaussian expressions p ( y τ | x , y ) = G ( h y τ | x , y i , σ y τ | x ,y ) and p ( y τ | y ) = G ( h y τ | y i , σ y τ | y ) in D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) we get t x → yτ ( x , y ) = D (cid:2) p ( y τ | x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ | y ) (cid:3) ≡ R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) p (cid:16) y τ (cid:12)(cid:12) y (cid:17) ! = − + ln σ yτ | y σ yτ | x ,y + σ yτ | y R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) ( y τ − h y τ | x , y i + h y τ | x , y i − h y τ | y i ) = − + ln σ yτ | y σ yτ | x ,y + σ yτ | x ,y +( h y τ | x ,y i−h y τ | y i ) σ yτ | y . (25)From the linear regression h y τ | x , y i = x ∂ x h y τ | x , y i + y ∂ y h y τ | x , y i we ﬁnd h y τ | x , y i − h y τ | y i = h y τ | x , y i − R dx p ( x (cid:12)(cid:12) y ) h y τ | x , y i = (cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) ∂ x h y τ | x , y i . (26)Transfer entropy and Granger causality are equivalent in linear systems [6], T x → yτ = 12 ln σ y τ | y σ y τ | x ,y ! , (27)as can be found immediately by substituing the corresponding Gaussian expressions in Eq. (24). This relation togetherwith Eq. (23) and Eq. (26) gives t x → yτ ( x , y ) − T x → yτ = ( ∂ x h y τ | x , y i ) σ y τ | y h(cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) − σ x | y i , (28)which relates the local transfer entropy t x → yτ ( x , y ) deviation from its macroscopic counterpart T x → yτ to the localconditional log-likelihood (cid:0) x − (cid:10) x (cid:12)(cid:12) y (cid:11)(cid:1) . The minimum local transfer entropy is attained at x = (cid:10) x (cid:12)(cid:12) y (cid:11) for any y , and it gives min x t x → yτ ( x , y ) = (cid:0) e − T x → yτ + 2 T x → yτ − (cid:1) ≥ F. Proof of Eq. (20) - Ensemble information response

Here we give a more detailed derivation of the ﬂuctuation-response theorem for the ensemble information response.Consider the ensemble response divergence of Eq. (18-19) in the main text, ] d x → yτ ( (cid:15) ) ≡ D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y ; x ⇒ x + (cid:15) (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) , (29)where the last line holds only for τ > p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) is the probability of y τ given that at time t = 0 the perturbation x ⇒ x + (cid:15) wasapplied, but without knowledge of the current state ( x , y ). Assuming p ( x , y , y τ ) to be smooth, and expanding inorders of the perturbation, from Eq. (29) we obtain ] d x → yτ ( (cid:15) ) = R dy τ (cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) ln D p (cid:16) y τ (cid:12)(cid:12) x + (cid:15),y (cid:17)E p ( y τ ) ! = R dy τ (cid:2) p ( y τ ) + (cid:15) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:3) ln (cid:16) (cid:15)p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + (cid:15) p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11)(cid:17) + O ( (cid:15) )= (cid:15) R dy τ p ( y τ ) (cid:10) ∂ x p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) + O ( (cid:15) )= (cid:15) D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E + O ( (cid:15) ) , (30)where we used (cid:10) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:11) = p ( y τ ), and from the second line we expanded the logarithm, ln(1+ δ ) = δ − δ + O ( δ ),we inverted the order of integration and derivation, and used the normalization of probability, namely Z dy τ (cid:10) ∂ nx p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = (cid:28) ∂ nx Z dy τ p ( y τ (cid:12)(cid:12) x , y ) (cid:29) = (cid:10) ∂ nx (cid:11) = 0 , (31)for any n ∈ N + . The last line of Eq. (30) follows from (cid:10) ∂ x p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = (cid:10) p ( y τ (cid:12)(cid:12) x , y ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) (cid:11) = R dx dy p ( x , y , y τ ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y ) = p ( y τ ) R dx dy p ( x , y (cid:12)(cid:12) y τ ) ∂ x ln p ( y τ (cid:12)(cid:12) x , y )= p ( y τ ) (cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) . (32)We deﬁne the ensemble information response as ^ Γ x → yτ ( (cid:15) ) ≡ lim (cid:15) → ] d x → yτ ( (cid:15) ) c x ( (cid:15) ) , (33)and from Eq. (30) it directly follows the ﬂuctuation-response theorem ^ Γ x → yτ ( (cid:15) ) = − D(cid:10) ∂ x ln p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) (cid:12)(cid:12) y τ (cid:11) E(cid:10) ∂ x ln p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:11) . (34) G. Proof of Equation (21)

Let us substitute the Gaussian expressions for the probabilities of a linear Ornstein-Uhlenbeck process into theensemble information response (Eq. (34)), ^ Γ x → yτ = − (cid:28)D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) (cid:12)(cid:12) y τ E (cid:29)D ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E = σ x | y ∂ x D y τ (cid:12)(cid:12) x ,y E σ yτ | x ,y ! D(cid:10) y τ − (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11) E = Γ x → yτ σ yτ σ yτ | x ,y D(cid:0) y τ − (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11)(cid:1) E = Γ x → yτ σ yτ σ yτ | x ,y (cid:0) − ∂ y τ h x (cid:12)(cid:12) y τ i ∂ x h y τ (cid:12)(cid:12) x , y i − ∂ y τ h y (cid:12)(cid:12) y τ i ∂ y h y τ (cid:12)(cid:12) x , y i (cid:1) , (35)where in the third passage we used Γ x → yτ = (cid:16) ∂ x h y τ (cid:12)(cid:12) x ,y i (cid:17) σ x | y σ yτ | x ,y (Eq. (11) in main text), and h x (cid:12)(cid:12) y τ i = y τ ∂ y τ h x (cid:12)(cid:12) y τ i .The last line can be written more compactly as ^ Γ x → yτ = Γ x → yτ σ y τ σ y τ | x ,y (cid:0) − ∂ y τ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11)(cid:1) . (36) x yx y FIG. 1. Information response Γ x → yτ and its ensemble counterpart ^ Γ x → yτ as a function of the timescale τ . The model is the(linear) OU process of Eq. (44) with parameters t R = 10, β = 0 . α = 0 . D = 0 . To relate it with Shannon information and transfer entropy, let us consider the conditional variance σ y τ | x ,y ≡ (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) − (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) . (37)Using the linear property of variances being independent of the conditions, ∂ x σ y τ | x ,y = 0, which implies D σ y τ | x ,y E = σ y τ | x ,y , we take the expectation of Eq. (37) obtaining σ y τ | x ,y ≡ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) − D(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) E . (38)We see that the ﬁrst term on the RHS is simply the unconditional variance σ y τ = σ y . From iterated conditioning (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) = R R dx dy p ( x , y ) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = R R R dx dy dy τ p ( x , y ) p ( y τ (cid:12)(cid:12) x , y ) y τ = R dy τ p ( y τ ) y τ R R dx dy p ( x , y (cid:12)(cid:12) y τ )= R dy τ p ( y τ ) y τ = (cid:10) y τ (cid:11) = σ y . (39)Then substituting in Eq. (38) we get σ y − σ y τ | x ,y = D(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) E = (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:11) = (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1)(cid:11) = R dx dy p ( x , y ) (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) = R dx dy p ( x , y ) (cid:0) x ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + y ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) R dy τ p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) y τ = h x y τ i ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + h y y τ i ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) = σ y (cid:0) ∂ y τ (cid:10) x (cid:12)(cid:12) y τ (cid:11) ∂ x (cid:10) y τ (cid:12)(cid:12) x , y (cid:11) + ∂ y τ (cid:10) y (cid:12)(cid:12) y τ (cid:11) ∂ y (cid:10) y τ (cid:12)(cid:12) x , y (cid:11)(cid:1) = σ y ∂ y τ (cid:10)(cid:10) y τ (cid:12)(cid:12) x , y (cid:11) (cid:12)(cid:12) y τ (cid:11) , (40)which substituted into Eq. (36) gives ^ Γ x → yτ = Γ x → yτ σ y τ | x ,y σ y τ . (41)Let us introduce the total information as the mutual information between the couple of variables ( x , y ) and y τ , I xy,yτ ≡ D (cid:2) p ( x , y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) p ( y τ ) (cid:3) = D (cid:2) p ( y , y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y ) p ( y τ ) (cid:3) + (cid:10) D (cid:2) p (cid:0) x , y τ (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1) p (cid:0) y τ (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) = I y,yτ + T x → yτ . (42)In linear systems the mutual information is I y,yτ = ln (cid:18) σ yτ σ yτ | y (cid:19) and the transfer entropy T x → yτ = ln (cid:18) σ yτ | y σ yτ | x ,y (cid:19) ,so that the total information is I xy,yτ = I y,yτ + T x → yτ = ln (cid:18) σ yτ σ yτ | x ,y (cid:19) , and from Eq. (41) we obtain ^ Γ x → yτ = Γ x → yτ e − I xy,yτ = e − I y,yτ (cid:16) − e − T x → yτ (cid:17) , (43)which relates the two deﬁnitions of information response. In Fig. 1 we plot them for the 2D hierarchical OU process[9] (Eq. 15 in the main text) ( dxdt = − xt R + η t , dydt = αx − βy, (44)with h η t η t i = qδ ( t − t ), and parameters α , β > t R > q > H. Nonlinear example

We considered the nonlinear SDE ( dxdt = − xt rel + η t , dydt = αx − βy, (45)with white noise h η t η t i = qδ ( t − t ), and parameters α , β > t rel > q >

0. For intuition, x can be interpretedas an external ﬂuctuating concentration signal with timescale t rel , and y as a noiseless biochemical response that ismore activated when the signal is far from its average value x = 0 in either positive or negative direction. We checkednumerically that the equivalence between transfer entropy and information response for linear OU processes (Eq. 14in the main text) does not hold here (see Fig. 3), and the transfer entropy is not easily connected to interventionalcausation. For a speciﬁc τ = 3 we plot the local contributions to the response divergence and transfer entropy, seeFig. 2. The local response divergence is governed, at least qualitatively, by the squared derivative of the quadraticinteraction ∼ (cid:0) ∂ x x (cid:1) ∼ x . As a result the product d x → yτ ( x , y , (cid:15) ) p ( x , y ) is bimodal. The conditional local density p ( x (cid:12)(cid:12) y ), at least for large y , is also bimodal because of the quadratic driving and ﬁnite correlation time of the signal.For a given y , the local transfer entropy t x → yτ ( x , y ) ≡ D (cid:2) p ( y τ (cid:12)(cid:12) x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ (cid:12)(cid:12) y ) (cid:3) is larger for unlikely x whichmeans, given the bimodality of p ( x (cid:12)(cid:12) y ), in addition to the increase in the two sides x → ±∞ like it is also in thelinear case, also towards a peak at x = 0. Therefore, when multiplied by the local density p ( x , y ), the local transferentropy contribution t x → yτ ( x , y ) p ( x , y ) has four peaks for a ﬁxed y (three for small y ). I. General perturbations

This manuscript is based on a particular type of perturbation, namely an (cid:15) -shift of a variable at t = 0. In thelocal response divergence, since the measurement completely resolves the uncertainty, the perturbation correspondsto a shift of the corresponding delta distribution, δ ( x (0) − x ) δ ( y (0) − y ) ⇒ δ ( x (0) − x − (cid:15) ) δ ( y (0) − y ). In theensemble response divergence instead the perturbation is written p ( x , y ) ⇒ p ( x − (cid:15), y ). Note that in both caseswe use the information-theoretic cost at the ensemble level, c x ≡ D (cid:2) p ( x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x − (cid:15), y ) (cid:3) , since the KL divergencebetween two diﬀerent Dirac-deltas is not deﬁned.More in general, a perturbation of x at the ensemble level can be written in the form p (cid:0) x (cid:12)(cid:12) y (cid:1) ⇒ p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )] ≡ p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) , (46)with R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) = 0. The perturbed probability of y τ is written p ∗ ( y τ ) = R R dx dy p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) p ( y ) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) = R R dx dy p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1) p ( y ) p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )]= p ( y τ ) (cid:2) (cid:15) R R dx dy p (cid:0) x , y (cid:12)(cid:12) y τ (cid:1) h x ( x , y ) (cid:3) = p ( y τ ) (cid:2) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:3) . (47) FIG. 2. Local information response and local transfer entropy in the nonlinear model of Eq. (45) with parameters t rel = 10, β = 0 . α = 0 . D = 0 .

1, for a timescale τ = 3. We can deﬁne the generalized response divergence as a functional d i → jτ [ h ]( (cid:15) ) ≡ D (cid:2) p ∗ ( y τ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = R dy τ p ( y τ ) (cid:2) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:3) ln (cid:0) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:1) = R dy τ p ( y τ ) (cid:16) (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) + (cid:15) (cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) (cid:17) + O ( (cid:15) )= (cid:15) (cid:10)(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) + (cid:15) D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E = (cid:15) D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E , (48)where in the last passage we used the iterated conditioning theorem and the h ( x , y ) normalization, (cid:10)(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h h x ( x , y ) i = (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) (cid:11) = 0. Similarly, the information-theoretic cost is c x [ h ]( (cid:15) ) ≡ D (cid:2) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) p ( y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) = (cid:10) D (cid:2) p ∗ (cid:0) x (cid:12)(cid:12) y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) x (cid:12)(cid:12) y (cid:1)(cid:3)(cid:11) = (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) [1 + (cid:15)h x ( x , y )] ln [1 + (cid:15)h x ( x , y )] (cid:11) = (cid:15) (cid:10)R dx p (cid:0) x (cid:12)(cid:12) y (cid:1) h x ( x , y ) (cid:11) = (cid:15) (cid:10) h x ( x , y ) (cid:11) . (49)Then the generalized information response and its corresponding ﬂuctuation-response theorem are written ^ Γ x → yτ [ h ] ≡ lim (cid:15) → d i → jτ [ h ]( (cid:15) ) c x [ h ]( (cid:15) ) = D(cid:10) h x ( x , y ) (cid:12)(cid:12) y τ (cid:11) E h h x ( x , y ) i . (50) J. Linear response theory

In this section we review the classical ﬂuctuation-response theorem [7, 10–13] and the linear ﬂuctuation-responseinequality for the corresponding KL divergence [14], and we motivate the introduction of the information response inthis framework. Let us expand the average response of y τ to the small perturbation x ⇒ x + (cid:15) , for τ > (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) = (cid:10)(cid:10) y τ (cid:12)(cid:12) x + (cid:15), y (cid:11)(cid:11) ≡ R R R dx dy dy τ y τ p ( x , y ) p ( y τ (cid:12)(cid:12) x + (cid:15), y )= R R R dx dy dy τ y τ p ( y τ (cid:12)(cid:12) x , y ) p ( x − (cid:15), y )= R R R dx dy dy τ y τ p ( y τ (cid:12)(cid:12) x , y ) (cid:2) p ( x , y ) − (cid:15)∂ x p ( x , y ) + O ( (cid:15) ) (cid:3) = h y τ i − (cid:15) h y τ ∂ x ln p ( x , y ) i + O ( (cid:15) ) . (51) x y e T x y x y e I ( y , y ) (1 e T x y ) FIG. 3. Information response Γ x → yτ and its ensemble counterpart ^ Γ x → yτ as a function of the timescale τ , compared to thecorresponding combination of mutual informations they reduce to in linear systems. The model is the nonlinear Langevinsystem of Eq. (45) with parameters t rel = 10, β = 0 . α = 0 . D = 0 . In the limit (cid:15) → ﬂuctuation-response theorem :lim (cid:15) → (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) − h y τ i (cid:15) = − h y τ ∂ x ln p ( x , y ) i , (52)which equates the linear response coeﬃcient to a correlation evaluated in the unperturbed dynamics.For those systems having a symmetry in the correlation function, h y τ ∂ x ln p ( x , y ) i = ± h y − τ ∂ x ln p ( x , y ) i , theWiener-Kintchine theorem applied to Eq. (52) gives the equivalence between subseptibility and cross-spectral density,that applied to Brownian motion gives the celebrated Einstein relation [7].Let us now take the absolute value of both sides in the ﬂuctuation-response theorem (Eq. (52)), apply the iteratedconditioning to the RHS, and then the Cauchy-Schwarz inequality (cid:12)(cid:12) R f ( x ) g ( x ) dx (cid:12)(cid:12) ≤ R (cid:12)(cid:12) f ( x ) (cid:12)(cid:12) dx R (cid:12)(cid:12) g ( x ) (cid:12)(cid:12) dx , toobtain (cid:12)(cid:12)(cid:12)(cid:12) lim (cid:15) → D y τ (cid:12)(cid:12) x ⇒ x + (cid:15) E −h y τ i (cid:15) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) h y τ ∂ x ln p ( x , y ) i (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:10) y τ (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:10) ( y τ − h y τ i ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) R dy τ p ( y τ ) ( y τ − h y τ i ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) (cid:12)(cid:12)(cid:12)(cid:12) ≤ qR dy τ p ( y τ ) ( y τ − h y τ i ) R dy τ p ( y τ ) (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) = r σ y τ D(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) E , (53)where we used (cid:10) h y τ i (cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h y τ i (cid:10)(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11)(cid:11) = h y τ i h ∂ x ln p ( x , y ) i = 0, and identiﬁedthe variance σ y τ = R dy τ p ( y τ ) ( y τ − h y τ i ) . Using the expressions for the ensemble response divergence of Eq. (29)-(30), namely ] d x → yτ ( (cid:15) ) ≡ D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) = (cid:15) D(cid:10) ∂ x ln p ( x , y ) (cid:12)(cid:12) y τ (cid:11) E + O ( (cid:15) ), we obtain the linearﬂuctuation-response inequality [14] (cid:12)(cid:12) (cid:10) y τ (cid:12)(cid:12) x ⇒ x + (cid:15) (cid:11) − h y τ i (cid:12)(cid:12) ≤ σ y τ q D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) + O ( (cid:15) ) , (54)0which identiﬁes the KL divergence D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) as the information-theoretic bound to the responseof y τ relative to its natural ﬂuctuations σ y τ .The two fundamental results derived above suggest the possibility of a ﬂuctuation-response theorem forKL divergences, that is what we do in the main text. In particular, starting from the KL divergence D (cid:2) p ( y τ (cid:12)(cid:12) x ⇒ x + (cid:15) ) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) which describes the response, we deﬁne a second KL divergence to quantify theinformation-theoretic cost of perturbations. Then we expand separately these two KL divergences, and they areboth of order O ( (cid:15) ) for (cid:15) →

0, with the corresponding Taylor coeﬃcients having the form of Fisher information. Theresulting linear response coeﬃcient is then a ratio of Fisher information, such relation we interpret as an information-theoretic ﬂuctuation-response theorem.Here we sketch the analogy between ours and the classical ﬂuctuation-response theorem:lim (cid:15) → (cid:10) y τ (cid:12)(cid:12) x → x + (cid:15) (cid:11) − h y τ i (cid:15) = − h y τ ∂ x ln p ( x , y ) i . • ResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponse • PerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbationPerturbation • CorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelationCorrelation ^ Γ x → yτ ≡ lim (cid:15) → D (cid:2)(cid:10) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1)(cid:11) (cid:12)(cid:12)(cid:12)(cid:12) p ( y τ ) (cid:3) D (cid:2) p ( x − (cid:15), y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x , y ) (cid:3) = − (cid:28)D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17) (cid:12)(cid:12) y τ E (cid:29)D ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E linear = e − I y,yτ (1 − e − T x → yτ ) , and we added the connection between ﬂuctuation-response theory and mutual informations obtained for linear systems(Eq. (21) in the main text).We outlined the analogy of the classical ﬂuctuation-dissipation theorem with our ensemble information response ^ Γ x → yτ , but in the main text we ﬁrst focus on the information response Γ x → yτ , which is the averaged conditional(local) version of ^ Γ x → yτ . While the connection with the original ﬂuctuation-response theorem is loose, the structureof perturbation-response-correlation is analogous,Γ x → yτ ≡ lim (cid:15) → (cid:10) D (cid:2) p (cid:0) y τ (cid:12)(cid:12) x + (cid:15), y (cid:1) (cid:12)(cid:12)(cid:12)(cid:12) p (cid:0) y τ (cid:12)(cid:12) x , y (cid:1)(cid:3)(cid:11) D (cid:2) p ( x , y ) (cid:12)(cid:12)(cid:12)(cid:12) p ( x − (cid:15), y ) (cid:3) = D ∂ x ln p (cid:16) y τ (cid:12)(cid:12) x ,y (cid:17)ED ∂ x ln p (cid:16) x (cid:12)(cid:12) y (cid:17)E linear = e T x → yτ − . K. Application to data science

In the main text, we present our results in relation to the current literature in theoretical ﬁelds such as ﬂuctuation-response theory, information theory, and nonequilibrium thermodynamics. Here we motivate our study also in relationto the current trend of data science.The accuracy of predictions is one of the main goals in statistics and applied physics. In general, predictions canbe obtained from mechanical models, where physical intuition plays a role in selecting the relevant observables andcharacterizing their interactions [15], or from machine learning approaches, where the large availability of (labeled)data enables high-dimensional computing architectures to be trained for pattern recognition [16]. In this latter case,predictions are not explainable in terms of intuitive mechanisms or geometrical relations. In other words, the abilityof doing predictions does not imply understanding [17, 18].With the aim of explainability , an helpful representation of the dynamics is given by causal networks [9, 19], whereweighted directed links between nodes represent the propagation of perturbations between variables in the network, orthe information ﬂow. Causal networks are coarse-grained representations of the dynamics and its interactions, limitedto a set of scalars representing how much a variable is inﬂuencing the dynamics of other variables, such dynamicsbeing observed over a timescale τ (maybe tunable). As an example, the simplest way to deﬁne a causal network isthe correlation matrix, however not always the most appropriate.1To quantify such degree of causation we motivate the use of our information response, deﬁned as the ratio ofthe change of a prediction over the change of a predictor, both evaluated as KL divergences. It has the formof an information-theoretic ﬂuctuation-response theorem, and therefore it has both the invariance properties frominformation theory and the physical interpretation of a propagation of perturbations. While the present setting islimited to dyadic relations between variables, a generalization in terms of simultaneous perturbations of multiplevariables is possible, and will be discussed in a future manuscript.Once a particular deﬁnition of causation is chosen, determining and quantifying the strength of causal links becomesa problem of statistical estimation, and is the subject of causal inference [20, 21]. In this manuscript we are interestedin the former problem, i.e., to deﬁne a quantitative measure of causation. [1] S. E. Shreve, Stochastic calculus for ﬁnance II:Continuous-time models , Vol. 11 (Springer Science &Business Media, 2004).[2] I. Karatzas and S. E. Shreve, in

Brownian Motion andStochastic Calculus (Springer, 1998) pp. 47–127.[3] R. Kubo, M. Toda, and N. Hashitsume,

Statisticalphysics II: nonequilibrium statistical mechanics , Vol. 31(Springer Science & Business Media, 2012).[4] J. M. Horowitz and H. Sandberg, New Journal of Physics , 125007 (2014).[5] S.-i. Amari, Information geometry and its applications ,Vol. 194 (Springer, 2016).[6] L. Barnett, A. B. Barrett, and A. K. Seth, Physicalreview letters , 238701 (2009).[7] H. Risken, in