Causal blankets: Theory and algorithmic framework
Fernando E. Rosas, Pedro A.M. Mediano, Martin Biehl, Shamil Chandaria, Daniel Polani
CCausal blankets:Theory and algorithmic framework
Fernando E. Rosas , , , Pedro A.M. Mediano , Martin Biehl ,Shamil Chandaria , , and Daniel Polani (cid:63) Centre for Psychedelic Research, Imperial College London, London SW7 2DD, UK Data Science Institute, Imperial College London, London SW7 2AZ, UK Centre for Complexity Science, Imperial College London, London SW7 2AZ, UK Department of Psychology, University of Cambridge, Cambridge CB2 3EB, UK Araya Inc., Tokyo 107-6024, Japan Institute of Philosophy, School of Advanced Study, University of London, UK Dept. of Computer Science, University of Hertfordshire, Hatfield AL10 9AB, UK [email protected] [email protected] [email protected]@gmail.com [email protected]
Abstract.
We introduce a novel framework to identify perception-actionloops (PALOs) directly from data based on the principles of computa-tional mechanics. Our approach is based on the notion of causal blan-ket , which captures sensory and active variables as dynamical sufficientstatistics — i.e. as the “differences that make a difference.” Furthermore,our theory provides a broadly applicable procedure to construct PALOsthat requires neither a steady-state nor Markovian dynamics. Using ourtheory, we show that every bipartite stochastic process has a causal blan-ket, but the extent to which this leads to an effective PALO formulationvaries depending on the integrated information of the bipartition.
Keywords:
Perception-action loops · Computational mechanics · Inte-grated information · Stochastic processes
The perception-action loop (PALO) is one of the most important constructs ofcognitive science, and plays a fundamental role in many other disciplines includ-ing reinforcement learning and computational neuroscience. Despite its impor-tance and pervasiveness, fundamental questions about what kind of systems canbe properly described by a PALO are still to a large extent unanswered. Theaim of this paper is to introduce a framework that allows us to identify PALOsdirectly from data, which complements existent approaches and serves to deepenour understanding of the essential elements that make a PALO. (cid:63)
F.R. was supported by the Ad Astra Chandaria foundation. P.M. was funded by the WellcomeTrust (grant no. 210920/Z/18/Z). M.B. was supported by a grant from Templeton World CharityFoundation, Inc. (TWCF). The opinions expressed in this publication are those of the authorsand do not necessarily reflect the views of TWCF. a r X i v : . [ n li n . AO ] S e p Rosas, Mediano, Biehl, Chandaria, and Polani
One of the most encompassing accounts of PALOs can be found in the FreeEnergy Principle (FEP) literature, which formalises them via
Markov blankets (MBs) [14]. An interesting contribution of this literature is to characterise “sen-sory” ( S ) and “active” ( A ) variables as having two defining properties: (i) theymediate the interactions between internal variables of the agent ( M ) and externalvariables of its environment ( E ), and (ii) they impose a specific causal structureon these interactions — e.g. sensory variables may affect internal variables, butare not (directly) affected by them [14].Formally, MBs were originally introduced by Pearl [21] for Markov andBayesian networks. Within the FEP literature, MBs are usually employed in mul-tivariate stochastic processes with ergodic Markovian dynamics, with a steady-state distribution p ∗ that is required to satisfy [20] p ∗ ( e t , m t | s t , a t ) = p ∗ ( e t | s t , a t ) p ∗ ( m t | s t , a t ) . (1)However, Eq. (1) does not suffice to guarantee a PALO structure, as noted inRef. [7]. In effect, the MB condition is insufficient to establish requirement (ii):its symmetry with respect to internal and external variables makes it impossibleto infer the direction of the loop; additionally, the fact that the condition holdsacross variables synchronously makes it unsuitable to guarantee a causal rela-tionship [22]. Recent reports [11] acknowledge that this synchronous conditionneeds to be complemented with additional diachronic restrictions on the system’sdynamics, which can be written, for instance, as a set of coupled stochastic dif-ferential equations of the form˙ m t = f in ( m t , a t , s t ) + ω in t , ˙ a t = f a ( m t , a t , s t ) + ω a t , ˙ e t = f ex ( e t , a t , s t ) + ω ex t , ˙ s t = f s ( e t , a t , s t ) + ω s t . (2)Above, the functions f in , f a , f ex , f s determine the flow, and ω in t , ω a t , ω ex t , ω s t denoteadditive Gaussian noise. Interestingly, it has been shown that Eq. (2) impliesEq. (1) under additional assumptions: either block diagonality conditions overthe solenoidal flow [11], or strong dissipation [12, Appendix]. Hence, PALOscould be interpreted as coupled stochastic dynamical systems of the form inEq. (2), as long as the flow satisfies any of the two mentioned conditions.Despite its elegance, this formalisation of PALOs has important limitations.First, this formulation relies strongly on Langevin dynamics, making it difficultto extend it to PALOs appearing in discrete systems. Secondly, this approachdepends on a set of assumptions — for one, the aforementioned conditions overthe flow and the restriction to systems in their steady-state — that might be toorestrictive for some scenarios of interest. Finally, and perhaps most importantly,Eq. (1) forces all interactions between M t and E t to be accountable by ( S t , A t ),which imposes — due to the data processing inequality [9] — an information However, in the general case neither Eqs. (1) or (2) imply each other [7] — hencethey need to be taken as complementary conditions.ausal blankets: Theory and algorithmic framework 3 M t A t A t S t S t E t E t E t E t E t E t E t ME SA (a) (b)Fig. 1.
Two visualisations of PALOs in the FEP literature, either based on (a)
Markovblankets according to Eq. (1) or (b)
Langevin dynamics following Eq. (2). bottleneck of the form I ( M t ; E t ) ≤ I ( M t ; A t , S t ). Therefore, the MB formalismforbids interdependencies induced by past events that are kept in memory, butmay not directly influence the present state of the blankets. This informationkept in memory arguably plays an important role in many PALOs, and includesuncontroversial features of cognition (such as old memories that an agent re-tains but is neither caused by a sensation nor causing an action at the currentmoment), yet are forbidden by MBs.
Computational mechanics is a method for studying patterns and statistical regu-larities observed in stochastic processes by uncovering their hidden causal struc-ture [24,25]. A key insight is that an optimal, minimimal representation of aprocess can be revealed by grouping past trajectories according to their fore-casting abilities into so-called causal states . More precisely, the causal statesof a (possibly non-Markovian) time series { Z t } t ∈ Z are the equivalent classes oftrajectories z t := ( . . . , z t − , z t ) given by the relationship z t ≡ (cid:15) z (cid:48) t iff p ( z t +1 | z t ) = p ( z t +1 | z (cid:48) t ) ∀ z t +1 . It can be shown that the causal states are the coarsest coarse-graining of pasttrajectories x t that retains full predictive power over future variables [10,13].Moreover, the corresponding process over causal states always has Markoviandynamics, providing the simplest yet encompassing representation of the sys-tem’s information dynamics on a latent space — known as the epsilon-machine .Please note that the causal states of a system are guaranteed to providecounterfactual relationships [22] only if the system at hand is fully observed. Inthe case of partially observed scenarios, causal states ought to be understood inthe Granger sense, i.e. as states of maximal non-mediated predictive ability [8]. We thank Nathaniel Virgo for first noting this issue. Rosas, Mediano, Biehl, Chandaria, and Polani
In this paper we introduce an operationalisation of PALOs based on causalblankets (CB), a construction based on a novel definition of dynamical statis-tical sufficiency. CB capture properties (i) and (ii) in a single mathematicalconstruction by applying informational constructs directly to dynamical condi-tions. Moreover, CBs can be constructed with great generality for any bipartitesystem without imposing further conditions, and hence can be applied to non-ergodic, non-Markovian stochastic processes. This generality allows us to explorenovel connections between PALOs and integrated information. In the rest of themanuscript, we:1) Provide a rigorous definition of CBs (Definition 2); and2) Show every agent-environment partition has a CB, and thus can be describedas a PALO (Proposition 1); although3) Not all systems are equally well described as a PALO, and this can be quan-tified via information geometry and integrated information (Sec. 3) — pro-viding a principled measure to distinguish preferable candidates for PALO. We consider the perspective of a scientist who repeatedly measures a systemcomposed of two interacting parts X t and Y t . We assume that, from these obser-vations, a reliable statistical model of the corresponding discrete-time stochasticprocess can be built — of which all the resulting marginal and conditional dis-tributions are well-defined. Random variables are denoted by capital letters (e.g. X, Y ) and their realisations by lower case letters (e.g. x, y ); stochastic processesat discrete times (i.e. time series) are represented as bold letters without sub-script X = { X t } t ∈ Z , and X t := ( . . . , X t − , X t ) denotes the infinite past of X until and including t .Given two random variables X and Y , a statistic U = f ( X ) is said to be Bayesian sufficient of X w.r.t. Y if X ⊥⊥ Y | U , which implies that all thecommon variability between X and Y is accounted for by U [9]. The first stepin our construction is to introduce a dynamical version of statistical sufficiency. Definition 1 (D-BaSS).
Given two stochastic processes X , Y , a process U is a dynamical Bayesian sufficient statistic (D-BaSS) of X w.r.t. Y if, for all t ∈ Z , the following conditions hold:i. Precedence: there exists a function F ( · ) such that U t = F ( X t ) for all t ∈ Z .ii. Sufficiency: Y t +1 ⊥⊥ X t | ( U t , Y t ) .Moreover, a stochastic process M is a minimal D-BaSS of X with respect to Y if it is itself a D-BaSS and for any D-BaSS U there exists a function f ( · ) suchthat f ( U t ) = M t , ∀ t ∈ Z . The proofs of our results can be found in the Appendix.ausal blankets: Theory and algorithmic framework 5
The first condition above states that U is no more than a simpler, coarse-grained representation of X , and the second implies that the influence of X t on Y t +1 given Y t is fully mediated by U t . This has interesting consequences fortransfer entropy, as seen in the next lemma. Lemma 1. If U is a D-BaSS of X w.r.t. Y , then TE( X → Y ) t := I ( X t ; Y t +1 | Y t ) = I ( U t ; Y t +1 | Y t ) . (3)There are many such D-BaSS; e.g. U t = X t would be one valid D-BaSS of X w.r.t. Y . However, Theorem 1 shows that minimal D-BaSS’s are unique (upto bijective transformations). Theorem 1 (Existence and uniqueness of the minimal D-BaSS).
Givenstochastic processes X , Y , the minimal D-BaSS of X w.r.t. Y corresponds to thepartition of past-trajectories x t induced by the following equivalence relationship: x t ≡ p x (cid:48) t iff ∀ y t , y t +1 p ( y t +1 | x t , y t ) = p ( y t +1 | x (cid:48) t , y t ) . Therefore, the minimal D-BaSS is always well-defined, and is unique up to anisomorphism.
This result shows that D-BaSSs can be built irrespective of any other possiblylatent influences on X and Y , as it is defined purely on the joint statistics ofthese two processes. Moreover, Theorem 1 provides a recipe to build a D-BaSS:group together all the past trajectories that lead to the same predictions, which isa key principle of computational mechanics [10,13,24,25]. Therefore, a minimalD-BaSS distinguishes only “differences that make a difference” for the futuredynamics, generalising the construction presented in Ref. [6, Definition 1] forMarkovian dynamical systems, and being closely related to the notion of sensoryequivalence presented in Ref. [3]. With these ideas at hand, we can formulateour definition of causal blanket. Definition 2 (Causal blanket).
Given two stochastic processes X , Y , a re-ciprocal D-BaSS (ReD-BaSS) is a stochastic process R which satisfies:i. Joint precedence: R t = F ( X t , Y t ) for some function F ( · ) .ii. Reciprocal sufficiency: R is a D-BaSS of X w.r.t. Y , and also is a D-BaSSof Y w.r.t. X .A causal blanket (CB) is a minimal ReD-BaSS: a time series M , itself a ReD-BaSS, such that for all ReD-BaSSs R there exists a function f ( · ) such that M t = f ( R t ) , ∀ t ∈ Z . This definition satisfies the two key desiderata discussed in Section 1.1: (i)a CB mediates the interactions that take place between X and Y , and (ii)it assesses causality by focusing on statistical relationships between past andfuture. From this perspective, CBs are the “informational layer” that causallydecouples the agent’s and environment’s temporal evolution from each other (seeProposition 2). Additionally, our next result guarantees that CBs always exist,and are unique to each bipartite system. Rosas, Mediano, Biehl, Chandaria, and Polani ( . . . , X t − , X t )( . . . , Y t − , Y t ) X t +1 Y t +1 AgentEnvironmentCausal blanket A t S t Fig. 2.
Causal blanket { S , A } , which acts as a sufficient statistic mediating the inter-actions between X and Y . Proposition 1.
Given X , Y , their CB always exists and is unique (up to anisomorphism). Moreover, their CB is isomorphic to a pair { S , A } , where A isa minimal D-BaSS of X w.r.t. Y , and S is a minimal D-BaSS of Y w.r.t. X . Proposition 1 has two important consequences: it guarantees that CBs always exist, and that they naturally resemble a PALO — as visualised in Fig 2. Pleasenote that this type of PALO formalisation has a rich history, being studied inRefs. [4,5] and variations being considered in Refs. [15,16,26]. In contrast, ourframework follows Refs. [3,6] and does not assume active and sensory variablesas given, but discovers them directly from the data. As a matter of fact, the“sensory” ( S ) and “active” ( A ) variables of CBs correspond (due to Definition 2)to minimal sufficient statistics that mediate the interdependencies between thepast and future of X and Y . The construction of CBs imposes no requirementson the system’s statistics or its structure — beyond the bipartition, holdingalso for non-ergodic and also non-stationary systems, and systems with non-Markovian dynamics.It is also possible to build internal and external states M t , E t such that( M t , A t ) = X t and ( E t , S t ) = Y t with great generality. This can be done viaan orthogonal completion of the phase space; the details of this procedure willbe made explicit in a future publication. In this way, CBs can be thought assuggesting implicit “equations of motion” somehow equivalent to Eq. (2), asshown in Figure 2. However, it is important to remark that this representationdoes not provide counterfactual guarantees for partially observed systems (seeSection 1.2). Example 1.
Consider a multivariate stochastic process M , A , E , S whose dy-namics follows M t +1 = f in ( M t , A t , S t ) + N in , A t +1 = f a ( M t , A t , S t ) + N a ,E t +1 = f ex ( E t , A t , S t ) + N ex , S t +1 = f s ( E t , A t , S t ) + N s , (4)with N in t , N a t , N ex t , N s t being independent of M t , A t , E t , S t (note that Eq. 4 cor-responds to a discrete-time version of Eq. (2)). Then, by defining X t = ( M t , A t )and Y t = ( E t , S t ), one can show using Definition 2 that that { S , A } is the CB of X , Y — as long as the partial derivatives of f in , f a , f ex , f s with respect to theircorresponding arguments are nonzero. ausal blankets: Theory and algorithmic framework 7 According to Def. 2, CBs don’t depend on the joint distribution p ( x t +1 , y t +1 | x t , y t ),but only on the marginals p ( x t +1 | x t , y t ) and p ( y t +1 | x t , y t ). Here we study howmeaningful the CB (and the description of the system as a PALO) is when thejoint process’s dynamics are different from the product of these two marginals.Let us start by introducing the synergistic coefficient ξ t ∈ R , which is arandom variable given by ξ t := log p ( X t +1 , Y t +1 | X t , Y t ) p ( X t +1 | X t , Y t ) p ( Y t +1 | X t , Y t ) . (5)A process ( X , Y ) is said to have factorisable dynamics if ξ t = 0 a.s. for all t ∈ Z . Proposition 2 (Conditional independence of trajectories). If R is aReD-BaSS and the dynamics of X , Y is factorisable, then X ⊥⊥ Y | R . Thus,such system is perfectly described as a PALO, and R is a MB (in Pearl’s sense). A direct consequence of this Proposition is that a ReD-BaSS does not guar-antee statistical independence of X , Y at the trajectory level in non-factorisablesystems. Therefore, in such systems there are interactions between X and Y that are not mediated by the CB. Please note that this is not a weakness ofthe CB construction — which is optimal in capturing all the directed influences,as shown in Proposition 1. Instead, this result suggests that non-factorisablesystems might not be well-suited to be described as a PALO.To further understand this, let us explore the integrated information in thesystem ( X , Y ) using information geometry [19]. For this, consider the manifolds M = (cid:8) q t : q ( x t +1 , y t +1 | x t , y t ) = q ( x t +1 | x t , y t ) q ( y t +1 | x t , y t ) (cid:9) , M = (cid:8) q t : q ( x t +1 , y t +1 | x t , y t ) = q ( x t +1 | x t ) q ( y t +1 | y t ) (cid:9) . Manifold M corresponds to all systems with factorisable dynamics, and M toall systems where the dynamics of agent and environment are fully decoupled.The information-geometric projection of an arbitrary system p t onto M ,˜ ϕ t := min q t ∈M D ( p t || q t ) , (6)has been proposed as a measure of integrated information [2,18]. Using thePythagoras theorem [1] together with the fact that M ⊂ M , one can de-compose ˜ ϕ t as ˜ ϕ t (cid:124) (cid:123)(cid:122) (cid:125) D ( p t (cid:107) q (2) t ) = E { ξ t } (cid:124) (cid:123)(cid:122) (cid:125) D ( p t || q (1) t ) + (cid:104) TE( A → Y ) t + TE( S → X ) t (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) D ( q (1) t || q (2) t ) , (7)where q ( k ) t := arg min q t ∈M k D ( p t || q t ). Note that in non-ergodic scenarios the expected values are not calculated over indi-vidual trajectories, but over the ensemble statistics that define the probability. Rosas, Mediano, Biehl, Chandaria, and Polani
This decomposition confirms previous results that showed that integratedinformation is a construct that combines low-order transfer and high-order syn-ergies [17]. Thanks to Lemma 1, Eq. (7) states that the transfer component of ˜ ϕ t (i.e. D (cid:0) q (1) t || q (2) t (cid:1) ) is what is properly mediated by the CB. In contrast, the partof ˜ ϕ related to high-order statistics, i.e. E { ξ t } = I ( X t +1 ; Y t +1 | X t , Y t ), is notaccounted by the CB. This last term can either refer to spurious synchronouscorrelations (due e.g. to sub-sampling), or be due to synergistic dynamics thatare a signature of emergent phenomena [23].In summary, our results suggest that the dynamics of a system ( X , Y ) thatis too synergistically integrated are poorly represented as a PALO, even if theCB formally still exists. Additionally, the synergistic component of integratedinformation can be used as a measure for this mismatch. This manuscript introduced a data-driven method to build PALOs leveragingprinciples of computational mechanics. Our construction provides an informa-tional interpretation of sensory and actuation variables: sensory (resp. active)variables encode all the changes from “outside” (resp. “inside”) that affect thefuture evolution of the “inside” (resp. “outside”). Our framework is broadly ap-plicable, depending only on the underlying bipartition but not imposing anyfurther conditions on the system’s dynamics or distribution. Furthermore, we il-lustrated how this construction allows one to relate — within a PALO framework— the separation of a system and its environment to the integrated informationencompassing the two.It is to be noted that the CB construction relies on discrete time, which, whilebeing immediately applicable to digitally sampled data, might not be natural insome scenarios. Also, CB theory at this stage does not provide explicit links withprobabilistic inference. As shown in Example 1, CBs provide a natural extensionof Eq. (2) to the discrete-time case, so one possibility would be to combinethem with the MB condition in Eq. (1). The exploration of such “causal Markovblankets” which would satisfy both Eq. (1) and Definition 2 is an interestingavenue for future research.It is our hope that the CB construction may enrich the toolbox of researchersstudying PALOs and help to illuminate further our understanding of the natureof agency. ausal blankets: Theory and algorithmic framework 9
References
1. Amari, S.i., Nagaoka, H.: Methods of information geometry, vol. 191. AmericanMathematical Soc. (2007)2. Ay, N.: Information geometry on complexity and stochastic interaction. Entropy (4), 2432–2458 (2015)3. Ay, N., L¨ohr, W.: The Umwelt of an embodied agent A measure-theoretic defini-tion. Theory in Biosciences (3-4), 105–116 (2015)4. Bertschinger, N., Olbrich, E., Ay, N., Jost, J.: Information and closure in systemstheory. In: Explorations in the Complexity of Possible Life. Proceedings of the 7thGerman Workshop of Artificial Life. pp. 9–21 (2006)5. Bertschinger, N., Olbrich, E., Ay, N., Jost, J.: Autonomy: An information theoreticperspective. Biosystems (2), 331–345 (2008)6. Biehl, M., Polani, D.: Action and perception for spatiotemporal patterns. In: Ar-tificial Life Conference Proceedings 14. pp. 68–75. MIT Press (2017)7. Biehl, M., Pollock, F.A., Kanai, R.: A technical critique of the free energy principleas presented in “Life as we know it”. arXiv:2001.06408 (2020)8. Bressler, S.L., Seth, A.K.: Wiener–Granger causality: A well established method-ology. Neuroimage (2), 323–329 (2011)9. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons(2012)10. Crutchfield, J.P., Young, K.: Inferring statistical complexity. Physical Review Let-ters (2), 105 (1989)11. Friston, K., Da Costa, L., Parr, T.: Some interesting observations on the free energyprinciple. arXiv:2002.04501 (2020)12. Friston, K.J., Fagerholm, E.D., Zarghami, T.S., Parr, T., Hip´olito, I., Magrou, L.,Razi, A.: Parcels and particles: Markov blankets in the brain. arXiv:2007.09704(2020)13. Grassberger, P.: Toward a quantitative theory of self-generated complexity. Inter-national Journal of Theoretical Physics (9), 907–938 (1986)14. Kirchhoff, M., Parr, T., Palacios, E., Friston, K., Kiverstein, J.: The Markov blan-kets of life: Autonomy, active inference and the free energy principle. Journal ofThe Royal Society Interface (138), 20170792 (2018)15. Klyubin, A.S., Polani, D., Nehaniv, C.L.: Organization of the information flow inthe perception-action loop of evolved agents. In: Proceedings. 2004 NASA/DoDConference on Evolvable Hardware, 2004. pp. 177–180. IEEE (2004)16. Klyubin, A.S., Polani, D., Nehaniv, C.L.: Representations of space and time in themaximization of information flow in the perception-action loop. Neural Computa-tion (9), 2387–2432 (2007)17. Mediano, P.A., Rosas, F., Carhart-Harris, R.L., Seth, A.K., Barrett, A.B.: Be-yond integrated information: A taxonomy of information dynamics phenomena.arXiv:1909.02297 (2019)18. Mediano, P.A., Seth, A.K., Barrett, A.B.: Measuring integrated information: Com-parison of candidate measures in theory and simulation. Entropy (1), 17 (2019)19. Oizumi, M., Tsuchiya, N., Amari, S.i.: Unified framework for information inte-gration based on information geometry. Proceedings of the National Academy ofSciences (51), 14817–14822 (2016)20. Parr, T., Da Costa, L., Friston, K.: Markov blankets, information geometry andstochastic thermodynamics. Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences (2164), 20190159 (2020)0 Rosas, Mediano, Biehl, Chandaria, and Polani21. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann (1988)22. Pearl, J.: Causality. Cambridge University Press (2009)23. Rosas, F.E., Mediano, P.A., Jensen, H.J., Seth, A.K., Barrett, A.B., Carhart-Harris, R.L., Bor, D.: Reconciling emergences: An information-theoretic approachto identify causal emergence in multivariate data. arXiv:2004.08220 (2020)24. Shalizi, C.R., Crutchfield, J.P.: Computational mechanics: Pattern and prediction,structure and simplicity. Journal of Statistical Physics (3-4), 817–879 (2001)25. Shalizi, C.: Causal architecture. Complexity, and Self-Organization in Time Se-ries and Cellular Automata PhD thesis (Univ Wisconsin–Madison, Madison, WI)(2001)26. Tishby, N., Polani, D.: Information theory of decisions and actions. In: Perception-action cycle, pp. 601–636. Springer (2011) A Proofs
Proof (Lemma 1).
Let’s consider U to be a D-BaSS of X w.r.t. Y . Then, prop-erty (ii) of a D-Bass is equivalent to I ( X t ; Y t +1 | U t , Y t ) = 0 . (8)Using this, one can verify that I ( X t ; Y t +1 | Y t ) = I ( U t , X t ; Y t +1 | Y t ) = I ( U t ; Y t +1 | Y t ) . Here, the first equality holds because U t is a deterministic function of X t , andthe second equality follows from an application of the chain rule and Eq. (8). Proof (Theorem 1).
Consider the function F ( · ) that maps each x t to its cor-responding equivalence class F ( x t ) established by the equivalence relationship ≡ p , and define M t = F ( X t ). As this construction satisfies the requirement ofprecedence in Def. 1, let us show the sufficiency of M . By definition of M t , it isclear that if m t = F ( x t ) then p ( y t +1 | x t , y t ) = p ( y t +1 | m t , y t ) , which implies that H ( Y t +1 | X t , Y t ) = H ( Y t +1 | M t , Y t ). As a consequence, I ( X t ; Y t +1 | Y t ) = H ( Y t +1 | Y t ) − H ( Y t +1 | X t , Y t )= H ( Y t +1 | Y t ) − H ( Y t +1 | M t , Y t )= I ( M t ; Y t +1 | Y t ) . (9)From this, sufficiency follows from noticing that I ( X t ; Y t +1 | M t , Y t ) = I ( X t , M t ; Y t +1 | Y t ) − I ( M t ; Y t +1 | Y t )= I ( X t ; Y t +1 | Y t ) − I ( M t ; Y t +1 | Y t )= 0 . ausal blankets: Theory and algorithmic framework 11 Above, the first equality is due to the chain rule, the second follows from thefact that M t is a function of X t , and the third uses Eq. (9).To finish the proof, let us show that M is minimal. For this, consider an-other U to be another D-BaSS of X w.r.t. Y . As U t = G ( X t ) for some function G ( · ), U corresponds to another partition of the trajectories x t . If there existsno function f such that f ( U t ) = M t , that implies that the partition that corre-sponds to M is not a coarsening of the partition for U , and therefore that thereexists x t and x (cid:48) t such that G ( x t ) = G ( x (cid:48) t ) while p ( y t +1 | x t , y t ) (cid:54) = p ( y t +1 | x (cid:48) t , y t ).This, in turn, implies that there exists a x (cid:48) t such that that p ( y t +1 | u t , x (cid:48) t , y t ) (cid:54) = p ( y t +1 | u t , y t ) = (cid:80) x t p ( y t +1 | u t , x t , y t ) p ( x t | u t , y t ), showing that X t is not con-ditionally independent of Y t +1 given U t , Y t , contradicting the fact that U is aD-BaSS. This contradiction proves that the partition induced by U is a refine-ment of the partition induced by M , proving the minimality of the latter. Proof (Proposition 1).
Let’s denote by A the minimal D-BaSS of X w.r.t. Y , and S the minimal D-BaSS of Y w.r.t. X , which are known to exist and be uniquethanks to Theorem 1. Then, by defining M t := ( S t , A t ), one can directly verifythat M is a ReD-BaSS of ( X , Y ). To prove its minimality, let us consider anotherReD-BaSS of ( X , Y ) denoted by N . As N is a D-BaSS of X w.r.t. Y , theminimality of A guarantees the existance of a mapping f ( · ) such that f ( N t ) = S t .Similarly, thanks to the minimality of S , there is another mapping g ( · ) such that g ( N t ) = A t . Therefore, the function F ( · ) = ( f, g ) satisfies F ( N t ) = M t , whichconfirms the minimality of M . Proof (Proposition 2).
The proof is based on the principle that if p ( A, B, C ) = f ( A, C ) g ( B, C ) , then A ⊥⊥ B | C . Building on that rationale, a direct calculationshows that p ( x , y ) = ∞ (cid:89) τ = −∞ p ( x τ +1 , y τ +1 | x τ , y τ )= ∞ (cid:89) τ = −∞ exp { ξ τ } p ( x τ +1 | x τ , y τ ) p ( y τ +1 | x τ , y τ ) , (10)where the second equality uses Eq. (5). Additionally, if, as per assumption ofthe Proposition, R is a ReD-BaSS of ( X , Y ), then p ( x τ +1 | x τ , y τ ) = p ( x τ +1 | x τ , y τ , r τ ) = p ( x τ +1 | x τ , r τ ) , where the first equality uses the fact that r τ (by definition) is a function of( x τ , y τ ), and the second uses the sufficiency of D-BaSS’s. Following an analogousderivation, one can show that p ( y τ +1 | x τ , y τ ) = p ( y τ +1 | r τ , y τ ). Then, with theassumption that the dynamics of ( X , Y ) is factorisable and hence ξ t = 0, itfollows from Eq. (10) that p ( x , y ) = ∞ (cid:89) τ = −∞ p ( x τ +1 | r τ , y τ ) p ( y τ +1 | r τ , y τ ) . Note that the infinite products in this proof are just a formal procedure to acknowl-edge products that can be taken up to arbitrary times.2 Rosas, Mediano, Biehl, Chandaria, and Polani