[PDF] A Technical Critique of Some Parts of the Free Energy Principle

Abstract

We summarize the original formulation of the free energy principle, and highlight some technical issues. We discuss how these issues affect related results involving generalised coordinates and, where appropriate, mention consequences for and reveal, up to now unacknowledged, differences to newer formulations of the free energy principle. In particular, we reveal that various definitions of the "Markov blanket" proposed in different works are not equivalent. We show that crucial steps in the free energy argument which involve rewriting the equations of motion of systems with Markov blankets, are not generally correct without additional (previously unstated) assumptions. We prove by counterexample that the original free energy lemma, when taken at face value, is wrong. We show further that this free energy lemma, when it does hold, implies equality of variational density and ergodic conditional density. The interpretation in terms of Bayesian inference hinges on this point, and we hence conclude that it is not sufficiently justified. Additionally, we highlight that the variational densities presented in newer formulations of the free energy principle and lemma are parameterised by different variables than in older works, leading to a substantially different interpretation of the theory. Note that we only highlight some specific problems in the discussed publications. These problems do not rule out conclusively that the general ideas behind the free energy principle are worth pursuing.

Full PDF

aa r X i v : . [ q - b i o . N C ] F e b A technical critique of the free energy principle aspresented in “Life as we know it” and relatedworks

Martin Biehl ∗ Felix A. Pollock ∗ Ryota Kanai Araya Inc., Tokyo 105-0003, Japan School of Physics and Astronomy, Monash University, Clayton,Victoria 3800, AustraliaFebruary 10, 2020

Abstract

We summarize the argument in Friston (2013) and highlight sometechnical errors. We also discuss how these errors aﬀect the very similarFriston et al. (2014) and, where appropriate, mention consequences forthe newer proposals in Friston (2019); Parr et al. (2019). The errors callinto question the purported interpretation that the internal coordinates ofevery system with a Markov blanket will appear to engage in Bayesian in-ference. In particular, in addition to highlighting the implicit restrictionto linear models, we identify three formal errors in the main argumentof Friston (2013): The ﬁrst concerns the rewriting of the equations ofmotion of systems with Markov blankets which turns out not to be gen-erally correct. We prove the non-equivalence with a counterexample thatexhibits a Markov blanket but does not satisfy the rewritten equations.Our counterexample also invalidates the corresponding (but more gen-eral) rewritten equations in the more recent Friston (2019). The seconderror concerns the Free Energy Lemma itself, which we prove, by coun-terexample, to be wrong in general. The third is the claim that the FreeEnergy Lemma, when it does hold, implies equality of variational densityand ergodic conditional density. The interpretation in terms of Bayesianinference hinges on this point, and we hence conclude that it is unjusti-ﬁed. Additionally, we highlight that the deﬁnitions of the Markov blanketin Friston (2013); Parr et al. (2019) are not equivalent and that the as-sumptions in Parr et al. (2019) may be too strong to allow for meaningfulinterpretation.

Overview

In Friston (2013) it is argued that the internal coordinates of an ergodic randomdynamical system with a Markov blanket necessarily appear to engage in active ∗ These authors contributed equally to this work.  f ψ ( ψ, s ) f s ( ψ, s, a ) f a ( s, a, λ ) f λ ( a, λ )  linear f ( x )Gaussian andMarkov ω Eq. (4)Eqs. (19) and (20) Eqs. (21) and (22) Eqs. (17) and (18) ∃ q (Ψ | λ ) s.t.Eqs. (29) and (30) ∃ q (Ψ | λ ) s.t.Eqs. (37) and (38)) ∃ q (Ψ | λ ) s.t.Eqs. (39) and (40)Eqs. (35) and (36) Eq. (42)Interpretation asBayesian inferenceStep 1Step 2 Step 2Friston (2019) Step 2, generalStep 3Step 4 Step 5 Step 6Parret al.(2019) Figure 1: Argument visualization. Numbers labelling edges indicate corresponding stepsin this paper. Struck out edges indicate implications that we prove incorrect. The mainargument in Friston (2013) takes the left path. The box in the top right indicates therelations between Conditions 1 to 3 and their role in Parr et al. (2019). Merged edgesindicate a logical AND combination of the parent nodes. a priori , as presented in, e.g. Friston et al. (2015). None ofFriston (2013); Friston et al. (2014); Friston (2019); Parr et al. (2019) makethis assumption, they instead aim to identify the conditions under which suchagents will emerge within a given stochastic process. We now brieﬂy introducethe setting of Friston (2013) and then sketch the content of this paper.The starting point is a random dynamical system whose evolution is governedby the Langevin equation ˙ x = f ( x ) + ω, (1)where the system state x and vector ﬁeld f ( x ) are multi-dimensional, and ω isa Gaussian noise term. There is an additional assumption that the system isergodic, such that the steady state probability density p ∗ ( x ) is well deﬁned. It is then assumed that there is a coordinate system x = ( ψ, s, a, λ ) with ψ = ( ψ , ..., ψ n ψ ), s = ( s , ..., s n s ), a = ( a , ..., a n a ), and λ = ( λ , ..., λ n λ ),referred to as external, sensory, active, and internal coordinates respectively,such that the following condition holds: Condition 1.

The function f ( x ) can be written as f ( x ) =  f ψ ( ψ, s, a ) f s ( ψ, s, a ) f a ( s, a, λ ) f λ ( s, a, λ )  . (2)This particular structure is described as “[formalizing] the dependenciesimplied by the Markov blanket” (Friston, 2013). In contrast, more recentworks Friston (2019); Parr et al. (2019) formulate the Markov blanket in termsof statistical dependencies of the ergodic density p ∗ ( x ) = p ∗ ( ψ, s, a, λ ). Speciﬁ-cally, the following condition is presented: Condition 2.

The ergodic density factorises as p ∗ ( ψ, s, a, λ ) = p ∗ ( ψ | s, a ) p ∗ ( λ | s, a ) p ∗ ( s, a ) . (3)In other words, the internal and external coordinates are independently dis-tributed when conditioned on the sensory and active coordinates. This meanswe have two diﬀerent formal expressions of what constitutes a Markov blanketin these publications, and their relationship has not previously been established.Taking Condition 1 to hold, the argument of Friston (2013) then proceedsalong the following steps: In the original paper, the ergodic density is simply denoted p ( x ). We here add a star tohighlight that it is a time independent probability density. These are called “states” in Friston (2013). tep 1 Rewrite the vector ﬁeld f ( ψ, s, a, λ ) describing the dynamics of thesystem in terms of the gradient of negative logarithm of the ergodic density p ∗ ( ψ, s, a, λ ) of that system. Step 2

Rewrite the components f λ ( s, a, λ ) and f a ( s, a, λ ) of the vector ﬁeld f ( ψ, s, a, λ ) in terms of only partial gradients of the negative logarithm of p ∗ ( ψ, s, a, λ ). Step 3

Assert (in the

Free Energy Lemma ) the existence of a density q ( ψ | λ )over the external coordinates ψ parameterized by the internal coordinates λ , and that f ( ψ, s, a, λ ) can again be rewritten, this time in terms of afree energy depending on q (Ψ | λ ). Step 4

Claim that equivalence of the equations of motion in Step 2 and Step 3implies that certain partial gradients of the KL divergence between q (Ψ | λ )and the conditional ergodic density p ∗ (Ψ | s, a, λ ) must vanish. Step 5

Claim that it follows from Step 4 that q (Ψ | λ ) and p ∗ (Ψ | s, a, λ ) are“rendered” equal. Step 6

Interpret • p ∗ (Ψ | s, a, λ ) as a posterior over external coordinates given particularvalues of sensor, active, and internal coordinates, • q (Ψ | λ ) as encoding Bayesian beliefs about the external coordinatesby the internal coordinates, and • their equality as the internal coordinates appearing to “solve theproblem of Bayesian inference”.In the present paper, we make the following main observations • The re-expression of Eq. (1) in the form chosen in Step 1 is derived underthe assumption that the system is linear and subject to Gaussian andMarkov noise. • Condition 1 and Condition 2 are independent from each other. • Condition 1 and Condition 3 together lead to a system where the inter-pretation of s and a as sensory and active coordinates is questionable. • Under both Conditions 1 and 2, the expressions of f λ ( s, a, λ ) and f a ( s, a, λ )resulting from Step 2 are not as general as those contained in the result ofStep 1. The more general alternative expression derived in Friston (2019)remains insuﬃciently general. • Under both Conditions 1 and 2, the Free Energy Lemma is wrong andcannot be salvaged by using alternatives in Step 2. • Under both Conditions 1 and 2, contrary to Step 5 the vanishing of thegradient of the KL divergence does not imply equality of q (Ψ | λ ) and p ∗ (Ψ | s, a, λ ). Here, and whenever it would otherwise be ambiguous, we use a capitalized Ψ to indicatefull distributions, rather than the probability density for speciﬁc value of ψ . As a consequence, the basic preconditions for the interpretations in Step 6are not implied by either of the two proposed Markov blanket Conditions 1and 2.The later Friston et al. (2014) presents an argument almost identical to theone in the original Friston (2013). In Section 7 we discuss how our observationsapply to this publication.

Here we introduce the expression of the system’s dynamics Eq. (1) in the formused for the Free Energy Lemma (Lemma 2.1. in Friston, 2013). This formexpresses the dynamics of internal and active coordinates of the given ergodicrandom dynamical system in terms of the gradient of the ergodic density p ∗ ( x ).In accordance with the results of Kwon et al. (2005), f ( x ) is rewritten as (seeEq.(2.5) in Friston, 2013): f ( x ) = (Γ + R ) · ∇ ln p ∗ ( x ) , (4)where Γ is the diﬀusion matrix, which we will take to be block diagonal, and R is an antisymmetric matrix, deﬁned through the relation M R + RM T = M Γ − Γ M T , (5)with M ij = ∇ j f i ( x ) . (6)Both Γ and R are assumed constant. We emphasise here that Eq. (4) is de-rived in the literature under the explicit assumption that the ﬂuctuations ω be Gaussian and Markov, and that f ( x ) is a linear function (Ao, 2004; Kwonet al., 2005). When f ( x ) is nonlinear, Eq. (4) must be modiﬁed (Kwon andAo, 2011). In other words, by requiring Eq. (4) to hold generically, one is re-stricted to the class of Ornstein-Uhlenbeck processes , and the ergodic density p ∗ ( x ) = p ∗ ( ψ, s, a, λ ) is necessarily a multivariate Gaussian with zero mean.Speciﬁcally, following Kwon et al. (2005), p ∗ ( ψ, s, a, λ ) := 1 Z exp (cid:20) −

12 ( ψ, s, a, λ ) U ( ψ, s, a, λ ) ⊤ (cid:21) , (7)where ( ψ, s, a, λ ) is a row vector and Z is a suitable normalisation constant.From Eq. (4) it can be seen that, U = − (Γ + R ) − M. (8)This concludes Step 1. In Friston (2013), and later work such as Friston (2019), Γ is taken to be proportional tothe identity matrix. There may be more general processes for which Eq. (50) of Kwon and Ao (2011) is zero(and hence Eq. (4) is valid), but these will necessarily be ﬁnely tuned, and Eq. (4) will not berobust under small perturbations to the dynamics. M and U . Firstly, since it eﬀectively states that ∇ ψ f a ( x ) = ∇ ψ f λ ( x ) = ∇ λ f s ( x ) = ∇ λ f ψ ( x ) = 0, Condition 1 ⇔ M aψ = M λψ = M sλ = M ψλ = 0 , (9)with M αβ a block sub-matrix of M in general. Secondly, because of the mul-tivariate Gaussian nature of p ∗ ( ψ, s, a, λ ), the dependencies of conditional dis-tributions are encoded in the inverse U of the covariance matrix; we thereforehave that Condition 2 ⇔ U ψλ = U λψ = 0 , (10)where U αβ is a block sub-matrix of U . These implications bring us to our ﬁrstobservation: Observation 1.

Neither one of Condition 1 (the vector ﬁeld dependency struc-ture) or Condition 2 (conditional independence in the ergodic distribution) im-plies the other:

Condition 1 ; Condition 2 (11)Condition 1 : Condition 2 . (12) Proof.

In Appendix A, we provide direct counterexamples, using the equivalentconstraints on the matrices M and U in Eqs. (9) and (10), to implication in eitherdirection. That is, there exists a system obeying Condition 1 that does not obeyCondition 2 (proving Eq. (11)), and there exists one obeying Condition 2 thatdoes not obey Condition 1 (proving Eq. (12)).Henceforth, unless otherwise stated, we will assume both Condition 1 and Con-dition 2. Any implications that fail to hold in this special case cannot holdgenerally. For Step 2 we focus on the components f λ = ( f λ , ..., f n λ ) and f a = ( f a , ..., f n a )of f . Without loss of generality we can rewrite them from Eq. (4) as: f a ( s, a, λ ) = ( R aψ · ∇ ψ + R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) ln p ∗ ( ψ, s, a, λ ) , (13) f λ ( s, a, λ ) = ( R λψ · ∇ ψ + R λs · ∇ s + (Γ λλ + R λλ ) · ∇ λ + R λa · ∇ a ) ln p ∗ ( ψ, s, a, λ ) , (14)where Γ nm ( R nm ) is the block of Γ ( R ) connecting derivatives with respect tothe m coordinates to the time derivatives of the n coordinates. The expectationvalue with respect to p ∗ ( ψ | s, a, λ ) leaves the left hand side of these equationsunchanged. A few manipulations (cf. Friston, 2019, Eq.(12.14), p.129) reveal6hat, on the right hand side, this leads to the ergodic density p ∗ ( ψ, s, a, λ ) beingreplaced by the marginalised ergodic density p ∗ ( s, a, λ ) so that we get f a ( s, a, λ ) = ( R aψ · ∇ ψ + R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) ln p ∗ ( s, a, λ ) (15) f λ ( s, a, λ ) = ( R λψ · ∇ ψ + R λs · ∇ s + (Γ λλ + R λλ ) · ∇ λ + R λa · ∇ a ) ln p ∗ ( s, a, λ ) . (16)Since ∇ ψ ln p ∗ ( s, a, λ ) = 0, the terms involving ∇ ψ drop out: f a ( s, a, λ ) = ( R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) ln p ∗ ( s, a, λ ) , (17) f λ ( s, a, λ ) = ( R λs · ∇ s + R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) ln p ∗ ( s, a, λ ) . (18)We are not aware of how to further simplify this equation without additionalassumptions. However, in Friston (2013, Eq. (2.5) and Eq. (2.6)) all but thediagonal terms are implicitly assumed to vanish, i.e., Eq. (4) is equated with: f a ( s, a, λ ) = (Γ aa + R aa ) · ∇ a ln p ∗ ( s, a, λ ) , (19) f λ ( s, a, λ ) = (Γ λλ + R λλ ) · ∇ λ ln p ∗ ( s, a, λ ) . (20)This equation is the result of Step 2.In the more recent Friston (2019, Appendix B) a more detailed discussion ofEq. (4) is presented, where it is claimed that Condition 1 implies Condition 2(cf. our Observation 1) along with the following simpliﬁcation of Eqs. (17)and (18) (Friston, 2019, Eqs. (12.8-12.11,12.15), pp.126-129): f a ( s, a, λ ) = ((Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) ln p ∗ ( s, a, λ ) , (21) f λ ( s, a, λ ) = ( R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) ln p ∗ ( s, a, λ ) . (22)However, Eqs. (21) and (22) are still provably less general than Eqs. (13)and (14), even when both Condition 1 and Condition 2 are satisﬁed. Observation 2.

Given a random dynamical system obeying Eq. (1) , ergodicity,and both Condition 1 and Condition 2, none of Eqs. (19) to (22) generally hold.Proof. By counterexample, see Appendix B. There, we show explicitly thata model satisfying the above assumptions does not satisfy the equations inquestion.In order to arrive at Eqs. (21) and (22) from Eqs. (17) and (18) in general, onemust remove the oﬀending “solenoidal ﬂow” terms by ﬁat. That is, one assumes R as = R λs = 0. In Friston (2019, Eq. (12.4)), the following, even stronger,condition is assumed as an alternative starting point (along with Condition 2): Condition 3.

The blocks of the R matrix appearing in Eq. (4) coupling ( s, a ) coordinates to λ and ψ coordinates, and ψ coordinates to λ coordinates vanish,i.e. R ψs = R ψa = R ψλ = R sλ = R aλ = 0 . (23)This is claimed to imply M ψλ = M λψ = 0, but not the full Condition 1.However, in Parr et al. (2019), both Condition 1 and Condition 3 are assumed(along with R as = 0). This prompts our next observation.7 bservation 3. In a system satisfying both Condition 1 and Condition 3, theinternal coordinates cannot be directly inﬂuenced by the sensory coordinates: f λ ( s, a, λ ) = f λ ( a, λ ) , and the external coordinates cannot be directly inﬂuencedby the active coordinates: f ψ ( ψ, s, a ) = f ψ ( ψ, s ) .Proof. From Eq. (5), it follows that M = (Γ + R ) M T (Γ − R ) − , (24)with the inverse replaced by a pseudoinverse if Γ − R is not invertible. Therefore,if Γ αβ = δ αβ Γ αα and R αβ = δ αβ R αα for blocks of coordinates labelled by α and β , then M αβ = (Γ αα + R αα ) M Tβα (Γ ββ − R ββ ) − , (25)and M βα = 0 ⇒ M αβ = 0.Condition 3 implies that the only nonzero blocks of R are R ψψ , R ss , R sa , R as , R aa , and R λλ , and Γ is assumed to be block diagonal. As noted in Eq. (9),Condition 1 requires that M aψ = M λψ = M sλ = M ψλ = 0. Through Eq. (25),these together imply that M λs = M ψa = 0, and hence that f ( x ) =  f ψ ( ψ, s ) f s ( ψ, s, a ) f a ( s, a, λ ) f λ ( a, λ )  , (26)as was to be shown.In this case, the four sets of coordinates interact in a chain, and it is question-able whether the s and a coordinates can be meaningfully interpreted, respec-tively, as sensory inputs to the internal coordinates or their boundary-mediatedinﬂuence on the external coordinates. The relation of the dynamics of the internal coordinates to Bayesian beliefsis made by introducing a density (called the variational density) q (Ψ | λ ) thatis then interpreted as encoding a Bayesian belief. It is parameterized by theinternal coordinates λ and claimed to be “arbitrary”. The existence of thevariational density q (Ψ | λ ) is asserted by the Free Energy Lemma (see Lemma2.1 in Friston, 2013). More precisely, the Free Energy Lemma (and Step 3) asserts that for everyergodic density p ∗ ( ψ, s, a, λ ) of a system obeying Eqs. (19) and (20) there is afree energy F ( s, a, λ ), deﬁned as F ( s, a, λ ) : = − ln p ∗ ( s, a, λ ) + Z q ( ψ | λ ) ln q ( ψ | λ ) p ∗ ( ψ | s, a, λ ) d ψ (27)= − ln p ∗ ( s, a, λ ) + D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] , (28) Explicitly, the Free Energy Lemma asserts the existence of a free energy F ( s, a, λ ) interms of which f ( ψ, s, a, λ ) can be expressed and not the existence of q (Ψ | λ ). However, sincethe free energy is deﬁned as a functional of q (Ψ | λ ), it exists if and only if a suitable q (Ψ | λ )exists. Equivalently as expressed in Friston (2013), for every Gibbs energy G ( x ) := − ln p ∗ ( ψ, s, a, λ ).

8n terms of the “posterior density” p ∗ (Ψ | s, a, λ ), such that Eqs. (19) and (20)can be rewritten as: f a ( s, a, λ ) = − (Γ + R ) aa · ∇ a F ( s, a, λ ) , (29) f λ ( s, a, λ ) = − (Γ + R ) λλ · ∇ λ F ( s, a, λ ) . (30)It is worth considering what a proof of the Free Energy Lemma could looklike. A proof of existence of a free energy (and therefore of the Free EnergyLemma) would need to show that, for every system satisfying the given assump-tions, there always exists a q (Ψ | λ ) such that the right hand sides of Eqs. (29)and (30) are equal to the right hand sides of Eqs. (19) and (20). ExpandingEqs. (29) and (30) using Eq. (28) leads to: f a ( s, a, λ ) = (Γ + R ) aa · ∇ a ln p ∗ ( s, a, λ ) − (Γ + R ) aa · ∇ a D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] , (31) f λ ( s, a, λ ) = (Γ + R ) λλ · ∇ λ ln p ∗ ( s, a, λ ) − (Γ + R ) λλ · ∇ λ D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] . (32)For equality of the right hand sides to those of Eqs. (19) and (20) we need:(Γ + R ) aa · ∇ a D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 (33)(Γ + R ) λλ · ∇ λ D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 . (34)In words, these equations say that the Free Energy Lemma holds if any of thefollowing three conditions (of strictly increasing strengths) are given:1. There is a q (Ψ | λ ) such that the partial gradients ∇ a and ∇ λ of the KLdivergence between the variational density and the conditional ergodicdensity are elements of the nullspaces of (Γ + R ) aa and (Γ + R ) λλ respec-tively.2. There is a q (Ψ | λ ) such that the gradients of the KL divergence to p ∗ (Ψ | s, a, λ )are equal to the nullvector: ∇ a D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 , (35) ∇ λ D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 , (36)Then they are always elements of the nullspaces of (Γ+ R ) aa and (Γ+ R ) λλ respectively.3. There is a q (Ψ | λ ) such that q (Ψ | λ ) = p ∗ (Ψ | s, a, λ ) (and hence p ∗ (Ψ | s, a, λ ) = p ∗ (Ψ | λ )) which implies that the KL divergence to p ∗ (Ψ | s, a, λ ) vanishes forall a, λ and the two partial gradients are always nullvectors and thereforeelements of the according nullspaces.The Free Energy Lemma can then be proven by showing that one of these threecases follows from the conditions of the lemma. However, no attempt is madein Friston (2013) to establish this. Instead the given proof discusses purported consequences of the existence of a suitable q (Ψ | λ ). These will be discussed inSteps 4 and 5. Here, we keep the conditioning argument λ , as in Friston (2013), and do not explicitlyassume Condition 2, though our conclusions are unaﬀected by it. q (Ψ | λ ) such that f a ( s, a, λ ) = ((Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) F ( s, a, λ ) , (37) f λ ( s, a, λ ) = ( R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) F ( s, a, λ ) . (38)or f a ( s, a, λ ) = ( R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) F ( s, a, λ ) , (39) f λ ( s, a, λ ) = ( R λs · ∇ s + R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) F ( s, a, λ ) , (40)hold respectively. However, we ﬁnd this not to be the case in general. Observation 4.

Given a random dynamical system obeying Eq. (1) , ergodicity,Condition 1 and Condition 2, there need not exist a free energy expressed interms of a variational density q (Ψ | λ ) such that: (i) Eqs. (29) and (30) hold if Eqs. (19) and (20) do; (ii)

Eqs. (37) and (38) hold if Eqs. (19) and (20) don’t hold but Eqs. (21) and (22) do; (iii)

Eqs. (39) and (40) hold if neither Eqs. (19) and (20) nor Eqs. (21) and (22) hold but Eqs. (17) and (18) do.Proof.

In Appendix C, we derive a set of conditions on the R and U matrices,and on the putative variational density q (Ψ | λ ), that follow from each of thepairs of equations in cases (i-iii). We show that, in general, each pair leads toa contradiction and, in each case, provide a counterexample that falls in theaccording system class.Before proceeding, we note that later works present an alternative version ofthe Free Energy Lemma, where the conditioning argument of q (Ψ | λ ) is replacedby the most likely value of λ conditional on the ( s, a ) coordinates (Friston,2019; Parr et al., 2019). We here concern ourselves with the version apparent inFriston (2013), where q (Ψ | λ ) is parameterised by the internal states themselves,but will brieﬂy comment on the interpretation of the alternative approach inStep 6. As mentioned in Step 3, the proof of the Free Energy Lemma in Friston (2013)only discusses its consequences. The ﬁrst proposed consequence is that express-ing the vector ﬁeld in terms of a free energy as in Eqs. (29) and (30) “requires”that the gradients with respect to a and λ of the KL divergence vanish, i.e. thatEqs. (35) and (36) hold.We mentioned in Step 3 that the implication in the opposite direction holds.This can be seen from Eqs. (33) and (34). However, if the nullspace of (Γ + R ) aa or (Γ + R ) λλ is non-trivial, then the gradient may be a non-zero element of this10ubspace and Eqs. (29) and (30) will still hold. In that case the vanishinggradients would not be necessary for the Free Energy Lemma.The conditions under which a non-trivial nullspace exists are discussed inKwon et al. (2005). In short, the nullspace is guaranteed to be trivial in thespecial case where Γ is positive deﬁnite. Whether or not ergodic systems witha Markov blanket can ever admit a non-trivial nullspace, and hence divergencesin Eqs. (31) and (32) with non-vanishing gradients, is not immediately clear.However, in order to establish the necessity of Eqs. (35) and (36) this remainsto be proven. q (Ψ | λ ) and p ∗ (Ψ | s, a, λ ) The proof of the Free Energy Lemma in Friston (2013) also proposes that thevanishing of gradients of the KL divergence, of the variational density q (Ψ | λ )from the conditional ergodic density p ∗ (Ψ | s, a, λ ), implies the equality of thesedensities. We mentioned in Step 4 that the implication in the opposite directionholds. This can also be seen from Eqs. (33) and (34). Concerning the implicationin the direction proposed by Friston (2013), let us now assume that for a givensystem Eqs. (19) and (20) hold, a variational density q (Ψ | λ ) does exist, and thegradients of the KL divergence of the variational and ergodic densities vanishi.e. Eqs. (35) and (36) hold. Then consider the argument by Friston (2013) inthis direct quote (comments in square brackets by us):However, equation (2.6) [Eqs. (19) and (20) above] requires the gra-dients of the divergence to be zero [Eqs. (35) and (36)], which meansthe divergence must be minimized with respect to internal states.This means that the variational and posterior densities must beequal: q ( ψ | λ ) = p [ ∗ ] ( ψ | s, a, λ ) ⇒ D KL = 0 ⇒ ( (Γ + R ) · ∇ λ D KL = 0 , (Γ + R ) · ∇ a D KL = 0 . In other words, the ﬂow of internal and active states minimizes freeenergy, rendering the variational density equivalent to the posteriordensity over external states.The ﬁrst problem in the above quote is that the minimization of the di-vergence does not follow from the vanishing gradients. On the contrary, sinceEqs. (35) and (36) must hold for all ( s, a, λ ), the KL divergenceD KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )]cannot depend on ( λ, a ); it therefore has no extremum (and thus no minimum)with respect to either of these coordinates.The second problem pertains to the identiﬁcation of the two distributionsat a minimum. In general, if we try to ﬁnd the minimum of a KL divergencebetween a given probability density p ( Y ) and a family of densities p ( Y | θ )parameterized by θ , then the lowest possible value of zero is achieved only ifthere is a parameter θ such that p ( Y | θ ) = p ( Y ). If there is no such θ , thenthe minimum value will be larger than zero. So, even if the divergence were11inimized, it would not need to be zero. More generally, the divergence K ( s )need not be zero for any value of s .There is therefore no satisfactory reason given why the variational density q (Ψ | λ ) and the posterior density p ∗ (Ψ | s, a, λ ) should be equal or have low KLdivergence. In fact they need not be. Observation 5.

Given a random dynamical system obeying Eq. (1) , ergodicity,Condition 1 and Condition 2. Then if, additionally, (i)

Eqs. (19) and (20) hold and the Free Energy Lemma holds i.e. there existsa probability density q (Ψ | λ ) such that Eqs. (29) and (30) hold, or (ii) Eqs. (21) and (22) hold and there exists q (Ψ | λ ) such that Eqs. (37) and (38) hold, or (iii) Eqs. (17) and (18) hold and there exists q (Ψ | λ ) such that Eqs. (39) and (40) hold,then there is no c ≥ for which it can be guaranteed that D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] < c. (41) In particular, it does not follow from these conditions that q (Ψ | λ ) = p ∗ (Ψ | s, a, λ ) . (42) Proof.

By example; see Appendix D. To show that the implication does notgenerally hold for given system and densities q (Ψ | λ ) that obey Eqs. (19), (20),(29) and (30), Eqs. (21), (22), (37) and (38), or Eqs. (17), (18), (39) and (40) weonly have to consider a system that obeys all three pairs of equations, Eqs. (19)and (20), Eqs. (21) and (22), and Eqs. (21) and (22), and for which suitable q (Ψ | λ ) exist. For this system we then need to show that the q (Ψ | λ ) that obeyEqs. (29) and (30) are not necessarily equal (or similar) to p ∗ (Ψ | s, a, λ ).We use a variant of the model used in Appendix B as such a counterexample.This system obeys all three of Eqs. (19) and (20), Eqs. (21) and (22), andEqs. (21) and (22) and the nullspace of the associated Γ + R is trivial. Weidentify a set of possible q (Ψ | λ ) satisfying Eqs. (29) and (30) which implies thatthe gradients of the KL divergence between those q (Ψ | λ ) and p ∗ (Ψ | s, a, λ ) vanishi.e. Eqs. (35) and (36) hold. We then demonstrate that for the q (Ψ | λ ) in thisset the value of the KL divergence to p ∗ (Ψ | s, a, λ ) can be arbitrarily large. Finally, we turn our attention to the interpretation in terms of Bayesian infer-ence, i.e. Step 6. We again quote directly from Friston (2013):Because (by Gibbs inequality) this divergence [ D KL [ q ( ψ | λ ) || p ∗ ( ψ | s, a, λ )]]cannot be less than zero, the internal ﬂow will appear to have mini-mized the divergence between the variational and posterior density.In other words, the internal states will appear to have solved theproblem of Bayesian inference by encoding posterior beliefs abouthidden (external) states, under a generative model provided by theGibbs energy. 12e have shown that, in general, there is no suitable variational density and that,even if there is one, it can be arbitrarily diﬀerent from the posterior density.Since the arguments for the internal ﬂow appearing to minimize the divergencebetween variational and posterior density are therefore incorrect, there is noreason why the internal states should appear to have solved the problem ofBayesian inference.As mentioned in Step 3, some newer works (e.g., Friston (2019); Parr et al.(2019)) formulate a version of the Free Energy principle where the variationaldensity of beliefs is parameterised not by the internal coordinates λ but by¯ λ ( s, a ) = arg max λ p ∗ ( λ | s, a ), the most likely value of the internal coordinatesgiven the sensory and active ones. In this case, many of the arguments we raisedin Steps 3-5 do not apply. However, the new parameters ¯ λ ( s, a ) are strictly afunction of the sensory and active coordinates. This means we have a Markovchain Λ → ( S, A ) → ¯Λ and, by the data processing inequality (Cover andThomas, 2006), the mutual information between the both sensory and activecoordinates and the belief parameter ¯ λ upper bounds that between the internalcoordinates and the belief parameter. It is therefore not clear to what extentthe internal coordinates λ , rather than the active and sensory coordinates ( s, a )themselves, can be said to be encoding beliefs about the external coordinates.Note also that, on any given trajectory, unless the distribution p ∗ ( λ | s, a ) issuﬃciently peaked and unimodal, the internal coordinates are not guaranteedto spend most of their time close to their most likely conditional value, and (bydeﬁnition if Condition 2 holds), they will not be better predictors of the externalcoordinates than those in the Markov blanket. Friston et al. (2014) argues for the same interpretation as Friston (2013) butthere are some diﬀerences in the argument.The diﬀerences are the following: • In Friston et al. (2014), Eq. (1) is formulated for “generalized states,”which we refer to here as generalized coordinates. This means that thevariable x is replaced by a multidimensional variable denoted ˜ x = ( x, x ′ , x ′′ , ... ). • The Markov blanket structure is not explicitly deﬁned via Eq. (2). For-mally, it is introduced directly (see Friston et al., 2014, Eq.(10)) in a lessgeneral form corresponding to Eqs. (19) and (20). Therefore, our obser-vations concerning Steps 1 to 3 are not directly relevant to this paper. • The internal coordinate λ is renamed to r and the role of matrix R isplayed by the matrix − Q . • The proof of the Free Energy Lemma given in Friston et al. (2014) isdiﬀerent. It (implicitly) suggests to set the variational density equal tothe ergodic conditional posterior. As discussed in Step 1 and Step 2, these make a slightly diﬀerent set of assumptions fromFriston (2013), but rely on similar arguments to the ones we disprove here. With capitalisations indicating random variables associated to the corresponding lowercase coordinates (or functions of coordinates). At the same time Friston (2013) is referenced in connection to the Markov blanket sothere seems to be no intention to replace the original deﬁnition with the stronger one. The proof of the Free Energy Lemma no longer contains the propositionthat the gradient of the KL divergence of the variational density and theergodic conditional density vanish i.e. Step 4. • The proof also no longer contains the claim that the vanishing gradients ofthe KL divergence of the variational density and the ergodic conditionaldensity imply equality of those densities i.e. Step 5 is not present.The interpretation in terms of Bayesian inference is unchanged and still relieson the equality of the variational and the ergodic conditional density.Since there are no explicit generalized coordinate versions of Steps 1, 2, 4and 5 in Friston et al. (2014) we do not discuss those steps here. We onlydisprove the Free Energy Lemma and the claim that when the Free EnergyLemma holds the variational and ergodic conditional density become equal. Forthis we present a way to translate the counterexamples used in Observations 4and 5 into counterexamples in generalized coordinates. The interpretation interms of Bayesian inference given in Friston et al. (2014) is therefore equally asunjustiﬁed as the one in Friston (2013).For completeness, we ﬁrst state the generalized coordinate versions of theLangevin equation Eq. (1) ˙˜ x = f (˜ x ) + ˜ ω, (43)the less general version of the Markov blanket structure Eq. (2) f ˜ ψ ( ˜ ψ, ˜ s, ˜ a ) = (Γ − Q ) ˜ ψ ˜ ψ ∇ ˜ ψ ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) f ˜ s ( ˜ ψ, ˜ s, ˜ a ) = (Γ − Q ) ˜ s ˜ s ∇ ˜ s ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) f ˜ a (˜ s, ˜ a, ˜ r ) = (Γ − Q ) ˜ a ˜ a ∇ ˜ a ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) f ˜ r (˜ s, ˜ a, ˜ r ) = (Γ − Q ) ˜ r ˜ r ∇ ˜ r ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) , (44)the expression of the ˜ a and ˜ r components of the vectorﬁeld in terms of themarginalised ergodic density Eqs. (19) and (20) f ˜ a (˜ s, ˜ a, ˜ r ) = (Γ − Q ) ˜ a ˜ a · ∇ ˜ a ln p ∗ (˜ s, ˜ a, ˜ r ) , (45) f ˜ r (˜ s, ˜ a, ˜ r ) = (Γ − Q ) ˜ r ˜ r · ∇ ˜ r ln p ∗ (˜ s, ˜ a, ˜ r ) , (46)and in terms of free energy Eqs. (29) and (30): f ˜ a (˜ s, ˜ a, ˜ r ) = ( Q − Γ) ˜ a ˜ a · ∇ ˜ a F (˜ s, ˜ a, ˜ r ) , (47) f ˜ r (˜ s, ˜ a, ˜ r ) = ( Q − Γ) ˜ r ˜ r · ∇ ˜ r F (˜ s, ˜ a, ˜ r ) . (48)The Free Energy Lemma then requires that there exists q ( ˜Ψ | ˜ r ) such that the KLdivergence between p ∗ ( ˜Ψ | ˜ s, ˜ a, ˜ r ) vanishes. Without going into further details ofthe diﬀerence between the proof in Friston et al. (2014) and that in Friston(2013), we can prove the former wrong by translating the counterexample usedfor the latter into generalised coordinates. Observation 6.

There is a general way to translate a system in ordinary coor-dinates into a system of generalised coordinates that corresponds to an inﬁnitenumber of independent copies of the original system. This means all propertiesof the original system (e.g. linearity, ergodicity, the Gaussian and Markovianproperty of the noise, Conditions 1 and 2, properties of Γ , R, U ) are preservedduring this translation. roof. By construction, see Appendix E.This implies that the counterexamples used in proving Observations 4 and 5directly translate to the setting of the generalised coordinates. The Free EnergyLemma is therefore also wrong for generalised coordinates and the variationaldensity q ( ˜Ψ | ˜ r ) is not “ensured” (Friston et al., 2014) to be equal to the condi-tional ergodic density p ∗ ( ˜Ψ | ˜ s, ˜ a, ˜ r ). Conclusion

We found that the two diﬀerent Markov blanket conditions proposed in Friston(2013, 2019); Parr et al. (2019) are independent from each other. We thenshowed that under both of those Markov blanket conditions, among the six stepscontained in the argument in Friston (2013), three do not hold independentlyfrom each other. We also showed that ﬁxing the second of those steps (Step 2)does not provide an valid alternative. The line of reasoning of Friston (2013)therefore does not support its claim that the internal coordinates of a Markovblanket “appear to have solved the problem of Bayesian inference by encodingposterior beliefs about hidden (external) [coordinates],...”. We have also shownthat using generalised coordinates as in Friston et al. (2014) does not remedythe situation. Additionally, we identiﬁed a technical error in Friston (2019)and an interpretational issue resulting from possibly too strong assumptions(both Conditions 1 and 3) in Parr et al. (2019). We also remarked that thelatter publications both argue that it is the most likely internal coordinatesgiven sensory and active coordinates that encode posterior beliefs about externalstates instead of the internal coordinates themselves. This is a diﬀerent proposalthat is not subject to our technical critique.

Acknowledgments

All authors are grateful to Karl Friston and Thomas Parr for constructive feed-back on an earlier version of this work. MB wants to thank Yen Yu for helpfuldiscussions on generalized coordinates. The work by MB and RK on this publi-cation was made possible through the support of a grant from Templeton WorldCharity Foundation, Inc. The opinions expressed in this publication are those ofthe authors and do not necessarily reﬂect the views of Templeton World CharityFoundation, Inc. FAP acknowledges support from Monash University’s Networkof Excellence scheme.

A Counterexamples for Observation 1

Consider a four dimensional linear system obeying Eq. (1) for which there arecoordinates x = ( ψ, s, a, λ ) with n ψ = n s = n a = n λ = 1 and f ( x ) = M x, (49)15ith the parameterisation M =  − m m m m − m m m m − m m m m −  . (50)From Eq. (9), it is clear that the system obeys Condition 1 if m = 0. In thiscase, taking Γ to be the identity matrix, it is possible to show that U ψλ = − m ( m − m + 2)( m + m m − m m − m − m + m − m + 4)( m + 5 m − m m + 4 m + 4) . (51)For ﬁxed, ﬁnite m , this is zero only for a few discrete values of m , such as m = m −

2; that it is generically non-zero proves Eq. (11). As a concreteexample, the following M =  − − / − / − / − − / − / − − / − / − / −  , (52)has (full rank and hence ergodic) U =  /

255 127 / − / − / /

255 274 /

255 206 /

255 31 / /

85 206 /

255 274 /

255 127 / − /

85 31 /

85 127 /

255 236 /  , (53)and hence ergodic density p ∗ ( ψ, s, a, λ ) = r π exp (cid:20) − (cid:0) a + s ) + 118( ψ + λ )+127( ψs + aλ ) + 93( ψa + sλ ) + 206 as − ψλ (cid:1) (cid:21) , (54)which does not conditionally factorise.Taking the same parameterisation as in Eq. (50), and ﬁxing m = m = − / ,we can search for a non-zero value of m that leads to U ψλ = 0 (equivalent toCondition 2 through Eq. (10)). We ﬁnd such a value in the real root c ≃ − . c − c − c + 31 c + 40 c + 3 = 0. That is, with M =  − − / − / c − / − − / cc − / − − / c − / − / −  , (55)which does not satisfy Condition 1, we have U =  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  , (56)which has non-zero determinant (i.e., the dynamics is ergodic) and an ergodicdensity satisfying Condition 2. This proves Eq. (12).16 Counterexample for Step 2

Here, we consider a linear system, as in the previous appendix. We again assumeΓ equal to the identity matrix and choose a force matrix of the form M =  − − √ − √ − √ − − − − √ √ − √ −  (57)which explicitly satisﬁes Condition 1 and has full rank such that the system isergodic. Using Eq. (5) this leads to U =  √ √ √ − √ √ √ − √ √  (58)which shows that this system also satisﬁes Condition 2 since U ψλ = U λψ = 0.We also ﬁnd R =  − √ √ − √ − √ − √ √ − √ − √  , (59)which shows that all entries or R that can be non-zero for an anti-symmetricmatrix are non-zero. For the marginal ergodic density we ﬁnd p ∗ ( s, a, λ ) = 23916 √ π / exp (cid:20) − a − as √ √ aψ − s sψ − ψ (cid:21) (60)The diﬀerence between the right hand sides of Eqs. (17) and (19) is R as ∇ s ln p ∗ ( s, a, λ ) + R aλ ∇ λ ln p ∗ ( s, a, λ ) = 37 a + 69 √ λ − s = 0 , (61)which shows that Eq. (19) is wrong in this example and therefore not generallyequivalent to Eq. (17). Similarly, computing the diﬀerence between the righthand sides of Eqs. (18) and (20), one ﬁnds R λs ∇ s ln p ∗ ( s, a, λ ) + R λa ∇ a ln p ∗ ( s, a, λ ) = 2 a − √ λ + 27 s √ = 0 , (62)and hence Eq. (20) is also incorrect in general.Performing the same comparison for the diﬀerence between the general ex-pression in Eqs. (17) and (18) and the expressions taken from Friston (2019),one ﬁnds R a i s ∇ s ln p ∗ ( s, a, λ ) = 73 (cid:0) a + 552 √ λ − s (cid:1) = 0 (63)17or the diﬀerence between the right hand sides of Eqs. (17) and (21), and R λs ∇ s ln p ∗ ( s, a, λ ) = − (cid:0) a + 552 √ λ − s (cid:1) √ = 0 , (64)for the diﬀerence between the right hand sides of Eqs. (18) and (22). Therefore,Eqs. (21) and (22) are also incorrect in general, even when Condition 1 andCondition 2 both hold. C Counterexamples for Step 3

We saw in Appendix B that Eqs. (19) and (20) are not generally equivalent toEq. (4), even when Condition 1 and Condition 2 hold simultaneously. We nowshow that if we instead use Eqs. (17) and (18), which are generally equivalentto Eq. (4), the Free Energy Lemma does not hold in general.The original Free Energy Lemma requires that (see Eqs. (31) and (32))(Γ + R ) aa · ∇ a D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 (65)(Γ + R ) λλ · ∇ λ D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 . (66)replacing the partial gradient in Eqs. (29) and (30) with the full gradient andincluding the entire matrix (Γ + R ) leads to the corresponding requirement forthe more general case:( R as · ∇ s + (Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 (67)( R λs · ∇ s + R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 . (68)Similarly, the version based on the equations taken from Friston (2019) implies((Γ aa + R aa ) · ∇ a + R aλ · ∇ λ ) D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 (69)( R λa · ∇ a + (Γ λλ + R λλ ) · ∇ λ ) D KL [ q (Ψ | λ ) || p ∗ (Ψ | s, a, λ )] = 0 . (70)Using the rules of Gaussian integration, we can write the logarithm of theconditional ergodic density asln p ∗ ( ψ | s, a, λ ) = − | U ψψ ψ + U − ψψ U ψs s + U − ψψ U ψa a + U − ψψ U ψλ λ | + C, (71)with C a constant (and remembering each of ψ , s , a and λ is a vector of coordi-nates in general). We can then expand out the derivatives of the KL divergence18o express them in terms of the coordinates: ∇ s D KL [ q (Ψ | λ ) || p ∗ ( ψ | s, a, λ )] = − Z d ψ q ( ψ | λ ) ∇ s ln p ∗ ( ψ | s, a, λ )= U sψ U − ψψ (cid:0) U ψs s + U ψa a + U ψλ λ + U ψψ h ψ i q (Ψ | λ ) (cid:1) , (72) ∇ a D KL [ q (Ψ | λ ) || p ∗ ( ψ | s, a, λ )] = − Z d ψ q ( ψ | λ ) ∇ a ln p ∗ ( ψ | s, a, λ )= U aψ U − ψψ (cid:0) U ψs s + U ψa a + U ψλ λ + U ψψ h ψ i q (Ψ | λ ) (cid:1) , (73) ∇ λ D KL [ q (Ψ | λ ) || p ∗ ( ψ | s, a, λ )] = Z d ψ h (ln q ( ψ | λ ) − ln p ∗ ( ψ | s, a, λ ) + 1) ∇ λ q ( ψ | λ ) − q ( ψ | λ ) ∇ λ ln p ∗ ( ψ | s, a, λ ) i = U λψ U − ψψ (cid:0) U ψs s + U ψa a + U ψλ λ + U ψψ h ψ i q (Ψ | λ ) (cid:1) + ∇ λ h ψ i q (Ψ | λ ) (cid:0) U ψs s + U ψa a + U ψλ λ (cid:1) + ∇ λ (cid:0) h ψ T U ψψ ψ i q (Ψ | λ ) − H [ q (Ψ | λ )] (cid:1) , (74)with h g ( ψ ) i q (Ψ | λ ) := R d ψ q ( ψ | λ ) g ( ψ ) and H the Shannon entropy.Substituting Eqs. (73) and (74) into Eqs. (65) and (66) leads to(Γ aa + R aa ) U aψ U − ψψ (cid:0) U ψs s + U ψa a + U ψλ λ + U ψψ h ψ i q (Ψ | λ ) (cid:1) = 0 , (75)and 0 =(Γ λλ + R λλ ) U λψ U − ψψ (cid:0) U ψs s + U ψa a + U ψλ λ + U ψψ h ψ i q (Ψ | λ ) (cid:1) + (Γ λλ + R λλ ) ∇ λ h ψ i q (Ψ | λ ) (cid:0) U ψs s + U ψa a + U ψλ λ (cid:1) + (Γ λλ + R λλ ) ∇ λ (cid:0) h ψ T U ψψ ψ i q (Ψ | λ ) − H [ q (Ψ | λ )] (cid:1) . (76)Since these must hold for all values of the coordinates, they put strong require-ments on the U and R matrices. Speciﬁcally,(Γ aa + R aa ) U aψ U − ψψ U ψs = 0 , (77)(Γ aa + R aa ) U aψ U − ψψ U ψa = 0 , (78)(Γ aa + R aa ) U aψ U − ψψ U ψλ = 0 . (79)In other words, since U ψψ and Γ aa must be nonzero for the dynamics to beergodic, it must be that U ψa = 0. Speciﬁcally, consider the system speciﬁedby the force matrix M =  − − − −  (80) This is equivalent to p ∗ (Ψ | s, a, λ ) = p ∗ (Ψ | s, λ ). So if Condition 2 also holds we must have p ∗ (Ψ | s, a, λ ) = p ∗ (Ψ | s ) in order for there to be a suitable q (Ψ | λ ). R =  −

00 0 0 0  (81)and U =  −

00 1 0 0 −

00 0 0 1  . (82)Here M is full rank so the system is ergodic, clearly it also satisﬁes Condition 1due to the structure of M . Since R as = R aλ = R λs = 0 it obeys Eqs. (19)and (20) and since Q ψλ = 0 it also obeys Condition 2. Additionally, we ﬁnd U ψa = − / which is a contradiction.For the more general version, substituting Eqs. (72) to (74) into Eq. (67),one ﬁnds0 = (cid:16) ( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ + R aλ ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψs s + (cid:16) ( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ + R aλ ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψa a + (cid:16) ( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ + R aλ ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψλ λ + ( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) h ψ i q (Ψ | λ ) + R aλ ∇ λ (cid:0) h ψ T U ψψ ψ i q (Ψ | λ ) − H [ q (Ψ | λ )] (cid:1) , (83)which, considering that the coordinates can take any values, implies that( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ + R aλ ∇ λ h ψ i q (Ψ | λ ) (84)lies in a common (left) nullspace of U ψs , U ψa and U ψλ . However, the existenceof such a nontrivial nullspace would imply that the corresponding subspace of ψ coordinates is independent of the s , a and λ coordinates (to see this, considermarginalising over their complement in Eq. (71)). In other words, if only ψ coordinates that play a nontrivial role in the dynamics are considered, thenEq. (67) must imply the quantity in Eq. (84) is zero, and hence that R aλ ∇ λ h ψ i q (Ψ | λ ) = − ( R as U sψ + ( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ . (85)However, through a similar procedure, one ﬁnds that Eq. (68) is equivalent20o 0 = (cid:16) ( R λs U sψ + R λa U aψ + (Γ λλ + R λλ ) U λψ ) U − ψψ + (Γ λλ + R λλ ) ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψs s + (cid:16) ( R λs U sψ + R λa U aψ + (Γ λλ + R λλ ) U λψ ) U − ψψ + (Γ λλ + R λλ ) ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψa a + (cid:16) ( R λs U sψ + R λa U aψ + (Γ λλ + R λλ ) U λψ ) U − ψψ + (Γ λλ + R λλ ) ∇ λ h ψ i q (Ψ | λ ) (cid:17) U ψλ λ + ( R λs U sψ + R λa U aψ + (Γ λλ + R λλ ) U λψ ) h ψ i q (Ψ | λ ) + (Γ λλ + R λλ ) ∇ λ (cid:0) h ψ T U ψψ ψ i q (Ψ | λ ) − H [ q (Ψ | λ )] (cid:1) , (86)implying that(Γ λλ + R λλ ) ∇ λ h ψ i q (Ψ | λ ) = − ( R λs U sψ + R λa U aψ + (Γ λλ + R λλ ) U λψ ) U − ψψ . (87)Unless R aλ and (Γ λλ + R λλ ) share a common nullspace, or the U and R matricesare ﬁnely tuned, then Eqs. (85) and (87) contradict one another. In this case,there cannot exist a q (Ψ | λ ) that satisﬁes both Eqs. (67) and (68), and hencethe modiﬁed Free Energy Lemma is invalid in general. In particular, using theexample from Appendix B, if we solve Eq. (85) for ∇ λ h ψ i q (Ψ | λ ) we ﬁnd ∇ λ h ψ i q (Ψ | λ ) = − , (88)and from Eq. (87) we get ∇ λ h ψ i q (Ψ | λ ) = 29239 , (89)which is a contradiction.If we now perform the same procedure for Eqs. (69) and (70), we arrive atthe following conditions on the gradient of the variational density: R aλ ∇ λ h ψ i q (Ψ | λ ) = − (( R aa + Γ aa ) U aψ + R aλ U λψ ) U − ψψ . (90)and R aλ ∇ λ h ψ i q (Ψ | λ ) = − ( R λa U aψ + (Γ λλ + R λλ ) U λψ ) U − ψψ . (91)Even when Condition 2 holds and U ψλ = 0, these will be inconsistent in general.As a speciﬁc counterexample, take the system with force matrix M =  − − − − √ −  , (92)21ith correposponding R = 

13 13 √ − − √ − √ √  , (93)and U = 

00 1 0 0 − √ − √

34 34  . (94)This model is ergodic (full rank U ), and it satisﬁes both Condition 1 and Con-dition 2. Moreover, the forces satisfy Eqs. (21) and (22). However, substitutingthe relevant elements of U and R matrices into Eq. (90), we ﬁnd ∇ λ h ψ i q (Ψ | λ ) = 1 √ , (95)but doing the same for Eq. (91) gives ∇ λ h ψ i q (Ψ | λ ) = 13 , (96)which is a contradiction. D Counterexample for Step 5

Here we provide an example system for which Conditions 1 and 2 as well asSteps 1 to 4 are valid but Step 5 fails. We use a system with f ( x ) = M x (97)where M :=  − / / − / / − / / −  . (98)This system is ergodic, satisﬁes Condition 1 and as we will will see satisﬁesEqs. (19) and (20) as well. Using Eq. (5) we ﬁnd R =   (99)and from Eq. (8) U = − M (100)22hich means that Condition 2 is also satisﬁed.This leads to the ergodic density p ∗ ( ψ, s, a, λ ) = √ π e − ( ψ − ψs + s − sa + a − aλ + λ ) (101)which can be used to check that Eqs. (19) and (20) hold for this example. Theconditional ergodic density is p ∗ ( ψ | s, a, λ ) = p ∗ ( ψ | s ) = 1 √ π e − ( ψ − s ) . (102)If we now deﬁne q ( ψ | λ ) = q ( ψ ) = exp( − ( ψ − µ ) / / √ π as a Gaussiandistribution with mean µ and variance one, we can compute the KL divergenceto get: D KL [ q (Ψ) || p ∗ (Ψ | s, a, λ )] = K ( s ) = 12 (cid:18) µ − s (cid:19) . (103)Clearly, for this choice of q ( ψ | λ ) the gradients with respect to a and λ of theKL divergence vanish everywhere (Eqs. (35) and (36) hold). This also means wecan express f a , f λ in terms of a free energy i.e. the Free Energy Lemma holdsfor this system. However, for any proposed bound c ≥ s for which it is exceeded, whatever the choice of µ . Moreover,we can choose a µ such that the KL divergence is larger than any given c , evenwhen s = 0. E Translating systems into generalized coordi-nates systems

We show how to get a generalized coordinate system from a ﬁnite dimensionalsystem. By deﬁnition the generalized coordinates are inﬁnite dimensional. Forall n ∈ N and a coordinate x they also include the n -th time derivative of x .Assume as given an ergodic, linear, random dynamical system described by˙ x = M x + ω (104)where x = ( x , ..., x k ) is a k -dimensional vector, M is a k × k real valuedmatrix, and ˙ x := ddt x . We can look at the second time derivative of the stateby diﬀerentiating both sides: ddt ˙ x = ddt ( M x + ω ) (105)¨ x = M ˙ x + ˙ ω (106)Similarly for the third time derivative: ddt ¨ x = ddt ( M ˙ x + ˙ ω ) (107)= M ¨ x + ¨ ω (108)23imilarly for all higher derivatives: d n dt n x = M d n − dt n − x + d n dt n ω. (109)Now deﬁne the generalized coordinates ˜ x = ( x, x ′ , x ′′ , ... ) as x = x (110) x ′ = ddt x (111) x ′′ = d dt x (112)... (113) x ( n ) = d n dt n x (114)... (115)Deﬁne also ˜ ω := ( ω, ddt ω, d dt ω, ..., d n dt n ω, ... ) . (116)Without further clariﬁcation, the derivatives of ω are not well deﬁned when thelatter is a Gaussian white noise process, as explicitly assumed in writing the vec-tor ﬁeld f ( x ) in terms of the ergodic density (Ao, 2004; Kwon et al., 2005; Kwonand Ao, 2011). As discussed in van Kampen (1981), delta-correlated Markoviannoise is always a limiting approximation of noise with a ﬁnite correlation time.Meaningfully taking the derivatives requires ﬁrst choosing a functional form forthe (co)variance whose limit is a delta function. However, diﬀerent choicescan lead to vastly diﬀerent central moments of the generalized noise distribu-tion, including those that vanish or diverge at all orders. In the former case,the process in terms of generalized coordinates may not be ergodic (Cornfeldet al., 1982); in the latter case, the process is not well deﬁned. In general, theKramers-Moyal coeﬃcients will not vanish beyond second order, meaning thatan approach for which there is an equivalent Fokker-Planck equation, and henceEq. (4), is not valid (Risken, 1996).Here, we can therefore assume that the noise is such that the derivatives inEq. (116) can be treated as Markov and Gaussian. We also assume that d n dt n ω is independently and identically distributed to d n − dt n − ω for all n . Finally, we canthen deﬁne the (inﬁnite) matrix ¯ M as the block diagonal matrix with all blocksequal to M : ¯ M :=  M . . . M ... . . .  (117)The time derivative of ω is independent of ω , as the changes are independentof the value of ω . So we actually get an inﬁnite number of independent and Another, more direct approach would be in terms of generalized functions, but here tooadditional information is required to specify the derivatives (Oberguggenberger, 1995). x = ¯ M ˜ x + ˜ ω. (118)These equations describe a random dynamical system composed of an inﬁnitenumber of independent linear random dynamical systems, all governed by thesame matrix M and driven by independently and identically distributed noise.Since the ﬁrst of these systems (for the variables x ) is ergodic by assumption, allof the subsystems are also ergodic and, therefore, the whole system is ergodicwith the ergodic density equal to a product of the original ergodic density:¯ p ∗ (˜ x ) = p ∗ ( x ) p ∗ ( x ′ ) p ∗ ( x ′′ ) · · · p ∗ ( x ( n ) ) · · · . (119)Additionally, if M is such that M ψ · ( ψ, s, a, r ) ⊤ = f ψ ( ψ, s, a ) = (Γ − Q ) ψψ ∇ ψ ln p ∗ ( ψ, s, a, r ) M s · ( ψ, s, a, r ) ⊤ = f s ( ψ, s, a ) = (Γ − Q ) ss ∇ s ln p ∗ ( ψ, s, a, r ) M a · ( ψ, s, a, r ) ⊤ = f a ( s, a, r ) = (Γ − Q ) aa ∇ a ln p ∗ ( ψ, s, a, r ) M r · ( ψ, s, a, r ) ⊤ = f r ( s, a, r ) = (Γ − Q ) rr ∇ r ln p ∗ ( ψ, s, a, r ) , (120)(which is the case for the M in the counterexample to Step 5) then for( x , x , x , x ) : = ( ψ, s, a, r )( x ′ , x ′ , x ′ , x ′ ) : = ( ψ ′ , s ′ , a ′ , r ′ )( x ′′ , x ′′ , x ′′ , x ′′ ) : = ( ψ ′′ , s ′′ , a ′′ , r ′′ )...( x ( n )1 , x ( n )2 , x ( n )3 , x ( n )4 ) : = ( ψ ( n ) , s ( n ) , a ( n ) , r ( n ) )... , (121)¯ Q :=  Q . . . Q ... . . .  , (122)¯Γ :=  Γ 0 . . .  , (123)and using Eq. (8) and that the inverse of a block diagonal matrix is blockdiagonal ¯ U :=  U . . . U ... . . .  , (124)25e also have:¯ M ˜ ψ · ( ˜ ψ, ˜ s, ˜ a, ˜ r ) ⊤ = f ˜ ψ ( ˜ ψ, ˜ s, ˜ a ) = (¯Γ − ¯ Q ) ˜ ψ ˜ ψ ∇ ˜ ψ ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r )¯ M ˜ s · ( ˜ ψ, ˜ s, ˜ a, ˜ r ) ⊤ = f ˜ s ( ˜ ψ, ˜ s, ˜ a ) = (¯Γ − ¯ Q ) ˜ s ˜ s ∇ ˜ s ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r )¯ M ˜ a · ( ˜ ψ, ˜ s, ˜ a, ˜ r ) ⊤ = f ˜ a (˜ s, ˜ a, ˜ r ) = (¯Γ − ¯ Q ) ˜ a ˜ a ∇ ˜ a ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r )¯ M ˜ r · ( ˜ ψ, ˜ s, ˜ a, ˜ r ) ⊤ = f ˜ r (˜ s, ˜ a, ˜ r ) = (¯Γ − ¯ Q ) ˜ r ˜ r ∇ ˜ r ln p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) . (125)The ergodic density of such a system is a product of ergodic densities of theoriginal system Eq. (102):¯ p ∗ ( ˜ ψ, ˜ s, ˜ a, ˜ r ) = p ∗ ( ψ, s, a, r ) p ∗ ( ψ ′ , s ′ , a ′ , r ′ ) p ∗ ( ψ ′′ , s ′′ , a ′′ , r ′′ ) · · · . (126)Thus any property of the original system is also a property of the generalizedcoordinate system. References

Ao, P. (2004). Potential in stochastic diﬀerential equations: novel construction.

Journal of Physics A: Mathematical and General , 37(3):L25–L30.Cornfeld, I. P., Fomin, S. V., and Sinai, Y. G. (1982).

Ergodic Theory . Springer-Verlag, New York.Cover, T. M. and Thomas, J. A. (2006).

Elements of information theory . Wiley-Interscience, Hoboken, N.J.Friston, K. (2013). Life as we know it.

Journal of The Royal Society Interface ,10(86):2013.0475.Friston, K. (2019). A free energy principle for a particular physics. arXiv:1906.10184 [q-bio] . arXiv: 1906.10184.Friston, K., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., and Pezzulo,G. (2015). Active inference and epistemic value.

Cognitive Neuroscience ,6(4):187–214.Friston, K., Sengupta, B., and Auletta, G. (2014). Cognitive Dynamics: FromAttractors to Active Inference.

Proceedings of the IEEE , 102(4):427–445.Kwon, C. and Ao, P. (2011). Nonequilibrium steady state of a stochastic systemdriven by a nonlinear drift force.

Physical Review E , 84:061106.Kwon, C., Ao, P., and Thouless, D. J. (2005). Structure of stochastic dy-namics near ﬁxed points.

Proceedings of the National Academy of Sciences ,102(37):13029–13033.Oberguggenberger, M. (1995). Generalized functions and stochastic processes.In Bolthausen, E., Dozzi, M., and Russo, F., editors,

Seminar on StochasticAnalysis, Random Fields and Applications. Progress in Probability, vol 36. ,page 215–230. Birkh¨auser, Basel. 26arr, T., Da Costa, L., and Friston, K. (2019). Markov blankets, informa-tion geometry and stochastic thermodynamics.

Philosophical Transactionsof the Royal Society A: Mathematical, Physical and Engineering Sciences ,378(4):2019.0159.Risken, H. (1996).

The Fokker-Planck Equation: Methods of Solution and Ap-plications . Springer, Berlin, Heidelberg.van Kampen, N. G. (1981).