[PDF] Limits on Inferring the Past

Abstract

Here we define and study the properties of retrodictive inference. We derive equations relating retrodiction entropy and thermodynamic entropy, and as a special case, show that under equilibrium conditions, the two are identical. We demonstrate relations involving the KL-divergence and retrodiction probability, and bound the time rate of change of retrodiction entropy. As a specific case, we invert various Langevin processes, inferring the initial condition of \(N\) particles given their final positions at some later time. We evaluate the retrodiction entropy for Langevin dynamics exactly for special cases, and find that one's ability to infer the initial state of a system can exhibit two possible qualitative behaviors depending on the potential energy landscape, either decreasing indefinitely, or asymptotically approaching a fixed value. We also study how well we can retrodict points that evolve based on the logistic map. We find singular changes in the retrodictivity near bifurcations. Counterintuitively, the transition to chaos is accompanied by maximal retrodictability.

Full PDF

LLimits on Inferring the Past

Nathaniel Rupprecht, Dervis C. Vural ∗ University of Notre Dame (Dated: May 14, 2018)Here we deﬁne and study the properties of retrodictive inference. We derive equations relatingretrodiction entropy and thermodynamic entropy, and as a special case, show that under equilib-rium conditions, the two are identical. We demonstrate relations involving the KL-divergence andretrodiction probability, and bound the time rate of change of retrodiction entropy. As a speciﬁccase, we invert various Langevin processes, inferring the initial condition of N particles given theirﬁnal positions at some later time. We evaluate the retrodiction entropy for Langevin dynamicsexactly for special cases, and ﬁnd that one’s ability to infer the initial state of a system can exhibittwo possible qualitative behaviors depending on the potential energy landscape, either decreasingindeﬁnitely, or asymptotically approaching a ﬁxed value. We also study how well we can retrodictpoints that evolve based on the logistic map. We ﬁnd singular changes in the retrodictivity nearbifurcations. Counterintuitively, the transition to chaos is accompanied by maximal retrodictability. INTRODUCTION

Many astonishing facts about the origin of the uni-verse, evolution of life, or history of civilizations willnever be directly observed, but will only be inferred inthe light of their manifestations in the present. Evolvedforward in time, any state of knowledge, regardless of howexact, will invariably deteriorate into an entropy maxi-mizing probability distribution [1–4]. How rapidly doesour knowledge of the past, as inferred from a measure-ment made in the present, deteriorate, going backwardsin time?While methods exist for inferring the origin of an ob-served ﬁnal state [5–8], or inferring some original data af-ter it has been corrupted [9, 10] we know little about howaccurately the initial state of a many-body system can becharacterized given its present state, how quickly a sys-tem forgets its initial state due to thermal ﬂuctuations,and how the limit our ability to infer the past dependson system parameters. The answers to these questionsshould lie in non-equilibrium statistical mechanics, wherethermal motion is incorporated into mechanical laws [11–13]. In systems where thermal collisions erase the infor-mation pertaining past states of particles, Fokker-Planckequation constitutes the groundwork of nonequilibriumanalysis [14–20].Here we determine the theoretical limits to inferringthe initial state of a system, to which we refer as “retro-diction” – in contrast to prediction. We quantify thequality of retrodiction in terms of retrodiction entropy, S R . We derive a relationship between thermodynamicentropy and retrodiction entropy, and report a lowerbound on its generation rate. Then, to apply these ideasto a speciﬁc problem, we consider a collection of particlescoupled to a thermal bath, and obtain the time depen-dence of S R in convex, concave and ﬂat potentials. Toestablish whether chaos fundamentally inﬂuences retro-dictability, we also investigate the retrodiction entropyof the logistic map as it transitions from the non-chaotic regime to the chaotic regime. Finally, we conclude ourdiscussion with a comparison of retrodiction entropy toother inverse statistical methods and methods for com-paring predictability and retrodictability. DEFINITIONS AND NOTATION

Our system consists of a set states Ω, a prior distribu-tion on the set of states, P , and a “transition probabil-ity” function T . The state space Ω will depend on theproblem at hand, it could for example be the space of allpossible positions and velocities of a collection of particles(i.e. phase space). The prior distribution speciﬁes howthe system will be initialized - P ( α ) is the probabilitythat the system will be prepared in the state α ∈ Ω. Thetransition probability T ( ω | α ; t ) is the probability thatthe system ends in the state ω ∈ Ω given that it startedin the state α ∈ Ω and evolved for a time t . We willgenerally suppress the time variable.The probability R ( α | ω ; t ) = R ω ( α ) that the initialstate was α given the ﬁnal state ω , is given by the Bayestheorem, R ( α | ω ; t ) = T ( ω | α ; t ) P ( α ) P t ( ω ) = T ( ω | α ; t ) P ( α ) (cid:80) α (cid:48) T ( ω | α (cid:48) ; t ) P ( α (cid:48) ) . (1)where P t is the prior distribution P evolved forwards intime. R would typically be called the likelihood or theposterior distribution. In the present context, we willrefer to it as the retrodiction probability, and deﬁne theentropy associated with it as the retrodiction entropy, S R ( ω ) = − (cid:88) α R ω ( α ) log R ω ( α ) . (2)Intuitively, the larger S R ( ω ) is, the less accurately theinitial state can be inferred given a measurement of theﬁnal state, ω .Note that S R is a function of the ﬁnal state observedafter a single realization of a stochastic process. If the a r X i v : . [ s t a t . O T ] M a y process were to be run again, the particles would endup elsewhere, and have a diﬀerent S R associated withthat ﬁnal state. As such, it will be useful to deﬁne S R averaged over all possible ﬁnal measurements, (cid:104) S R (cid:105) .A related quantity of interest is the Kullback-Leibler(KL) divergence D ( p (cid:107) q ) = (cid:80) x p ( x ) log p ( x ) /q ( x ), mea-sures the amount of overlap between two distributions p ( x ) and q ( x ) [21]. Thus, another useful measure ofretrodictability is the KL divergence D ( R ω (cid:107) P ) between R and P , which quantiﬁes the amount of informationgained over the prior upon a measurement. As our abil-ity to infer the past decreases, the retrodiction proba-bility coincides more with the prior probability, the KLdivergence decreases. Ultimately, D ( R ω (cid:107) P ) = 0 as themeasurement ω provides no additional information re-garding the initial state beyond what we already know;the prior, P . Notation

Throughout, we denote the average over all free pa-rameters by (cid:104)·(cid:105) . However, there are two diﬀerent typesof averages that are indicated by this notation: averagesover the distribution on initial states, and averages overdistribution on ﬁnal states. When we average over quan-tities where the free variable ranges over initial states,we use a probability weight P for each such free vari-able. For quantities where the free variable ranges overﬁnal states, we use a probability weight P t for each suchfree variable. In the case where there are multiple statesthat are being averaged over, we include a subscript toindicate that there is a free variable to be averaged over.For example, (cid:104) S R (cid:105) = (cid:88) ω P t ( ω ) S R ( ω ) (cid:104) S T (cid:105) = (cid:88) α P ( α ) S T ( α ) (cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:88) α ,α P ( α ) P ( α ) D ( T α (cid:107)T α ) GENERAL PROPERTIES OF RETRODICTIONRelation between retrodiction and thermodynamics

To facilitate readability onwards, we expose only thecrucial steps in the main text, leaving the proofs andderivations to the appendices.Our ﬁrst key result is the relationship between retrod-iction entropy and thermodynamic entropy (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ) . (3)Here (cid:104) S T (cid:105) is the average entropy associated with thetransition probability T α ( ω ), whereas S and S t are the entropies associated with the prior probability P , andthe observation probability, P t . Eq. (3) relates our abil-ity to infer the past, (cid:104) S R (cid:105) , to our ability to predict thefuture, (cid:104) S T (cid:105) and S t . This identity is derived in AppendixA.Note that Eq. (3) holds for processes both in or outof equilibrium, and provides useful insights on the gen-eral properties of S R . For short times, P t (cid:39) P , solim t → (cid:104) S R (cid:105) / (cid:104) S T (cid:105) = 1. For long times, if the systemconverges to a stationary distribution P ∞ , (as is the casein a bounded space or trapping potential), then P t and T α ( ω ) must approach P ∞ independent of the startingstate, and (3) implies lim t →∞ (cid:104) S R (cid:105) = S , i.e. we cannotguess the initial state any better than using whatever wealready knew before making the measurement.As another interesting special case, we consider whathappens if the prior probability P coincides with thestationary state probability P ∞ (assuming one exists).Then S t = S for all times t , and (3) implies (cid:104) S R ( t ) (cid:105) = (cid:104) S T ( t ) (cid:105) . (4)For example, if we are inferring the past of a systemin equilibrium we would be drawing the initial state ofthe system out of the equilibrium distribution, i.e. using P ( s ) = e − βE ( s ) /Z as the prior probability, measure thepositions of some particles, and ask where they used tobe. Eq. (4) tells us that in equilibrium, the rate of ther-modynamic entropy and retrodiction entropy generationis the same. Our ability to predict the future fades atexactly the same rate as our ability to infer the originalstate of the system.No such correspondence need hold for non-equilibriumprocesses. For a system with equilibrium entropy S eq , if S > S eq then S t will decrease from S at t = 0 to S eq as t → ∞ . Thus (cid:104) S R (cid:105) > (cid:104) S T (cid:105) . In this case, we knowthat particles will gather, so we know better where theywill be in the future than where they were originally. Incontrast, if S < S eq , S t will increase in time and (cid:104) S R (cid:105) < (cid:104) S T (cid:105) . Here, we know more about where the particleswere originally than where they will be in the future. Tosum up, the more certain we can be about the state ofthe system in the future, the less certain we are aboutwhere the system started out in the past. Experimental measurement of retrodictability

It is instructive to view (3) from a practical, empiricalperspective. Consider a system of particles evolving in apotential energy landscape U ( (cid:126)x ) while coupled to a heatbath. Can we estimate bounds on (cid:104) S R (cid:105) without knowingthe microscopic dynamics of the system (e.g. the inter-particle interactions) or the potential energy landscape,but only using thermodynamic measurements?This is possible under certain conditions. We can ini-tialize a system such that particles are in state α withprobability P ( α ), let the particles evolve for a time t ,calorimetrically obtain the change in thermodynamic en-tropy via ∆ S α = (cid:82) α dQ/T , and then average this overmultiple instances to obtain (cid:104) S T (cid:105) s (the sample averageof entropy). The identity dS = dQ/T holds when the sys-tem moves along a reversible path. While it is not trivialto measure S t for processes out-of-equilibrium, we canuse the equilibrium result, (cid:104) S T (cid:105) = (cid:104) S R (cid:105) (eq. 4) and thesecond law, to place an upper bound on average retrod-iction entropy, for any process (in or out of equilibrium), (cid:104) S R (cid:105) < (cid:104) S T (cid:105) s + S . Under special conditions, we can do better thanan inequality. If the prior distribution is uncorrelated P ( x , . . . , x N ) = p ( x ) p ( x ) . . . p ( x N ), and if interac-tions between particles are negligible, then P t ( y , . . . , y N ) = N (cid:89) k =1 (cid:32)(cid:88) x k T ( y k | x k ; t ) p ( x k ) (cid:33) ≡ (cid:89) k q ( y k )Since each term in this product is independent, the en-tropy is extensive S t = N H [ q ], and S = N H [ p ]. Thus,an experimentalist can measure S t − S by placing M (cid:29) p ( x ), allow the particlesto evolve for a time t , and again calorimetrically integrate∆ S = (cid:82) dQ/T to obtain S t − S (cid:39) N ∆ S/M . Note thatsince M (cid:29)

1, ∆ S will be deterministic. Thus from (3)the retrodictability becomes a diﬀerence of two entropymeasurements, (cid:104) S R (cid:105) = (cid:104) ∆ S (cid:105) s − NM ∆ S, (5)The ﬁrst term on the right is measured by initializingparticles individually at α with probability P ( α ) andaveraging all outcomes, whereas the second term, by asingle shot measurement of a gas initialized with density P ( α ). We emphasize that this experimental protocol toobtain (5) will be valid only when inter-particle interac-tions are negligible, and for an uncorrelated prior, but aslong as these assumptions hold, (cid:104) S R (cid:105) can be known byonly performing thermodynamic measurements, withoutneeding to know the underlying potential or microscopicdynamics. Continuous space and divergence relations

For a continuous state space, we may consider S R tobe a diﬀerential entropy, which is not invariant under achange of variables. In contrast, D ( R ω (cid:107) P ) is invariantunder changes of variables, and therefore may be a moredesirable measure. We derive, in a similar manner to (3), (cid:104) D ( R ξ (cid:107) P ) (cid:105) = S t − (cid:104) S T (cid:105) = S − (cid:104) S R (cid:105) . Markovian stochastic processes are known to have a KL-divergence that are non-increasing in time [21]. Thus we are motivated to ask how the KL divergence between twoforward processes T α , compares to the KL divergence be-tween two retrodiction probabilities R ω . First, we show(cf. Appendix B) (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( P t (cid:107)T α ) (cid:105) + S t − (cid:104) S T (cid:105)(cid:104) D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) S T (cid:105) − S t . Combining these gives us the relationship (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) . (6)Thus, the average amount of overlap between diﬀerentretrodiction probability distributions is exactly equal tothe average amount of overlap between diﬀerent forwarddistributions (cf. Appendix B). Taking the time deriva-tive of both sides tells us that the average rate of increaseis the same for forward and reverse probabilities, and thatthis quantity is non-increasing [21]. In Appendix B, welist all the KL divergence relations between the distribu-tions T , R , P , and P t . Lower bound to retrodiction entropy generation

We can establish a lower bound on the time rate ofchange of retrodiction entropy in terms of forward en-tropies and KL divergences. Diﬀerentiating (3) and us-ing the convexity of log gives us an upper bound on therate of change of S t (cf. Appendix C),˙ S t ≤ (cid:104) ˙ S T (cid:105) + ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . (7)Using the theorem on Markov processes, we know thatthe second term in (7) is ≤

0. The last term in (7)measures the divergence between the prior state and theretrodiction probability, which should decrease with timeas the reconstructed probability approaches the prior.Rearranging (7), we get, ∂∂t (cid:104) S R (cid:105) ≥ − ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) Information theoretical interpretation

From an information theoretic point of view, retrod-iction entropy is the amount of information required tospecify which state the system was initialized, given anobservation of its ﬁnal state. The KL divergence betweenthe retrodiction probability R ω , and the prior distribu-tion P is a measure of how much information has beengained by making a measurement (above and beyond theinformation contained in the prior). The KL divergenceis asymmetric in its arguments, D ( R ω (cid:107) P ) (cid:54) = D ( P (cid:107)R ω ).However, there is a good reason for preferring D ( R ω (cid:107) P )over D ( P (cid:107)R ω ). Letting X , X t be the random variablesfor the conﬁguration at times 0 and t , it can be shownthat (cid:104) D ( R ω (cid:107) P ) (cid:105) = I ( X ; X t ) where I ( · , · ) is the mutualinformation. In other words, the average KL divergencebetween retrodiction probabilities and the prior is themutual information between the initial and ﬁnal statesof the system. We can use this and our other formulasto write retrodiction entropy in terms of mutual informa-tion, (cid:104) S R (cid:105) = S − I ( X ; X t ) = H ( X ) − I ( X ; X t ) (8)While it is impossible to evaluate quantities like D ( R ω (cid:107) P ) or S R ( ω ) for a speciﬁc ω without being givena speciﬁc problem (and being able to evaluate the transi-tion probabilities for that problem), eqs. (1) to (4) and (6)to (8) hold true quite generally, for any system in or outof equilibrium. RETRODICTION OF BROWNIAN PARTICLESIN A POTENTIAL

Following these general results, we now study a speciﬁcphysical system, the retrodiction entropy of Brownianparticles diﬀusing in a potential. The α th coordinate ( α = x, y, z, ...) of the k -th particle, will be written as x ( k ) = { x ( k ) α } , and for the initial state, the α th coordinate of theinitial position will be written as y = { y α } . In otherwords, Latin superscripts index particles 1 , . . . , N whileGreek subscripts indicate their coordinates, 1 , . . . , d .Suppose N particles are released at the same positionat t = 0 and evolve in a potential U ( (cid:126)x ) according toLangevin dynamics. The evolution of the state probabil-ity distribution p ( (cid:126)x, t ) is governed by the general Fokker-Planck equation, ∂p ( (cid:126)x, t ) ∂t = (cid:88) α,β ∂ [ D αβ ( (cid:126)x, t ) p ( (cid:126)x, t )] ∂x α ∂x β − (cid:88) α ∂ [ µ α ( (cid:126)x, t ) p ( (cid:126)x, t )] ∂x α , where µ α ( (cid:126)x, t ) is a drift term and D αβ ( x, t ) is the diﬀu-sion tensor. Since particles are independent and followidentical transition rules, the probability that N particlesstarting at state x , end in states x (1) , . . . , x ( N ) is T ( x (1) , . . . , x ( N ) | y ; t ) = N (cid:89) k =1 p ( x ( k ) | y ; t ) (9)The retrodiction probability R ( y | x (1) , . . . , x ( N ) ) is thenthe probability that the initial position of the clusterof particles was y given the N observed ﬁnal positions { x ( k ) } . Retrodiction Entropy of a Gaussian process

Consider a process with (individual) probability distri-butions p ( x ( k ) | y ; t ) = d (cid:89) α =1 exp[( x ( k ) α − λ α ( t ) y α ) /D α ( t )] (cid:112) πD α ( t ) . (10)Here, the transition probability T ( x (1) , . . . , x ( N ) | y ) is T = d (cid:89) α =1 exp[ − (cid:80) Nk =1 ( x ( k ) α − λ α ( t ) y α ) /D α ( t )]( πD α ( t )) N/ (11)since all particles start at x α . Note that we allow thegeneralized diﬀusion and drift to be diﬀerent in everydimension α . Suppose the prior probability for the initialposition of the cluster of particles is Gaussian, centeredat the origin, P ( y ) = d (cid:89) α =1 (2 πσ α ) − / exp[ − y α / (2 σ α )] . (12)The observation probability of a conﬁguration is then P t ( x (1) , . . . , x ( N ) ; t ) = d (cid:89) α =1 (cid:0) π N σ α D α ( t ) N (cid:1) − / × (cid:115) D α ( t ) κ α ( t ) N λ α ( t ) exp[ − ND α ( t ) (cid:0) (cid:104) x α (cid:105) − κ α ( t ) (cid:104) x α (cid:105) (cid:1) ] (13)where, κ α ( t ) = [1 + D α ( t ) / (2 N σ α λ α ( t ) )] − and (cid:104) x nα (cid:105) = (cid:80) Nk =1 [ x ( k ) α ] n /N . From this and T , P , we can evaluatethe retrodiction probability R ( y | x (1) , . . . , x ( N ) ; t ) = d (cid:89) α =1 (cid:115) N λ α ( t ) πD α ( t ) κ α ( t ) × exp (cid:34) − (cid:18) N λ α ( t ) D α ( t ) κ α ( t ) (cid:19) (cid:18) y α − κ α ( t ) λ α ( t ) (cid:104) x α (cid:105) (cid:19) (cid:35) . As this is a Gaussian distribution, it is straightforwardto evaluate its entropy, the retrodiction entropy, S R = 12 log (cid:34)(cid:16) πeN (cid:17) d d (cid:89) α =1 D α ( t ) λ α ( t ) + D α ( t ) / (2 σ α N ) (cid:35) . (14)Note that in the limit of σ α → ∞ in all directions,we obtain the case of a uniform (non-normalizable) priorover all space. In this case, or in the case that σ ’s areﬁnite and particles are “scattered oﬀ” by external forces,i.e. λ α ( t ) → ∞ as t → ∞ , the retrodiction entropy is S R = ( d/

2) log[ πeD GM ( t ) / ( N λ GM ( t ) )]where the subscript “GM” indicates a geometric meanover the diﬀerent directions α . The individual entropiesof the distributions T , P , and P t are listed in AppendixA, which also serves to verify (3). - �� ( � ) = � � � � ( � ) = � � � ( � ) = �� ( � ) = - � � � ( � ) = - � � � - �� ( � ) = � � � � ( � ) = � � � ( � ) = �� ( � ) = - � � � ( � ) = - � � � FIG. 1.

Retrodiction Entropy Generation. Left:

Theretrodiction entropy S R of ﬁve particles in a convex, ﬂat andconcave potential U ( (cid:126)x ) with a uniform prior. S R quantiﬁeshow poorly the initial state of the particles can be inferred,backwards in time. Free and trapped particles forget theirorigin monotonically, whereas particles dispersing in a con-cave potential remember their past no matter how much timepasses. Right:

An analogous plot, but with a Gaussian priorinstead of a uniform prior. Free and trapped particles saturateto having maximum retrodiction entropy, whereas particles ina concave potential still remember their past, just as in thecase of a uniform prior.

Convex and concave potentials

Two processes that have analytical solutions tothe Fokker-Planck equation are Wiener and Ornstein-Uhlenbeck processes, describing Brownian particles inﬂat U ( (cid:126)x ) = α + (cid:126)β · (cid:126)x and parabolic U ( (cid:126)x ) = α + (cid:126)β · (cid:126)x + θ(cid:126)x potentials. We evaluate the retrodiction entropy for thesespecial cases, and ﬁnd that it diverges for particles ran-dom walking in ﬂat and convex potentials ( θ ≥

0) indicat-ing that the system steadily forgets its past. In contrast,concave ( θ ≤

0) potentials have a retrodiction entropythat asymptotically approach a constant less than S P ,indicating that the system always retains the memory ofits initial state (see Fig.1).The distribution of a free Brownian particle is p ( x | y ; t ) = d (cid:89) α =1 (4 πD α t ) − / exp[ − ( x α − y α ) / D α t ] . In this case, the functions in (10) are D α ( t ) = 4 D α t and λ α ( x α ) = 1. Thus, S R = 12 log (cid:34) (4 πe ) d d (cid:89) α =1 σ α D α t D α t + σ α N (cid:35) . �� 〈 � � ( � ) 〉 � � 〈 � � ( � ) 〉 - - - - �� θ � � � � �� 〈 � � 〉 � � = �� 〈 � � 〉 � � = �� 〈 � � 〉 � � = �� 〈 � � 〉 � � →∞ FIG. 2.

The Role of Convexity. Left:

Prior entropy S ,average thermodynamic entropy (cid:104) S T (cid:105) , and observational en-tropy S t for two particles in a harmonic trap, as derived inAppendix A. The retrodiction entropy is related to the otherthree, through (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ). The prior distribu-tion is Gaussian with σ = 5. Right: S R for the Ornstein-Uhlenbeck process at diﬀerent times with a Gaussian prior, σ = 5. For positive convexity, S R converges to S , for nega-tive convexity, S R converges to some smaller value, meaningsome information can still be recovered. In the limit of σ α → ∞ , S R increases at a logarithmicrate at all times. If the σ ’s are ﬁnite, then at long times, S R → d / log[4 πeσ GM ], which is just the entropy of theprior distribution P . For short times, we have S R ∼ ( d/

2) log (4 πeD GM t/N )Next, we consider Brownian particles in a convex orconcave harmonic potential, U ( (cid:126)x ) = θ(cid:126)x , described bythe the Orstein-Uhlenbeck process. The probability dis-tribution given an initial position y is p ( x | y ; t ) = d (cid:89) α =1 exp[ − θ ( x α − y α e − θt ) / (2 D α (1 − e − θt ))] (cid:112) θ − πD α (1 − e − θt )meaning that D α ( t ) = 2 D α θ − (1 − e − θt ) and λ α ( t ) = e − θt . Thus, S R = 12 log (cid:34) (2 πe ) d d (cid:89) α =1 σ α D α (1 − e − θt ) σ α N θ e − θt + D α (1 − e − θt ) (cid:35) . In the limit of inﬁnite σ ’s, we get two very diﬀerentlong-time behaviors depending on the sign of θ . For θ > t → ∞ , S R ∼ dθt . For θ <

0, we have a potential that tends to quickly forceparticles away from the origin. In this case, S R = ( d/

2) log (cid:104) (2 π e ) (1 − e | θ | t ) D GM / ( N | θ | ) (cid:105) . Therefore, as t → ∞ , S R ∼ const. − d e − | θ | t . Thus, af-ter some initial transient loss of information, our abilityto reconstruct the initial state plateaus, i.e. the systemalways retains information about its initial state for ar-bitrarily long times (see Fig. 1a). For ﬁnite σ ’s, S R hasthree distinct temporal regimes. It starts logarithmic,crosses over to linear, and then ﬁnally saturates to S (see Fig. 1B, 2).In Fig. 1a, we have plotted the average retrodiction en-tropy as a function of time for ﬁve particles in potentialswith various concavities ( θ parameters, U ( x ) = θx ).The prior is a non-normalizable uniform prior. The pro-cess is an Ornstein-Uhlenbeck process when θ (cid:54) = 0, and isthe Weiner process when θ = 0. For concave potentials(in blue), the retrodiction entropy converges to a ﬁnitevalue. For a potential with θ = 0, we recover the Wienerprocess, and S R increases logarithmically. For convexpotentials, S R is asymptotically linear, diverging muchmore quickly than the Weiner process.In Fig. 1b, we have shown the analogous plot, but for aGaussian prior. For concave potentials, the retrodictionentropy still saturates to a value below the prior entropyvalue. For convex potentials, the retrodiction entropystarts logarithmic, becomes linear, and then quickly sat-urates to S . For the Wiener process, the retrodictionentropy does eventually approach the value of S , thoughvery slowly – at t = 1000, it is still 2 .

5% away from S .In Fig. 2a, we show the time dependence of the en-tropies S , (cid:104) S T (cid:105) , S t , and (cid:104) S R (cid:105) for two particles in a con-vex potential with a Gaussian prior. This illustrates thefact that (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ). The linear behaviorof S R in the intermediate regime can be seen before itexponentially approaches the value of the entropy of theprior, S .In Fig. 2b, we plot the average retrodiction entropy ofthe Ornstein-Uhlenbeck process at speciﬁc times, start-ing with a Gaussian prior. In the long time limit, if theconvexity is positive, the retrodiction entropy approachesthe entropy of the prior distribution, S , and hence theblack line being ﬂat for all θ ≥

0. However, if the con-vexity is negative (so a concave potential), we can seethat the retrodiction entropy converges to a value lessthan S . This indicates that by making a measurement,we gain information about the initial state of the systemeven after arbitrarily long times. RETRODICTION OF A CHAOTIC SYSTEM

To study how chaos relates to retrodictability we con-sider the simplest of chaotic systems, the logistic map, X t = r · X t − (1 − X t − ) . characterized by a single parameter r which determineswhether the system is chaotic. Our key result here is somewhat counter-intuitive: We ﬁnd that the system ismaximally retrodictable right before and right after ittransitions into chaos.The asymptotic properties of the Logistic map is wellknown [22]. The values p n take as n tends to inﬁnity,i.e. the attractors, is shown in the bifurcation diagram(Fig. 3a). For small values of r , the trajectories are pe-riodic. As r is increased, there is a sequence of perioddoublings (cf. Fig. 3, blue vertical dashes) until the sys-tem transitions to chaos at r (cid:39) .

57 (red vertical dashes).Within the chaotic regime, there are occasional islandsof stability where periodic attractors exist. For example,at r (cid:39) .

83 there is a period 3 attractor (green verticaldashes).Since the logistic map is purely deterministic, in orderto deﬁne probabilities and entropies we suppose that thestate of the system cannot be measured with inﬁnite ac-curacy – similar to how probability and entropy arise inclassical statistical mechanics. To avoid artifacts stem-ming from the precise details of coarse graining, we pickvery small bins with randomized positionsSpeciﬁcally, we coarse grain the interval [0 ,

1] randomlyinto b bins by picking b − < x < x < ... < x b − <

1. Wethen uniformly and randomly sample s points from eachbin, and iterate each point τ times via the logistic map.This way, we construct the probability transition matrix T ( τ ) ji , the probability that a point selected randomly frombin j ends in bin i after τ logistic steps. Using this, andassuming a uniform prior on picking the initial point, wecan obtain the retrodiction probability matrix R ( τ ) ji , andthe average retrodiction entropy.As the binning is random, the value of average retrod-iction entropy is slightly diﬀerent for each realization ofthe binning, so we average over many diﬀerent randombinnings. We note that we are essentially calculating theinformation dimension of the retrodiction probability. In-formation dimension [23, 24] is one of several commonways to calculate fractal dimension. Our prescriptionhere is only diﬀerent in that we are applying it to ourretrodiction of the original state, not to the calculationof the ﬁnal state.Figure 3 contains several panels related to the retrod-ictability of the logistic map. The top panel is the bifur-cation diagram for the logistic map, which we align withthe other two panels to use as a reference.The middle panel shows what initial states convergeto what ﬁnal state. Here we see the basins of attractionof the logistic map. The vertical axis indicates the ini-tial position of the point, whereas the color representsthe value the point has after 250 iterations. We can seehow the unit interval splits into domains at each bifur-cation point. At the onset of chaos, even the points verynear each other can end up in diﬀerent phase oscillations.The degree of chaos increases several times, when sub- FIG. 3.

Retrodiction Entropy, Bifurcations and Chaos.

The vertical dashed lines from left to right are (1) perioddoubling, (2) period quadrupling, (3) period ×

8, (4) onset of chaos (red), and (5) the onset of one particular island of stabilitywhere chaos breaks oﬀ to periodic motion (green).

Top:

The bifurcation diagram for the logistic map, showing X t for multiplelarge t values. Middle:

Basins of attraction. The initial state X determines the ﬁnal state X t , ( t = 200) within [0 ,

1] whichis mapped to a color gradient from dark red (0) to light yellow (1). The change in the number of basins can be clearly seennear the vertical lines. As the system transitions into chaos, nearby points start converging to distinct ﬁnal points. The systembecomes chaotic (at r (cid:39) .

58) and then “well mixed” (at r (cid:39) . Bottom:

The (normalized) retrodiction entropy vs. logisticparameter r is plotted at several diﬀerent times, t . The black line ( t = 500) is an excellent approximation to the asymptoticlimit of (cid:104) S R (cid:105) . Note that in the non-chaotic regime, the retrodiction entropy converges to ﬂat steps, whereas in the chaoticregime, the retrodiction entropy converges to steps (with values equal to that in the non-chaotic regime) with occasional dipscoinciding with islands of stability. Course graining was done with b = 500 bins, with 10000 sample initial points per bin. domains of the unit interval become more mixed. Thisoccurs for example at r = 3 .

58 and r = 3 .

59 before thepoint of complete mixing at r = 3 . t = 500 steps, is agood approximation of the asymptotic limit of (cid:104) S R (cid:105) . Forparameter values below the ﬁrst period doubling, retrod-iction entropy is at a maximum since all points in the unitinterval converge to a single value, therefore observingthat value does not provide any useful information aboutthe initial state of the system. Therefore, S = log V = 0since the “volume” V is the unit interval. At the period doubling, the asymptotic value of S R drops to − log 2.This is reﬂective of the fact that in the two period re-gion, the measure of the set of points that converge toeach period is 1/2. Therefore, the retrodiction entropygiven either of the two ending positions is − log 2. Thistrend of reduction in average retrodiction entropy con-tinues with every period doubling, as an equal measureof points converge to diﬀerent basins.Note that, as period doublings occur more rapidly withincreasing r , our ﬁnite bin size prohibits us from resolv-ing the discrete steps close to the onset of chaos. Asperiod doublings happen exponentially quickly and ex-ponentially close together, an exponential number of binsbecomes necessary to distinguish between the entropydrops associated with successive bifurcations.The blue vertical dashed lines in Fig. 3 show the loca-tions of the period doublings. Near the period doublingpoints, there is a dramatic slowdown in convergence of S R to its asymptotic value, which is reﬂective of the factthat there is a slowdown in convergence of sequences tothe periodic attractor.As period multiplicities of every power of 2 occur be-fore the onset of chaos, the long-time limit of diﬀerentialretrodiction entropy approaches negative inﬁnity (in thelimit of inﬁnite number of bins). Even with a limitednumber of bins, the asymptotic retrodiction entropy hitsa minimum right at the chaotic transition.Past the point of chaos, retrodiction entropy ascendsin steps with the same asymptotic values as the descend-ing steps. The reason why the steps have the same valuecan be seen in the middle panel of Fig. 3. As r ap-proaches chaos, the system breaks the unit interval ofstarting positions into sub-domains that map to diﬀer-ent periodic attractors (that are subdivided somewhatsimilarly to a Cantor set). After the onset of chaos, thesub-domains undergo mixing, as previously mentioned,where any point that started in that domain has an equalchance of ending up in any attractor in any sub-domainof that domain.The reconstruction entropy in the chaotic regime alsohas occasional dips, which correlate with the “islands ofstability”. For example, we have marked the value r =3 .

83 in green, which is where the logistic map has a periodthree oscillation. The dips around r = 3 .

63 and r = 3 . x, r ), but instead an entire neighborhood inthe unit interval converges to the same attractor. DISCUSSION

The approach of using retrodiction entropy bears somesimilarities to other methods of inference, particularlymaximum a posteriori (MAP) estimation and otherBayesian methods, but also has signiﬁcant diﬀerences.Philosophically, our goal in deﬁning S R is not to ﬁnd themode of a distribution (this is the usual goal of Bayesianinference), but to characterize the information containedin the distribution as a whole. Identifying modes, or themost likely initial state can be very misleading. For ex-ample, in highly degenerate systems, there could be manypeaks in R , each containing a small amount of probabil-ity mass. In contrast, S R characterizes the informationcontent within the entire probability distribution.That being said, entropy does not constitute a com-plete characterization of a probability distribution either.For example, it might be informative to pull out a guess from R and compare it with the actual initial state, (cid:90) R y ( x )( x − x ) R y ( x ) d x d x . Since entropy does not take into account informationabout the spatial location of probability mass, it wouldnot inform on this quantity.

Comparison with other approaches

There is a long history of inference and informationtheory in the development of statistical mechanics. Here,we brieﬂy review a few similar methods of doing inferenceand measuring predictability.Problems in inverse statistical mechanics are generallysolved by using maximum likelihood estimation (MLE)or, if prior information is available, maximum a-posteriorestimation (MAP). Other methods are available, for ex-ample, the pseudolikelihood [25]. However, most of theproblems typically treated in inverse statistical physicsare lattice problems, and the typical goal is to ﬁnd mi-croscopic parameters of the system given some number of(generally independent) measurements, rather than ﬁnd-ing the state of the system in the past. For example, aprototypical inverse statistical mechanical problem is theinverse Ising problem [26], where the connections J ij be-tween spin variables is unknown, the spin conﬁgurationis sampled some number of times from the equilibriumdistribution, and the problem is to infer the most likelymatrix J ij .A line of papers by J. Crutchﬁeld and C.J. Ellison treatsemi-inﬁnite chains of random variables as consecutivestates in discrete time, and suggests that the mutual in-formation between semi-inﬁnite sets of variables is a goodmeasure for the amount of information about the paststored in the present [27–30]. Their backwards entropy h µ = lim n →∞ H ( X − n +1 , ..., X ) /n diﬀers from our retro-diction entropy, which, in compatible notation, becomes (cid:104) S R (cid:105) = H ( X ) − I ( X ; X t ) (cf. (8)). Note that while h µ is deﬁned for a chain of inﬁnite time points, retrodictionentropy operates between two speciﬁc times.The goals of computational mechanics and our retro-diction entropy approach are diﬀerent. Computationalmechanics asks what ﬁnite state machine can statisticallyreproduce a sequence or random variables. Furthermore,many of the examples they treat are not physical sys-tems, but ﬁnite state computational processes - they lookat e.g. the random insertion process [27], random noisycopy, and the golden mean process [28], though in [30]the authors look at reproducing the patterns in diﬀerentIsing systems.In addition, the constraint of having inﬁnite pasts andfutures amounts to studying systems only in equilibrium,which is not a case we would typically be interested inwhen studying retrodiction entropy. Possible generalizations

We can loosen our formalism to make it applicable togeneral inference problems; not just problems in statisti-cal mechanics. An inference problem is typically of theform where there is a space of sets of possible model pa-rameters, A , and a space of possible observed outcomes,Ω. The transition probability is the probability that anobservable event occurs given a set of model parame-ters. There is not necessarily any variable that serves as“time.” As the problem is one of reconstructing parame-ters, and there is no time, so no “past,” we would call theBayesian inverse of T reconstruction probability and callthe corresponding S R reconstruction entropy (instead ofretrodiction probability and entropy).Reconstruction entropy is a measurement of how wellwe can determine the parameters of a system given anobserved event generated from a model with unknownparameters. Retrodiction entropy is a special case of thiswhere the set of parameters is the same as the set ofobservables ( A = Ω), e.g. both are phase space. Addi-tionally, when retrodicting, we consider a parameterizedfamily of transition probabilities, understanding this pa-rameter to be our system time. For the more generalreconstruction entropy, most of the formulas we have de-rived still hold, for example eqs. (1) to (4), (6) and (8),and the KL divergence relations in appendix B. On theother hand, results like (7) do not hold if there is no timeparameter. CONCLUSION

We introduced the notion of retrodiction entropy as ameasure of our ability to infer the past state of a col-lection of particles based on a single measurement ofthe system, and derived a relationship between this andthermodynamic entropy. We have established bounds onthe retrodiction entropy generation rate, derived a set orKL divergence relations between diﬀerent relevant prob-abilities, and outlined retrodiction entropy’s asymptoticproperties. We also showed that for systems where theinitial state is an equilibrium distribution, the averageforward and retrodiction entropy are identical. Lastly,we analytically solved two concrete examples, quantify-ing how rapidly a system of particles forgets its initialstate in convex, concave and ﬂat potentials, and ana-lyzing macrostate retrodiction entropy for a chaotic sys-tem. Particularly, we saw that in a concave potentialthere is an upper limit to the loss of information per-taining the initial state, and for the logistic map, we sawsharp changes in asymptotic retrodiction entropy at pe-riod doublings, and could identify islands of stability inthe chaotic regime by dips in retrodiction entropy.The connection between thermodynamic quantities (cid:104) S T (cid:105) , S t and a purely information theoretical one, S R , is in accordance with the seminal works of Maxwell, Smolu-chowski, Landauer, Szillard, Beckenstein, and others [1–4]. We now know, from (3), that thermodynamic entropyat present time not only quantiﬁes the information con-tent of the state of the system at present time, it alsorelates to how precisely information about the originalstate of the system can be recovered after some amountof time has passed. APPENDIX A: DERIVATION OF THERELATIONSHIP BETWEEN RETRODICTIONENTROPY AND THERMODYNAMIC ENTROPY

We use sum notation throughout, although these couldbe replaced with integrals. Suppose P is normalized.Then (3) can be proved through simple integration: (cid:104) S R (cid:105) = (cid:88) ω P t ( ω ) S R ( ξ ) = − (cid:88) ω P t ( ω ) (cid:88) α R ω ( α ) log R ω ( α )= − (cid:88) ω,α T α ( ω ) P ( α ) log (cid:18) T α ( ω ) P ( α ) P t ( ω ) (cid:19) = − (cid:88) ω,α P ( α ) T α ( ω ) log T α ( ω ) + (cid:88) ω P t ( ω ) log P t ( ω ) − (cid:88) α P ( α ) log P ( α ) = (cid:104) S T (cid:105) − ( S t − S ) . where we substituted P t ( ω ) = (cid:80) α T α ( ω ) P ( α ).As an explicit example of this, consider the Gaussianprocess family we discussed in the paper, with T , P , P t given by (11), (12) and (13). For this case, S T = (cid:104) S T (cid:105) = 12 log (cid:34) π N d (cid:89) α =1 D α ( t ) N (cid:35) + N d S = 12 log (cid:34) (2 πe ) d d (cid:89) α =1 σ α (cid:35) S t = 12 log (cid:34) (2 π N N ) d d (cid:89) α =1 σ α D α ( t ) N λ α ( t ) D α ( t ) κ α ( t ) (cid:35) + N d S R = (cid:104) S R (cid:105) = 12 log (cid:34)(cid:16) πeN (cid:17) d d (cid:89) α =1 D α ( t ) κ α ( t ) λ α ( t ) (cid:35) APPENDIX B: KL-DIVERGENCE RELATIONS

Here, we derive (6). We start with the deﬁnition ofKL-divergence: D ( R ω (cid:107)R ω ) = − (cid:88) α R ω ( α ) log R ω ( α ) R ω ( α )= − (cid:88) ξ T α ( ω ) P ( α ) P t ( ω ) (cid:20) log (cid:18) T α ( ω ) T α ( ω ) (cid:19) + log (cid:18) P t ( ω ) P t ( ω ) (cid:19) (cid:21) . ω ’s with the probability weight P t ( ω ) P t ( ω ), the ﬁrst term in the brackets gives − (cid:88) ω ,ω ,α P t ( ω ) T α ( ω ) P ( α ) log[ T α ( ω ) / T α ( ω )]= − (cid:88) ω ,ω ,α P ( α ) P t ( ω ) [ T α ( ω ) log T α ( ω ) − T α ( ω ) log T α ( ω )]= − (cid:88) ω ,ω ,α P ( α ) P t ( ω ) log T α ( ω ) − P ( α ) T α ( ω ) log T α ( ω )= −(cid:104) S T (cid:105) − (cid:88) ω P t ( ω ) log T α ( ω ) = S t − (cid:104) S T (cid:105) + (cid:88) α D ( P t (cid:107)T α )(we have used the fact that D ( A (cid:107) B ) = − (cid:80) A log B − S A and (cid:80) ω T α ( ω ) = 1) whereas the second term gives − (cid:88) ω ,ω ,α P t ( ω ) T α ( ω ) P ( α ) log P t ( ω ) P t ( ω )= − (cid:88) ω ,ω P t ( ω ) (cid:32)(cid:88) α T α ( ω ) P ( α ) (cid:33) log P t ( ω ) P t ( ω )= − (cid:88) ω ,ω P t ( ω ) P t ( ω ) log P t ( ω ) P t ( ω )= S t − S t = 0 . Putting everything together, (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( N (cid:107)T ξ ) (cid:105) + S t − (cid:104) S T (cid:105) . The second term here is, (cid:104) D ( N (cid:107)T ξ ) (cid:105) = − (cid:88) ξ,ω P ( ξ ) N ( ω ) log T ξ ( ω ) N ( ω )= − (cid:88) ξ ,ξ P ( ξ ) P ( ξ ) T ξ ( ω ) log T ξ ( ω ) − S t = (cid:104) D ( T ξ (cid:107)T ξ ) (cid:105) + (cid:104) S T (cid:105) − S t . Putting these equations together gives us Eq. (6).We can take the KL divergence between any pair ofdistributions that have a common domain. It is naturalto only compare distributions that are either both on theﬁnal state or both on the initial state. Furthermore, asthe KL divergence is asymmetric, we can ask about bothorderings. The six options are ( T , T ), ( T , P t ), ( P t , T ),( R , R ), ( R , P ), and ( P , R ). In a similar way to ourderivations above, we can ﬁnd relations between the av-erages of the KL divergence between all these pair interms of each other or in terms of entropies: (cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:104) D ( P (cid:107)R ω ) (cid:105) + S t − (cid:104) S T (cid:105)(cid:104) D ( T α (cid:107) P t ) (cid:105) = (cid:104) D ( R ω (cid:107) P ) (cid:105) = S − (cid:104) S R (cid:105) = S t − (cid:104) S T (cid:105)(cid:104) D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( P (cid:107)R ω ) (cid:105)(cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:104) D ( R ω (cid:107)R ω ) (cid:105) One can put these together to derive relations for the av-erages of the symmetric combinations of KL divergences. (cid:104) D ( T α (cid:107) P t ) + D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105)(cid:104) D ( R ω (cid:107) P ) + D ( P (cid:107)R ω ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) APPENDIX C: LIMITS ON THE SIZE OFOBSERVATIONAL AND RETRODICTIONENTROPY

We can use Jensen’s inequality to put an upper boundon the time rate of change of S t . Since − log x is a convexfunction, we have the inequality, − log (cid:88) ω P ( α ) T α ( ω ) ≤ − (cid:88) α P ( α ) log T α ( ω ) . Start with the deﬁnition of S t , then apply Jensen’s in-equality:˙ S t = − (cid:88) ξ ˙ P t ( ω ) log P t ( ω ) = − (cid:88) ω ˙ P t ( ω ) log (cid:88) α P ( α ) T α ( ω ) ≤ − (cid:88) α,ω P ( α ) ˙ P t ( ω ) log T α ( ω )= − (cid:88) α,ω P ( α ) ˙ P t ( ω ) (cid:18) log T α ( ω ) P t ( ω ) + log P t ( ω ) (cid:19) = − (cid:88) α,ω P ( α ) ˙ P t ( ω ) log T α ( ω ) P t ( ω ) + ˙ S t . Canceling the ˙ S N terms on both sides yields0 ≤ − (cid:88) α P ( α ) (cid:88) ω ˙ P t ( ω ) log T α ( ω ) P t ( ω )which bears some similarity to the KL-divergence. Thederivative of an arbitrary KL-divergence is ∂∂t D ( p (cid:107) q ) = − (cid:88) ˙ p log qp − (cid:88) pq ˙ q. Using this in the preceding inequality, we get0 ≤ ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) α P ( α ) (cid:88) ω P t ( ω ) ∂∂t log T α ( ω )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) α P ( α ) (cid:88) ω P t ( ω ) ∂∂t log R ω ( α ) P t ( ω ) P ( α )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) ω P t ( ω ) (cid:88) α P ( α ) ∂∂t log R ω ( α ) P ( α )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . Using the expression we previously discussed for (cid:104) D ( P t (cid:107)T α ) (cid:105) , we can reintroduce ˙ S t to the equation,˙ S t ≤ (cid:104) ˙ S T (cid:105) + ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . ∂∂t (cid:104) S R (cid:105) via(3) ∂∂t (cid:104) S R (cid:105) ≥ − ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) (15)Now we will make use of the fact that for a Markovprocess, the relative entropy of two distributions is non-increasing [21]. We include this theorem below for thesake of completeness. Theorem:

Consider two probability distributions p , q , on the same state space. Then at any times t < t , D ( p t (cid:107) q t ) ≥ D ( p t (cid:107) q t ) Proof:

Let s < t . Then, D ( p ( x t | x s ) (cid:107) q ( x t | x s ))= D ( p ( x t ) (cid:107) q ( x t )) + D ( p ( x s | x t ) (cid:107) q ( x s | x t ))= D ( p ( x s ) (cid:107) q ( x s )) + D ( p ( x t | x s ) (cid:107) q ( x t | x s ))By the deﬁnition of Markov, p ( x t | x s ) = q ( x t | x s ), so D ( p ( x t | x s ) (cid:107) q ( x t | x s )) = 0. Then, subtracting the secondand third lines, we get D ( p t (cid:107) q t ) − D ( p s (cid:107) q s ) = − D ( p s,t (cid:107) q s,t ) ≤ ∂∂t D ( T α (cid:107)T α ) ≤ α , α . Therefore, the ﬁrstterm on the right hand side of eq. (15) is non-negative.The second term of eq. (15) is harder to work with.Intuitively, we expect R to approach P as we lose infor-mation about the past due to stochastic events. So weexpect D ( P (cid:107)R ξ ) to eventually reach a minimum for anyﬁxed ξ . As long as (cid:104) D ( P (cid:107)R ξ ) (cid:105) decreases more slowlythan (cid:104) D ( T ω (cid:107)T ω ), this bound is good enough to guaran-tee that ∂ (cid:104) S R (cid:105) /∂t ≥ ∗ Corresponding Author: [email protected][1] E. T. Jaynes, Physical review , 620 (1957).[2] E. T. Jaynes, Physical review , 171 (1957).[3] C. E. Shannon and W. Weaver,

The mathematical theoryof communication (University of Illinois press, 1998).[4] H. S. Leﬀ and A. F. Rex,

Maxwell’s demon: entropy,information, computing (Princeton University Press,2014). [5] G. E. Box and G. C. Tiao,

Bayesian inference in statis-tical analysis , Vol. 40 (John Wiley & Sons, 2011).[6] M. Welling and Y. W. Teh, in

Proceedings of the 28thInternational Conference on Machine Learning (ICML-11) (2011) pp. 681–688.[7] B. A. Desmarais and S. J. Cranmer, Physica A: Statisti-cal Mechanics and its Applications , 1865 (2012).[8] V. A. T. Nguyen and D. C. Vural, Physical Review E ,032314 (2017).[9] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurringimages: matrices, spectra, and ﬁltering (SIAM, 2006).[10] R. H. Chan and K. Chen, SIAM Journal on ScientiﬁcComputing , 1043 (2010).[11] P. Ullersma, Physica , 27 (1966).[12] H.-Y. Yu, D. M. Eckmann, P. S. Ayyaswamy, andR. Radhakrishnan, Physical Review E , 052303 (2015).[13] W. T. Coﬀey and Y. P. Kalmykov, The Langevin equa-tion: with applications to stochastic problems in physics,chemistry and electrical engineering , Vol. 27 (World Sci-entiﬁc, 2012).[14] F. Wolf, Journal of mathematical physics , 305 (1988).[15] M. Hashemi, Physica A: Statistical Mechanics and itsApplications , 141 (2015).[16] M. Bernstein and L. S. Brown, Physical review letters , 1933 (1984).[17] J. A. Carrillo and G. Toscani, Mathematical methods inthe applied sciences , 1269 (1998).[18] G. Toscani, Quarterly of Applied Mathematics , 521(1999).[19] V. Schw¨ammle, E. M. Curado, and F. D. Nobre, The Eu-ropean Physical Journal B-Condensed Matter and Com-plex Systems , 159 (2007).[20] A. R. Plastino, H. G. Miller, and A. Plastino, PhysicalReview E , 3927 (1997).[21] T. M. Cover and J. A. Thomas, Elements of informationtheory (John Wiley & Sons, 2012).[22] R. M. May, Nature , 459 (1976).[23] P. Grassberger and I. Procaccia, Physical review letters , 346 (1983).[24] J. D. Farmer, Zeitschrift f¨ur Naturforschung A , 1304(1982).[25] J. Besag, Journal of the Royal Statistical Society. SeriesB (Methodological) , 192 (1974).[26] H. C. Nguyen, R. Zecchina, and J. Berg, Advances inPhysics , 197 (2017).[27] J. P. Crutchﬁeld, C. J. Ellison, and J. R. Mahoney, Phys-ical review letters , 094101 (2009).[28] C. J. Ellison, J. R. Mahoney, and J. P. Crutchﬁeld, Jour-nal of Statistical Physics136