LLimits on Inferring the Past
Nathaniel Rupprecht, Dervis C. Vural ∗ University of Notre Dame (Dated: May 14, 2018)Here we define and study the properties of retrodictive inference. We derive equations relatingretrodiction entropy and thermodynamic entropy, and as a special case, show that under equilib-rium conditions, the two are identical. We demonstrate relations involving the KL-divergence andretrodiction probability, and bound the time rate of change of retrodiction entropy. As a specificcase, we invert various Langevin processes, inferring the initial condition of N particles given theirfinal positions at some later time. We evaluate the retrodiction entropy for Langevin dynamicsexactly for special cases, and find that one’s ability to infer the initial state of a system can exhibittwo possible qualitative behaviors depending on the potential energy landscape, either decreasingindefinitely, or asymptotically approaching a fixed value. We also study how well we can retrodictpoints that evolve based on the logistic map. We find singular changes in the retrodictivity nearbifurcations. Counterintuitively, the transition to chaos is accompanied by maximal retrodictability. INTRODUCTION
Many astonishing facts about the origin of the uni-verse, evolution of life, or history of civilizations willnever be directly observed, but will only be inferred inthe light of their manifestations in the present. Evolvedforward in time, any state of knowledge, regardless of howexact, will invariably deteriorate into an entropy maxi-mizing probability distribution [1–4]. How rapidly doesour knowledge of the past, as inferred from a measure-ment made in the present, deteriorate, going backwardsin time?While methods exist for inferring the origin of an ob-served final state [5–8], or inferring some original data af-ter it has been corrupted [9, 10] we know little about howaccurately the initial state of a many-body system can becharacterized given its present state, how quickly a sys-tem forgets its initial state due to thermal fluctuations,and how the limit our ability to infer the past dependson system parameters. The answers to these questionsshould lie in non-equilibrium statistical mechanics, wherethermal motion is incorporated into mechanical laws [11–13]. In systems where thermal collisions erase the infor-mation pertaining past states of particles, Fokker-Planckequation constitutes the groundwork of nonequilibriumanalysis [14–20].Here we determine the theoretical limits to inferringthe initial state of a system, to which we refer as “retro-diction” – in contrast to prediction. We quantify thequality of retrodiction in terms of retrodiction entropy, S R . We derive a relationship between thermodynamicentropy and retrodiction entropy, and report a lowerbound on its generation rate. Then, to apply these ideasto a specific problem, we consider a collection of particlescoupled to a thermal bath, and obtain the time depen-dence of S R in convex, concave and flat potentials. Toestablish whether chaos fundamentally influences retro-dictability, we also investigate the retrodiction entropyof the logistic map as it transitions from the non-chaotic regime to the chaotic regime. Finally, we conclude ourdiscussion with a comparison of retrodiction entropy toother inverse statistical methods and methods for com-paring predictability and retrodictability. DEFINITIONS AND NOTATION
Our system consists of a set states Ω, a prior distribu-tion on the set of states, P , and a “transition probabil-ity” function T . The state space Ω will depend on theproblem at hand, it could for example be the space of allpossible positions and velocities of a collection of particles(i.e. phase space). The prior distribution specifies howthe system will be initialized - P ( α ) is the probabilitythat the system will be prepared in the state α ∈ Ω. Thetransition probability T ( ω | α ; t ) is the probability thatthe system ends in the state ω ∈ Ω given that it startedin the state α ∈ Ω and evolved for a time t . We willgenerally suppress the time variable.The probability R ( α | ω ; t ) = R ω ( α ) that the initialstate was α given the final state ω , is given by the Bayestheorem, R ( α | ω ; t ) = T ( ω | α ; t ) P ( α ) P t ( ω ) = T ( ω | α ; t ) P ( α ) (cid:80) α (cid:48) T ( ω | α (cid:48) ; t ) P ( α (cid:48) ) . (1)where P t is the prior distribution P evolved forwards intime. R would typically be called the likelihood or theposterior distribution. In the present context, we willrefer to it as the retrodiction probability, and define theentropy associated with it as the retrodiction entropy, S R ( ω ) = − (cid:88) α R ω ( α ) log R ω ( α ) . (2)Intuitively, the larger S R ( ω ) is, the less accurately theinitial state can be inferred given a measurement of thefinal state, ω .Note that S R is a function of the final state observedafter a single realization of a stochastic process. If the a r X i v : . [ s t a t . O T ] M a y process were to be run again, the particles would endup elsewhere, and have a different S R associated withthat final state. As such, it will be useful to define S R averaged over all possible final measurements, (cid:104) S R (cid:105) .A related quantity of interest is the Kullback-Leibler(KL) divergence D ( p (cid:107) q ) = (cid:80) x p ( x ) log p ( x ) /q ( x ), mea-sures the amount of overlap between two distributions p ( x ) and q ( x ) [21]. Thus, another useful measure ofretrodictability is the KL divergence D ( R ω (cid:107) P ) between R and P , which quantifies the amount of informationgained over the prior upon a measurement. As our abil-ity to infer the past decreases, the retrodiction proba-bility coincides more with the prior probability, the KLdivergence decreases. Ultimately, D ( R ω (cid:107) P ) = 0 as themeasurement ω provides no additional information re-garding the initial state beyond what we already know;the prior, P . Notation
Throughout, we denote the average over all free pa-rameters by (cid:104)·(cid:105) . However, there are two different typesof averages that are indicated by this notation: averagesover the distribution on initial states, and averages overdistribution on final states. When we average over quan-tities where the free variable ranges over initial states,we use a probability weight P for each such free vari-able. For quantities where the free variable ranges overfinal states, we use a probability weight P t for each suchfree variable. In the case where there are multiple statesthat are being averaged over, we include a subscript toindicate that there is a free variable to be averaged over.For example, (cid:104) S R (cid:105) = (cid:88) ω P t ( ω ) S R ( ω ) (cid:104) S T (cid:105) = (cid:88) α P ( α ) S T ( α ) (cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:88) α ,α P ( α ) P ( α ) D ( T α (cid:107)T α ) GENERAL PROPERTIES OF RETRODICTIONRelation between retrodiction and thermodynamics
To facilitate readability onwards, we expose only thecrucial steps in the main text, leaving the proofs andderivations to the appendices.Our first key result is the relationship between retrod-iction entropy and thermodynamic entropy (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ) . (3)Here (cid:104) S T (cid:105) is the average entropy associated with thetransition probability T α ( ω ), whereas S and S t are the entropies associated with the prior probability P , andthe observation probability, P t . Eq. (3) relates our abil-ity to infer the past, (cid:104) S R (cid:105) , to our ability to predict thefuture, (cid:104) S T (cid:105) and S t . This identity is derived in AppendixA.Note that Eq. (3) holds for processes both in or outof equilibrium, and provides useful insights on the gen-eral properties of S R . For short times, P t (cid:39) P , solim t → (cid:104) S R (cid:105) / (cid:104) S T (cid:105) = 1. For long times, if the systemconverges to a stationary distribution P ∞ , (as is the casein a bounded space or trapping potential), then P t and T α ( ω ) must approach P ∞ independent of the startingstate, and (3) implies lim t →∞ (cid:104) S R (cid:105) = S , i.e. we cannotguess the initial state any better than using whatever wealready knew before making the measurement.As another interesting special case, we consider whathappens if the prior probability P coincides with thestationary state probability P ∞ (assuming one exists).Then S t = S for all times t , and (3) implies (cid:104) S R ( t ) (cid:105) = (cid:104) S T ( t ) (cid:105) . (4)For example, if we are inferring the past of a systemin equilibrium we would be drawing the initial state ofthe system out of the equilibrium distribution, i.e. using P ( s ) = e − βE ( s ) /Z as the prior probability, measure thepositions of some particles, and ask where they used tobe. Eq. (4) tells us that in equilibrium, the rate of ther-modynamic entropy and retrodiction entropy generationis the same. Our ability to predict the future fades atexactly the same rate as our ability to infer the originalstate of the system.No such correspondence need hold for non-equilibriumprocesses. For a system with equilibrium entropy S eq , if S > S eq then S t will decrease from S at t = 0 to S eq as t → ∞ . Thus (cid:104) S R (cid:105) > (cid:104) S T (cid:105) . In this case, we knowthat particles will gather, so we know better where theywill be in the future than where they were originally. Incontrast, if S < S eq , S t will increase in time and (cid:104) S R (cid:105) < (cid:104) S T (cid:105) . Here, we know more about where the particleswere originally than where they will be in the future. Tosum up, the more certain we can be about the state ofthe system in the future, the less certain we are aboutwhere the system started out in the past. Experimental measurement of retrodictability
It is instructive to view (3) from a practical, empiricalperspective. Consider a system of particles evolving in apotential energy landscape U ( (cid:126)x ) while coupled to a heatbath. Can we estimate bounds on (cid:104) S R (cid:105) without knowingthe microscopic dynamics of the system (e.g. the inter-particle interactions) or the potential energy landscape,but only using thermodynamic measurements?This is possible under certain conditions. We can ini-tialize a system such that particles are in state α withprobability P ( α ), let the particles evolve for a time t ,calorimetrically obtain the change in thermodynamic en-tropy via ∆ S α = (cid:82) α dQ/T , and then average this overmultiple instances to obtain (cid:104) S T (cid:105) s (the sample averageof entropy). The identity dS = dQ/T holds when the sys-tem moves along a reversible path. While it is not trivialto measure S t for processes out-of-equilibrium, we canuse the equilibrium result, (cid:104) S T (cid:105) = (cid:104) S R (cid:105) (eq. 4) and thesecond law, to place an upper bound on average retrod-iction entropy, for any process (in or out of equilibrium), (cid:104) S R (cid:105) < (cid:104) S T (cid:105) s + S . Under special conditions, we can do better thanan inequality. If the prior distribution is uncorrelated P ( x , . . . , x N ) = p ( x ) p ( x ) . . . p ( x N ), and if interac-tions between particles are negligible, then P t ( y , . . . , y N ) = N (cid:89) k =1 (cid:32)(cid:88) x k T ( y k | x k ; t ) p ( x k ) (cid:33) ≡ (cid:89) k q ( y k )Since each term in this product is independent, the en-tropy is extensive S t = N H [ q ], and S = N H [ p ]. Thus,an experimentalist can measure S t − S by placing M (cid:29) p ( x ), allow the particlesto evolve for a time t , and again calorimetrically integrate∆ S = (cid:82) dQ/T to obtain S t − S (cid:39) N ∆ S/M . Note thatsince M (cid:29)
1, ∆ S will be deterministic. Thus from (3)the retrodictability becomes a difference of two entropymeasurements, (cid:104) S R (cid:105) = (cid:104) ∆ S (cid:105) s − NM ∆ S, (5)The first term on the right is measured by initializingparticles individually at α with probability P ( α ) andaveraging all outcomes, whereas the second term, by asingle shot measurement of a gas initialized with density P ( α ). We emphasize that this experimental protocol toobtain (5) will be valid only when inter-particle interac-tions are negligible, and for an uncorrelated prior, but aslong as these assumptions hold, (cid:104) S R (cid:105) can be known byonly performing thermodynamic measurements, withoutneeding to know the underlying potential or microscopicdynamics. Continuous space and divergence relations
For a continuous state space, we may consider S R tobe a differential entropy, which is not invariant under achange of variables. In contrast, D ( R ω (cid:107) P ) is invariantunder changes of variables, and therefore may be a moredesirable measure. We derive, in a similar manner to (3), (cid:104) D ( R ξ (cid:107) P ) (cid:105) = S t − (cid:104) S T (cid:105) = S − (cid:104) S R (cid:105) . Markovian stochastic processes are known to have a KL-divergence that are non-increasing in time [21]. Thus we are motivated to ask how the KL divergence between twoforward processes T α , compares to the KL divergence be-tween two retrodiction probabilities R ω . First, we show(cf. Appendix B) (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( P t (cid:107)T α ) (cid:105) + S t − (cid:104) S T (cid:105)(cid:104) D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) S T (cid:105) − S t . Combining these gives us the relationship (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) . (6)Thus, the average amount of overlap between differentretrodiction probability distributions is exactly equal tothe average amount of overlap between different forwarddistributions (cf. Appendix B). Taking the time deriva-tive of both sides tells us that the average rate of increaseis the same for forward and reverse probabilities, and thatthis quantity is non-increasing [21]. In Appendix B, welist all the KL divergence relations between the distribu-tions T , R , P , and P t . Lower bound to retrodiction entropy generation
We can establish a lower bound on the time rate ofchange of retrodiction entropy in terms of forward en-tropies and KL divergences. Differentiating (3) and us-ing the convexity of log gives us an upper bound on therate of change of S t (cf. Appendix C),˙ S t ≤ (cid:104) ˙ S T (cid:105) + ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . (7)Using the theorem on Markov processes, we know thatthe second term in (7) is ≤
0. The last term in (7)measures the divergence between the prior state and theretrodiction probability, which should decrease with timeas the reconstructed probability approaches the prior.Rearranging (7), we get, ∂∂t (cid:104) S R (cid:105) ≥ − ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) Information theoretical interpretation
From an information theoretic point of view, retrod-iction entropy is the amount of information required tospecify which state the system was initialized, given anobservation of its final state. The KL divergence betweenthe retrodiction probability R ω , and the prior distribu-tion P is a measure of how much information has beengained by making a measurement (above and beyond theinformation contained in the prior). The KL divergenceis asymmetric in its arguments, D ( R ω (cid:107) P ) (cid:54) = D ( P (cid:107)R ω ).However, there is a good reason for preferring D ( R ω (cid:107) P )over D ( P (cid:107)R ω ). Letting X , X t be the random variablesfor the configuration at times 0 and t , it can be shownthat (cid:104) D ( R ω (cid:107) P ) (cid:105) = I ( X ; X t ) where I ( · , · ) is the mutualinformation. In other words, the average KL divergencebetween retrodiction probabilities and the prior is themutual information between the initial and final statesof the system. We can use this and our other formulasto write retrodiction entropy in terms of mutual informa-tion, (cid:104) S R (cid:105) = S − I ( X ; X t ) = H ( X ) − I ( X ; X t ) (8)While it is impossible to evaluate quantities like D ( R ω (cid:107) P ) or S R ( ω ) for a specific ω without being givena specific problem (and being able to evaluate the transi-tion probabilities for that problem), eqs. (1) to (4) and (6)to (8) hold true quite generally, for any system in or outof equilibrium. RETRODICTION OF BROWNIAN PARTICLESIN A POTENTIAL
Following these general results, we now study a specificphysical system, the retrodiction entropy of Brownianparticles diffusing in a potential. The α th coordinate ( α = x, y, z, ...) of the k -th particle, will be written as x ( k ) = { x ( k ) α } , and for the initial state, the α th coordinate of theinitial position will be written as y = { y α } . In otherwords, Latin superscripts index particles 1 , . . . , N whileGreek subscripts indicate their coordinates, 1 , . . . , d .Suppose N particles are released at the same positionat t = 0 and evolve in a potential U ( (cid:126)x ) according toLangevin dynamics. The evolution of the state probabil-ity distribution p ( (cid:126)x, t ) is governed by the general Fokker-Planck equation, ∂p ( (cid:126)x, t ) ∂t = (cid:88) α,β ∂ [ D αβ ( (cid:126)x, t ) p ( (cid:126)x, t )] ∂x α ∂x β − (cid:88) α ∂ [ µ α ( (cid:126)x, t ) p ( (cid:126)x, t )] ∂x α , where µ α ( (cid:126)x, t ) is a drift term and D αβ ( x, t ) is the diffu-sion tensor. Since particles are independent and followidentical transition rules, the probability that N particlesstarting at state x , end in states x (1) , . . . , x ( N ) is T ( x (1) , . . . , x ( N ) | y ; t ) = N (cid:89) k =1 p ( x ( k ) | y ; t ) (9)The retrodiction probability R ( y | x (1) , . . . , x ( N ) ) is thenthe probability that the initial position of the clusterof particles was y given the N observed final positions { x ( k ) } . Retrodiction Entropy of a Gaussian process
Consider a process with (individual) probability distri-butions p ( x ( k ) | y ; t ) = d (cid:89) α =1 exp[( x ( k ) α − λ α ( t ) y α ) /D α ( t )] (cid:112) πD α ( t ) . (10)Here, the transition probability T ( x (1) , . . . , x ( N ) | y ) is T = d (cid:89) α =1 exp[ − (cid:80) Nk =1 ( x ( k ) α − λ α ( t ) y α ) /D α ( t )]( πD α ( t )) N/ (11)since all particles start at x α . Note that we allow thegeneralized diffusion and drift to be different in everydimension α . Suppose the prior probability for the initialposition of the cluster of particles is Gaussian, centeredat the origin, P ( y ) = d (cid:89) α =1 (2 πσ α ) − / exp[ − y α / (2 σ α )] . (12)The observation probability of a configuration is then P t ( x (1) , . . . , x ( N ) ; t ) = d (cid:89) α =1 (cid:0) π N σ α D α ( t ) N (cid:1) − / × (cid:115) D α ( t ) κ α ( t ) N λ α ( t ) exp[ − ND α ( t ) (cid:0) (cid:104) x α (cid:105) − κ α ( t ) (cid:104) x α (cid:105) (cid:1) ] (13)where, κ α ( t ) = [1 + D α ( t ) / (2 N σ α λ α ( t ) )] − and (cid:104) x nα (cid:105) = (cid:80) Nk =1 [ x ( k ) α ] n /N . From this and T , P , we can evaluatethe retrodiction probability R ( y | x (1) , . . . , x ( N ) ; t ) = d (cid:89) α =1 (cid:115) N λ α ( t ) πD α ( t ) κ α ( t ) × exp (cid:34) − (cid:18) N λ α ( t ) D α ( t ) κ α ( t ) (cid:19) (cid:18) y α − κ α ( t ) λ α ( t ) (cid:104) x α (cid:105) (cid:19) (cid:35) . As this is a Gaussian distribution, it is straightforwardto evaluate its entropy, the retrodiction entropy, S R = 12 log (cid:34)(cid:16) πeN (cid:17) d d (cid:89) α =1 D α ( t ) λ α ( t ) + D α ( t ) / (2 σ α N ) (cid:35) . (14)Note that in the limit of σ α → ∞ in all directions,we obtain the case of a uniform (non-normalizable) priorover all space. In this case, or in the case that σ ’s arefinite and particles are “scattered off” by external forces,i.e. λ α ( t ) → ∞ as t → ∞ , the retrodiction entropy is S R = ( d/
2) log[ πeD GM ( t ) / ( N λ GM ( t ) )]where the subscript “GM” indicates a geometric meanover the different directions α . The individual entropiesof the distributions T , P , and P t are listed in AppendixA, which also serves to verify (3). - ���� � � � � �� � � �� �� � � � � ��� � ( � ) = � � � � ( � ) = � � � ( � ) = �� ( � ) = - � � � ( � ) = - � � � - ���� � � � � �� � � �� �� � � � � ��� � ( � ) = � � � � ( � ) = � � � ( � ) = �� ( � ) = - � � � ( � ) = - � � � FIG. 1.
Retrodiction Entropy Generation. Left:
Theretrodiction entropy S R of five particles in a convex, flat andconcave potential U ( (cid:126)x ) with a uniform prior. S R quantifieshow poorly the initial state of the particles can be inferred,backwards in time. Free and trapped particles forget theirorigin monotonically, whereas particles dispersing in a con-cave potential remember their past no matter how much timepasses. Right:
An analogous plot, but with a Gaussian priorinstead of a uniform prior. Free and trapped particles saturateto having maximum retrodiction entropy, whereas particles ina concave potential still remember their past, just as in thecase of a uniform prior.
Convex and concave potentials
Two processes that have analytical solutions tothe Fokker-Planck equation are Wiener and Ornstein-Uhlenbeck processes, describing Brownian particles inflat U ( (cid:126)x ) = α + (cid:126)β · (cid:126)x and parabolic U ( (cid:126)x ) = α + (cid:126)β · (cid:126)x + θ(cid:126)x potentials. We evaluate the retrodiction entropy for thesespecial cases, and find that it diverges for particles ran-dom walking in flat and convex potentials ( θ ≥
0) indicat-ing that the system steadily forgets its past. In contrast,concave ( θ ≤
0) potentials have a retrodiction entropythat asymptotically approach a constant less than S P ,indicating that the system always retains the memory ofits initial state (see Fig.1).The distribution of a free Brownian particle is p ( x | y ; t ) = d (cid:89) α =1 (4 πD α t ) − / exp[ − ( x α − y α ) / D α t ] . In this case, the functions in (10) are D α ( t ) = 4 D α t and λ α ( x α ) = 1. Thus, S R = 12 log (cid:34) (4 πe ) d d (cid:89) α =1 σ α D α t D α t + σ α N (cid:35) . ���� � � � � ��� � � 〈 � � ( � ) 〉 � � 〈 � � ( � ) 〉 - - - - ���������� θ � � � � �� � � �� �� � � � � ��� 〈 � � 〉 � � = ��� 〈 � � 〉 � � = ���� 〈 � � 〉 � � = ��� 〈 � � 〉 � � →∞ FIG. 2.
The Role of Convexity. Left:
Prior entropy S ,average thermodynamic entropy (cid:104) S T (cid:105) , and observational en-tropy S t for two particles in a harmonic trap, as derived inAppendix A. The retrodiction entropy is related to the otherthree, through (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ). The prior distribu-tion is Gaussian with σ = 5. Right: S R for the Ornstein-Uhlenbeck process at different times with a Gaussian prior, σ = 5. For positive convexity, S R converges to S , for nega-tive convexity, S R converges to some smaller value, meaningsome information can still be recovered. In the limit of σ α → ∞ , S R increases at a logarithmicrate at all times. If the σ ’s are finite, then at long times, S R → d / log[4 πeσ GM ], which is just the entropy of theprior distribution P . For short times, we have S R ∼ ( d/
2) log (4 πeD GM t/N )Next, we consider Brownian particles in a convex orconcave harmonic potential, U ( (cid:126)x ) = θ(cid:126)x , described bythe the Orstein-Uhlenbeck process. The probability dis-tribution given an initial position y is p ( x | y ; t ) = d (cid:89) α =1 exp[ − θ ( x α − y α e − θt ) / (2 D α (1 − e − θt ))] (cid:112) θ − πD α (1 − e − θt )meaning that D α ( t ) = 2 D α θ − (1 − e − θt ) and λ α ( t ) = e − θt . Thus, S R = 12 log (cid:34) (2 πe ) d d (cid:89) α =1 σ α D α (1 − e − θt ) σ α N θ e − θt + D α (1 − e − θt ) (cid:35) . In the limit of infinite σ ’s, we get two very differentlong-time behaviors depending on the sign of θ . For θ > t → ∞ , S R ∼ dθt . For θ <
0, we have a potential that tends to quickly forceparticles away from the origin. In this case, S R = ( d/
2) log (cid:104) (2 π e ) (1 − e | θ | t ) D GM / ( N | θ | ) (cid:105) . Therefore, as t → ∞ , S R ∼ const. − d e − | θ | t . Thus, af-ter some initial transient loss of information, our abilityto reconstruct the initial state plateaus, i.e. the systemalways retains information about its initial state for ar-bitrarily long times (see Fig. 1a). For finite σ ’s, S R hasthree distinct temporal regimes. It starts logarithmic,crosses over to linear, and then finally saturates to S (see Fig. 1B, 2).In Fig. 1a, we have plotted the average retrodiction en-tropy as a function of time for five particles in potentialswith various concavities ( θ parameters, U ( x ) = θx ).The prior is a non-normalizable uniform prior. The pro-cess is an Ornstein-Uhlenbeck process when θ (cid:54) = 0, and isthe Weiner process when θ = 0. For concave potentials(in blue), the retrodiction entropy converges to a finitevalue. For a potential with θ = 0, we recover the Wienerprocess, and S R increases logarithmically. For convexpotentials, S R is asymptotically linear, diverging muchmore quickly than the Weiner process.In Fig. 1b, we have shown the analogous plot, but for aGaussian prior. For concave potentials, the retrodictionentropy still saturates to a value below the prior entropyvalue. For convex potentials, the retrodiction entropystarts logarithmic, becomes linear, and then quickly sat-urates to S . For the Wiener process, the retrodictionentropy does eventually approach the value of S , thoughvery slowly – at t = 1000, it is still 2 .
5% away from S .In Fig. 2a, we show the time dependence of the en-tropies S , (cid:104) S T (cid:105) , S t , and (cid:104) S R (cid:105) for two particles in a con-vex potential with a Gaussian prior. This illustrates thefact that (cid:104) S R (cid:105) = (cid:104) S T (cid:105) − ( S t − S ). The linear behaviorof S R in the intermediate regime can be seen before itexponentially approaches the value of the entropy of theprior, S .In Fig. 2b, we plot the average retrodiction entropy ofthe Ornstein-Uhlenbeck process at specific times, start-ing with a Gaussian prior. In the long time limit, if theconvexity is positive, the retrodiction entropy approachesthe entropy of the prior distribution, S , and hence theblack line being flat for all θ ≥
0. However, if the con-vexity is negative (so a concave potential), we can seethat the retrodiction entropy converges to a value lessthan S . This indicates that by making a measurement,we gain information about the initial state of the systemeven after arbitrarily long times. RETRODICTION OF A CHAOTIC SYSTEM
To study how chaos relates to retrodictability we con-sider the simplest of chaotic systems, the logistic map, X t = r · X t − (1 − X t − ) . characterized by a single parameter r which determineswhether the system is chaotic. Our key result here is somewhat counter-intuitive: We find that the system ismaximally retrodictable right before and right after ittransitions into chaos.The asymptotic properties of the Logistic map is wellknown [22]. The values p n take as n tends to infinity,i.e. the attractors, is shown in the bifurcation diagram(Fig. 3a). For small values of r , the trajectories are pe-riodic. As r is increased, there is a sequence of perioddoublings (cf. Fig. 3, blue vertical dashes) until the sys-tem transitions to chaos at r (cid:39) .
57 (red vertical dashes).Within the chaotic regime, there are occasional islandsof stability where periodic attractors exist. For example,at r (cid:39) .
83 there is a period 3 attractor (green verticaldashes).Since the logistic map is purely deterministic, in orderto define probabilities and entropies we suppose that thestate of the system cannot be measured with infinite ac-curacy – similar to how probability and entropy arise inclassical statistical mechanics. To avoid artifacts stem-ming from the precise details of coarse graining, we pickvery small bins with randomized positionsSpecifically, we coarse grain the interval [0 ,
1] randomlyinto b bins by picking b − < x < x < ... < x b − <
1. Wethen uniformly and randomly sample s points from eachbin, and iterate each point τ times via the logistic map.This way, we construct the probability transition matrix T ( τ ) ji , the probability that a point selected randomly frombin j ends in bin i after τ logistic steps. Using this, andassuming a uniform prior on picking the initial point, wecan obtain the retrodiction probability matrix R ( τ ) ji , andthe average retrodiction entropy.As the binning is random, the value of average retrod-iction entropy is slightly different for each realization ofthe binning, so we average over many different randombinnings. We note that we are essentially calculating theinformation dimension of the retrodiction probability. In-formation dimension [23, 24] is one of several commonways to calculate fractal dimension. Our prescriptionhere is only different in that we are applying it to ourretrodiction of the original state, not to the calculationof the final state.Figure 3 contains several panels related to the retrod-ictability of the logistic map. The top panel is the bifur-cation diagram for the logistic map, which we align withthe other two panels to use as a reference.The middle panel shows what initial states convergeto what final state. Here we see the basins of attractionof the logistic map. The vertical axis indicates the ini-tial position of the point, whereas the color representsthe value the point has after 250 iterations. We can seehow the unit interval splits into domains at each bifur-cation point. At the onset of chaos, even the points verynear each other can end up in different phase oscillations.The degree of chaos increases several times, when sub- FIG. 3.
Retrodiction Entropy, Bifurcations and Chaos.
The vertical dashed lines from left to right are (1) perioddoubling, (2) period quadrupling, (3) period ×
8, (4) onset of chaos (red), and (5) the onset of one particular island of stabilitywhere chaos breaks off to periodic motion (green).
Top:
The bifurcation diagram for the logistic map, showing X t for multiplelarge t values. Middle:
Basins of attraction. The initial state X determines the final state X t , ( t = 200) within [0 ,
1] whichis mapped to a color gradient from dark red (0) to light yellow (1). The change in the number of basins can be clearly seennear the vertical lines. As the system transitions into chaos, nearby points start converging to distinct final points. The systembecomes chaotic (at r (cid:39) .
58) and then “well mixed” (at r (cid:39) . Bottom:
The (normalized) retrodiction entropy vs. logisticparameter r is plotted at several different times, t . The black line ( t = 500) is an excellent approximation to the asymptoticlimit of (cid:104) S R (cid:105) . Note that in the non-chaotic regime, the retrodiction entropy converges to flat steps, whereas in the chaoticregime, the retrodiction entropy converges to steps (with values equal to that in the non-chaotic regime) with occasional dipscoinciding with islands of stability. Course graining was done with b = 500 bins, with 10000 sample initial points per bin. domains of the unit interval become more mixed. Thisoccurs for example at r = 3 .
58 and r = 3 .
59 before thepoint of complete mixing at r = 3 . t = 500 steps, is agood approximation of the asymptotic limit of (cid:104) S R (cid:105) . Forparameter values below the first period doubling, retrod-iction entropy is at a maximum since all points in the unitinterval converge to a single value, therefore observingthat value does not provide any useful information aboutthe initial state of the system. Therefore, S = log V = 0since the “volume” V is the unit interval. At the period doubling, the asymptotic value of S R drops to − log 2.This is reflective of the fact that in the two period re-gion, the measure of the set of points that converge toeach period is 1/2. Therefore, the retrodiction entropygiven either of the two ending positions is − log 2. Thistrend of reduction in average retrodiction entropy con-tinues with every period doubling, as an equal measureof points converge to different basins.Note that, as period doublings occur more rapidly withincreasing r , our finite bin size prohibits us from resolv-ing the discrete steps close to the onset of chaos. Asperiod doublings happen exponentially quickly and ex-ponentially close together, an exponential number of binsbecomes necessary to distinguish between the entropydrops associated with successive bifurcations.The blue vertical dashed lines in Fig. 3 show the loca-tions of the period doublings. Near the period doublingpoints, there is a dramatic slowdown in convergence of S R to its asymptotic value, which is reflective of the factthat there is a slowdown in convergence of sequences tothe periodic attractor.As period multiplicities of every power of 2 occur be-fore the onset of chaos, the long-time limit of differentialretrodiction entropy approaches negative infinity (in thelimit of infinite number of bins). Even with a limitednumber of bins, the asymptotic retrodiction entropy hitsa minimum right at the chaotic transition.Past the point of chaos, retrodiction entropy ascendsin steps with the same asymptotic values as the descend-ing steps. The reason why the steps have the same valuecan be seen in the middle panel of Fig. 3. As r ap-proaches chaos, the system breaks the unit interval ofstarting positions into sub-domains that map to differ-ent periodic attractors (that are subdivided somewhatsimilarly to a Cantor set). After the onset of chaos, thesub-domains undergo mixing, as previously mentioned,where any point that started in that domain has an equalchance of ending up in any attractor in any sub-domainof that domain.The reconstruction entropy in the chaotic regime alsohas occasional dips, which correlate with the “islands ofstability”. For example, we have marked the value r =3 .
83 in green, which is where the logistic map has a periodthree oscillation. The dips around r = 3 .
63 and r = 3 . x, r ), but instead an entire neighborhood inthe unit interval converges to the same attractor. DISCUSSION
The approach of using retrodiction entropy bears somesimilarities to other methods of inference, particularlymaximum a posteriori (MAP) estimation and otherBayesian methods, but also has significant differences.Philosophically, our goal in defining S R is not to find themode of a distribution (this is the usual goal of Bayesianinference), but to characterize the information containedin the distribution as a whole. Identifying modes, or themost likely initial state can be very misleading. For ex-ample, in highly degenerate systems, there could be manypeaks in R , each containing a small amount of probabil-ity mass. In contrast, S R characterizes the informationcontent within the entire probability distribution.That being said, entropy does not constitute a com-plete characterization of a probability distribution either.For example, it might be informative to pull out a guess from R and compare it with the actual initial state, (cid:90) R y ( x )( x − x ) R y ( x ) d x d x . Since entropy does not take into account informationabout the spatial location of probability mass, it wouldnot inform on this quantity.
Comparison with other approaches
There is a long history of inference and informationtheory in the development of statistical mechanics. Here,we briefly review a few similar methods of doing inferenceand measuring predictability.Problems in inverse statistical mechanics are generallysolved by using maximum likelihood estimation (MLE)or, if prior information is available, maximum a-posteriorestimation (MAP). Other methods are available, for ex-ample, the pseudolikelihood [25]. However, most of theproblems typically treated in inverse statistical physicsare lattice problems, and the typical goal is to find mi-croscopic parameters of the system given some number of(generally independent) measurements, rather than find-ing the state of the system in the past. For example, aprototypical inverse statistical mechanical problem is theinverse Ising problem [26], where the connections J ij be-tween spin variables is unknown, the spin configurationis sampled some number of times from the equilibriumdistribution, and the problem is to infer the most likelymatrix J ij .A line of papers by J. Crutchfield and C.J. Ellison treatsemi-infinite chains of random variables as consecutivestates in discrete time, and suggests that the mutual in-formation between semi-infinite sets of variables is a goodmeasure for the amount of information about the paststored in the present [27–30]. Their backwards entropy h µ = lim n →∞ H ( X − n +1 , ..., X ) /n differs from our retro-diction entropy, which, in compatible notation, becomes (cid:104) S R (cid:105) = H ( X ) − I ( X ; X t ) (cf. (8)). Note that while h µ is defined for a chain of infinite time points, retrodictionentropy operates between two specific times.The goals of computational mechanics and our retro-diction entropy approach are different. Computationalmechanics asks what finite state machine can statisticallyreproduce a sequence or random variables. Furthermore,many of the examples they treat are not physical sys-tems, but finite state computational processes - they lookat e.g. the random insertion process [27], random noisycopy, and the golden mean process [28], though in [30]the authors look at reproducing the patterns in differentIsing systems.In addition, the constraint of having infinite pasts andfutures amounts to studying systems only in equilibrium,which is not a case we would typically be interested inwhen studying retrodiction entropy. Possible generalizations
We can loosen our formalism to make it applicable togeneral inference problems; not just problems in statisti-cal mechanics. An inference problem is typically of theform where there is a space of sets of possible model pa-rameters, A , and a space of possible observed outcomes,Ω. The transition probability is the probability that anobservable event occurs given a set of model parame-ters. There is not necessarily any variable that serves as“time.” As the problem is one of reconstructing parame-ters, and there is no time, so no “past,” we would call theBayesian inverse of T reconstruction probability and callthe corresponding S R reconstruction entropy (instead ofretrodiction probability and entropy).Reconstruction entropy is a measurement of how wellwe can determine the parameters of a system given anobserved event generated from a model with unknownparameters. Retrodiction entropy is a special case of thiswhere the set of parameters is the same as the set ofobservables ( A = Ω), e.g. both are phase space. Addi-tionally, when retrodicting, we consider a parameterizedfamily of transition probabilities, understanding this pa-rameter to be our system time. For the more generalreconstruction entropy, most of the formulas we have de-rived still hold, for example eqs. (1) to (4), (6) and (8),and the KL divergence relations in appendix B. On theother hand, results like (7) do not hold if there is no timeparameter. CONCLUSION
We introduced the notion of retrodiction entropy as ameasure of our ability to infer the past state of a col-lection of particles based on a single measurement ofthe system, and derived a relationship between this andthermodynamic entropy. We have established bounds onthe retrodiction entropy generation rate, derived a set orKL divergence relations between different relevant prob-abilities, and outlined retrodiction entropy’s asymptoticproperties. We also showed that for systems where theinitial state is an equilibrium distribution, the averageforward and retrodiction entropy are identical. Lastly,we analytically solved two concrete examples, quantify-ing how rapidly a system of particles forgets its initialstate in convex, concave and flat potentials, and ana-lyzing macrostate retrodiction entropy for a chaotic sys-tem. Particularly, we saw that in a concave potentialthere is an upper limit to the loss of information per-taining the initial state, and for the logistic map, we sawsharp changes in asymptotic retrodiction entropy at pe-riod doublings, and could identify islands of stability inthe chaotic regime by dips in retrodiction entropy.The connection between thermodynamic quantities (cid:104) S T (cid:105) , S t and a purely information theoretical one, S R , is in accordance with the seminal works of Maxwell, Smolu-chowski, Landauer, Szillard, Beckenstein, and others [1–4]. We now know, from (3), that thermodynamic entropyat present time not only quantifies the information con-tent of the state of the system at present time, it alsorelates to how precisely information about the originalstate of the system can be recovered after some amountof time has passed. APPENDIX A: DERIVATION OF THERELATIONSHIP BETWEEN RETRODICTIONENTROPY AND THERMODYNAMIC ENTROPY
We use sum notation throughout, although these couldbe replaced with integrals. Suppose P is normalized.Then (3) can be proved through simple integration: (cid:104) S R (cid:105) = (cid:88) ω P t ( ω ) S R ( ξ ) = − (cid:88) ω P t ( ω ) (cid:88) α R ω ( α ) log R ω ( α )= − (cid:88) ω,α T α ( ω ) P ( α ) log (cid:18) T α ( ω ) P ( α ) P t ( ω ) (cid:19) = − (cid:88) ω,α P ( α ) T α ( ω ) log T α ( ω ) + (cid:88) ω P t ( ω ) log P t ( ω ) − (cid:88) α P ( α ) log P ( α ) = (cid:104) S T (cid:105) − ( S t − S ) . where we substituted P t ( ω ) = (cid:80) α T α ( ω ) P ( α ).As an explicit example of this, consider the Gaussianprocess family we discussed in the paper, with T , P , P t given by (11), (12) and (13). For this case, S T = (cid:104) S T (cid:105) = 12 log (cid:34) π N d (cid:89) α =1 D α ( t ) N (cid:35) + N d S = 12 log (cid:34) (2 πe ) d d (cid:89) α =1 σ α (cid:35) S t = 12 log (cid:34) (2 π N N ) d d (cid:89) α =1 σ α D α ( t ) N λ α ( t ) D α ( t ) κ α ( t ) (cid:35) + N d S R = (cid:104) S R (cid:105) = 12 log (cid:34)(cid:16) πeN (cid:17) d d (cid:89) α =1 D α ( t ) κ α ( t ) λ α ( t ) (cid:35) APPENDIX B: KL-DIVERGENCE RELATIONS
Here, we derive (6). We start with the definition ofKL-divergence: D ( R ω (cid:107)R ω ) = − (cid:88) α R ω ( α ) log R ω ( α ) R ω ( α )= − (cid:88) ξ T α ( ω ) P ( α ) P t ( ω ) (cid:20) log (cid:18) T α ( ω ) T α ( ω ) (cid:19) + log (cid:18) P t ( ω ) P t ( ω ) (cid:19) (cid:21) . ω ’s with the probability weight P t ( ω ) P t ( ω ), the first term in the brackets gives − (cid:88) ω ,ω ,α P t ( ω ) T α ( ω ) P ( α ) log[ T α ( ω ) / T α ( ω )]= − (cid:88) ω ,ω ,α P ( α ) P t ( ω ) [ T α ( ω ) log T α ( ω ) − T α ( ω ) log T α ( ω )]= − (cid:88) ω ,ω ,α P ( α ) P t ( ω ) log T α ( ω ) − P ( α ) T α ( ω ) log T α ( ω )= −(cid:104) S T (cid:105) − (cid:88) ω P t ( ω ) log T α ( ω ) = S t − (cid:104) S T (cid:105) + (cid:88) α D ( P t (cid:107)T α )(we have used the fact that D ( A (cid:107) B ) = − (cid:80) A log B − S A and (cid:80) ω T α ( ω ) = 1) whereas the second term gives − (cid:88) ω ,ω ,α P t ( ω ) T α ( ω ) P ( α ) log P t ( ω ) P t ( ω )= − (cid:88) ω ,ω P t ( ω ) (cid:32)(cid:88) α T α ( ω ) P ( α ) (cid:33) log P t ( ω ) P t ( ω )= − (cid:88) ω ,ω P t ( ω ) P t ( ω ) log P t ( ω ) P t ( ω )= S t − S t = 0 . Putting everything together, (cid:104) D ( R ω (cid:107)R ω ) (cid:105) = (cid:104) D ( N (cid:107)T ξ ) (cid:105) + S t − (cid:104) S T (cid:105) . The second term here is, (cid:104) D ( N (cid:107)T ξ ) (cid:105) = − (cid:88) ξ,ω P ( ξ ) N ( ω ) log T ξ ( ω ) N ( ω )= − (cid:88) ξ ,ξ P ( ξ ) P ( ξ ) T ξ ( ω ) log T ξ ( ω ) − S t = (cid:104) D ( T ξ (cid:107)T ξ ) (cid:105) + (cid:104) S T (cid:105) − S t . Putting these equations together gives us Eq. (6).We can take the KL divergence between any pair ofdistributions that have a common domain. It is naturalto only compare distributions that are either both on thefinal state or both on the initial state. Furthermore, asthe KL divergence is asymmetric, we can ask about bothorderings. The six options are ( T , T ), ( T , P t ), ( P t , T ),( R , R ), ( R , P ), and ( P , R ). In a similar way to ourderivations above, we can find relations between the av-erages of the KL divergence between all these pair interms of each other or in terms of entropies: (cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:104) D ( P (cid:107)R ω ) (cid:105) + S t − (cid:104) S T (cid:105)(cid:104) D ( T α (cid:107) P t ) (cid:105) = (cid:104) D ( R ω (cid:107) P ) (cid:105) = S − (cid:104) S R (cid:105) = S t − (cid:104) S T (cid:105)(cid:104) D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( P (cid:107)R ω ) (cid:105)(cid:104) D ( T α (cid:107)T α ) (cid:105) = (cid:104) D ( R ω (cid:107)R ω ) (cid:105) One can put these together to derive relations for the av-erages of the symmetric combinations of KL divergences. (cid:104) D ( T α (cid:107) P t ) + D ( P t (cid:107)T α ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105)(cid:104) D ( R ω (cid:107) P ) + D ( P (cid:107)R ω ) (cid:105) = (cid:104) D ( T α (cid:107)T α ) (cid:105) APPENDIX C: LIMITS ON THE SIZE OFOBSERVATIONAL AND RETRODICTIONENTROPY
We can use Jensen’s inequality to put an upper boundon the time rate of change of S t . Since − log x is a convexfunction, we have the inequality, − log (cid:88) ω P ( α ) T α ( ω ) ≤ − (cid:88) α P ( α ) log T α ( ω ) . Start with the definition of S t , then apply Jensen’s in-equality:˙ S t = − (cid:88) ξ ˙ P t ( ω ) log P t ( ω ) = − (cid:88) ω ˙ P t ( ω ) log (cid:88) α P ( α ) T α ( ω ) ≤ − (cid:88) α,ω P ( α ) ˙ P t ( ω ) log T α ( ω )= − (cid:88) α,ω P ( α ) ˙ P t ( ω ) (cid:18) log T α ( ω ) P t ( ω ) + log P t ( ω ) (cid:19) = − (cid:88) α,ω P ( α ) ˙ P t ( ω ) log T α ( ω ) P t ( ω ) + ˙ S t . Canceling the ˙ S N terms on both sides yields0 ≤ − (cid:88) α P ( α ) (cid:88) ω ˙ P t ( ω ) log T α ( ω ) P t ( ω )which bears some similarity to the KL-divergence. Thederivative of an arbitrary KL-divergence is ∂∂t D ( p (cid:107) q ) = − (cid:88) ˙ p log qp − (cid:88) pq ˙ q. Using this in the preceding inequality, we get0 ≤ ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) α P ( α ) (cid:88) ω P t ( ω ) ∂∂t log T α ( ω )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) α P ( α ) (cid:88) ω P t ( ω ) ∂∂t log R ω ( α ) P t ( ω ) P ( α )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) + (cid:88) ω P t ( ω ) (cid:88) α P ( α ) ∂∂t log R ω ( α ) P ( α )= ∂∂t (cid:104) D ( P t (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . Using the expression we previously discussed for (cid:104) D ( P t (cid:107)T α ) (cid:105) , we can reintroduce ˙ S t to the equation,˙ S t ≤ (cid:104) ˙ S T (cid:105) + ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) − (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) . ∂∂t (cid:104) S R (cid:105) via(3) ∂∂t (cid:104) S R (cid:105) ≥ − ∂∂t (cid:104) D ( T α (cid:107)T α ) (cid:105) + (cid:104) ∂∂t D ( P (cid:107)R ω ) (cid:105) (15)Now we will make use of the fact that for a Markovprocess, the relative entropy of two distributions is non-increasing [21]. We include this theorem below for thesake of completeness. Theorem:
Consider two probability distributions p , q , on the same state space. Then at any times t < t , D ( p t (cid:107) q t ) ≥ D ( p t (cid:107) q t ) Proof:
Let s < t . Then, D ( p ( x t | x s ) (cid:107) q ( x t | x s ))= D ( p ( x t ) (cid:107) q ( x t )) + D ( p ( x s | x t ) (cid:107) q ( x s | x t ))= D ( p ( x s ) (cid:107) q ( x s )) + D ( p ( x t | x s ) (cid:107) q ( x t | x s ))By the definition of Markov, p ( x t | x s ) = q ( x t | x s ), so D ( p ( x t | x s ) (cid:107) q ( x t | x s )) = 0. Then, subtracting the secondand third lines, we get D ( p t (cid:107) q t ) − D ( p s (cid:107) q s ) = − D ( p s,t (cid:107) q s,t ) ≤ ∂∂t D ( T α (cid:107)T α ) ≤ α , α . Therefore, the firstterm on the right hand side of eq. (15) is non-negative.The second term of eq. (15) is harder to work with.Intuitively, we expect R to approach P as we lose infor-mation about the past due to stochastic events. So weexpect D ( P (cid:107)R ξ ) to eventually reach a minimum for anyfixed ξ . As long as (cid:104) D ( P (cid:107)R ξ ) (cid:105) decreases more slowlythan (cid:104) D ( T ω (cid:107)T ω ), this bound is good enough to guaran-tee that ∂ (cid:104) S R (cid:105) /∂t ≥ ∗ Corresponding Author: [email protected][1] E. T. Jaynes, Physical review , 620 (1957).[2] E. T. Jaynes, Physical review , 171 (1957).[3] C. E. Shannon and W. Weaver,
The mathematical theoryof communication (University of Illinois press, 1998).[4] H. S. Leff and A. F. Rex,
Maxwell’s demon: entropy,information, computing (Princeton University Press,2014). [5] G. E. Box and G. C. Tiao,
Bayesian inference in statis-tical analysis , Vol. 40 (John Wiley & Sons, 2011).[6] M. Welling and Y. W. Teh, in
Proceedings of the 28thInternational Conference on Machine Learning (ICML-11) (2011) pp. 681–688.[7] B. A. Desmarais and S. J. Cranmer, Physica A: Statisti-cal Mechanics and its Applications , 1865 (2012).[8] V. A. T. Nguyen and D. C. Vural, Physical Review E ,032314 (2017).[9] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurringimages: matrices, spectra, and filtering (SIAM, 2006).[10] R. H. Chan and K. Chen, SIAM Journal on ScientificComputing , 1043 (2010).[11] P. Ullersma, Physica , 27 (1966).[12] H.-Y. Yu, D. M. Eckmann, P. S. Ayyaswamy, andR. Radhakrishnan, Physical Review E , 052303 (2015).[13] W. T. Coffey and Y. P. Kalmykov, The Langevin equa-tion: with applications to stochastic problems in physics,chemistry and electrical engineering , Vol. 27 (World Sci-entific, 2012).[14] F. Wolf, Journal of mathematical physics , 305 (1988).[15] M. Hashemi, Physica A: Statistical Mechanics and itsApplications , 141 (2015).[16] M. Bernstein and L. S. Brown, Physical review letters , 1933 (1984).[17] J. A. Carrillo and G. Toscani, Mathematical methods inthe applied sciences , 1269 (1998).[18] G. Toscani, Quarterly of Applied Mathematics , 521(1999).[19] V. Schw¨ammle, E. M. Curado, and F. D. Nobre, The Eu-ropean Physical Journal B-Condensed Matter and Com-plex Systems , 159 (2007).[20] A. R. Plastino, H. G. Miller, and A. Plastino, PhysicalReview E , 3927 (1997).[21] T. M. Cover and J. A. Thomas, Elements of informationtheory (John Wiley & Sons, 2012).[22] R. M. May, Nature , 459 (1976).[23] P. Grassberger and I. Procaccia, Physical review letters , 346 (1983).[24] J. D. Farmer, Zeitschrift f¨ur Naturforschung A , 1304(1982).[25] J. Besag, Journal of the Royal Statistical Society. SeriesB (Methodological) , 192 (1974).[26] H. C. Nguyen, R. Zecchina, and J. Berg, Advances inPhysics , 197 (2017).[27] J. P. Crutchfield, C. J. Ellison, and J. R. Mahoney, Phys-ical review letters , 094101 (2009).[28] C. J. Ellison, J. R. Mahoney, and J. P. Crutchfield, Jour-nal of Statistical Physics136