Bounding the coarse graining error in hidden Markov dynamics
aa r X i v : . [ c ond - m a t . s t a t - m ec h ] A p r Bounding the Coarse Graining Error in Hidden Markov Dynamics
David Andrieux
Center for Nonlinear Phenomena and Complex Systems,Universit´e Libre de Bruxelles, B-1050 Brussels, Belgium
Lumping a Markov process introduces a coarser level of description that is useful in many contextsand applications. The dynamics on the coarse grained states is often approximated by its Markoviancomponent. In this letter we derive finite-time bounds on the error in this approximation. Theseresults hold for non-reversible dynamics and for probabilistic mappings between microscopic andcoarse grained states.
I. INTRODUCTION
Markov processes are a standard modeling tool used in many applications ranging from finance [1] and telecommu-nications [2] to physics [3], chemistry [4], biology [5], and computer science [6]. For theoretical and practical reasons, itis often convenient to partition the state space into aggregates and to view the dynamics at a coarser level. The coarsegraining operation bridges the gap between different level of descriptions by introducing “mesostates” each represent-ing many microstates. Analysis of experimental data naturally leads to coarse graining as observation techniques maynot be able to resolve the set of microscopic states and only give access to mesostates. Similarly, disregarding theenvironment or part of a system provides an effective description of a remaining (sub)system of interest.A deterministic mapping between micro and mesostates is too restrictive to be applicable in many problems ofinterest. The concept of coarse graining can be extended to include the case where the mapping is a probabilisticfunction of the microstates. The resulting model, called a hidden Markov model, is a doubly embedded stochasticprocess [7]. Such processes are especially known for their application in temporal pattern recognition such as speechand handwriting recognition [8] or bioinformatics [9].Although the dynamics on the aggregates is not Markovian in general [10], there is a natural choice for a Markovdynamics on the set of lumped states [10–12]. This dynamics reproduces the influence of the first past state on thetransition probabilities and neglects the higher-order memory effects. This choice matches the time evolution of theoriginal unlumped state started at the stationary state. Furthermore, the probability transfers between aggregatesmatch those arising from the original chain.In the present work we analyze the accuracy of such coarse grained models as compared to the exact microscopicbehavior. This problem was first envisaged by Hoffman and Salamon in [12] for the special case of deterministiccoarse gaining and reversible Markov chains. Reversible Markov chains have transition matrices diagonally similar tosymmetric matrices, a strong symmetry property at the basis of their analysis. Here we generalize their approach tothe case of probabilistic coarse graining and non-reversible dynamics. We obtain bounds on the error made by usingthe lumped dynamics considered as a model of the unlumped dynamics.
II. LUMPED MARKOV CHAINS
We consider a Markov chain characterized by a transition matrix G on the finite state space Σ. The probabilitydistribution ppp = ( p , p , . . . , p N ) evolves in discrete time steps according to ppp ( n + 1) = ppp ( n ) G . (1)We assume that the Markov chain is primitive, i.e., there exists an n such that G n has all positive entries. Thisguarantees that G has a unique stationary distribution πππ such that πππ = πππG . (2)Our goal is to analyze lumped dynamics. Let { ω j } ∈ Ω be the set of mesostates. The mesostate ω is observed withprobability b i ( ω ) when the system is in microstate i . We collect these conditional probability distributions into thematrix C with elements C iω = b i ( ω ) . (3)The matrix C serves to specify the lumped probability distribution ˆ ppp = pppC on Ω corresponding to a distribution ppp onΣ. We also introduce the matrix D with elements D ωi = π i b i ( ω ) P k π k b k ( ω ) . (4)The element D ωi is the conditional probability to be in state i given the observation ω . In the case of a deterministicassociation between microstates and aggregates, the operators C and D reduce to the operators introduced in [12].Their successive action defines a stochastic operator CD that satisfies πππ = πππCD . (5)Following [11], we now introduce the lumped dynamics with transition matrixˆ G = DGC . (6)This matrix is stochastic, ˆ G ≥ P ω ′ ˆ G ωω ′ = 1. This choice of the transition matrix insures that the distributionˆ πππ = πππC is the stationary distribution of ˆ G :ˆ πππ ˆ G = πππCDGC = πππGC = πππC = ˆ πππ . (7)By construction, the dynamics ˆ G also preserves the probability fluxes between states in the coarse grained description.Precisely, the dynamics ˆ G arises from the Markovian approximation of a stationary sequence of observed mesostates.ˆ G is the unique Markov chain on Ω that satisfies this condition. III. BOUNDING COARSE GRAINING ERRORS
Starting from a distribution ppp on Σ, its time evolution among the aggregates with the unlumped dynamics is ppp G n C , while its time evolution with the lumped dynamics is ppp C ˆ G n . The main question considered here is howdifferent these two dynamics can be. To address this question, we define the norm k vvv k π = (cid:13)(cid:13) vvvU − π (cid:13)(cid:13) , where U π =diag( √ π , √ π , . . . , √ π N ) and k·k is the 2-norm. The corresponding operator norm is k A k π = (cid:13)(cid:13) U π AU − π (cid:13)(cid:13) . (8)The difference between the two probability distributions after n time steps can be expressed as (cid:13)(cid:13)(cid:13) ppp C ˆ G n − ppp G n C (cid:13)(cid:13)(cid:13) π = k ppp C ( DGC ) n − ppp G n C k π = k ppp ( CDG ) n C − ppp G n C k π = k ppp (( CDG ) n − G n ) C k π . (9)As emphasized in [12], we observe the proeminent role of the operator CDG ≡ H , which specifies a dynamics on theoriginal state space Σ.Our goal will be to bound the n -step difference k H n − G n k π . The n -step difference can transiently grow, but musteventually decline to zero as, by construction, the lumped and the unlumped chain converge to the same stationarydistribution.We use of the fact that H and G have the common stationary distribution πππ . We define the projection operator P π = uuu T πππ , where uuu is the vector (1 , , . . . , ∈ R N . The complementary projection P σ = I − P π . We end up withthe following representation of G and H : G = ( P π + P σ ) G = P π + P σ G (10)and H = ( P π + P σ ) H = P π + P σ H . (11)From (10) and (11) we obtain H − G = P σ H − P σ G , and P π P σ G = P σ GP π = P π P σ H = P σ HP π = 0.The norms k P σ H k π and k P σ G k π will play a crucial role in our analysis. In this regard, we prove the followingtheorem. Theorem 1.
Let G be a transition probability matrix, and let P σ the projection operator introduced above. Then k P σ G k π < . (12) Furthermore, k P σ G k π = σ , where σ is the second-largest singular value of U π GU − π . PROOF OF THEOREM 1. We will use the following result [13]. Let A be a matrix with nonnegative entries andspectral radius ρ ( A ), and suppose that there exist left and right positive Perron eigenvectors vvv and , respectively.Then ρ ( A ) = (cid:13)(cid:13) XAX − (cid:13)(cid:13) , where X = diag( v / j w − / j ).In our case this translates into k G k π = (cid:13)(cid:13) U π GU − π (cid:13)(cid:13) = ρ ( G ) = 1. Because the 2-norm of a matrix A is given by itsdominant singular value, we deduce that the dominant singular value σ of U π GU − π equals 1. Equivalently, we havethat the dominant eigenvalue of ( U π GU − π )( U π GU − π ) T is σ = 1.We now turn to the norm k P σ G k π . Note that P σ G uuu T = ( G − P π ) uuu T = 0, so that P σ G has negative elements andthe above construction cannot be applied. Introducing the notation U π AU − π ≡ ¯ A , the norm k P σ G k π = k ¯ P σ ¯ G k . Itis thus given by the largest eigenvalue of ¯ P σ ¯ G ¯ G T ¯ P T σ .First, we consider the projection operator ¯ P π . A direct calculation shows that P π is symmetrized by U − π . Itfollows that ¯ P σ = I − ¯ P π is symmetric as well. Now, because G and P σ commute, ¯ G and ¯ P σ commute with each other.Accordingly, ¯ P σ ¯ G ¯ G T ¯ P T σ = ¯ P σ ¯ G ¯ G T . Furthermore, there exists an eigenbasis such that the eigenvalues of ¯ P σ ¯ G ¯ G T takethe form α i β i , where α i and β i are the eigenvalues of ¯ P σ and ¯ G ¯ G T , respectively. Because ¯ P σ is a projection operator,it has ( N −
1) eigenvalues 1 and one eigenvalue 0. The latter corresponds to the right eigenvector √ π i .The vector √ π i is a right eigenvector of ¯ G ¯ G T with eigenvalue 1. From σ = 1, we have that 1 is the dominanteigenvalue of ¯ G ¯ G T . Therefore, the eigenvalues of ¯ P σ ¯ G ¯ G T are given by the eigenvalues of ¯ G ¯ G T , except for its dom-inant eigenvalue 1 that is replaced by 0. In particular, k P σ G k π is given by the dominant eigenvalue of ¯ P σ ¯ G ¯ G T or,equivalently, by the second-largest singular value σ of ¯ G = U π GU − π .We now have to prove that σ <
1. We already know that σ ≤ σ = 1, but we now show that the inequality isstrict. This follows from the Perron-Frobenius theorem applied to ¯ G ¯ G T . Indeed, because G is a primitive transitionmatrix, ¯ G ¯ G T is nonnegative and primitive. Recalling that the norm of an operator is always greater or equal to itsspectral radius, ρ ( A ) ≤ k A k , we arrive at | λ | ≤ k P σ G k π = σ < , (13)where λ is the second-largest (in modulus) eigenvalue of G . (cid:3) Because H is a transition matrix, we deduce from Theorem 1 that k P σ H k π = η <
1, with η the second-largestsingular value of U π HU − π . Furthermore, we have k P σ H k π = k CDG − P π k π = k CDG − CDP π k π = k CD ( G − P π ) k π ≤ k CD k π k P σ G k π . (14)Noting that CD is stochastic and that the similarity transform U − π symmetrizes CD , we conclude that k CD k π = 1.This leads to k P σ H k π ≤ k P σ G k π or η ≤ σ . (15)We are now in position to derive our first bound. The following theorem bounds the n -step difference k H n − G n k π in terms of the one-step difference k H − G k π . Theorem 2.
Let G be a transition probability matrix, and let H = CDG with C and D as introduced above. Define δ = k H − G k π . (16) Then k H n − G n k π ≤ δ K ′ ( n ) ≤ δ K ( n ) , (17) where K ′ ( n ) = ( σ n − η n ) / ( σ − η ) and K ( n ) = nσ n − . Here σ and η are the second-largest singular values of U π GU − π and U π HU − π , respectively. PROOF OF THEOREM 2. We start by expressing the n -step difference in terms of the one-step difference as in[12]. We note that k H n − G n k π = (cid:13)(cid:13) ( H − G ) H n − + G ( H n − − G n − ) (cid:13)(cid:13) π . (18)Iterating, we find k H n − G n k π = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n − X k =0 G k ( H − G ) H n − k − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) π ≤ n − X k =0 (cid:13)(cid:13) G k ( H − G ) H n − k − (cid:13)(cid:13) π . (19)Then for integers k, n , (cid:13)(cid:13) G k ( H − G ) H n − k − (cid:13)(cid:13) π = (cid:13)(cid:13) ( P π + P σ G ) k ( P σ H − P σ G )( P π + P σ H ) n − k − (cid:13)(cid:13) π = (cid:13)(cid:13) ( P σ G ) k ( P σ H − P σ G )( P σ H ) n − k − (cid:13)(cid:13) π ≤ k P σ G k kπ δ k P σ H k n − k − π . (20)Combining (19) and (20) we find k H n − G n k π ≤ δ n − X k =0 k P σ G k kπ k P σ H k n − k − π . (21)Carrying out the summation in (21) and using Theorem 1 we obtain the bound k H n − G n k π ≤ δ K ′ ( n ) = δ σ n − η n σ − η . (22)This expression can be bound from above by combining (21) and (14) to give K ′ ( n ) ≤ K ( n ) = nσ n − , (23)with equality when η = σ . (cid:3) We now derive a bound independent of the one-step difference k H − G k π . Theorem 3.
Let G be a transition probability matrix, and let H = CDG with C and D as introduced above. Then k H n − G n k π ≤ η n + σ n ≤ σ n , (24) where σ and η are the second-largest singular values of U π GU − π and U π HU − π , respectively. PROOF OF THEOREM 3. We observe that G n = ( P π + P σ G ) n = P π + ( P σ G ) n and H n = ( P π + P σ H ) n = P π + ( P σ H ) n . Hence we have k H n − G n k π = k ( P σ H ) n − ( P σ G ) n k π ≤ k ( P σ H ) n k π + k ( P σ G ) n k π ≤ k P σ H k nπ + k P σ G k nπ . (25)From Theorem 1 and k P σ H k π ≤ k P σ G k π we deduce the inequalities (24). (cid:3) We end this section with the following remarks. The bound 2 σ n from Theorem 3 is independent of δ and η .Accordingly, it is valid for any probabilistic coarse graining of the original chain. This bound also shows that thecoarse graining error decreases exponentially in time. Relation (13) reveals that the fastest possible decay rate of thebound is | λ | . We show in the next section that this rate is achieved for special classes of Markov dynamics. IV. SPECIAL MARKOV CHAINSA. Reversible Markov chains
A Markov chain is reversible if it obeys the detailed balance conditions π i G ij = π j G ji ∀ i, j ∈ Σ . (26)This corresponds to an equilibrium situation where no probability currents are present in the stationary state. Wenote that in this case ˆ G is also reversible. Under the detailed balance conditions (26) the operator U π P σ GU − π issymmetric. This strong symmetry property of reversible Markov chains is at the basis of Hoffman and Salamon’sanalysis. Accordingly we have k P σ G k π = | λ | , (27)where λ is the second-largest eigenvalue of G . In this way we recover the bound K ( n ) = n | λ | n − of [12]. B. Doubly stochastic matrices
Doubly stochastic matrices are characterized by the following property: X i G ij = 1 ∀ j ∈ Σ , (28)i.e., both their rows and columns sum to one. This implies that the stationary distribution is uniform, πππ =(1 /N, · · · , /N ). In particular, we have that A = U π AU − π for any operator A . Doubly stochastic matrices donot necessarily satisfy the reversibility conditions (26). Notably, the class of doubly stochastic matrices coincides withthe class of normal stochastic matrices [14]. Normal matrices commute with their transpose and have their eigenvaluesas singular values. Noting that P σ G is normal if G is, we conclude that k P σ G k π = k P σ G k = σ = | λ | , yielding K ( n ) = n | λ | n − . (29) V. CONTINUOUS-TIME MARKOV PROCESSES
The previous construction can be extended to continuous-time Markov processes using the concept of uniformization[15, 16].The probability distribution ppp ( t ) now obeys the dynamicsd ppp ( t )d t = ppp ( t ) L , (30)with the rate matrix L ij ≥ i = j , and L ii = − P j = i L ij . Note that the rate matrix has negative elements. Weassume it has a unique stationary distribution πππ such that 0 = πππL .We define the matrices C and D as above. The lumped dynamics ˆ L = DLC is verified to be a rate matrix: ˆ L ωω ′ ≥ ω = ω ′ and ˆ L ωω = − P ω = ω ˆ L ωω ′ .Starting from a distribution ppp , the difference between the two dynamics after a time t reads (cid:13)(cid:13)(cid:13) ppp C e t ˆ L − ppp e tL C (cid:13)(cid:13)(cid:13) π = (cid:13)(cid:13) ppp (cid:0) e tH − e tL (cid:1) C (cid:13)(cid:13) π , (31)where we defined the operator CDL ≡ H .We now introduce the transition matrix T ( β ) = I + L/β , where I is the unity operator and β the uniformizationparameter [15, 16]. To ensure that T ( β ) is a proper transition matrix, β must satisfy β ≥ max i | L ii | . The distribution πππ is also the stationary distribution of T ( β ), πππT ( β ) = πππ ( I + L/β ) = πππ , for all β .We thus have (cid:13)(cid:13) e tL (cid:13)(cid:13) π = (cid:13)(cid:13)(cid:13) e tβ ( T − I ) (cid:13)(cid:13)(cid:13) π = (cid:13)(cid:13)(cid:13) e tβP σ ( T − I ) (cid:13)(cid:13)(cid:13) π = (cid:13)(cid:13) e − tβP σ e tβP σ T (cid:13)(cid:13) π ≤ (cid:13)(cid:13) e − tβP σ (cid:13)(cid:13) π (cid:13)(cid:13) e tβP σ T (cid:13)(cid:13) π ≤ e − tβ k P σ k π e tβ k P σ T k π = e − tβ [1 − σ ( β )] . (32)In the second line we used that P σ ( T − I ) = T − I , in the third line that P σ commutes with P σ T , and in the lastequality that k P σ k π = 1. Here σ ( β ) < U π T ( β ) U − π .We have k H k π ≤ k CD k π k L k π = k L k π , from which we deduce exp ( t k H k π ) ≤ exp ( t k L k π ). Taking into account(32) we obtain the bound (cid:13)(cid:13) e tH − e tL (cid:13)(cid:13) π ≤ − βt [1 − σ ( β )] , (33)which decreases exponentially in time. This bound can be further optimized by minimizing over the uniformizationparameter β . VI. CONCLUSIONS
We derived quantitative bounds on the error made by using a lumped Markov process instead of the unlumpeddynamics. Notably, the deviations between the two levels of description can be uniformly bounded in terms of theirdeviation in one time step. The bounds are expressed in terms of the second-largest singular values of the transitionprobability matrices. These results generalize the work by Hoffman and Salamon [12] for reversible Markov chainsand deterministic coarse graining. Our construction holds for discrete- and continuous-time, and for non-reversibleprocesses and probabilistic coarse graining. The important finding is that our bounds hold for all time and are notjust asymptotic.The main technique making our bounds possible consisted in the use of a carefully chosen operator norm. Exploitingthe fact that transition matrices are nonnegative, we find a norm that equals the dominant singular value. On theother hand, important observables such as the statistics of current fluctuations are described in terms of generalizedtransition operators that are nonnegative but non-stochastic [16]. As our approach relies on the fact that the singularvalues of stochastic matrices are lower than one, it is not clear how to extend our arguments to these non-stochasticoperators. In addition, the impact of coarse graining on observables nonlinear in the probability distribution such asthe entropy production remains to be investigated [17].Other dynamics on the aggregates that are consistent with the stationary state ˆ πππ can be defined. These dynamicscan satisfy further requirements, such as that the net probability transfers between aggregates match those derivedfrom the unlumped chain. The “gauge” freedom available in choosing the dynamics might be used to minimize thecoarse graining error while preserving relevant dynamical features.
Acknowledgments.
This work is supported by the F.R.S.-FNRS Belgium. [1] M. Kijima,
Stochastic Processes with Applications to Finance (Chapman and Hall, 2002).[2] G. Giambene,
Queuing Theory and Telecommunications: Networks and Applications (Springer, 2010).[3] N. G. van Kampen,
Stochastic Processes in Physics and Chemistry (North-Holland, Amsterdam, 1992).[4] C. Gardiner,
Handbook of Stochastic Methods: for Physics, Chemistry and the Natural Sciences (Springer, 2004).[5] L. J. S. Allen,
An Introduction to Stochastic Processes with Applications to Biology (Chapman and Hall, 2010).[6] G. Bolch, S. Greiner, H. de Meer, and K. S. Trivedi,
Queueing Networks and Markov Chains: Modeling and PerformanceEvaluation with Computer Science Applications (Wiley-Interscience, 2006).[7] L. E. Baum and T. Petrie, Ann. Math. Stat. , 1554 (1966).[8] G. A. Fink, Markov Models for Pattern Recognition: From Theory to Applications (Springer, 2010).[9] T. Koski,
Hidden Markov Models of Bioinformatics (Springer, 2002).[10] J. G. Kemeny and J. L. Snell (Eds.),
Finite Markov chains (Van Nostrand, Princeton, NJ, 1960).[11] C. J. Burke and M. Rosenblatt, Ann. Math. Statist. , 1112 (1958).[12] K. H. Hoffmann and P. Salamon, Appl. Math. Lett. , 1471 (2009).[13] D. Hershkowitz, W. Huang, H. Schneider, and H. Weinberger, SIAM J. Mat. Anal. Appl. , 249 (1997).[14] R. Sinkhorn, Linear Algebra Appl. , 225 (1981).[15] A. Jensen, Skand. Aktuarretidskr , 87 (1953).[16] D. Andrieux, Phys. Rev. E , 031124 (2010).[17] G. Nicolis, Phys. Rev. E83