Shannon Entropy Rate of Hidden Markov Processes
aarXiv :2008.XXXXX
Shannon Entropy Rate of Hidden Markov Processes
Alexandra M. Jurgens ∗ and James P. Crutchfield † Complexity Sciences Center, Physics DepartmentUniversity of California at DavisDavis, California 95616 (Dated: September 1, 2020)Hidden Markov chains are widely applied statistical models of stochastic processes, from funda-mental physics and chemistry to finance, health, and artificial intelligence. The hidden Markovprocesses they generate are notoriously complicated, however, even if the chain is finite state: nofinite expression for their Shannon entropy rate exists, as the set of their predictive features is gener-ically infinite. As such, to date one cannot make general statements about how random they arenor how structured. Here, we address the first part of this challenge by showing how to efficientlyand accurately calculate their entropy rates. We also show how this method gives the minimal setof infinite predictive features. A sequel addresses the challenge’s second part on structure.
Keywords: Markov process, Shannon entropy, iterated function system, mixed state, predictive feature,optimal prediction, Blackwell measure
I. INTRODUCTION
Randomness is as necessary to physics as determin-ism. Indeed, since Henri Poincaré’s failed attempt toestablish the orderliness of planetary motion, it has beenunderstood that both determinism and randomness areessential and unavoidable in the study of physical sys-tems [1–4]. In the 1960s and 1970s, the rise of dynamicalsystems theory and the exploration of statistical physicsof critical phenomena offered up new perspectives onthis duality. The lesson was that intricate structuresin a system’s state space amplify uncertainty, guiding itand eventually installing it—paradoxically—in complexspatiotemporal patterns. Accepting this state of affairsprompts basic, but as-yet unanswered questions. How isthis emergence monitored? How do we measure a sys-tem’s randomness or quantify its patterns and their or-ganization?The tools needed to address these questions aroseover recent decades during the integration of Turing’scomputation theory [5–7], Shannon’s information theory[8], and Kolmogorov’s dynamical systems theory [9–13].This established the vital role that information plays inphysical theories of complex systems. In particular, theapplication of hidden Markov chains to model and an-alyze the randomness and structure of physical systemshas seen considerable success, not only in complex sys-tems [14], but also in coding theory [15], stochastic pro-cesses [16], stochastic thermodynamics [17], speech recog-nition [18], computational biology [19, 20], epidemiology[21], and finance [22], to offer a nonexhaustive list of ex-amples.A highly useful property of certain hidden Markovchains (HMCs) is unifilarity [23], a structural constraint ∗ [email protected] † [email protected] on their state transitions. Shannon showed that given aprocess generated by a finite-state unifilar HMC, one maydirectly and accurately calculate a process’ irreduciblerandomness [8]—now called the Shannon entropy rate .Furthermore, for such a process, there is a unique mini-mal finite-state unifilar HMC that generates the process[24], known as the (cid:15) -machine . The (cid:15) -machine states—theprocess’ causal states —are the minimal set of maximallypredictive features. One consequence of the (cid:15) -machine’suniqueness and minimality is that its mathematical de-scription gives a constructive definition of a process’structural complexity as the amount of memory requiredto generate the process.Loosening the unifilar constraint to consider a widerclass of generated processes, however, leads to majorroadblocks. Predicting a process generated by a finite-state nonunifilar HMC requires an infinite set of causalstates [25]. That is, though “finitely” generated, theprocess cannot be predicted by any finite unifilar HMC.Practically, this precludes directly determining the pro-cess’ entropy rate using Shannon’s result and, at best,obscures any insight into its internal structure.That said, its causal states are (in general, see Ap-pendix B) equivalent to the uncountable set of mixedstates , or predictive features, formally introduced byBlackwell over a half century ago [26]. To date, work-ing with infinite mixed-states required coarse-graining toproduce a finite set of predictive features. Fortunately,the tradeoffs between resource constraints and predictivepower induced by such coarse graining can be systemat-ically laid out [27–29].The following introduces an alternative and moredirect approach to working with mixed states, though.It casts generating mixed states as a chaotic dynam-ical system—specifically, a (place dependent) iteratedfunction system (IFS). This obviates analyzing the un-derlying HMC via coarse graining. Rather, the com-plex dynamics of the new system directly captures theinformation-theoretic properties of the original process. a r X i v : . [ n li n . C D ] A ug Specifically, this allows exactly calculating the entropyrate of the process generated by the original nonunifilarfinite-state HMC. Additionally, the IFS interpretation ofthe nonunifilar HMC provides new insight into the struc-ture and complexity of infinite-state processes. This hasdirect application to the study of randomness and struc-ture in a wide range of physical systems.In point of fact, the following and its sequel [30]were proceeded by two companions that applied the the-oretical results here to two, rather different, physical do-mains. The first analyzed the origin of randomness andstructural complexity engendered by quantum measure-ment [31]. The second solved a longstanding problem onexactly determining the thermodynamic functioning ofMaxwellian demons, aka information engines [32]. Thatis, the following and its sequel lay out the mathemati-cal and algorithmic tools required to successfully analyzethese applied problems. We believe the new approach isdestined to find even wider applications.Section II recalls the necessary background instochastic processes, hidden Markov chains, and infor-mation theory. Section III reviews the needed results oniterated function systems; while Sec. IV develops mixedstates and their dynamic—the mixed-state presentation.The main result connecting these then follows in Sec. V,showing that the mixed-state presentation is an IFS andthat it produces an ergodic process. Section VI recallsBlackwell’s theory, updating it for our present purposeof determining the entropy rate of any HMC. The Sup-plementary Materials provide background on the asymp-totic equipartition property and minimality of the mixedstates. They also constructively work through the resultsfor several example nonunifilar HMCs. They close withthe statistical error analysis underlying entropy-rate es-timation.
II. HIDDEN MARKOV PROCESSES A stochastic process P is a probability measure overa bi-infinite chain . . . X t − X t − X t X t +1 X t +2 . . . of ran-dom variables, each denoted by a capital letter. Aparticular realization . . . x t − x t − x t x t +1 x t +2 . . . is de-noted via lowercase letters. We assume values x t be-long to a discrete alphabet A . We work with blocks X t : t , where the first index is inclusive and the sec-ond exclusive: X t : t = X t . . . X t − . P ’s measure isdefined via the collection of distributions over blocks: { Pr( X t : t ) : t < t , t, t ∈ Z } .To simplify the development, we restrict to station-ary, ergodic processes: those for which Pr( X t : t + ‘ ) =Pr( X ‘ ) for all t ∈ Z , ‘ ∈ Z + . In such cases, we onlyneed to consider a process’s length- ‘ word distributions Pr( X ‘ ).A Markov process is one for which Pr( X t | X −∞ : t ) =Pr( X t | X t − ). A hidden Markov process is the output of amemoryless channel [33] whose input is a Markov process σ σ (cid:3) : : (cid:3) : 1 FIG. 1. A hidden Markov chain (HMC) with two states, { σ , σ } and two symbols { (cid:3) , . This machine is unifilar. [16]. Working with processes directly is cumbersome, sowe turn to consider finitely-specified mechanistic modelsthat generate them. Definition 1.
A finite-state edge-labeled hidden MC (HMC) consists of:1. a finite set of states S = { σ , ..., σ N } ,2. a finite alphabet A of k symbols x ∈ A , and3. a set of N by N symbol-labeled transition matrices T ( x ) , x ∈ A : T ( x ) ij = Pr( σ j , x | σ i ) . The correspond-ing overall state-to-state transitions are describedby the row-stochastic matrix T = P x ∈A T ( x ) . Any given stochastic process can be generated byany number of HMCs. These are called a process’ pre-sentations .We now introduce a structural property of HMCsthat has important consequences in characterizing pro-cess randomness and structure.
Definition 2. A unifilar HMC (uHMC) is an HMC suchthat for each state σ i ∈ S and each symbol x ∈ A thereis at most one outgoing edge from state σ i labeled withsymbol x . Although there are many presentations for a process P , there is a canonical presentation that is unique: aprocess’ (cid:15) -machine . Definition 3. An (cid:15) -machine is a uHMC with probabilis-tically distinct states : For each pair of distinct states σ i , σ j ∈ S there exists a finite word w = x ‘ − suchthat: Pr( X ‘ = w |S = σ k ) = Pr( X ‘ = w |S = σ j ) . A process’ (cid:15) -machine is its optimal, minimal presen-tation, in the sense that the set of predictive states | S | isminimal compared to all its other unifilar presentations[34]. A. Entropy Rate of HMCs
A process’ intrinsic randomness is the informationin the present measurement, discounted by having ob-served the information in an infinitely long history. It ismeasured by Shannon’s source entropy rate [8].
Definition 4.
A process’ entropy rate h µ is the asymp-totic average entropy per symbol [35]: h µ = lim ‘ →∞ H[ X ‘ ] /‘ , (1) where H [ X ‘ ] is the Shannon entropy of block X ‘ : H [ X ‘ ] = − X x ‘ ∈A ‘ Pr( x ‘ ) log Pr( x ‘ ) . (2)Given a finite-state unifilar presentation M u of a pro-cess P , we may directly calculate the entropy rate fromthe transition matrices of the uHMC [8]: h µ ( P ) = h µ ( M u )= − X σ ∈ S Pr( σ ) X x ∈A T ( x ) σσ log T ( x ) σσ . (3)Blackwell showed, though, that in general for processesgenerated by HMCs there is no closed-form expressionfor the entropy rate [26]. For a process generated byan nonunifilar HMC M , applying Eq. (3) to M typicallyoverestimates the true entropy rate of the process h µ ( P ): h µ ( M ) ≥ h µ ( P ) . Overcoming this limitation is one of our central results.We now embark on introducing the necessary tools forthis.
III. ITERATED FUNCTION SYSTEMS
To get there, we must take a short detour to reviewiterated function systems (IFSs) [36], as they play a crit-ical role in analyzing HMCs. Speaking simply, we showthat HMCs are dynamical systems—namely, IFSs.Let (∆ N , d ) be a compact metric space with d ( · , · )a distance. This notation anticipates our later applica-tion, in which ∆ N is N -simplex of discrete-event proba-bility distributions (see Section IV A). However, the re-sults here are general.Let f ( x ) : ∆ N → ∆ N for x = 1 , . . . , k be a set ofLipschitz functions with: d (cid:16) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:17) ≤ τ ( x ) d ( η, ζ ) , for all η, ζ ∈ ∆ N and where τ ( x ) is a constant. Thisnotation is chosen to draw an explicit parallel to thestochastic processes discussed in Section II and to avoidconfusion with the lowercase Latin characters used for re-alizations of stochastic processes. In particular, note thatthe superscript ( x ) here and elsewhere parallels that ofthe HMC symbol-labeled transition matrices T ( x ) . Thereasons for this will soon become clear.The Lipschitz constant τ ( x ) is the contractivity ofmap f ( x ) . Let p ( x ) : M → [0 ,
1] be continuous, with p ( x ) ( η ) ≥ P kx =1 p ( x ) ( η ) = 1 for all η in M . Thetriplet { ∆ N , { p ( x ) } , { f ( x ) } : x ∈ A} defines a place-dependent IFS.A place-dependent IFS generates a stochastic pro-cess over η ∈ ∆ N as follows. Given an initial position η ∈ ∆ N , the probability distribution { p ( x ) ( η ) : x =1 , . . . , k } is sampled. According to the sample x , apply f ( x ) to map η to the next position η = f ( x ) ( η ). Re-sample x from the distribution and continue, generating η , η , η , . . . .If each map f ( x ) is a contraction—i.e., τ ( x ) < η, ζ ∈ ∆ N —it is well known that there exists a uniquenonempty compact set Λ ⊂ ∆ N that is invariant underthe IFS’s action: Λ = k \ x =1 f ( x ) (Λ) . Λ is the IFS’s attractor .Consider the operator V : M (∆ N ) → M (∆ N ) onthe space of Borel measures on the N -simplex: V µ ( B ) = k X x =1 Z ( f ( x ) ) − ( B ) p ( x ) ( η ) dµ ( η ) . (4)A Borel probability measure µ is said to be invariant or stationary if V µ = µ . It is attractive if for any probabilitymeasure ν in M (∆ N ): Z gd ( V n ν ) → Z gµ , for all g in the space of bounded continuous functions on∆ N . Let’s recall here a key result concerning the existenceof attractive, invariant measures for place-dependentIFSs. Theorem 1. [37, Thm. 2.1] Suppose there exists r < q > X x ∈A p ( x ) ( η ) d q (cid:16) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:17) ≤ r q d q ( η, ζ ) , for all η, ζ ∈ ∆ N . Assume that the modulus of uniformcontinuity of each p ( x ) satisfies Dini’s condition and thatthere exists a δ > X x : d ( f ( x ) ( η ) ,f ( x ) ( ζ )) ≤ rd ( η,ζ ) p ( x ) ( η ) p ( x ) ( ζ ) ≤ δ , (5)for all η, ζ ∈ ∆ N . Then there is an attractive, unique, in-variant probability measure for the Markov process gen-erated by the place-dependent IFS.In addition, under these same conditions Ref. [38]established an ergodic theorem for IFS orbits . That is,for any η ∈ ∆ N and g : ∆ N → ∆ N :1 n + 1 n X k =0 g ( w x k ◦ · · · ◦ w x η ) → Z gdµ . (6) IV. MIXED-STATE PRESENTATION
We now return to stochastic processes and theirHMC presentations. When calculating entropy ratesfrom various presentations, we noted that HMC presen-tations led to difficulties: (i) the internal Markov-chainentropy-rate overestimates the process’ entropy rate and(ii) there is no closed-form entropy-rate expression. Todevelop the tools needed to resolve these problems, weintroduce HMC mixed states and their dynamic.Assume that an observer has a finite HMC presenta-tion M for a process P . Since the process is hidden, theobserver does not directly measure M ’s internal states.Absent output data, the best guess for M ’s hidden statesis that they occur according to the state stationary dis-tribution π . The observer can improve on this guess bymonitoring the output data x x x . . . that M generates.Given knowledge of M , determining the internal statefrom observed data is the problem of observer-processsynchronization . A. Mixed States
For a length- ‘ word w generated by M let η ( w ) =Pr( S | w ) be the observer’s belief distribution as to theprocess’ current state after observing w : η ( w ) ≡ Pr( S ‘ | X ‘ = w, S ∼ π ) . (7)When observing a N -state machine, the vector h η ( w ) | lives in the (N-1)-simplex ∆ N − , the set such that: { η ∈ R N : h η | i = 1 , h η | δ i i ≥ , i = 1 , . . . , N } , where h δ i | = (cid:0) . . . . . . (cid:1) . The 0-simplex ∆ isthe single point | η i = (1), the 1-simplex ∆ is the linesegment [0 ,
1] from | η i = (0 ,
1) to | η i = (1 , η ( w ) that an HMCcan visit defines its set of mixed states : R = { η ( w ) : w ∈ A + , Pr( w ) > } . Generically, the mixed-state set R for an N -state HMCis infinite, even for finite N [26].Note that when a mixed state appears in probabilityexpressions, the notation refers to the random variable η ,not the row vector | η i , and we drop the bra-ket notation.Bra-ket notation is used in vector-matrix expressions. B. Mixed-State Dynamic
The probability of transitioning from h η ( w ) | to h η ( wx ) | on observing symbol x follows from Eq. (7) im-mediately; we have:Pr( η ( wx ) | η ( w )) = Pr( x |S ‘ ∼ η ( w )) . This defines the mixed-state transition dynamic W . To-gether the mixed states and their dynamic define anHMC that is unifilar by construction. This is a process’ mixed-state presentation (MSP) U ( P ) = { R , W} .We defined a process’ U abstractly. The U typicallyhas an uncountably infinite set of mixed states, makingit challenging to work with in the form laid out in Sec-tion IV A. Usefully, however, given any HMC M thatgenerates the process, we may explicitly write down thedynamic W . Assume we have an N + 1-state HMC pre-sentation M with k symbols x ∈ A . The initial conditionis the invariant probability π over the states of M , so that h η | = h δ π | . In the context of the mixed-state dynamic,mixed-state subscripts denote time.The probability of generating symbol x when inmixed state η is:Pr( x | η ) = h η | T ( x ) | i , (8)where T ( x ) is the symbol-labeled transition matrix asso-ciated with the symbol x .From η , we calculate the probability of seeing each x ∈ A . Upon seeing symbol x , the current mixed state h η t | is updated according to: h η t +1 ,x | = h η t | T ( x ) h η t | T ( x ) | i . (9)Thus, given an HMC presentation we can restateEq. (7) as: h η ( w ) | = h η | T ( w ) h η | T ( w ) | i = h π | T ( w ) h π | T ( w ) | i . Equation (9) tells us that, by construction, the MSP isunifilar, since each possible output symbol uniquely de-termines the next (mixed) state. Taken together, Eqs. (8)and (9) define the mixed-state transition dynamic W as:Pr( η t +1 , x | η t ) = Pr( x | η t )= h η t | T ( x ) | i , for all η ∈ R , x ∈ A .To find the MSP U = { R , W} for a given HMC M we apply the mixed-state construction method :1. Set U = { R = ∅ , W = ∅} . (A)(B)(C) σ σ (cid:3) : : : 1 η (cid:0) , (cid:1) η , (cid:0) , (cid:1) η , (cid:3) (1 , (cid:3) : : η (cid:0) , (cid:1) η (cid:0) , (cid:1) η (1 , η , (0 , (cid:3) : (cid:3) : : : 1 : : (cid:3) : FIG. 2. Determining the mixed-state presentation (MSP) ofthe 2-state unifilar HMC shown in (A): The invariant statedistribution π = (2 / , / η used in (B) to calculate the next set of mixed states. (C)The full set of mixed states seen from all allowed words. Inthis case, we recover the unifilar HMC shown in (A) as theMSP’s recurrent states.
2. Calculate M ’s invariant state distribution: π = πT .3. Take η to be h δ π | and add it to R .4. For each current mixed state η t ∈ R , use Eq. (8)to calculate Pr( x | η t ) for each x ∈ A .5. For η t ∈ R , use Eq. (9) to find the updated mixedstate η t +1 ,x for each x ∈ A .6. Add η t ’s transitions to W and each η t +1 ,x to R ,merging duplicate states.7. For each new η t +1 , repeat steps 4-6 until no newmixed states are produced.With the MSP U ( M ) in hand, the next issue is determin-ing it’s (equivalent) (cid:15) -machine. There are several cases.Beginning with a finite, unifilar HMC M generat-ing a process P , the MSP U ( M ) is a finite, optimally-predictive rival presentation to P ’s (cid:15) -machine, as seenin Fig. 2. In this case, the starting HMC depicted inFig. 2 (A) is an (cid:15) -machine, and reducing the MSP in Fig. 2 (C) by trimming the transient states returns theprocess’ recurrent-state (cid:15) -machine. When starting withthe (cid:15) -machine, trimming the resultant U ( (cid:15) -machine) inthis way always returns the (cid:15) -machine.In general, if U ( M ) is finite, we find the (cid:15) -machineby minimizing U ( M ) via merging duplicate states: re-peat mixed-state construction on U ( M ) and trim tran-sient states once more. Minimizing countably-infiniteand uncountably-infinite U ( M ) is discussed further inAppendix B.The MSPs of unifilar presentations are interestingand contain additional information beyond the unifilarpresentations. For example, containing transient causalstates, they are employed in calculating many complexitymeasures that track convergence statistics [39].However, here we focus on the mixed-state presen-tations of nonunifilar HMCs, which typically have an in-finite mixed-state set R . Figure 3 illustrates applyingmixed-state construction to a finite, nonunifilar HMC.This produces an infinite sequence of mixed states on∆ = [0 , R is countablyinfinite, allowing us to better understand the underlyingprocess P ; compared, say, to the 2-state nonunifilar HMCin Fig. 3(A). MSPs of nonunifilar HMCs typically havean uncountably-infinite mixed-state set R . V. MSP AS AN IFS
With this setup, our intentions in reviewing it-erated function systems (IFSs) become explicit. Themixed-state presentation (MSP) exactly defines a place-dependent IFS, where the mapping functions are the setof symbol-labeled mixed-state update functions as givenin Eq. (9) and the set of place-dependent probabilityfunctions are given by Eq. (8). We then have a map-ping function and associated probability function for eachsymbol x ∈ A that can be derived from the symbol-labeled transition matrix T ( x ) .If these probability and mapping functions meet theconditions of Theorem 1, we identify the attractor Λ asthe set of mixed states R and the invariant measure µ as the invariant distribution π of the potentially infinite-state U . This is the original HMC’s Blackwell measure.Since all Lipschitz continuous functions are Dini contin-uous, the probability functions meet the conditions byinspection. We now establish that the maps are contrac-tions, by appealing to Birkhoff’s 1957 proof that a posi-tive linear map preserving a convex cone is a contractionunder the Hilbert projection metric [40].Given an integer N ≥
2, let C N be the nonneg-ative cone in R N , so that C N consists of all vectors z = ( z , z , . . . , z N ) satisfying z = 0 and z i ≥ i . (A)(B)(C) σ σ : : : (cid:3) : (0 ,
1) (1 , . . . Pr( η ) = h η | T | i Pr( (cid:3) | η ) = h η | T (cid:3) | i ( , ) η (1 , η ( , ) η ( , ) η ( , ) η (0 , η ∞ η η η η . . . η ∞ : 1 (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) (cid:3) : : FIG. 3. Determining the mixed-state presentation of the 2-state nonunifilar
HMC shown in (A). The invariant distribution π = (1 / , / η used in (B) to calculate the next set of mixed states. (B) plots the mixed statesalong the 1-simplex ∆ = [0 , The projective distance d : C N × C N → [0 , ∞ ) is defined: d ( z, y ) :=max (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) z r z s y s y r (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) : r, s = 1 , . . . , N ; r = s (cid:27) (10)for z, y ∈ C N , where d ( z, z ) = 0. If one of the points is onthe cone boundary, the distance is taken to be + ∞ . Notethat the projective distance, by construction, defines d ( αz, βy ) = d ( z, y ), where α, β ∈ R + . In other words,for two mixed states η, ζ ∈ ∆ N , d (cid:0) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:1) = d ( ηT ( x ) , ζT ( x ) ).If T ( x ) is an N × N positive matrix, we have d ( zT ( x ) , yT ( x ) ) < d N ( z, y ) for every z, y ∈ C N such that d ( y, z ) >
0. We define the projective contractivity τ ( x ) associated with T ( x ) as: τ x := sup { z,y ∈ C N : d ( z,y ) > } d ( zT ( x ) , yT ( x ) ) d ( z, y ) , so that τ ( x ) satisfies τ ( x ) ≤
1. As the theorem belowindicates, this inequality is strict.
Theorem 2. ([41, Thm. 1].) Let the integers m, n ≥ T ( x ) = h t ( x ) ij i of order m × n with positive components, τ ( x ) is given by the followingBirkhoff formula: τ ( x ) = 1 − (cid:0) φ ( x ) (cid:1) / (cid:0) φ ( x ) (cid:1) / , where: φ ( H ) := min r,s,j,k t ( x ) rj t ( x ) sk t ( x ) sj t ( x ) rk . By inspection we see that φ ( H ) > τ < N . This is notgenerally true for our transition matrices—they are re-stricted merely to be nonnegative. However, the aboveresult extends to any nonnegative matrix T ( x ) for whichthere exists an N ∈ N + such that (cid:0) T ( x ) (cid:1) N is a posi-tive matrix. Then there will be a τ ( x ) < d (cid:16) η (cid:0) T ( x ) (cid:1) N , ζ (cid:0) T ( x ) (cid:1) N (cid:17) < d ( η, ζ ). This is equivalent toa requirement that T ( x ) be aperiodic and irreducible. σ σ : p : 1 − p : 1 − q (cid:3) : q FIG. 4. Simple Nonunifilar Source (SNS): The symbol-labeledtransition matrices given in Eq. (11) are both reducible, butthe place-dependent IFS still has an attractor with an invari-ant probability distribution. By setting p = q = 1 /
2, wereturn the nonunifilar HMC from Fig. 3.
Still, we are not guaranteed irreducibility and ape-riodicity for our symbol-labeled transition matrices. In-deed, the
Simple Nonunifilar Source , depicted in Fig. 4,has the symbol-labeled transition matrices: T ( ) = (cid:18) − p p − q (cid:19) and T ( (cid:3) ) = (cid:18) q (cid:19) . (11)Both T ( ) and T ( (cid:3) ) are reducible. A quick check is toexamine Fig. 4 and ask if there is a length- n sequenceconsisting of only a single symbol that reaches every statefrom every other state. Nonetheless, the HMC has acountable set of mixed states R and an invariant measure µ . We can determine this from the mapping functions: f ( ) ( η ) = (cid:20) h η | δ i (1 − p )1 − (1 − h η | δ i ) q , h η | δ i p + (1 − h η | δ i )(1 − q )1 + (1 − h η | δ i ) q (cid:21) and (12) f ( (cid:3) ) ( η ) = [1 , . (13)From any initial state η , other than η = σ = [1 , (cid:3) is positive. Once a (cid:3) isemitted, the mixed state is guaranteed to be η = σ =[1 , −∞ , we call the sym-bol a synchronizing symbol. From σ , the set of mixedstates is generated by repeated emissions of s, so that R = n(cid:0) f ( ) (cid:1) n ( σ ) : n = 0 , . . . , ∞ o . This is visually de-picted in Fig. 3 for the specific case of p = q = 1 /
2. Forall p and q , the measure can be determined analytically;see Ref. [43]. Note that this is due to the HMC’s highlystructured topology. In general, the set of mixed statesis uncountable—either a fractal or continuous set—andthe measure cannot be analytically expressed.Assuming the HMC generates an ergodic process en-sures that the total transition matrix T = P x T ( x ) isnonnegative, irreducible, and aperiodic. Define for anyword w = x . . . x ‘ ∈ A + the associated mapping func-tion T ( w ) = T ( x ) ◦ · · · ◦ T ( x ‘ ) . Consider word w in a pro-cess’ typical set of realizations (see Appendix A), whichset approaches measure one as | w | → ∞ . Due to ergod-icity, it must be the case that f ( w ) is either (i) a constantmapping—and, therefore, infinitely contracting—or (ii) T ( w ) is irreducible.As an example of the former case, we see that anycomposition of the SNS functions Eq. (13) is always aconstant function, so long as there is at least one (cid:3) inthe word, the probability of which approaches one as theword grows in length.As an example of the later case, imagine addingto the SNS in Fig. 4 a transition on (cid:3) from σ to σ .Then, both symbol-labeled transition matrices are stillreducible, but the composite transition matrices for anyword including both symbols is now irreducible. There-fore, the map is contracting. While this is not the casefor words composed of all (cid:3) s and all s, these sequencesare measure zero as N → ∞ . Appendix A discusses thisfurther. VI. ENTROPY OF GENERAL HMCS
Blackwell analyzed the entropy of functions of finite-state Markov chains [26]. With a shift in notation, func-tions of Markov chains can be identified as general hiddenMarkov chains. This is to say, both presentation classesgenerate the same class of stochastic processes. As wehave discussed, the entropy rate problem for unifilar hid-den Markov chains is solved, with Shannon’s entropy rateexpression, Eq. (3). However, according to Blackwell,there is no analogous closed-form expression for the en-tropy rate of a nonunifilar HMC.
A. Blackwell Entropy Rate
That said, Blackwell gave an expression for the en-tropy rate of general HMCs, by introducing mixed statesover stationary, ergodic, finite-state chains. (Although hedoes not refer to them as such.) His main result, retain-ing his notation, is transcribed here and adapted by usto constructively solve the HMC entropy-rate problem.
Theorem 3. ([26, Thm. 1].) Let { x n , −∞ < n < ∞} be a stationary ergodic Markov process with states i =1 , . . . , I and transition matrix M = k m ( i, j ) k . Let Φ bea function defined on 1 , . . . , I with values a = 1 , . . . , A and let y n = Φ( x n ). The entropy of the { y n } process isgiven by: H = − Z X a r a ( w ) log r a ( w ) dQ ( w ) , (14)where Q is a probability distribution on the Borel setsof the set W of vectors w = ( w , . . . , w I ) with w i ≥ P i w i = 1, and r a ( w ) = P Ii =1 P j Φ( j )= a w i m ( i, j ). Thedistribution Q is concentrated on the sets W , . . . , W A ,where W a consists of all w ∈ W with w i = 0 for Φ( i ) = a and satisfies: Q ( E ) = X a Z f − a E r a ( w ) dQ ( w ) , (15)where f a maps W into W a , with the j th coordinate of f a ( w ) given by P i w i m ( i, j ) /r a ( w ) for Φ( j ) = a .We can identify the w vectors in Theorem 3 as ex-actly the mixed states of Section IV. Furthermore, it isclear by inspection that r a ( w ) and f a ( w ) are the proba-bility and mapping functions of Eqs. (8) and (9), respec-tively, with a playing the role of our observed symbol x . Therefore, Blackwell’s expression Eq. (14) for theHMC entropy rate, in effect, replaces the average over afinite set S of unifilar states in Shannon’s entropy rateformula Eq. (3) with (i) the mixed states R and (ii) anintegral over the Blackwell measure µ . In our notation,we write Blackwell’s entropy formula as: h Bµ = − Z R dµ ( η ) X x ∈A p ( x ) ( η ) log p ( x ) ( η ) . (16)Thus, as with Shannon’s original expression, thistoo uses unifilar states—now, though, states from themixed-state presentation U . This, in turn, maintainsthe finite-to-one internal (mixed-) state sequence toobserved-sequence mapping. Therefore, one can identifythe mixed-state entropy rate itself as the process’ entropyrate. B. Calculating the Blackwell HMC Entropy
Appealing to Ref. [38], we have that contractivity ofour substochastic transition matrix mappings guaranteesergodicity over the words generated by the mixed-statepresentation. And so, we can replace Eq. (16)’s integral over R with a time average over a mixed-state trajectory η , η , . . . determined by a long allowed word, using Eqs.(8) and (9). This gives a new limit expression for theHMC entropy rate: c h µB = − lim ‘ →∞ ‘ X x ∈A Pr( x | η ‘ ) log Pr( x | η ‘ ) , (17)where η ‘ = η ( w ‘ ) and w ‘ is the first ‘ symbols of anarbitrarily long sequence w ∞ generated by the process.Note that w ‘ will be a typical trajectory, if ‘ issufficiently long. To remove convergence-slowing contri-butions from transient mixed states, one can ignore somenumber of the initial mixed states. The exact number oftransient states that should be ignored is unknown ingeneral. That said, it depends on the initial mixed state η , which is generally taken to be h δ π | , and the diameterof the attractor.This completes our development of the HMC entropyrate. Appendix C applies the theory and associated al-gorithm to a number of examples, with both countableand uncountable mixed states, and reveals a number ofsurprising properties. We now turn to practical issues ofthe resources needed for accurate estimation. C. Data Requirements
Although we developed our HMC entropy-rate ex-pression in terms of IFSs, determining a process’ entropyrate can be recast as Markov chain Monte Carlo (MCMC)estimation. In MCMC, the mean of a function f ( x ) ofinterest over a desired probability distribution π ( x ) is es-timated by designing a Markov chain with a stationarydistribution π . For HMCs the desired distribution is theBlackwell measure µ , which is the stationary distribu-tion µ over the MSP states R . Then, the Markov chainis simply the transition dynamic W over R .With this setting, we estimate the entropy rate c h µB as the mean of the stochastic process defined by takingthe entropy H [ X η ] over symbols emitted from state η fora sequence of mixed states generated by W . In effect, weestimate the entropy rate as the mean of this stochasticprocess: c h µB = µ H = h H [ X η ] i µ . (18)Mathematically, little has changed. The advantage,though, of this alternative description is that it invokesthe extensive body of results on MCMC estimation. Inthis, it is well known that there are two fundamentalsources of error in the estimation. First, there is thatdue to initialization bias or undesired statistical trendsintroduced by the initial transient data produced by theMarkov chain before it reaches the desired stationary dis-tribution. Second, there are errors induced by autocorre-lation in equilibrium . That is, the samples produced bythe Markov chain are correlated. And, the consequenceis that statistical error cannot be estimated by 1 / √ N , asdone for N independent samples.To address these two sources of error, we follow com-mon MCMC practice, considering two “time scales” thatarise during estimation. Consider the autocorrelation ofthe stationary stochastic process: C f ( t ) = h f s f s + t i − µ f , where µ f is f ’s mean. Also, consider the normalized au-tocorrelation, defined: ρ f ( t ) = C f ( t ) C f (0) . If the autocorrelation decays exponentially with time, wedefine the exponential autocorrelation time : τ exp,f = lim t →∞ sup 1 − log | ρ f ( t ) | and τ exp = sup f τ exp,f . So, τ exp upper bounds the rate of convergence from aninitial nonequilibrium distribution to the equilibrium dis-tribution.For a given observable, we also define the integratedautocorrelation time τ init,f as: τ int,f = 12 ∞ X −∞ ρ f ( t ) . (19)This relates the correlated samples selected by the chainto the variance of independent samples for the particu-lar function f of interest. The variance of f ( x )’s sam-ple mean in MCMC is higher by a factor of 2 τ int,f . Inother words, the errors for a sample of length N are oforder p τ int,f /N . Thus, targeting 1% accuracy requires ≈ τ int,f samples.In practice, it is difficult to find τ exp and τ int for ageneric Markov chain. There are two options. The firstis to use numerical approximations that estimate the au-tocorrelation function, and therefore τ , from data. If wehave the nonunifilar model in hand, it is a simple matterof sweeping through increasingly long strings of generateddata until we observe convergence of the autocorrelationfunction.Alternatively, taking inspiration from previoustreatments of nonunifilar models, we make a finite-stateapproximation to the MSP by coarse-graining the sim-plex into boxes of length (cid:15) and employ a suitable method, such as Ulam’s method, to approximate the transition op-erator. Using methods previously discussed in Ref. [44],this allows calculating the autocorrelation function di-rectly. Appendix D shows that that the approximationerror vanishes as (cid:15) → VII. CONCLUSION
We opened this development considering the rolethat determinism and randomness play in the behaviorof complex physical systems. A central challenge in thishas been quantifying randomness, patterns, and struc-ture and doing so in a mathematically-consistent but cal-culable manner. For well over a half a century Shannonentropy rate has stood as the standard by which to quan-tify randomness in a time series. Until now, however, cal-culating it for processes generated by nonunifilar HMCshas been difficult, at best.We began our analysis of this problem by recallingthat, in general, hidden Markov chains that are not unifi-lar have no closed-form expression for the Shannon en-tropy rate of the processes they generate. Despite this,these HMCs can be unifilarized by calculating the mixedstates. The resulting mixed-state presentations are them-selves HMCs that generate the process. However, adopt-ing a unifilar presentation comes at a heavy cost: Gener-ically, they are infinite state and so Shannon’s expressioncannot be used. Nonetheless, we showed how to workconstructively with these mixed-state presentations. Inparticular, we showed that they fall into a common classof dynamical system. The mixed-state presentation is aniterated function system. Due to this, a number of resultsfrom dynamical systems theory can be applied.Specifically, analyzing the IFS dynamics associatedwith a finite-state nonunfilar HMC allows one to extractuseful properties of the original process. For instance,we can easily find the entropy rate of the generated pro-cess from long orbits of the IFS. That is, one may selectany arbitrary starting point in the mixed-state simplexand calculate the entropy over the IFS’s place-dependentprobability distribution. We evolve the mixed state ac-cording to the IFS and sequentially sample the entropyof the place-dependent probability distribution at eachstep. Using an arbitrarily long word and taking the meanof these entropies, the method converges on the process’entropy rate.Although others consider the IFS-HMC connection[45, 46], our development expanded previous work to in-clude the much broader, more general class of nonunifi-lar HMCs. In addition, we demonstrated not only themixed-state presentation’s role in calculating the entropyrate, but also its connection to existing approaches to0randomness and structure in complex systems. In par-ticular, while our results focused on quantifying and cal-culating a process’ randomness, we left open questionsof pattern and structure. However, the path to achiev-ing the results introduced here strongly suggests thatthe mixed-state presentation offers insight into answeringthese questions. For instance, Fig. 3 demonstrated howthe highly structured nature of the Simple NonunifilarSource is made topologically explicit through calculatingits mixed-state presentation—which is also its (cid:15) -machine.Though space will not let us develop it furtherhere, this connection is not spurious. Indeed, manyinformation-theoretic properties of the underlying pro-cess may be directly extracted from its mixed-state pre-sentation. This follows from our showing how the attrac-tor of the IFS defined by an HMC is exactly the set ofmixed states R of that HMC. These sets are often frac-tal in nature and quite visually striking. See Fig. S6 forseveral examples.The sequel [30] to this development establishes thatthe fractal dimension of the mixed-state attractor isexactly the divergence rate of the statistical complex-ity [24]—a measure of a process’ structural complex-ity that tracks memory. Furthermore, the sequel intro-duces a method to calculate the fractal dimension of themixed-state attractor from the Lyapunov spectrum of themixed-state IFS. In this way, it demonstrates that coarse-graining the simplex—the previous approach to study thestructure of infinite-state processes—may be avoided al-together. To close, we note that these structural tools and theentropy-rate method introduced here have already beenput to practical use in two previous works. One diag-nosed the origin of randomness and structural complex-ity in quantum measurement [31]. The other exactly de-termined the thermodynamic functioning of Maxwellianinformation engines [32], when there had been no pre-vious method for this. At this point, however, we mustleave the full explication of these techniques and furtheranalysis on how mixed states reveal the underlying struc-ture of processes generated by hidden Markov chains tothe sequel [30]. ACKNOWLEDGMENTS
The authors thank Sam Loomis, Greg Wimsatt,Ryan James, David Gier, and Ariadna Venegas-Li forhelpful discussions and the Telluride Science ResearchCenter for hospitality during visits and the participantsof the Information Engines Workshops there. JPC ac-knowledges the kind hospitality of the Santa Fe Institute,Institute for Advanced Study at the University of Ams-terdam, and California Institute of Technology for theirhospitality during visits. This material is based uponwork supported by, or in part by, FQXi Grant numberFQXi-RFP-IPW-1902, and U.S. Army Research Labora-tory and the U.S. Army Research Office under contractW911NF-13-1-0390 and grant W911NF-18-1-0028. [1] D. Goroff, editor.
H. Poincaré, New Methods Of Celes-tial Mechanics, 1: Periodic And Asymptotic Solutions .American Institute of Physics, New York, 1991. 1[2] D. Goroff, editor.
H. Poincaré, New Methods Of Celes-tial Mechanics, 2: Approximations by Series . AmericanInstitute of Physics, New York, 1993.[3] D. Goroff, editor.
H. Poincaré, New Methods Of Ce-lestial Mechanics, 3: Integral Invariants and AsymptoticProperties of Certain Solutions . American Institute ofPhysics, New York, 1993.[4] J. P. Crutchfield, N. H. Packard, J. D. Farmer, and R. S.Shaw. Chaos.
Sci. Am. , 255:46 – 57, 1986. 1[5] A. M. Turing. On computable numbers, with an applica-tion to the entsheidungsproblem.
Proc. Lond. Math. Soc.Ser. 2 , 42:230, 1936. 1[6] C. E. Shannon. A universal Turing machine with two in-ternal states. In C. E. Shannon and J. McCarthy, editors,
Automata Studies , number 34 in Annals of Mathemati-cal Studies, pages 157–165. Princeton University Press,Princeton, New Jersey, 1956.[7] M. Minsky.
Computation: Finite and Infinite Machines .Prentice-Hall, Englewood Cliffs, New Jersey, 1967. 1[8] C. E. Shannon. A mathematical theory of communica-tion.
Bell Sys. Tech. J. , 27:379–423, 623–656, 1948. 1, 2,3 [9] A. N. Kolmogorov.
Foundations of the Theory of Prob-ability . Chelsea Publishing Company, New York, secondedition, 1956. 1[10] A. N. Kolmogorov. Three approaches to the concept ofthe amount of information.
Prob. Info. Trans. , 1:1, 1965.[11] A. N. Kolmogorov. Combinatorial foundations of infor-mation theory and the calculus of probabilities.
Russ.Math. Surveys , 38:29–40, 1983.[12] A. N. Kolmogorov. Entropy per unit time as a metricinvariant of automorphisms.
Dokl. Akad. Nauk. SSSR ,124:754, 1959. (Russian) Math. Rev. vol. 21, no. 2035b.[13] Ja. G. Sinai. On the notion of entropy of a dynamicalsystem.
Dokl. Akad. Nauk. SSSR , 124:768, 1959. 1[14] J. P. Crutchfield. Between order and chaos.
NaturePhysics , 8(January):17–24, 2012. 1[15] B. Marcus, K. Petersen, and T. Weissman, editors.
En-tropy of Hidden Markov Process and Connections to Dy-namical Systems , volume 385 of
Lecture Notes Series .London Mathematical Society, 2011. 1[16] Y. Ephraim and N. Merhav. Hidden Markov processes.
IEEE Trans. Info. Th. , 48(6):1518–1569, 2002. 1, 2[17] J. Bechhoefer. Hidden Markov models for stochastic ther-modynamics.
New. J. Phys. , 17:075003, 2015. 1[18] L. R. Rabiner and B. H. Juang. An introduction to hid-den Markov models.
IEEE ASSP Magazine , January:4–16, 1986. 1 [19] E. Birney. Hidden Markov models in biological sequenceanalysis. IBM J. Res. Dev., , 45(3.4):449–454, 2001. 1[20] S. Eddy. What is a hidden Markov model?
NatureBiotech. , 22:1315âĂŞ1316, Oct 2004. 1[21] C. Bretó, D. He, E. L. Ionides, and A. A. King. Timeseries analysis via mechanistic models.
Ann. App. Statis-tics , 3(1):319âĂŞ348, Mar 2009. 1[22] T. Rydén, T. Teräsvirta, and S. Åsbrink. Stylized factsof daily return series and the hidden Markov model.
J.App. Econometrics , 13:217–244, 1998. 1[23] R. B. Ash.
Information Theory . John Wiley and Sons,New York, 1965. 1[24] J. P. Crutchfield and K. Young. Inferring statistical com-plexity.
Phys. Rev. Let. , 63:105–108, 1989. 1, 10[25] J. P. Crutchfield. The calculi of emergence: Compu-tation, dynamics, and induction.
Physica D , 75:11–54,1994. 1[26] D. Blackwell. The entropy of functions of finite-stateMarkov chains. In
Transactions of the first Pragueconference on information theory, Statistical decisionfunctions, Random processes , volume 28, pages 13–20,Prague, Czechoslovakia, 1957. Publishing House of theCzechoslovak Academy of Sciences. 1, 3, 4, 7, 8[27] F. Creutzig, A. Globerson, and N. Tishby. Past-futureinformation bottleneck in dynamical systems.
Phys. Rev.E , 79(4):041925, 2009. 1[28] S. Still, J. P. Crutchfield, and C. J. Ellison. Optimalcausal inference: Estimating stored information and ap-proximating causal architecture.
CHAOS , 20(3):037111,2010.[29] S. Marzen and J. P. Crutchfield. Predictive rate-distortion for infinite-order Markov processes.
J. Stat.Phys. , 163(6):1312–1338, 2014. 1[30] A. Jurgens and J. P. Crutchfield. Infinite complexityof finite state hidden Markov processes. in preparation ,2020. 2, 10[31] A. Venegas-Li, A. Jurgens, and J. P. Crutchfield.Measurement-induced randomness and structure inquantum dynamics. arXiv:1908.09053 , 2019. 2, 10[32] A. Jurgens and J. P. Crutchfield. Functional thermo-dynamics of Maxwellian ratchets: Constructing and de-constructing patterns, randomizing and derandomizingbehaviors.
Phys. Rev. Research , 2(3):033334, 2020. 2, 10[33] T. M. Cover and J. A. Thomas.
Elements of InformationTheory . Wiley-Interscience, New York, second edition, 2006. 2, 1[34] C. R. Shalizi and J. P. Crutchfield. Computational me-chanics: Pattern and prediction, structure and simplicity.
J. Stat. Phys. , 104:817–879, 2001. 2[35] J. P. Crutchfield and D. P. Feldman. Regularities un-seen, randomness observed: Levels of entropy conver-gence.
CHAOS , 13(1):25–54, 2003. 3[36] M. Barnsley.
Fractals Everywhere . Academic Press, NewYork, 1988. 3[37] M. F. Barnsley, S. G. Demko, J. H. Elton, and J. S.Geronimo. Invariant measures arising from iterated func-tion systems with place dependent probabilities.
Ann.Inst. H. Poincare , 24:367âĂŞ394, 1988. 3[38] J. H. Elton. An ergodic theorem for iterated maps.
Ergod.Th. Dynam. Sys. , 7:481–488, 1987. 3, 8[39] J. P. Crutchfield, P. Riechers, and C. J. Ellison. Exactcomplexity: Spectral decomposition of intrinsic compu-tation.
Phys. Lett. A , 380(9-10):998–1002, 2016. 5[40] G. Birkhoff. Extensions of Jentzsch’s theorem.
Trans.Am. Math. Soc. , 85(1):219–227, 1957. 5[41] R. Cavazos-Cadena. An alternative derivation ofBirkhoff’s formula for the contraction coefficient of a pos-itive matrix.
Linear Algebra Apps. , 375:291âĂŞ297, 2003.6[42] E. Kohlberg and J. W. Pratt. The contraction mappingapproach to the Perron-Frobenius theory: Why Hilbert’smetric?
Math. Oper. Res. , 7(2), 1982. 7[43] S. Marzen and J. P. Crutchfield. Information anatomy ofstochastic equilibria.
Entropy , 16(9):4713–4748, 2014. 7[44] P. Riechers and J. P. Crutchfield. Spectral simplicityof apparent complexity, Part II: Exact complexities andcomplexity spectra.
Chaos , 28:033116, 2018. 9[45] M. Rezaeian. Hidden Markov process: A newrepresentation, entropy rate and estimation entropy. arXiv:0606114 . 9[46] W. Słomczyński, J. Kwapień, and K. Życzkowski. En-tropy computing via integration over fractal measures.
Chaos: An Interdisciplinary Journal of Nonlinear Sci-ence , 10(1):180âĂŞ188, Mar 2000. 9[47] S. E. Marzen and J. P. Crutchfield. Nearly maximallypredictive features and their dimensions.
Phys. Rev. E ,95(5):051301(R), 2017. 8
Supplementary MaterialsThe Shannon Entropy Rate ofHidden Markov Processes
Alexandra Jurgens and James P. Crutchfield arXiv :2002.XXXXX
The Supplementary Materials to follow review the notion of typical sets of realizations in a stochastic process,discuss minimality of infinite-state mixed-state presentations, determine the entropy rates of a suite of examplehidden Markov chains with infinite mixed-state presentations, and give details of errors that arise when estimatingautocorrelation.
Appendix A: Asymptotic Equipartition and the Typical Set Contraction
The asymptotic equipartition property (AEP) states that for a discrete-time, ergodic, stationary process X : − n log Pr( X , X , . . . , X n ) → h µ ( X ) , (S1)as n → ∞ [33]. This effectively divides the set of sequences into two sets: the typical set —sequences for which theAEP holds—and the atypical set, for which it does not. As a consequence of the AEP, it must be the case thatthe typical set is measure one in the space of all allowed realizations and all sequences in the atypical set approachmeasure zero as n → ∞ .We argue that while our IFS class includes reducible maps, any composition of maps corresponding to a wordin the typical set will be irreducible. This can be seen intuitively by considering the SNS, shown in Fig. 4, andadding an additional transition on a (cid:3) from σ to σ . This produces an HMC with two reducible symbol-labeledtransition matrices, but an irreducible total transition matrix. However, as | w | → ∞ , the only words such that T ( w ) remains reducible are (cid:3) N and N . We can see that these words cannot possibly be in the typical set, since − n log Pr( (cid:3) n ) = − log Pr( (cid:3) ) = h µ ( X ). The entropy rate h µ is by definition the branching entropy averaged overthe mixed states. And so, any word that visits only a restricted subset of the mixed states—i.e., a word with areducible transition matrix—cannot approach h µ , regardless of length. Therefore, only words with an irreduciblemapping will be in the typical set, implying that there exists an integer word length | w | > Appendix B: Minimality of U ( M ) The minimality of infinite-state mixed-state presentations U ( M ) is an open question. As demonstrated in Ap-pendix C 1, it is possible to construct MSPs with an uncountably infinite number of states for a process that requiresonly one state.A proposed solution to this problem is a short and simple check on mergeablility of mixed states, which hererefers to any two distinct mixed states that have the same conditional probability distribution over future strings; i.e.,any two mixed states η and ζ for which: Pr( X L | η ) = Pr( X L | ζ ) , (S1)for all L ∈ N + .Although minimality does not impact the entropy-rate calculation, one benefit of the IFS formalization of theMSP is the ability to directly check for duplicated states and therefore determine if the MSP is nonminimal. Wecheck this by considering, for an N + 1 state machine M with alphabet A = { , , . . . , k } , the dynamic not only overmixed states, but probability distributions over symbols. Let: P ( η ) = (cid:16) p (0) ( η ) , . . . , p ( k − ( η ) (cid:17) (S2)and consider Fig. S1. For each mixed state η ∈ ∆ N , Eq. (S2) gives the corresponding probability distribution ρ ( η ) ∈ ∆ k over the symbols x ∈ A . Let M emit symbol x , then the dynamic from one such probability distribution ρ ∈ ∆ k to the next is given by: g ( x ) ( ρ t ) = P ◦ f ( x ) ◦ P − ( ρ )= ρ t +1 ,x . (S3)From this, we see that if Eq. (S2) is invertible, g ( x ) : ∆ k → ∆ k is well defined and has the same functionalproperties as f ( x ) . In other words, in this case, it is not possible to have two distinct mixed states η, ζ ∈ ∆ N with thesame probability distribution over symbols. And, the probability distributions can only converge under the actionof g ( x ) if the mixed states also converge under the action of f ( x ) . Shortly, we consider several cases where P is not invertible over the entire symbol simplex. ∆ N ∆ N ∆ k ∆ k f ( x ) P Pg ( x ) FIG. S1. Commuting diagram for probability functions P = { p ( x ) } , mixed-state mapping functions f ( x ) , and proposed symbol-distribution mapping functions g ( x ) . If every mixed state in R corresponds to a unique probability distribution over symbols, we conjecture that thecorresponding U ( M ) is the minimal unifilar representation of the underlying process P . If we then trim the transientstates of U ( M ), leaving the recurrent set R R , the result is the (cid:15) -machine. Appendix C: Examples
The following illustrates how to apply the theory and algorithms from the main text to accurately and efficientlycalculate the entropy rate of processes generated by HMCs with countable and uncountable mixed states. It highlightsa number of curious and nontrivial properties of these processes and their MSPs.
1. Cantor Set MSP
We first analyze a process with an MSP whose uncountable mixed states lie in a Cantor set. Surprisingly, thisMSP is far from minimal, as the process is, in fact, generated by a biased coin—that is, a single-state (cid:15) -machine. a. The Cantor Set
The Cantor set is perhaps the most well-known example of a nontrivial self-similar (fractal) set. The familiarmiddle-thirds version is constructed by starting with the unit interval C = [0 ,
1] and removing the middle third,giving the set C = [0 , ] ∪ [ , C produces C , and so on. TheCantor set C consists of points that remain after infinitely repeating this action: C = T ∞ n =1 C n . σ σ (cid:3) : as (cid:3) : a ( a − s (cid:3) : a : 1 − a : s − s (1 − a ) : − as FIG. S2. Nonunifilar HMC M C that generates Cantor sets of mixed states on the 2-simplex, for values of 0 < p < s > s = 3, we find the middle-third Cantor set. The Cantor set is uncountably infinite and has Hausdorff dimension:dim H C = log 2log 3 . A parametrized family of Cantor sets is generated by repeating C = [0 , s ] ∪ [ s − s ,
1] (i.e., removing the middle s − s ),the Hausdorff dimension is: dim H C = log 2log s .
Simply stated, the dimension is the logarithm of the number of copies of the original unit interval made at eachiteration, divided by the logarithm of the length ratio between the original object and its copy. b. The Cantor Machine
The Cantor set, due to its familiarity, makes for a useful, first object of study for uncountable-mixed-state HMCs.Figure S2 shows an HMC M C that generates a Cantor set of mixed states. There 0 < a < s > T ( (cid:3) ) = (cid:18) as a ( s − s a (cid:19) and T ( ) (cid:18) − a ( s − − a ) s − as (cid:19) . This allows us to immediately write down the probability functions and mapping functions, recalling that in thetwo-state case the vectors on the simplex take the form h η | = ( h η | δ i , − h η | δ i ): ( p ( (cid:3) ) ( η ) = h η | T ( (cid:3) ) | i = ap ( ) ( η ) = h η | T ( ) | i = 1 − a and: f ( (cid:3) ) ( η ) = h η | T ( (cid:3) ) h η | T ( (cid:3) ) | i = (cid:18) h η | δ i s , − h η | δ i s (cid:19) f ( ) ( η ) = h η | T ( ) h η | T ( ) | i = (cid:18) s + h η | δ i − s , − h η | δ i s (cid:19) . It is easily seen, by considering h η | δ i = 0 and h η | δ i = 1, that these maps, in fact, map the simplex to the first andsecond intervals of C , respectively.The Cantor Machine MSP U ( M C ) is shown Fig. S3 (Top). It has an uncountably-infinite number of recurrentstates, which correspond exactly to the elements of the Cantor set. Since the probability functions do not depend on η η (cid:3) η η (cid:3)(cid:3) η (cid:3) η (cid:3) η ... ... ... ... ... ... ... ... (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a σ (cid:3) : a : 1 − a FIG. S3. Two valid alternative presentations of the Cantor set machine: (Top) The MSP U ( M C ) of the Cantor set machine M C in Fig. S2. The set of mixed states η w is uncountably infinite. (Bottom) A unifilar hidden Markov model, commonly calledthe Biased Coin, that generates the same process as the nonunifilar Cantor machine M C in Fig. S2. h η | , we do not need to invoke the Ergodic Theorem, but instead can calculate the entropy exactly: h µ U ( M C) = − Z X i p ( x ) ( η ) log p ( x ) ( η ) dµ U ( M C ) ( η )= − a log ( a ) − (1 − a ) log (1 − a )= H ( a ) . c. A Biased Coin However, there is an important caveat here, noted in Appendix B. The MSP may contain states that are prob-abilistically equivalent. The probability mapping functions are noninvertible and, in fact, every single mixed statecorresponds to the same conditional probability distribution over symbols. This means that the uncountably-infiniteMSP is not a minimal presentation. There is a markedly simpler unifilar model for the Cantor set machine M C .In fact, all mixed states in U ( M C ) collapse into a single state, giving the minimal unifilar model of the Cantor setmachine as the Biased Coin HMC shown in Fig. S3 (Bottom). This HMC generates the same process as the Cantormachine, but requires only a single state. σ σ σ (cid:3) : 1 : : : : : FIG. S4. Three-state, nonunifilar machine M S .
2. Countable MSP with -state HMC Now, we explore a different, but related case that introduces a condition for a countable MSP and again highlightsthe role of minimality. a. -State HMC with a Countable MSP Consider the 3-state HMC M S of Fig. S4. The transition matrices for this machine are: T ( (cid:3) ) = and T ( ) =
12 12
13 13 13 . These give the mapping and probability functions: ( p ( (cid:3) ) ( η ) = η p ( ) ( η ) = 1 − η and: f ( (cid:3) ) ( η ) = (0 , , f ( ) ( η ) = (cid:18) η + 13 η , η + 13 η , η (cid:19) . Consider the probability functions first. P is not invertible over all of ∆ , but is partially invertible over arestricted domain. Given a line in the simplex where η and η are a function of η —say, ( η , − η , − η )—we caninvert P ( η ). The question becomes: What is the appropriate restricted domain?Note that for both f ( (cid:3) ) and f ( ) , η = η . In the simplex this corresponds to all the mixed states lying along aline in ∆ —the line ( η , η , − η ). This, then, is the restricted domain over which the states U ( M S ) correspond tounique probability distributions. The fact that this space is a line implies that the generative machine can be writtenwith only two states.The constancy of the mapping function for (cid:3) contributes further structure, ensuring that the set of mixed stateswill be countably infinite. We can write the mixed states down in series, in terms of how many s we have seen sincethe last (cid:3) : η ( n ) = (cid:18) n − n )4 · n − · n , n − n )4 · n − · n , n · n − · n (cid:19) , where n = 0 is taken to be the mixed state η ( ) = (0 , , Pr ( (cid:3) | η ( n )) = 2(3 n − n )4 · n − · n Pr ( η ( n )) = 2 · n − n · n − · n . The MSP U ( M S ) is shown in Fig. S5. If the initial condition η = (0 , , (cid:15) -machine. Since R is countable, we can find µ by hand, by solving the set of equations π n +1 = π n Pr( η n ), with P n π n = 1. This gives π n = (cid:0) − n − − n (cid:1) and we find: h µ U ( M S ) = ∞ X n =0 π n log Pr( η n ) ≈ . . η η . . . η n . . . η ∞ : 1 : : n − · n · n − · n : (cid:3) : (cid:3) : n − n )4 · n − · n (cid:3) : FIG. S5. Unifilar HMC that generates the same process generated by the nonunfilar machine M S in Fig. S4. b. Actually, A 2-State Machine As mentioned above, the restricted domain over which P is invertible implies a smaller state set for the processgenerated by the nonunifilar machine M S . For all relevant mixed states, Pr( σ ) = Pr( σ ), suggesting that we devisean HMC combining the two states. However, the mapping function for (cid:3) must still project definitively to a singlestate, to retain the countable infinity of mixed states. In fact, these restrictions ensure that the minimal nonunifilarHMC for the process is the HMC for the Simple Nonunifilar Source, discussed in Section V.If we declare that ˆ η ( ) = (0 , s, by using the mapping functions in Section V. The next two states are:ˆ η ( ) = (1 − q, q ) andˆ η ( ) = (cid:18) − p + pq − q − p + pq , q − p + pq (cid:19) . For the underlying process to remain the same, the condition that must be met is P ( η ( n )) = ˆ P (ˆ η ( n )). Thisdetermines p and q . For n = 0 this is trivially met. For n = 1 we have:ˆ P (cid:0) ˆ η ( ) (cid:1) = (cid:0) p (1 − q ) , − (1 − q ) p (cid:1) = (cid:18) , (cid:19) , so that 1 − q = p . Substituting this into the ˆ η ( ) condition we get:ˆ η ( ) = (cid:18) − q , q (cid:19) . Substituting this into the probability distribution constraint for n = 2 gives q = 1 / q = 1 /
3, corresponding totwo different 2-state nonunifilar HMCs that generate the same process as the 3-state HMC. This further emphasizes thelack of uniqueness of generative models. That said, by examining the underlying IFS, their HMCs can be recovered.
3. Parametrized HMCs and Their MSPs
Finally, consider an HMC with 3 symbols and 3 states: T (cid:3) = αy βx βxαx βy βxαx βx βy , T = βy αx βxβx αy βxβx αx βy , and T ◦ = βy βx αxβx βy αxβx βx αy , (S1)with β = − α and y = 1 − x . From inspection, we see that α can take on any value from 0 to 1 and x may rangefrom 0 to . (a) 100,000 mixed states of the HMC defined by Eq. (S1) with α = 0 . x = 0 . α = 0 . x = 0 . α = 0 . x = 0 .
33. (d) 100,000 mixed states of the HMC defined by Eq. (S1) with α = 0 . x = 0 . FIG. S6. Parametrized 3-state HMC defined in Eq. (S1) that generates MSPs in a variety of structures, depending on x and α . However, due to the rotational symmetry in the transition matrices, the attractor is radially symmetric around the simplexcenter. Choosing α = 0 . x ∈ [0 , .
5] gives us an MSP that first fills nearly the entire simplex, withprobability mass concentrated at the corners, then shrinks to a finite machine with 3 states at x = 1 /
3, and finallygrows once again into a fractal measure, as Fig. S6 illustrates. To demonstrate the ease and efficiency calculatingtheir entropy rates Fig. S7 plots h µ as function of ( x, α ) ∈ [0 , . × [0 , x . . . . . . . . . . . α . . . . . . Entropy Rate
FIG. S7. Entropy rates of the parametrized HMC defined in Eq. (S1) over x ∈ [0 . , .
5] and α ∈ [0 . , . Appendix D: Estimation Errors for Finite-State Autocorrelation
Coarse-graining the mixed-state simplex into a set C of boxes of width (cid:15) , we may construct a finite-state approxi-mation of the infinite-state MSP. It has been shown that given such an approximation, for any given box c , the boundon the difference in the entropy rate over the symbol distribution between the coarse-grained approximation and amixed state within that box is bounded by: | H [ X |C = c ] − H [ X | η ∈ f ] | ≤ H b √ G(cid:15) ! , (S1)where H b ( · ) is the binary entropy function [47]. Our task here is to consider the error in the autocorrelation in thesequence of mixed states since, if we can show that this is bounded, the error in the autocorrelation of the branchingentropy must also be bounded.At time zero, the autocorrelation is equal to A ( L = 0) = h X X i , so for the finite-state approximation, we have: A C ( L = 0) = X i π C ( i ) c i c i , where π C is the stationary distribution over the coarse-grained mixed states, π C ( i ) is the stationary probability of cell i , and c i is the center of cell i . For the true process, we have: A ( L = 0) = Z R dµ ( η ) ηη = X i π C ( i ) Z η ∈C i dµ ( η | i ) ηη , where dµ ( η | i ) is the distribution over mixed states within cell i . The maximum distance between any two mixedstates in a cell i is bounded by: k η − ζ k ≤ √ G(cid:15) , the length of the longest diagonal in a hypercube of dimension | G | , by construction. Since the gradient of the L norm is simply (cid:79) k x k = x / k x k , we have a bound on the difference in the autocorrelation at time zero: | A C ( L = 0) − A ( L = 0) | ≤ N p | G | (cid:15) . With increasing length we have: A C ( L ) = X i π C ( i ) c i X w ∈A L f ( w ) ( c i ) p ( w ) ( c i )and: A ( L ) = Z R dµ ( η ) η X w ∈A L f ( w ) ( η ) p ( w ) ( η )= X i π C ( i ) Z η ∈C i dµ ( η | i ) η X w ∈A L f ( w ) ( η ) p ( w ) ( η ) . Let η = c i + δ for some mixed state in cell i . Then we can write: | A C ( L ) − A ( L ) | ≤ X i π C ( i ) " c i X w ∈A L f ( w ) ( c i ) p ( w ) ( c i ) − ( c i + δ ) X w ∈A L f ( w ) ( c i + δ ) p ( w ) ( c i + δ ) . Now, note that: p ( w ) ( c i + δ ) ≈ p ( w ) ( c i ) + (cid:79) p ( w ) ( c i ) · δ and: f ( w ) ( c i + δ ) ≈ f ( w ) ( c i ) + e λ x δ where λ w is the leading Lyapunov exponent of the mapping function. Substituting this and eliminating terms of order δ gives us: | A C ( L ) − A ( L ) | ≤ X i π C ( i ) " c i X w ∈A L (cid:16) f ( w ) ( c i ) (cid:79) p ( w ) · δ + e λ w δp ( w ) ( c i ) (cid:17) + δ X w ∈A L f ( w ) ( c i ) p ( w ) ( c i ) . These terms identify three sources of approximation error: (i) that due to a difference in the probability distributionover symbols, (ii) that in the mapping functions, and (iii) that from approximating the points at the center of theircells.For the first, we note that total variation in the probability distribution over symbols is bounded by the distancebetween the mixed states at which the distributions are computed. So, for any two mixed states in the same cell, k Pr( X = x | η ) − Pr( X = x | ζ ) k T V ≤ √
G(cid:15) . Then, the first term is the error due to the difference in the expectationvalue of the next state, given that we have calculated the probability distribution at c i + δ , rather than c i . Using0Hölder’s inequality, for two distributions over P ( X ) and Q ( X ), we may say: E [ f ] Q − E [ f ] P = X x f ( x )( P ( x ) − Q ( x )) X x f ( x )( P ( x ) − Q ( x )) ≤ X x f ( x ) | P ( x ) − Q ( x ) | X x f ( x ) | P ( x ) − Q ( x ) | ≤ k f k p k P − Q k q , where 1 /p + 1 /q = 1. Setting q = 1: E [ f ] Q − E [ f ] P ≤ k f k ∞ k P − Q k T V . So, after taking the product with the cell centers c i , we have that the first error is bounded by N √ G(cid:15) at all lengths.For the second, we note that since the maps are contractions, λ <
0, and the distance between f x ( η ) and f x ( ζ ),where η and ζ are in the same cell i , is bounded by √ G(cid:15) . As the length of a word w grows, λ w → −∞ and the distance f w ( η ) − f w ( ζ ) →
0. At large L , this term vanishes, at a rate equal to the average maximal Lyapunov exponent of theIFS. The final error is that in the autocorrelation in the cell approximation which is, likewise, bounded by the cellsize—this is the same error from A (0), viz. N √ G(cid:15) .And so, in combination with the bound on the entropy, we may say, loosely speaking, that the error in theautocorrelation vanishes as (cid:15) →
0. Therefore, to find τ and estimate the error in Eq. (18) as a function of samplesize, we take finer coarse-grained approximations until convergence in the autocorrelation curve is observed, and thencalculate ττ