[PDF] Shannon Entropy Rate of Hidden Markov Processes

Abstract

Hidden Markov chains are widely applied statistical models of stochastic processes, from fundamental physics and chemistry to finance, health, and artificial intelligence. The hidden Markov processes they generate are notoriously complicated, however, even if the chain is finite state: no finite expression for their Shannon entropy rate exists, as the set of their predictive features is generically infinite. As such, to date one cannot make general statements about how random they are nor how structured. Here, we address the first part of this challenge by showing how to efficiently and accurately calculate their entropy rates. We also show how this method gives the minimal set of infinite predictive features. A sequel addresses the challenge's second part on structure.

Full PDF

aarXiv :2008.XXXXX

Shannon Entropy Rate of Hidden Markov Processes

Alexandra M. Jurgens ∗ and James P. Crutchﬁeld † Complexity Sciences Center, Physics DepartmentUniversity of California at DavisDavis, California 95616 (Dated: September 1, 2020)Hidden Markov chains are widely applied statistical models of stochastic processes, from funda-mental physics and chemistry to ﬁnance, health, and artiﬁcial intelligence. The hidden Markovprocesses they generate are notoriously complicated, however, even if the chain is ﬁnite state: noﬁnite expression for their Shannon entropy rate exists, as the set of their predictive features is gener-ically inﬁnite. As such, to date one cannot make general statements about how random they arenor how structured. Here, we address the ﬁrst part of this challenge by showing how to eﬃcientlyand accurately calculate their entropy rates. We also show how this method gives the minimal setof inﬁnite predictive features. A sequel addresses the challenge’s second part on structure.

Keywords: Markov process, Shannon entropy, iterated function system, mixed state, predictive feature,optimal prediction, Blackwell measure

I. INTRODUCTION

Randomness is as necessary to physics as determin-ism. Indeed, since Henri Poincaré’s failed attempt toestablish the orderliness of planetary motion, it has beenunderstood that both determinism and randomness areessential and unavoidable in the study of physical sys-tems [1–4]. In the 1960s and 1970s, the rise of dynamicalsystems theory and the exploration of statistical physicsof critical phenomena oﬀered up new perspectives onthis duality. The lesson was that intricate structuresin a system’s state space amplify uncertainty, guiding itand eventually installing it—paradoxically—in complexspatiotemporal patterns. Accepting this state of aﬀairsprompts basic, but as-yet unanswered questions. How isthis emergence monitored? How do we measure a sys-tem’s randomness or quantify its patterns and their or-ganization?The tools needed to address these questions aroseover recent decades during the integration of Turing’scomputation theory [5–7], Shannon’s information theory[8], and Kolmogorov’s dynamical systems theory [9–13].This established the vital role that information plays inphysical theories of complex systems. In particular, theapplication of hidden Markov chains to model and an-alyze the randomness and structure of physical systemshas seen considerable success, not only in complex sys-tems [14], but also in coding theory [15], stochastic pro-cesses [16], stochastic thermodynamics [17], speech recog-nition [18], computational biology [19, 20], epidemiology[21], and ﬁnance [22], to oﬀer a nonexhaustive list of ex-amples.A highly useful property of certain hidden Markovchains (HMCs) is uniﬁlarity [23], a structural constraint ∗ [email protected] † [email protected] on their state transitions. Shannon showed that given aprocess generated by a ﬁnite-state uniﬁlar HMC, one maydirectly and accurately calculate a process’ irreduciblerandomness [8]—now called the Shannon entropy rate .Furthermore, for such a process, there is a unique mini-mal ﬁnite-state uniﬁlar HMC that generates the process[24], known as the (cid:15) -machine . The (cid:15) -machine states—theprocess’ causal states —are the minimal set of maximallypredictive features. One consequence of the (cid:15) -machine’suniqueness and minimality is that its mathematical de-scription gives a constructive deﬁnition of a process’structural complexity as the amount of memory requiredto generate the process.Loosening the uniﬁlar constraint to consider a widerclass of generated processes, however, leads to majorroadblocks. Predicting a process generated by a ﬁnite-state nonuniﬁlar HMC requires an inﬁnite set of causalstates [25]. That is, though “ﬁnitely” generated, theprocess cannot be predicted by any ﬁnite uniﬁlar HMC.Practically, this precludes directly determining the pro-cess’ entropy rate using Shannon’s result and, at best,obscures any insight into its internal structure.That said, its causal states are (in general, see Ap-pendix B) equivalent to the uncountable set of mixedstates , or predictive features, formally introduced byBlackwell over a half century ago [26]. To date, work-ing with inﬁnite mixed-states required coarse-graining toproduce a ﬁnite set of predictive features. Fortunately,the tradeoﬀs between resource constraints and predictivepower induced by such coarse graining can be systemat-ically laid out [27–29].The following introduces an alternative and moredirect approach to working with mixed states, though.It casts generating mixed states as a chaotic dynam-ical system—speciﬁcally, a (place dependent) iteratedfunction system (IFS). This obviates analyzing the un-derlying HMC via coarse graining. Rather, the com-plex dynamics of the new system directly captures theinformation-theoretic properties of the original process. a r X i v : . [ n li n . C D ] A ug Speciﬁcally, this allows exactly calculating the entropyrate of the process generated by the original nonuniﬁlarﬁnite-state HMC. Additionally, the IFS interpretation ofthe nonuniﬁlar HMC provides new insight into the struc-ture and complexity of inﬁnite-state processes. This hasdirect application to the study of randomness and struc-ture in a wide range of physical systems.In point of fact, the following and its sequel [30]were proceeded by two companions that applied the the-oretical results here to two, rather diﬀerent, physical do-mains. The ﬁrst analyzed the origin of randomness andstructural complexity engendered by quantum measure-ment [31]. The second solved a longstanding problem onexactly determining the thermodynamic functioning ofMaxwellian demons, aka information engines [32]. Thatis, the following and its sequel lay out the mathemati-cal and algorithmic tools required to successfully analyzethese applied problems. We believe the new approach isdestined to ﬁnd even wider applications.Section II recalls the necessary background instochastic processes, hidden Markov chains, and infor-mation theory. Section III reviews the needed results oniterated function systems; while Sec. IV develops mixedstates and their dynamic—the mixed-state presentation.The main result connecting these then follows in Sec. V,showing that the mixed-state presentation is an IFS andthat it produces an ergodic process. Section VI recallsBlackwell’s theory, updating it for our present purposeof determining the entropy rate of any HMC. The Sup-plementary Materials provide background on the asymp-totic equipartition property and minimality of the mixedstates. They also constructively work through the resultsfor several example nonuniﬁlar HMCs. They close withthe statistical error analysis underlying entropy-rate es-timation.

II. HIDDEN MARKOV PROCESSES A stochastic process P is a probability measure overa bi-inﬁnite chain . . . X t − X t − X t X t +1 X t +2 . . . of ran-dom variables, each denoted by a capital letter. Aparticular realization . . . x t − x t − x t x t +1 x t +2 . . . is de-noted via lowercase letters. We assume values x t be-long to a discrete alphabet A . We work with blocks X t : t , where the ﬁrst index is inclusive and the sec-ond exclusive: X t : t = X t . . . X t − . P ’s measure isdeﬁned via the collection of distributions over blocks: { Pr( X t : t ) : t < t , t, t ∈ Z } .To simplify the development, we restrict to station-ary, ergodic processes: those for which Pr( X t : t + ‘ ) =Pr( X ‘ ) for all t ∈ Z , ‘ ∈ Z + . In such cases, we onlyneed to consider a process’s length- ‘ word distributions Pr( X ‘ ).A Markov process is one for which Pr( X t | X −∞ : t ) =Pr( X t | X t − ). A hidden Markov process is the output of amemoryless channel [33] whose input is a Markov process σ σ (cid:3) : : (cid:3) : 1 FIG. 1. A hidden Markov chain (HMC) with two states, { σ , σ } and two symbols { (cid:3) , . This machine is uniﬁlar. [16]. Working with processes directly is cumbersome, sowe turn to consider ﬁnitely-speciﬁed mechanistic modelsthat generate them. Deﬁnition 1.

A ﬁnite-state edge-labeled hidden MC (HMC) consists of:1. a ﬁnite set of states S = { σ , ..., σ N } ,2. a ﬁnite alphabet A of k symbols x ∈ A , and3. a set of N by N symbol-labeled transition matrices T ( x ) , x ∈ A : T ( x ) ij = Pr( σ j , x | σ i ) . The correspond-ing overall state-to-state transitions are describedby the row-stochastic matrix T = P x ∈A T ( x ) . Any given stochastic process can be generated byany number of HMCs. These are called a process’ pre-sentations .We now introduce a structural property of HMCsthat has important consequences in characterizing pro-cess randomness and structure.

Deﬁnition 2. A uniﬁlar HMC (uHMC) is an HMC suchthat for each state σ i ∈ S and each symbol x ∈ A thereis at most one outgoing edge from state σ i labeled withsymbol x . Although there are many presentations for a process P , there is a canonical presentation that is unique: aprocess’ (cid:15) -machine . Deﬁnition 3. An (cid:15) -machine is a uHMC with probabilis-tically distinct states : For each pair of distinct states σ i , σ j ∈ S there exists a ﬁnite word w = x ‘ − suchthat: Pr( X ‘ = w |S = σ k ) = Pr( X ‘ = w |S = σ j ) . A process’ (cid:15) -machine is its optimal, minimal presen-tation, in the sense that the set of predictive states | S | isminimal compared to all its other uniﬁlar presentations[34]. A. Entropy Rate of HMCs

A process’ intrinsic randomness is the informationin the present measurement, discounted by having ob-served the information in an inﬁnitely long history. It ismeasured by Shannon’s source entropy rate [8].

Deﬁnition 4.

A process’ entropy rate h µ is the asymp-totic average entropy per symbol [35]: h µ = lim ‘ →∞ H[ X ‘ ] /‘ , (1) where H [ X ‘ ] is the Shannon entropy of block X ‘ : H [ X ‘ ] = − X x ‘ ∈A ‘ Pr( x ‘ ) log Pr( x ‘ ) . (2)Given a ﬁnite-state uniﬁlar presentation M u of a pro-cess P , we may directly calculate the entropy rate fromthe transition matrices of the uHMC [8]: h µ ( P ) = h µ ( M u )= − X σ ∈ S Pr( σ ) X x ∈A T ( x ) σσ log T ( x ) σσ . (3)Blackwell showed, though, that in general for processesgenerated by HMCs there is no closed-form expressionfor the entropy rate [26]. For a process generated byan nonuniﬁlar HMC M , applying Eq. (3) to M typicallyoverestimates the true entropy rate of the process h µ ( P ): h µ ( M ) ≥ h µ ( P ) . Overcoming this limitation is one of our central results.We now embark on introducing the necessary tools forthis.

III. ITERATED FUNCTION SYSTEMS

To get there, we must take a short detour to reviewiterated function systems (IFSs) [36], as they play a crit-ical role in analyzing HMCs. Speaking simply, we showthat HMCs are dynamical systems—namely, IFSs.Let (∆ N , d ) be a compact metric space with d ( · , · )a distance. This notation anticipates our later applica-tion, in which ∆ N is N -simplex of discrete-event proba-bility distributions (see Section IV A). However, the re-sults here are general.Let f ( x ) : ∆ N → ∆ N for x = 1 , . . . , k be a set ofLipschitz functions with: d (cid:16) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:17) ≤ τ ( x ) d ( η, ζ ) , for all η, ζ ∈ ∆ N and where τ ( x ) is a constant. Thisnotation is chosen to draw an explicit parallel to thestochastic processes discussed in Section II and to avoidconfusion with the lowercase Latin characters used for re-alizations of stochastic processes. In particular, note thatthe superscript ( x ) here and elsewhere parallels that ofthe HMC symbol-labeled transition matrices T ( x ) . Thereasons for this will soon become clear.The Lipschitz constant τ ( x ) is the contractivity ofmap f ( x ) . Let p ( x ) : M → [0 ,

1] be continuous, with p ( x ) ( η ) ≥ P kx =1 p ( x ) ( η ) = 1 for all η in M . Thetriplet { ∆ N , { p ( x ) } , { f ( x ) } : x ∈ A} deﬁnes a place-dependent IFS.A place-dependent IFS generates a stochastic pro-cess over η ∈ ∆ N as follows. Given an initial position η ∈ ∆ N , the probability distribution { p ( x ) ( η ) : x =1 , . . . , k } is sampled. According to the sample x , apply f ( x ) to map η to the next position η = f ( x ) ( η ). Re-sample x from the distribution and continue, generating η , η , η , . . . .If each map f ( x ) is a contraction—i.e., τ ( x ) < η, ζ ∈ ∆ N —it is well known that there exists a uniquenonempty compact set Λ ⊂ ∆ N that is invariant underthe IFS’s action: Λ = k \ x =1 f ( x ) (Λ) . Λ is the IFS’s attractor .Consider the operator V : M (∆ N ) → M (∆ N ) onthe space of Borel measures on the N -simplex: V µ ( B ) = k X x =1 Z ( f ( x ) ) − ( B ) p ( x ) ( η ) dµ ( η ) . (4)A Borel probability measure µ is said to be invariant or stationary if V µ = µ . It is attractive if for any probabilitymeasure ν in M (∆ N ): Z gd ( V n ν ) → Z gµ , for all g in the space of bounded continuous functions on∆ N . Let’s recall here a key result concerning the existenceof attractive, invariant measures for place-dependentIFSs. Theorem 1. [37, Thm. 2.1] Suppose there exists r < q > X x ∈A p ( x ) ( η ) d q (cid:16) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:17) ≤ r q d q ( η, ζ ) , for all η, ζ ∈ ∆ N . Assume that the modulus of uniformcontinuity of each p ( x ) satisﬁes Dini’s condition and thatthere exists a δ > X x : d ( f ( x ) ( η ) ,f ( x ) ( ζ )) ≤ rd ( η,ζ ) p ( x ) ( η ) p ( x ) ( ζ ) ≤ δ , (5)for all η, ζ ∈ ∆ N . Then there is an attractive, unique, in-variant probability measure for the Markov process gen-erated by the place-dependent IFS.In addition, under these same conditions Ref. [38]established an ergodic theorem for IFS orbits . That is,for any η ∈ ∆ N and g : ∆ N → ∆ N :1 n + 1 n X k =0 g ( w x k ◦ · · · ◦ w x η ) → Z gdµ . (6) IV. MIXED-STATE PRESENTATION

We now return to stochastic processes and theirHMC presentations. When calculating entropy ratesfrom various presentations, we noted that HMC presen-tations led to diﬃculties: (i) the internal Markov-chainentropy-rate overestimates the process’ entropy rate and(ii) there is no closed-form entropy-rate expression. Todevelop the tools needed to resolve these problems, weintroduce HMC mixed states and their dynamic.Assume that an observer has a ﬁnite HMC presenta-tion M for a process P . Since the process is hidden, theobserver does not directly measure M ’s internal states.Absent output data, the best guess for M ’s hidden statesis that they occur according to the state stationary dis-tribution π . The observer can improve on this guess bymonitoring the output data x x x . . . that M generates.Given knowledge of M , determining the internal statefrom observed data is the problem of observer-processsynchronization . A. Mixed States

For a length- ‘ word w generated by M let η ( w ) =Pr( S | w ) be the observer’s belief distribution as to theprocess’ current state after observing w : η ( w ) ≡ Pr( S ‘ | X ‘ = w, S ∼ π ) . (7)When observing a N -state machine, the vector h η ( w ) | lives in the (N-1)-simplex ∆ N − , the set such that: { η ∈ R N : h η | i = 1 , h η | δ i i ≥ , i = 1 , . . . , N } , where h δ i | = (cid:0) . . . . . . (cid:1) . The 0-simplex ∆ isthe single point | η i = (1), the 1-simplex ∆ is the linesegment [0 ,

1] from | η i = (0 ,

1) to | η i = (1 , η ( w ) that an HMCcan visit deﬁnes its set of mixed states : R = { η ( w ) : w ∈ A + , Pr( w ) > } . Generically, the mixed-state set R for an N -state HMCis inﬁnite, even for ﬁnite N [26].Note that when a mixed state appears in probabilityexpressions, the notation refers to the random variable η ,not the row vector | η i , and we drop the bra-ket notation.Bra-ket notation is used in vector-matrix expressions. B. Mixed-State Dynamic

The probability of transitioning from h η ( w ) | to h η ( wx ) | on observing symbol x follows from Eq. (7) im-mediately; we have:Pr( η ( wx ) | η ( w )) = Pr( x |S ‘ ∼ η ( w )) . This deﬁnes the mixed-state transition dynamic W . To-gether the mixed states and their dynamic deﬁne anHMC that is uniﬁlar by construction. This is a process’ mixed-state presentation (MSP) U ( P ) = { R , W} .We deﬁned a process’ U abstractly. The U typicallyhas an uncountably inﬁnite set of mixed states, makingit challenging to work with in the form laid out in Sec-tion IV A. Usefully, however, given any HMC M thatgenerates the process, we may explicitly write down thedynamic W . Assume we have an N + 1-state HMC pre-sentation M with k symbols x ∈ A . The initial conditionis the invariant probability π over the states of M , so that h η | = h δ π | . In the context of the mixed-state dynamic,mixed-state subscripts denote time.The probability of generating symbol x when inmixed state η is:Pr( x | η ) = h η | T ( x ) | i , (8)where T ( x ) is the symbol-labeled transition matrix asso-ciated with the symbol x .From η , we calculate the probability of seeing each x ∈ A . Upon seeing symbol x , the current mixed state h η t | is updated according to: h η t +1 ,x | = h η t | T ( x ) h η t | T ( x ) | i . (9)Thus, given an HMC presentation we can restateEq. (7) as: h η ( w ) | = h η | T ( w ) h η | T ( w ) | i = h π | T ( w ) h π | T ( w ) | i . Equation (9) tells us that, by construction, the MSP isuniﬁlar, since each possible output symbol uniquely de-termines the next (mixed) state. Taken together, Eqs. (8)and (9) deﬁne the mixed-state transition dynamic W as:Pr( η t +1 , x | η t ) = Pr( x | η t )= h η t | T ( x ) | i , for all η ∈ R , x ∈ A .To ﬁnd the MSP U = { R , W} for a given HMC M we apply the mixed-state construction method :1. Set U = { R = ∅ , W = ∅} . (A)(B)(C) σ σ (cid:3) : : : 1 η (cid:0) , (cid:1) η , (cid:0) , (cid:1) η , (cid:3) (1 , (cid:3) : : η (cid:0) , (cid:1) η (cid:0) , (cid:1) η (1 , η , (0 , (cid:3) : (cid:3) : : : 1 : : (cid:3) : FIG. 2. Determining the mixed-state presentation (MSP) ofthe 2-state uniﬁlar HMC shown in (A): The invariant statedistribution π = (2 / , / η used in (B) to calculate the next set of mixed states. (C)The full set of mixed states seen from all allowed words. Inthis case, we recover the uniﬁlar HMC shown in (A) as theMSP’s recurrent states.

2. Calculate M ’s invariant state distribution: π = πT .3. Take η to be h δ π | and add it to R .4. For each current mixed state η t ∈ R , use Eq. (8)to calculate Pr( x | η t ) for each x ∈ A .5. For η t ∈ R , use Eq. (9) to ﬁnd the updated mixedstate η t +1 ,x for each x ∈ A .6. Add η t ’s transitions to W and each η t +1 ,x to R ,merging duplicate states.7. For each new η t +1 , repeat steps 4-6 until no newmixed states are produced.With the MSP U ( M ) in hand, the next issue is determin-ing it’s (equivalent) (cid:15) -machine. There are several cases.Beginning with a ﬁnite, uniﬁlar HMC M generat-ing a process P , the MSP U ( M ) is a ﬁnite, optimally-predictive rival presentation to P ’s (cid:15) -machine, as seenin Fig. 2. In this case, the starting HMC depicted inFig. 2 (A) is an (cid:15) -machine, and reducing the MSP in Fig. 2 (C) by trimming the transient states returns theprocess’ recurrent-state (cid:15) -machine. When starting withthe (cid:15) -machine, trimming the resultant U ( (cid:15) -machine) inthis way always returns the (cid:15) -machine.In general, if U ( M ) is ﬁnite, we ﬁnd the (cid:15) -machineby minimizing U ( M ) via merging duplicate states: re-peat mixed-state construction on U ( M ) and trim tran-sient states once more. Minimizing countably-inﬁniteand uncountably-inﬁnite U ( M ) is discussed further inAppendix B.The MSPs of uniﬁlar presentations are interestingand contain additional information beyond the uniﬁlarpresentations. For example, containing transient causalstates, they are employed in calculating many complexitymeasures that track convergence statistics [39].However, here we focus on the mixed-state presen-tations of nonuniﬁlar HMCs, which typically have an in-ﬁnite mixed-state set R . Figure 3 illustrates applyingmixed-state construction to a ﬁnite, nonuniﬁlar HMC.This produces an inﬁnite sequence of mixed states on∆ = [0 , R is countablyinﬁnite, allowing us to better understand the underlyingprocess P ; compared, say, to the 2-state nonuniﬁlar HMCin Fig. 3(A). MSPs of nonuniﬁlar HMCs typically havean uncountably-inﬁnite mixed-state set R . V. MSP AS AN IFS

With this setup, our intentions in reviewing it-erated function systems (IFSs) become explicit. Themixed-state presentation (MSP) exactly deﬁnes a place-dependent IFS, where the mapping functions are the setof symbol-labeled mixed-state update functions as givenin Eq. (9) and the set of place-dependent probabilityfunctions are given by Eq. (8). We then have a map-ping function and associated probability function for eachsymbol x ∈ A that can be derived from the symbol-labeled transition matrix T ( x ) .If these probability and mapping functions meet theconditions of Theorem 1, we identify the attractor Λ asthe set of mixed states R and the invariant measure µ as the invariant distribution π of the potentially inﬁnite-state U . This is the original HMC’s Blackwell measure.Since all Lipschitz continuous functions are Dini contin-uous, the probability functions meet the conditions byinspection. We now establish that the maps are contrac-tions, by appealing to Birkhoﬀ’s 1957 proof that a posi-tive linear map preserving a convex cone is a contractionunder the Hilbert projection metric [40].Given an integer N ≥

2, let C N be the nonneg-ative cone in R N , so that C N consists of all vectors z = ( z , z , . . . , z N ) satisfying z = 0 and z i ≥ i . (A)(B)(C) σ σ : : : (cid:3) : (0 ,

1) (1 , . . . Pr( η ) = h η | T | i Pr( (cid:3) | η ) = h η | T (cid:3) | i ( , ) η (1 , η ( , ) η ( , ) η ( , ) η (0 , η ∞ η η η η . . . η ∞ : 1 (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) (cid:3) : : FIG. 3. Determining the mixed-state presentation of the 2-state nonuniﬁlar

HMC shown in (A). The invariant distribution π = (1 / , / η used in (B) to calculate the next set of mixed states. (B) plots the mixed statesalong the 1-simplex ∆ = [0 , The projective distance d : C N × C N → [0 , ∞ ) is deﬁned: d ( z, y ) :=max (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) z r z s y s y r (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) : r, s = 1 , . . . , N ; r = s (cid:27) (10)for z, y ∈ C N , where d ( z, z ) = 0. If one of the points is onthe cone boundary, the distance is taken to be + ∞ . Notethat the projective distance, by construction, deﬁnes d ( αz, βy ) = d ( z, y ), where α, β ∈ R + . In other words,for two mixed states η, ζ ∈ ∆ N , d (cid:0) f ( x ) ( η ) , f ( x ) ( ζ ) (cid:1) = d ( ηT ( x ) , ζT ( x ) ).If T ( x ) is an N × N positive matrix, we have d ( zT ( x ) , yT ( x ) ) < d N ( z, y ) for every z, y ∈ C N such that d ( y, z ) >

0. We deﬁne the projective contractivity τ ( x ) associated with T ( x ) as: τ x := sup { z,y ∈ C N : d ( z,y ) > } d ( zT ( x ) , yT ( x ) ) d ( z, y ) , so that τ ( x ) satisﬁes τ ( x ) ≤

1. As the theorem belowindicates, this inequality is strict.

Theorem 2. ([41, Thm. 1].) Let the integers m, n ≥ T ( x ) = h t ( x ) ij i of order m × n with positive components, τ ( x ) is given by the followingBirkhoﬀ formula: τ ( x ) = 1 − (cid:0) φ ( x ) (cid:1) / (cid:0) φ ( x ) (cid:1) / , where: φ ( H ) := min r,s,j,k t ( x ) rj t ( x ) sk t ( x ) sj t ( x ) rk . By inspection we see that φ ( H ) > τ < N . This is notgenerally true for our transition matrices—they are re-stricted merely to be nonnegative. However, the aboveresult extends to any nonnegative matrix T ( x ) for whichthere exists an N ∈ N + such that (cid:0) T ( x ) (cid:1) N is a posi-tive matrix. Then there will be a τ ( x ) < d (cid:16) η (cid:0) T ( x ) (cid:1) N , ζ (cid:0) T ( x ) (cid:1) N (cid:17) < d ( η, ζ ). This is equivalent toa requirement that T ( x ) be aperiodic and irreducible. σ σ : p : 1 − p : 1 − q (cid:3) : q FIG. 4. Simple Nonuniﬁlar Source (SNS): The symbol-labeledtransition matrices given in Eq. (11) are both reducible, butthe place-dependent IFS still has an attractor with an invari-ant probability distribution. By setting p = q = 1 /

2, wereturn the nonuniﬁlar HMC from Fig. 3.

Still, we are not guaranteed irreducibility and ape-riodicity for our symbol-labeled transition matrices. In-deed, the

Simple Nonuniﬁlar Source , depicted in Fig. 4,has the symbol-labeled transition matrices: T ( ) = (cid:18) − p p − q (cid:19) and T ( (cid:3) ) = (cid:18) q (cid:19) . (11)Both T ( ) and T ( (cid:3) ) are reducible. A quick check is toexamine Fig. 4 and ask if there is a length- n sequenceconsisting of only a single symbol that reaches every statefrom every other state. Nonetheless, the HMC has acountable set of mixed states R and an invariant measure µ . We can determine this from the mapping functions: f ( ) ( η ) = (cid:20) h η | δ i (1 − p )1 − (1 − h η | δ i ) q , h η | δ i p + (1 − h η | δ i )(1 − q )1 + (1 − h η | δ i ) q (cid:21) and (12) f ( (cid:3) ) ( η ) = [1 , . (13)From any initial state η , other than η = σ = [1 , (cid:3) is positive. Once a (cid:3) isemitted, the mixed state is guaranteed to be η = σ =[1 , −∞ , we call the sym-bol a synchronizing symbol. From σ , the set of mixedstates is generated by repeated emissions of s, so that R = n(cid:0) f ( ) (cid:1) n ( σ ) : n = 0 , . . . , ∞ o . This is visually de-picted in Fig. 3 for the speciﬁc case of p = q = 1 /

2. Forall p and q , the measure can be determined analytically;see Ref. [43]. Note that this is due to the HMC’s highlystructured topology. In general, the set of mixed statesis uncountable—either a fractal or continuous set—andthe measure cannot be analytically expressed.Assuming the HMC generates an ergodic process en-sures that the total transition matrix T = P x T ( x ) isnonnegative, irreducible, and aperiodic. Deﬁne for anyword w = x . . . x ‘ ∈ A + the associated mapping func-tion T ( w ) = T ( x ) ◦ · · · ◦ T ( x ‘ ) . Consider word w in a pro-cess’ typical set of realizations (see Appendix A), whichset approaches measure one as | w | → ∞ . Due to ergod-icity, it must be the case that f ( w ) is either (i) a constantmapping—and, therefore, inﬁnitely contracting—or (ii) T ( w ) is irreducible.As an example of the former case, we see that anycomposition of the SNS functions Eq. (13) is always aconstant function, so long as there is at least one (cid:3) inthe word, the probability of which approaches one as theword grows in length.As an example of the later case, imagine addingto the SNS in Fig. 4 a transition on (cid:3) from σ to σ .Then, both symbol-labeled transition matrices are stillreducible, but the composite transition matrices for anyword including both symbols is now irreducible. There-fore, the map is contracting. While this is not the casefor words composed of all (cid:3) s and all s, these sequencesare measure zero as N → ∞ . Appendix A discusses thisfurther. VI. ENTROPY OF GENERAL HMCS

Blackwell analyzed the entropy of functions of ﬁnite-state Markov chains [26]. With a shift in notation, func-tions of Markov chains can be identiﬁed as general hiddenMarkov chains. This is to say, both presentation classesgenerate the same class of stochastic processes. As wehave discussed, the entropy rate problem for uniﬁlar hid-den Markov chains is solved, with Shannon’s entropy rateexpression, Eq. (3). However, according to Blackwell,there is no analogous closed-form expression for the en-tropy rate of a nonuniﬁlar HMC.

A. Blackwell Entropy Rate

That said, Blackwell gave an expression for the en-tropy rate of general HMCs, by introducing mixed statesover stationary, ergodic, ﬁnite-state chains. (Although hedoes not refer to them as such.) His main result, retain-ing his notation, is transcribed here and adapted by usto constructively solve the HMC entropy-rate problem.

Theorem 3. ([26, Thm. 1].) Let { x n , −∞ < n < ∞} be a stationary ergodic Markov process with states i =1 , . . . , I and transition matrix M = k m ( i, j ) k . Let Φ bea function deﬁned on 1 , . . . , I with values a = 1 , . . . , A and let y n = Φ( x n ). The entropy of the { y n } process isgiven by: H = − Z X a r a ( w ) log r a ( w ) dQ ( w ) , (14)where Q is a probability distribution on the Borel setsof the set W of vectors w = ( w , . . . , w I ) with w i ≥ P i w i = 1, and r a ( w ) = P Ii =1 P j Φ( j )= a w i m ( i, j ). Thedistribution Q is concentrated on the sets W , . . . , W A ,where W a consists of all w ∈ W with w i = 0 for Φ( i ) = a and satisﬁes: Q ( E ) = X a Z f − a E r a ( w ) dQ ( w ) , (15)where f a maps W into W a , with the j th coordinate of f a ( w ) given by P i w i m ( i, j ) /r a ( w ) for Φ( j ) = a .We can identify the w vectors in Theorem 3 as ex-actly the mixed states of Section IV. Furthermore, it isclear by inspection that r a ( w ) and f a ( w ) are the proba-bility and mapping functions of Eqs. (8) and (9), respec-tively, with a playing the role of our observed symbol x . Therefore, Blackwell’s expression Eq. (14) for theHMC entropy rate, in eﬀect, replaces the average over aﬁnite set S of uniﬁlar states in Shannon’s entropy rateformula Eq. (3) with (i) the mixed states R and (ii) anintegral over the Blackwell measure µ . In our notation,we write Blackwell’s entropy formula as: h Bµ = − Z R dµ ( η ) X x ∈A p ( x ) ( η ) log p ( x ) ( η ) . (16)Thus, as with Shannon’s original expression, thistoo uses uniﬁlar states—now, though, states from themixed-state presentation U . This, in turn, maintainsthe ﬁnite-to-one internal (mixed-) state sequence toobserved-sequence mapping. Therefore, one can identifythe mixed-state entropy rate itself as the process’ entropyrate. B. Calculating the Blackwell HMC Entropy

Appealing to Ref. [38], we have that contractivity ofour substochastic transition matrix mappings guaranteesergodicity over the words generated by the mixed-statepresentation. And so, we can replace Eq. (16)’s integral over R with a time average over a mixed-state trajectory η , η , . . . determined by a long allowed word, using Eqs.(8) and (9). This gives a new limit expression for theHMC entropy rate: c h µB = − lim ‘ →∞ ‘ X x ∈A Pr( x | η ‘ ) log Pr( x | η ‘ ) , (17)where η ‘ = η ( w ‘ ) and w ‘ is the ﬁrst ‘ symbols of anarbitrarily long sequence w ∞ generated by the process.Note that w ‘ will be a typical trajectory, if ‘ issuﬃciently long. To remove convergence-slowing contri-butions from transient mixed states, one can ignore somenumber of the initial mixed states. The exact number oftransient states that should be ignored is unknown ingeneral. That said, it depends on the initial mixed state η , which is generally taken to be h δ π | , and the diameterof the attractor.This completes our development of the HMC entropyrate. Appendix C applies the theory and associated al-gorithm to a number of examples, with both countableand uncountable mixed states, and reveals a number ofsurprising properties. We now turn to practical issues ofthe resources needed for accurate estimation. C. Data Requirements

Although we developed our HMC entropy-rate ex-pression in terms of IFSs, determining a process’ entropyrate can be recast as Markov chain Monte Carlo (MCMC)estimation. In MCMC, the mean of a function f ( x ) ofinterest over a desired probability distribution π ( x ) is es-timated by designing a Markov chain with a stationarydistribution π . For HMCs the desired distribution is theBlackwell measure µ , which is the stationary distribu-tion µ over the MSP states R . Then, the Markov chainis simply the transition dynamic W over R .With this setting, we estimate the entropy rate c h µB as the mean of the stochastic process deﬁned by takingthe entropy H [ X η ] over symbols emitted from state η fora sequence of mixed states generated by W . In eﬀect, weestimate the entropy rate as the mean of this stochasticprocess: c h µB = µ H = h H [ X η ] i µ . (18)Mathematically, little has changed. The advantage,though, of this alternative description is that it invokesthe extensive body of results on MCMC estimation. Inthis, it is well known that there are two fundamentalsources of error in the estimation. First, there is thatdue to initialization bias or undesired statistical trendsintroduced by the initial transient data produced by theMarkov chain before it reaches the desired stationary dis-tribution. Second, there are errors induced by autocorre-lation in equilibrium . That is, the samples produced bythe Markov chain are correlated. And, the consequenceis that statistical error cannot be estimated by 1 / √ N , asdone for N independent samples.To address these two sources of error, we follow com-mon MCMC practice, considering two “time scales” thatarise during estimation. Consider the autocorrelation ofthe stationary stochastic process: C f ( t ) = h f s f s + t i − µ f , where µ f is f ’s mean. Also, consider the normalized au-tocorrelation, deﬁned: ρ f ( t ) = C f ( t ) C f (0) . If the autocorrelation decays exponentially with time, wedeﬁne the exponential autocorrelation time : τ exp,f = lim t →∞ sup 1 − log | ρ f ( t ) | and τ exp = sup f τ exp,f . So, τ exp upper bounds the rate of convergence from aninitial nonequilibrium distribution to the equilibrium dis-tribution.For a given observable, we also deﬁne the integratedautocorrelation time τ init,f as: τ int,f = 12 ∞ X −∞ ρ f ( t ) . (19)This relates the correlated samples selected by the chainto the variance of independent samples for the particu-lar function f of interest. The variance of f ( x )’s sam-ple mean in MCMC is higher by a factor of 2 τ int,f . Inother words, the errors for a sample of length N are oforder p τ int,f /N . Thus, targeting 1% accuracy requires ≈ τ int,f samples.In practice, it is diﬃcult to ﬁnd τ exp and τ int for ageneric Markov chain. There are two options. The ﬁrstis to use numerical approximations that estimate the au-tocorrelation function, and therefore τ , from data. If wehave the nonuniﬁlar model in hand, it is a simple matterof sweeping through increasingly long strings of generateddata until we observe convergence of the autocorrelationfunction.Alternatively, taking inspiration from previoustreatments of nonuniﬁlar models, we make a ﬁnite-stateapproximation to the MSP by coarse-graining the sim-plex into boxes of length (cid:15) and employ a suitable method, such as Ulam’s method, to approximate the transition op-erator. Using methods previously discussed in Ref. [44],this allows calculating the autocorrelation function di-rectly. Appendix D shows that that the approximationerror vanishes as (cid:15) → VII. CONCLUSION

We opened this development considering the rolethat determinism and randomness play in the behaviorof complex physical systems. A central challenge in thishas been quantifying randomness, patterns, and struc-ture and doing so in a mathematically-consistent but cal-culable manner. For well over a half a century Shannonentropy rate has stood as the standard by which to quan-tify randomness in a time series. Until now, however, cal-culating it for processes generated by nonuniﬁlar HMCshas been diﬃcult, at best.We began our analysis of this problem by recallingthat, in general, hidden Markov chains that are not uniﬁ-lar have no closed-form expression for the Shannon en-tropy rate of the processes they generate. Despite this,these HMCs can be uniﬁlarized by calculating the mixedstates. The resulting mixed-state presentations are them-selves HMCs that generate the process. However, adopt-ing a uniﬁlar presentation comes at a heavy cost: Gener-ically, they are inﬁnite state and so Shannon’s expressioncannot be used. Nonetheless, we showed how to workconstructively with these mixed-state presentations. Inparticular, we showed that they fall into a common classof dynamical system. The mixed-state presentation is aniterated function system. Due to this, a number of resultsfrom dynamical systems theory can be applied.Speciﬁcally, analyzing the IFS dynamics associatedwith a ﬁnite-state nonunﬁlar HMC allows one to extractuseful properties of the original process. For instance,we can easily ﬁnd the entropy rate of the generated pro-cess from long orbits of the IFS. That is, one may selectany arbitrary starting point in the mixed-state simplexand calculate the entropy over the IFS’s place-dependentprobability distribution. We evolve the mixed state ac-cording to the IFS and sequentially sample the entropyof the place-dependent probability distribution at eachstep. Using an arbitrarily long word and taking the meanof these entropies, the method converges on the process’entropy rate.Although others consider the IFS-HMC connection[45, 46], our development expanded previous work to in-clude the much broader, more general class of nonuniﬁ-lar HMCs. In addition, we demonstrated not only themixed-state presentation’s role in calculating the entropyrate, but also its connection to existing approaches to0randomness and structure in complex systems. In par-ticular, while our results focused on quantifying and cal-culating a process’ randomness, we left open questionsof pattern and structure. However, the path to achiev-ing the results introduced here strongly suggests thatthe mixed-state presentation oﬀers insight into answeringthese questions. For instance, Fig. 3 demonstrated howthe highly structured nature of the Simple NonuniﬁlarSource is made topologically explicit through calculatingits mixed-state presentation—which is also its (cid:15) -machine.Though space will not let us develop it furtherhere, this connection is not spurious. Indeed, manyinformation-theoretic properties of the underlying pro-cess may be directly extracted from its mixed-state pre-sentation. This follows from our showing how the attrac-tor of the IFS deﬁned by an HMC is exactly the set ofmixed states R of that HMC. These sets are often frac-tal in nature and quite visually striking. See Fig. S6 forseveral examples.The sequel [30] to this development establishes thatthe fractal dimension of the mixed-state attractor isexactly the divergence rate of the statistical complex-ity [24]—a measure of a process’ structural complex-ity that tracks memory. Furthermore, the sequel intro-duces a method to calculate the fractal dimension of themixed-state attractor from the Lyapunov spectrum of themixed-state IFS. In this way, it demonstrates that coarse-graining the simplex—the previous approach to study thestructure of inﬁnite-state processes—may be avoided al-together. To close, we note that these structural tools and theentropy-rate method introduced here have already beenput to practical use in two previous works. One diag-nosed the origin of randomness and structural complex-ity in quantum measurement [31]. The other exactly de-termined the thermodynamic functioning of Maxwellianinformation engines [32], when there had been no pre-vious method for this. At this point, however, we mustleave the full explication of these techniques and furtheranalysis on how mixed states reveal the underlying struc-ture of processes generated by hidden Markov chains tothe sequel [30]. ACKNOWLEDGMENTS

The authors thank Sam Loomis, Greg Wimsatt,Ryan James, David Gier, and Ariadna Venegas-Li forhelpful discussions and the Telluride Science ResearchCenter for hospitality during visits and the participantsof the Information Engines Workshops there. JPC ac-knowledges the kind hospitality of the Santa Fe Institute,Institute for Advanced Study at the University of Ams-terdam, and California Institute of Technology for theirhospitality during visits. This material is based uponwork supported by, or in part by, FQXi Grant numberFQXi-RFP-IPW-1902, and U.S. Army Research Labora-tory and the U.S. Army Research Oﬃce under contractW911NF-13-1-0390 and grant W911NF-18-1-0028. [1] D. Goroﬀ, editor.

H. Poincaré, New Methods Of Celes-tial Mechanics, 1: Periodic And Asymptotic Solutions .American Institute of Physics, New York, 1991. 1[2] D. Goroﬀ, editor.

H. Poincaré, New Methods Of Celes-tial Mechanics, 2: Approximations by Series . AmericanInstitute of Physics, New York, 1993.[3] D. Goroﬀ, editor.

H. Poincaré, New Methods Of Ce-lestial Mechanics, 3: Integral Invariants and AsymptoticProperties of Certain Solutions . American Institute ofPhysics, New York, 1993.[4] J. P. Crutchﬁeld, N. H. Packard, J. D. Farmer, and R. S.Shaw. Chaos.

Sci. Am. , 255:46 – 57, 1986. 1[5] A. M. Turing. On computable numbers, with an applica-tion to the entsheidungsproblem.

Proc. Lond. Math. Soc.Ser. 2 , 42:230, 1936. 1[6] C. E. Shannon. A universal Turing machine with two in-ternal states. In C. E. Shannon and J. McCarthy, editors,

Automata Studies , number 34 in Annals of Mathemati-cal Studies, pages 157–165. Princeton University Press,Princeton, New Jersey, 1956.[7] M. Minsky.

Computation: Finite and Inﬁnite Machines .Prentice-Hall, Englewood Cliﬀs, New Jersey, 1967. 1[8] C. E. Shannon. A mathematical theory of communica-tion.

Bell Sys. Tech. J. , 27:379–423, 623–656, 1948. 1, 2,3 [9] A. N. Kolmogorov.

Foundations of the Theory of Prob-ability . Chelsea Publishing Company, New York, secondedition, 1956. 1[10] A. N. Kolmogorov. Three approaches to the concept ofthe amount of information.

Prob. Info. Trans. , 1:1, 1965.[11] A. N. Kolmogorov. Combinatorial foundations of infor-mation theory and the calculus of probabilities.

Russ.Math. Surveys , 38:29–40, 1983.[12] A. N. Kolmogorov. Entropy per unit time as a metricinvariant of automorphisms.

Dokl. Akad. Nauk. SSSR ,124:754, 1959. (Russian) Math. Rev. vol. 21, no. 2035b.[13] Ja. G. Sinai. On the notion of entropy of a dynamicalsystem.

Dokl. Akad. Nauk. SSSR , 124:768, 1959. 1[14] J. P. Crutchﬁeld. Between order and chaos.

NaturePhysics , 8(January):17–24, 2012. 1[15] B. Marcus, K. Petersen, and T. Weissman, editors.

En-tropy of Hidden Markov Process and Connections to Dy-namical Systems , volume 385 of

Lecture Notes Series .London Mathematical Society, 2011. 1[16] Y. Ephraim and N. Merhav. Hidden Markov processes.

IEEE Trans. Info. Th. , 48(6):1518–1569, 2002. 1, 2[17] J. Bechhoefer. Hidden Markov models for stochastic ther-modynamics.

New. J. Phys. , 17:075003, 2015. 1[18] L. R. Rabiner and B. H. Juang. An introduction to hid-den Markov models.

IEEE ASSP Magazine , January:4–16, 1986. 1 [19] E. Birney. Hidden Markov models in biological sequenceanalysis. IBM J. Res. Dev., , 45(3.4):449–454, 2001. 1[20] S. Eddy. What is a hidden Markov model?

NatureBiotech. , 22:1315âĂŞ1316, Oct 2004. 1[21] C. Bretó, D. He, E. L. Ionides, and A. A. King. Timeseries analysis via mechanistic models.

Ann. App. Statis-tics , 3(1):319âĂŞ348, Mar 2009. 1[22] T. Rydén, T. Teräsvirta, and S. Åsbrink. Stylized factsof daily return series and the hidden Markov model.

J.App. Econometrics , 13:217–244, 1998. 1[23] R. B. Ash.

Information Theory . John Wiley and Sons,New York, 1965. 1[24] J. P. Crutchﬁeld and K. Young. Inferring statistical com-plexity.

Phys. Rev. Let. , 63:105–108, 1989. 1, 10[25] J. P. Crutchﬁeld. The calculi of emergence: Compu-tation, dynamics, and induction.

Physica D , 75:11–54,1994. 1[26] D. Blackwell. The entropy of functions of ﬁnite-stateMarkov chains. In

Transactions of the ﬁrst Pragueconference on information theory, Statistical decisionfunctions, Random processes , volume 28, pages 13–20,Prague, Czechoslovakia, 1957. Publishing House of theCzechoslovak Academy of Sciences. 1, 3, 4, 7, 8[27] F. Creutzig, A. Globerson, and N. Tishby. Past-futureinformation bottleneck in dynamical systems.

Phys. Rev.E , 79(4):041925, 2009. 1[28] S. Still, J. P. Crutchﬁeld, and C. J. Ellison. Optimalcausal inference: Estimating stored information and ap-proximating causal architecture.

CHAOS , 20(3):037111,2010.[29] S. Marzen and J. P. Crutchﬁeld. Predictive rate-distortion for inﬁnite-order Markov processes.

J. Stat.Phys. , 163(6):1312–1338, 2014. 1[30] A. Jurgens and J. P. Crutchﬁeld. Inﬁnite complexityof ﬁnite state hidden Markov processes. in preparation ,2020. 2, 10[31] A. Venegas-Li, A. Jurgens, and J. P. Crutchﬁeld.Measurement-induced randomness and structure inquantum dynamics. arXiv:1908.09053 , 2019. 2, 10[32] A. Jurgens and J. P. Crutchﬁeld. Functional thermo-dynamics of Maxwellian ratchets: Constructing and de-constructing patterns, randomizing and derandomizingbehaviors.

Phys. Rev. Research , 2(3):033334, 2020. 2, 10[33] T. M. Cover and J. A. Thomas.

Elements of InformationTheory . Wiley-Interscience, New York, second edition, 2006. 2, 1[34] C. R. Shalizi and J. P. Crutchﬁeld. Computational me-chanics: Pattern and prediction, structure and simplicity.

J. Stat. Phys. , 104:817–879, 2001. 2[35] J. P. Crutchﬁeld and D. P. Feldman. Regularities un-seen, randomness observed: Levels of entropy conver-gence.

CHAOS , 13(1):25–54, 2003. 3[36] M. Barnsley.

Fractals Everywhere . Academic Press, NewYork, 1988. 3[37] M. F. Barnsley, S. G. Demko, J. H. Elton, and J. S.Geronimo. Invariant measures arising from iterated func-tion systems with place dependent probabilities.

Ann.Inst. H. Poincare , 24:367âĂŞ394, 1988. 3[38] J. H. Elton. An ergodic theorem for iterated maps.

Ergod.Th. Dynam. Sys. , 7:481–488, 1987. 3, 8[39] J. P. Crutchﬁeld, P. Riechers, and C. J. Ellison. Exactcomplexity: Spectral decomposition of intrinsic compu-tation.

Phys. Lett. A , 380(9-10):998–1002, 2016. 5[40] G. Birkhoﬀ. Extensions of Jentzsch’s theorem.

Trans.Am. Math. Soc. , 85(1):219–227, 1957. 5[41] R. Cavazos-Cadena. An alternative derivation ofBirkhoﬀ’s formula for the contraction coeﬃcient of a pos-itive matrix.

Linear Algebra Apps. , 375:291âĂŞ297, 2003.6[42] E. Kohlberg and J. W. Pratt. The contraction mappingapproach to the Perron-Frobenius theory: Why Hilbert’smetric?

Math. Oper. Res. , 7(2), 1982. 7[43] S. Marzen and J. P. Crutchﬁeld. Information anatomy ofstochastic equilibria.

Entropy , 16(9):4713–4748, 2014. 7[44] P. Riechers and J. P. Crutchﬁeld. Spectral simplicityof apparent complexity, Part II: Exact complexities andcomplexity spectra.

Chaos , 28:033116, 2018. 9[45] M. Rezaeian. Hidden Markov process: A newrepresentation, entropy rate and estimation entropy. arXiv:0606114 . 9[46] W. Słomczyński, J. Kwapień, and K. Życzkowski. En-tropy computing via integration over fractal measures.

Chaos: An Interdisciplinary Journal of Nonlinear Sci-ence , 10(1):180âĂŞ188, Mar 2000. 9[47] S. E. Marzen and J. P. Crutchﬁeld. Nearly maximallypredictive features and their dimensions.

Phys. Rev. E ,95(5):051301(R), 2017. 8

Supplementary MaterialsThe Shannon Entropy Rate ofHidden Markov Processes

Alexandra Jurgens and James P. Crutchﬁeld arXiv :2002.XXXXX

The Supplementary Materials to follow review the notion of typical sets of realizations in a stochastic process,discuss minimality of inﬁnite-state mixed-state presentations, determine the entropy rates of a suite of examplehidden Markov chains with inﬁnite mixed-state presentations, and give details of errors that arise when estimatingautocorrelation.

Appendix A: Asymptotic Equipartition and the Typical Set Contraction

The asymptotic equipartition property (AEP) states that for a discrete-time, ergodic, stationary process X : − n log Pr( X , X , . . . , X n ) → h µ ( X ) , (S1)as n → ∞ [33]. This eﬀectively divides the set of sequences into two sets: the typical set —sequences for which theAEP holds—and the atypical set, for which it does not. As a consequence of the AEP, it must be the case thatthe typical set is measure one in the space of all allowed realizations and all sequences in the atypical set approachmeasure zero as n → ∞ .We argue that while our IFS class includes reducible maps, any composition of maps corresponding to a wordin the typical set will be irreducible. This can be seen intuitively by considering the SNS, shown in Fig. 4, andadding an additional transition on a (cid:3) from σ to σ . This produces an HMC with two reducible symbol-labeledtransition matrices, but an irreducible total transition matrix. However, as | w | → ∞ , the only words such that T ( w ) remains reducible are (cid:3) N and N . We can see that these words cannot possibly be in the typical set, since − n log Pr( (cid:3) n ) = − log Pr( (cid:3) ) = h µ ( X ). The entropy rate h µ is by deﬁnition the branching entropy averaged overthe mixed states. And so, any word that visits only a restricted subset of the mixed states—i.e., a word with areducible transition matrix—cannot approach h µ , regardless of length. Therefore, only words with an irreduciblemapping will be in the typical set, implying that there exists an integer word length | w | > Appendix B: Minimality of U ( M ) The minimality of inﬁnite-state mixed-state presentations U ( M ) is an open question. As demonstrated in Ap-pendix C 1, it is possible to construct MSPs with an uncountably inﬁnite number of states for a process that requiresonly one state.A proposed solution to this problem is a short and simple check on mergeablility of mixed states, which hererefers to any two distinct mixed states that have the same conditional probability distribution over future strings; i.e.,any two mixed states η and ζ for which: Pr( X L | η ) = Pr( X L | ζ ) , (S1)for all L ∈ N + .Although minimality does not impact the entropy-rate calculation, one beneﬁt of the IFS formalization of theMSP is the ability to directly check for duplicated states and therefore determine if the MSP is nonminimal. Wecheck this by considering, for an N + 1 state machine M with alphabet A = { , , . . . , k } , the dynamic not only overmixed states, but probability distributions over symbols. Let: P ( η ) = (cid:16) p (0) ( η ) , . . . , p ( k − ( η ) (cid:17) (S2)and consider Fig. S1. For each mixed state η ∈ ∆ N , Eq. (S2) gives the corresponding probability distribution ρ ( η ) ∈ ∆ k over the symbols x ∈ A . Let M emit symbol x , then the dynamic from one such probability distribution ρ ∈ ∆ k to the next is given by: g ( x ) ( ρ t ) = P ◦ f ( x ) ◦ P − ( ρ )= ρ t +1 ,x . (S3)From this, we see that if Eq. (S2) is invertible, g ( x ) : ∆ k → ∆ k is well deﬁned and has the same functionalproperties as f ( x ) . In other words, in this case, it is not possible to have two distinct mixed states η, ζ ∈ ∆ N with thesame probability distribution over symbols. And, the probability distributions can only converge under the actionof g ( x ) if the mixed states also converge under the action of f ( x ) . Shortly, we consider several cases where P is not invertible over the entire symbol simplex. ∆ N ∆ N ∆ k ∆ k f ( x ) P Pg ( x ) FIG. S1. Commuting diagram for probability functions P = { p ( x ) } , mixed-state mapping functions f ( x ) , and proposed symbol-distribution mapping functions g ( x ) . If every mixed state in R corresponds to a unique probability distribution over symbols, we conjecture that thecorresponding U ( M ) is the minimal uniﬁlar representation of the underlying process P . If we then trim the transientstates of U ( M ), leaving the recurrent set R R , the result is the (cid:15) -machine. Appendix C: Examples

The following illustrates how to apply the theory and algorithms from the main text to accurately and eﬃcientlycalculate the entropy rate of processes generated by HMCs with countable and uncountable mixed states. It highlightsa number of curious and nontrivial properties of these processes and their MSPs.

1. Cantor Set MSP

We ﬁrst analyze a process with an MSP whose uncountable mixed states lie in a Cantor set. Surprisingly, thisMSP is far from minimal, as the process is, in fact, generated by a biased coin—that is, a single-state (cid:15) -machine. a. The Cantor Set

The Cantor set is perhaps the most well-known example of a nontrivial self-similar (fractal) set. The familiarmiddle-thirds version is constructed by starting with the unit interval C = [0 ,

1] and removing the middle third,giving the set C = [0 , ] ∪ [ , C produces C , and so on. TheCantor set C consists of points that remain after inﬁnitely repeating this action: C = T ∞ n =1 C n . σ σ (cid:3) : as (cid:3) : a ( a − s (cid:3) : a : 1 − a : s − s (1 − a ) : − as FIG. S2. Nonuniﬁlar HMC M C that generates Cantor sets of mixed states on the 2-simplex, for values of 0 < p < s > s = 3, we ﬁnd the middle-third Cantor set. The Cantor set is uncountably inﬁnite and has Hausdorﬀ dimension:dim H C = log 2log 3 . A parametrized family of Cantor sets is generated by repeating C = [0 , s ] ∪ [ s − s ,

1] (i.e., removing the middle s − s ),the Hausdorﬀ dimension is: dim H C = log 2log s .

Simply stated, the dimension is the logarithm of the number of copies of the original unit interval made at eachiteration, divided by the logarithm of the length ratio between the original object and its copy. b. The Cantor Machine

The Cantor set, due to its familiarity, makes for a useful, ﬁrst object of study for uncountable-mixed-state HMCs.Figure S2 shows an HMC M C that generates a Cantor set of mixed states. There 0 < a < s > T ( (cid:3) ) = (cid:18) as a ( s − s a (cid:19) and T ( ) (cid:18) − a ( s − − a ) s − as (cid:19) . This allows us to immediately write down the probability functions and mapping functions, recalling that in thetwo-state case the vectors on the simplex take the form h η | = ( h η | δ i , − h η | δ i ): ( p ( (cid:3) ) ( η ) = h η | T ( (cid:3) ) | i = ap ( ) ( η ) = h η | T ( ) | i = 1 − a and:  f ( (cid:3) ) ( η ) = h η | T ( (cid:3) ) h η | T ( (cid:3) ) | i = (cid:18) h η | δ i s , − h η | δ i s (cid:19) f ( ) ( η ) = h η | T ( ) h η | T ( ) | i = (cid:18) s + h η | δ i − s , − h η | δ i s (cid:19) . It is easily seen, by considering h η | δ i = 0 and h η | δ i = 1, that these maps, in fact, map the simplex to the ﬁrst andsecond intervals of C , respectively.The Cantor Machine MSP U ( M C ) is shown Fig. S3 (Top). It has an uncountably-inﬁnite number of recurrentstates, which correspond exactly to the elements of the Cantor set. Since the probability functions do not depend on η η (cid:3) η η (cid:3)(cid:3) η (cid:3) η (cid:3) η ... ... ... ... ... ... ... ... (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a (cid:3) : a : 1 − a σ (cid:3) : a : 1 − a FIG. S3. Two valid alternative presentations of the Cantor set machine: (Top) The MSP U ( M C ) of the Cantor set machine M C in Fig. S2. The set of mixed states η w is uncountably inﬁnite. (Bottom) A uniﬁlar hidden Markov model, commonly calledthe Biased Coin, that generates the same process as the nonuniﬁlar Cantor machine M C in Fig. S2. h η | , we do not need to invoke the Ergodic Theorem, but instead can calculate the entropy exactly: h µ U ( M C) = − Z X i p ( x ) ( η ) log p ( x ) ( η ) dµ U ( M C ) ( η )= − a log ( a ) − (1 − a ) log (1 − a )= H ( a ) . c. A Biased Coin However, there is an important caveat here, noted in Appendix B. The MSP may contain states that are prob-abilistically equivalent. The probability mapping functions are noninvertible and, in fact, every single mixed statecorresponds to the same conditional probability distribution over symbols. This means that the uncountably-inﬁniteMSP is not a minimal presentation. There is a markedly simpler uniﬁlar model for the Cantor set machine M C .In fact, all mixed states in U ( M C ) collapse into a single state, giving the minimal uniﬁlar model of the Cantor setmachine as the Biased Coin HMC shown in Fig. S3 (Bottom). This HMC generates the same process as the Cantormachine, but requires only a single state. σ σ σ (cid:3) : 1 : : : : : FIG. S4. Three-state, nonuniﬁlar machine M S .

2. Countable MSP with -state HMC Now, we explore a diﬀerent, but related case that introduces a condition for a countable MSP and again highlightsthe role of minimality. a. -State HMC with a Countable MSP Consider the 3-state HMC M S of Fig. S4. The transition matrices for this machine are: T ( (cid:3) ) =   and T ( ) = 

12 12

13 13 13  . These give the mapping and probability functions: ( p ( (cid:3) ) ( η ) = η p ( ) ( η ) = 1 − η and:  f ( (cid:3) ) ( η ) = (0 , , f ( ) ( η ) = (cid:18) η + 13 η , η + 13 η , η (cid:19) . Consider the probability functions ﬁrst. P is not invertible over all of ∆ , but is partially invertible over arestricted domain. Given a line in the simplex where η and η are a function of η —say, ( η , − η , − η )—we caninvert P ( η ). The question becomes: What is the appropriate restricted domain?Note that for both f ( (cid:3) ) and f ( ) , η = η . In the simplex this corresponds to all the mixed states lying along aline in ∆ —the line ( η , η , − η ). This, then, is the restricted domain over which the states U ( M S ) correspond tounique probability distributions. The fact that this space is a line implies that the generative machine can be writtenwith only two states.The constancy of the mapping function for (cid:3) contributes further structure, ensuring that the set of mixed stateswill be countably inﬁnite. We can write the mixed states down in series, in terms of how many s we have seen sincethe last (cid:3) : η ( n ) = (cid:18) n − n )4 · n − · n , n − n )4 · n − · n , n · n − · n (cid:19) , where n = 0 is taken to be the mixed state η ( ) = (0 , ,  Pr ( (cid:3) | η ( n )) = 2(3 n − n )4 · n − · n Pr ( η ( n )) = 2 · n − n · n − · n . The MSP U ( M S ) is shown in Fig. S5. If the initial condition η = (0 , , (cid:15) -machine. Since R is countable, we can ﬁnd µ by hand, by solving the set of equations π n +1 = π n Pr( η n ), with P n π n = 1. This gives π n = (cid:0) − n − − n (cid:1) and we ﬁnd: h µ U ( M S ) = ∞ X n =0 π n log Pr( η n ) ≈ . . η η . . . η n . . . η ∞ : 1 : : n − · n · n − · n : (cid:3) : (cid:3) : n − n )4 · n − · n (cid:3) : FIG. S5. Uniﬁlar HMC that generates the same process generated by the nonunﬁlar machine M S in Fig. S4. b. Actually, A 2-State Machine As mentioned above, the restricted domain over which P is invertible implies a smaller state set for the processgenerated by the nonuniﬁlar machine M S . For all relevant mixed states, Pr( σ ) = Pr( σ ), suggesting that we devisean HMC combining the two states. However, the mapping function for (cid:3) must still project deﬁnitively to a singlestate, to retain the countable inﬁnity of mixed states. In fact, these restrictions ensure that the minimal nonuniﬁlarHMC for the process is the HMC for the Simple Nonuniﬁlar Source, discussed in Section V.If we declare that ˆ η ( ) = (0 , s, by using the mapping functions in Section V. The next two states are:ˆ η ( ) = (1 − q, q ) andˆ η ( ) = (cid:18) − p + pq − q − p + pq , q − p + pq (cid:19) . For the underlying process to remain the same, the condition that must be met is P ( η ( n )) = ˆ P (ˆ η ( n )). Thisdetermines p and q . For n = 0 this is trivially met. For n = 1 we have:ˆ P (cid:0) ˆ η ( ) (cid:1) = (cid:0) p (1 − q ) , − (1 − q ) p (cid:1) = (cid:18) , (cid:19) , so that 1 − q = p . Substituting this into the ˆ η ( ) condition we get:ˆ η ( ) = (cid:18) − q , q (cid:19) . Substituting this into the probability distribution constraint for n = 2 gives q = 1 / q = 1 /

3, corresponding totwo diﬀerent 2-state nonuniﬁlar HMCs that generate the same process as the 3-state HMC. This further emphasizes thelack of uniqueness of generative models. That said, by examining the underlying IFS, their HMCs can be recovered.

3. Parametrized HMCs and Their MSPs

Finally, consider an HMC with 3 symbols and 3 states: T (cid:3) =  αy βx βxαx βy βxαx βx βy  , T =  βy αx βxβx αy βxβx αx βy  , and T ◦ =  βy βx αxβx βy αxβx βx αy  , (S1)with β = − α and y = 1 − x . From inspection, we see that α can take on any value from 0 to 1 and x may rangefrom 0 to . (a) 100,000 mixed states of the HMC deﬁned by Eq. (S1) with α = 0 . x = 0 . α = 0 . x = 0 . α = 0 . x = 0 .

33. (d) 100,000 mixed states of the HMC deﬁned by Eq. (S1) with α = 0 . x = 0 . FIG. S6. Parametrized 3-state HMC deﬁned in Eq. (S1) that generates MSPs in a variety of structures, depending on x and α . However, due to the rotational symmetry in the transition matrices, the attractor is radially symmetric around the simplexcenter. Choosing α = 0 . x ∈ [0 , .

5] gives us an MSP that ﬁrst ﬁlls nearly the entire simplex, withprobability mass concentrated at the corners, then shrinks to a ﬁnite machine with 3 states at x = 1 /

3, and ﬁnallygrows once again into a fractal measure, as Fig. S6 illustrates. To demonstrate the ease and eﬃciency calculatingtheir entropy rates Fig. S7 plots h µ as function of ( x, α ) ∈ [0 , . × [0 , x . . . . . . . . . . . α . . . . . . Entropy Rate

FIG. S7. Entropy rates of the parametrized HMC deﬁned in Eq. (S1) over x ∈ [0 . , .

5] and α ∈ [0 . , . Appendix D: Estimation Errors for Finite-State Autocorrelation

Coarse-graining the mixed-state simplex into a set C of boxes of width (cid:15) , we may construct a ﬁnite-state approxi-mation of the inﬁnite-state MSP. It has been shown that given such an approximation, for any given box c , the boundon the diﬀerence in the entropy rate over the symbol distribution between the coarse-grained approximation and amixed state within that box is bounded by: | H [ X |C = c ] − H [ X | η ∈ f ] | ≤ H b √ G(cid:15) ! , (S1)where H b ( · ) is the binary entropy function [47]. Our task here is to consider the error in the autocorrelation in thesequence of mixed states since, if we can show that this is bounded, the error in the autocorrelation of the branchingentropy must also be bounded.At time zero, the autocorrelation is equal to A ( L = 0) = h X X i , so for the ﬁnite-state approximation, we have: A C ( L = 0) = X i π C ( i ) c i c i , where π C is the stationary distribution over the coarse-grained mixed states, π C ( i ) is the stationary probability of cell i , and c i is the center of cell i . For the true process, we have: A ( L = 0) = Z R dµ ( η ) ηη = X i π C ( i ) Z η ∈C i dµ ( η | i ) ηη , where dµ ( η | i ) is the distribution over mixed states within cell i . The maximum distance between any two mixedstates in a cell i is bounded by: k η − ζ k ≤ √ G(cid:15) , the length of the longest diagonal in a hypercube of dimension | G | , by construction. Since the gradient of the L norm is simply (cid:79) k x k = x / k x k , we have a bound on the diﬀerence in the autocorrelation at time zero: | A C ( L = 0) − A ( L = 0) | ≤ N p | G | (cid:15) . With increasing length we have: A C ( L ) = X i π C ( i ) c i X w ∈A L f ( w ) ( c i ) p ( w ) ( c i )and: A ( L ) = Z R dµ ( η ) η X w ∈A L f ( w ) ( η ) p ( w ) ( η )= X i π C ( i ) Z η ∈C i dµ ( η | i ) η X w ∈A L f ( w ) ( η ) p ( w ) ( η ) . Let η = c i + δ for some mixed state in cell i . Then we can write: | A C ( L ) − A ( L ) | ≤ X i π C ( i ) " c i X w ∈A L f ( w ) ( c i ) p ( w ) ( c i ) − ( c i + δ ) X w ∈A L f ( w ) ( c i + δ ) p ( w ) ( c i + δ ) . Now, note that: p ( w ) ( c i + δ ) ≈ p ( w ) ( c i ) + (cid:79) p ( w ) ( c i ) · δ and: f ( w ) ( c i + δ ) ≈ f ( w ) ( c i ) + e λ x δ where λ w is the leading Lyapunov exponent of the mapping function. Substituting this and eliminating terms of order δ gives us: | A C ( L ) − A ( L ) | ≤ X i π C ( i ) " c i X w ∈A L (cid:16) f ( w ) ( c i ) (cid:79) p ( w ) · δ + e λ w δp ( w ) ( c i ) (cid:17) + δ X w ∈A L f ( w ) ( c i ) p ( w ) ( c i ) . These terms identify three sources of approximation error: (i) that due to a diﬀerence in the probability distributionover symbols, (ii) that in the mapping functions, and (iii) that from approximating the points at the center of theircells.For the ﬁrst, we note that total variation in the probability distribution over symbols is bounded by the distancebetween the mixed states at which the distributions are computed. So, for any two mixed states in the same cell, k Pr( X = x | η ) − Pr( X = x | ζ ) k T V ≤ √

G(cid:15) . Then, the ﬁrst term is the error due to the diﬀerence in the expectationvalue of the next state, given that we have calculated the probability distribution at c i + δ , rather than c i . Using0Hölder’s inequality, for two distributions over P ( X ) and Q ( X ), we may say: E [ f ] Q − E [ f ] P = X x f ( x )( P ( x ) − Q ( x )) X x f ( x )( P ( x ) − Q ( x )) ≤ X x f ( x ) | P ( x ) − Q ( x ) | X x f ( x ) | P ( x ) − Q ( x ) | ≤ k f k p k P − Q k q , where 1 /p + 1 /q = 1. Setting q = 1: E [ f ] Q − E [ f ] P ≤ k f k ∞ k P − Q k T V . So, after taking the product with the cell centers c i , we have that the ﬁrst error is bounded by N √ G(cid:15) at all lengths.For the second, we note that since the maps are contractions, λ <

0, and the distance between f x ( η ) and f x ( ζ ),where η and ζ are in the same cell i , is bounded by √ G(cid:15) . As the length of a word w grows, λ w → −∞ and the distance f w ( η ) − f w ( ζ ) →

0. At large L , this term vanishes, at a rate equal to the average maximal Lyapunov exponent of theIFS. The ﬁnal error is that in the autocorrelation in the cell approximation which is, likewise, bounded by the cellsize—this is the same error from A (0), viz. N √ G(cid:15) .And so, in combination with the bound on the entropy, we may say, loosely speaking, that the error in theautocorrelation vanishes as (cid:15) →

0. Therefore, to ﬁnd τ and estimate the error in Eq. (18) as a function of samplesize, we take ﬁner coarse-grained approximations until convergence in the autocorrelation curve is observed, and thencalculate ττ