[PDF] Divergent Predictive States: The Statistical Complexity Dimension of Stationary, Ergodic Hidden Markov Processes

Abstract

Even simply-defined, finite-state generators produce stochastic processes that require tracking an uncountable infinity of probabilistic features for optimal prediction. For processes generated by hidden Markov chains the consequences are dramatic. Their predictive models are generically infinite-state. And, until recently, one could determine neither their intrinsic randomness nor structural complexity. The prequel, though, introduced methods to accurately calculate the Shannon entropy rate (randomness) and to constructively determine their minimal (though, infinite) set of predictive features. Leveraging this, we address the complementary challenge of determining how structured hidden Markov processes are by calculating their statistical complexity dimension -- the information dimension of the minimal set of predictive features. This tracks the divergence rate of the minimal memory resources required to optimally predict a broad class of truly complex processes.

Full PDF

aarXiv :2102.XXXXX

Divergent Predictive States:The Statistical Complexity Dimension ofStationary, Ergodic Hidden Markov Processes

Alexandra M. Jurgens ∗ and James P. Crutchﬁeld † Complexity Sciences CenterPhysics and Astronomy DepartmentUniversity of California at Davis One Shields Avenue, Davis, CA 95616 (Dated: February 23, 2021)Inferring models from samples of stochastic processes is challenging, even in the most basic settingin which processes are stationary and ergodic. A principle reason for this, discovered by Blackwellin 1957, is that ﬁnite-state generators produce realizations with arbitrarily-long dependencies thatrequire an uncountable inﬁnity of probabilistic features be tracked for optimal prediction. This,in turn, means predictive models are generically inﬁnite-state. Speciﬁcally, hidden Markov chains,even if ﬁnite, generate stochastic processes that are irreducibly complicated. The consequences aredramatic. For one, no ﬁnite expression for their Shannon entropy rate exists. Said simply, onecannot make general statements about how random they are and ﬁnite models incur an irreducibleexcess degree of unpredictability. This was the state of aﬀairs until a recently-introduced methodshowed how to accurately calculate their entropy rate and, constructively, to determine the minimalset of inﬁnite predictive features. Leveraging this, here we address the complementary challengeof determining how structured hidden Markov processes are by calculating the rate of statisticalcomplexity divergence—the information dimension of the minimal set of predictive features.

Keywords: Hidden Markov process, hidden Markov model, iterated function system, optimal prediction, pre-dictive feature, mixed-state presentation, Blackwell measure, spectrum of Lyapunov characteristic exponents,fractal dimension, information dimension

I. INTRODUCTION

A paradox lives at the heart of highly com-plex systems—the intricate patterns they generate arisethrough an interplay between determinism and stochas-ticity. Despite progress identifying and measuring theirdegrees of randomness and unpredictability, basic ques-tions remain. Speciﬁcally, how do we quantify correlationand “structure”? Can we detect a system’s emergent pat-terns and quantify their organization?Clearly posing these questions and developing thetools to answer them required, over the recent decades,integrating Turing’s computation theory [1–3], Shannon’sinformation theory [4], and Kolmogorov’s dynamical sys-tems theory [5–9]. Together they highlighted the centralrole that information —its generation, transmission, andstorage—plays in investigating complex systems. Draw-ing from the convergence, computational mechanics [10]introduced a deﬁnition of the structural organizationand memory of stochastic processes—the statistical com-plexity which measures the number and distribution ofoptimally-predictive features.Answers to the randomness-structure problem havebeen carefully outlined and successfully implemented forprocesses that can be optimally predicted with countably ∗ [email protected] † [email protected] many predictive features [11–13]. However, many com-plex systems arising in engineering, physical, and biolog-ical systems [14–18] require an inﬁnite number of predic-tive features. Somewhat soberingly, these truly complexsystems are implicated in a range of natural phenomena,from the geophysics of earthquakes [19] and physiologi-cal measurements of neural avalanches [20] to semanticsin natural language [21] and cascading failures in powertransmission grids [22].In point of fact, as ﬁrst established by Blackwellin the 1950s [14], calculating the entropy rate of pro-cesses generated by discrete time, N -state hidden Markovchains (HMCs) requires tracking an uncountably-inﬁniteset of distributions over an HMC’s states. The follow-ing establishes that optimally predicting these processesmust use these sets of distributions as predictive features.These sets, living on the ( N − a r X i v : . [ c ond - m a t . s t a t - m ec h ] F e b cesses generated by HMCs. This gave a new and eﬃ-cient tool for consistently describing the randomness oftruly complex systems. However, those results did notaddress the complementary side of the complex-systemparadox—the structural aspect of the interplay betweenstructure and randomness. Troublingly, for processeswith uncountably inﬁnite sets of predictive features, thestatistical complexity diverges, substantially circumscrib-ing its usefulness as a metric of system organization. Aswe will show, quantifying the structure of these trulycomplex systems requires a new approach and new toolsand methods.Historically, the need for such a measure of diver-gent information storage and Blackwell’s discovery wereperhaps anticipated by Shannon’s deﬁnition in the 1940sof dimension rate [4]: λ = lim δ → lim (cid:15) → lim T →∞ N ( (cid:15), δ, T ) T log (cid:15) , where N ( (cid:15), δ, T ) is the smallest number of elements thatmay be chosen such that all elements of a trajectory en-semble generated over time T , apart from a set of mea-sure δ , are within the distance (cid:15) of at least one chosentrajectory. This is the minimal “number of dimensions”required to specify a member of a trajectory (or mes-sage) ensemble. Unfortunately, Shannon devotes barely aparagraph to the concept, leaving it largely unmotivatedand uninterpreted. Nonetheless, we take inspiration fromthis to develop the statistical complexity dimension , theasymptotic growth rate of the statistical complexity foran uncountably inﬁnite set of predictive features.We conjecture that the statistical complexity dimen-sion is the same dimension rate proposed by Shannon.However, the following goes beyond Shannon’s brief men-tion to provide constructive and accurate methods for de-termining this important system invariant. Technically,statistical complexity dimension is deﬁned as the infor-mation dimension of the (self-similar set of) predictivestates. Several distinct steps are involved. Determiningthis information dimension requires establishing ergod-icity, calculating the Lyapunov spectrum of an HMC’smixed-state presentation, and applying a suitably modi-ﬁed version of the Lyapunov-information dimension con-jecture from dynamical systems that connects the spec-trum to the dimension.To highlight the usefulness of these informationalquantities, that otherwise appear rather abstracted fromnatural systems, it should be noted that the following andits predecessor [32] were proceeded by two companionsthat applied the theoretical results here to two, ratherdiﬀerent, physical domains. The ﬁrst analyzed the originof randomness and structural complexity engendered byquantum measurement [33]. The second solved a long-standing problem on exactly determining the thermody-namic functioning of Maxwellian demons, aka informa-tion engines [18]. That is, the predecessor and the presentdevelopment (along with a sequel to be announced in the conclusion) lay out the mathematical and algorith-mic tools required to successfully analyze structure andrandomness in these applied problems. Taken together,we believe the new approach will ﬁnd even wider use thanin these application areas.In the following, we introduce a practical andcomputable measure of structural complexity analogousto Shannon’s dimension rate in the form of statisti-cal complexity dimension d µ . Section II recalls thenecessary background in stochastic processes, hiddenMarkov chains, and information theory. Section IIIrecounts mixed states and their dynamic—the mixed-state presentation—as well as the connection to iteratedfunction systems (IFSs) previously demonstrated. Themain result follows in Sec. IV, where the Lyapunov-information dimension conjecture is reviewed and up-dated to our needs and the statistical complexity dimen-sion d µ is introduced. Finally, in Sec. VI d µ of a three-state parametrized HMC is calculated across a wide re-gion of parameter space, demonstrating the insights af-forded by and computational eﬃciency of our methods. II. HIDDEN MARKOV PROCESSES

Our main objects of study are stochastic processesand the mechanisms that generate them—hidden Markovchains, mixed states, and the (cid:15) -machine. We touch onseveral of their important properties, including station-arity, ergodicity, randomness, and memory. Readers fa-miliar with the previous work in this series [32] may skipto Section II D.

A. Processes A stochastic process P is a probability measure overa bi-inﬁnite chain . . . X t − X t − X t X t +1 X t +2 . . . of ran-dom variables, each X t denoted by a capital letter. Aparticular realization . . . x t − x t − x t x t +1 x t +2 . . . is de-noted via lowercase. We assume values x t belong to a dis-crete alphabet A . We work with blocks X t : t , where theﬁrst index is inclusive and the second exclusive: X t : t = X t . . . X t − . P ’s measure is deﬁned via the collection ofdistributions over blocks: { Pr( X t : t ) : t < t , t, t ∈ Z } .To simplify the development, we restrict to station-ary, ergodic processes: those for which Pr( X t : t + ‘ ) =Pr( X ‘ ) for all t ∈ Z , ‘ ∈ Z + , and for which individ-ual realizations obey all of those statistics. In such cases,we only need to consider a process’ length- ‘ word distri-butions Pr( X ‘ ).A Markov process is one that exhibits memory overa single time step: Pr( X t | X −∞ : t ) = Pr( X t | X t − ). A hid-den Markov process is the output of a memoryless channel[34] whose input is a Markov process [25]. σ σ (cid:3) : s (cid:3) : s − s (cid:3) : : : s − s : s FIG. 1. A hidden Markov chain with two states S = { σ , σ } and two symbols A = { (cid:3) , . It is nonuniﬁlar andparametrized with s ∈ [1 , ∞ ). It becomes uniﬁlar in the limitof s → ∞ . B. Presentations

Directly working with processes—nominally, inﬁnitesets of inﬁnite sequences and their probabilities—is cum-bersome. So, we turn to consider ﬁnitely-speciﬁed mech-anistic models that generate them.

Deﬁnition 1.

A ﬁnite-state edge-labeled hidden Markovchain (HMC) consists of:1. a ﬁnite set of states S = { σ , ..., σ N } ,2. a ﬁnite alphabet A of k symbols x ∈ A , and3. a set of N by N symbol-labeled transition matrices T ( x ) , x ∈ A : T ( x ) ij = Pr( σ j , x | σ i ) . The associated state-to-state transitions are de-scribed by the row-stochastic matrix T = P x ∈A T ( x ) .The internal-state Markov chain is given by { S , T } .The asymptotic, stationary state distribution is π = { Pr( σ ) , σ ∈ S } and, as a vector, is given by T ’s lefteigenvector normalized in probability: π = πT .A given stochastic process can be generated by anynumber of HMCs. These are called a process’ presenta-tions . We now introduce a structural property of HMCsthat has important consequences in determining a pro-cess’ randomness and structure. Deﬁnition 2. A uniﬁlar HMC (uHMC) is an HMC suchthat for each state σ i ∈ S and each symbol x ∈ A thereis at most one outgoing edge from state σ i labeled withsymbol x . One consequence is that a uHMC’s states are predic-tive in the sense that the distribution of words followinga state is the same as P ’s word distribution conditionedon the words that lead to the state. (This need not holdfor the states of nonuniﬁlar HMCs.)Although there can be many presentations for a pro-cess P , there is a canonical presentation that is unique:a process’ (cid:15) -machine [10]. Deﬁnition 3. An (cid:15) -machine is a uHMC with probabilis-tically distinct states : For each pair of distinct states σ i , σ j ∈ S there exists a ﬁnite word w = x ‘ − suchthat: Pr( X ‘ = w |S = σ k ) = Pr( X ‘ = w |S = σ j ) . A process’ (cid:15) -machine is its optimally-predictive,minimal presentation, in the sense that the set S of pre-dictive states is minimal compared to all its other uniﬁlarpresentations. That said, S may be ﬁnite, countably in-ﬁnite, or uncountably inﬁnite. By capturing a process’structure and not merely being predictive, an (cid:15) -machine’sstates are called causal states . C. Process Intrinsic Randomness: HMC EntropyRate

A process’ intrinsic randomness is the information inthe present measurement, discounted by having observedthe preceding inﬁnitely-long history. It is measured byShannon’s source entropy rate [4].

Deﬁnition 4.

A process’ entropy rate h µ is the asymp-totic average Shannon entropy per symbol [11]: h µ = lim ‘ →∞ H[ X ‘ ] /‘ , (1) where H [ X ‘ ] is the Shannon entropy of block X ‘ : H [ X ‘ ] = − X x ‘ ∈A ‘ Pr( x ‘ ) log Pr( x ‘ ) . (2)Given a ﬁnite-state uniﬁlar presentation M u of a pro-cess P , we may directly calculate the process’ entropyrate from its uHMC’s transition matrices [4]: h µ = − X σ ∈S π σ X x ∈A Pr( x | σ ) log Pr( x | σ ) . (3)In stark contrast, for processes generated by nonuniﬁlarHMCs there is no closed-form expression for the entropyrate [14]. For these processes, the closed-form expressionEq. (3) applied to the HMC states and transition ma-trices substantially misestimates the generated process’entropy rate.Addressing this nonuniﬁlar case was the focus ofour previous development [32]. We showed that the en-tropy rate of a general HMC may be determined usingits mixed states ; reviewed shortly in Section III. Track-ing an HMC’s mixed states allows one to ﬁnd the entropyrate of the generated process and so the latter’s intrinsicrandomness. D. Process Intrinsic Structure

A process’ memory is determined using its (cid:15) -machine(minimal) presentation M . Depending on the speciﬁcneed, this may be measured either in terms of the num-ber | S | of M ’s causal states or the amount of historicalShannon entropy they store—that is, the statistical com-plexity C µ . Deﬁnition 5.

A process’ statistical complexity is theShannon entropy stored in its (cid:15) -machine’s causal states: C µ = H[Pr( S )]= − X σ ∈S π σ log π σ . (4)From the deﬁnitions above, a process’ (cid:15) -machineis its smallest uHMC presentation, in the sense thatboth | S | and C µ are minimized by a process’ (cid:15) -machine,compared to all other uniﬁlar (predictive) presentations.Due to the (cid:15) -machine’s minimality, we can identify the (cid:15) -machine’s C µ as the process’ statistical complexity.A challenge similar to that encountered with the en-tropy rate of nonuniﬁlar HMCs arises: there is no closed-form expression for the C µ of the processes they generate.We now turn to give a constructive answer to this chal-lenge. The preceding presentation types, though, givea useful path to understanding how a process’ diﬀerentpresentations help or hinder determining process proper-ties. The strategy in the following turns on yet anotherpresentation type. Here on in, with nothing else said, ref-erence to an HMC means the general case—a nonunﬁlarHMC. III. OBSERVER-PROCESSSYNCHRONIZATION

Previously, we introduced mixed-state presentationsof HMCs and established their equivalence to randomdynamical systems known as iterated function systems (IFSs) [32]. We now brieﬂy review this construction.(Readers familiar with the previous results may skip toSection IV.)Assume that an observer has a ﬁnite HMC presen-tation M for a process P that it is monitoring. Considerthe observer-process synchronization problem in whichthe observer determines at each moment P ’s HMC statefrom observed process data. (An equivalent framing isthat we assume M is generating P and the observer seeksto determine at each moment which state σ M is in.)Since the process is hidden (and since M is anHMC), the observer cannot directly detect the state. Theobserver’s best guess initially is that the states occur ac-cording to M ’s internal-state stationary distribution π .Using knowledge of M ’s structure, the observer then re-ﬁnes this guess by monitoring the output data x x x . . . that M generates. If and when the observer knows withcertainty in which state the process is, they have syn-chronized to the process. A. Mixed State Presentation

1. Of A Process

For a length- ‘ word w generated by M let η ( w ) =Pr( S | w ) be the observer’s belief distribution as to theprocess’ current state after observing w : η ( w ) ≡ Pr( S ‘ | X ‘ = w, S ∼ π ) . (5)When observing a N -state machine, the vector h η ( w ) | lives in the (N-1)-simplex ∆ N − , the set such that: { η ∈ R N : h η | i = 1 , h η | δ i i ≥ , i = 1 , . . . , N } , where h δ i | = (cid:0) . . . . . . (cid:1) and | i = (cid:0) . . . (cid:1) .We use this notation for components of the mixed statevector η to avoid confusion with temporal indexing.Synchronization then occurs when a word w is ob-served such that Pr( S ‘ = σ | X ‘ = w ) = 1, for one state σ . The belief distributions η ( w ) that an HMC can visitdeﬁnes its set of mixed states : R = { η ( w ) : w ∈ A + , Pr( w ) > } . Generically, the mixed-state set R for an N -state HMCis inﬁnite, even for ﬁnite N [14]. Figure 2 shows a casewhere the HMC generates the Cantor set as its mixedstate set.The probability of transitioning from h η ( w ) | to h η ( wx ) | on observing symbol x follows from Eq. (5) im-mediately; we have:Pr( η ( wx ) | η ( w )) = Pr( x |S ‘ ∼ η ( w )) . This deﬁnes the mixed-state transition dynamic W . To-gether the mixed states and their dynamic deﬁne anHMC that is uniﬁlar by construction. This is a process’ mixed-state presentation (MSP) U ( P ) = { R , W} .

2. Of an HMC

Above, we deﬁned a process’ U abstractly. However,given any HMC M that generates the process, we mayexplicitly write down the MSP U ( M ) = { R , W} . As-sume we have an N + 1-state HMC presentation M with k symbols x ∈ A . We let the initial mixed state be the in-variant probability π over the states of M , so h η | = h δ π | .In the context of the mixed-state dynamic, mixed-statesubscripts denote time. (A)(B)(C)(D) σ σ (cid:3) : s (cid:3) : s − s (cid:3) : : : s − s : s η (cid:0) , (cid:1) η , (cid:0) s , s − s (cid:1) η , (cid:3) (cid:0) s − s , s (cid:1) (cid:3) : : η η η (cid:3) η η (cid:3) η (cid:3) η (cid:3)(cid:3) ... ... ... ... ... ... ... ... : (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) : : (cid:3) : (0 ,

1) (1 , FIG. 2. Constructing the mixed-state presentation of the 2-state s -parametrized nonuniﬁlar HMC shown in (A): The invariantstate distribution π = (1 / , / η used in (B) to calculate the next set of mixed states—thosehaving seen past words of length one. (C) In this example, inﬁnitely many mixed states are generated and R is the middle-1 /s Cantor set. Color indicates the relative closeness of each mixed state to the original states in (A). From η , one would need tosee a word of inﬁnite s to reach σ . In (D) the mixed states for s = 3 are pictured on the 1-simplex—the unit interval from η = (0 ,

1) to η = (1 , The probability of generating symbol x when inmixed state η is:Pr( x | η ) = h η | T ( x ) | i , (6)where T ( x ) is the symbol-labeled transition matrix asso-ciated with the symbol x . Now given a mixed state attime t , we may calculate the probability of seeing each x ∈ A . Upon seeing symbol x , the current mixed state h η t | is updated according to: h η t +1 ,x | = h η t | T ( x ) h η t | T ( x ) | i . (7)Equation (7) tells us that, by construction, the MSP is uniﬁlar, since each possible output symbol uniquely de-termines the next (mixed) state. Taken together, Eqs. (6)and (7) deﬁne the mixed-state transition dynamic W as:Pr( η t +1 , x | η t ) = Pr( x | η t )= h η t | T ( x ) | i , for all η ∈ R , x ∈ A . B. Constructing the Mixed-State Presentation

To ﬁnd the MSP U = { R , W} for a given HMC M we apply mixed-state construction : (a) Mixed state attractor for 3-state “alpha”machine. (b) Mixed state attractor for 3-state machinefrom Eq. (16), with α = 0 . x = 0 .

1. (c) Mixed state attractor for 3-state “beta”machine.

FIG. 3. Simple HMCs generate MSPs with a wide variety of structures, many fractal in nature. Each subplot displays 10 mixed states of a diﬀerent, highly-nonuniﬁlar 3-state hidden Markov chain. The HMCs themselves are speciﬁed in Appendix A.

1. Set U = { R = ∅ , W = ∅} .2. Calculate M ’s invariant state distribution: π = πT .3. Take η to be h δ π | and add it to R .4. For each current mixed state η t ∈ R , use Eq. (6)to calculate Pr( x | η t ) for each x ∈ A .5. For η t ∈ R , use Eq. (7) to ﬁnd the updated mixedstate η t +1 ,x for each x ∈ A .6. Add η t ’s transitions to W and each η t +1 ,x to R ,merging duplicate states.7. For each new η t +1 , repeat steps 4-6 until no newmixed states are produced.This algorithm need not terminate, as shown in Fig. 2,which depicts the MSP construction for the HMC inFig. 1. However, it can terminate for HMCs describedby ﬁnite-state (cid:15) -machines. When dealing with HMCs forwhich the algorithm does not terminate, one must im-pose a limit on the number of generated mixed-states,eﬀectively setting a level of approximation for R .One may ask, given that mixed-state constructionreturns a uniﬁlar HMC of the underlying process, is theMSP the same as the (cid:15) -machine? It is not guaranteed tobe so, as indeed is the case in Fig. 2. While the so-calledCantor machine generates an uncountably inﬁnite set ofmixed states, the (cid:15) -machine of the underlying process is asingle-state fair coin—which has a one-state HMC as its (cid:15) -machine. We can see this by noting that the symbol-branching probabilities depicted in Fig. 2(C) are identicalfor every generated mixed state. This may seem like acause for concern, as we seem to have overestimated thenecessary state size for the process in Fig. 2 by an inﬁnitefactor. However, this case is both rare and easy to checkfor, as discussed in Appendix B. By applying a simplecheck on the uniqueness of mixed states, we can conﬁrmif the MSP is the (cid:15) -machine of the underlying process.Unless otherwise noted, we take this to be true. Although the most common purpose of applyingmixed-state construction is to uniﬁlarize an HMC, wemay ﬁnd the MSP of uHMCs as well. The MSPs of uniﬁ-lar presentations are interesting and contain additionalinformation beyond the initial uniﬁlar presentation. Forexample, they typically contain transient causal statesand these are employed to calculate many complexitymeasures that track convergence statistics [35].Here, we focus on the mixed-state presentationsof nonuniﬁlar HMCs, which typically have an inﬁnitemixed-state set R . Even applying mixed-state construc-tion to nominally simple, ﬁnite-state, ﬁnite-alphabetnonunﬁlar HMCs results in an explosion of mixed states.Figure 3 gives three examples of MSPs with fractal mixedstates sets R , each generated by a three-state nonuniﬁlarHMC. C. MSP as an IFS

Specifying MSP construction in this way reveals thatgenerating mixed states is a type of random dynamicalsystem known as place-dependent iterated function sys-tem (IFS) [32]. For ﬁnite k and space ∆, place-dependentIFSs are characterized by a set of mapping functions : n f ( x ) : ∆ → ∆ (cid:12)(cid:12)(cid:12) x ∈ { , . . . k } o , and associated probability functions : n p ( x ) : ∆ → [0 , (cid:12)(cid:12)(cid:12) x ∈ { , . . . k } o . A place-dependent IFS generates a stochastic pro-cess over η ∈ ∆ N as follows: Given an initial position η ∈ ∆ N , the probability distribution { p ( x ) ( η ) : x =1 , . . . , k } is sampled. According to the sample x , apply f ( x ) to map η to the next position η = f ( x ) ( η ). Re-sample x from the distribution and continue, generating η , η , η , . . . .An N -state HMC’s associated mixed-state construc-tion deﬁnes a place-dependent IFS over the N − T ( x ) deﬁninga mapping function (Eq. (7)) and associated probabil-ity function (Eq. (6)). As our previous results showed[32], an IFS deﬁned by an ergodic HMC has a unique at-tractor, which is the set of mixed states R . Additionally,this attractor has a unique, attracting, invariant measureknown as the Blackwell measure µ B ( R ). IV. STRUCTURE OF INFINITE-STATEPROCESSES

Our prior development showed how to use the MSPto ﬁnd a process’ intrinsic randomness—in the form ofShannon entropy rate h µ . Our goal here is to complementthe measure of randomness with a measure of structureor memory.Recall that for processes generated by uniﬁlarHMCs, a unique minimal machine known as the (cid:15) -machine exists, so we can uniquely deﬁne the statisticalcomplexity C µ for a process using Eq. (4). Nonuniﬁ-lar HMCs have no such canonical minimal presentation.However, as discussed in Section III, we may ﬁnd themixed-state presentation U ( M ) of uHMCs to uniﬁlarizethem. Using the MSP to develop a measure of struc-ture for processes generated by HMCs is a natural choice,since the MSP for a given process is uniﬁlar and unique[32]—indeed, in most cases, the MSP is the (cid:15) -machine ofthe underlying process (see Appendix B).However, the naive approach of simply measuringstructure with statistical complexity C µ introduces aproblem: the statistical complexity diverges for an HMCwith an uncountably-inﬁnite state set R . In general,MSPs of HMCs are uncountably-inﬁnite state, preclud-ing distinguishing them via C µ . This being said, it is clearvisually from Fig. 3 that HMCs with uncountably-inﬁnitestate spaces still have signiﬁcant and distinct structures.We wish to ﬁnd a way to measure and distinguish suchstructure. For this, we take inspiration from Shannon’sdimension rate [4] and call on a familiar tool.Fractal dimension measures the rate at which a cho-sen size metric of a set diverges with the scale at whichthe set is observed [36–40]. Fractal dimension is also use-ful to probe the “size” of objects when cardinality is notinformative. For example, the mixed-state presentation,generically, has an uncountable inﬁnity of causal states.That observation is far too coarse, though, to distinguishthe clearly distinct mixed-state sets R in Fig. 3. Each isuncountably inﬁnite, but the R ’s geometries diﬀer. De-termining their fractal and other dimensions will allow usto distinguish them and allow us to introduce additionalinsights into the original process’ intrinsic informationprocessing. A. Dimensions

Consider the mixed-state set R on the simplex foran N -state HMC M that generates a process P . Weconsider two types of dimension for R : the Minkowski-Bouligand or box-counting dimension, often simply calledthe fractal dimension, and the information dimension.To calculate the ﬁrst, coarse-grain the N -simplexwith evenly spaced subsimplex cells of side length (cid:15) . Let F ( (cid:15) ) be the set of cells that encompass at least one mixedstate. Then R ’s box-counting dimension is: d ( R ) = − lim (cid:15) → log |F ( (cid:15) ) | log (cid:15) , (8)where | C | is the size of set C .The information dimension considers how the mea-sure over R scales. Let each cell in F ( (cid:15) ) be a stateand approximate the dynamic over U ( M ) by groupingall transitions to and from states encompassed by thesame cell. This results in a Markov chain that generatesan approximation of the original process P that has astationary distribution µ ( F ( (cid:15) )). Then R ’s informationdimension is: d ( µ ( R )) = lim (cid:15) → H µ [ F ( (cid:15) )]log (cid:15) , (9)where H µ [ F ( (cid:15) )] = − P C i ∈F ( (cid:15) ) µ ( C i ) log µ ( C i ) is theShannon entropy over the set F ( (cid:15) ) of cells that coverattractor R with respect to µ . B. Dimensions and Scaling of HMCs

These dimensions give two complementary resource-scaling laws for HMC-generated processes. RearrangingEq. (8), we see that the number of mixed states in ourﬁnite-state approximation to U ( M ) scales algebraicallywith R ’s box-counting dimension: |F ( (cid:15) ) | ∼ (cid:15) − d ( R ) . (10)In other words, for an uncountably inﬁnite MSP, the rateof growth of mixed states is d ( R ).Similarly, the entropy of the mixed-state set scaleswith the information dimension. Rearranging Eq. (9)shows that the state entropy of the ﬁnite-state approx-imation to U ( M ) scales logarithmically with R ’s infor-mation dimension with respect to the Blackwell measure: H µ [ F ] ∼ d ( µ ) · log (cid:15) . (11)As (cid:15) → |F| and H µ ( F ) diverge and d and d are thedivergence rates, respectively. The remainder focuses on d as applied to the (cid:15) -machine, for which it describes therate of divergence of statistical complexity C µ . V. STATISTICAL COMPLEXITY DIMENSION

We call the information dimension d ( µ ) of the (cid:15) -machine the statistical complexity dimension d µ .Applying d to the Blackwell measure µ B ( R ) givesthe rate of divergence of C µ as one constructs increas-ingly better ﬁnite-state approximations to the inﬁnite-state (cid:15) -machine. In this way, d µ describes the divergenceof memory resources when attempting to optimally pre-dict a process that requires an uncountably-inﬁnite num-ber of predictive features. This is a unique, minimaldescription of the process’ structural complexity. Thissolves the challenge posed in the introduction: quantify-ing structure for truly complex systems.When a process may be optimally predicted with acountable number of predictive features, the statisticalcomplexity dimension vanishes. In this case, the morerelevant complexity measure is the original (cid:15) -machine sta-tistical complexity C µ , which is is ﬁnite. Statistical com-plexity dimension for a process that may be minimallygenerated with N states is less than or equal to N − embedding dimension , sincethe mixed states lie in a space of dimension N − d µ —is nontrivial, as itrequires estimating a fractal measure. Fortunately, tocalculate the d µ of the mixed state attractor R , we canleverage the associated generating dynamical system (seeSection III C). A. Dimension from Dynamical (In)Stabilities

We can link the information dimension of an MSP’smixed state set R to the stability properties of the asso-ciated IFS. This starts with determining the local time-average stability and instability of orbits within an at-tractor via the spectrum of Lyapunov characteristic ex-ponents Γ = { λ , . . . , λ N : λ i ≥ λ i +1 } [41, 42]. IndividualLCEs λ i measure the average local growth or decay rateof orbit perturbations. The net result is a list of quanti-ties that indicate long-term orbit instability ( λ i >

0) andorbit stability ( λ i <

0) in complementary directions.Usefully, their sum gives the net state-spacedivergence—volume loss for dissipative systems. Thesum of the positive LCEs is the dynamical system’s en-tropy rate [40, 41, 43–45]—the net information genera-tion.To motivate our present, somewhat indirect, use ofthe Lyapunov spectrum, it will help to develop a sim-ple intuition for the LCEs as quantities of stretching andcontraction. Imagine the attractor of a two-dimensional( N = 2) map with LCEs λ > > λ and whosestate space is covered with equally spaced squares of sidelength (cid:15) . After iterating the map q times, for (cid:15) small enough, the local action of the map is approximately lin-ear. From the LCE deﬁnition, this means it takes theinitial square cells to rectangles of average length ( e λ q ) (cid:15) and average width ( e λ q ) (cid:15) . Now, covering the attrac-tor with squares of side length ( e λ q ) (cid:15) requires roughly e (cid:0) λ λ (cid:1) q squares per rectangle. In this way, the Lyapunovexponents describe scalings analogous to that seen withthe box-counting dimension [46].This suggests deﬁning a Lyapunov dimension interms of the spectrum Γ [47]: d Γ =  k + P ki λ i | λ k +1 | , N X i λ i < N , N X i λ i ≥ , (12)where k is the largest index such that the summation P ki λ i remains positive. If λ < d Γ = 0. It’s namehelps distinguish the conditions under which the rela-tionships between the various dimensions actually hold.(There are many conditions and system classes that thissummary necessarily leaves out.)It was conjectured [47] that for “typical dynamicalsystems”, the Lyapunov dimension d Γ equals the infor-mation dimension d . What has been shown, however, isthat for any ergodic, invariant probability measure µ : d ( µ ) ≤ d Γ , with equality when µ is a Sinai-Bowen-Ruelle (SBR) mea-sure [48]. This remarkable relationship directly relates asystem’s dynamics to the geometry and natural measureof its attractor. Additionally, and usefully, Eq. (12) givesus a tractable method to ﬁnd the information dimensionof an attractor generated by a dynamical system. B. Calculating Statistical Complexity Dimension

We now have in hand two important pieces. First,the deﬁnition of statistical complexity dimension d µ asthe information dimension of an (cid:15) -machine. Second, wehave a bound on the information dimension of an attrac-tor, given knowledge of the generating system’s dynam-ics. To complete our picture, we now address the ﬁnalpuzzle piece.As Section V A discussed, there is a direct relation-ship between the information dimension of a chaotic at-tractor and the dynamics of the system to which theattractor belongs. Furthermore, as discussed in Sec-tion III C, every HMC has an associated random dynam-ical system—the iterated function system (IFS)—whichhas the HMC’s set of mixed states R as its unique at-tractor. Combining these two facts allows us to exactlycalculate d µ in many cases of interest. (A)(B) f (0) (cid:0) ∆ (cid:1) f (1) (cid:0) ∆ (cid:1) (0 ,

1) (1 , f (0) (cid:0) ∆ (cid:1) f (1) (cid:0) ∆ (cid:1) Overlapping Region(0 ,

1) (1 , FIG. 4. Overlap problem on the 1-simplex ∆ : Two distinct IFSs are considered, each with two mapping functions. The imagesof the mapping functions over the entire simplex are depicted in red and blue. (A) Images of the mapping functions f (0) and f (1) do not overlap—every mixed state η t ∈ R has a unique pre-image. (B) Images of the mapping functions overlap (purpleOverlapping Region)—there exist η , η ∈ R such that f (0) ( η ) = f (1) ( η ) = η . This case is an overlapping IFS . An advantage of working with IFSs deﬁned by HMCsis a clean division exhibited by their Lyapunov spectrum.It has been shown that the entropy rate of the generatedprocess is equivalent to the largest Lyapunov exponent[49, 50]. Calculating this was the main topic of the pre-quel to the present work [32]. Furthermore, due to thecontractivity of the IFS mapping functions, all other Lya-punov exponents will necessarily be negative. For a re-view of calculating the Lyapunov exponents for IFSs, seeAppendix E.Therefore, the Lyapunov dimension of an IFS is: f d Γ =  k − h µ + P k − i λ i | λ k | , h µ + N X i λ i < N , h µ + N X i λ i ≥ , (13)where k is now the largest index for which h µ + P ki λ i > h µ was the largest Lyapunovexponent.Under speciﬁc technical conditions to be discussedshortly, the IFS d Γ is exactly equal to the informationdimension of the IFS’s attractor and, therefore, is d µ [51].In general, relaxing those conditions, f d Γ upper boundsthe statistical complexity dimension: f d Γ ≥ d µ . (14)Assembling these pieces together determines the ba-sic algorithm to calculate (or bound) the statistical com-plexity dimension:1. For an N -state HMC M with |A| = k , write downthe associated IFS with k symbol-labeled mappingfunctions and probability functions.2. Calculate the entropy rate h µ using the Blackwelllimit (see [32]). 3. Calculate the negative Lyapunov exponents { λ , . . . , λ N − } (see Appendix E).4. Compute the Lyapunov dimension d Γ usingEq. (13).As mentioned, in speciﬁc cases, the Lyapunov dimen-sion is exactly equal to the statistical complexity dimen-sion, and our task is complete. However, there are majortechnical concerns with when we have only the bound inEq. (14) and with its tightness then. C. The Overlap Problem

A subtle disadvantage of working with IFSs is a di-rect result of the stochastic nature of them as random dy-namical systems. We must consider the overlap problem ,which concerns the ranges of the symbol-labeled map-ping functions f ( i ) , illustrated in Fig. 4. Speciﬁcally, theproblem means that we must distinguish between IFSsthat meet the open set condition and those that do not. Deﬁnition 6.

An iterated function system with mappingfunctions f ( η ) : ∆ N → ∆ N satisﬁes the open set condi-tion (OSC) if there exists an open set U ∈ ∆ N such thatfor all η, ζ ∈ ∆ N : f η ( U ) ∩ f ζ ( U ) = ∅ , η = ζ . IFSs that meet the OSC are nonoverlapping IFSs . When the images of the symbol-labeled mappingsoverlap the inequality in Eq. (14) is strict. To brieﬂyoutline the consequences, for an overlapping IFS the en-tropy rate h µ does not accurately capture state-space ex-pansion. And, this causes the IFS d Γ (Eq. (13)) to over-estimate the information dimension. As a rule of thumb,the degree to which the mappings overlap determines themagnitude of the bound’s error. The impact of overlapsis signiﬁcant. It is explored both in Section VI, where we0calculate the statistical complexity dimension for HMCswith and without overlap, as well as in the sequel, whichdiagnoses the problem’s origins and outlines a solution.For now, to give a workable approach, we simplyintroduce two extra steps to the d µ algorithm from theprevious section:5. Determine if the Open Set Condition is met usingthe mapping functions f ( i ) .6. If the OSC is not met, estimate the degree of over-lap to determine the closeness of the bound on d µ . D. Statistical Complexity Dimension for ProcessesGenerated by Two-State HMCs

Finally, we analyze two-state HMCs, for whichEq. (13) simpliﬁes signiﬁcantly and gives exact resultsfor d µ . (Thus, the concerns just outlined occur only forthree or more state HMCs, which are explored in the nextsection.)For two-state nonuniﬁlar HMCs, the mixed-state setlives on the 1-simplex ∆ —the unit interval from η =(0 ,

1) to η = (1 , η ∈ R and the dynamicon them exist in a one-dimensional space and, thus, thereis a single negative Lyapunov exponent λ < η n +1 = f ( η n ) is: λ ( η ) = lim N →∞ N N − X i =0 log (cid:12)(cid:12)(cid:12)(cid:12) df ( η i ) dη (cid:12)(cid:12)(cid:12)(cid:12) for an orbit starting at η . For an IFS with a set ofof mapping functions { f ( x ) } , we ﬁnd λ as the weightedaverage of the Lyapunov exponents of each map: λ µ = Z X x p ( x ) ( η ) log (cid:12)(cid:12)(cid:12)(cid:12) df ( x ) ( η ) dη (cid:12)(cid:12)(cid:12)(cid:12) dµ , where µ is the IFS’s Blackwell measure. We can applyergodicity to transform this into a summation over timefor ease of calculation.If a two-state HMC has an MSP with an uncount-able inﬁnity of mixed states and its corresponding IFSsatisﬁes OSC, there is a simple relationship between theentropy rate, the Lyapunov exponent, and the statisticalcomplexity dimension. This is given by: d µ ( µ ) = − h µ λ µ , (15)recalling that h µ >

0, so that the dimension is alwayspositive. Failing OSC, this ratio is an upper bound onthe dimension of the measure µ [52]. For a discussion ofthe intuition behind this formula, see Appendix C . VI. MULTI-STATE HMC EXAMPLES

Notably, the Lyapunov dimension Eq. (15) for more-than-two-state HMCs is easily shown to be correct whenthe maps are similitudes and the probability functionsare constant. The latter is seen, for example, with theSierpinski triangle, as discussed in Appendix D.However, we are generally interested in multi-stateHMCs that do not produce the perfectly-self-similar frac-tals that arise under those conditions. Furthermore, weare often interested in considering physical systems de-scribed by parametrized HMCs, such as those that arosein the two prequels on quantum measurement processesand information engine functionality [18, 33]. In suchcases, an HMC determined by an application may meetthe OSC in some regions of parameter space and fail todo so in others. We will consider an HMC that spansthe breadth of these possible behaviors, from zero over-lap to complete overlap. This will demonstrate the rangeof applicability of the statistical complexity dimensionalgorithm laid about above.Consider the following HMC with 3 symbols { (cid:3) , , ◦} and 3 states: T (cid:3) =  αy βx βxαx βy βxαx βx βy  , T =  βy αx βxβx αy βxβx αx βy  , and T ◦ =  βy βx αxβx βy αxβx βx αy  , (16)with β = (1 − α ) / y = 1 − x . By inspection, we seethat α takes on any value from 0 to 1 and x may rangefrom 0 to 1 / α, x ) parameter space. Each black dot is agenerated mixed state, while the colored regions showthe range of each symbol-labeled map.For example, on one hand, in the top left cornerwith α = 0 .

01 and x = 0 .

01, we ﬁnd an attractor thatextends across the simplex, with moderate amounts ofoverlap. On the other, α = 0 .

79 and x = 0 .

11 producesan attractor with no overlap, and clearly deﬁned regions.Moreover, for any α , choosing x = 1 / α = β = 1 /

3, all symbol-labeledmapping functions are identical. Therefore, the attrac-tor is the single ﬁxed-point shared by all three maps—a single-state HMC. This is a case of maximal possibleoverlap. Along both lines in parameter space the MSPcollapses to a ﬁnite state HMC, so d µ = 0, by deﬁnition.However, these diﬀerent mechanisms of state-collapse arerelevant in calculation of d Γ via Eq. (13).1 FIG. 5. Mixed-state attractors generated by a 3-state HMC parametrized over α ∈ [0 ,

1] and x ∈ [0 , . ,

000 mixed states are plotted for each attractor, with the initial 5 ,

000 states thrown away as transients.The ranges of the symbol-labeled maps are color shaded, revealing regions of their image overlap on the attractor. Comparingto Eq. (16), the red, blue, and green regions represent the images of the mapping functions deﬁned by T (cid:3) , T , and T ◦ ,respectively. First, consider Fig. 6a, which illustrates the esti-mated area on the simplex taken up by the attractoracross parameter space. For a discussion of how attractorarea was estimated, please see Appendix F. This ﬁgurematches the ( x, α ) grid in Fig. 5: lower values of x pro-duce larger attractors, excepting the region near α = 1 /

3, where the area drops to zero. We know from our analysisof the attractor grid that along this line, the attractor isﬁnite state, and so the statistical complexity dimension d µ vanishes. However, this is not accurately reﬂected by d Γ , as seen in Fig. 6b.That said, d Γ clearly—and correctly—vanishes when2 (a) Mixed-state attractor area estimated across α and x .The area is minimized along lines of α = 1 / x = 1 / d Γ does notdetect the collapse to zero dimension along α = 1 /

3, which isdue to overlaps.(c) Percentage of attractor area in which there is overlap. (d) Overlap in the attractor area. Comparing with (c), formuch of this area, the overlap is very small.

FIG. 6. Attractor area, overlap regions, and Lyapunov dimension d Γ of the mixed-state attractors shown in Fig. 5, parametrizedby α = [0 ,

1] and x = [0 , . x = 1 /

3. This is the other line in parameter space wherethe MSP is ﬁnite state and d µ is known to be zero. Thisdisparity is due to the diﬀerent mechanisms driving thecollapse.When x = 1 /

3, the symbol-labeled mapping func-tions become constants, which is reﬂected in the Lya-punov exponents and consequently in d Γ . The constantfunctions have negative Lyapunov exponents of negativeinﬁnity, sending Eq. (13) to zero.In contrast, along the α = 1 / d Γ badly overestimates d µ . This illustrates the impor-tance of the OSC on the bound.This poses the question, which regions in HMC pa-rameter space exhibit overlap IFS maps? Figure 6d de-picts the parameter space as overlap or no overlap. Fig-ure 6c shows this as a percentage of the total attractorarea. For a discussion of how overlap was determined,please see Appendix F. Comparing the two, we see that3there is a signiﬁcant region for which overlap does existfor x < .

15 and a smaller region where x > .

48. How-ever, for much of that region the attractor’s overlap areais relatively small. As a rule of thumb, the gap between d Γ and d µ for the mixed-state set measure µ ( R ) is deter-mined by the percentage of the attractor that is aﬀectedby overlap. If the overlap region is relatively small, incomparison to the size of the attractor, d Γ may be veryclose to d µ .However, if the overlap is very large, d Γ may be adramatic overestimation of d µ . This occurs when α =1 / x < .

15. The statistical complexity dimension d µ vanishes, yet the Lyapunov dimension saturates at d Γ = 2 . α = 1 / x = 1 /

3. Thisis because state space collapse due to overlap requiresthe maps to be identical, and even minute diﬀerencesin the symbol-labeled transition matrices will producean uncountably-inﬁnite MSP, potentially with d µ = 2 . VII. CONCLUSION

Our development opened by considering the chal-lenge of quantifying the structure of complex systems.For well over a half a century Shannon entropy rate stoodas the standard by which to quantify randomness in timeseries and in chaotic dynamical systems. Quantifyingobservable patterns remained a more elusive goal. How-ever, with developments from computational mechanics,it has become possible to answer questions of structureand pattern, at least for stochastic processes generatedby ﬁnite-state predictive machines, including the sym-bolic dynamics generated by chaotic dynamical systems.To handle the processes generated by ﬁnite-statenonpredictive (nonuniﬁlar) hidden Markov chains, we de-veloped the mixed-state presentation. This uniﬁlarized general HMCs, giving a predictive presentation that it-self generates the process. However, adopting a uniﬁlar presentation came at a heavy cost: Generically, they areinﬁnite state and so previous structural measures diverge.Nonetheless, we showed how to work constructively withthese inﬁnite mixed-state presentations. In particular,we showed that they fall into a common class of dynam-ical system: The mixed-state presentation is an iteratedfunction system. Due to this, a number of results fromdynamical systems theory can be applied to more fullydescribe the original stochastic process.Previously, others considered the IFS-HMC connec-tion [54, 55]. Complementing those eﬀorts, we expandedthe role of the mixed-state presentation to calculate en-tropy rate and demonstrated its usefulness in determin-ing the underlying structural properties of the gener-ated process. Indeed, Figs. 3 and 5 show how visu-ally striking—and distinct—mixed state sets generatedby HMCs are.Here, moving in a new direction beyond previous ef-forts, we established that the information dimension ofthe mixed-state attractor is exactly the divergence rateof the statistical complexity [56]—a measure of a process’structural complexity that tracks memory. Thus, pro-cesses in this class eﬀectively increase their use of mem-ory, “creating” mixed or causal states, on the ﬂy. Fur-thermore, we introduced a method to calculate the infor-mation dimension of the mixed-state attractor from theLyapunov spectrum of the mixed-state IFS. In this way,we demonstrated that coarse-graining the mixed-statesimplex—the previous method for studying the struc-ture of inﬁnite-state processes [57]—can be avoided alto-gether. This greatly improves accuracy and calculationalspeed.During the development, we noted several obstacles.Most importantly, the presence of overlap and failure tomeet the conditions of the OSC causes the Lyapunovdimension to be a strict upper bound, and some timesquite a poor one, on the statistical complexity dimension.The ﬁnal work of our trilogy [53] introduces a measure ofhow badly the entropy rate overestimates the expansionof the mixed-state set. Combining this measure withthe Lyapunov-information dimension conjecture ﬁnallyyields a correct Eq. (13) to apply to HMCs with overlap;that is, for processes generated by all general HMCs.To close, we note that the structural tools intro-duced here and the entropy-rate method introduced pre-viously [32] have been put to practical use in two previ-ous works. One diagnosed the origin of randomness andstructural complexity in quantum measurement [33]. Theother exactly determined the thermodynamic functioningof Maxwellian information engines [18], when there hadbeen no previous method for this kind of detailed and ac-curate analysis. At this point, however, we leave the fullexplication of these techniques and further analysis onhow mixed states reveal the underlying structure of pro-cesses generated by hidden Markov chains to the sequel[53].4

ACKNOWLEDGMENTS

The authors thank Sam Loomis, Sarah Marzen, Ari-adna Venegas-Li and Ryan James for helpful discussionsand the Telluride Science Research Center for hospital-ity during visits and the participants of the InformationEngines Workshops there. JPC acknowledges the kindhospitality of the Santa Fe Institute, Institute for Ad- vanced Study at the University of Amsterdam, and Cal-ifornia Institute of Technology for their hospitality dur-ing visits. This material is based upon work supportedby, or in part by, FQXi grants FQXi-RFP-IPW-1902 andFQXI-RFP-CPW-2007 and U.S. Army Research Labora-tory and the U.S. Army Research Oﬃce grants W911NF-18-1-0028 and W911NF-21-1-0048. [1] A. M. Turing. On computable numbers, with an applica-tion to the entsheidungsproblem.

Proc. Lond. Math. Soc.Ser. 2 , 42:230, 1936. 1[2] C. E. Shannon. A universal Turing machine with two in-ternal states. In C. E. Shannon and J. McCarthy, editors,

Automata Studies , number 34 in Annals of Mathemati-cal Studies, pages 157–165. Princeton University Press,Princeton, New Jersey, 1956.[3] M. Minsky.

Computation: Finite and Inﬁnite Machines .Prentice-Hall, Englewood Cliﬀs, New Jersey, 1967. 1[4] C. E. Shannon. A mathematical theory of communica-tion.

Bell Sys. Tech. J. , 27:379–423, 623–656, 1948. 1, 2,3, 7[5] A. N. Kolmogorov.

Foundations of the Theory of Prob-ability . Chelsea Publishing Company, New York, secondedition, 1956. 1[6] A. N. Kolmogorov. Three approaches to the concept ofthe amount of information.

Prob. Info. Trans. , 1:1, 1965.[7] A. N. Kolmogorov. Combinatorial foundations of infor-mation theory and the calculus of probabilities.

Russ.Math. Surveys , 38:29–40, 1983.[8] A. N. Kolmogorov. Entropy per unit time as a metricinvariant of automorphisms.

Dokl. Akad. Nauk. SSSR ,124:754, 1959. (Russian) Math. Rev. vol. 21, no. 2035b.[9] Ja. G. Sinai. On the notion of entropy of a dynamicalsystem.

Dokl. Akad. Nauk. SSSR , 124:768, 1959. 1[10] J. P. Crutchﬁeld. Between order and chaos.

NaturePhysics , 8(January):17–24, 2012. 1, 3[11] J. P. Crutchﬁeld and D. P. Feldman. Regularities un-seen, randomness observed: Levels of entropy conver-gence.

CHAOS , 13(1):25–54, 2003. 1, 3[12] S. Marzen, M. R. DeWeese, and J. P. Crutchﬁeld. Timeresolution dependence of information measures for spik-ing neurons: Scaling and universality.

Front. Comput.Neurosci. , 9:109, 2015.[13] C. R. Shalizi and J. P. Crutchﬁeld. Computational me-chanics: Pattern and prediction, structure and simplicity.

J. Stat. Phys. , 104:817–879, 2001. 1[14] D. Blackwell. The entropy of functions of ﬁnite-stateMarkov chains. In

Transactions of the ﬁrst Pragueconference on information theory, Statistical decisionfunctions, Random processes , volume 28, pages 13–20,Prague, Czechoslovakia, 1957. Publishing House of theCzechoslovak Academy of Sciences. 1, 3, 4[15] J. P. Crutchﬁeld and K. Young. Computation at theonset of chaos. In W. Zurek, editor,

Entropy, Complex-ity, and the Physics of Information , volume VIII of

SFIStudies in the Sciences of Complexity , pages 223 – 269,Reading, Massachusetts, 1990. Addison-Wesley. [16] N. Travers and J. P. Crutchﬁeld. Inﬁnite excess en-tropy processes with countable-state generators.

Entropy ,16:1396–1413, 2014.[17] L. Debowski. On hidden Markov processes with inﬁniteexcess entropy.

J. Theo. Prob. , 27(2):539–551, 2012.[18] A. Jurgens and J. P. Crutchﬁeld. Functional thermo-dynamics of Maxwellian ratchets: Constructing and de-constructing patterns, randomizing and derandomizingbehaviors.

Phys. Rev. Research , 2(3):033334, 2020. 1, 2,10, 13[19] D. L. Turcotte.

Fractals and Chaos in Geology and Geo-physics . Cambridge University Press, Cambridge, UnitedKingdom, second edition, 1997. 1[20] J. M. Beggs and D. Plenz. Neuronal avalanches in neo-cortixal circuits.

J. Neurosci. , 23(35):11167–11177, 2003.1[21] L. Debowski. Excess entropy in natural language:Present state and perspectives.

Chaos , 21(3):037105,2011. 1[22] I. Dobson, B. A. Carreras, V. E. Lynch, and D. E. New-man. Complex systems analysis of series of blackouts:Cascading failure, critical points, and self-organization.

CHAOS , 17(2):26103, 2007. 1[23] J. P. Crutchﬁeld. The calculi of emergence: Compu-tation, dynamics, and induction.

Physica D , 75:11–54,1994. 1[24] B. Marcus, K. Petersen, and T. Weissman, editors.

En-tropy of Hidden Markov Process and Connections to Dy-namical Systems , volume 385 of

Lecture Notes Series .London Mathematical Society, 2011. 1[25] Y. Ephraim and N. Merhav. Hidden Markov processes.

IEEE Trans. Info. Th. , 48(6):1518–1569, 2002. 1, 2[26] J. Bechhoefer. Hidden Markov models for stochastic ther-modynamics.

New. J. Phys. , 17:075003, 2015. 1[27] L. R. Rabiner and B. H. Juang. An introduction to hid-den Markov models.

IEEE ASSP Magazine , January:4–16, 1986. 1[28] E. Birney. Hidden Markov models in biological sequenceanalysis.

IBM J. Res. Dev., , 45(3.4):449–454, 2001. 1[29] S. Eddy. What is a hidden Markov model?

NatureBiotech. , 22:1315–1316, Oct 2004. 1[30] C. Bretó, D. He, E. L. Ionides, and A. A. King. Timeseries analysis via mechanistic models.

Ann. App. Statis-tics , 3(1):319–348, Mar 2009. 1[31] T. Rydén, T. Teräsvirta, and S. Åsbrink. Stylized factsof daily return series and the hidden Markov model.

J.App. Econometrics , 13:217–244, 1998. 1[32] A. Jurgens and J. P. Crutchﬁeld. Shannon entropy rateof hidden Markov processes. arXiv.org:2008.12886 , 2020.1, 2, 3, 4, 6, 7, 9, 13, 5 [33] A. Venegas-Li, A. Jurgens, and J. P. Crutchﬁeld.Measurement-induced randomness and structure in con-trolled qubit processes. Phys. Rev. E , 102(4):040102(R),2020. 2, 10, 13[34] T. M. Cover and J. A. Thomas.

Elements of InformationTheory . Wiley-Interscience, New York, second edition,2006. 2[35] J. P. Crutchﬁeld, P. Riechers, and C. J. Ellison. Exactcomplexity: Spectral decomposition of intrinsic compu-tation.

Phys. Lett. A , 380(9-10):998–1002, 2016. 6[36] A. Renyi. On the dimension and entropy of probabilitydistributions.

Acta Math. Hung. , 10:193, 1959. 7[37] B. B. Mandelbrot.

The Fractal Geometry of Nature . W.H. Freeman and Company, San Francisco, California,1982.[38] G. A. Edgar.

Measure, Topology, and Fractal Geometry .Springer-Verlag, New York, 1990.[39] K. Falconer.

Fractal geometry: mathematical foundationsand applications . John Wiley, Chichester, 1990.[40] Ya. B. Pesin.

Dimension Theory in Dynamical Systems:Contemporary Views and Applications . Chicago Lecturesin Mathematics. University of Chicago Press, Chicago,Illinois, 1997. 7, 8[41] I. Shimada and T. Nagashima. A numerical approach toergodic problem of dissipative dynamical systems.

Prog.Theo. Phys. , 61:1605, 1979. 8[42] G. Benettin, L. Galgani, A. Giorgilli, and J.-M. Strelcyn.Lyapunov characteristic exponents for smooth dynami-cal systems and for hamiltonian systems; a method forcomputing all of them.

Meccanica , 15:9, 1980. 8[43] Ja. B. Pesin. Ljapunov characteristic exponents and er-godic properties of smooth dynamical systems with an in-variant measure.

Doklady Akademii Nauk , 226:774–777,1976. 8[44] D. Ruelle. Sensitive dependence on initial conditions andturbulent behavior of dynamical systems.

Annals of theNew York Academy of Sciences , 316(1):408–416, 1977.[45] G. Benettin, L. Galgani, and J.-M. Strelcyn. Kol-mogorov entropy and numerical experiments.

Phys. Rev.A , 14(6):2338–2345, 1976. 8[46] J. D. Farmer, E. Ott, and J. A. Yorke. The dimension ofchaotic attractors.

Physica , 7D:153, 1983. 8[47] J. Kaplan and J. Yorke. Chaotic behavior of multidi-mensional diﬀerence equations. In

Functional Diﬀeren-tial Equations and Approximation of Fixed Points , vol-ume 730 of

Lecture Notes in Mathematics , pages 204–227.Springer, 1979. 8[48] F. Ledrappier and L. S. Young. The metric entropy ofdiﬀeomorphisms: Part II: Relations between entropy, ex- ponents and dimension.

Ann. Mathematics , 122:540–574,1985. 8[49] P. Jacquet, G. Seroussi, and W. Szpankowski. On theentropy of a hidden Markov process.

Theo. Comp. Sci. ,395:203–219, 2008. 9[50] T. Holliday, A. Goldsmith, and P. Glynn. Capacity ofﬁnite state channels based on Lyapunov exponents ofrandom matrices.

IEEE Trans. Info. Th. , 52:3509–3532,2006. 9[51] B. Barany. On the Ledrappier-Young formula for self-aﬃne measures.

Math. Proc. Cambridge Phil. Soc. ,159(3):405–432, 2015. 9[52] J. Jaroszewska and M. Rams. On the Hausdorﬀ dimen-sion of invariant measures of weakly contracting on aver-age measurable IFS.

J. Stat. Physics , 2008. 10[53] A. Jurgens and J. P. Crutchﬁeld. Backwards entropy rateof hidden Markov processes. in preparation , 2021. 13[54] M. Rezaeian. Hidden Markov process: A newrepresentation, entropy rate and estimation entropy. arXiv:0606114 , 2006. 13[55] W. Słomczyński, J. Kwapień, and K. Życzkowski. En-tropy computing via integration over fractal measures.

Chaos: Interdisc. J. Nonlin. Sci. , 10(1):180–188, Mar2000. 13[56] J. P. Crutchﬁeld and K. Young. Inferring statistical com-plexity.

Phys. Rev. Let. , 63:105–108, 1989. 13[57] S. E. Marzen and J. P. Crutchﬁeld. Nearly maximallypredictive features and their dimensions.

Phys. Rev. E ,95(5):051301(R), 2017. 13[58] A. Jurgens and J. P. Crutchﬁeld. Minimal embeddingdimension of minimally inﬁnite hidden Markov processes. in preparation , 2021. 2[59] V. I. Oseledets. A multiplicative ergodic theorem. Lya-punov characteristic numbers for dynamical systems.

Trans. Moscow Math. Soc. , 19:197–231, 1968. 4[60] J. H. Elton. An ergodic theorem for iterated maps.

Ergod.Th. Dynam. Sys. , 7:481–488, 1987. 5[61] G. Froyland and K. Aihara. Rigorous numerical estima-tion of Lyapunov exponents and invariant measures ofiterated function systems and random matrix products.

Intl. J. Bifn. Chaos , 10(1):103–122, 2000. 5[62] A. Boyarsky and Y. S. Lou. A matrix method for approx-imating fractal measures.

Intl. J. Bifn. Chaos , 02:167–175, 1992. 5[63] Shapely: Geometric objects, predicates, and operations.version 1.7.1. 5, 6

Supplementary Materials

Divergent Predictive States:The Statistical Complexity Dimension ofStationary, Ergodic Hidden Markov Processes

Alexandra Jurgens and James P. Crutchﬁeld arXiv :2102.XXXXX

The Supplementary Materials give the HMCs for the example processes considered, review MSP minimality,draw a correspondence with the Baker’s Map on the unit square, layout an HMC whose mixed-state set R is thewell-known Sierpinski Triangle fractal, and review Lyapunov characteristic exponents and calculating their spectrumfor IFSs. Appendix A: Nonuniﬁlar HMC Examples

We reproduce here the HMCs used to create Fig. 3. First, the “alpha” HMC, from Fig. 3a is given by: T (cid:3) =  . × − .

392 1 . × − .

475 2 . × − . × − .

224 2 . × − .  , (S1) T =  . × − .

133 0 . . × − .

315 2 . × − .

467 1 . × − . × −  , and T ◦ =  . × − . × − . × − . × − . × − . × − . × − . × − . × −  . Figure 3b is given by Eq. (16), at α = 0 . x = 0 .

1. The “beta” HMC, in Fig. 3c, is given by: T (cid:3) =  . × − .

388 4 . × − .

464 4 . × − . × − .

232 2 . × − .  , (S2) T =  . × − .

123 0 . . × − .

292 2 . × − .

432 9 . × − . × −  , and T ◦ =  . × − . × − . × − . × − . × − . × − . × − . × − . × −  . Due to ﬁnite numerical accuracy, reproduction of the attractors using these speciﬁcations may diﬀer slightly fromFig. 3.

Appendix B: Mixed-State Presentation Minimality

Given an HMC M , minimality of inﬁnite-state mixed-state presentations U ( M ) is an open problem. MSPs arenot guaranteed to be minimal. In fact, it is possible to construct MSPs with an uncountably-inﬁnite number of statesfor a process that requires only one state to optimally predict, as seen with the so-called Cantor state process inFigs. 1 and 2. Note that while this HMC generates an uncountable number of mixed states, each one has the sameemitted-symbol probability distribution, indicating that all states can be merged into a single state with no loss ofpredictability. Indeed, the (cid:15) -machine for the HMC depicted in Fig. 1 is simply the single-state Fair Coin HMC. ∆ N ∆ N ∆ k ∆ k f ( x ) P Pg ( x ) FIG. S1. Commuting diagram for probability functions P = { p ( x ) } , mixed-state mapping functions f ( x ) , and proposed symbol-distribution mapping functions g ( x ) . A proposed solution to this needless presentation verbosity is a short and simple check on mergeablility of mixedstates. This refers to any two distinct mixed states that have the same conditional probability distribution over futurestrings; i.e., any two mixed states η and ζ for which:Pr( X L | η ) = Pr( X L | ζ ) , (S1)for all L ∈ N + .A beneﬁt of the IFS formalization of the MSP is the ability to directly check for duplicated states and thereforedetermine if the MSP is nonminimal. We check this by considering, for an N + 1 state HMC M with alphabet A = { , , . . . , k } , the dynamic not only over mixed states, but probability distributions over symbols. Let: P ( η ) = (cid:16) p (0) ( η ) , . . . , p ( k − ( η ) (cid:17) (S2)and consider Fig. S1. For each mixed state η ∈ ∆ N , Eq. (S2) gives the corresponding probability distribution ρ ( η ) ∈ ∆ k over the symbols x ∈ A . Let M emit symbol x , then the dynamic from one such probability distribution ρ ∈ ∆ k to the next is given by: g ( x ) ( ρ t ) = P ◦ f ( x ) ◦ P − ( ρ )= ρ t +1 ,x . (S3)From this, we see that if Eq. (S3) is invertible, g ( x ) : ∆ k → ∆ k is well deﬁned and has the same functionalproperties as f ( x ) . In other words, in this case, it is not possible to have two distinct mixed states η, ζ ∈ ∆ N with thesame probability distribution over symbols. And, the probability distributions can only converge under the actionof g ( x ) if the mixed states also converge under the action of f ( x ) . When every mixed state has a unique outgoingprobability distribution, these states are also the causal states, and the MSP is the process’ (cid:15) -machine. Our companionwork [58] elaborates on this and the implications for identifying the embedding dimension of minimal generators. Appendix C: Correspondence with Baker’s Map

The simple dimension formula in Eq. (15) may not seem easily motivated. Especially, considering that, in general,both positive and negative Lyapunov exponents are required to have a nontrivial attractor. However, for iteratedfunction systems, all Lyapunov exponents are negative and the expansive role played by positive Lyapunov exponentsis instead played by an IFS’s stochastic map selection, as measured by the entropy rate h µ .This is more intuitively appreciated by comparing the two-state IFS with the Baker’s map. Consider the Baker’smap: x n +1 =  x n s , y < px n + s − s , y ≥ p and y n +1 =  y n p , y < py n − pp − , y ≥ p It has LCE spectrum Λ = { λ , λ } , where: λ = p log( p ) + (1 − p ) log(1 − p ) λ = p log(1 /s ) + (1 − p ) log(1 /s ) . Note that λ > λ <

0. Then, the Lyapunov dimension is: d Γ = 1 − λ λ . To compare this to an IFS, take: { f ( x ) } = (cid:26) x n s , x n + s − s (cid:27) and { p ( x ) } = { p, − p } . Thus, we identify the y coordinate as controlling the stochastic map choice. The dynamic over position in the y direction exactly determines the IFS entropy rate. Since the Baker’s map is volume preserving in y , the extradimension always contributes a plus one in the dimension formula. In other words, the dimension along a slice ofconstant y equals the IFS dimension. Appendix D: Sierpinski’s Triangle

The Sierpinski triangle is a canonical a Cantor set in two dimensions. An HMC that generates a MSP attractorthat is the Sierpinski triangle is: T (0) =  a a ( s − a a ( s − as  , T (1) =  − as (1 − as )( s − s − as s (1 − as )( s − s − as s  , and T (2) =  − as s (1 − as )( s − s − as (1 − as )( s − s (1 − as )( s − s  , (S1)where s controls the contraction coeﬃcient and a controls the probability of selecting the maps. This HMC producesconstant probability functions: p (0) = as , p (1) = 1 − as , and p (2) = 1 − as . and, therefore, linear mappings, since f (0) = h η | T ( i ) /p ( i ) ( η ). The constant probability functions make the entropyrate trivial to calculate. And, the linearity of the mappings does the same for the Lyapunov exponents.Setting s = 2 and a = 1 /

6, results in equal probability for all maps and gives the standard Sierpinski triangleshown in Fig. S2. In this case, the entropy rate is h µ = log (3) and the Lyapunov exponents are both − log (2).Plugging this into Eq. (13) returns the well-known fractal dimension of the Sierpinski triangle, log / log ≈ . FIG. S2. Attractor of the 3-state, 3-symbol machine speciﬁed in Eq. (S1), with s = 2 and a = 1 / Appendix E: Lyapunov Exponents

A positive

Lyapunov characteristic exponent for a dynamical system measures the exponential rate of separationof trajectories that begin inﬁnitesimally close. Since, typically, the separation rate depends on the direction of theinitial separation, we use a spectrum of Lyapunov exponents, with one exponent for each state-space dimension. In achaotic dynamical system, at least one Lyapunov exponent is positive. In general, the Lyapunov exponent spectrumfor an N -dimensional dynamical system with mapping x n +1 = F ( x n ) depends on the initial condition x . However,here we consider ergodic systems, for which the spectrum does not.Consider the map’s Jacobian matrix: J = ∂F∂x and the evolution of vectors in the tangent space, controlled by:˙ Y = Y J , where Y (0) = I N and Y ( t ) describes how an inﬁnitesimal change in x (0) has propagated to x ( t ). Let { y , . . . , y N } bethe eigenvalues of the matrix Y ( t ) Y ( t ) (cid:124) . Then, the Lyapunov exponents are: λ i = lim t →∞ t log y i . The

Lyapunov numbers were introduced and proven to exist by Oseledets [59]. The Lyapunov exponents are merelythe logarithms of Lyapunov numbers.The most common way of calculating an IFS’s Lyapunov spectrum, employed to produce the results in Fig. 6,is the pull-back method. The basic idea is that for an IFS in the N − N -state HMC, therewill be N − N − initialization bias or undesired statistical trends introduced by the initial transientdata produced by the Markov chain before it reaches the desired stationary distribution. Second, there are errorsinduced by autocorrelation in equilibrium . That is, the samples produced by the Markov chain are correlated. And,the consequence is that statistical error cannot be estimated by 1 / √ N , as done for N independent samples.Bounding these error sources requires estimating the autocorrelation function, which can be done from longsequences of samples. If we have the nonuniﬁlar HMC in hand, it is a simple matter of sweeping through increasinglylong sequences of generated samples until we observe convergence of the autocorrelation function. An alternativemethod of approximating the inﬁnite-state HMC with a ﬁnite-state approximation is discussed in detail in our previouswork [32]. The upshot is that the method here generally eﬃciently leads to accurate estimates of the LCE spectrum.For completeness, we note that there are alternative methods to calculate Lyapunov exponents; see, e.g., Refs.[61, 62]. These methods may be more appropriate in speciﬁc applications. That said, the accuracy and applicabilityof Lyapunov exponent estimation is not the focus here. Appendix F: Overlap Estimation

To estimate the size of a mixed-state attractor and overlap of mapping functions in Fig. 6, a combination oftechniques were used. Let’s brieﬂy summarize the method here.First, 250 ,

000 diﬀerent HMCs were generated using a 500 ×

500 parameter grid over α = [0 ,

1] and x = [0 , . ,

000 mixed states were generated from an initial randomized state, throwingaway the ﬁrst 5 ,,