[PDF] Comparing Information-Theoretic Measures of Complexity in Boltzmann Machines

Abstract

In the past three decades, many theoretical measures of complexity have been proposed to help understand complex systems. In this work, for the first time, we place these measures on a level playing field, to explore the qualitative similarities and differences between them, and their shortcomings. Specifically, using the Boltzmann machine architecture (a fully connected recurrent neural network) with uniformly distributed weights as our model of study, we numerically measure how complexity changes as a function of network dynamics and network parameters. We apply an extension of one such information-theoretic measure of complexity to understand incremental Hebbian learning in Hopfield networks, a fully recurrent architecture model of autoassociative memory. In the course of Hebbian learning, the total information flow reflects a natural upward trend in complexity as the network attempts to learn more and more patterns.

Full PDF

AArticle

Comparing Information-Theoretic Measures ofComplexity in Boltzmann Machines

Maxinder S. Kanwal *, Joshua A. Grochow and Nihat Ay University of California, Berkeley University of Colorado, Boulder Santa Fe Institute Max Planck Institute for Mathematics in the Sciences University of Leipzig * Correspondence: [email protected]

Abstract:

In the past three decades, many theoretical measures of complexity have been proposedto help understand complex systems. In this work, for the ﬁrst time, we place these measures on alevel playing ﬁeld, to explore the qualitative similarities and differences between them, and theirshortcomings. Speciﬁcally, using the Boltzmann machine architecture (a fully connected recurrentneural network) with uniformly distributed weights as our model of study, we numerically measurehow complexity changes as a function of network dynamics and network parameters. We applyan extension of one such information-theoretic measure of complexity to understand incrementalHebbian learning in Hopﬁeld networks, a fully recurrent architecture model of autoassociativememory. In the course of Hebbian learning, the total information ﬂow reﬂects a natural upward trendin complexity as the network attempts to learn more and more patterns.

Keywords: complexity; information integration; information geometry; Boltzmann machine;Hopﬁeld network; Hebbian learning

1. Introduction

Many systems, across a wide array of disciplines, have been labeled “complex”. The strikinganalogies between these systems [1,2] beg the question: What collective properties do complex systemsshare and what quantitative techniques can we use to analyze these systems as a whole? With newmeasurement techniques and ever-increasing amounts of data becoming available about larger andlarger systems, we are in a better position than ever before to understand the underlying dynamicsand properties of these systems.While few researchers agree on a speciﬁc deﬁnition of a complex system, common terms used todescribe complex systems include “emergence” and “self-organization”, which characterize high-levelproperties in a system composed of many simpler sub-units. Often these sub-units follow local rulesthat can be described with much better accuracy than those governing the global system. Mostdeﬁnitions of complex systems include, in one way or another, the hallmark feature that the whole ismore than the sum of its parts.In the uniﬁed study of complex systems, a vast number of measures have been introducedto concretely quantify an intuitive notion of complexity (see, e.g., [3,4]). As Shalizi points out [4],among the plethora of complexity measures proposed, roughly, there are two main threads: those thatbuild on the notion of Kolmogorov complexity and those that use the tools of Shannon’s informationtheory. There are many systems for which the nature of their complexity seems to stem either fromlogical/computational/descriptive forms of complexity (hence, Kolmogorov complexity) and/or frominformation-theoretic forms of complexity. In this paper we focus on information-theoretic measures.While the uniﬁed study of complex systems is the ultimate goal, due to the broad nature of theﬁeld, there are still many sub-ﬁelds within complexity science [1,2,5]. One such sub-ﬁeld is the study ofnetworks, and in particular, stochastic networks (broadly deﬁned). Complexity in a stochastic network a r X i v : . [ c s . I T ] J u l of 16 is often considered to be directly proportional to the level of stochastic interaction of the units thatcompose the network—this is where tools from information theory come in handy. Within the framework of considering stochastic interaction as a proxy for complexity, a fewcandidate measures of complexity have been developed and reﬁned over the past decade. There is noconsensus best measure, as each individual measure frequently captures some aspects of stochasticinteraction better than others.In this paper, we empirically examine four measures (described in detail later): (1) multi-information,(2) synergistic information, (3) total information ﬂow, (4) geometric integrated information. Additionalnotable information-theoretic measures that we do not examine include those of Tononi et al., ﬁrstproposed in [6] and most recently reﬁned in [7], as a measure of consciousness, as well as similarmeasures of integrated information described by Barrett & Seth [8], and Oizumi et al. [9].The term “humpology,” ﬁrst coined by Crutchﬁeld [5], attempts to qualitatively describe a longand generally understood feature that a natural measure of complexity ought to have. In particular, asstochasticity varies from 0% to 100%, the structural complexity should be unimodal, with a maximumsomewhere in between the extremes [10]. For a physical analogy, consider the spectrum of molecularrandomness spanning from a rigid crystal (complete order) to a random gas (complete disorder).At both extremes, we intuitively expect no complexity: a crystal has no ﬂuctuations, while a totallyrandom gas has complete unpredictability across time. Somewhere in between, structural complexitywill be maximized (assuming it is always ﬁnite).We now describe the four complexity measures of interest in this study. We assume acompositional structure of the system and consider a ﬁnite set V of nodes. With each node v ∈ V , weassociate a ﬁnite set X v of states. In the prime example of this article, the Boltzmann machine, we have V = {

1, . . . , N } , and X v = {± } for all v . For any subset A ⊆ V , we deﬁne the state set of all nodes in A as the Cartesian product X A : = ∏ v ∈ A X v and use the abbreviation X : = X V . In what follows, wewant to consider stochastic processes in X and assign various complexity measures to these processes.With a probability vector p ( x ) , x ∈ X , and a stochastic matrix P ( x , x (cid:48) ) , x , x (cid:48) ∈ X , we associate a pair ( X , X (cid:48) ) of random variables satisfying p ( x , x (cid:48) ) : = Pr ( X = x , X (cid:48) = x (cid:48) ) = p ( x ) P ( x , x (cid:48) ) , x , x (cid:48) ∈ X . (1)Obviously, any such pair of random variables satisﬁes Pr ( X = x ) = p ( x ) , and Pr ( X (cid:48) = x (cid:48) | X = x ) = P ( x , x (cid:48) ) whenever p ( x ) >

0. As we want to assign complexity measures to transitions of the systemstate in time, we also use the more suggestive notation X → X (cid:48) instead of ( X , X (cid:48) ) . If we iterate thetransition, we obtain a Markov chain X n = ( X n , v ) v ∈ V , n =

1, 2, . . . , in X , with p ( x , x , . . . , x n ) : = Pr ( X = x , X = x , . . . , X n = x n ) = p ( x ) n ∏ k = P ( x k − , x k ) , n =

1, 2, . . . , (2)where, by the usual convention, the product on the right-hand side of this equation equals one if theindex set is empty, that is, for n =

1. Obviously, we have Pr ( X = x ) = p ( x ) , and Pr ( X n + = x (cid:48) | X n = x ) = P ( x , x (cid:48) ) whenever p ( x ) >

0. Throughout the whole paper, we will assume that the probabilityvector p is stationary with respect to the stochastic matrix P . More precisely, we assume that for all x (cid:48) ∈ X the following equality holds: p ( x (cid:48) ) = ∑ x ∈ X p ( x ) P ( x , x (cid:48) ) .With this assumption, we have Pr ( X n = x ) = p ( x ) , and the distribution of ( X n , X n + ) does not dependon n . This will allow us to restrict attention to only one transition X → X (cid:48) . In what follows, we deﬁnevarious information-theoretic measures associated with such a transition. of 16 MI The multi-information is a measure proposed by McGill [11] that captures the extent to which thewhole is greater than the sum of its parts when averaging over time. For the above random variable X ,it is deﬁned as MI ( X ) (cid:44) ∑ v ∈ V H ( X v ) − H ( X ) , (3)where the Shannon entropy H ( X ) = − ∑ x ∈ X p ( x ) log p ( x ) . (Here, and throughout this article, wetake logarithms with respect to base 2.) It holds that MI ( X ) = X i , aremutually independent.1.1.2. Synergistic Information, SI The synergistic information, proposed by Edlund et al. [12], measures the extent to which the (one-step) predictive information of the whole is greater than that of the parts. (For details related to thepredictive information, see [13–15].) It builds on the multi-information by including the dynamicsthrough time in the measure: SI ( X → X (cid:48) ) (cid:44) I ( X ; X (cid:48) ) − ∑ v ∈ V I ( X v ; X (cid:48) v ) , (4)where I ( X ; X (cid:48) ) denotes the mutual information between X and X (cid:48) . One potential issue with thesynergistic information is that it may be negative. This is not ideal, as it is difﬁcult to interpret anegative value of complexity. Furthermore, a preferred baseline minimum value of 0 serves as areference point against which one can objectively compare systems.The subsequent two measures (total information ﬂow and geometric integrated information) havegeometric formulations that make use of tools from information geometry. In information geometry,the Kullback-Leibler divergence (KL divergence) is used to measure the dissimilarity between twodiscrete probability distributions. Applied to our context, we measure the dissimilarity between twostochastic matrices P and Q with respect to p by D pKL ( P (cid:107) Q ) = ∑ x ∈ X p ( x ) ∑ x (cid:48) ∈ X P ( x , x (cid:48) ) log P ( x , x (cid:48) ) Q ( x , x (cid:48) ) . (5)For simplicity, let us assume that P and Q are strictly positive and that p is the stationary distributionof P . In that case, we do not explicitly refer to the stationary distribution p and simply write D KL ( P (cid:107) Q ) .The KL divergence between P and Q can be interpreted by considering their corresponding Markovchains with distributions (2) (e.g., see [16] for additional details on this formulation). Denoting thechain of P by X n , n =

1, 2, . . . , and the chain of Q by Y n , n =

1, 2, . . . , with some initial distributions p and q , respectively, we obtain1 n ∑ x , x ,..., x n Pr ( X = x , X = x , . . . , X n = x n ) log Pr ( X = x , X = x , . . . , X n = x n ) Pr ( Y = x , Y = x , . . . , Y n = x n )= n (cid:32) ∑ x Pr ( X = x ) log Pr ( X = x ) Pr ( Y = x ) + n − ∑ k = ∑ x Pr ( X k = x ) ∑ x (cid:48) Pr ( X k + = x (cid:48) | X k = x ) log Pr ( X k + = x (cid:48) | X k = x ) Pr ( Y k + = x (cid:48) | Y k = x ) (cid:33) = n ∑ x p ( x ) log p ( x ) q ( x ) + n − n ∑ x p ( x ) ∑ x (cid:48) P ( x , x (cid:48) ) log P ( x , x (cid:48) ) Q ( x , x (cid:48) ) n → ∞ → D KL ( P (cid:107) Q ) . of 16 We can use the KL divergence (5) to answer our original question—

To what extent is the whole greaterthan the sum of its parts? —by comparing a system of interest to its most similar (least dissimilar) systemwhose whole is exactly equal to the sum of its parts. When comparing a transition P to Q using the KLdivergence, one measures the amount of information lost when Q is used to approximate P . Hence, byconstraining Q to be equal to the sum of its parts, we can then arrive at a natural measure of complexityby taking the minimum extent to which our distribution P is greater (in the sense that it contains moreinformation) than some distribution Q , since Q represents a system of zero complexity. Formally, onedeﬁnes a manifold S of so-called “split” systems consisting of all those distributions that are equal tothe sum of their parts, and then measures the minimum distance to that manifold: Complexity ( P ) (cid:44) min Q ∈S D KL ( P (cid:107) Q ) . (6)It is important to note here that there are many different viable choices of split manifold S .This approach was ﬁrst introduced by Ay for a general class of manifolds S [17]. Amari [18] andOizumi et al. [19] proposed variants of this quantity as measures of information integration. In whatfollows, we consider measures of the form (6) for two different choices of S .1.1.3. Total Information Flow, IF The total information ﬂow, also known as the stochastic interaction, expands on themulti-information (like SI ) to include temporal dynamics. Proposed by Ay in [17,20], the measurecan be expressed by constraining Q to the manifold of distributions, S ( ) , where there exists functions f v ( x v , x (cid:48) v ) , v ∈ V , such that Q is of the form: Q ( x , x (cid:48) ) = Q (( x v ) v ∈ V , ( x (cid:48) v ) v ∈ V ) = e ∑ v ∈ V f v ( x v , x (cid:48) v ) Z ( x ) , (7)where Z ( x ) denotes the partition function that properly normalizes the distribution. Note that anystochastic matrix of this kind satisﬁes the property that Q ( x , x (cid:48) ) = ∏ v ∈ V Pr ( X (cid:48) v = x (cid:48) v | X v = x v ) . Thisresults in IF ( X → X (cid:48) ) (cid:44) min Q ∈S ( ) D KL ( P (cid:107) Q ) (8) = ∑ v ∈ V H ( X (cid:48) v | X v ) − H ( X (cid:48) | X ) . (9)The total information ﬂow is non-negative, as are all measures that can be expressed as aKL divergence. One issue of note, as pointed out in [18,19], is that IF ( X → X (cid:48) ) can exceed I ( X ; X (cid:48) ) .One can formulate the mutual information I ( X ; X (cid:48) ) as I ( X ; X (cid:48) ) = min Q ∈S ( ) D KL ( P (cid:107) Q ) , (10)where S ( ) consists of stochastic matrices Q that satisfy Q ( x , x (cid:48) ) = Q (( x v ) v ∈ V , ( x (cid:48) v ) v ∈ V ) = e f V ( x (cid:48) ) Z ( x ) , (11)for some function f V ( x (cid:48) ) . Under this constraint, Q ( x , x (cid:48) ) = Pr ( X (cid:48) = x (cid:48) ) . In other words, allspatio-temporal interactions X → X (cid:48) are lost. Thus, it has been postulated that no measure ofinformation integration, such as the total information ﬂow, should exceed the mutual information [9].The cause of this violation in the total information ﬂow is due to the fact that IF ( X → X (cid:48) ) quantiﬁessame-time interactions in X (cid:48) (due to the lack of an undirected edge in the output in Figure 1B). Consider,for instance, a stochastic matrix P that satisﬁes (11), P ( x , x (cid:48) ) = p ( x (cid:48) ) for some probability vector p . Inthat case we have I ( X ; X (cid:48) ) =

0. Yet, (9) then reduces to the multi-information (3) of X (cid:48) = ( X (cid:48) v ) v ∈ V ,which is a measure of stochastic dependence. of 16 X X ′ X ′ X Full Model P ( x , ′ x ) time ! Version June 15, 2017 submitted to

Entropy with distributions (2). Denoting the chain of P by X n , n =

1, 2, . . . , and the chain of Q by Y n , n =

1, 2, . . . , with some initial distributions p and q , respectively, we obtain n Â x , x ,..., x n P ( X = x , X = x , . . . , X n = x n ) log P ( X = x , X = x , . . . , X n = x n ) P ( Y = x , Y = x , . . . , Y n = x n )= n Â x P ( X = x ) log P ( X = x ) P ( Y = x ) + n Â k = Â x P ( X k = x ) Â x P ( X k + = x | X k = x ) log P ( X k + = x | X k = x ) P ( Y k + = x | Y k = x ) ! = n Â x p ( x ) log p ( x ) q ( x ) + n n Â x p ( x ) Â x P ( x , x ) log P ( x , x ) Q ( x , x ) n ! • ! D KL ( P k Q ) .We can use the KL divergence (5) to answer our original question— To what extent is the whole greaterthan the sum of its parts? —by comparing a system of interest to its most similar (least dissimilar) systemwhose whole is exactly equal to the sum of its parts. When comparing a transition P to Q using the KLdivergence, one measures the amount of information lost when Q is used to approximate P . Hence, byconstraining Q to be equal to the sum of its parts, we can then arrive at a natural measure of complexityby taking the minimum extent to which our distribution P is greater (in the sense that it contains moreinformation) than some distribution Q , since Q represents a system of zero complexity. Formally, onedeﬁnes a manifold S , of so-called “split” distributions, consisting of all those distributions that areequal to the sum of their parts, and then measures the minimum distance to that manifold: Com plexity ( P ) , min Q D KL ( P k Q ) . (6)It is important to note here that there are many different viable choices of split manifold S . This approach was ﬁrst introduced by Ay for a general class of manifolds S [13]. Amari [14] and Oizumi et al. [15] proposed variants of this quantity as measures of information integration. In what follows, we consider measures of the form (6) for two different choices of S . I F The total information ﬂow, also known as the stochastic interaction, expands on themulti-information (like SI ) to include temporal dynamics. Proposed by Ay in [13], [16], the measurecan be expressed by constraining Q to the manifold of distributions, S ( ) , where Q is of the form: Q ( x , x ) = Q (( x v ) v V , ( x v ) v V ) = ’ v V Q v ( x v , x v ) , (7)resulting in I F ( X ! X ) , min Q ( ) D KL ( P k Q ) (8) = Â v V H ( X v | X v ) H ( X | X ) . (9)The total information ﬂow is non-negative, as are all measures that can be expressed as a KLdivergence. One issue of note, as pointed out in [14,15], is that I F ( X ! X ) can exceed I ( X ; X ) .Because the mutual information can be written as I ( X ; X ) = min Q ( ) D KL ( P k Q ) , (10) Version June 15, 2017 submitted to

Entropy with distributions (2). Denoting the chain of P by X n , n =

1, 2, . . . , and the chain of Q by Y n , n =

Entropy with distributions (2). Denoting the chain of P by X n , n =

1, 2, . . . , and the chain of Q by Y n , n =

S I ) to include temporal dynamics. Proposed by Ay in [13], [16], the measurecan be expressed by constraining Q to the manifold of distributions, S ( ) , where Q is of the form: Q ( x , x ) = Q (( x v ) v V , ( x v ) v V ) = ’ v V Q v ( x v , x v ) , (7)resulting in I F ( X ! X ) , min Q ( ) D KL ( P k Q ) (8) = Â v V H ( X v | X v ) H ( X | X ) . (9)The total information ﬂow is non-negative, as are all measures that can be expressed as a KLdivergence. One issue of note, as pointed out in [14,15], is that I F ( X ! X ) can exceed I ( X ; X ) .Because the mutual information can be written as I ( X ; X ) = min Q ( ) D KL ( P k Q ) , (10) B X X ′ X ′ X Total Information Flow Q ( x , ′ x ) C X X ′ X ′ X Geometric Integrated Information Q ( x , ′ x ) Figure 1.

Using graphical models, we can visualize different ways to deﬁne the “split” constraint onmanifold S in (6). Here we consider a two-node network X = ( X , X ) and its spatio-temporalstochastic interactions. (A) I ( X ; X (cid:48) ) uses constraint (11). (B) IF ( X → X (cid:48) ) uses constraint (7). (C) Φ G ( X → X (cid:48) ) uses constraint (13). Dashed lines represent correlations that either may or may notbe present in the input distribution p . We do not represent these correlations with solid lines in orderto highlight (with solid lines) the structure imposed on the stochastic matrices. Adapted and modiﬁedfrom [19]. of 16 Φ G In order to obtain a measure of information integration that does not exceed the mutualinformation I ( X ; X (cid:48) ) , Amari [18] (Section 6.9) deﬁnes Φ G ( X → X (cid:48) ) as Φ G ( X → X (cid:48) ) (cid:44) min Q ∈S ( ) D KL ( P (cid:107) Q ) , (12)where S ( ) contains not only the split matrices (7), but also those matrices that satisfy (11). Moreprecisely, the set S ( ) consists of all stochastic matrices for which there exists functions f v ( x v , x (cid:48) v ) , v ∈ V , and f V ( x (cid:48) ) such that Q ( x , x (cid:48) ) = Q (( x v ) v ∈ V , ( x (cid:48) v ) v ∈ V ) = e ∑ v ∈ V f v ( x v , x (cid:48) v )+ f V ( x (cid:48) ) Z ( x ) . (13)Here, Q belongs to the set of matrices where only time-lagged interactions are removed. Note that themanifold S ( ) contains S ( ) , the model of split matrices used for IF , as well as S ( ) , the manifold usedfor the mutual information. This measure thus satisﬁes both postulates that SI and IF only partiallysatisfy: 0 ≤ Φ G ( X → X (cid:48) ) ≤ I ( X ; X (cid:48) ) . (14)However, unlike IF ( X → X (cid:48) ) , there is no closed-form expression to use when computing Φ G ( X → X (cid:48) ) . In this paper, we use the iterative scaling algorithm described in [21] (Section 5.1) tocompute Φ G ( X → X (cid:48) ) for the ﬁrst time in concrete systems of interest.Note that, in deﬁning Φ G ( X → X (cid:48) ) , the notion of a split model used by Amari [18] is related, butnot identical, to that used by Oizumi et al. [19]. The manifold considered in the latter work is deﬁnedin terms of conditional independence statements and forms a curved exponential family.In the remainder of this article, we also use the shorthand notation MI , SI , IF , and Φ G , withoutexplicit reference to X and X (cid:48) , as already indicated in each measure’s respective subsection heading.We also use I as shorthand for the mutual information. In this paper, we look at the aforementioned candidate measures in a concrete system in orderto gain an intuitive sense of what is frequently discussed at a heavily theoretical and abstract level.Our system of interest is the Boltzmann machine (a fully-recurrent neural network with sigmoidalactivation units).We parameterize a network of N binary nodes by W ∈ R N × N , which denotes the connectivitymatrix of weights between each directed pair of nodes. Each node i takes a value X i ∈ {± } , andupdates to X (cid:48) i ∈ {± } according to:Pr ( X (cid:48) i = + | X ) = sigmoid (cid:18) − β N ∑ j = w ji · X j (cid:19) , (15)where sigmoid ( t ) = + e − t , β denotes a global inverse-temperature parameter, and w ji denotes thedirected weight from X j to X i . of 16 w ji ⋅ X jj ∈ V ∑ Pr( ′ X i = + X ) Figure 2.

The sigmoidal update rule as a function of the inverse-global temperature: As β increases,the stochastic update rule becomes closer to the deterministic one given by a step function. This stochastic update rule implies that every node updates probabilistically according to aweighted sum of the node’s parents (or inputs), which, in the case of our fully recurrent neuralnetwork, is every node in the network. Every node i has some weight, w ij , with which it inﬂuencesnode j on the next update. As the weighted sum of the inputs to a node becomes more positive, thelikelihood of that node updating to the state + β , commonly known as the globalinverse-temperature of the network. β effectively controls the extent to which the system is inﬂuencedby random noise: It quantiﬁes the system’s deviation from deterministic updating. In networks, thenoise level directly correlates with what we call the “pseudo-temperature” T of the network, where T = β . To contextualize what T might represent in a real-life complex system, consider the exampleof a biological neural network, where we can think of the pseudo-temperature as a parameter thatencompasses all of the variables (beyond just a neuron’s synaptic inputs) that inﬂuence whether aneuron ﬁres or not in a given moment (e.g., delays in integrating inputs, random ﬂuctuations fromthe release of neurotransmitters in vesicles, ﬁring of variable strength). As β → T → ∞ ), theinteractions are governed entirely by randomness. On the other hand, as β → ∞ ( T → β there is always a unique stationary distribution onthe stochastic network state space.

2. Results

What follows are plots comparing and contrasting the four introduced complexity measuresin their speciﬁed settings. The qualitative trends shown in the plots empirically hold regardless ofnetwork size; a 5-node network was used to generate the plots below.In Figure 3a, we see that when weights are uniformly distributed between 0 and 1, IF and Φ G are very similar qualitatively, with the additional property that Φ G ≤ IF , which directly follows from S ( ) ⊆ S ( ) . MI monotonically increases, which contradicts the intuition prescribed by humpology.Finally, SI is peculiar in that it is not lower-bounded by 0. This makes for difﬁcult interpretation: Whatdoes a negative complexity mean as opposed to zero complexity? Furthermore, in Figure 3b, we seethat Φ G satisﬁes constraint (14), with the mutual information in fact upper bounding both IF and Φ G . of 16 - - β C o m p l e x i t y MISIIF Φ G - - β C o m p l e x i t y MISIIF Φ G (a) Measures of complexity when using random weightinitializations sampled uniformly between 0 and 1(averaged over 100 trials, with error bars). β C o m p l e x i t y IIF Φ G β C o m p l e x i t y IIF Φ G (b) The mutual information I upper bounds IF and Φ G when using random weight initializations sampleduniformly between 0 and 1 (averaged over 100 trials,with error bars). Figure 3

It is straightforward to see the symmetry between selecting weights uniformly between 0 and + − IF monotonically increases(like MI in Figure 3a), a departure from the humpology intuition. Meanwhile, Φ G behaves qualitativelydifferently, such that Φ G → β → ∞ . In Figure 4b, we see an instance where all measures limit tosome non-zero value as β → ∞ . Finally, in Figure 4c, we see an instance where IF exceeds I while Φ G satisﬁes constraint (14), despite the common unimodality of both measures.An overly simplistic interpretation of the idea that humpology attempts to capture may lead oneto believe that Figure 4b is a negative result discrediting all four measures. We claim, however, thatthis result suggests that the simple humpology intuition described in Section 1.1 needs additionalnuance when applied to quantifying the complexity of dynamical systems. In Figure 4b, we observe acertain richness to the network dynamics, despite its deterministic nature. A network dynamics thatdeterministically oscillates around a non-trivial attractor is not analogous to the “frozen” state of arigid crystal (no complexity). Rather, one may instead associate the crystal state with a network whosedynamics is the identity map, which can indeed be represented by a split stochastic matrix. Therefore,whenever the stochastic matrix P converges to the identity matrix (the “frozen” matrix) for β → ∞ ,the complexity will asymptote to zero (as in Figure 3b). In other words, for dynamical systems, a“frozen” system is exactly that: a network dynamics that has settled into a single ﬁxed-point dynamics.Consequently, in our results, as β → ∞ , we should expect that the change in complexity dependson the dynamics that the network is settling into as it becomes deterministic, and the correspondingrichness (e.g., number of attractors and their lengths) of that asymptotic dynamics. of 16 - β C o m p l e x i t y MI ( X ) SI ( X → X' ) IF ( X → X' ) Φ G - - β C o m p l e x i t y MISIIF Φ G (a) β C o m p l e x i t y MI ( X ) SI ( X → X' ) IF ( X → X' ) Φ G - - β C o m p l e x i t y MISIIF Φ G (b) β C o m p l e x i t y IIF Φ G β C o m p l e x i t y IIF Φ G (c)Figure 4. Measures of complexity in single instances of using random weight initializations sampleduniformly between − So far, it may seem to be the case that Φ G is without ﬂaw; however, there are shortcomings thatwarrant further study. In particular, in formulating Φ G , the undirected output edge in Figure 5B(purple) was deemed necessary to avoid quantifying external inﬂuences to the system that IF wouldconsider as intrinsic information ﬂow. Yet, in the model studied here—the Boltzmann machine—thereare no such external inﬂuences (i.e., Y = Φ G and IF in our setting. More precisely, a full model that lacks anundirected output edge at the start should not lead to a “split”-projection that incorporates such anedge. However, this is not generally true for the projection that Φ G computes because the undirectedoutput edge present in the split model will in fact capture causal interactions within the system by deviously interpreting them as same-time interactions in the output (Figure 5). This counterintuitivephenomenon suggests that we should have preferred IF to be precisely equal to its ideal form Φ ideal inthe case of the Boltzmann machine, and yet, almost paradoxically, this would imply that the improvedform would still violate constraint (14). This puzzling conundrum begs further study of how toproperly disentangle external inﬂuences when attempting to strictly quantify the intrinsic causalinteractions.The preceding phenomenon, in fact, also calls into question the very postulate that the mutualinformation ought to be an upper bound on information integration. As we see in Figure 5A, theundirected output edge used in the “split”-projection for computing the mutual information I iscapable of producing the very same problematic phenomenon. Thus, the mutual information doesnot fully quantify the total causal inﬂuences intrinsic to a system. In fact, the assumption itself that I quantiﬁed the total intrinsic causal inﬂuences was based on the assumption that one can distinguishbetween intrinsic and extrinsic inﬂuences in the ﬁrst place, which may not be the case. X X ′ X ′ X Full Model P ( x , ′ x ) Y Y X Internal External A X X ′ X ′ X Mutual Information Q ( x , ′ x ) B X X ′ X ′ X Geometric Integrated Information Q ( x , ′ x ) Figure 5.

A full model (left) can have both intrinsic (blue) and extrinsic (red) causal interactionscontributing to its overall dynamics. Split models (

A,B ) formulated with an undirected output edge(purple) attempt to exclusively quantify extrinsic causal interactions (so as to strictly preserve intrinsiccausal interactions after the “split”-projection). However, the output edge can end up explaining awayinteractions from both external factors

Y and (some) internal factors X (red + blue = purple). As a result,using such a family of split models does not properly capture the total intrinsic causal interactionspresent in a system.

3. Application

In this section, we apply one of the preceding measures ( IF ) and examine its dynamics duringnetwork learning. We wish to exemplify the insights one can gain by exploring measures ofcomplexity in a more general sense. The results presented in Section 2 showed the promising natureof information-geometric formulations of complexity, such as IF and Φ G . Here, however, we restrictourselves to studying IF as a ﬁrst step due to the provable properties of its closed-form expression thatwe are able to exploit to study it in greater depth in the context of autoassociative memory networks.It would be useful to extend this analysis to Φ G , but is beyond the scope of this work. Autoassociative memory in a network is a form of “collective computation” where, given anincomplete input pattern, the network can accurately recall a previously stored pattern by evolvingfrom the input to the stored pattern. For example, a pattern might be a binary image, in whicheach pixel in the image corresponds to a node in the network with a value in {− + } . In this case,an autoassociative memory model with a stored image could then take as input a noisy version ofthe stored image and accurately recall the fully denoised original image. This differs from a “serialcomputation” approach to the same problem where one would simply store the patterns in a databaseand, when given an input, search all images in the database for the most similar stored image tooutput.One mechanism by which a network can achieve collective computation has deep connections toconcepts from statistical mechanics (e.g., the Ising model, Glauber dynamics, Gibbs sampling). Thistheory is explained in detail in [22]. The clever idea behind autoassociative memory models heavilyleverages the existence of an energy function (sometimes called a Lyapunov function) to govern theevolution of the network towards a locally minimal energy state. Thus, by engineering the network’sweighted edges such that local minima in the energy function correspond to stored patterns, one canshow that if an input state is close enough (in Hamming distance) to a desired stored state, then thenetwork will evolve towards the correct lower-energy state, which will in fact be a stable ﬁxed point ofthe network.The above, however, is only true up to a limit. A network can only store so many patterns beforeit becomes saturated. As more and more patterns are stored, various problems arise such as desirableﬁxed points becoming unstable optima, as well as the emergence of unwanted ﬁxed points in thenetwork that do not correspond to any stored patterns (i.e., spin glass states).In 1982, Hopﬁeld put many of these ideas together to formalize what is today known as theHopﬁeld model, a fully recurrent neural network capable of autoassociative memory. Hopﬁeld’sbiggest contribution in his seminal paper was assigning an energy function to the network model: E = − ∑ i , j w ij X i X j . (16)For our study, we assume that we are storing random patterns in the network. In this scenario,Hebb’s rule (Equation (17)) is a natural choice for assigning weights to each connection betweennodes in the network such that the random patterns are close to stable local minimizers of the energyfunction.Let { ξ ( ) , ξ ( ) , . . . , ξ ( T ) } denote the set of N -bit binary patterns that we desire to store. Then,under Hebb’s rule, the weight between nodes i and j should be assigned as follows: w ij = T T ∑ µ = ξ ( µ ) i ξ ( µ ) j , (17)where ξ ( µ ) i denotes the i th-bit of pattern ξ ( µ ) . Notice that all weights are symmetric, w ij = w ji .Hebb’s rule is frequently used to model learning, as it is both local and incremental —two desirableproperties of a biologically plausible learning rule. Hebb’s rule is local because weights are set basedstrictly on local information (i.e., the two nodes that the weight connects) and is incremental becausenew patterns can be learned one at a time without having to reconsider information from alreadylearned patterns. Hence, under Hebb’s rule, training a Hopﬁeld network is relatively simple andstraightforward.The update rule that governs the network’s dynamics is the same sigmoidal function used in theBoltzmann machine described in Section 1.2. We will have this update rule take effect synchronously for all nodes (Note: Hopﬁeld’s original model was described in the asynchronous, deterministic casebut can also be studied more generally.):Pr ( X (cid:48) i = + | X ) = + e − β ∑ j ∈ V X j · w ji . (18)At ﬁnite β , our Hopﬁeld model obeys a stochastic sigmoidal update rule. Thus, there exists a uniqueand strictly positive stationary distribution of the network dynamics.Here, we study incremental Hebbian learning, in which multiple patterns are stored in a Hopﬁeldnetwork in succession. We use total information ﬂow (Section 1.1.3) to explore how incrementalHebbian learning changes complexity, or more speciﬁcally, how the complexity relates to the numberof patterns stored.Before continuing, we wish to make clear upfront an important disclaimer: The results we describeare qualitatively different when one uses asynchronous dynamics instead of synchronous, as we usehere. With asynchronous dynamics, no signiﬁcant overall trend manifests, but other phenomenaemerge in need of further exploration.When we synchronously update nodes, we see very interesting behavior during learning:incremental Hebbian learning appears to increase complexity, on average (Figures 6a, 6b). Thedependence on β is not entirely clear, but as one can infer from Figures 6a and 6b, it appearsthat increasing β increases the magnitude of the average complexity while learning, while alsoincreasing the variance of the complexity. So as β increases, the average case becomes more andmore unrepresentative of the individual cases of incremental Hebbian learning. Synchronous: N = β = trials = C o m p l e x i t y (a) β = Synchronous: N = β = trials = C o m p l e x i t y (b) β = Figure 6.

Incremental Hebbian learning in a 9-node stochastic Hopﬁeld network with synchronousupdating (averaged over 100 trials of storing random 9-bit patterns).

We can also study the deterministic version of the Hopﬁeld model. This corresponds to letting β → ∞ in the stochastic model. With a deterministic network, many stationary distributions on thenetwork dynamics may exist, unlike in the stochastic case. As discussed above, if we want to recalla stored image, we would like for that image to be a ﬁxed point in the network (corresponding to astationary distribution equal to the Dirac measure at that state). Storing multiple images correspondsto the desire to have multiple Dirac measures acting as stationary distributions of the network.Furthermore, in the deterministic setting the nodal update rule becomes a step rather than a sigmoidfunction.Without a unique stationary distribution in the deterministic setting, we must decide how toselect an input distribution to use in calculating the complexity. If there are multiple stationarydistributions in a network, not all starting distributions on the network eventually lead to a single stationary distribution (as was the case in the stochastic model), but instead the stationary distributionthat the network eventually reaches is sensitive to the initial state of the network. When there aremultiple stationary distributions, there are actually inﬁnitely many stationary distributions, as anyconvex combination of stationary distributions is also stationary. If there exist N orthogonal stationarydistributions of a network, then there is in fact an entire ( N − ) -simplex of stationary distributions,any of which could be used as the input distribution for calculating the complexity.In order to address this issue, it is fruitful to realize that the complexity measure we are workingwith is concave with respect to the input distribution (Theorem A1 in Appendix A). As a functionof the input distribution, there is thus an “apex” to the complexity. In other words, is a unique localmaximum of the complexity function, which is also therefore a global maximum (but not necessarilya unique maximizer since the complexity is not strictly concave). This means that the optimizationproblem of ﬁnding the supremum over the entire complexity landscape with respect to the inputdistribution is relatively simple and can be viably achieved via standard gradient-based methods.We can naturally deﬁne a new quantity to measure complexity of a stochastic matrix P in thissetting, the complexity capacity : C cap ( X → X (cid:48) | P ) (cid:44) max p C ( X → X (cid:48) | p , P ) , (19)where the maximum is taken over all stationary distributions p of P . Physically, the complexity capacitymeasures the maximal extent—over possible input distributions—to which the whole is more than thesum of its parts. By considering the entire convex hull of stationary input distributions and optimizingfor complexity, we can ﬁnd this unique maximal value and use it to represent the complexity of anetwork with multiple stationary distributions.Again, in the synchronous-update setting, we see incremental Hebbian learning increasescomplexity capacity (Figures 7a, 7b). It is also worth noting that the complexity capacity in thissetting is limiting towards the absolute upper bound on the complexity, which can never exceed thenumber of binary nodes in the network. Physically, this corresponds to each node attempting to storeone full bit (the most information a binary node can store), and all of this information ﬂowing throughthe network between time-steps, as more and more patterns are learned. This limiting behavior of thecomplexity capacity towards a maximum (as the network saturates with information) is more gradualas the size of the network increases. This observed behavior matches the intuition that larger networksshould be able to store more information than smaller networks. C o m p l e x i t y C a p ac i t y Synchronous: N = β → ∞ ; trials = (a) N = Synchronous: N = β → ∞ ; trials = C o m p l e x i t y C a p ac i t y (b) N = Figure 7.

Incremental Hebbian learning in a N -node deterministic ( β → ∞ ) Hopﬁeld network withsynchronous updating (averaged over 100 trials of storing random N -bit patterns).

4. Conclusions

In summary, we have seen four different measures of complexity applied in concrete,parameterized systems. We observed that the synergistic information was difﬁcult to interpret onits own due to the lack of an intuitive lower bound on the measure. Building off the primitivemulti-information, the total information ﬂow and the geometric integrated information were closelyrelated, frequently (but not always) showing the same qualitative behavior. The geometric integratedinformation satisﬁes the additional postulate (14) stating that a measure of complexity should notexceed the temporal mutual information, a property that the total information ﬂow frequently violatedin the numerical experiments where connection weights were allowed to be both negative and positive.The geometric integrated information was recently proposed to build on and correct the original ﬂawsin the total information ﬂow, which it appears to have done quite singularly based on the examinationin the present study. While the geometric integrated information is a step in the right direction, furtherstudy is needed to properly disentangle external from internal causal inﬂuences that contribute tonetwork dynamics (see ﬁnal paragraphs of Section 2). Nonetheless, it is encouraging to see a semblanceof convergence with regards to quantifying complexity from an information-theoretic perspective.

Acknowledgments:

The authors would like to thank the Santa Fe Institute NSF REU program (NSF grant

Author Contributions:

N.A. and J.A.G. proposed the research. M.S.K. carried out most of the research and tookthe main responsibility for writing the article. All authors contributed to joint discussions. All authors read andapproved the ﬁnal manuscript.

Conﬂicts of Interest:

The authors declare no conﬂict of interest.

Appendix ATheorem A1 (Concavity of IF ( X → X (cid:48) ) ) . The complexity measureIF ( X → X (cid:48) ) (cid:44) ∑ v ∈ V H ( X (cid:48) v | X v ) − H ( X (cid:48) | X ) , is concave with respect to the input distribution p ( x ) = Pr ( X = x ) , x ∈ X , for stochastic matrix P ﬁxed. Note that in the deﬁnition of the complexity capacity (19), we take the supremum over all stationary input distributions. Since such distributions form a convex subset of the set of all input distributions,concavity of IF is preserved by the corresponding restriction. Proof.

The proof of the above statement follows from ﬁrst rewriting the complexity measure in terms ofa negative KL divergence between two distributions both afﬁne with respect to the input distribution,and then using the fact that the KL divergence is convex with respect to a pair of distributions(see [23] (Chapter 2)) to demonstrate that the complexity measure is indeed concave.Let P denote the ﬁxed stochastic matrix governing the evolution of X → X (cid:48) .Let p denote the input distribution on the states of X .First, note that the domain of p forms a convex set: For an N -unit network, the set of all validdistributions p forms an ( N − ) -simplex.Next, we expand IF : IF ( X → X (cid:48) ) = ∑ v ∈ V H ( X (cid:48) v | X v ) − H ( X (cid:48) | X )= − ∑ v ∈ V (cid:32) ∑ x v ∈ X v Pr ( X v = x v ) ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v | X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v | X v = x v ) (cid:33) + ∑ x ∈ X Pr ( X = x ) ∑ x (cid:48) ∈ X Pr ( X (cid:48) = x (cid:48) | X = x ) · log Pr ( X (cid:48) = x (cid:48) | X = x ) . Notice that the expanded expression for H ( X (cid:48) | X ) is afﬁne in the input distribution p , since the termsPr ( X (cid:48) = x (cid:48) | X = x ) are just constants given by P ( x , x (cid:48) ) . Hence, − H ( X (cid:48) | X ) is concave, and all that isleft to show is that the expansion of H ( X (cid:48) v | X v ) is also concave for all v ∈ V : H ( X (cid:48) v | X v ) = − ∑ x v ∈ X v Pr ( X v = x v ) ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v | X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v | X v = x v )= − ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) Pr ( X v = x v )= − ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) Pr ( X v = x v )+ log 1 | X v | − log 1 | X v | = − ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) Pr ( X v = x v )+ ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) log 1 | X v |− log 1 | X v | = − ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v (cid:18) Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) · log Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) | X v | · Pr ( X v = x v ) (cid:19) − log 1 | X v | .Ignoring the constant − log | X v | , as this does not change the concavity of the expression, we can rewritethe summation as = − ∑ x v ∈ X v ∑ x (cid:48) v ∈ X v (cid:32) (cid:18) ∑ x r ∈ X V \ v Pr ( X (cid:48) v = x (cid:48) v | X v = x v , X V \ v = x r ) · Pr ( X v = x v , X V \ v = x r ) (cid:19) · log ∑ x r ∈ X V \ v Pr ( X (cid:48) v = x (cid:48) v | X v = x v , X V \ v = x r ) · Pr ( X v = x v , X V \ v = x r ) | X v | · ∑ x r ∈ X V \ v Pr ( X v = x v , X V \ v = x r ) (cid:33) ,where X V \ v denotes the state of all nodes excluding X v . This expansion has made use of the fact thatPr ( X (cid:48) v = x (cid:48) v , X v = x v ) = ∑ x r ∈ X V \ v Pr ( X (cid:48) v = x (cid:48) v | X v = x v , X V \ v = x r ) · Pr ( X v = x v , X V \ v = x r ) andPr ( X v = x v ) = ∑ x r ∈ X V \ v Pr ( X v = x v , X V \ v = x r ) .The constant Pr ( X (cid:48) v = x (cid:48) v | X v = x v , X V \ v = x r ) = Pr ( X (cid:48) v = x (cid:48) v | X = ( x v , x r )) can be computeddirectly as a marginal over the stochastic matrix P . Furthermore, the constant Pr ( X v = x v , X V \ v = x r ) = Pr ( X = ( x v , x r )) comes directly from the input distribution p , making the entire expression forPr ( X (cid:48) v = x (cid:48) v , X v = x v ) afﬁne with respect to the input distribution.Finally, we get = − D KL  ∑ x r ∈ X V \ v Pr ( X (cid:48) v = x (cid:48) v | X v = x v , X V \ v = x r ) · Pr ( X v = x v , X V \ v = x r ) (cid:107) | X v | · ∑ x r ∈ X V \ v Pr ( X v = x v , X V \ v = x r )  = − D KL (cid:18) Pr ( X (cid:48) v = x (cid:48) v , X v = x v ) (cid:107) | X v | · Pr ( X v = x v ) (cid:19) ,the KL divergence between two distributions, both of which have been written so as to explicitlyshow them as afﬁne in the input distribution p , and then simpliﬁed to show that both are valid joint distributions over the states on the pair ( X (cid:48) v , X v ) . Thus, the overall expression is concave with respectto the input distribution. References

1. Miller, J.H.; Page, S.E.

Complex Adaptive Systems: An Introduction to Computational Models of Social Life ;Princeton University Press, 2007.2. Mitchell, M.

Complexity: A Guided Tour

Proceedings of the National Academy of Sciences of the United States ofAmerica , , 5033–5037.7. Oizumi, M.; Albantakis, L.; Tononi, G. From the Phenomenology to the Mechanisms of Consciousness:Integrated Information Theory 3.0. PLoS Comput Biol , , 1–25.8. Barrett, A.B.; Seth, A.K. Practical Measures of Integrated Information for Time-Series Data. PLoS ComputBiol , , 1–18.9. Oizumi, M.; Amari, S.i.; Yanagawa, T.; Fujii, N.; Tsuchiya, N. Measuring Integrated Information from theDecoding Perspective. PLOS Computational Biology , , 1–18.10. Gell-Mann, M. The Quark and the Jaguar: Adventures in the Simple and the Complex ; W. H. Freeman, 1994.11. McGill, W.J. Multivariate information transmission.

Psychometrika , , 97–116.12. Edlund, J.A.; Chaumont, N.; Hintze, A.; Christof Koch, G.T.; Adami, C. Integrated Information Increaseswith Fitness in the Evolution of Animats. PLoS Comput Biol , , 1–13.13. Bialek, W.; Nemenman, I.; Tishby, N. Predictability, complexity, and learning. Neural computation , , 2409–2463.14. Grassberger, P. Toward a quantitative theory of self-generated complexity. International Journal of TheoreticalPhysics , , 907–938.15. Crutchﬁeld, J.P.; Feldman, D.P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science , , 25–54.16. Nagaoka, H. The exponential family of Markov chains and its information geometry. The 28th Symposiumon Information Theory and Its Applications (SITA2005), 2005, pp. 601–604.17. Ay, N. Information Geometry on Complexity and Stochastic Interaction. MPI MIS PREPRINT 95 .18. Amari, S.

Information Geometry and Its Applications ; Springer Japan, 2016.19. Oizumi, M.; Tsuchiya, N.; Amari, S.i. Uniﬁed framework for information integration based on informationgeometry.

Proceedings of the National Academy of Sciences , , 14817–14822.20. Ay, N. Information Geometry on Complexity and Stochastic Interaction. Entropy , , 2432–2458.21. Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial. Foundations and Trends R (cid:13) inCommunications and Information Theory , , 417–528.22. Hertz, J.; Krogh, A.; Palmer, R.G. Introduction to the Theory of Neural Computation ; Perseus Publishing, 1991.23. Cover, T.M.; Thomas, J.A.