[PDF] Assessing complexity by means of maximum entropy models

Abstract

We discuss a characterization of complexity based on successive approximations of the probability density describing a system by means of maximum entropy methods, thereby quantifying the respective role played by different orders of interaction. This characterization is applied on simple cellular automata in order to put it in perspective with the usual notion of complexity for such systems based on Wolfram classes. The overlap is shown to be good, but not perfect. This suggests that complexity in the sense of Wolfram emerges as an intermediate regime of maximum entropy-based complexity, but also gives insights regarding the role of initial conditions in complexity-related issues.

Full PDF

AAssessing complexity by means of maximum entropy models

Gregor Chliamovitch,

1, 2, ∗ Bastien Chopard, and Lino Velasquez Department of Computer Science, University of Geneva, Switzerland Department of Theoretical Physics, University of Geneva, Switzerland (Dated: November 12, 2018)We discuss a characterization of complexity based on successive approximations of the probabil-ity density describing a system by means of maximum entropy methods, thereby quantifying therespective role played by diﬀerent orders of interaction. This characterization is applied on simplecellular automata in order to put it in perspective with the usual notion of complexity for suchsystems based on Wolfram classes. The overlap is shown to be good, but not perfect. This suggeststhat complexity in the sense of Wolfram emerges as an intermediate regime of maximum entropy-based complexity, but also gives insights regarding the role of initial conditions in complexity-relatedissues.

I. INTRODUCTION

In the course of the last few decades, so many charac-terizations (sometimes at odds with each other) of com-plexity have appeared that it would be illusory to give aone-sentence statement encapsulating them all. Perhapsone of the most widely accepted such characterizationscould be, in deliberately fuzzy terms, that “complexityarises when a system is more than the collection of itsparts” [1]. It is nonetheless far from easy to understandwhat “being more than one’s parts” really means. A sim-ilar idea is conveyed by the statement that “a complexsystem cannot be fully understood by looking separatelyat its elements” which is barely more transparent but,as we shall see, paves the way to quantitative interpreta-tions (see also [2]). As an historical aside, let us note thatbesides being the most popular, this notion of complex-ity also turns out to be one of the oldest since it can betraced back (in somewhat diﬀerent terms) to Aristotle’swritings [3]. For convenience, we shall refer to this def-inition of complexity as

Aristotle’s complexity or simply

A-complexity .In recent years, it has been proposed [4–6] to give amathematical meaning to these abstract principles in aprobabilistic framework by means of maximum entropy(ME) models . The ME approach provides a conceptuallysimple way to build generic models based on observa-tional constraints. Following this line of reasoning, Aris-totle’s principle could be reformulated by asserting thata system is complex when its ME approximation built onthe knowledge of small subparts provides a poor approx-imation of the system as a whole.ME models come in diﬀerent versions, depending onthe observational constraints retained for consideration.Recently, impressive successes have been obtained inthe study of neural networks by considering ME modelsbased on constrained two-points correlations and ﬁringrates [5]. However, little emphasis has been put on quan-tifying the limitations of this approach, and correlationsare certainly not the only nor the most general quantityworth being considered. On another side, it has been sug- gested building ME reconstructions on the knowledge ofmarginals up to a certain order [4, 6]. While this secondapproach is more general and certainly more in accor-dance with the spirit of information theory, it has thedrawback that the models so generated are more diﬃcultto investigate analytically than their correlation-basedcounterparts, often resulting in fairly abstract statementsand conclusions.The purpose of the present work is to display the MEmethod “at work” by carrying through a numerical inves-tigation of these marginal-based models. To this end weshall turn our attention to so-called one-dimensional el-ementary cellular automata (ECA) , since for this kind ofsystem a notion of complexity is well established and maytherefore serve as a benchmark. More precisely, ECAhave long since raised a huge interest due to the fact thatthey may be classiﬁed in four classes ranging from trivialto complex behaviour, in a sense to be discussed furtherlater on. While this classiﬁcation scheme (essentially dueto Wolfram [7], which is why we shall refer to this notionof complexity as

W-complexity ) is the most famous one,many others have been proposed over time (see [8] andreferences therein). Some attention will also be devotedin this work to

Langton’s parameter [9], which, ratherthan a classiﬁcation scheme, provides a parametrizationof the CA space.Despite innumerable attempts to apply usualinformation-theoretic tools to the study of cellularautomata, inspired by the belief that complexity should,in the end, have something to do with information,we are not aware of a situation where these toolsprovide convincingly original and deep insights intothese systems. It will turn out that ME tools givequite signiﬁcant results in this context, namely by tyinga link between W-complexity and the dependence ofA-complexity on the size of the system, somewhat atodds with the idea that systems become more and morecomplex when they grow larger due to more and moreroom left for building synergistic interactions.The outline of this paper is as follows. We start witha review of maximum entropy methods, and then show a r X i v : . [ n li n . C G ] A ug how these tools may be used to cast Aristotle’s intuitioninto a proper mathematical scheme. We continue witha reminder on ECA and discuss how to implement thesesystems in a way that ﬁts our purpose, after what ourresults are presented. II. MAXIMUM ENTROPY APPROXIMATIONS

The philosophy underlying ME models is so to speakopposite to the constructive one, where a model is builtand tuned to match observed properties ( top-down ). Inthe ME approach on the other side, we proceed by seek-ing the least structured model compatible with a givenset of observations ( bottom-up ). This is done by not- ing that the most general (that is the least structured)probability density is the one which has the largest en-tropy while still satisfying the observational constraints,which are usually provided by some set of observables f k ( k = 1 , , ..., K ) the average values of which are known, h f k i = µ k . Note that a drawback of this procedure liesin the fact that it only yields a probability density. Anenergy function can be deduced by analogy however, butthis important point will not be of much concern in thepresent paper.Assume we look for a probability distribution p on aset of N variables X , ..., X N (collectively denoted by X ),such that H ( X ) = − P x p ( x ) ln p ( x ) is maximal, whilemaking sure that the constraints h f k i := P x f k ( x ) p ( x ) = µ k and P x p ( x ) = 1 are enforced. Using Lagrange’smethod, the quantity we have to equate to zero is ∂∂p ( y ) − X x p ( x ) ln p ( x ) + λ X x p ( x ) − ! + K X k =1 λ k X x f k ( x ) p ( x ) − µ k !! , (1)where the λ ’s are the multipliers.Some straightforward algebra yields that the sought-after distribution may be written as p ( x ) = 1 Z exp K X k =1 λ k f k ( x ) ! , (2)where the multipliers have to be chosen to match theconstraints. Dividing by the partition function Z := P x exp (cid:16)P Kk =1 λ k f k ( x ) (cid:17) ensures that p is properly nor-malized. For instance in the most elementary case whereno constraint besides normalization is imposed, the MEdistribution is nothing but the uniform distribution (ifthe support of p is unbounded we have to impose ﬁnitemean and variance in order to recover a meaningful dis-tribution, which turns out to be a Gaussian).While in the present context we use the ME approachas a tool to investigate how well the reconstructed distri-bution matches the true one, which requires focussing onsmall systems whose distribution is known at any time,we should emphasize that the ME procedure may also bemost usefully employed when the true probability den-sity is unknown. In such a case, it is implicitly assumedthat the observables providing the constraints are easierto determine with accuracy than the distribution itself,thereby providing a way to reconstruct this distribution.It is nevertheless diﬃcult in this case to assess the accu-racy of the reconstruction resulting from the ME proce-dure.In this paper we shall deal with the case where theconstraints are provided by marginals instead of aver- ages. Fortunately, an appropriate use of delta functionsallows generalizing the previous results in a straightfor-ward way. Taking for illustration the case N = 4 ( i.e. x = ( w, x, y, z )), and assuming the tri-variate marginal p ( a, b, c ) is known, putting f ( x ) = δ ( w, a ) δ ( x, b ) δ ( y, c )allows writing h f i = X x f ( x ) p ( x )= X w,x,y δ ( w, a ) δ ( x, b ) δ ( y, c ) X z p ( x )= X w,x,y δ ( w, a ) δ ( x, b ) δ ( y, c ) p ( w, x, y )= p ( a, b, c ) . (3)Applying the result above to all possible values of thearguments then yields p ( x ) = 1 Z exp X a,b,c λ ( a, b, c ) f ( x )  = 1 Z exp ( λ ( w, x, y )) , (4)where λ now denotes a well-chosen function . This gen-eralizes to any number of marginals in a straightforwardway; for instance, if besides p the marginals p and p are given we get p ( x ) = 1 Z exp ( λ ( w, x, y ) + λ ( w, x, z ) + λ ( y, z )) (5)for some functions λ , λ , λ . Sadly, this elegant resultis actually of little use since the determination of these λ ’s is a diﬃcult problem. An important exception is thesimple case in which constrained marginals are univari-ate. Then λ ( w ) = ln p ( w ) (and λ ( x ) = ln p ( x ), etc. )obviously satisﬁes the requirements, so that the ME dis-tribution compatible with given univariate marginals isthe factorized distribution . In all other cases, we have toresort to the so-called iterative scaling algorithm whichallows numerical calculations.Brown [10] was among the ﬁrst to describe this algo-rithm, the principle of which is, starting from some ini-tial distribution, to consider all possible n -tuples of vari-ables in sequence, each time adjusting the correspondingmarginal. If we denote by p ( k ) the distribution obtainedafter k such adjustments, S k the subset considered at the k -th step and p S k the marginal distribution of this set,then the procedure is deﬁned by p ( k ) := p ( k − p S k p ( k − S k . (6)The order in which the n -tuples are examined does notalter the distribution we converge to. From our experi-ence, it seems that going twice through each n -tuple isenough to reach a satisfying solution. See [10] for proofsof convergence. This scheme is demanding due to the factthat the number of n -tuples in a set grows factorially. III. DECOMPOSING MULTI-INFORMATION

When looking at the ME reconstruction based on theknowledge of, say, tri-variate marginals, it is not easy apriori to know if it is accurate because it takes into ac-count strong triplet-wise interactions between variables,or if the same could have been achieved by the recon-struction based on bi-variate marginals only. We need away to disentangle diﬀerent orders of interaction, as wellas a tool to compare distributions.Both are provided by the

Kullback-Leibler (KL) diver-gence between two distributions p and q (living on thesame support), which is deﬁned as D ( p, q ) := X x p ( x ) ln p ( x ) q ( x ) . (7)This quantity provides a pseudo-distance on the space ofprobability distributions [11].The idea now is to use the KL divergence to comparethe distribution we want to approximate with the MEdistribution based on marginals of a certain order k bycomputing D ( p, p ( k ) ME ), where p ( k ) ME denotes the ME recon-struction based on k -marginals. It is intuitively clear thatthe larger the subsets considered, the more accurate theresulting ME approximation will be (indeed smaller-ordermarginals may be recovered from larger-order ones), whence the inequality D ( p, p ( k − ME ) ≥ D ( p, p ( k ) ME ). Thediﬀerence may therefore be interpreted as the gain inaccuracy when basing our guess on subsets of size k in-stead of k −

1, and therefore quantiﬁes speciﬁcally the roleplayed by interactions of order k . We deﬁne accordingly C k := D ( p, p ( k − ME ) − D ( p, p ( k ) ME ) ≥ . (8)This quantity is sometimes referred to as the connectedmulti-information of order k [4], but in our opinion thisname is unfortunate (where does connectedness enter intothe play ?) and will not be used here.Performing the telescopic sum of all these coeﬃcientsgives N X k =2 C k = D ( p, p (1) ME ) − D ( p, p ( N ) ME ) . (9)Since p ( N ) ME is trivially p itself (whence D ( p, p ( N ) ME ) = 0),and since, as mentioned above, p (1) ME = Q Ni =1 p ( x i ), weget ﬁnally N X k =2 C k = D p ( x ) , N Y i =1 p ( x i ) ! = X x p ( x ) ln p ( x ) Q Ni =1 p ( x i ) := M. (10)The quantity M is known as the multi-information [12]and quantiﬁes the total amount of interdependence insidea set of variables. (As an aside, note that the sum couldbe started from k = 1, in which case it would add upto the KL distance between p and the uniform distribu-tion. The total amount of interdependence could indeedbe deﬁned that way, but it is usual to say that this quan-tity is given by the multi-information as deﬁned above.Moreover, the standard convention attributes null inter-dependence to a set of independent variables which wouldnot be the case otherwise, where only uniform distribu-tions would be said to have no interdependence. Thisalternative way to deﬁne things could nevertheless beconsidered occasionally.) This formula shows, as couldhave been expected, that the total interdependence in asystem is built by addition of pair-wise, triplet-wise, andso on, interactions.An important question is whether all subsets of a givenorder should be treated on the same footing when thereexists a notion of distance between variables, as would bethe case for instance if the system were put on a graphand each variable assigned to a node (more generally,such a distance has to exist as soon as a focus is put on“multi-scaleness”). Considering for instance the case ofpairs, co-dependence between variables remote from eachother will intuitively be much smaller than between twoneighbouring variables, so that it seems acceptable to dis-card pairs constituted by distant elements. On anotherside, these loose pairs are by far more numerous thantightly-bound ones, so that though their contribution isweak envisaged individually, it cannot be neglected any-more when considered globally. But this very fact thatpairs (more generally n -tuples) are so many implies thatconsidering them all becomes numerically diﬃcult (facto-rial growth). Since in the experiments to follow we focuson systems which display such a notion of distance, pro-vided by the topology of the graph on which we put ourvariables, our viewpoint will be to consider as admissi-ble n -tuples only those consisting of connected variables( i.e. n -tuples of which restricted graph is connected). Afew checks (see below) tend to suggest than our conclu-sions are not drastically altered in the more general casewhere all subsets are retained. We have to admit thatthis simpliﬁcation is questionable and intend to addressmore speciﬁcally this issue elsewhere. IV. ELEMENTARY CELLULAR AUTOMATAAND LANGTON’S PARAMETER

One-dimensional ECA have been very thoroughly in-vestigated (see [13] for an introduction). Wolfram [7]noted that these may be classiﬁed in four diﬀerent classes:class I regroups ECA converging to some homogeneouspattern; class II displays an inhomogeneous stable pat-tern or periodic behaviour; class III displays completelychaotic behaviour; ﬁnally class IV regroups automatawhich exhibit slowly building up and decaying sub-patterns. It is believed that ECA belonging to class IVare the closest to our intuitive conception of complex-ity. While this classiﬁcation is very widely used, manyalternatives have been proposed (see [8] and referencestherein). One of these alternatives is to consider

Lang-ton’s λ parameter , which provides a parametrization ofthe space of cellular automata. In the case of elementaryone-dimensional automata, it is computed very easily asthe percentage of conﬁgurations in the lookup table giv-ing rise to a living cell ( i.e. a cell taking value 1), butthis parameter may be generalized to any kind of automa-ton [9]. A leitmotiv of Langton’s reﬂexion was to suggestthat, while simple (classes I and II) automata correspondto small values of λ and chaotic ones (class III) to valuesclose to λ = 1 /

2, complex CA should emerge somewherein between, at what has been popularized as the edge ofchaos .The idea of quantifying the role played by successiveorders of interaction in the informational content of cellu-lar automata has already been adressed by Lindgren andNordahl [14, 15]. However, these authors do not resortto ME methods but use instead a simpler decompositionof the entropy rate, which makes this tool restricted to one-dimensional topologies (this limitation of their workwas actually an important motivation for undertakingthe present study).

V. COMPUTATIONAL FRAMEWORK

Before moving on to discuss our results, we shouldsay a few words about our computational frameworkand address how the temporal evolution of the proba-bility density is handled. Often this is done by means ofMonte-Carlo methods, by evolving copies of the systemand reconstructing the probabilities by sampling trajec-tories (following this approach see [16] for a recent workon a closely related topic). Here we follow an alterna-tive approach which is to determine the time evolutionof the probability density exactly. Then we may, so tospeak, follow simultaneously all trajectories down to theleast probable ones. This amounts to a description interms of Markov chains, where the knowledge of historyallows to predict towards which states the system couldevolve. We will restrict ourselves here to the case wherethe knowledge of a ﬁnite history is suﬃcient to predictpossible futures. By suitably extending the state space,actually all such processes may be recast in the form of memoryless Markov chains (or simply

Markov chains ),by what we mean that the probabilities of the forthcom-ing states may be predicted knowing the current state ofthe process only. In the case of ECA considered here, theMarkov process is memoryless by construction.While this approach seems to outperform sampling interms of accuracy, it actually suﬀers from its numericalcost when the system’s size increases. Assuming for in-stance we deal with a system constituted by N variablestaking binary values, we have in this case 2 N possibleconﬁgurations, while assuming the system is driven bya dynamics with a k -steps memory yields 2 kN possiblehistories to keep into consideration, which becomes soonuntractable even for small values of N and k .Nonetheless this formalism has some advantages whichjustify its adoption in this paper. In particular it al-lows a more straightforward transition from numericalexploration to theoretical investigation. Still more im-portantly, as we already mentioned, this approach doesnot require to select (arbitrarily) an initial conﬁguration,but handles them all as long as they are not explicitlyassigned probability zero from the beginning. This willturn out to be a crucial feature. VI. RESULTS

We ﬁrst computed the time evolution of C k coeﬃ-cients for ECA of size N = 10 put on a periodic string-like topology, as well as the behaviour of the multi-information. Two instances are presented below. Figure C u m u l a t e d c o e ff i c i e n t s FIG. 1:

Time evolution of C k coeﬃcients in rule 110(adjacent subsets). Coeﬃcients are displayed cumulatively, i.e. we show successively (from bottom to top) C , C + C , etc. The sum converges to M (red curve). C and C provide asigniﬁcant contribution to the multi-information (notethat the coeﬃcients are displayed cumulatively; see cap-tion). Remark that after a transient phase of aroundten steps, the system enters a periodic regime. Whilethe multi-information is almost constant in this regime,the respective contributions present a much stronger vari-ability. C , for instance, alternatively reaches signiﬁcantvalue and then decays close to zero. On the contrary, thecontribution of C is nearly constant in the stationaryregime after reaching a peak during the transient phase.For comparison we show in ﬁgure 2 the correspond-ing picture when all subsets of a given order, instead ofadjacent ones only, are taken into account. The maindiﬀerence is that C and C play no signiﬁcant role, buton the whole the behaviour of the remaining coeﬃcientsis not qualitatively altered. Note however how our deci-sion to rule out non-adjacent subsets introduced spurioushigh-order dependences.This behaviour is in sharp constrast with the oneshown in ﬁgure 3, which displays the same plot forrule 90. This rule is characterized by the fact that thesole contribution to multi-information is provided by thecoeﬃcient C , all other orders of interaction vanishing(admittedly this could hardly be guessed from the plotalone). Very interestingly, rule 90 is nothing but the dy-namics obtained by applying the XOR operator on thetwo neighbours of a variable and assigning the result tothe variable. This highlights the fact that the notion of“order of interaction” as employed in the current contextdoes not quite overlap what we could expect from, forinstance, classical kinetic theory. There, “interaction oforder n ” would be understood in terms of the functional C u m u l a t e d c o e ff i c i e n t s FIG. 2:

Time evolution of C k coeﬃcients in rule 110 (allsubsets) C u m u l a t e d c o e ff i c i e n t s FIG. 3:

Time evolution of C k coeﬃcients (adjacent subsets)in rule 90. Only C contributes. form of the energy function (in the sense that the lattercould not be decomposed as a sum of functions involvingless than n variables). The case of rule 90 would then cor-respond to interactions of order two, and ECA in generalto interactions of order three. It would then be diﬃcultto justify the appearance of higher orders of interaction.An interesting question is to ﬁnd a dynamics such thatall informational content is brought in by interactions oforder k , i.e. C k = M and C l = 0 ∀ l = k .We should actually better get rid of the transientphase, details of which depend on the probabilities weinitially assigned to each conﬁguration. Moreover, in or-der to get more easily displayable results, we shall discardtemporal variations of C k coeﬃcients. From now on, co-eﬃcients will therefore be averaged (when necessary) overthe stationary phase, so that all forthcoming statementsabout C k coeﬃcients will be statements about the aver-age value of these coeﬃcients in the stationary regime. O r d e r o f s u b s e t s FIG. 4:

Size of subsets required to reconstruct a MEdistribution recovering 50%, 70% and 90% of themulti-information respectively

Figure 4 displays, for each value of the Langton parame-ter λ , up to what size hh Σ i T i λ the subsets should be con-sidered in order to recover, respectively, 50%, 70% and90% of the multi-information in the stationary regime(while heavy, this notation has the merit to make clearthat this is the size obtained from coeﬃcients averagedover time and over all rules having the same λ ). Weonly consider the range λ ∈ [0 , /

2] since any ECA with λ > / λ = 1 − λ . We observe that hh Σ i T i λ increasesmonotonically with λ , which means that A-complexitygrows with λ , and this statement holds whatever the per-centage of multi-information targeted.We now turn to the dependence of hh Σ i T i λ on thesize of the ECA, for the case where a 90% reconstruc-tion of the multi-information is targeted. Results aredisplayed in ﬁgure 5. The curve for λ = 0 is specialsince it comprises only one rule (R0). For this rule con-sidering individual elements -actually only one of them-is enough to characterize the system, so that in this case hh Σ i T i λ =0 = 1 whatever the size of the system. For othervalues of λ , hh Σ i T i λ grows with N , in a way which is in-vestigated below.The fact that hh Σ i T i λ increases monotonically with λ seems to be common to all sizes, except for N = 4 and N = 8 by a small amount. Though it seems diﬃcultto single out one speciﬁc cause for this inversion, theforthcoming discussion should make the issue clearer.Figure 5 says nothing about possible heterogeneity in-side the set of rules having the same λ . It will turn outthat this intra- λ variability is indeed extremely impor-tant, so that a discussion in terms of λ is actually ratherirrelevant. We will therefore now discuss the question indetail by examining rules for themselves, without tryingto tie links to the λ -parametrization.A careful examination of the size of subsets required to O r d e r o f s u b s e t s FIG. 5:

Order of subsets required to recover 90% of themulti-information as a function of N , for each value of λ ( λ = 0: cyan; λ = 1 /

8: green; λ = 1 /

4: blue; λ = 3 /

8: red; λ = 1 /

2: black) reconstruct at least 90% of the multi-information M forall 88 inequivalent rules separately allows to single outsome representative behaviours typiﬁed by the rules dis-played in ﬁgure 6 (the selection is arbitrary to a certainextent). The simplest behaviour is provided by rule 14,which belongs to Wolfram class II and exhibits a stabletranslating stationary conﬁgurational pattern (by conﬁg-urational pattern we shall always mean the spatial pat-tern obtained by evolving some initial conﬁguration). Inthis case the number of coeﬃcients to take into accountis seen to be the same (here h Σ i T = 5) whatever the sizeof the system, except for small sizes. This behaviour maybe encountered in many rules, with some variations re-garding the value of h Σ i T or the length of the transientphase (being meant as the transient in terms of N - recallthat the time plays no role here since coeﬃcients are aver-aged). Two interesting variations on this theme are to befound in rules 77 (reaching a stable inhomogeneous con-ﬁguration and thus classiﬁed as class II) and 32 (reachinga stable homogeneous conﬁguration and thus classiﬁed asclass I). In the former case, h Σ i T seems to get stabilizedat h Σ i T = 4, but jumps to h Σ i T = 3 when N increasesfrom N = 8 to N = 9. This illustrates a weakness ofour display of results, since h Σ i T changes abruptly when,say, the ﬁrst three coeﬃcients contribute to 89% of M , orwhen these same three coeﬃcients reach 91% of M . h Σ i T would then jump from h Σ i T = 4 to h Σ i T = 3 althoughthe change actually occurred almost smoothly. The caseof rule 32 is more relevant to our purpose. There we os-cillate between h Σ i T = 2 and h Σ i T = 3 depending on theparity of N . This may be explained by noting that thereexists one initial conﬁguration which does not convergeto a stationary homogeneous pattern; namely, the alter-nate conﬁguration 01010101010101 gets replicated againand again over time except for a one-cell shift at each it-eration. Such an initial pattern is however only possiblefor even values of N . It therefore happens that while thedynamics is rightly classiﬁed as class I for odd values, itshould not be so for even ones (at least not for all initialconﬁgurations). It is therefore no longer unexpected todetect a size-dependent amount of A-complexity.A completely diﬀerent behaviour is exempliﬁed by rule90 already discussed above. Here h Σ i T is close to itslargest possible value h Σ i T = N whatever the value of N . This is in accordance with the temporal behaviour of C k ’s observed in ﬁgure 3. Note the special case N = 8,for which h Σ i T drops to h Σ i T = 2. A similar behaviourmay be found in rule 106, which is perhaps even moreconvincing due to the lack of drop. Interestingly, whilerules 90 and 106 present similarities, they are usually as-signed to diﬀerent Wolfram classes (R90 is class III whileR106 is class IV). We shall come back to this questionlater on.Both types of behaviour analyzed so far constitute ex-treme situations: in one case (rules 14, 77 and 32), h Σ i T shows little or no dependance at all on the size of the sys-tem, while in the other (rules 90 and 106) h Σ i T seems togrow linearly with N . The following rules do not displaysuch unambiguous behaviours. Rule 105, for instance,oscillates between a “basis line” at h Σ i T = 2 and an “up-per bound” at h Σ i T = N , visiting intermediate values forsome other values of N . Rule 15 behaves similarly, ex-cept that oscillations are sharper (intermediate values arenever visited, at least for the range of sizes we were ableto explore). Some other rules also exhibit what could bedaringly (given the small range of sizes we were able toscan) characterized as sub-linear growth. Rule 22 illus-trates this nicely, as well as rules 73, 110 and 54 (thoughin this latter case it is tempting to assert that h Σ i T even-tually gets stabilized at h Σ i T = 8).Before moving on to examine if and how these typicalbehaviours may be related to Wolfram’s scheme of classi-ﬁcation, let us have a side look on the situation where allsubsets are considered instead of adjacent ones only. Thecounterpart of ﬁgure 6 is displayed in ﬁgure 7. It shouldﬁrst be noted that, as expected, the size of subsets tobe considered in order to reconstruct a speciﬁed frac-tion of M is almost always smaller when all subsets maybe considered than when unconnected ones are to be dis-carded. Even if the results obtained in these two cases aresomewhat diﬀerent, the distinctive features underlinedabove are preserved. In particular, low A-complexity ex-pressed in terms of adjacent subsets is coherent with lowA-complexity expressed in terms of all subsets, and sim-ilarly for high A-complexity.The primary purpose of this study was to put along-side the notion of complexity promoted by ME methodsapplied to Aristotle’s principle and complexity as meantin Wolfram’s classiﬁcation. We should therefore exam-ine whether, or not, some behaviours can be found tobe common to all rules pertaining to a given Wolfram CLASS RULESI 0, 8, 32, 40, 128, 136, 160, 168II 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 19, 23,24, 25, 26, 27, 28, 29, 33, 34, 35, 36, 37, 38, 42,43, 44, 46, 50, 51, 56, 57, 58, 72, 73, 74, 76, 77, 78,104, 108, 130, 131, 132, 133, 134, 138, 140, 142,152, 154, 156, 162, 164, 170, 172, 178, 184, 200,204, 232III 18, 22, 30, 45, 60, 90, 105, 129, 146, 150, 161IV 41, 54, 106, 110

TABLE I:

A compendium of the 88 inequivalentone-dimensional ECA class. Table I lists all inequivalent rules and the classthey belong to [8].All eight rules in class I display the simple behaviourdiscussed above where h Σ i T = const . The value of h Σ i T varies from one rule to the other. The tricky case of rule32 has already been discussed above, and R160 is verysimilar. This homogeneity of behaviours is in agreementwith the simplicity of conﬁgurational patterns convergedto in the stationary regime.At the other end of the spectrum, 11 rules belong toclass III. All of them display linear or sub-linear growth,possibly with drops for certain values of N ( cf. the dis-cussion of R90 and R105 above). None of these rulesshows the simple behaviour encountered in class I: hereagain, the ME approach is appropriate to catch the kindof complexity displayed by chaotic dynamics.Things are no longer that clear when we come to con-sidering classes II and IV. Class II regroups as much as 65of the 88 inequivalent rules. At least 41 of them displaythe same simple type of behaviour already encounteredin class I, which is ﬁne since rules in class II are notexpected to present a high level of complexity. Nonethe-less, the remaining 24 rules behave in a way which istypical of class III. While this might suggest that clas-siﬁcations based on A-complexity on the one side andW-complexity on the other are deﬁnitively at odds witheach other (which does not necessarily imply that oneshould be preferred to the other), it is also possible thatthe initial conﬁguration plays an important role in thisrespect. We already met above a rule (R32) whose classi-ﬁcation depended tightly on the initial conﬁguration cho-sen, and, implicitly, on the size of the system. Anothersuch instance is provided by rule 73, which is classiﬁedas class II due to the appearance of “walls” splitting theconﬁgurations into sub-conﬁgurations which, being of ﬁ-nite size, will necessarily repeat themselves, whence theattribution to class II. It may however happen that theinital conﬁguration is chosen in such a way as to forbidthe appearance of such separating walls, in which casethe dynamics should better be classiﬁed as class III.Class IV, ﬁnally, regroups only four inequivalent rules.Among these, three of which are shown above, one (R106) (a) R14 (b) R15 (c) R22 (d) R32 (e) R54 (f) R73 (g) R77 (h) R90 (i) R105 (j) R106 (k) R110 (l) R154 FIG. 6:

A representative sample of rules. Horizontal axis displays the size of the automaton, vertical axis displays therequired size of subsets to recover 90% of the informational content. The dotted line corresponds to maximal A-complexity h Σ i T = N . shows linear growth while another (R54) seems to con-verge towards a constant value of h Σ i T = 8 as would dy-namics in class I or II do (note however the large value of h Σ i T ). Rule 110 lies somewhere in between. The fourthrule (R41) in class IV is similar to R110. VII. CONCLUDING REMARKS

It should be noted that we did not actually take fulladvantage of the ME machinery in this study, focussinginstead on averaged quantities; the case of rule 110 servesas an illustration, since obviously the interplay of vari-ous coeﬃcients shown in ﬁgure 1 is much richer than (a) R14 (b) R15 (c) R22 (d) R32 (e) R54 (f) R73 (g) R77 (h) R90 (i) R105 (j) R106 (k) R110 (l) R154

FIG. 7:

The same picture as previously, now for the case where all subsets of a given size are taken as admissible. Only sizesup to N = 10 are considered. the averaged values considered in most of our analysis.In spite of this, our results highlighted a tight relation-ship between A-complexity and W-complexity which isall the more remarkable. We cannot elude however thatin some cases, namely in class II rules exhibiting very A-complex behaviour, considerable discrepancies arose be-tween these two schemes. In our opinion this may beinterpreted in two diﬀerent ways. Firstly, it might be that our study should indeed be reﬁned by looking atmore subtle quantities than averaged ones. Nonetheless,we have met several cases ( e.g. rules 32, 73, 160) whereunexpected A-complexity could be explained by misat-tributions to such-and-such a class due to unsuﬃcientattention paid to particular initial conditions. A greatstrength of our probabilistic approach is that we cannotbe fooled by such eﬀects since all possible initial conﬁg-0urations are considered in our framework. Moreover, ifone wishes, it allows a separate treatment of these rogueconﬁgurations by simply assigning them probability zero.The issue lies in the determination of these special initialconditions, which would require a considerable amountof work.Assuming all ambivalent rules may indeed be explainedby an adequate splitting of initial conﬁgurations (whichin itself would shed some light on the interplay be-tween dynamics, initial conditions, and complexity ofbehaviour), it is very tempting to sketch the followingglobal picture.

1) In automata converging to a stable orperiodic conﬁgurational pattern the knowledge of subsetsof some ﬁnite size is suﬃcient to reconstruct accurately(here up to a 10% error) the informational content of thesystem. 2) In chaotic automata the size of these subsetsgrows linearly with size, meaning that any inference basedon small subsystems yields intrinsically ﬂawed results. 3)Complex systems would then lie somewhere in between,perhaps exhibiting sub-linear growth of the required sub-sets . In other words, W-complexity corresponds to anintermediate regime of A-complexity, quite akin to Lang-ton’s edge of chaos which is therefore recovered startingfrom a completely diﬀerent vantage point.The behaviour encountered in classes I and II is some-what reminiscent of the situation prevailing in classicalkinetic theory, where the BBGKY hierarchy of equationsmay be truncated without too much harm after two stepsfor most gaseous or ﬂuid systems of interest, neglectingin a sense the contribution of higher-order reduced den-sities. Pushing further this analogy, this would suggestthat such systems cannot be characterized as “complex”in the sense investigated here. This comparison is how-ever made very conjectural by the fact that in the presentstudy the criterion for truncation is provided by the in-formational content of the system, which is not the casein the context of kinetic theory where such a criterionis not so clearly stated. Actually, our criterion is ratherarbitrary; indeed, reconstructing its informational con-tent is but a ﬁrst step towards the understanding of asystem since there is no one-to-one relationship betweenprobability distribution and multi-information.This should remind us that the ME method employedis susceptible of several reﬁnements, without even men-tioning the fact that the cellular automata studied hereare a very speciﬁc kind of system. As we already men-tioned, diﬀerent types of observational constraints maybe used in the reconstruction of the probability density, and we see no reason why the C k coeﬃcients could notbe diversiﬁed accordingly, perhaps giving rise to tractableanalytical expressions. We also emphasized that the se-lection of the subsets to take into account deserved care-ful attention. Lastly, it is unfortunate that these MEmodels are so diﬃcult to handle analytically and compu-tationally so demanding, precluding the exploration oflarger systems; advances on the theoretical as well as onthe numerical side are therefore mandatory if one wishesto gain insight into the underlying physics. ACKNOWLEDGMENTS

The authors would like to thank Joris Borgdorﬀ,Christophe Charpilloz, Raphael Conradin, AlexandreDupuis and Anton Golub for their helpful advices andcomments. The research leading to these results has re-ceived funding from the European Union Seventh Frame-work Programme (FP7/2007-2013) under grant agree-ment 317534 (Sophocles). ∗ [email protected][1] Y. Bar-Yam, Complexity , 15 (2004)[2] Y. Bar-Yam, Advances in Complex Systems , 47 (2004)[3] Aristotle, Metaphysics (Harvard University Press, Cam-bridge, Massachusetts, 1933)[4] E. Schneidman, S. Still, M. J. Berry, and W. Bialek,Physical Review Letters (2003)[5] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek,Nature , 1007 (2006)[6] N. Ay, E. Olbrich, N. Bertschinger, and J. Jost, Chaos (2011)[7] S. Wolfram, Reviews of Modern Physics , 601 (1983)[8] G. Martinez, arXiv (2013)[9] C. G. Langton, Physica D , 12 (1990)[10] D. T. Brown, Information and Control , 386 (1959)[11] T. Cover and J. Thomas, Elements of Information The-ory (Wiley-Interscience, New York, 2006)[12] S. Watanabe, IBM Journal , 66 (1960)[13] B. Chopard and M. Droz, Cellular Automata Modelingof Physical Systems (Cambridge University Press, Cam-bridge, 1998)[14] K. Lindgren, Complex Systems , 529 (1987)[15] K. Lindgren and M. G. Nordahl, Complex Systems , 409(1988)[16] R. Quax, A. Apolloni, and P. M. A. Sloot, Journal of theRoyal Society Interface10