Reconstructing cellular automata rules from observations at nonconsecutive times
aa r X i v : . [ n li n . C G ] F e b Reconstructing cellular automata rules from observations at nonconsecutive times
Veit Elser
Laboratory of Atomic and Solid State Physics, Cornell University, Ithaca, NY 14853-2501, USA (Dated: February 16, 2021)Recent experiments by Springer and Kenyon [1] have shown that a deep neural network canbe trained to predict the action of t steps of Conway’s Game of Life automaton given millions ofexamples of this action on random initial states. However, training was never completely successfulfor t >
1, and even when successful, a reconstruction of the elementary rule ( t = 1) from t > t − n adjacent cells. The unknownrules in our experiments are not restricted to simple rules derived from a few linear functions onthe inputs (as in Game of Life), but include all 2 n possible rules on n inputs. Our results extendto n = 6, for which exhaustive rule-search is not feasible. By relaxing translational symmetry inspace and also time, our method is attractive as a platform for the learning of binary data, since thediscreteness of the variables does not pose the same challenge it does for gradient-based methods. I. INTRODUCTION
From a hardware perspective, cellular automata (CA)are a natural model of computation. While too simple asa serious model of the universe itself [2], their dynamicsexhibit many of the same qualitative modes of behaviorseen in physical systems. CA have translational sym-metry and as such are of interest in machine learning,where neural networks with convolutional filters are rou-tinely used to detect spatial patterns, no matter wherethey occur in an image.With convolutional filters matching in size the inputfield of an automaton, a network has the capacity to rep-resent the automaton rules. The challenge of traininga network to learn the rules was recently taken up bySpringer and Kenyon (SK) [1] with Conway’s Game ofLife. In this 2D binary-valued automaton, the value of acell at the next time step is uniquely determined by thevalue of a linear filter applied to the 3 × t -step Game of Life evolution of the input pattern.Using standard gradient-based optimization of the net-work parameters, such as the 3 × t -step Game of Life rule was learned reliably only for t = 1. Results were mixed for t >
1, even when (convo-lutionally) adding many extra parameters as is commonpractice in machine learning.Because SK did not impose time-translational symme-try on their filters, their network cannot be faulted for notreconstructing the elementary ( t = 1) CA rule, even when it was able to correctly predict t > lottery ticket hypothesis [3] of gradient-basedoptimization on networks, for which the CA predictionproblem is an instructive test case. On the other hand,now that one approach to this problem has been tried,it seems appropriate to consider its difficulty and whatmethods are available to solve it.The case t = 1 is trivial for any number of CA inputs n :one simply examines states at two consecutive times andconstructs the CA rule as a look-up table. For a binaryautomaton, a random input state having size of order n n will contain all 2 n possible patterns to completelydefine the CA rule. For small enough n the case t > n (binary)CA rules on the input to find one that gives a matchto the output when evolved by t steps. Again, a singlelarge random data instance suffices, although now oneshould expect non-uniqueness, such as when the outputstate has low entropy (e.g. a uniform state). The CArule reconstruction problem is therefore interesting for t > n . Since 2 ≈ , n = 6 isalready an interesting case.We present a method for reconstructing CA rules thathas several parallels with neural networks. Variables arearranged at the nodes of a layered feed-forward network,with data applied at the input and output layers. “Train-ing” is done with a single input-output pair. When suc-cessful, the variables on the intervening layers reveal theunseen states of the CA. There are also variables on thenetwork edges, connecting every node not in the inputlayer with its n inputs in the layer one time step earlier.However, these are not weight parameters, as in standardneural networks, but auxiliary variables used for “split-ting” the reconstruction problem into constraints amongindependent sets of variables. The actual network pa-rameters in our method are the unknown 2 n bits of theCA rule. An important point of departure from standardpractice is that the parameters are not optimized by min-imizing a loss. Instead, the parameter-bits along with thestates in the unseen layers are recovered from the fixed-point of an iterative feasibility solver. This alternativeapproach [4] has been demonstrated for the training ofstandard network models and seems especially well suitedfor the CA rule reconstruction problem.After defining the network variables for a general CAin section II, we show in section III that the constraintsthey must satisfy can be partitioned into two sets suchthat the corresponding projections — to satisfy the con-straints with least change — are easy, local computations.In section IV we briefly review the general purpose RRRalgorithm we will use for finding feasible points, that is,points that satisfy both sets of constraints. The methodis first applied, in section V, to n = 3 automata in onedimension, featuring Wolfram’s Rules [5] 30 and 110 asexamples of chaotic and Turing-complete CAs. Althougha reconstruction algorithm is not needed for n = 3, wefind that the new method appears to find rules withoutexploring 2 possibilities. To demonstrate the methodin a setting where we know of no practical alternatives,we turn to a CA with n = 6. Finally, in section VI wedescribe how the same scheme might be used in a newmodel for unsupervised learning called Boolean genera-tive networks, where the task is to discover how stringsof bits are generated from fewer uncorrelated bits. II. NETWORK VARIABLES
We use ℓ = 0 , . . . , t to label the layers of the network,and p ∈ Λ for the points/nodes in the identical and trans-lationally invariant layers Λ. By giving Λ the topologyof a torus, a CA rule learned on a finite Λ also applies toinfinite Λ[ ? ]. A subset I ⊂ Λ of size n defines the inputfield of the automaton. Node ( ℓ, p ) receives inputs fromnodes ( ℓ − , p + i ), for each i ∈ I . The possible inputstates of the CA rule are labelled s = 0 , . . . , n −
1, withthe convention that the states are the base-2 digits of s .There are three sets of variables, x , y and z , all ofwhich take values 0 or 1 in a solution. The CA rule isexpressed by the variables z ( ℓ, p, s ) , ℓ > z ( ℓ, p, s ) = 1 means that thecell/node at ( ℓ, p ) has adopted the rule to be in state y ( l 1,p ) y ( l 1,p + ) y ( l - + ) y ( l,p - ) y ( l,p - ) y ( l,p ) x ( l,p,p ) x ( l,p,p + ) x ( l,p,p + ) x ( l,p - ) x ( l,p - ) FIG. 1. Node and edge variables highlighting (filled node,solid edges) those taking part in one CA rule constraint. Inthis example the CA has three inputs and each node variablehas three upward projecting versions on edges (two of whichare shown dashed). ℓ −
1) is in state s . Therewill be a constraint that all nodes use the same rule.The structure of the variables x and y is shown in Fig-ure 1 and is identical to the scheme used for constraint-based training of standard networks [4], except that theedges here do not also hold parameters (weights). Ourdrawings use the neural network convention, where time(forward propagation) is upward, the opposite of the CAconvention.There is a CA state variable y ( ℓ, p ) at every position p ∈ Λ and layer ℓ = 0 , . . . , t . These are constraineddirectly by the data in layers ℓ = 0 and ℓ = t . Fig-ure 1 shows the cell ( ℓ, p ) receiving inputs from cells( ℓ − , p + i ), i ∈ I (shown for i = 0 , ,
2) on which thestate is also expressed by a y variable. We also see vari-ables x ( ℓ, p − i, p ), i ∈ I attached to the edges projectingupward from cell ( ℓ − , p ). In a solution these have thesame value as y ( ℓ − , p ), but this is relaxed when impos-ing the CA rule constraints. In particular, we can thinkof x ( ℓ, p, p ) as an independent version of y ( ℓ − , p ) thatis used in the rule constraint at cell ( ℓ, p ). The full setof edge variables on which the rule constraint is imposedis x ( ℓ, p, p + i ), i ∈ I , where these are versions of thestate variables y ( ℓ − , p + i ). By allowing the CA ruleconstraint to act on independent versions of the inputvariables, the enforcement of the constraint becomes aneasy, local computation. Likewise, imposing equality ofthe multiple versions x ( ℓ, p − i, p ), i ∈ I with y ( ℓ − , p )is also a local computation. III. CONSTRAINTS AND PROJECTIONS
The splitting scheme described above is an example of“divide and concur” [6], where the concur-constraint im-poses equality of variables that have multiple versions.Even when the variables of the problem are required tobe discrete, it makes sense to embed them in the contin-uum since then a real-valued “concur-value” holds infor-mation about the degree to which one discrete value isfavored over the other. All of our variables, x , y and z , will be continuous to take advantage of this. Theconstraints are therefore sets in Euclidean space, andnearness to a constraint is the standard distance to theset. The projection to constraint set S , of an arbi-trary point ( x, y, z ) in our space of variables, is the point( x ′ , y ′ , z ′ ) = P S ( x, y, z ) ∈ S that minimizes the distanceto ( x, y, z ).Since projections minimize distance, and the threevariable types represent quite different things, there isno reason to assume that the real numbers 0 and 1 arethe best encoding of the discrete CA for all three of them.We therefore let x ∈ { , } , y ∈ { , η } and z ∈ { , ζ } bethe discrete choices for the variable types, where η and ζ will have the same role as hyperparameters in machinelearning.The “rule constraint” is local to each ( ℓ, p ), and eachlocal constraint set is the union of 2 n +1 point sets: ∀ ( ℓ = 1 , . . . , t ; p ∈ Λ) : (A) [ s =0 ,..., n − { z ( ℓ, p, s ) /ζ = y ( ℓ, p ) /η ∈ { , } ; ∀ i ∈ I : x ( ℓ, p, p + i ) = s i } . Here we use s i to denote the bit of state s associated withinput index i . This constraint set, called A , is one of thetwo constraint sets upon which the RRR algorithm, de-scribed below, is built. To see that the projection P A tothis set is an easy computation we need only observe thatdifferent ( ℓ, p ) have no variables in common and project-ing to each local constraint involves finding the minimumof 2 n +1 distances. As an example, consider a CA with n = 6. Each local constraint then involves 6 x ’s, one y ,and 64 z ’s. Given arbitrary real values of these variables,the projection to the constraint outputs discrete values ofthe x ’s (an instance of the rule-inputs), a discrete y (thecorresponding rule-output), and changes only a single z — the one associated with the discrete settings of the 6 x ’s — to the same value selected for y . The case ℓ = t is a special case of this constraint in that the variables y ( t, p ) are directly specified by the data.The B constraint implements “concur”, or variableequality, of two kinds. First, we require that the sameCA rule is applied at all layers and positions of the au-tomaton: ∀ s = 0 , . . . , n − , ∃ z B ( s ) ∈ R : (B1) ∀ ( ℓ = 1 , . . . , t ; p ∈ Λ) : z ( ℓ, p, s ) = z B ( s ) . Second, the edge variables projecting upward from thesame node should be equal to the node variable at thatnode: ∀ ( ℓ = 0 , . . . , t − p ∈ Λ) : (B2) ∀ i ∈ I : x ( ℓ + 1 , p − i, p ) = y ( ℓ, p ) /η . For this constraint ℓ = 0 is a special case because thevalues y (0 , p ) are directly specified by the data. As bothkinds of concur constraint are simple linear constraintson small sets of independent variables, the computationof the projection P B is easy.It is interesting that the choice of the Euclidean dis-tance, when defining projections, leads to the simple rulethat the concur value is just the arithmetic average. Thatis, the smallest sum-of-squares change to a set of realnumbers that makes them equal is to replace them bytheir average. When the 2-norm is replaced by the 1-norm, the concur value becomes the median of the num-bers and is not unique when the number of numbers iseven.The final step in establishing the constraint formula-tion is to show that any point ( x, y, z ) ∈ A ∩ B , wherethe variables satisfy all constraints, is a solution of theCA rule reconstruction problem. Starting with the B constraints, when these are satisfied the same CA rule isused at all nodes and the x variables on edges are trulyexact copies of the y variables on nodes. This constraintalso sets the y ’s on the input layer to their values in thedata. When the A constraint is also satisfied, with its x variables bound by the B constraint to y variables inthe lower layer, then a CA rule as represented by z holdsat each node, and constraint B ensures that the samerule is used at each node. Geometrically, A is a pointset that derives some of its structure from the final CAstate, while B is a single hyperplane whose parametersdepend on the initial CA state. Finding a point in theintersection of these sets is made hard by the property ofset A being nonconvex. IV. RRR ALGORITHM FOR FINDINGFEASIBLE POINTS
The relaxed-reflect-reflect (RRR) algorithm [7] is an it-erative method for finding points x ∈ A ∩ B , where A and B are subsets of R m . In the CA rule reconstruction prob-lem x is the vector comprising all the variables (denoted( x, y, z ) above). The algorithm is completely specified bythe projections P A and P B that take an arbitrary point x to the nearest point, by the Euclidean distance, on therespective constraint sets. The RRR iteration x → x ′ = x + β ( P B (2 P A ( x ) − x ) − P A ( x )) , (2)with time-step parameter β , has two key properties.First, it is easy to see that if x ∗ is a fixed point of theiteration, then x = P B (2 P A ( x ∗ ) − x ∗ ) = P A ( x ∗ ) ∈ B ∩ A, (3)is a solution. The second property, which relies on the“reflector” in the argument of P B , is that its fixed pointsare attractive. When RRR is written entirely in termsof reflectors, β/ ,
1) for convergence inthe convex case. Since even nonconvex sets A and B are usually locally convex, or are well approximated assuch, this generous range for the time step holds even inthe nonconvex case. The RRR iteration is asymmetric inthe sets A and B , so interchanging them gives anotheralgorithm. While all of the results we report use (2), wealso found solutions using the alternate form.RRR is the generalization to arbitrary constraint setsof the most successful algorithm for phase retrieval [8]. Ithas a strong record with combinatorially difficult prob-lems where gradient methods perform poorly [9]. Mostrecently it was used in the training of neural networks[4], with the same structure of node/edge variables forsplitting constraints as we use here.In loose analogy with gradient-based optimization inmachine learning, where progress is assessed by a de-creasing loss, proximity of a solution fixed point withRRR is reflected in the distance moved in each itera-tion, ∆ = k x ′ − x k . But whereas the evolution of lossin gradient optimization is mostly unremarkable, in hardfeasibility problems ∆ drops abruptly to a small value,in an apparent “aha” moment, after a long meander withlarge ∆. The CA rule reconstruction problem, especiallywhen the solution is unique, is a hard problem and it isnot realistic to expect any other kind of behavior in theevolution of ∆.Since ∆ mostly just serves as as indicator for solutiondiscovery, one needs other means for assessing the qual-ity of the RRR search. For our application the concurvalues of the CA rule, z B ( s ), serve that purpose. These2 n numbers are the output of the projection to constraint(B1) and convey the tendency toward 0 or 1 ( ζ ) for eachcombination of inputs. Their evolution with RRR itera-tion reveals the rate at which qualitatively different rulesare being considered in the iterative search. V. EXPERIMENTS
In this section we present results on CA rule recon-struction using the RRR algorithm on the constraint for-mulation described in section III. It is appropriate to viewthese results as experiments in that the run-time (number
FIG. 2. The Rule 30 (left) and 110 (right) automata evolvingwith periodic boundary conditions, upward, from the sameinitial state. of iterations) of RRR, itself a chaotic dynamical system,is beyond our ability to estimate. Going into these exper-iments we had no hypotheses about the nature of the rulesearch, only that the hardness would increase dramati-cally both with the number of inputs n and the numberof time steps t separating the initial and final states.The algorithm was implemented as a C program andrequires only the standard libraries. All software anddata used in the experiments is freely accessible at https://github.com/veitelser/rulerecon . A. Rules 30 and 110
Wolfram’s Rule 30 and 110 automata are interest-ing because the former exhibits the characteristics of achaotic dynamical system [10], while the latter was shownto be Turing-complete [11]. How these properties trans-late into the hardness of reconstructing the CA rule from t -step evolution data is an interesting question we ad-dress for the first time. Of course n = 3 rules are triviallyfound by exhaustive search, so “hardness” is interpretedthrough the lens of methods, such as ours, that continueto be practical even when exhaustive search is impossible.Figure 2 shows the evolution of the two automata (timerunning upward) from the periodic initial state of length L = 30 also used in the experiments. The data providedto the RRR algorithm is just this initial string of bitstogether with the string at time step t . To run RRRall that needs to be specified is the initialization of the( x, y, z ) variables and the three hyperparameters η , ζ and β . Because RRR itself has strongly mixing dynamics,there is no benefit from clever initialization and we simplyuse uniform random numbers bounded by the discretevalues of these variables.By minimizing the number of iterations to the solu-tion, for Rule 30 with t = 5, we obtained the settings η = 0 . ζ = 0 . β = 0 .
2. The same settings were then used at all t and also for Rule 110. Whereas the behavior with β ,described below, is systematic and interpretable, the op- TABLE I. Success rate for reconstructing Rule 30 from t = 5data in RRR time T = 4 × , as a function of the time step β . β timal scales of the three types of bits are entirely empir-ical. That ζ > η improves performance indicates thatrule-inconsistency (over all cells in the time evolution)should receive a higher penalty than a wrong rule-output( z -bits have a greater scale than y -bits).That small β improves performance was first noticedin experiments on the bit retrieval problem [7]. To inter-pret this phenomenon for the problem at hand, we definethe solution time T as the product of the time step β andthe number of iterations taken by RRR to find the solu-tion. The β → T , say averaged over runs fromdifferent starting points, is the continuous time taken byRRR to arrive at a solution fixed point. Since the speedof the dynamical system, the fluctuating quantity ∆ /β ,has a roughly constant running average over the courseof the search, T is also proportional to the distance (in R m ) traveled in finding the solution.Fixing T = 4 × for the Rule 30, t = 5 instance,we can test how solution discovery depends on β , thetime-discretization of RRR. Table I gives the success rateaveraged over 100 runs from random starting points atvarious β . We interpret the plunge in the success ratefor β > . β is too large, dis-tant variables (in space or time) are effectively uncou-pled because only their time average is noticed. Al-though RRR by construction never stagnates in a strictsense when the constraints are not perfectly satisfied, webelieve the quasi-independence of distant variables sta-bilizes thermodynamic-like states, where variables havestationary (non-solution) distributions. Our experimentsindicate that “finite temperature” traps of this kind arecompletely eliminated when the RRR time step is suffi-ciently small.When β is small and all runs succeed, we can askwhether T , averaged over starting points, converges toa finite value T ∗ in the limit of small β . If so, then T ∗ isthe mean solution time of continuous-time RRR. We findthis to be the case, with convergence already for β < . iterations v x v y v z FIG. 3. RRR velocities for the three variable types in a run ofa Rule 30, t = 5 reconstruction. The rule and unseen stateswere reconstructed in under 2 . × iterations. For reconstructing Rules 30 and 110 from t = 5 data wefind T ∗ to be respectively 7 . × and 1 . × .Figure 3 shows the RRR velocity, broken down amongthe three variable types, in a typical run of a Rule 30, t = 5 reconstruction. In the plot, v x = k x ′ − x k /β , v y = k y ′ − y k / ( η β ), v z = k z ′ − z k / ( ζ β ) are the velocitycomponents in units of bits per continuous time. Themany wiggles in these curves might suggest that RRRis trying out many of the 256 rules for n = 3, when infact very few are being considered in the search. Thiscan be seen in the time series of the concur estimate ofthe rule, z B (0) , . . . , z B (7), rendered in Figure 4 for thesame run as Figure 3 with RRR iterations running leftto right. Not only do particular bit patterns persist overmany iterations, only a small fraction of the 256 rules areseen at all. We interpret this to mean that most of thework in the RRR iterations goes into the slow process ofmaking the x and y variables consistent throughout spaceand time. When additionally this collective dynamicsof x and y is required to be consistent with a sharedrule, it appears that very few rule candidates come underconsideration. Curiously, the rules that appear with thegreatest frequency over the course of the search, whenreconstructing Rule 30 or 110, are the linear CA rules(linear in the field of two elements). For example, inFigure 4 we see Rule 150 as the leading candidate in overone-third of the iterations.Table II compares the work (RRR iterations) in recon-structing Rules 30 and 110 as a function of the number ofsteps t between the data strings. All runs, for both rules,used the initial state shown in Figure 2, η = 0 . β = 0 .
2. We did not find significant differences in opti-mal hyperparameters for the two rules. Only the optimal ζ exhibited a significant trend with t . The improvementseen with decreasing ζ , at larger t , means the search ismore productive when rule consistency is attenuated in FIG. 4. Evolution of the 8 rule-bits given by the concur esti-mates z B (0) , . . . , z B (7), for the same run shown in Figure 3.RRR “time” runs left to right. The black rows on the rightare the 1-bits of Rule 30. response to the increased number of independent ruleconstraints. Rule 110 appears to be consistently harderto reconstruct than Rule 30, but not overwhelmingly so.It is interesting that the increase in the RRR iterationswith t is somewhat erratic, such as for Rule 30 at t = 4.We believe this is transient behavior that might be elim-inated if the initial state is sampled from the stationarydistribution of the rule instead of the uniform distribu-tion. B. A rule on six inputs
To test our method for a CA where exhaustive rulesearch is not possible, we randomly selected a Rule X for n = 6, where X = 6489248685664986109 . (4)The time evolution of Rule X from a random periodicstate of length L = 200 is shown in Figure 5.Rule reconstruction for n = 6 is made harder bothbecause the number of z variables at each node has grownto 2 , and also because the spatial extent of the network( L = 200) needs to be large enough to sample all 2 input patterns for the rule to be determined uniquely. TABLE II. Growth with t in the average number of RRRiterations to reconstruct CA rules. All entries are based on100 runs, all successful, with η = 0 . β = 0 . t ζ Rule 30 Rule 1102 1.6 8 . × . × . × . × . × . × . × . × . × . × . × . × . × . × Our particular choice of initial state in fact only includes63 of the possible inputs and therefore cannot determinea unique rule for t = 1. Even having the benefit of ruleconsistency on another whole layer of the network, for t =2, does not uniquely determine the rule as our methodfinds three rules (differing in two bits) that can accountfor the data for that number of time steps. However, for t = 3 the method always finds the same rule and it isexactly the rule (4) we used to generate the data.It is a remarkable empirical fact that the RRR algo-rithm is able to discover the CA rule with a number ofiterations much less than the 2 rules that would haveto be considered in an exhaustive search. There is cur-rently no comprehensive theory how RRR manages tofind needles in similarly complex haystacks (e.g. phaseretrieval). From the evolution (in RRR time) of the con-cur estimates of the rule ( z B ) shown in Figure 6 in areconstruction from t = 2 data, we see that RRR is ableto identify a number of “branching bits” that, in persist-ing over many iterations, seem to be guiding the searchat a high level. This branching scheme is an automaticconsequence of the obvious splitting of constraints intosets A and B to make them independent; cleverness wasnot involved.RRR performance on reconstructing Rule X is summa-rized in Table III. For L = 200 and t = 3 our C programdoes 925 iterations per second and limited us to only 20trials in this case. Since the projection to the L × t ruleconstraints (A) dominate the time, and these could havebeen done concurrently, a parallel implementation couldgain a factor of 600 in time. Hyperparameter settings areessential for good results. Fixing β = 0 . T = 2 × on the t = 3 reconstruction,the 100% success rate with ζ = 0 . ± . VI. BOOLEAN GENERATIVE NETWORKSFOR BINARY DATA
What began as a case study in machine learning [1],and then turned to questions about CAs, now returns tothe subject of machine learning. In particular, we pro-pose applying our methodology for reconstructing CArules to the construction of generative models. Genera-
TABLE III. Hyperparameters and average RRR iterations forreconstructing Rule X . t η ζ β trials average iterations2 1.2 0.35 0.2 100 8 . × . × FIG. 5. Time evolution (upward) of Rule X (4) from a random periodic state of length 200.FIG. 6. Complete evolution of the 64 rule-bits given by the concur estimates z B (0) , . . . , z B (63) in a reconstruction of Rule X from t = 2 data. Over the course of the search (5 . × RRR iterations) many rule bits persist over long times. Thefixed-point bits of Rule X appear on the right. tive models may loosely be defined as schemes for gener-ating fake data from sufficiently many examples of gen-uine data. The capacity to create convincing fakes di-rectly demonstrates generalization and implies some un-derstanding of the structure of the data.All network-based generative models create fake databy sampling a smaller space than the space in which thedata resides, called the internal representation. In varia-tional autoencoders (VAEs) [12, 13] a network is trainedto encode data samples into an internal representationand then decode them back to the data with high fi-delity. If additionally the distribution of “codes” in theinternal representation is trained to have a chosen form,then sampling from that distribution and decoding con-stitutes a generative model. Generative adversarial net-works (GANs) [14] are decoder-discriminator pairs. Herethe idea is to train, in tandem, a decoder that makesincreasingly convincing fakes that fool the discriminator,and a discriminator that continues to be able to flag theever improving fakes. Since the decoders (for VAEs andGANs alike) in standard neural networks are continuousmaps, by adopting a universal (e.g. multi-variate nor-mal) model for the code distribution, the success of both of these methods is limited when the data distributionis very different in character, say in having a complexsupport. Another drawback is that the gradient-descentbased loss optimization methods for standard networkscan only promise local optima.Our proposal, called Boolean generative networks(BGNs), while having smaller scope than VAEs andGANs, is built on a more explicit definition of “gener-alization.” There is only a decoder and generalizationcapacity is specified by its depth. Figure 7 shows a smallBGN decoder of depth 2. The fully connected layers ofnodes should be interpreted as a Boolean circuit thattakes two Boolean inputs, in the code layer, and outputsBoolean T ( True ) and F ( False ) at the four outputnodes. By associating T with 1 and F with 0, the cir-cuit is able to generate 2 , 4-bit strings of data in itsoutputs. More generally, a BGN with m inputs and n outputs generates 2 m , n -bit data strings.By fixing the number of inputs, or the entropy of thegenerated data, the generalization capacity of a BGN isstrictly a function of the number of network edges whenwe adopt a uniform circuit construction rule. It is inthis respect that the BGN scheme intersects with the CA dataycodewx FIG. 7. A Boolean generative network for 4-bit binary dataand 2-bit codes, showing the three variable types: w , x (onedges) and y (on nodes). rule reconstruction problem. As a first proposal, we haveconsidered the rule where all the gates are Or , and eachedge can be in one of three states: ∅ , W , and W . Thesecorrespond, respectively, to the absence or presence ofa non-negating ( W ) or negating ( W ) wire. Instead of Or gates at all the non-input nodes, we could have cho-sen Nor , And , or
Nand , as these are equivalent withsuitable negations applied to the wires.Given enough depth, arbitrary Boolean functions canbe synthesized with just Or gates and negation, andtherefore BGNs have the capacity to represent arbitrarybinary data. However, the strength (or weakness) of ourproposal depends on whether binary data of interest canbe generated with networks of modest depth. We do notexplore this question here, but consider two very simpletoy data sets below to convey that depths as small as twoare already interesting. BGN representations have theproperty of being disentangled in that all combinationsof Boolean inputs are admissible for generating data.We depart from VAEs and GANs also by taking ad-vantage of the RRR fixed-point method for training. Asin CA rule reconstruction, the RRR algorithm is ableto reconstruct the Boolean circuit — wires and nega-tions — at the same time it is reconstructing the Booleanvariables at the nodes, including the code at the inputs.One difference from CA rule reconstruction is that all thevariables have a data item label, and the CA-rule concurconstraint (B1) is replaced by consistency of the BGN’swire variables across all data items. Large data sets canbe processed in batches [4] by running RRR for somenumber of iterations on one batch and using the concurestimate of the wire states in that run to warm-start therun on the next batch.The three variable types used to reconstruct a BGN,from a data set of output bit strings, is shown in Figure 7.The x variables on edges and y variables on nodes areexactly as they were in CA rule reconstruction. A smalldifference is the absence of y variables in the input/codelayer as there are neither gates nor data constraints at TABLE IV. BGN hyperparameters defining the discrete set-tings of the two-component wire variables and the twoBoolean variable types. The three wire states are ∅ (no wire),non-negating wire ( W ), and negating wire ( W ). ∅ W Ww (0 ,
0) ( σ, ω ) ( − σ, ω ) False True x False True y η this layer. The counterpart of constraint (B2) in theinput layer simply imposes equality of all the incident x edge-variables. Another difference is that there arewire variables w at all the edges of the network. Sincethese take three discrete values in a solution, the wirevariables w are variable-pairs at each edge since a 2D realspace is required to represent the most general metricalrelationship among three points.The BGN counterpart of the CA rule constraint (A)applies to the inputs x , wires w , and output y of every Or gate of the circuit. As in the CA rule constraint, we exer-cise our freedom in choosing the discrete settings of thesevariables to introduce three hyperparameters. These aredefined in Table IV. When ω = 0 the wire states couldhave been represented by a single real number, but wefind that ω = 0 improves the search behavior of RRR.An attractive feature of implementing arbitrary logicby 3-state wires (and only Or gates) is that the projec-tions to the gate constraints are highly local computa-tions. For each gate one considers both output states ofthe Or . When the output is F , all incident edges musthave ( w, x ) be in one of the four F states, ( ∅ , T ), ( ∅ , F ),( W, F ) or (
W , T ), and the projection selects the nearest.When the Or output is T , then w and x on each edgeare independently set to their nearest states, and if theresulting pair is one of the four F combinations, thenthe extra distance to the nearest of the T combinations( W, T ) and (
W , F ) must be computed as well. Theseextra distances are only used if all of the incident edgeshave F combinations, in which case the edge with thesmallest extra distance to T is changed to that T combi-nation. Whichever of the two cases of Or output has thesmallest projection distance is the one that gets selectedfor the projection.A reasonable objection to the strict logic of BGNs isthat real-world data is never free of noise and/or mayhave outliers that are not modeled by the logic of themodel. Noise is a serious problem in phase retrieval aswell [8], and we can use the same RRR strategy for ad-dressing it here. In the presence of noise, the RRR ve-locity does not drop all the way to zero (as in Figure3), but remains finite and small upon arriving at a near-solution. RRR iterations are terminated at such eventsand the search variables are interpreted as a solution thathas been corrupted by noise. In the case of BGNs, onewould project the concur estimates w B of the wire vari-ables to the nearest wire states and the concur values y B of the codes to the nearest Booleans (for each data item).The data generated with these wires and codes can thenbe compared with the true data and assessed for bit-flipor outlier errors. Another approach, that preserves thefixed-point behavior of RRR, is to attenuate the con-straint at the data nodes by allowing some number offlipped bits, or exempting some number of outlier data.Both methods of managing noise were demonstrated inthe study [4] that used RRR to train standard neuralnetwork models. A. Correlated partitions
In our first toy application of BGNs we consider datawhere every bit has an unbiased distribution and at allpairs of positions the bits are either perfectly uncorre-lated or perfectly correlated, either positively or nega-tively. Since the property of being correlated is an equiv-alence relation, the data bits partition into independent,perfectly correlated subsets. The task of the generativemodel is to discover this partition and the pattern ofnegations within each subset. The circuit in Figure 8shows how data of this type can always be representedby BGNs of depth 1. When the number of BGN inputs m matches the number of partitions in the data, the wirestates in a solution are unique up to the order m ! 2 m group of code permutations and negations.RRR easily discovers valid circuits for this type of data.We present results for an instance with m = 8, n = 16,where the data bits partition as 1+1+1+1+2+2+4+4.To test generalization we look for valid solutions whenthe number of data processed by RRR is less than thenumber of possible data, 2 m = 256. Table V summarizes FIG. 8. Wire state ( ∅ , W , W ) assignment in a fully connected3 → → σ ω β average iterations128 0.9 0.2 0.7 64064 1.5 0.3 0.7 36032 1.3 0.5 0.7 27016 1.3 0.5 0.7 440 our results for the average number of RRR iterations persolution in 100 trials, all successful, when the BGN hy-perparameters are tuned as the number of data items isreduced. Since the number of w variables participatingin each concur constraint equals the number of data, wemight have expected a larger variation in the optimalhyperparameters for these variables. Depth 1 networkshave no y variables and there is no need to set η . We didnot optimize with respect to β but observed that perfor-mance degrades overall when β > . B. Binary encoding
Our first application of BGNs only made trivial use ofthe Or gate. The second application, chosen mostly for FIG. 9. Data bits, in rows, for an 8 →
16 instance of corre-lated partitions. Columns 3 and 5, for example, are identicalbecause the bits at those positions are perfectly correlated. TABLE VI. Hyperparameter settings and average number ofRRR iterations in 100 trials, for finding depth 1 binary de-coder circuits from one-
False data. m σ ω β average iterations3 0.6 0.6 0.5 1404 0.5 0.2 0.5 6805 0.15 0.55 0.5 57000 historical interest, rectifies this. The 1985 paper [15] thatintroduced the back-propagation formula for parameteroptimization is noteworthy also for some novel applica-tions of the new methodology. Here we revisit the prob-lem of training an autoencoder tasked with compressing2 m (real-valued) data vectors to binary codes of m bits.By using sigmoid activation functions in the code layer,the internal representation was expected to be binary inthat values close to 0 and 1 are easily realized as outputsof the sigmoid. However, in results reported for the case m = 3, that were successful as far as reconstruction ofthe data, often the encoding would be such that half ofthe codes would include the value 1 / m settings of thecode, exactly one of the 2 m Or gates outputs False .When the data includes all 2 m one- False vectors, thedecoder circuit is unique up to the (2 m )! permutationsof the codes with respect to the position of the F in theoutput. Table VI summarizes RRR results for runs withall the data. We do not understand the change in theoptimized hyperparameter settings and the sharp rise inthe number of iterations at m = 5.By state counting we know a depth-1, m → m net-work does not have the capacity to binary-decode a gen-eral set of 2 m Boolean data vectors. A fully connecteddepth-2 BGN, with architecture m → m → m , hassufficient capacity but will the decoding circuits still beinterpretable? We get an interpretable design by com- FIG. 10. A BGN that generates 2 one- False vectors. Forany setting of the nodes in the code layer a single Or in thedata layer will be False . bining a one- False decoder (Figure 10) with a secondstage of the kind shown in Figure 11, with only negatingwires. We find, as shown in Figure 12, that these decoderdesigns are also the circuits found by RRR when the datavectors are generic (randomly generated). Changing theRRR starting point only has the effect of permuting (inthe solution) the codes with respect to one-
False posi-tions, and the latter with respect to the data labels.
VII. CONCLUSIONS
The contrast between traditional machine learning andthe new constraint-based approach, when applied to theCA prediction problem, could not be more stark. Notonly is the new method reliable when multiple time stepsseparate the data, but it is able to do so with just a singledata item instead of millions [1]. Although we did notapply the method to the Game of Life, in the settingwhere it is one of 2 possibilities, in the appendix weshow the method succeeds when modified for the easierproblem considered by Springer and Kenyon [1], wherethe rule can be formulated with linear filters.The CA prediction problem and simple toy applica-tions, such as binary encoding [15], challenge the work-ing premise upon which much of machine learning iscurrently based. For CA prediction, this is the belief,that given sufficient parameters and data, the models willeventually “understand” an elementary CA rule and notjust be increasingly good at mimicking its consequences.However, because there is no compelling evidence this ishappening spontaneously in at least this application, al-ternative designs that do offer this functionality shouldbe considered.Relatively modest modifications of the standard net-work design can improve data representation at a seman-tic level by simply being interpretable. A technical obsta-cle is that the network variables and parameters shouldbe able to take discrete values. We have shown that theRRR optimizer is up to this task and can be efficiently FIG. 11. A possible circuit in the second stage of a 3 → → False vectors from a first stage ofthe kind shown in Figure 10. For example, if the F is at theleftmost input node, the decoder output is the data vector F F T F T F F F . FIG. 12. RRR evolution ( σ = 0 . ω = 0 . η = 0 . β = 0 . w B that encodes negation) in a 3 → → deployed on networks. In the CA rule reconstruction ap-plication, interpretability took the form of making “therule” be the parameters of the model and then imposingthis rule at all time steps. In Boolean generative networks(BGNs), interpretability was introduced by imposing theconstraint that all the data is expressible as the output ofthe same fixed-depth logic circuit. Both applications, inhaving different variable types (on network nodes, edges,etc.), presented new challenges to RRR, which usuallyworks with homogeneous variable types (e.g. image pix-els). Introducing scale hyperparameters for the variousdiscrete variables greatly improved the performance ofRRR in this new setting. In the binary encoding prob-lem for random data vectors we saw an instance wherethe wire variables have very different distributions in dif-ferent layers (Figure 12) and might benefit by havingdifferent scale hyperparameters. An automatic hyperpa-rameter tuning mechanism would in any case make themethod more user-friendly.Our implementation of BGNs, where the Boolean cir-cuit is designed just through 3-state settings of the edges(wire, negating-wire, no wire), was meant mostly as ademonstration of what kinds of interpretable representa-tions are possible while staying within a network frame-work. An alternative model, even closer to the modelused for CA rule reconstruction, is to have fixed (non-negating) wires, only 2-input gates, and to give each gatethe freedom to select its own truth table.It is noteworthy, that while both applications we con-sidered are deeply tied to a notion of time (CA evolu- tion, logical implication), this detail played almost norole in the constraint formulation used by RRR. Con-straints were imposed (through projections) concurrentlyat all CA time-steps and layers of the BGN. If a similarmechanism is responsible for generating representationsof data in the brain, then what we understand as “logic”may just be the compatible software that runs on theparticular brand of hardware we have available for rep-resenting the world.But time reasserts itself as data is being distilled andrepresentations are formed. It is in this respect that thesubject properly falls in the domain of physics. Gra-dient descent and RRR are dynamical systems and thestrengths and weaknesses of these methods rest on theirbehavior in time. Because the time evolution of the dis-cretely constrained variables of RRR so closely resem-bles a CA, to avoid confusion we made a point of orient-ing time left-to-right (Figs. 4, 6 and 12) in contrast tothe vertical convention for CAs. The dynamics of RRRis poorly understood. In applications such as phase re-trieval [7] and sudoku [9], with just one type of variable,the model of strongly mixing dynamics as a mode of ex-haustive search has worked well. However, when thereare multiple variable types, such as in the applicationswe considered here, non-productive dynamical behaviorcan arise as well. A potential problem is posed when thevariable types define quasi-independent subsystems, likethe phonons and electrons in a conventional metal, andfail to find a solution to the joint system of constraints.So far, by tuning hyperparameters and decreasing theRRR time step β , we have been able to achieve “su-perconductivity” even in these more complex dynamicalsystems. ACKNOWLEDGMENTS
I thank Jonathan Yedidia for bringing reference [1] tomy attention, and Neil Sloane for reminding me of hisand Conway’s motto, of going as far as any reasonableperson, and then going further.
Appendix: Reconstructing Game of Life ‘gates’
Reconstructing the rule of a general binary automatonwith n = 9 inputs (one of 2 candidates), is probably in-tractable. We therefore consider a restricted formulationwhich happens to be close to the convolutional formula-tion studied by Springer and Kenyon [1]. The idea is toexpress the rule as the conjunction of two linear inequal-ities. If x is the binary vector of n cell states at time t w w FIG. 13. Lower and upper bound 3 × × that determine the state y of a cell at time t + 1, then y = ( , w · x ≥ b & w · x ≤ b , w and w arenon-negative. For simplicity, and also for the relationshipto BGNs, we restrict w and w to be binary, or indicatorsfor ‘wires’. We will refer to the pattern of wires to thefield of inputs as ‘masks’, and the automaton rule as agate defined by two masks. Again for simplicity we chosenot to have the network also learn the values of the twointeger bounds b and b . The Game of Life gate has b = b = 3 and the two 3 × × b = b = 2. A naive rule search in thiscase would involve 2 possibilities.One might criticize our ‘mask’ formulation as beingeasier than the one used in reference [1] in that the weightparameters are restricted to a discrete set. However, thediscrete option is the natural one for the application athand, and would also have been used in [1] had it beenwithin the scope of the gradient-based optimizer. Bychoosing to work with the restricted model we reaffirmthat discrete states need not be off limits in neural net-works.The network variables for mask reconstruction are forthe most part the same as in section II, for general rulereconstruction. Cells of the CA reside on a square latticeΛ with square-torus topology. The field of inputs I is a3 × × x and y , for the cell states. The y ’s reside onnodes and the x ’s on edges between nodes on adjacentlayers. Each y and the corresponding x ’s incident fromthe past participate in a rule constraint (A), now special-ized as a gate parameterized by two masks. One differ-ence is that now there are two sets of edge variables, x and x , one for each inequality in (A.1). Likewise, thereare two sets of weights, w and w , also on edges, that areinterpreted as indicator variables for wires to the gates.The B constraint again imposes consistency on the var-ious variable replicas. Constraint B1 ensures that allgates, over all space and time, have exactly the samepair of masks (choice of wires). For example, in Life theprojection to constraint set B results in 9 + 9 real-valuedaverages of the mask variables at each gate. ConstraintB2 imposes equality at each node of the y and the x ’sand x ’s incident from the future. The correspondingprojection, also given by an average, is an estimate ofthe cell states over the nodes of the network.By far the most elaborate projection is to the rule con-straint, as expressed by (A.1), but with separate vectors x and x for the two inequalities. Participating in eachlocal constraint is a single y , the gate output, two sets ofgate inputs, x and x , and corresponding wire variables w and w . In constraint set A all of these take binaryvalues which we denote 0 and 1 here, but have a scaleset by hyperparameters in the actual algorithm. Thesescales, such as η for y , are all relative to x and x , forwhich we chose the same scale. For the wire variables wefound that for the harder 5 × ω and ω for w and w .The discreteness of constraint A makes the correspond-ing projection easier, not harder. In the following de-scription of the projection algorithm we mostly want toconvey that the complexity does not grow combinatori-ally; in fact, the case of 5 × / × y ∈ { , } . In either case, the nextstep is to greedily project all the edge variables ( x ’s and w ’s) to their nearest discrete values. While doing this thealgorithm also records, for each edge and mask inequality(1 and 2), the projection to the nearest flipped product.For example, suppose the nearest ( x, w ) on a particularedge is (1 ,
0) with product 0. The nearest flipped productcombination is then (1 , , ,
1) or(1 , x, w ),(ii) the squared distance to that state, (iii) the nearestflipped-product ( x, w ), and (iv) the extra squared dis-3tance to the flipped-product state.In the second stage of the gate projection the sums ofthe products for each inequality ( w · x and w · x ), forthe greedy discrete state projections, are compared withthe bounds b and b . If we are considering y = 1 andboth inequalities are satisfied, then the constraint is sat-isfied and the squared distance has a contribution from y and the greedy values (ii) above from all the edges. Ifeither of the inequalities is not satisfied, some numberof the products in w · x must be flipped, 0 → → y = 1 canbe efficiently computed, as is the distance to that state.A similar set of computations is performed for the case y = 0. Now the greedy edge projections are output wheneither of the inequalities is violated. When both inequal-ities are satisfied by the greedy edge projections, thenthe extra squared distance in violating one or the other,by summing ranked edge contributions, are compared todecide which of the two should be violated.Whichever of the two cases y ∈ { , } has the small-est projection distance is the one selected for the gateprojection. A combinatorial explosion is avoided becausecontributions from edges can be sorted and the minimumdistance for changing the status of an inequality is a pro-jection to the equality case. The projection is simplifiedwhen the output node is in layer ℓ = t where y is specifiedby the data.We found that the two masks of the Life rule were eas-ily reconstructed even with several time steps t betweenthe observations. As in our experiments with general rulereconstruction, the reconstruction (training) used just asingle data item. Following reference [1] the initial ran-dom state had a density 0.38 of 1’s. For this kind of initialstate we obtained the (unique) Life rule for Λ as small as16 ×
16. With hyperparameters ω = 0 . ω = 0 .
7, and η = 2 .
0, the RRR algorithm with β = 0 .
75 very earlyin the search discovered the correct masks or close ap-proximations. It appears that most of the work goes intoreconstructing the CA states at the unseen times. Theaverage number of iterations per solution in 20 runs, all
TABLE VII. Average number of RRR iterations needed toreconstruct the Game of Life masks (Figure 13, top) up to t = 5 time steps between observations. All results are basedon 20 experiments and used hyperparameters ω = 0 . ω =0 . η = 2 .
0, and RRR time step β = 0 . t successful, and up to t = 5, is given in Table VII. As anRRR iteration corresponds computationally to a gradi-ent step in standard training, the numbers in Table VIIshould be compared to the 10 gradient steps (total dataitems) used in reference [1]. Even so, the success rate inthe minimal architecture (like ours) was 0% already for t = 2. Only when the network capacity was increased bya factor of 10 did the success rate for t = 2 reach 100%.The gradient trained networks were never successful for t = 5, even with the 10-fold parameter enhancement.Reconstructing the rules of Alien Life proved to bemuch more challenging. Since this rule is less famil-iar, Figure 14 shows three steps in the evolution froma random state on a 32 ×
32 torus. Our data, on thesame size system, had a random initial state at density0.5. Reconstructions always yielded the Alien Life masks(Figure 13, bottom) that were used to generate the final4
FIG. 15. Complete evolution of the two 5 × t = 2 data. Thetwo 5-wire masks of Alien Life appear on the right. data state. Results were very sensitive to the settings ofthe hyperparameters ω and ω associated with the twomasks, in particular, their difference. For example, when ω − ω > .
4, the variables of mask 1 are sufficientlyless compliant than those of mask 2 that they are effec-tively static while RRR tries to (unsuccessfully) resolveall inconsistencies via mask 2. This behavior can be de-tected by comparing the corresponding RRR velocities v i = k w ′ i − w i k / ( ω i β ), i = 1 ,
2, which yields v ≈ v when the hyperparameters differ by 0.4. The activity ofthe two masks in this kind of unproductive search is re-versed when ω − ω <