Cellular automata as convolutional neural networks
CCellular automata as convolutional neural networks
William Gilpin ∗ Department of Applied Physics, Stanford University (Dated: January 20, 2020)Deep learning techniques have recently demonstrated broad success in predicting complex dy-namical systems ranging from turbulence to human speech, motivating broader questions abouthow neural networks encode and represent dynamical rules. We explore this problem in the contextof cellular automata (CA), simple dynamical systems that are intrinsically discrete and thus difficultto analyze using standard tools from dynamical systems theory. We show that any CA may readilybe represented using a convolutional neural network with a network-in-network architecture. Thismotivates our development of a general convolutional multilayer perceptron architecture, which wefind can learn the dynamical rules for arbitrary CA when given videos of the CA as training data.In the limit of large network widths, we find that training dynamics are nearly identical acrossreplicates, and that common patterns emerge in the structure of networks trained on different CArulesets. We train ensembles of networks on randomly-sampled CA, and we probe how the trainednetworks internally represent the CA rules using an information-theoretic technique based on dis-tributions of layer activation patterns. We find that CA with simpler rule tables produce trainednetworks with hierarchical structure and layer specialization, while more complex CA produce shal-lower representations—illustrating how the underlying complexity of the CA’s rules influences thespecificity of these internal representations. Our results suggest how the entropy of a physical processcan affect its representation when learned by neural networks.
I. INTRODUCTION
Recent studies have demonstrated the surprising abil-ity of deep neural networks to learn predictive represen-tations of dynamical systems [1–5]. For example, cer-tain types of recurrent neural networks, when trainedon short-timescale samples of a high-dimensional chaoticprocess, can learn transition operators for that processthat rival traditional simulation techniques [2, 6, 7]. Morebroadly, neural networks can learn and predict generalfeatures of dynamical systems—ranging from turbulentenergy spectra [8], to Hamiltonian ground states [9, 10],to topological invariants [11]. Such successes mirror well-known findings in applied domains [12], which have con-vincingly demonstrated that neural networks may notonly represent, but also learn, generators for processesranging from speech generation [13] to video prediction[14]. However, open questions remain about how the un-derlying structure of a physical process affects its rep-resentation by a neural network trained using standardoptimization techniques.We aim to study such questions in the context of cel-lular automata (CA), among the simplest dynamical sys-tems due to the underlying discreteness of both theirdomain and the dynamical variables that they model.The most widely-known CA is Conway’s Game of Life,which consists of an infinite square grid of sites (“cells”)that can only take on a value of zero (“dead”) or one(“alive”). Starting from an initial binary pattern, eachcell is synchronously updated based on its current state, ∗ Also at Quantitative Biology Initiative, Harvard University;[email protected] as well as its current number of living and non-livingneighbors. Despite its simple dynamical rules, the Gameof Life has been found to exhibit remarkable propertiesranging from self-replication to Turing universality [15].Such versatility offers a vignette of broader questions inCA research, because many CA offer minimal examplesof complexity emerging from apparent simplicity [16–20].For this reason, CA have previously been natural candi-dates for evaluating the expressivity and capability ofmachine learning techniques such as genetic algorithms[21, 22].Here, we show that deep convolutional neural networksare capable of representing arbitrary cellular automata,and we demonstrate an example network architecturethat smoothly and repeatably learns an arbitrary CA us-ing standard loss gradient-based training. Our approachtakes advantage of the “mean field limit” for large net-works [23–25], for which we find that trained networksexpress a universal sparse representation of CA based ondepthwise consolidation of similar inputs. The effectivedepth of this representation, however, depends on theentropy of the CA’s underlying rules.
II. EQUIVALENCE BETWEEN CELLULARAUTOMATA AND CONVOLUTIONAL NEURALNETWORKS
Cellular automata.
We define a CA as a dynamicalsystem with M possible states, which updates its valuebased on its current value and D other cells—usually itsimmediate neighbors in a square lattice. There are M D possible unique M -ary input strings to a CA function,which we individually refer to as σ . A cellular automatonimplements an operator G ( σ ) that is fully specified by a a r X i v : . [ n li n . C G ] J a n list of transition rules σ → m , m ∈ , , ..., M −
1, andthere are M M D possible unique G ( σ ), each implementinga different ruleset. For the Game of Life, M = 2 , D = 9,and so G ( σ ) is a Boolean function that maps each of the2 = 512 possible 9-bit input strings to a single bit. Adefining feature of CA is the locality of dynamical updaterule, which ensures that the rule domain is small; thesize of D thus sets an upper bound on the rate at whichinformation propagates across space. Convolutional neural networks.
We define a convolu-tional neural network as a function that takes as an inputa multichannel image, to which it applies a series of localconvolutions via a trainable “kernel”. The same kernelis applied to all pixels in the image, and each convolu-tional layer consolidates information within a fixed localradius of each pixel in the input image [12]. Many stan-dard convolutional architectures include “pooling” lay-ers, which downsample the previous layer and therebyconsolidate local information across progressively largerspatial scales; however, all CNN discussed in this paperdo not include downsampling steps, and thus preservethe full dimensionality of the input image.
Cellular automata as recurrent mlpconv networks.
Theprimary analogy between cellular automata and tradi-tional convolutional neural networks arises from (1) thelocality of the dynamics, and (2) simultaneous tempo-ral updating of all spatial points. Because neural net-works can, in principle, act as universal function approx-imators [26], a sufficiently complex neural network ar-chitecture can be used to fully approximate each rule σ → m that comprises the CA function G ( σ ). Thissingle-neighborhood operator can then be implementedas a convolutional operator as part of a CNN, allowingit to be applied synchronously to all pixel neighborhoodsin an input image.Representing a CA with a CNN thus requires twosteps: feature extraction in order to identify each of the M D input cases describing each neighborhood, followedby association of each neighborhood with an appropriateoutput pixel. In the appendix, we show explicitly howto represent any CA using a single convolutional layer,followed by repeated 1 × M D input σ against a template, while another approach treatslayers of the network like levels in a tree search that iter-atively narrows down each input σ to the desired output m . A key aspect of our approach is our usage of onlyone non-unity convolutional layer (with size 3 × D ofthe CA. All subsequent layers consist of 1 × × × × III. A GENERAL NETWORK ARCHITECTUREFOR LEARNING ARBITRARY CELLULARAUTOMATA
Having proven that arbitrary cellular automata maybe analytically represented by convolutional perceptronswith finite layers and units, we next ask whether auto-mated training of neural networks on time series of cellu-lar automata images is sufficient to learn their rules. Weinvestigate this process by training ensembles of convo-lutional neural networks on random random images andrandom CA rulesets. We start by defining a CA as anexplicit mapping between each of 2 = 512 possible 3 × ×
10 pixels) and train-ing data batches (500 images) to ensure that the trainingdata contains at least one instance of each rule. On av-erage, each image contains an equal number of black andwhite pixels; for sufficiently large images this ensures thateach of the 512 input states is equally probable. We notethat, in principle, training the network will proceed muchfaster if the network is shown an example of only one ruleat a time. However, such a process causes the networkstructure to depend strongly on the order in which in-
T = 1T = 0
Activations3x3 filters 1x1 filters
Figure 1.
Conway’s Game of Life as a convolutional neural network.
Two convolutional filters identify the value ofthe center pixel and count the number of neighbors. These features are then scored and summed to generate a prediction forthe system at the next timepoint. dividual rules were shown, whereas presenting all inputcases simultaneously forces the network to learn internalrule representations based on their relative importancefor maximizing accuracy.
Network architecture and training parameters.
Figure2 shows the network used in our training experiments.Our network consists of a basic mlpconv architecture cor-responding to a single 3 × × × × M >
2. Our network may thus be considered afully convolutional linear committee machine.We trained our networks using the Adam optimizerwith an L2 norm loss function, with hyperparameters(learning rate, initial weights, etc) optimized via a gridsearch (see Appendix for all hyperparameters). Becausegenerating new training data is computationally inexpen-sive, for each stage of hyper parameter tuning, a new,unseen validation dataset was generated. Additionally,validation was performed using randomly-chosen, unseenCA rulesets in order to ensure that network hyperpa-rameters were not tuned to specific CA rulesets. Dur-ing training, a second validation dataset 20% of the sizeof the training data was generated from the same CAruleset. Training was stopped when the network predic-tion accuracy reached 100% on this secondary validationdataset, after rounding predictions to the nearest integer.The loss used to compute gradients for the optimizer wasnot rounded. The final, trained networks were then ap-plied to a new dataset of unseen test data (equal in sizeto five batches of training data).We found that training successfully converged for allCA rulesets studied, and we note that our explicit use of aconvolutional network architecture simplifies learning ofthe full rule table. Because we are primarily interested inusing CNN as a way to study internal representations of CA rulesets, we emphasize that 100% performance on thesecond validation dataset a condition of stopping train-ing. As a result, all trained networks had identical perfor-mance; however, the duration and dynamics of trainingvaried considerably by CA ruleset (discussed below). Re-gardless of whether weight-based regularization was usedduring training, we found that performance on the un-seen test data was within ∼ .
3% of the training data forall networks studied (after outputs are rounded, perfor-mance reaches 100%, as expected). We caution, however,that this equal train-test performance should not be in-terpreted as a measure of generalizability, as would be thecase for CNN used to classify images, etc. [32]. Rather,because a CA only has M D possible input-output pairs(rather than an unlimited space of inputs), this resultsimply demonstrates that training was stopped at a pointwhere the model had encountered and learned all inputs.In fact, we note that it would be impossible to train anetwork to represent an arbitrary CA without being ex-posed to all of its inputs: since an arbitrary CA can sendany given input σ to any given output m , there is no wayfor a network to predict the output for an symbol withouthaving encountered it previously. However, we note thata network could, in principle, encode a prior expectationfor an unseen input symbol σ , if it was trained primarilyon CA of a certain type.In a previous work that used a one-layer network tolearn the rules of a chaotic CA, it was found that trainingwithout weight-sharing prevents full learning, becausedifferent spatial regions on the system’s attractor havedifferent dynamical complexity [30]. In the results below,we deliberately use very large networks with 12 hiddenlayers—one 3 × × Training dynamics of networks.
Consistent with priorreports that large networks approach a “mean field” limit[24, 25, 33], we find that training is highly repeatable forthe large networks that we study, even when differenttraining data is used, different CA rules are learned, orthe hyperparameters are altered slightly from their opti-mal values (although this extends the duration of train-ing). We also find that doubling the depth and widthof our networks does not qualitatively affect our results,consistent with a large-network limit. Additionally, wetrained alternative networks using a different optimizer(vanilla stochastic gradient descent) and loss function(cross-entropy loss), and found nearly identical internalstructure in the trained networks (as discussed below);however, the form of the loss curves during training wasmore concave for such networks. See the supplementarymaterial for further details of networks and training.Figure 3A shows the results of training a single net-work on the Game of Life, and then applying the trainednetwork to the “glider,” a known soliton-like solution tothe Game. During the early stages of the training, theactivations appear random and intermittent. As train-ing proceeds, the network adjusts to the scale of outputvalues generated by the input data, and then begins tolearn clusters of related rules—leading to tightening ofthe output image and trimming of spurious activationpatterns.
IV. ANALYSIS OF TRAINED NETWORKS
We next consider the relevance of our training obser-vations to the general properties of binary cellular au-tomata. Intuition would suggest that certain sets of CArules are intrinsically easier to learn, regardless of M and D ; for example, a null CA that sends every input to zeroin a single timestep requires a trivial network structure,while the Game of Life should require a structure like Fig-ure 1 that can identify each possible neighborhood count.We thus repeat the training data generation and CA net-work training process described above, except this timewe sample CA at random from the 2 ≈ possiblerulesets for binary CA. The complexity of the dynamicsproduced by a given rule are generally difficult to ascer-tain a priori , and typical efforts to systematically investi-gate the full CA rule space have focused on comparativesimulations of different rules [16, 17]. For example, theGame of Life is a member of a unique set of “Class IV”CA capable of both chaotic and regular dynamics de-pending on their initial state; membership in this classhas been hypothesized to be a prerequisite to supportingcomputational universality [15, 16]. General predictionof dynamical class is an ongoing question in the CA liter-ature [21], however, there is a known, approximate rela-tionship between the complexity of simulated dynamics,and the relative fraction λ of transitions to zero and oneamong the full set of 512 possible input cases: λ = 0and λ = 1 correspond to null CA, whereas λ = 0 . λ directly, we parametrize the space of CA equiv-alently using the effective “rule entropy,” H ca . We define H ca by starting from a maximum-entropy image with auniform distribution of input symbols ( p σ ≈ /M D forall σ ), to which we then apply the CA rule once and thenrecord the new distribution of input cases, p (cid:48) σ . The resid-ual Shannon entropy H ca ≡ − (cid:80) σ p (cid:48) σ log p (cid:48) σ provides ameasure of the degree to which the CA rules compressthe space of available states. H ca ( λ ) monotonically in-creases from H ca (0) = 0 until it reaches a global maxi-mum at H ca (1 /
2) = 9, after which it symmetrically de-creases back to H ca (1) = 0.Figure 3B shows the result of training 2560 randomly-sampled CA with different values of H ca . Ensemblesof 512 related cellular automata were generated by ran-domly selecting single symbols in the input space to tran-sition to 1 (starting with the null case σ → σ ),one at a time, until reaching the case σ → σ .This “table walk” sampling approach [17] was then repli-cated 5 times for different starting conditions.We observe that the initial 10 −
100 training epochsare universal across H ca . Detailed analysis of the acti-vation patterns across the network (Supplementary ma-terial) suggests that this transient corresponds to initial-ization, wherein the network learns the scale and boundsof the input data. Recent studies of networks trainedon real-world data suggest that this initialization periodconsists of the network finding an optimal representationof the input data [34]. During the next stage of training,the network begins to learn specific rules: the numberof neurons activated in each layer begins to decrease, asthe network becomes more selective regarding which in-puts provoke non-zero network outputs (see supplemen-tary material). Because H ca determines the sparsity ofthe rule table—and thus the degree to which the rulesmay be compressed— H ca strongly affects the dynamicsof this phase of training, with simpler CA learning fasterand shallower representations of the rule table, resultingin smaller final loss values (Figure 3B, inset). This be-havior confirms general intuition that more complicatedCA rules require more precise representations, makingthem harder to learn.A key feature of using large networks to fit simple func-tions like CA is strong repeatability of training across dif-ferent initializations and CA rulesets. In the appendix,we reproduce all results shown in the main text usingnetworks with different sizes and depths, and even a dif-ferent optimizer, loss function, and other hyperparam-eters, and we report nearly identical results (for bothtraining and test data) as those found using our networkarchitecture described above. On both the training dataand test data, we find similar universal training curvesthat depend on H ca , as well as distributions of activationpatterns. This universality is not observed in “narrow”networks with fewer neurons per layer, for which trainingproceeds as a series of plateaus in the loss punctuated bylarge drops when the stochastic optimizer happens uponnew rules. In this limit, randomly-chosen CA rulesetswill not consistently result in training successfully find-ing all correct rules and terminating. Moreover, small Figure 2.
Architecture of a trainable convolutional neural network for learning cellular automata.
A schematic ofthe mlpconv network trained on binary cellular automata. Dimensions, where not marked, are determined by the dimensionalityof the previous layer. networks that do terminate do not display apparent pat-terns when their internal structure is analyzed using theapproaches described below—consistent with a randomsearch. Similar loss dynamics have previously been ob-served when CA are learned using genetic algorithms, inwhich the loss function remains mostly flat, punctuatedby occasional leaps when a mutant encounters a new rule[21]. For gradient-based training, similar kinetic trappingoccurs in the vicinity of shallow minima or saddle points[35, 36], but these effects are reduced in larger networkssuch as those used here.
V. INFORMATION-THEORETICQUANTIFICATION OF ACTIVATIONS.
That training thousands of arbitrary CA yields ex-tremely similar training dynamics suggests that deepnetworks trained using gradient optimizers learn a uni-versal approach to approximating simple functions likeCA. This motivates us to next investigate how exactlythe trained networks represent the underlying CA ruletable—do the networks simply match entire input pat-terns, or do they learn consolidated features such asneighbor counts? Because the intrinsic entropy of the CArule table affects training, we reason that the entropy ofactivated representations at each layer is a natural heuris-tic for analyzing the internal states of the network. Wethus define a binary measure of activity for each neuronin a fully-trained network: when the network encountersa given input σ , any neurons that produce a non-zero out-put are marked as 1 (or 0 otherwise), resulting in a newset of binary strings a ( σ ) denoting the rounded activationpattern for each input σ . For example, in an mlpconvnetwork with only 3 layers, and 3 neurons per layer, anexample activation pattern for a specific input σ couldyield a ( σ ) = { , , } , with commas demarcat-ing layers. Our approach constitutes a simplified versionof efforts to study deep neural networks by inspecting activation pattern “images” of neurons in downstreamlayers when specific input images are fed into the net-work [25, 37–39]. However, for our system binary strings(thresholded activation patterns) are sufficient to char-acterize the trained networks, due to the finite space ofinput-output pairs for binary CA, and the large size ofour networks; in our investigations, no cases were foundin which two different inputs ( σ, σ (cid:48) ) produced differentunrounded activation patterns, but identical patterns af-ter binarization ( a ( σ ) , a ( σ (cid:48) )).Given the ensemble of input symbols σ ∈ { , } D , anda network consisting of L layers each containing N neu-rons, we can define separate symbol spaces representingactivations of the entire network a T ( σ ) ∈ { , } LN ; eachindividual layer, a L ,i ( σ ) ∈ { , } N , i ∈ [0 , L − a N ,ij ( σ ) ∈ { , } , i ∈ [0 , L − j ∈ [0 , N − M D unique input cases σ , we can then calculate the probability p α,k for observ-ing a given unique symbol a k at a level α ∈ { T , L , N } in the network. We quantify the uniformity of each ac-tivation symbol distribution p using the entropy H α = − (cid:80) k p α,k log p α,k , which satisfies H α ≤ dim( α ). Wecondense notation and refer to the activation entropies H T , H L ,i , H N ,ij as the total entropy, the entropy of i th layer, and the entropy of the j th neuron in the i th layer. We note that, in addition to readily quantifyingthe number of unique activation patterns and their uni-formity across input cases, the Shannon entropy natu-rally discounts zero-entropy “dead neurons,” a commonartifact of training high-dimensional ReLU networks [31].Our general analysis approach is related to a recently-developed class of techniques for analyzing trained net-works [40], in which an ensemble of training data (here,a uniform distribution of σ ) is fed into a trained networkin order to generate a new statistical observable (here, H ).We expect and observe that (cid:104)H N ,ij (cid:105) ij < (cid:104)H L ,i (cid:105) i ≤ H T .Unsurprisingly, the maximum entropy of a single neuron Training epoch
Rule entropy H ca F i n a l L o ss A B L o ss -1 Epoch 1500Epoch 500Epoch 50Epoch 5Exact
Figure 3.
Training 2560 convolutional neural networks on random cellular automata. (A) A network trained onthe Game of Life for different durations, and then applied to images of each stage of the “glider” solution. (B) The loss versustime during training, colored by the rule entropy H ca . Groups of 512 related cellular automata were generated by iterativelychoosing random σ → σ →
1. 5 replicates wereperformed. Loss values represent the sum over the batch; values of 10 or smaller imply that only small rounding errors werepresent at the end of training. The entropy of the resulting rule table is characteristic of the CA, and it is indicated by H ca = 0(blue, minimum entropy CA) to H ca = 9 (magenta, maximum entropy CA). (Inset) The final loss for each network at the endof training, shown as a function of H ca . is log H T ≈ λ = 0where a network with all zero weights and biases wouldboth correctly represent the rule table, and have identi-cal firing patterns for all inputs ( H T = 0). This effectdirectly arises from training using gradient-based meth-ods, for which at least some early layers in the networkproduce unique activation patterns for each σ that arenever condensed during later training stages. Accord-ingly, regularization using a total weight cost or dropoutboth reduce H T .Comparing H L ,i across models and layers demonstratesthat early layers in the network tend to generate a broadset of activation patterns that closely follow the uniforminput symbol distribution (Figure 4A). These early layersin the network thus remain saturated at H L ,i = H T ≈ H L ,i to theoretical pre-dictions for the layerwise entropy for the different types ofways that a CNN can represent the CA. The uppermostdashed curve corresponds to a network that generates amaximum entropy set of 512 equiprobable activation pat-terns in each layer. This case corresponds to a “shallow” network that matches each input case to a unique tem-plate at each layer. Lower dashed curves correspond topredictions for networks that implement the CA as lay-erwise search, in which σ that map to the same output m are mapped to the same activation pattern at some pointbefore the final layer. This corresponds to a progressivedecrease in the number of unique activation patterns ineach layer. The two dashed curves shown correspondto theoretical networks that eliminate 45% and 50% ofunique activation patterns at each layer.We find that higher entropy rules H ca (red points) tendto produce shallower networks due to the rule table be-ing less intrinsically compressible; whereas simpler CA(blue points) produce networks with more binary tree-like structure. This relationship has high variance inearly layers, making it difficult to visually discern in thepanel save for the last layer. However, explicit calcu-lation of the Pearson correlation r ( H ca , H L ,i ) confirmsits presence across all layers of the network, and thatit becomes more prominent in deeper layers (Figure 4A,inset). This trend is a consequence of training the net-work using backpropagation-based techniques, in whichloss gradients computed at the final, L th hidden layerare used to update the weights in the previous ( L − th layer, which are then used to update the ( L − th layer,and so forth [41]. During training, the entropy of the fi-nal layer increases continuously until it reaches a plateaudetermined by the network size and by H ca . The penul-timate layer then increases in entropy until reaching aplateau, and so forth until H T = 9 across all σ —at which -2 -1 Total neuron entropy Σ j H N,ij /N Layer depth i
Layer depth i T o t a l l a y e r e n t r o p y H L , i / D r ( H c a , H L , i ) -1 -1 BA
101 12
Figure 4.
Internal representations of cellular automata by trained networks. (A) The individual layerwise entropy( H L ,i /D ) for the 2560 networks shown in the previous figure. Noise has been added to the horizontal coordinates (layer index)to facilitate visualization. As in previous figures, coloration corresponds to the entropy H ca of the underlying CA. Dashedlines correspond to expected trends for theoretical networks that eliminates 0% of cases in each layer (i.e., a pattern-matchingimplementation), 45% of cases, and 50% (top to bottom) (Inset) The Pearson correlation coefficient r between the rule entropy H ca and layer entropy H L ,i . Error range corresponds to bootstrapped 25% -75% quantiles. (B) The normalized layerwiseentropy ( H L ,i /D ) versus the normalized total layerwise neuron entropy ( H N ,ij /N ), with the linear scaling annotated. point training stops because the test error will reach zero(training dynamics are further analyzed in the Supple-mentary Material). This general correlation between CAentropy and network structure is consistent with earlierstudies in which networks were trained to label CA rule-sets by their dynamical complexity class [42].The role of H ca on internal representation distributions p L can be further analyzed using Zipf plots of activationpattern a k frequency versus rank (Supplementary Mate-rial): the resulting plots show that the distribution of ac-tivation symbols is initially uniform (because the trainingdata has a uniform distribution of σ ), but the distribu-tion becomes progressively narrower and more peaked inlater layers. This process occurs more sharply for net-works trained on CA with smaller H ca .We next consider how the entropy of our observed layeractivation patterns relates to the entropy of the individ-ual neurons H N ,ij that comprise them; we suspect thereis a relation because the individual firing entropies de-termine the “effective” number of neurons in a layer, N eff = 2 (cid:80) j H N ,ij . Across all layers, we observe a linear re-lationship between H N ,ij and H L ,i , which saturates when H L ,i ≈ H T (Figure 4B). The lower- H ca CA lie withinthe linear portion of this plot, suggesting that variationin activation patterns in this regime results from lay-ers recruiting varying numbers of neurons. Conversely,higher-entropy CA localize in a saturated region whereeach layer encodes a unique activation pattern for eachunique input state, leading to no dependence on the to-tal effective number of neurons. This plot explains ourearlier observation that the dynamics of training do notdepend on the exact network shape as long as the net-work has sufficiently many neurons: for low H ca , layersnever saturate, and are free to recruit more neurons untilthey are able to pattern-match every unique input (atintermediate and large H ca ). A CA with more possible input states (larger M or D ) would thus require moreneurons per layer to enter this large-network limit.We also consider the degree to which the decrease H L ,i vs. i arises from deeper layers becoming “spe-cialized” to specific input features, a common observa-tion for deep neural networks [12, 38, 41]. We quan-tify the layer specialization using the total correlation,a measure of the mutual information between the acti-vation patterns of a layer, and the neurons within thatlayer: I i = (cid:80) j H N ,ij − H L ,i . This quantity is minimized( I i = 0) when the single neuron activations within alayer are independent of one another; conversely, at themaximum value individual neurons only activate jointlyin the context of forming a specific layer activation pat-tern. Plots of I i vs. i (Supplementary material) revealthat during early layers, individual neurons tend to fireindependently, consistent with multi-neuron features be-ing unique to each input case. In these early layers, I i islarge because the number of possible activation patternsin a single layer of the large network (2 ) is much largerthan the number of input cases (2 ). In later layers, how-ever, the correlation begins to decrease, consistent withindividual neurons being activated in the context of mul-tiple input cases—indicating that these neurons are as-sociated with features found in multiple input cases, likethe states of specific neighbors. Calculation of r ( I i , H ca )confirms that this effect varies with H ca . VI. DISCUSSION
We have shown an analogy between convolutional neu-ral networks and cellular automata, and demonstrateda type of network capable of learning arbitrary binaryCA using standard techniques. Our approach uses asimple architecture that applies a single 3 × × a priori rulesto the its apparent dynamical complexity during simu-lation [16, 18, 22]—for example, do Class IV and othercomplex CA impose unique structures upon fitted neuralnetworks, or can neural networks predict their compu-tational complexity given a rule table? These problemsand more general studies of dynamical systems will re-quire more sophisticated approaches, such as unsuper-vised training and generative architectures (such as re-stricted Boltzmann machines). More broadly, we notethat studying the bounded space of CA has motivatedour development of general entropy-based approaches toprobing trained neural networks. In future work we hopeto relate our observations to more general patterns ob-served in studies of deep networks, such as the informa-tion bottleneck [34]. Such results may inform analysisof open-ended dynamical prediction tasks, such as videoprediction, by showing a simple manner in which processcomplexity manifests as structural motifs. VII. ACKNOWLEDGMENTS
W.G. was supported by the NSF-Simons Center forMathematical and Statistical Analysis of Biology at Har-vard University, from NSF Grant No. DMS-1764269, andfrom the Harvard FAS Quantitative Biology Initiative.He was also supported by the U.S. Department of De-fense through the NDSEG fellowship program, as well asby the Stanford EDGE-STEM fellowship program
VIII. CODE AVAILABILITY
Convolutional neural networks were implemented inPython 3.4 using TensorFlow 1.8 [43]. Source codeis available at https://github.com/williamgilpin/convoca . IX. APPENDIXA. Representing arbitrary CA with convolutionalneural networks
Here we show explicitly how a standard mlpconv mul-tilayer perceptron architecture with ReLU activation iscapable of representing an arbitrary M state cellular au-tomaton with a finite depth and neuron count [28]. Weprovide the following explicit examples primarily as anillustration of the ways in which 1 ×
1. Pattern-matching the rule table with a shallow network
An arbitrary M-state cellular automaton can first beconverted into a one-hot binary representation. Given an L × L image, we seek to generate an L × L × M stack ofbinary activation images:1. Convolve the input layer with M distinct 1 × , , − , ... − ( M − M × M −
1) convolutional filters tests a differentconsecutive pair [1 , − b, , ..., , , − b, , ..., , , , − b, , ..., ... , [0 , ..., , , − b ], where b is anypositive constant b ≥ M/ ( M − , ..., , M ) + M param-eters and two layers to produce an activation volume ofdimensions L × L × M .We now have an L × L × ( M −
1) array correspondingthe one-hot encoding of each pixel’s value in an L × L lattice. We now pattern match each of the M D possibleinputs with its corresponding correct output value. Wenote that the steps we take below represent an upperbound; if the number of quiescent versus active states inthe cellular automaton is known in advance (= λM D ,where λ is Langton’s parameter) [17], then the numberof patterns to match (and thus total parameters) may bereduced by a factor of λ , because only the non-quiescent“active” rules that produce non-zero output values needto be matched.1. Construct a block of M D S × S × ( M −
1) convolu-tional filters, where S corresponds to the neighbor-hood size of the CA ( S = 3 for a standard CA witha Moore neighborhood). Each of the M D filterssimply corresponds to an image of each possible in-put state, with entries equalling one for each non-zero site, and large negative values (greater than D ( M − M > M D filters based on thecellular automaton’s rule table. For S × S × ( M − q , assigna bias of ( q − − ( L − L is the number ofnon-zero sites in the neighborhood L ≤ D ( M − ≥ L , such as D ( M −
2. Searching the rule table with a deep network
Another way to represent a cellular automaton witha multilayer perceptron constitutes searching a subset ofall possible inputs in each layer. This approach requiresall input cases σ that map to the same output symbol m ,to also map to the same activation pattern at some layerof the network. This coalescence of different input statescan occur at any point in the network before the finallayer; here we outline a general approach for constructingmaps to the same output symbol using large networks. Assigning input cases to a unique binary strings.
Assume there are N convolutional filters. If there are M D unique input cases, these filters can be used to generatean n -hot encoding of the input states. n should be chosensuch that (cid:0) Nn (cid:1) ≥ M D . Here, we assume a binary CA witha Moore neighborhood ( M = 2, D = 9). If N = 100neurons are present in each layer, then a two-hot binarystring ( n = 2) is sufficient to uniquely represent everypossible input state of a binary Moore CA, using thefollowing steps1. The D pixel neighborhood is split into n sub-neighborhoods, with sizes we refer to as D , D , .., D n . For example, for a the binary MooreCA, we can split the neighborhood into the first 5pixels (counted from top-left to the center) and theremaining 4 pixels (the center pixel to the bottomright corner. Note that the number and dimen-sionality of these sub-neighborhoods must satisfythe condition: if Q ≡ M D + M D + ... + M D n ,then (cid:0) NQ (cid:1) ≥ M D .2. Define M D + M D + ... + M D n filters, whichmatch each possible sub-neighborhood. For ex-ample, for the neighborhood reading 101000111 from upper-left to bottom-right, two filters canbe defined that will match sub-neighborhoods con-sisting of the first 5 bits and the last 4 bits,using the approach described above for pattern-matching. In this case, these filters would be1 , − , , − , − , − , , , −
1, and 0 , , , − , − , − , , , − n -hot bi-nary encoding of the input state, because eachunique input case will match the same n filters fromthe set of N , thus creating a unique representation. Assigning input case binary strings to matchingoutput symbols.
At this stage in the network, eachinput case has been mapped to a unique N digit bi-nary string with exactly n ones within it. Successive1 × N = 5 then the possible inputcases are σ ∈ { } . Many of these cases can beuniquely matched by applying a filter consisting of threeones, followed by a bias of b = 2. For example, using thefilter W = ( − , − , − , , −
2) to perform the operation h = RELU ( W · σ + b ) will result in an output of 1 forthe cases { , } only. To match strings with nooverlapping bits, more than two cases must be mergedsimultaneously. In general, to merge H cases using thisapproach, two strings must have H − × × λ parameter of the CA rule table, the depth (andthus minimum number of layers) to perform this searchwould be a maximum of log − λ = 0 .
3. Network representation of the Game of Life
We note that there are many other ways to implementa CA that are not exactly layerwise depth search, nora shallow pattern match, depending on the number andtype of features being checked at each layer of the net-work. For example, each of the D pixels in the neigh-borhood of the CA can be checked with separate convo-lutional kernels all in the first layer, and then differentcombinations of these values could be checked in subse-quent steps. The shallow network described above rep-resents an extreme case, in which every value of the fullinput space is explicitly checked in the first layer. Thisimplementation is efficient for many CA, because of thelow cost of performing multiple numerical convolutions.However, for CA with large M or D , the layer-wise searchmethod may be preferable.For the Game of Life, we can use knowledge of thestructure of a CA in order to design a better implementa-tion. The Game of Life is an outer totalistic CA, meaningthat the next state of the system is fully determined bythe current value of the center pixel, and the total num-ber of ones and zeros among its immediate neighbors.For this reason, only two unique convolutional filters areneeded.The first filter is the identity, which is applied with bias 0.The second filter is the neighbor counting filter Due specifically to our use of ReLU activation func-tions throughout our networks (rather than sigmoids),several copies of this filter must be applied in order todetect different specific neighbor counts. In particular,because the Game of Life rules require specific informa-tion about whether the total number of “alive” neighborsis <
2, 2, 3, or ≥
4, we need four duplicates of the neigh-bor counting filter, with biases ( − , − , − , − L × L binaryinput image with 5 total 3 × × L × L × <
2, = 2,= 3, and ≥
4. Each 5 × L × L face of the activationvolume now contains a unique activation pattern that canbe matched against the appropriate output case. In thenext layer of the network, two 1 × , , / , − / , − / / , / , − , − / , − / − / , − / L × L × L × L output corresponding to the next state of the automa-ton, this volume is summed along its depth—which canbe performed efficiently as a final convolution with a 1 × ,
1) along its depth, and no bias. Thiswill produce an L × L output image correspond to thenext state of the Game.For an example implementation of this algorithm inTensorFlow, see the function ca_funcs.make_game_of_life() in https://github.com/williamgilpin/convoca/blob/master/ca_funcs.py .In principle, this architecture can work for any outer-totalistic cellular automaton, such as Life without Death,High Life, etc—although depending on the number ofunique neighbor count and center pixel pairings that de-termine the ruleset, the number of neighbor filters maybeed to be adjusted. For example, in the Game of Lifethe cases of 0 living and 1 living neighbors do not needto be distinguished by the network, because both casesresult in the center pixel having a value of zero in thenext timestep.Likewise, for a purely totalistic cellular automaton(such as a majority vote rule), only a single convolu-tional filter (consisting of 9 identical values) is necessary,because the value of the center pixel does not need to beresolved by the network. B. Neural network training details
Convolutional neural networks were implemented inPython 3.4 using TensorFlow 1.8 [43]. Source codeis available at https://github.com/williamgilpin/convoca .For all convolutions, periodic boundary conditionswere implemented by manually copying pixel values fromeach edge of the input image, and then appending themonto the opposite edges. The padding option “VALID”was then used for the first convolutional filter layer in theTensorFlow graph.Hyperparameters for the large networks described inthe main text were optimized using a grid search. Foreach training run performed while optimizing hyperpa-rameters, a new validation set of unseen binary images1associated with an unseen cellular automaton ruleset wascreated, in order to prevent the cellular automaton rule-set from biasing the choice of hyperparameters. Oncehyperparameters were chosen, and training on arbitrarycellular automata started, an additional validation setof binary images was generated for each ruleset. Theseimages were used to determine when to stop training.Finally, an unseen set of binary images was used as atest partition, in order to compute the final accuracy ofthe trained networks. The training and test accuracies(before rounding the CNN output to the nearest integer)were within 0 .
3% for all networks studied, which is a di-rect consequence of the network’s ability to represent allinput cases exactly. After rounding the CNN output tothe nearest integer, both the train and test datasets had100% accuracy. The unrounded train and test perfor-mance during the training of one network are shown asa function of training epoch in Figure S1 of the supple-mentary material.The default networks contained one 3 × × × Table I. Hyperparameters for networks used in the main text.Parameter Value
Input dimensions × pxNumber of layers Neurons per layer
Input samples imagesBatch size imagesWeight initialization He Normal[44]
Weight scale Learning rate − Max train epochs
Optimizer AdamLoss L2
We also considered the degree to which the exact di-mensions of the “large network” affect our results. Wetrained another ensemble of networks with loss function,hyperparameters, and optimizer identical to the maintext, but with the number of layers and the number ofneurons per layer doubled (Table II). As we observe in themain text, our results remain almost identical (Figure S2of the supplementary material, left panel). We attributethis to the relatively small number of unique input casesthat the networks need to learn (512) as compared to thepotential expressivity of large networks.As a control against the choice of optimizer and lossaffecting training, we also trained a replicate ensembleof networks that had the same network shapes (12 lay-ers with 100 neurons each) but a different loss functionand optimizer, for which different optimal hyperparame-
Table II. Hyperparameters for the large network.Parameter Value
Input dimensions × pxNumber of layers Neurons per layer
Input samples imagesBatch size imagesWeight initialization He Normal[44]
Weight scale Learning rate − Max train epochs
Optimizer AdamLoss L2 ters were found using a new grid search (Table III). Wecompare results using this alternative network to the de-fault network described in the main text, and we find theresults are nearly identical.
Table III. Hyperparameters for the alternative network.Parameter Value
Input dimensions × pxNumber of layers Neurons per layer
Input samples imagesBatch size imagesWeight initialization He Normal[44]
Weight scale × − Learning rate × − Max train epochs
Optimizer S. G. D.Loss cross-entropy
Figures S2 (right panel) and S3 of the supplementarymaterial show the results of training a network usingthese parameters. The shape of the training curve isslightly different, with the universal transient (duringwhich the network learns general features of the inputdata such as the range and number of unique cases) beingmuch longer for this network. However, the later phasesof training continue similarly to the standard network,with H ca strongly affecting the later stages of trainingand the final loss. Moreover, after training has concluded,the dependence of the internal representations of the net-work on H ca (Figure S2 of the supplementary material)matches the patterns seen in the default network above. X. SUPPLEMENTARY MATERIALS ANDADDITIONAL EXPERIMENTS
For this arXiv submission, additional supplementarymaterial has been uploaded as an ancillary file. To obtainthe SI, navigate to the abstract landing page, and thenon the right-hand side select “Other formats” under the“Download” header. On the resulting page, click the2link for “Download source” under the “Source” header.A numbered file will download. Either extract the file from the terminal, rename the file with the extension“.gz”, and then double-click the file to extract a directorycontaining the PDF of supplementary material. [1] L. Zdeborov´a, Nature Physics , 420 (2017).[2] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott, Phys-ical review letters , 024102 (2018).[3] J. Carrasquilla and R. G. Melko, Nature Physics , 431(2017).[4] E. P. Van Nieuwenburg, Y.-H. Liu, and S. D. Huber,Nature Physics , 435 (2017).[5] G. Torlai, G. Mazzola, J. Carrasquilla, M. Troyer,R. Melko, and G. Carleo, Nature Physics , 447 (2018).[6] H. Jaeger and H. Haas, Science , 78 (2004).[7] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner,arXiv preprint arXiv:1808.04930 (2018).[8] J. N. Kutz, Journal of Fluid Mechanics , 1 (2017).[9] G. Carleo and M. Troyer, Science , 602 (2017).[10] G. Torlai and R. G. Melko, Physical Review B , 165134(2016).[11] P. Zhang, H. Shen, and H. Zhai, Physical review letters , 066401 (2018).[12] Y. LeCun, Y. Bengio, and G. Hinton, nature , 436(2015).[13] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior,and K. Kavukcuoglu, in SSW (2016) p. 125.[14] M. Mathieu, C. Couprie, and Y. LeCun, arXiv preprintarXiv:1511.05440 (2015).[15] A. Adamatzky,
Game of life cellular automata , Vol. 1(Springer, 2010).[16] S. Wolfram, Reviews of modern physics , 601 (1983).[17] C. G. Langton, Physica D: Nonlinear Phenomena , 12(1990).[18] D. P. Feldman, C. S. McTague, and J. P. Crutchfield,Chaos: An Interdisciplinary Journal of Nonlinear Science , 043106 (2008).[19] E. Fredkin, Physica D: Nonlinear Phenomena , 254(1990).[20] A. Adamatzky and J. Durand-Lose, in Handbook of Nat-ural Computing (Springer, 2012) pp. 1949–1978.[21] M. Mitchell, J. P. Crutchfield, R. Das, et al. , in
Proceed-ings of the First International Conference on Evolution-ary Computation and Its Applications (EvCA?96) , Vol. 8(Moscow, 1996).[22] M. Mitchell and P. T. Hraber, Complex Systems , 89(1993).[23] P.-M. Nguyen, arXiv preprint arXiv:1902.02880 (2019).[24] R. M. Neal, Bayesian learning for neural networks , Vol.118 (Springer Science & Business Media, 2012).[25] M. Chen, J. Pennington, and S. S. Schoenholz, arXivpreprint arXiv:1806.05394 (2018). [26] G. Cybenko, Mathematics of control, signals and systems , 303 (1989).[27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich, in Proceedings of the IEEE conference on com-puter vision and pattern recognition (2015) pp. 1–9.[28] M. Lin, Q. Chen, and S. Yan, arXiv preprintarXiv:1312.4400 (2013).[29] (cid:32)L. Kaiser and I. Sutskever, in
International Conferenceon Machine Learning (2016).[30] N. Wulff and J. A. Hertz, in
Advances in Neural Infor-mation Processing Systems (1993) pp. 631–638.[31] V. Nair and G. E. Hinton, in
Proceedings of the 27thinternational conference on machine learning (ICML-10) (2010) pp. 807–814.[32] I. Goodfellow, Y. Bengio, and A. Courville,
Deep learn-ing (MIT press, 2016).[33] H. Sompolinsky, A. Crisanti, and H.-J. Sommers, Phys-ical review letters , 259 (1988).[34] R. Shwartz-Ziv and N. Tishby, arXiv preprintarXiv:1703.00810 (2017).[35] D. Saad and S. A. Solla, Physical Review E , 4225(1995).[36] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Gan-guli, and Y. Bengio, in Advances in neural informationprocessing systems (2014) pp. 2933–2941.[37] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein, in
International Conference on MachineLearning (2017).[38] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, Uni-versity of Montreal , 1 (2009).[39] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, andS. Ganguli, in
Advances in neural information processingsystems (2016) pp. 3360–3368.[40] M. Raghu, J. Gilmer, J. Yosinski, and J. Sohl-Dickstein,in
Advances in Neural Information Processing Systems (2017) pp. 6076–6085.[41] S. Arora, A. Bhaskara, R. Ge, and T. Ma, in