Generic predictions of output probability based on complexities of inputs and outputs
GGeneric predictions of output probability based on complexities of inputs and outputs
Kamaludin Dingle , , Guillermo Valle P´erez , Ard A. Louis Centre for Applied Mathematics and Bioinformatics,Department of Mathematics and Natural Sciences,Gulf University for Science and Technology, Kuwait, Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Parks Road, Oxford, OX1 3PU, United Kingdom (Dated: October 3, 2019)For a broad class of input-output maps, arguments based on the coding theorem from algorithmicinformation theory (AIT) predict that simple (low Kolmogorov complexity) outputs are exponen-tially more likely to occur upon uniform random sampling of inputs than complex outputs are. Here,we derive probability bounds that are based on the complexities of the inputs as well as the outputs,rather than just on the complexities of the outputs. The more that outputs deviate from the codingtheorem bound, the lower the complexity of their inputs. Our new bounds are tested for an RNAsequence to structure map, a finite state transducer and a perceptron. These results open avenuesfor AIT to be more widely used in physics.
PACS numbers:
Deep links between physics and theories of computa-tion [1, 2] are being increasingly exploited to uncovernew fundamental physics and to provide novel insightsinto theories of computation. For example, advancesin understanding quantum entanglement are often ex-pressed in sophisticated information theoretic language,while providing new results in computational complexitytheory such as polynomial time algorithms for integer fac-torization [3]. These connections are typically expressedin terms of Shannon information, with its natural anal-ogy with thermodynamic entropy.There is, however, another branch of information the-ory, called algorithmic information theory (AIT) [4],which is concerned with the information content of indi-vidual objects. It has been much less applied in physics(although notable exceptions occur, see [5] for a recentoverview). Reasons for this relative lack of attention in-clude that AIT’s central concept, the Kolmogorov com-plexity K U ( x ) of a string x , defined as the length of theshortest program that generates x on a universal Tur-ing machine (UTM) U , is formally uncomputable dueto its link to the famous halting problem of UTMs [6].Moreover, many important results, such as the invariancetheorem which states that for two UTMs U and W , theKolmogorov complexities K U ( x ) = K W ( x ) + O (1) areequivalent, hold asymptotically up to O (1) terms thatare independent of x , but not always well understood,and therefore hard to control.Another reason applications of AIT to many practi-cal problems have been hindered can be understood interms of hierarchies of computing power. For example,one of the oldest such categorisations, the Chomsky hi-erarchy [7], ranks automata into four different classes, ofwhich the UTMs are the most powerful, and finite statemachines (FSMs) are the least. Many key results in AITare derived by exploiting the power of UTMs. Interest-ingly, if physical processes can be mapped onto UTMs,then certain properties can be shown to be uncom- putable [8, 9]. However, many problems in physics arefully computable, and therefore lower on the Chomskyhierarchy than UTMs. For example, finite Markov pro-cesses are equivalent to FSMs, and RNA secondary struc-ture (SS) folding algorithms can be recast as context-freegrammars, the second level in the hiearchy. Thus, an im-portant cluster of questions for applications of AIT re-volve around extending its methods to processes lower incomputational power than UTMs.To explore ways of moving beyond these limitationsand towards practical applications, we consider here oneof the most iconic results of AIT, namely the coding theo-rem of Solomonoff and Levin [10, 11], which predicts thatupon randomly chosen programs, the probability P U ( x )that a universal Turing machine (UTM) generates out-put x can be bounded as 2 − K ( x ) ≤ P ( x ) ≤ − K ( x )+ O (1) .Given this profound prediction of a general exponentialbias towards simplicity (low Kolmogorov complexity) onemight have expected widespread study and applicationsin science and engineering. This has not been the casebecause the theorem unfortunately suffers from the gen-eral issues of AIT described above (see however [12–14]for important attempts to apply the full coding theorem).Nevertheless, it has recently been shown [15, 16] that arelated exponential bias towards low complexity outputsobtains for a range of non-universal input-output maps f : I → O that are lower on the Chomsky hierarchy thanUTMs. In particular, an upper bound on the probabil-ity P ( x ) that an output obtains upon uniform randomsampling of inputs, P ( x ) ≤ − a ˜ K ( x ) − b (1)was recently derived [15] using a computable approxima-tion ˜ K ( x ) to the Kolmogorov complexity of x , typicallycalculated using lossless compression techniques. Here a and b are constants that are independent of x and whichcan often be determined from some basic information a r X i v : . [ phy s i c s . d a t a - a n ] O c t (a) (b) (c)(d) (e) (f) FIG. 1: The probability P ( x ) that a particular output arises upon random sampling of inputs versus output complexity ˜ K ( x )shows clear simplicity bias for: (a) A length L = 15 RNA sequence to SS mapping, (b) An FST, sampled over all 2 binaryinputs of length 30, and (c) A 7-input perceptron with weights discretised to 3 bits. The black solid line is the simplicity biasbound (1) (with a and b fit). For all these maps high complexity outputs occur with low probability. The outputs are colourcoded by the maximum complexity K max ( p | n ) of the set of inputs mapping to output x . Outputs further from the bound havelower input complexities. Figs (d) length L = 15 RNA, (e) the FST and (d) the perceptron, show the data plotted for thelower bound (8) (black line) with only the intercept fit to the data, the slope is a prediction. The orange line is using Eq (8)with a normalised probability for a parameter free predictor. Including the complexity of the input through K max ( p | n ) reducesthe spread in the data, and so provides more predictive power than K ( x ) alone. about the map. The so-called simplicity bias bound (1)holds for computable maps f where the number of inputs N I is much greater than the number of outputs N O andthe map K ( f ) is simple, meaning that asymptotically K ( f )+ K ( n ) (cid:28) K ( x )+ O (1) for a typical output x , where n specifies the size of N I , e.g. N I = 2 n . Eq. (1) typicallyworks better for larger N I and N O . Approximating thetrue Kolmogorov complexity also means that the boundshouldn’t work for maps where a significant fraction ofoutputs have complexities that are not qualitatively cap-tured by compression based approximations. For exam-ple many pseudo random-number generators are designedto produce outputs that appear to be complex when mea-sured by compression or other types of Kolmogorov com-plexity approximations. Yet these outputs must have low K ( x ) because they are generated by relatively simple al-gorithms with short descriptions. Nevertheless, it hasbeen shown that the bound (1) works remarkably wellfor a wide class of input-output maps, ranging from thesequence to RNA secondary structure map, to systemsof coupled ordinary differential equations, to a stochasticfinancial trading model, to the parameter-function map for several classes of deep neural networks [15, 17, 18].The simplicity bias bound (1) predicts that high P ( x )outputs will be simple, and that complex outputs willhave a low P ( x ). But, in sharp contrast to the full AITcoding theorem, it doesn’t have a lower bound, allowinglow ˜ K ( x ) outputs with low P ( x ) that are far from thebound. Indeed, this behaviour is generically observed formany (non-universal) maps [15, 17] (see also Fig 1), butshould not be the case for UTMs that obey the full cod-ing theorem. Understanding the behaviour of outputs farfrom the bound should shed light on fundamental differ-ences between UTMs and maps with less computationalpower that are lower on the Chomsky hierarchy, and mayopen up avenues for wider applications of AIT in physics.With this challenge in mind, we take an approach thatcontrasts with the traditional coding theorem of AIT orwith the simplicity bias bound, which only consider thecomplexity of the outputs. Instead, we derive boundsthat also take into account the complexity of the in-puts that generate a particular output x . While this ap-proach is not possible for UTMs, since the halting prob-lem means one cannot enumerate all inputs [4], and soaverages over input complexity cannot be calculated, itcan be achieved for non-UTM maps. Among our mainresults, we show that the further outputs are from thesimplicity bias bound (1), the lower the complexity ofthe set of inputs. Since, by simple counting arguments,most strings are complex [4], the cumulative probabilityof outputs far from the bound is therefore limited. Wealso show that by combining the complexities of the out-put with that of the inputs, we can obtain better boundson and estimates of P ( x ).Whether such bounds nevertheless have real predic-tive power needs to be tested empirically. Because in-put based bounds typically need exhaustive sampling,full testing is only possible for smaller systems, which re-stricts us here to maps where finite size effects may stillplay a role [15]. We test our bounds on three systems,the famous RNA sequence to secondary structure map(which falls into the context-free class in the Chomskyhierarchy), here for a relatively small size with length L = 15 sequences, a finite state transducer (FST), a verysimple input-output map that is lowest on the Chom-sky hierarchy [7], with length L = 30 binary inputs, andfinally the parameter-function map [18, 19] of a percep-tron [20] with discretized weights to allow complexitiesof inputs to be calculated. The preceptron plays a keyrole in deep learning neural network architectures [21].Nevertheless, as can be seen in Fig. 1(a-c) all three mapsexhibit simplicity bias predicted by Eq (1), even if theyare relatively small. In Ref. [15], much cleaner simplic-ity bias behaviour can be observed for larger RNA maps,but these are too big to exhaustively sample inputs. Sim-ilarly, cleaner simplicity bias behaviour occurs for theundiscretised perceptron [18], but then it is hard to anal-yse the complexity of the inputs.Fig. 1(a-c) shows that the complexity of the inputstrings that generate each output x decreases for furtherdistances from the simplicity bias bound. This is the kindof phenomenon that the we will attempt to explain.To study input based bounds, consider a map f : I → O between N I inputs and N O outputs that satisfies therequirements for simplicity bias [15]. Let f ( p ) = x , where p is some input program p ∈ I producing output x ∈ O .For simplicity let p ∈ { , } n , so that all inputs havelength n and N I = 2 n (this restriction can be relaxedlater). Define f − ( x ) to be the set of all the inputs thatmap to x , so that the probability that x obtains uponsampling inputs uniformly at random is P ( x ) = | f − ( x ) | n (2)Any arbitrary input p can be described using the follow-ing O (1) procedure [15]: Assuming f and n are given,first enumerate all 2 n inputs and map them to outputsusing f . The index of a specific input p within the set f − ( x ) can be described using at most log ( | f − ( x ) | )bits. In other words, this procedure identifies each input by first finding the output x it maps to, and then find-ing its label within the set f − ( x ). Given f and n , anoutput x = f ( p ) can be described using K ( x | f, n ) + O (1)bits [15]. Thus, the Kolmogorov complexity of p , given f and n can be bounded as: K ( p | f, n ) ≤ K ( x | f, n ) + log ( | f − ( x ) | ) + O (1) . (3)We note that this bound holds in principle for all p ,but that it is tightest for K max ( p | x ) ≡ max p { K ( p | f, n ) } for p ∈ f − ( x ). More generally, we can expect thesebounds to be fairly tight for the maximum complexity K max ( p | f, n ) of inputs due to the following argument.First note that K max ( p | f, n ) ≥ log ( | f − ( x ) | ) + O (1) (4)because any set of | f − ( x ) | different elements must havestrings of at least this complexity. Next, K ( x | f, n ) ≤ K ( p | f, n ) + O (1) (5)because each p can be used to generate x . Therefore:max( K ( x | f, n ) , log ( | f − ( x ) | )) ≤ K max ( p | f, n ) + O (1) , (6)so the bound (3) cannot be too weak. In the worst casescenario, where K max ( p | n ) ≈ log ( | f − ( x ) | ) ≈ K ( x | f, n ),the right hand side of the bound (3) is approximatelytwice the left hand side (up to additive O (1) terms). Itis tighter if either K ( x | f, n ) is small, or if K ( x | f, n ) is bigrelative to log ( | f − ( x ) | ). As is often the case for AITpredictions, the stronger the constraint/prediction, themore likely it is to be observed in practice, because, forexample, the O (1) terms are less likely to drown out theeffects.By combining with Eq. (2), the bound (3) can berewritten in two complementary ways. Firstly, a lowerbound on P ( x ) can be derived of the form: P ( x ) ≥ − K ( x | f,n ) − [ n − K ( p | f,n )]+ O (1) (7) ∀ p ∈ f − ( x ) which complements the simplicity bias up-per bound (1). This bound is tightest for K max ( p | n ).In Ref. [15] it was shown that P ( x ) ≤ − K ( x | f,n )+ O (1) by using a similar counting argument to that used above,together with a Shannon-Fano-Elias code procedure.Similar results can be found in standard works [4, 22].A key step is to move from the conditional complexityto one that is independent of the map and of n . If f is simple, then the explicit dependence on n and f canbe removed by noting that since K ( x ) ≤ K ( x | f, n ) + K ( f ) + K ( n ) + O (1), and K ( x | f, n ) ≤ K ( x ) + O (1) then K ( x | f, n ) ≈ K ( x ) + O (1). In Eq. (1) this is further ap-proximated as K ( x | f, n ) + O (1) ≈ a ˜ K ( x ) + b , leading toa practically useable upper bound. The same argumentcan be used to remove explicit dependence on n and f for K ( p | f, n ). (a) (b) (c) FIG. 2: Deviation of P ( x ) from the simplicity bias upper bound (1)) increases with increasing randomness deficit δ max ( x ) = n − K max ( p | n ) for (a) L = 15 RNA, (b) L = 30 FST, (c) perceptron with weights discretised to 4 bits. For the perceptron, allfunctions with the same P ( x ) and K ( x ) are averaged together to reduce scatter. Points are colour coded by output complexity K ( x ). For the upper bound (9) (black line) we fit the intercept, but the slope is a prediction, if we treat it as a normalisedprobability we obtain the orange line which is a direct prediction with no free parameters. If we define a maximum randomness deficit δ max ( x ) = n − K max ( p | n ), then this tightest version of bound (7)can be written in a simpler form as P ( x ) ≥ − a ˜ K ( x )+ b − δ max ( x )+ O (1) (8)In Figs. 1 (d-f) we plot this lower bound for all three mapsstudied. Throughout the paper, we use a scaled complex-ity measure, which ensures that ˜ K ( x ) ranges between ≈ ≈ n bits, for strings of length n , as expected for Kol-mogorov complexity. See Methods for more details.When comparing the data in Figs. 1 (d-f) to Figs. 1 (a-c), it is clear that including the input complexities re-duces the spread in the data for RNA and the FST, al-though for the perceptron model the difference is lesspronounced. This success suggests using the bound (8)as a predictor P ( x ) ≈ − K ( x | f,n ) − δ max ( x ) , with the addi-tional constraint that P x P ( x ) = 1 to normalise it. Ascan be seen in Figs. 1 (d-f), this simple procedure worksreasonably well, showing that the input complexity pro-vides additional predictive power to estimate P ( x ) fromsome very generic properties of the inputs and outputs.A second, complimentary way that bound (3) can beexpressed is in terms of how far P ( x ) differs from thesimplicity bias bound (1):[log ( P ( x )) − log ( P ( x ))] ≤ [ n − K ( p | f, n )] + O (1) (9)where P ( x ) = 2 − K ( x | f,n ) ≈ − a ˜ K ( x )+ b is the upperbound (1) shown in Figs 1 (a-c).For a random input p , with high probability we expect K ( p | f, n ) = n + O (1) [4]. Thus, eqs. (7) and (9) imme-diately imply that large deviations from the simplicitybias bound (1) are only possible with highly non-randominputs with a large randomness deficit δ max ( x ).In Fig. 2(a)-(c) we directly examine bound (9), showingexplicitly the prediction that a drop of probability P ( x ) by ∆ bits from the simplicity bias bound ( 1) correspondsto a ∆ bit randomness deficit in the set of inputs.Simple counting arguments can be used to show thatthe number of non-random inputs is a small fraction ofthe total number of inputs [23]. For example, for bi-nary strings of length n , with N I = 2 n , the numberof inputs with complexity K = n − δ is approximately2 − δ N I . If we define a set D ( f ) of all outputs x i thatsatisfy (log ( P ( x i )) − log ( P ( x i ))) ≥ ∆, i.e. the set of alloutputs for which log P ( x ) is at least ∆ bits below thesimplicity bias bound (1), then this counting argumentleads to the following cumulative bound: X x ∈D ( f ) P ( x ) ≤ − ∆+1+ O (1) (10)which predicts that, upon randomly sampling inputs,most of the probability weight is for outputs with P ( x )relatively close to the upper bound. There may be manyoutputs that are far from the bound, but their cumula-tive probability drops off exponentially the further theyare from the bound because the number of simple inputsis exponentially limited. Note that this argument is for acumulative probability over all inputs. It does not predictthat for a given complexity K ( x ), that the outputs shouldall be near the bound. In that sense this lower bound isnot like that of the original coding theorem which holdsfor any output x .Bound (10) does not need an exhaustive enumerationto be tested. In Fig. 3 we show this bound for a seriesof different maps, including many maps from [15]. Thecumulative probability weight scales roughly as expected,implying that most of the probability weight is relativelyclose to the bound (at least on a log scale).What is the physical nature of these low complexity,low probability outputs that occur far from the bound?They must arise in one way or another from the lower (a) (b) (c) (d)(e) (f) (g) (h) FIG. 3: The cumulative probability versus the distance from the bound ∆ correlates with the the cumulative bound (10)(red line) for (a) L=15 RNA and (b) L=30 FST (c) Perceptron. (d) fully connected 2 layer neural network from [19], (e)coarse-grained ordinary differential equation map from [15], (f) Ornstein-Uhlenbeck financial model from [15], (g) L-systemsfrom [15], (h) simple matrix map from [15]. The solid red line is the prediction 2 − ∆+1 from Eq. (10), the dashed line denotes10% cumulative probability. , computational power of these maps, since they don’t oc-cur in the full AIT coding theory. Low complexity, lowprobability outputs correspond to output patterns whichare simple, but which the given computable map is notgood at generating.In RNA it is easy to construct outputs which are sim-ple but will have low probability. Compare two L = 15structures S =((.(.(...).).)). and S =.((.((...)).))., whichare both symmetric and thus have a relatively low com-plexity K ( S ) = K ( S ) = 21 .
4. Nevertheless they have asignificant difference in probability, P ( S ) /P ( S ) ≈ S has several single bonds, which is much harderto make according to the biophysics of RNA. Only spe-cially ordered input sequences can make S , in otherwords they are simple, with K max ( p | n ) = 8 .
6. By con-trast, the inputs of S are much higher at K max ( p | n ) =21 . O (1) terms – that have led to the general neglect of AITin the physics literature, the bounds are undoubtablysuccessful. It appears that, just as is found in other ar-eas of physics, these relationships hold well outside ofthe asymptotic regime where they can be prove to becorrect. This practical success opens up the promise ofusing such AIT based techniques to derive other resultsfor computable maps from across physics.Many new questions arise. Can it be proven whenthe O (1) terms are relatively unimportant? Why do ourrather simple approximations to K ( x ) work? It would beinteresting to find maps where these classical objectionsto the practical use of AIT are important. There mayalso be connections between our work and finite statecomplexity [24] or minimum description length [25] ap-proaches. Progress in these domains should generate newfundamental understandings of the physics of informa-tion.K.D. acknowledges partial financial support from theKuwait Foundation for the Advancement of Sciences(KFAS) grant number P115-12SL-06. G.V.P. acknowl-edges financial support from EPSRC through grantEP/G03706X/1. [1] M. Mezard and A. Montanari, Information, physics, andcomputation (Oxford University Press, USA, 2009).[2] C. Moore and S. Mertens,
The nature of computation (OUP Oxford, 2011).[3] P. W. Shor, in
Proceedings 35th annual symposium onfoundations of computer science (Ieee, 1994), pp. 124–134.[4] M. Li and P. Vitanyi,
An introduction to Kolmogorovcomplexity and its applications (Springer-Verlag NewYork Inc, 2008).[5] S. Devine,
Algorithmic information theory: Review forphysicists and natural scientists (2014).[6] A. M. Turing, J. of Math , 5 (1936).[7] N. Chomsky, Information Theory, IRE Transactions on , 113 (1956).[8] S. Lloyd, Physical review letters , 943 (1993).[9] T. S. Cubitt, D. Perez-Garcia, and M. M. Wolf, Nature , 207 (2015).[10] R. J. Solomonoff, Information and control , 1 (1964).[11] L. Levin, Problemy Peredachi Informatsii , 30 (1974). [12] J. Delahaye and H. Zenil, Appl. Math. Comput. , 63(2012).[13] H. Zenil, F. Soler-Toscano, K. Dingle, and A. A. Louis,Physica A: Statistical Mechanics and its Applications , 341 (2014).[14] F. Soler-Toscano, H. Zenil, J.-P. Delahaye, and N. Gau-vrit, PloS one , e96223 (2014).[15] K. Dingle, C. Q. Camargo, and A. A. Louis, Nature com-munications , 761 (2018).[16] H. Zenil, L. Badillo, S. Hern´andez-Orozco, andF. Hern´andez-Quiroz, International Journal of Parallel,Emergent and Distributed Systems , 161 (2019).[17] G. V. P´erez, A. A. Louis, and C. Q. Camargo, arXivpreprint arXiv:1805.08522 (2018).[18] C. Mingard, J. Skalse, G. Valle-Prez, D. Martnez-Rubio, V. Mikulik, and A. A. Louis, Arxiv preprintarXiv:1909.11522 (2019).[19] G. Valle-P´erez, C. Q. Camargo, and A. A. Louis, arXivpreprint arXiv:1805.08522 (2018).[20] F. Rosenblatt, Psychological review , 386 (1958).[21] Y. LeCun, Y. Bengio, and G. Hinton, nature , 436(2015).[22] P. G´acs, Lecture notes on descriptional complexity andrandomness (Boston University, Graduate School of Artsand Sciences, Computer Science Department, 1988).[23] G. Chaitin, IEEE Transactions on Information Theory , 10 (1974).[24] C. S. Calude, K. Salomaa, and T. K. Roblot, TheoreticalComputer Science , 5668 (2011).[25] P. Gr¨unwald and T. Roos, arXiv preprintarXiv:1908.08484 (2019). upplementary Information for: Generic predictions of output probability based on complexities of inputs and outputs
Kamaludin Dingle , , Guillermo Valle P´erez , Ard A. Louis Centre for Applied Mathematics and Bioinformatics,Department of Mathematics and Natural Sciences,Gulf University for Science and Technology, Kuwait, Rudolf Peierls Centre for Theoretical Physics, University of Oxford, Parks Road, Oxford, OX1 3PU, United Kingdom (Dated: October 3, 2019)
PACS numbers:
RNA SEQUENCE TO SECONDARYSTRUCTURE MAPPING
RNA is made of a linear sequence of 4 different kinds ofnucleotides, so that there are N I = 4 L possible sequencesfor any particular length L . A versatile molecule, it canstore information, as messenger RNA, or else performcatalytic or structural functions. For functional RNA,the three-dimensional (3D) structure plays an importantrole in its function. In spite of decades of research, it re-mains difficult to reliably predict the 3D structure fromthe sequence alone. However, there are fast and accuratealgorithms to calculate the so-called secondary structure(SS) that determines which base binds to which base.Given a sequence, these methods typically minimize theTurner model [1] for the free-energy of a particular bond-ing pattern. The main contributions in the Turner modelare the hydrogen bonding and stacking interactions be-tween the nucleotides, as well as some entropic factors totake into account motifs such as loops. Fast algorithmsbased on dynamic programming allow for rapid calcula-tions of these SS, and so this mapping from sequencesto SS has been a popular model for many studies in bio-physics.In this context, we view it as an input-output map,from N I input sequences to N O output SS structures.This map has been extensively studied (see e.g. [2–10])and provided profound insights into the biophysics offolding and evolution.Here we use the popular Vienna package [3] to foldsequences to structures, with all parameters set to theirdefault values (e.g. the temperature T = 37 ◦ C ). Wefolded all N I = 4 ≈ sequences of length 15, into 346different structures which were the free-energy minimumstructures for those sequences. The number of sequencesmapping to a structure is often called the neutral set size .The structures can be abstracted in standard dot-bracket notation, where brackets denote bonds, and dotsdenote unbonded pairs. For example, ... (( .... )) .... . meansthat the first three bases are not bonded, the fourth andfifth are bonded, the sixth through ninth are unbonded,the tenth base is bonded to the fifth base, the eleventhbase is bonded to the fourth base, and the final four bases are unbounded.To estimate the complexity of an RNA SS, we firstconverted the dot-bracket representation of the structureinto a binary string x , and then used the complexity esti-mators of section to estimate its complexity. To convertto binary strings, we replaced each dot with the bits 00,each left-bracket with the bits 10, and each right-bracketwith 01. Thus an RNA SS of length n becomes a bit-string of length 2 n . As an example, the following n = 15structure yields the displayed 30-bit string .. ((( ... ))) .... → FINITE STATE TRANSDUCER
Finite state transducers (FSTs) are a generalizationof finite state machines that produce an output. Theyare defined by a finite set of states S , finite input andoutput alphabets I and O , and a transition function T : S ×I → S ×O defining, for each state, and input symbol,a next state, and output symbol. One also needs to definea distinguished state, S ∈ S , which will be the initialstate, before any input symbol has been read. Givenan input sequence of L input symbols, the system visitsdifferent states, and simultaneously produces an outputsequence of L output symbols.FST form a popular toy system for computable maps.They can express any computable function that requiresonly a finite number of memory, and the number ofstates in the FSTs offers a good parameter to controlthe complexity of the map. The class of machines wedescribed above is also known as Mealy machines [12]. Ifone restricts the transition function to only depend onthe current state, one obtains Moore machines [13]. Ifone considers the input sequence to a Moore machine to a r X i v : . [ phy s i c s . d a t a - a n ] O c t be stochastic, it immediately follows that its state se-quence follows a Markov chain, and its output sequenceis a Markov information source. Therefore, FSTs can beused to model many stochastic systems in nature and en-gieering, which can be described by finite-state Markovdynamics.FSTs lie in the lowest class in the Chomsky hierar-chy. However, they appear to be biased towards simpleoutputs in a manner similar to Levin’s coding theorem.In particular, Zenil et al. [14] show evidence of this bycorrelating the probability of FSTs and UTMs producingparticular outputs. More precisely, they sampled randomFSTs with random inputs, and random UTMs with ran-dom inputs, and then compared the empirical frequencieswith with individual output strings are obtained by bothfamilies, after many samples of machines and inputs. Forboth types of machines, simple strings were much morelikely to be produced than complex strings.We use randomly generated FSTs with 5 states. TheFSTs are generated by uniformly sampling complete ini-tially connected DFAs (where every state is reachablefrom the initial state, and the transition function is de-fined for every input) using the library FAdo [15], whichuses the algorithm developed by Almeida et al. [16]. Out-put symbols are then added to each transition indepen-dently and with uniform probability. In our experiments,the inputs are binary strings and the outputs are binarystrings of length L = 30. The outputs for the wholeset of 2 L input strings are computed using the HFST li-brary ( https://hfst.github.io/ ). Not all FSTs showbias, but we have observed that all those that show biasshow simplicity bias, and have the same behavior as thatshown in Figure ?? for low complexity - low probabilityoutputs.We can see why some simple outputs will occur withlow probability by considering system specific details ofthe FST. For an FST, an output of length n which is n/ n/ n/ n , whileFSTs have finite memory. We can also prove that, forinstance, an FST that only produces such strings (for any n ) is impossible. The set of possible strings that an FSTcan produce comprises a regular language, as constructedby using the output symbols at each transition as inputsymbols, giving us a non-deterministic finite automaton.Finally, using the pumping lemma [17], it is easy to seethat this family of strings isn’t a regular language. PERCEPTRON
The perceptron [18] is the simplest type of artificialneural network. It consists of a single linear layer, and a single output neuron with binary activation. Becausemodern deep neural network architectures are typicallymade of many layers of perceptrons, this simple system isimportant to study [19]. In this paper we use perceptronswith Boolean inputs and discretized weights. For inputs x ∈ { , } n , the discretized perceptron uses the followingparametrized class of functions: f w,b ( x ) = ( w · x + b ) , where w ∈ {− a, a + δ, . . . , a − δ, a } n and b ∈ {− a, a + δ, . . . , a − δ, a } are the weight vector and bias term, whichtake values in a discrete lattice with D := 2 a/δ + 1 pos-sible values per weight. We used D = 2 k , so that eachweight can be represented by k bits, and a = (2 k − / δ = 1. Note that rescaling all the weights w andthe bias b by the same fixed constant wouldn’t changethe family of functions.To obtain the results in Figure ?? , for which n = 7,we represented the weights and bias with k = 3 bits. Weexhausitvely enumerated all 2 possible values of theweights and vectors, and we counted how many times weobtained each possible Boolean function on the Booleanhypercube { , } . The weight-bias pair was representedusing 3 × (7 + 1) = 24 bits. A pair ( w, b ) is an inputto the parameter-function map of the perceptron. Thecomplexity of inputs to this map can therefore be ap-proximated by computing the Lempel-Ziv complexity ofthe 24-bit representation of the pair ( w, b ).In Figure 1, we compare the simplicity bias of a per-ceptron with real-valued weights and bias, sampled froma standard Normal distribution, to the simplicity bias ofthe perceptron with discretized weights. We observe thatboth display similar simplicity bias, although the profileof the upper bound changes slightly.For the perceptron we can also understand some simpleexamples of low complexity, low probability outputs. Forexample, the function with all 0s except a 1, for the in-puts (1,0,0,0,0,0,0) and (0,1,0,0,0,0,0) has a similar com-plexity to the function which only has 1s at the inputs(1,0,0,0,0,0,0) and (0,1,1,1,1,1,1). However, the latter hasmuch lower probability. One can understand this becauseif we take the dot product of a random weight vector w with two different inputs x and x , the results have cor-relation given x · x / ( || x |||| x || ). Therefore we expectthe input (0,1,1,1,1,1,1) to be correlated to more otherinputs, than (0,1,0,0,0,0,0), so that the probability of ithaving a different value than the majority of inputs (asis the case for the second function) is expected to be sig-nificantly lower. (a) (b) FIG. 1: Probability versus complexity ˜ K ( x ) (measured here as C LZ ( x ) from Eq. (1)) shows simplicity bias in the perceptronfor (a) full continuous weights and (b) with discretised weights. Since weights and biases are real-valued in Fig. (a) it is notstraightforward to measure the complexity of the inputs, as it is for the discretised weights of Fig. (b). METHODS TO ESTIMATE COMPLEXITY ˜ K ( x ) Lempel-Ziv compression
There is a much more extensive discussion of differ-ent ways to estimate the Kolmogorov complexity in thesupplementary information of [11] and [20]. Here we usecompression based measures, and as in these previouspapers, we these are based on the 1976 Lempel Ziv (LZ)algorithm [21], but with some small changes: C LZ ( x ) = ( log ( n ) , x = 0 n or 1 n log ( n )[ N w ( x ...x n ) + N w ( x n ...x )] / , otherwise(1)Here N w ( x ) is the number of code words found by theLZ algorithm. The reason for distinguishing 0 n and 1 n is merely an artefact of N w ( x ) which assigns complex-ity K = 1 to the string 0 or 1, but complexity 2 to0 n or 1 n for n ≥
2, whereas the Kolmogorov complex-ity of such a trivial string actually scales as log ( n ), asone only needs to encode n . In this way we ensure thatour C LZ ( x ) measure not only gives the correct behaviourfor complex strings in the lim n →∞ , but also the correctbehaviour for the simplest strings. In addition to thelog ( n ) correction, taking the mean of the complexity ofthe forward and reversed strings makes the measure morefine-grained, since it allows more values for the complex-ity of a string. Note that C LZ ( x ) can also be used forstrings of larger alphabet sizes than just 0/1 binary al-phabets. Scaling complexities
To directly test the input based measures we typicallyneed fairly small systems, where the LZ based measureabove may show some anomalies (see also the supple-mentary information of [11] for a more detailed descrip-tion). Thus, for such small systems, or when comparingdifferent types and sizes of objects (e.g. RNA SS andRNA sequences) a slightly different scaling may be moreappropriate, which not only accounts for the fact that C LZ ( x ) > n for strings of length n , but also the lowercomplexity limit may not be ∼
0, which it should be(see also the discussion in the supplementary informa-tion of [11]). Hence we use a different rescaling of thecomplexity measure˜ K ( x ) = log ( N O ) · C LZ ( x ) − min( C LZ ( x ))max( C LZ ( x )) − min( C LZ ( x )) (2)which will now range between 0 ≤ ˜ K ( x ) ≤ log ( N O ) = n if for example N O = 2 n . For large objects, this dif-ferent scaling will reduce to the simpler one, becausemax( C LZ ( x )) (cid:29) min( C LZ ( x )).We note that there is nothing fundamental about usingLZ to generate approximations to true Kolmogorov com-plexity. Many other approximations could be used, andtheir merits may depend on the details of the problemsinvolved. For further discussion of other complexity mea-sures, see for example the supplementary information ofRefs. [11, 22]. AN ALTERNATIVE WAY TO DERIVE THECUMMULATIVE BOUND
Here we examine other ways of deriving what are ef-fectively lower bounds on the probability, as expressed inthe cumulative bound (8). First consider, as in [11], thefunction q ( x ) = P ( x ) P ( x ) (3)where P ( x ) = 2 − K ( x | f,n )+ O (1) . Here q ( x ) measures theratio of the upper bound of equation ( ?? ) to the prob-ability P ( x ) that an output x is generated by randomsampling of inputs. Because we work with computablemaps, P x P ( x ) = 1, by definition. However, the bound P ( x ) is not normalised, as it is an upper bound on thetrue probability. One measure of its cumulative tightnessis to calculate the expected value of q ( x ) summed overall inputs, which we call E I . This can be written as asum over all outputs, where every output is weighed as P ( x ): E I = 1 N I N I X i =1 q ( x ( p i )) = N O X j =1 P ( x j ) q ( x j ) = N O X j =1 P ( x j )(4)By definition of an upper bound, q ( x ) ≥ E I = P x ∈ O − K ( x | f,n )+ O (1) ≥
1. Interestingly, be-cause K ( x | f, n ) is a prefix code, P x ∈ O − K ( x | f,n ) ≤ E I > O (1) terms.In [11] Markov’s inequality was used to derive a lowerbound upon uniform random sampling of inputs, P ( x ) E I r ≤ P ( x ) ≤ P ( x ) (5)which holds with a probability of at least 1 − r . Theupper bound, given approximately by equation ( ?? ), al-ways holds of course. We measured E I explicitly for themaps in the main text compared to our approximate up-per bound and find that typically log E I ≈ ?? ) follows a very simple argument. Recallthat D ( f ) is defined as the set of all outputs x i thatsatisfy (log ( P ( x i )) − log ( P ( x i ))) ≥ ∆. Recall also thatthe upper bound is defined as P ( x ) = 2 − K ( x | f,n )+ O (1) .Then we can obtain the bound as follows. X x ∈D ( f ) P ( x ) ≤ X x ∈D ( f ) P ( x )2 − ∆ = X x ∈D ( f ) − K ( x | f,n )+ O (1) − ∆ = 2 − ∆+ O (1) X x ∈D ( f ) − K ( x | f,n ) ≤ − ∆+ O (1) X x − K ( x | f,n ) ≤ − ∆+ O (1) , where the last line follows from Kraft inequality [23],which applies because K ( x ) comprise a prefix code. Ifinstead Eq (4) were used for P x P ( x ) = E I in the deriva-tion above, then we would obtain X x ∈D ( f ) P ( x ) ≤ E I − ∆ (6)Although these arguments result in essentially thesame bound as the cumulative bound in the main text,the connection with the complexity of inputs is moreopaque. However, these derivations highlight other as-pects of the bound, such as the role of the O (1) term inthe exponent. Therefore, the two derivations may giveinsight into the tightness of the looseness of the boundin different situations. [1] D. H. Mathews, M. D. Disney, J. L. Childs, S. J.Schroeder, M. Zuker, and D. H. Turner, Proceedings ofthe National Academy of Sciences , 7287 (2004).[2] P. Schuster, W. Fontana, P. Stadler, and I. Hofacker,Proceedings: Biological Sciences , 279 (1994), ISSN0962-8452.[3] I. Hofacker, W. Fontana, P. Stadler, L. Bonhoef-fer, M. Tacker, and P. Schuster, Monatshefte f¨urChemie/Chemical Monthly , 167 (1994).[4] W. Fontana, BioEssays , 1164 (2002), ISSN 1521-1878.[5] M. Cowperthwaite, E. Economo, W. Harcombe,E. Miller, and L. Meyers, PLoS computational biology , e1000110 (2008).[6] J. Aguirre, J. M. Buld´u, M. Stich, and S. C. Manrubia,PloS ONE , e26324 (2011).[7] S. Schaper and A. A. Louis, PloS one , e86635 (2014).[8] A. Wagner, Robustness and evolvability in living systems (Princeton University Press Princeton, NJ:, 2005).[9] A. Wagner,