The holographic principle and the language of genes
aa r X i v : . [ phy s i c s . b i o - ph ] M a y The holographic principle and the language of genes
Dirson Jian Li ∗ and Shengli Zhang Department of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China
We show that the holographic principle in quantum gravity imposes a strong constraint on life.The degrees of freedom of an organism can be estimated according to the theory of Boolean networks,which is constrained by the entropy bound. Hence we can explain the languages in protein sequencesor in DNA sequences. The overall evolution of biological complexity can be illustrated. And somegeneral properties of protein length distributions can be explained by a linguistic mechanism.
PACS numbers: 87.10.-e, 87.18.-h, 04.60.BcKeywords: entropy bound, Boolean networks, formal language
INTRODUCTION
The general principles in non-living systems play signif-icant roles in living systems. How do the principles ingravity theory or in quantum mechanism impact on ourunderstanding of life? An organism can not keep activewithout the supply of energy due to the first law in ther-modynamics. And it can not live long without the supplyof minus entropy due to the second law in thermodynam-ics. But it seems that there are no direct effects of the rel-ativity principle or the uncertainty principle on life. Wefound that the holographic principle, which is likely onlyone of several independent conceptual advances neededfor progress in quantum gravity [1][2][3][4][5], profoundlyconstraints the forms of life and substantially impacts onthe evolution of life.The holographic principle states that there is a precise,general and surprisingly strong limit on the informationcontent of spacetime regions. The number of quantumstates in a spatial region is bounded from above by thesurface of the region measured in the unit of four-foldPlanck areas. This entropy bound is a strong constrainton any theory about our universe. If this principle is true,field theory or string theory, where there are infinite de-grees of freedom, can not be the ultimate theory. And ifthis principle can be applied to the phenomenon of life,the degrees of freedom in a living system will also be con-strained. From this point of view, the principles in rela-tivity or in quantum theory constrain life in an alterna-tive way. The holographic principle indicates that thereis a strict relationship between the information storagecapacity of the space and the complexity of any organismwherein. Such a basic idea can be illustrated by a sim-ple example. Whatever a living system with n degrees offreedom is, we can conclude that it can never exist in auniverse with a horizon area less than 4 nl p , where l p isthe Planck length.In this paper, we estimated the immense degrees offreedom for living systems according to the theory ofgene regulatory networks and Boolean networks [6] [7].We found a contradiction between the possible degreesof freedom of living systems and the maximum informa- tion storage capacity in the observed universe. Then wereconciled this contradiction in terms of the causality be-tween the possible sequences of macromolecules for theactual living systems, which is equivalent to the existenceof language of genes. We propose evidences of languageof genes and we can explain the outline of protein lengthdistributions by a linguistic mechanism of generation ofprotein sequences. We can also explain the leaps in theevolution of biological complexity according to the en-tropy bound. IMMENSE DEGREES OF FREEDOM IN LIVINGSYSTEMS
Information properly bridges biology and physics[8][9][10], which gives deep insights into the nature oflife. With the development of genetics, we know that thegene regulatory networks play significant roles in devel-opment and evolution of life [6]. Based on the theoryof self-organization, Kauffman proposed a general the-ory of Boolean networks to describe the gene regulatorynetworks, where the interactions between genes can rep-resented by Boolean operations between the nodes of thenetwork [7]. Thus, the degrees of freedom of a livingsystem can be estimated by the number of states of thecorresponding Boolean network. Proteins are the elemen-tary units in the activities of life. So a living organismcan be represented by a dynamical system of all the pro-teins in its body. We denote the set P as all possibleprotein sequences with a cutoff of protein length l . Pro-teins are chains concatenated by 20 amino acids. So thereare m = Σ lk =1 k elements in the set P . We define aBoolean network N as the Boolean network whose nodesare elements of P (Fig. 1a). According to the definitionof Boolean networks, there are two states for each nodeof a Boolean network: “on” or “off” [7]. A state of N represent that some nodes are “on” while the others are“off”. So a proteome can be represented by a state of N ,where only the nodes corresponding to protein sequencesin the proteome are “on”. The state space S consists ofall possible states of N whose number is n ′ s = 2 m . (1)An actual species can be represented by a point in S . Theevolution of a species can be illustrated by a trajectory in S (Fig. 2b). As a preliminary consideration, the degreesof freedom of a living system can be estimated by thelogarithm of number of states d ′ ∼ ln n ′ s ∼ l ln 2 , (2)which we will reconsider later on.According to the holographic principle, we can cal-culate that the information in the observed universe isabout [11] I univ = 10 bits. (3)This value is too large for non-living systems. For ex-ample, the information of black body cosmic backgroundphotons is about 10 bits, which may be the largest de-grees of freedom for possible non-living systems. Butit is still much less than I univ . The remaining informa-tion storage capacity in our universe has not been wastedhowever for there being living systems. The degrees offreedom for living systems are so immense that may ex-ceed the maximum information storage capacity in theobserved universe.The structure of chains of genetic macromolecules es-sentially provides immense degrees of freedom for liv-ing systems, because the number of possible protein se-quences can be as large as 20 l . For a living system, thedegrees of freedom may be equivalent to that of the ob-served universe if the protein length is about n ∗ = 94amino acids. Interestingly, the most frequent proteinlength for the life on our planet is about n ∗ . The im-mense degrees of freedom of living systems originate fromthe great number of possible sequences in N . Most of thedegrees of freedom come from the states of N in whichabout half the nodes are “on”. On the other hand, thedegrees of freedom can also come from the states in whichonly a minority of nodes are “on”. Our living systems be-long to the latter case, where there are only thousandsof proteins in actual proteomes. ENTROPY BOUND AND THE CAUSALITY OFSEQUENCES
The estimate of immense degrees of freedom of a livingsystem in the above, however, seriously contradicts theholographic principle if we consider the actual life aroundus. The average protein length in a proteome rangesabout from 250 amino acids to 550 amino acids, and acertain number of proteins are longer than thousands ofamino acids. According to the preliminary estimate, the • • • • • • • • m sequences in P m states for N a U S causal relationship s b FIG. 1:
Boolean networks and the causal relationshipbetween the macromolecular sequences. a,
The nodesof the Boolean network N consist of all possible protein se-quences in P with length less than l amino acids. For eachnode, there are two states “on” or “off”, so there are 2 m statesfor N . The state s is a state of N in which some nodes are“on” (represented by black dots). b, Only a part of the statesin N may have biological meaning in an actual living system.The number of states in N may exceed I univ , but the numberof states in U can not be greater than I univ . degrees of freedoms for the actual living systems on ourplanet will be much larger than the maximum degreesof freedoms in the observed universe I univ . We have toreconcile the contradiction between the preliminary esti-mate of degrees of freedom of living systems and the con-clusion of the holographic principle. If the holographicprinciple is not invalid, we must find ways to shrink thepreliminary estimate of the degrees of freedom of livingsystems.We introduce the causality between the states of N toreveal the additional constraint on the degrees of freedomby the entropy bound. At the beginning of the evolutionon the planet, the first living system may be denoted asan inertial state s . When the degrees of freedom of N is greater than I univ , not all the states of N can havecausal relationship with s unless the holographic princi-ple is untrue. We define the set U as all the states thathave causal relationship with s , which has n s states andis only a proper subset of S (Fig. 1b). The nodes of U constitute L , which is a subset of P . An actual liv-ing system at present corresponds to a dynamic systemevolving only in the state space U and a meaningful pro-tein sequence in biology must belong to L . The degreesof freedom of a living system, therefore, can be defined (ba) ( )(aa,ba) (aa)(aa,ab,ba) (aa,ab)(ab)(ab,ba)bbba abaa a S U L ⊂ N dim(U) < I univ • • • • • b P • s FIG. 2:
The finite degrees of freedom require languageof sequences. a,
A toy model can explain the necessity ofthe language in sequences. Suppose that the entropy boundrequires that the information can not be greater than 2 bits ina tiny universe. There is no living system corresponding to theBoolean network with 3 nodes aa , ab and ba . We can choose asubset aa and ab as the nodes of an available Boolean network,which corresponds to an actual living system in this universe.Thus we obtain a language consisting of 2 words: aa and ab . b, Only the states in the set
U ⊂ S has causal relationshipwith the inertial state s due to the entropy bound. We canobtain a language L of the sequences, which is a subset (dotson N ) of P . The number of elements in L must be less than I univ . by the number of states in U : d ≡ ln n s , (4)where n s is much less than n ′ s and d can be rightly lessthan I univ . THE LANGUAGE REQUIRED BY THEHOLOGRAPHIC PRINCIPLE
The causality provides a physical explanation to dis-tinguish a part of sequences L from all possible sequences P . Not all the amino acid chains or base chains aremeaningful in biology. According to the theory of for-mal language, a language is defined by a subset of allthe sequences concatenated by letters in a given alpha-bet [12]. The choice of a subset L from P is a naturalway to define a formal language (Fig. 2). The protein orDNA languages originate in the constraint on the degreesof freedom of life by the entropy bound. The alphabet ofprotein language consists of 20 amino acids, and the al-phabet of the language of genes consists of 4 bases. Thearrangement of the letters in the sequences should be de-termined by some grammars. Although there are variousentropy bounds, there is no difference for the requirement of finite degrees of freedom in life and the requirement ofthe language of genes for all the theories. To some extent,the language of genes is a consequence of the principles inquantum gravity. The phenomenon of life is constrainedstrictly by the entropy bound. The requirement of theorder of sequences by the grammars can not be explainedin the context of classical physics because the degrees offreedom of life can be infinite.The ability of speaking for human beings is determinedby genes. That we can communicate with each other in-stinctively can be attributed to our common genes. Thehuman language can be viewed as a transformation ofcell language [13]. The information storage capacity ofa natural language can also be estimated by the similarcalculation in the above. For instance, we estimated thatthere are up to I human = 26 l ′ bits of information can bewritten in a language with 26 letters and the length ofwords in the language is l ′ ∼
10, which is much less thanthe protein length. In this sense, the natural languageis simpler than the language of genes. The value I human is much less than the information in the observed uni-verse I univ . So the description of the universe by naturallanguage is always a simplified version of the actuallycomplex world. Interestingly, there were not rare casesto reach the same goal by different routes in the his-tory of natural sciences, such as, Riemannian Geometryand general relativity, or the theory of bundles and gaugetheories. Such encounters may come from that all the de-scriptions in different subjects have a common ultimatetheory of all the information in the universe, although wecan not understand all the details of the world by onlyone subject. THE LANGUAGE OF GENES ANDUNDERLYING ORDER IN SEQUENCES
Several attempts have been made over the past threedecades to combine linguistic theory with biology [14][15]. The distribution of the number of occurrences ofprotein domains in a genome can be a good fit of thepower-law distribution known as Zipf’s law in linguis-tics, and we can distinguish between the protein linguis-tics and the language of genes according to the theory offormal language [15]. So the experimental observationssupport the existence of languages in the sequences ofmacromolecules. On one hand, they are required by theholographic principle. On the other hand, they are con-sequences of the evolution of life at the molecular level[16][17][18][19]. The alphabets of amino acids or basesformed at the beginning of life. And genetic code devel-oped and fixed in the early stage of evolution. All thesefactors can determine whether a sequence is permitted ina life, which is equivalent to the role of grammars at themolecular level.We found a strong evidence of the underlying mecha-
250 350 450 550050010001500 l f m Archaebateria Eubacteria Eukaryote
Mycoplasma Virus
FIG. 3:
Relationship between the average proteinlength ¯ l and the frequency f m of the highest peakof discrete fourier transformation of protein lengthdistribution. The distribution of the species from three do-mains likes a rainbow. Even for the group of closely relatedspecies such as mycoplasmas (belonging to eubacteria), theirdistribution also form an “arch” of the rainbow. This is astrong evidence for the underlying mechanism of the proteinlength distributions. nism in the organization of amino acids in protein se-quences by studying the correlations between proteinlength distributions, which indicates the languages in theprotein sequences. The protein length distribution cor-responds to a vector D = ( D (1) , D (2) , ..., D ( g ) , ...D ( c )) , (5)where there are D ( g ) proteins with length g in the com-plete proteome of a species and c = 3000 is the cutoffof protein length. Our data of the protein length distri-butions are obtained from the data of 106 complete pro-teomes in the database Predictions for Entire Proteomes[20]. The discrete fourier transformation of the proteinlength distribution is:˜ D ( f ) = 1 √ c c X g =1 D ( g ) e πi ( g − f − /c (6)Let f m denotes the frequency of the highest peak ˜ D ( f m )in the discrete fourier transformation of the proteinlength distribution for a species. We found that there isan interesting relationship between the frequency f m andthe average protein length ¯ l of species. The distributionof species in ¯ l − f m plane shows an regular pattern: thespecies in the three domains (Archaebacteria, Eubacte-ria and Eukaryotes) gathered in three rainbow-like archesrespectively (Fig 3). This pattern strongly indicates theintrinsic correlation among the protein length distribu-tions, which can never achieve if the protein length dis-tributions are stochastic. The periodic-like fluctuationsin the protein length distribution [21] may also originate in the underlying mechanism of generation of protein se-quences. EXPLANATION OF THE ORDER IN PROTEINSEQUENCES
We propose a model to reveal the underlying mecha-nism in the protein sequences according to tree adjoininggrammar [22]. In the model, protein sequences can begenerated by tree adjoining operations, i.e., substitut-ing the initial tree or auxiliary trees into to each otherby identifying the inner nodes (Fig. 4a) [22]. There isonly one variant t in the model, which is the probabil-ity of substitutions in the adjoining operations and de-notes different species. A certain number of proteins canbe generated when t is fixed, hence we obtain a proteinlength distribution by the model (Fig. 4b). The proper-ties of protein length distributions can be explained bythe simulation. The outline and the fluctuations of thesimulated protein length distribution agree with the ac-tual protein length distributions in principle.We show that there is a close relationship between theprotein length distributions and grammar rules. Thefluctuations in the distributions are determined by thegrammar rules. The same grammar rule corresponds tothe same distribution. If changing grammar rules, weobtain different outlines and fluctuations of distribution.This result suggests that the fluctuations in actual pro-tein length distributions are intrinsic properties of certainspecies and may infer the underlying mechanism on theorder of protein sequences. THE MACROEVOLUTION OF BIOLOGICALCOMPLEXITY
The evolution of complexity of life is not a linear courseof increment [23][24]. The entropy bound can also ex-plain the leaps in the evolution of biological complex-ity. Consequently we can outline the macroevolutionof life. The gene regulatory networks are acceleratingnetworks [25][26]. According to this theory, the evolu-tion of complexity of any accelerating networks has to beslowed down and will stop at an upper limit of complex-ity. Hence there must be upper limits of complexity inboth of the evolution of biological complexity for prokary-otes and eukaryotes, where the entropy bound is a naturalupper limit. The whole evolution of biological complexitycan be, therefore, divided into three steps: the evolutionof unicellular life, the evolution of multicellular life andthe evolution of society of human beings. The Cambrianexplosion divided the first two steps. And we found thatthe evolution of multicellular life has reached its upperlimit because the maximum non-coding DNA content isnear to 1 at present. The civilization of human beings
S x S T x xx S S T S T T x xx a F r equen cy ( × − ) b FIG. 4:
Simulation of protein length distributions bya linguistic model. a,
The tree adjoining grammar. Thereare one initial tree and two auxiliary trees, where S and T are inner nodes and x or x x are leaves which represent theamino acids. b, The simulation of protein length distributionby the tree adjoining grammar. The properties of proteinlength distributions such as the outline and fluctuations canbe simulated by the linguistic model. appeared, which can be taken as an alternative form ofbiological complexity. The entire evolution of biologicalcomplexity should be governed by a universal mechanismof evolution. The universal language of genes in speciesmay harmonize the evolution of life in the biosphere.We thank Hefeng Wang, Liu Zhao, Yachao Liu andLei Zhang for valuable discussions. Supported by NSF ofChina Grant No. of 10374075. ∗ [email protected][1] J. D. Benkenstein, Black holes and the second law. Lett.Nuovo. Cim. , 737 (1972);[2] S. Hawking, Black hole explosions? Commun. Math.Phys. , 199 (1975).[3] G. ’t Hooft, Dimensional reduction in quan-tum gravity. arXiv Preprint Archive J. Math. Phys. , 6377 (1995).[5] R. Bousso, The holographic principle. Rev. Mod. Phys. , 825-874 (2002). [6] E. H. Davidson, The regulatory genome: gene regulatorynetworks in development and evolution (Elservier, Ams-terdam, 2006).[7] S. A. Kauffman,
The origins of order (Oxford Univ.Press, New York, 1993).[8] J. A. Wheeler, Information, physics, quantum: the searchfor the links, in
Procedings of 3rd international sympo-sium foundations of quanturm mechanism , pp. 354-368(Tokyo, 1989).[9] A. Wada, Bioinformatics - the necessity of the questfor ‘first principles’ in life.
Bioinformatics , 663-664(2000).[10] B.-O. K¨uppers, Information and the origin of life (MITPress, Cambridge Mass., 1990).[11] L. Susskind and J. Lindesay,
An introduction to blackholes, information, and the string theory revolution: theholographic universe (World Scientific, Singapore, 2005).[12] J. E. Hopcroft et al.,
Introduction to automata theory,languages, and computation (Addison Wesley, 2001).[13] S. Ji, Isomorphism between cell and human languages:molecular biological, bioinformatic and linguistic impli-cations.
Biosynthesis , 17-39 (1997).[14] M. Gimona, Protein linguistics. Nature Rev. Mol. CellBoil. , 68-73 (2006).[15] D. B. Searls, Language of genes. Nature , 211-217(2002).[16] R. D. Knight and L. F. Landweber, The early evolutionof the genetic code.
Cell , 569-572 (2000).[17] E. Szathm´ary, Why are there four letters in the geneticalphabet?
Nature Rev. Genetics , 995-1001 (2003).[18] S. Osawa. et al., Resent evidence for evolution of thegenetic code. Microbiol. Rev. 56 , 229-264 (1992).[19] E. N. Trifonov, The triplet code from first principle.
J.Biomol. Struct. Dyn. , 1-11 (2004).[20] The database with Predictions for Entire Proteomes onURL: http://cubic.bioc.columbia.edu/pep.[21] A. L. Berman, E. Kolker and E. N. Trifonov, Underlyingorder in protein sequence organization. Proc. Natl. Acad.Sci. USA , 4044-4047 (1994).[22] A. K. Joshi and Y. Schabes, in Handbook of Formal Lan-guages , eds G. Rozenberg and A. Salomma, pp.69-214(Springer, Heidelberg, 1997).[23] J. S. Mattick, RNA regulation: a new genetics?
NatureRev. Genet. , 316-323 (2004).[24] C. Adami, C. Ofria and T. C. Collier, Evolution of bio-logical complexity. Proc. Natl. Acad. Sci. USA , 4463-4468 (2000).[25] J. S. Mattick and M. J. Gagen, Accelerating Networks. Science , 856-857 (2005).[26] Croft, L. J., Lercher, M. J., Gagen, M. J. & Mattick, J. S.Is prokaryotic complexity limited by accelerated growthin regulatory overhead. arXiv Preprint ArchivearXiv Preprint Archive