The C-value enigma and timing of the Cambrian explosion
aa r X i v : . [ q - b i o . GN ] M a y The C-value enigma and timing of the Cambrianexplosion
Dirson Jian Li ∗ & Shengli Zhang Department of Applied Physics, Xi’an Jiaotong University, Xi’an 710049, China ∗ E-mail: [email protected].
ABSTRACTThe Cambrian explosion is a grand challenge to science today and involves multidisci-plinary study. This event is generally believed as a result of genetic innovations, environ-mental factors and ecological interactions, even though there are many conflicts on natureand timing of metazoan origins. The crux of the matter is that an entire roadmap of theevolution is missing to discern the biological complexity transition and to evaluate the crit-ical role of the Cambrian explosion in the overall evolutionary context. Here we calculatethe time of the Cambrian explosion by an innovative and accurate “C-value clock”; ourresult (560 million years ago) quite fits the fossil records. We clarify that the intrinsicreason of genome evolution determined the Cambrian explosion. A general formula forevaluating genome size of different species has been found, by which major questions ofthe C-value enigma can be solved and the genome size evolution can be illustrated. TheCambrian explosion is essentially a major transition of biological complexity, which cor-responds to a turning point in genome size evolution. The observed maximum prokaryoticcomplexity is just a relic of the Cambrian explosion and it is supervised by the maximuminformation storage capability in the observed universe. Our results open a new prospectof studying metazoan origins and molecular evolution. NTRODUCTION
The broad outline of Cambrian diversification has been known for more than a century, butonly in the post-genomic era have the data necessary to explain the nature of the Cambrian ex-plosion. This problem originated in the disciplines of paleontology and stratigraphy, while thedebate about it may be as old as the problem itself [1][2][3]. Some ascribed the Cambrian explo-sion to intrinsic causes, while others believe that it may have been triggered by environmentalfactors. Innovative ideas exploded in the past decade with new fossil discoveries and progressin biogeochemistry, molecular systematics and developmental genetics [4][5][6][7]. However,we still need insights from other fields such as genome size evolution, self-organization, com-plexity theory and the holographic principle [8][9][10][11] to fully resolve this long-runningproblem.There is a profound relationship between the Cambrian explosion and the C-value enigma.Why did so many complex creatures appear in the late Neoproterozoic and Cambrian, but notearlier or later? We believe that the nature and timing of the Cambrian explosion can be de-termined by the evolution of genome size (see the schematic in Supplementary Figure 1). Weinvented a ”C-value clock” to calculate the time of the Cambrian explosion based on genomicdata. The basis of the C-value clock depends on the notion that the evolutionary relationshipcan be revealed by the correlation of protein length distributions and the genome size evolutioncan be taken as a chronometer.The start of our theory is a formula for evaluating genome size (namely C-value) of dif-ferent species. According to this formula, major component questions of the C-value enigmacan be solved and the genome size evolution can be illustrated. Consequently, the genomesize evolution can be taken as an accurate chronometer to study the macroevolution. We founda unique turning point in genome size evolution and calculated the time of the turning point,which corresponds to the Cambrian explosion. We believe that the Cambrian explosion was es-2entially a major transition of biological complexity when the prokaryotic complexity reachedits maximum value. We suggest that the biological complexity is supervised by the maximuminformation storage capability in the observed universe.
RESULTS AND DISCUSSIONS
Genome size evolution.
Genome sizes vary extensively in or between taxa. We found thatthe genome size S can be determined by two variables: the noncoding DNA content η andthe correlation polar angle θ . Hence we obtained an empirical formula of genome size for anycontemporary species: S ( η, θ ) = s exp( ηa − θb ) , (1)where s = 7 . × base pairs (bp), a = 0 . and b = 0 . were obtained by leastsquares based on the data of S , η and θ for species (see Supplementary Table 1 and 2).We also obtained another empirical formula of gene number N ( η, θ ) = 1 . × exp( η . − θ . ) and the relationship between non-coding DNA and coding DNA for eukaryotes log N nc =2 .
81 log N c − . . The predictions of the formulae agree with the experimental observationsvery well (Fig. 1a, 1b). The empirical formula of genome size is the start of our theory, whichcan be verified by many agreements between its predictions and experimental observations(especially the detailed agreements, Fig. 1, 3 and 4).The formula of genome size for contemporary species can help us write down the formulaof genome size evolution from t = T = 3 , million years ago (Ma) (the beginning oflife [12]) to t = 0 (today). We introduced a function s ( t ) to describe the overall trend of thegenome size evolution according to the distribution of species in the η − θ plane. This is the3ain assumption in our theory. We can distinguish two phases in genome size evolution (Fig.2a). In phase I , all the species in the lower triangle of the η − θ plane are simple prokaryotesand their non-coding DNA contents are low. In phase II , all the species in the upper triangleof the η − θ plane are eukaryotes, and the non-coding DNA content increased to the maximumvalue η ∗ . It is reasonable, therefore, to take the critical event that divides the two phases as theCambrian explosion.Thus, we can obtain the formula of genome size evolution: s I ( t ) = s exp( t/τ ) for phase I and s II ( t ) = s exp( t/τ ) for phase II , where s = 1 . × bp, τ = 644 million years and s = 1 . × bp, τ = 106 million years (Fig. 2b). The result qualitatively agrees with thestraightforward (but a little coarse) estimation of genome size evolution in Ref. [13] in that ( i )both genome size evolution increase exponentially (namely linearly in Fig. 2b) and ( ii ) there isa unique turning point in genome size evolution for our result or for the estimate (Fig. 2b). Asexpected, the dividing value of genome size in our theory s I ( T c ) = s II ( T c ) = s agrees with themaximum prokaryotic genome size in observation [8]. Explanation of the C-value enigma.
The C-value enigma is apparently concerned with thelack of correlation between genome size and morphological complexity but profoundly with thenature of the Cambrian explosion. According to the genome size formula, we obtained somegeneral properties of genome size evolution, hence major questions of the C-value enigma canbe explained.According to the genome size evolution formula, we can distinguish two speeds of genomesize evolution. In phase I , the genome size doubled in about every million years on thewhole. And in phase II , the genome size doubled in about every million years on the whole.So, the speed of genome size evolution for phase II (mainly non- coding DNA increasing)is much faster than that for phase I (mainly coding DNA increasing). The pattern of expo-4ential increment can be simply understood by the relation ∆ s ( t ) ∝ s ∆ t for the two phasesrespectively. The overall picture of the genome size evolution reflects the entire roadmap of thebiological complexity evolution, which is helpful to understand the macroevolution.The Cambrian explosion can help to account for the genome size ranges in taxa. All phylaappeared almost simultaneously in the Cambrian explosion. In the evolution, therefore, η in-creases from ¯ η to η ∗ for each phylum (Fig. 2a). The genome size in a phylum varies by about ∆ = lg exp η ∗ − ¯ ηa ∼ . orders of magnitude (Fig. 3). The history of a class is generally shorterthan that of a phylum. So the genome size range in a class is less than that in a phylum, whichvaries by about δ = lg exp ∆ θb ∼ . orders of magnitude (Fig. 3), where the uncertainty ∆ θ isestimated by . (Fig. 2a). Furthermore, we can explain the lack of correlation between genomesize and morphological complexity. The origin of phyla in the Cambrian explosion related tothe appearance of kernels of gene regulatory networks, whose complexity varied notably. Butthe C-values of species in different phyla did not vary notably [5]. So the discrepancy betweengenome size and eukaryotic complexity happened from scratch (Fig. 3).Three clusters of prokaryotes C Gram − , C Gram + and C small can be distinguished in the lowertriangle of the η − θ plane (Fig. 2a), where Gram negative bacteria, Gram positive bacteria andbacteria with small genome size are in the majority respectively [14]. We evenly distributed dots (representing ”species”) in three symmetric areas enclosing C Gram − , C Gram + and C small in Fig. 4a (the same areas with Fig. 2a). After projecting the three symmetric areas inplane by the non-linear transformation Eqn. 1, we obtained three asymmetric areas C ′ Gram − , C ′ Gram + and C ′ small in η − s plane in Fig. 4c. Finally, we obtained the prokaryotic genome sizedistribution in Fig. 4b by counting the numbers of species in each genome size section withidentical width in Fig. 4c. Timing of the Cambrian explosion.
The time T c for the Cambrian explosion can be calculated5ccording to the formula of genome size evolution. The function s I ( t ) represents the codingDNA evolution. Its extrapolated value s I (0) = s represents the size of coding DNA at present.And the value s II (0) = s represents the total genome size at present. For the coding DNAcontent at present, we obtained an equation between the experimental data and the theoreticalprediction − η ∗ = s /s , where s and s are functions of T c . According to this equation, wehave T c = T (1 − ( b − ¯ η ln(1 − η ∗ ) + ba η ∗ − ¯ η − ¯ η + 1) − ) ≡ f ( η ∗ ) . (2)This is the formula to calculate the Cambrian explosion time by C-value clock, which radicallydiffers from molecular clock estimates (Fig. 2c) [15] [16]. The value η ∗ should be of thespecies whose η is the largest and whose complexity is the greatest. The best choice is noother than human: η ∗ = 0 . [17] [18]. Therefore, we obtained the Cambrian explosion time T c = f (0 . Ma. Our result agrees with the fossil records very well (Fig. 2d).This main result of C-value clock shows that the Cambrian explosion corresponds to a turn-ing point in genome size evolution. It is for the first time, to our knowledge, to successfullymediate timing of the Cambrian explosion between paleontology and molecular biology. Con-sidering the sensitive relationship between T c and η ∗ . (Fig. 2d), it is remarkable to calculatealmost the exact time of the Cambrian explosion by the non- coding DNA content of humangenome. The subtle relationship T c = f ( η ∗ ) indicates the close relationship between the rapidexpansion of noncoding DNA and the cause of the Cambrian explosion. The genetic mechanismcan give us a clear and in-depth understanding of the Cambrian explosion. Both developmentand evolution of the animal body plans should be studied at the level of gene regulatory net-works [5] [19]. The appearance of genomic regulatory systems may be a prerequisite for theanimal evolution. And the phylum-specific or subphylum-specific kernels of gene regulatorynetworks may explain the conservation of major phyletic characters ever since the Cambrian[19]. 6ccording to Eqn. 2, we obtained ∆ T c T c = − ∆ η ∗ η ∗ + ∆ T T − . ∆ aa + 0 . ∆ bb − . ∆¯ η ¯ η .The error of T c in prediction, therefore, mainly comes from the parameter η ∗ . If consideringthe uncertainty in human gene prediction, the error of coding DNA content in human genomeis about [17]. Hence we obtained that the value of T c in prediction ranges from Mato
Ma. Even if the databases of complete genomes and proteomes may expand much inthe future, the parameters in Eqn. 1 would change slightly. So our main results in this paperwill still be valid. By the way, if choosing T c as the date of the earliest known microfossils,i.e., T = 3 , Ma, the prediction would be T c = 516 Ma. There is a notable discrepancybetween the molecular clock estimates and the fossil records [16] [20] [21]. Obviously, theC-value clock works better than the molecular clocks for this problem. We can conclude thatthe C-value clock estimate agrees with the fossil records in principle (Fig. 2d).If comparing the time of evolution of life as a day, why did not the complex life appearin the morning or in the afternoon but appear around half past eight in the evening? In termsof the overall picture of genome size evolution in Fig. 2b, we can explain why the simple lifehad actually predominated on the planet for the first / time in the evolution. It is due tothat the evolutionary speed for non-coding DNA is much faster than that for coding DNA.TheCambrian explosion can not happen in the first half of the period in the evolution. The reason isthat s is always less than s such that the turning point had to appear later than the time T / .Furthermore, it can be illustrated that the Cambrian explosion must happen very late because s is in fact much less than s at present, namely, the slope for the evolution of non-coding DNAis much steeper than the slope for the evolution of coding DNA (Fig. 2b). Nature of the Cambrian explosion.
The formula of genome size evolution opens up an op-portunity to investigate the entire roadmap of evolution based on biological complexity. It isobserved that the biological complexity increases faster and faster but not smoothly [22] [23]724]. The pattern that mass extinctions followed by rapid evolutionary radiations is widely con-sidered to have fundamentally shaped the history of life. But it is not the answer to the caseof the Cambrian explosion. The evolution is not only a mixture of accidental events. The onewith less perseverance can never spend billions of years to assemble a jaguar by quarks! Anoverall mechanism of the evolution is required to explain the Cambrian explosion. The genomesize evolution is just a problem on macroevolution. In our theory, the function s ( t ) representsnot only the trend of the genome size evolution but also the trend of the biological complexityevolution because the prokaryotic complexity is related to the genome size and the eukaryoticcomplexity is related to the non-coding DNA content [18]. The turning point in genome sizeevolution implies that there was a critical value of biological complexity in evolution, which issupported by the fact that both the genome size and the complexity of prokaryotes have neverreached the size and complexity of eukaryotes. The constraint of the prokaryotic complex-ity demands a leap in biological complexity. As a result, the complex organisms successfullybypassed this constraint during Cambrian.Several attempts have been proposed to explain the maximum prokaryotic complexity [25][26] [8]. Its existence can be explained by the theory of accelerating networks [27]. It is sug-gested that prokaryotic complexity may have been limited throughout evolution by regulatoryoverhead, and conversely that complex eukaryotes must have bypassed this constraint by novelstrategies [25] [22]. We give another explanation based on Kauffman’s theory and the holo-graphic principle [9] [11] [28]. The theory of self-organization provides deep insight into thespontaneous emergence of order which graces the living world [9]. The prokaryotic complexityshould be understood as a dynamical system at the level of gene networks. So we can defineprokaryotic complexity by information stored in Boolean networks, which is so immense that itcan reach the maximum information content I univ in the observed universe. Holographic boundin physics imposes a strict limit on the biological complexity. The information bridges between8iology and physics [29] [30]. We believe that the maximum prokaryotic complexity is con-strained by the upper limit of information storage capacity in our universe. Hence the maximumcomplexity of accelerating networks in the above explanation can be given concretely.The Cambrian explosion of animal phyla radically differs from all the other radiations suchas the radiations of modern birds and mammals in the early Tertiary, because it corresponds tothe unique critical event in the genome size evolution. The intrinsic reason of genome evolutiondetermined the Cambrian explosion, during which the biological complexity leapt not only atthe anatomical level but also at the molecular level. The stability of the genomic system becamelow before the Cambrian explosion because the old mechanism of evolution was suffocated. Atthis critical moment, any extrinsic factors were qualified to turn the evolution to a new direction.Numerous complex animal body plans were destined to come at a certain time. In contrast, thecauses of other radiations were full of uncertainty. The nature of the Cambrian explosion mustbe studied in a broader context than before. The Cambrian explosion and the origin of life werethe most important events in the evolution from nonliving systems to living systems. We believethat the C-value enigma and the Cambrian explosion will help us uncover the intricate mecha-nism in evolution. A multidisciplinary framework has been established in our work to explainthe Cambrian explosion (see Supplementary Figure 1), which will shed light on the essence ofevolution. METHODS
The definition of correlation polar angle θ and its biological meaning. The correlation polarangle indicates the evolutionary relationship, whose role in the C-value clock is as important asthe role of sequence similarities in molecular clocks. The correlation polar angle can be defined9ccording to protein length distributions, which helped in discovery of the formula of genomesize when we fortunately realized the relationship between genome size S and the correlationpolar angle θ . In the followings, we define the correlation polar angle firstly. Then we explainits biological meaning.The protein length distribution is an intrinsic property of a species, which is defined as adistribution (namely a vector) D = ( D , D , ..., D n , ... ) : there are D n proteins with length n inthe complete proteome of the species. Our data of the protein length distributions are obtainedfrom the data of complete proteomes in the database Predictions for Entire Proteomes [32].The normalized vector of protein length distribution d is defined by the direction of vector D : d ≡ D / √ D · D = D / sX n D n . Because there are few proteins longer than amino acids in a complete proteome (Supple-mentary Figure 2c), we can neglect them and set the length as the cutoff of protein lengthin the calculation. Hence both D and d are 3000-dimensional vectors. Thus each species cor-responds to a point on the 3000-dimensional unit sphere (Supplementary Figure 4a). The polaraxis of the spherical coordinates (Supplementary Figure 4a) can be defined by the direction ofthe vector of the total protein length distribution of the species (Supplementary Figure 2c) Z = X i ∈ species D ( i ) . And we denote the normalized vector of Z as the unit vector z of polar axis, the correspondingpoint of which situates at the center of the swarm of points on the unit sphere (Supplemen-tary Figure 4a). The correlation polar angle θ of a species is defined by the polar angle of thecorresponding vector of protein length distribution: θ ≡ π arccos( d · z ) , where the factor π is added in order that the value of θ ranges from to .10he biological meaning of the correlation polar angle can be interpreted as the averageevolutionary relationship between an species and all the other species (Supplementary Figure3b). The less the value of θ is, the closer the average evolutionary relationship is. This inter-pretation is based on the following two considerations: (1) Let vectors d ( i ) and d ( j ) corre-spond the protein length distributions of two species i and j (Supplementary Figure 4a). Thecorrelation between the two protein length distributions can be defined by their inner product C ij = d ( i ) · d ( j ) . Hence we obtain the correlation matrix ( C ij ) (Supplementary Figure 3a).We can see that the evolutionary relationship is closely related to the correlation between theprotein length distributions. The correlation polar angle θ for species i can be interpreted as theaverage evolutionary relationship according to (compare Supplementary Figure 3a and 3b): cos( π θ ) = X j ∈ species d ( i ) · D ( j ) / √ Z · Z )= X j ∈ species cos( π θ ij ) w ( j ) , where θ ij = π arccos( C ij ) is the correlation angle between two species and w ( j ) = p D ( j ) · D ( j ) / √ Z · Z is the weight for species j in the summation. (2) An auxiliarypolar axis z ′ can be defined by another direction differed from the polar axis. For example, wechose the direction corresponds to the distribution in Supplementary Figure 2d, hence the aux-iliary polar angle is defined by φ ≡ π arccos( d · z ′ ) (Supplementary Figure 4a). Then the highdimensional unit sphere (dim=3000) can be projected to a two dimensional θ − φ plane, whereeukaryotes, archaebacteria and eubacteria gather together in three areas respectively (Supple-mentary Figure 4b) and the closely related species also form clusters in the θ − φ plane. So thecorrelation polar angle is a useful tool to study the evolutionary relationship. The conclusion isstill valid if we choose other directions as the auxiliary polar angle. Derivation of Eqn. 1: the genome size S ( η, θ ) . We found that ln S decreases linearly with11 (Supplementary Figure 5a) but increases linearly with η (Supplementary Figure 5b) on thewhole. Hence, we wrote down the relation: ln S = ln s + ηa − θb . According to the biological data of genome size, η and θ (Supplementary Table 2), we obtainedthe empirical formula of genome size Eqn. 1 and its coefficients a , b and s by least squares.Similarly, we obtained the gene number formula N ( η, θ ) = n exp( ηa ′ − θb ′ ) . The value of η varies little for prokaryotes in both formulae and b ≈ b ′ , so the genome size isapproximately proportional to the gene numbers: SN ≈ s n exp(¯ η ( 1 a − a ′ )) = 842 , which is near to the ratio in observation [8]. But such linear relationship is destroyed for eu-karyotes because of the vast variation of η . The relationship between non-coding DNA N nc and coding DNA N c for eukaryotes. Theaverage protein length for eukaryotes is about 450 amino acids, so the logarithm of coding DNAfor eukaryotes is about log N c = log(3 × n )+ ηa ′ − ¯ ηb ′ according to the gene number formula.And the logarithm of non-coding DNA is about log N nc = log s + ηa − ¯ ηb + log η according tothe genome size formula. So we have log N nc = a ′ a log N c + log s − a ′ a log(1350 n ) − ¯ ηb + ¯ ηb ′ a ′ a + log η ≈ .
81 log N c − . , where we let log η ≈ log 0 . in calculation. According to the experimental observation (Figure1 in Ref. [33]), we obtain the relationship log N nc = 2 .
82 log N c − . between non-coding12NA and coding DNA for actual species on the whole if choosing two points (6 . , . and (7 . , . in Figure 1 in Ref. [33] to determine the linear relationship. Our result agrees withthe experimental observation perfectly. The genome size evolution function s ( t ) . We can observe a right-angled distribution of thecontemporary species in the η − θ plane (Fig. 2a). The prokaryotes and the eukaryotes areseparated by the diagonal line η = θ . An underlying mechanism of genome size evolutionis necessary to account for the distribution. Some species originated earlier while the otheroriginated later. As a result, the distribution of species in the η − θ plane has recorded theinformation of genome size evolution. Hence we can write down the genome size evolutionfunction.The prokaryotes situate around the horizontal line η = ¯ η = 0 . , where ¯ η is the average of η for prokaryotes (see Supplementary Table 1 and 2). According to Eqn. 1, the trend of thegenome size evolution for prokaryotes increases when θ decreases. When θ is close to , thereis few species because the genome size is too small as for the contemporary species. On theother hand, the eukaryotes situate around the vertical line θ = ¯ η and the trend of their genomesize evolution increases when η increases.We introduced a function s ( t ) to describe the overall trend of the genome size evolutionaccording to the right-angled distribution in observation, whose turning point corresponds tothe largest genome size of prokaryotes (Fig 2a). It is reasonable to define that the genomesize evolution function s ( t ) evolves leftwards along the horizontal line η = ¯ η and consequentlyupwards along the vertical line θ = ¯ η in the η − θ plane. This definition of genome size evolutionfunction agrees not only with the right-angled distribution of species in the η − θ plane but alsowith the trend of the genome size evolution from small to large on the whole.13 erivation of Eqn. 2: the Cambrian explosion time T c . In phase I, η ( t ) = ¯ η , and θ ( t ) decreases linearly from to ¯ η , i.e., θ ( t ) = 1 − (1 − ¯ η )( t − T ) T c − T . So we have s I ( t ) ≡ s exp( η ( t ) a − θ ( t ) b ) = s exp( t/τ ) , where s = s exp( ¯ ηa − T c − ¯ ηT b ( T c − T ) ) and τ = b ( T c − T )1 − ¯ η . Incidentally, we have s ′ ≡ s I ( T ) = s exp( ¯ ηa − b ) . And in phase II, θ ( t ) = ¯ η and η ( t ) = η ∗ − ( η ∗ − ¯ η ) t/T c . So we have s II ( t ) ≡ s exp( η ( t ) a − θ ( t ) b ) = s exp( t/τ ) , where s = s exp( η ∗ a − ¯ ηb ) and τ = a ( − T c ) η ∗ − ¯ η . Finally, substituting the expressions of s and s into the equation − η ∗ = s /s , we obtained Eqn. 2. Upper limit of the prokaryotic complexity.
Boolean networks have for several decades re-ceived much attention in understanding the underlying mechanism in evo-devo biology [9][34]. We define the network N L as a Boolean network whose nodes are all possible proteinsequences with the length less than L . The size of state space of N L is ∼ L in that N L has about L nodes. According to Shannon’s theory, the information stored in this network is I net ∼ log L = 20 L bits (Supplementary Figure 6). Types of prokaryotes can be interpretedby attractors of the Boolean network N L , which are robust against perturbations in evolution[34]. An actual genome of an organism can be denoted by one point amongst the total ∼ L points in the state space of N L . Based on the consideration that the biological complexity shouldbe evaluated at the level of gene regulatory networks, the prokaryotic complexity can be definedby the information I net stored in N L . Its value is much greater than the information stored in thegenetic sequences; the latter is not sufficient to measure the biological complexity for overlook-ing the complexity at the level of gene networks. This definition does not apply to eukaryoticcomplexity, which may involve RNA regulations [22].14e can show that the constrained maximum complexity of unicellular organisms can be ex-plained by the upper limit of information stored in the finite space. There was a great achieve-ment in the knowledge of fundamental laws in nature, which originated in the field of quantumgravity [35] [36] [28]. It claims that the information storage capacity of a spatially finite systemmust be limited by its boundary area measured in fourfold Planck area unless the second lawof thermodynamics is untrue. Consequently, we can obtain the maximum information storagecapacity in the observable universe as I univ ≈ bits [37], which is a strict limit on the in-formation content not only for physical systems but also for living organisms. Let I net ∼ I univ ,we obtained L ∼ amino acids, which dramatically corresponds to the most probable pro-tein length for prokaryotes (Supplementary Figure 2b) [38]. So the information stored in theprokaryotic gene networks is so large as to be comparable to I univ . Thus we have demonstratedthe equivalence between the prokaryotic complexity and the information content I univ in ouruniverse. We might say that what kind of spacetime determines what kind of life. A certain vastspacetime is necessary to accommodate the immense information stored in life.We are grateful to Hefeng Wang, Lei Zhang, and Yachao Liu for valuable discussions. Sup-ported by NSF of China Grant No. of 10374075. References [1] Valentine, J. W., Erwin, D. H. & Jablonski, D. Developmental evolution of metazoan bodyplans: the fossil evidence.
Dev. Biol. , 373-381 (1996).[2] Knoll, A. H. & Carroll, S. B. Early animal evolution: emerging views from comparativebiology and geology.
Science , 2129-2137 (1999).[3] Conway Morris, S. The Cambrian explosion: Slow-fuse or megatonnage?
Proc. Natl.Acad. Sci. USA , 4426-4429 (2000). 154] Shu, D.-G. et al. Primitive deuterostomes from the Chengjiang Lagerst¨atte (Lower Cam-brian, China). Nature , 419-424 (2001).[5] Davidson, E. H. & Erwin, D. H. Gene regulatory networks and the evolution of animalbody plans.
Science , 796-800 (2006).[6] Larroux, C. The NK Homeobox Gene Cluster Predates the Origin of Hox Genes.
CurrentBiology , 706-710 (2007).[7] Peltier, W. R., Liu, Y. & Crowley, J. W. Snowball Earth prevention by dissolved organiccarbon remineralization. Nature , 813-818 (2007).[8] Gregory T. R. ed.
The evolution of the genome (Elsevier, Amsterdam, 2005).[9] Kauffman, S. A.
The origins of order (Oxford Univ. Press, New York, 1993).[10] Sol´e, R. V., Fern´andez, P. & Kauffman, S. A. Adaptive walks in a gene network modelof morphogenesis: insights into the Cambrian explosion. arXiv Preprint Archive [online],http://arXiv.org/abstract/q-bio/0311013v1 (2003).[11] Bekenstein, J. D. Information in the holographic universe.
Sci. Am. , 58-65 (2003).[12] Mojzsis, S. J. et al. Evidence for life on Earth before 3,800 million years ago.
Nature ,55-59 (1996).[13] Sharov, A. A. Genome increase as a clock for the origin and evolution of life.
BiologyDirect :17 (2007).[14] Trevors, J. T. Genome size in bacteria. Antonie van Leeuwenhoek , 293-303 (1996).[15] Peterson, K. J. et al. Estimating metazoan divergence times with a molecular clock. Proc.Natl. Acad. Sci. USA , 6536-6541 (2004).1616] Kumar, S. Molecular clocks: four decades of evolution.
Nat. Rev. Genet. , 654-662(2005).[17] International Human Genome Sequencing Consortium, Finishing the euchromatic se-quence of the human genome. Nature , 931-945 (2004).[18] Taft, R. J. & Mattick, J. S. Increasing biological complexity is positively correlatedwith the relative genome-wide expansion of non-protein-coding DNA sequences. arXivPreprint Archive
The regulatory genome: gene regulatory networks in development andevolution (Elservier, Amsterdam, 2006).[20] Hedges, S. B. & Kumar, S. Genomic clocks and evolutionary timescales.
Trends Genet. , 200-206 (2003).[21] Bromham, L., Penny, D. The modern molecular clocks. Nat. Rev. Genet. , 216-224(2003).[22] Mattick, J. S. RNA regulation: a new genetics? Nature Rev. Genet. , 316-323 (2004).[23] Knoll, A. H. Proterozoic and early Cambrian protists: evidence for accelerating evolution-ary tempo. Proc. Natl. Acad. Sci. USA , 6743-6750 (1994).[24] Adami, C., Ofria, C. & Collier, T. C. Evolution of biological complexity. Proc. Natl. Acad.Sci. USA , 4463-4468 (2000).[25] Croft, L. J., Lercher, M. J., Gagen, M. J. & Mattick, J. S. Is prokaryotic complexitylimited by accelerated growth in regulatory overhead. arXiv Preprint Archive [online],http://arxiv.org/abs/q-bio.MN/0311021 (2003).1726] Lynch, M. & Conery, J. S. The origins of genome complexity. Science , 1401-1404(2003).[27] Mattick, J. S. & Gagen, M. J. Accelerating Networks.
Science , 856-857 (2005).[28] Bousso, R. The holographic principle.
Rev. Mod. Phys. , 825-874 (2002).[29] Wada, A. Bioinformatics - the necessity of the quest for ‘first principles’ in life. Bioinfor-matics , 663-664 (2000).[30] Wheeler, J. A. Information, physics, quantum: the search for the links, in Procedingsof 3rd international symposium foundations of quantum mechanics , pp. 354-368 (Tokyo,1989).[31] Gregory T. R. Macroevolution, hierarchy theory, and the C-value enigma.
Paleobiology , 179-202 (2004).[32] Carter, P., Liu, J. & Rost, B. PEP: Predictions for Entire Proteomes. Nu-cleic Acids Research , 410-413 (2003). The URL of the database PEP ishttp://cubic.bioc.columbia.edu/pep.[33] Ahnert, S. E., Fink, T. M. A. & Zinovyev, A. How much non-coding DNA do eukaryotesrequire? arXiv Preprint Archive Proc. Natl. Acad. Sci. USA , 13439-13444 (2005).[35] Bekenstein, J. D. Black holes and the second law.
Lett. Nuovo. Cim. , 737-740 (1972).[36] Hawking, S. W. Black hole explosions? Nature , 30-31 (1974).1837] Susskind, L. & Lindesay, J.
An introduction to black holes, information, and the stringtheory revolution (World Scientific, Singapore, 2005).[38] Rost, B. Did evolution leap to create the protein universe?
Curr. Opin. Stru. Biol. ,409-416 (2002). 19 Prediction of genome size O b s e r v a t i on o f geno m e s i z e EukaryotesEubacteriaArchaebacteria a Prediction of gene number O b s e r v a t i on o f gene nu m be r EukaryotesEubacteriaArchaebacteria b Figure 1:
Comparison between predictions and observations for genome size and genenumber.
Our results quite fit the experimental observations not only for prokaryotes but alsofor eukaryotes. a, Genome size (correlation coefficient r = 0 . ). b, Gene number (correlationcoefficient r = 0 . ). 20 θ η EubacteriaEukaryotesArchaebacteria P ha s e II Phase I η = θ s I (t) s II (t) a η * η η C G r a m − C G r a m + C S m a ll human 4 3 2 1 0 100103106109 t (10 Ma) s ( t ) s I (t) s II (t) s s s T T c ~T /2 b s’ Turningpoint • • MOLECULAR CLOCK C−VALUE CLOCKEVOLUTIONARYRELATIONSHIPCHRONOMETER similarity ofsequences correlation of proteinlength distributions molecular evolution genome size evolution c η * T c ( M a ) T c = f ( η * ) d
560 Ma . fossil records~570−510 Ma molecular clock estimates~1100−573 Ma C−value clockestimates~560−502 Ma Figure 2:
Genome size evolution and the nature and timing of the Cambrian explosion. a,
The distribution of species in θ − η plane and the function of genome size evolution. b, Theturning point of genome size evolution (red: total genetic DNA and blue: coding DNA). Ourresult (solid lines) is supported by the coarse estimate (thick dotted lines, data for estimate timeand genome size for taxa are obtained from Ref [13])). c, Comparison between the molecularclock and the C-value clock. d, A sensitive relationship T = f ( η ∗ ) . If varying η ∗ a little, T c will change much. The value of T c ranges approximately from Ma to
Ma according tothe C-value clock estimate. The result by C-value clock agrees with the fossil records [3] betterthan the molecular clock estimates [16] [15]. There should be notable systematic errors in theusual method of molecular clock estimates. 21
5 6 7 8 9 10 ••• lg (genome size) C o m p l ex i t y o f ph y l a prokaryotes A Blg s’ lg s lg s + ∆∼ δ classphyla Figure 3:
Explanation of C-value enigma: genome size range and eukaryotic complexity.
The ranges in genome size by order of magnitude ( ∆ ∼ . for phyla and δ ∼ . for classes) fitthe experimental observations in general (see Fig. 1 in Ref. [31]). In observation, the genomesizes of majority phyla also vary by about magnitudes and the genome sizes of majority classesvary by less than magnitude [8]. The complexity of a species inherits from the complexityof the corresponding phylum in general, so the complexity of species A in a more complexphylum can potentially outstrip the complexity of species B in a less complex phylum, thoughthe genome size of A is much less than that of B .22 .1 0.70 0.2 θ η η , θ ) ( × bp) N u m be r o f " s pe c i e s " a b peak II shoulder peak I s C Gram− C Gram+ C small saddle slope × bp) η C ’ s m a ll C ’ G r a m + C ’ G r a m − c Figure 4:
Explanation of C-value enigma: prokaryotic genome size distribution. a,
Evenlydistributed dots (representing “species”) in three symmetric areas in θ − η plane. b, The pre-diction of prokaryotic genome size distribution quite fits the experimental observation. Theprincipal characters of two peaks and their ratio in height and even the detailed characters suchas shoulder, saddle and slope are almost the same in the actual genome size distribution in Fig.10.12 in Ref. [8]. c, The prediction of prokaryotic distribution in s − η plane (three asymmetricareas enclosing by lines) quite fits the intricate distribution of prokaryotes (green dots).23 UPPLEMENTARY INFORMATION • Supplementary figures 1 ∼ • Supplementary tables 1 ∼ TART genome size formula (Eqn. 1) [Fig. 1a] relation between non−coding and coding DNA
C−VALUE ENIGMA CAMBRIAN EXPLOSION origin of phyla kernals of gene regulatory networks
TURNING POINT in genome size evolution [Fig. 2b] observed maximum prokaryotic complexity genome size evolution formula as C−value clock [Fig. 2b, 2c] Cambrian explosion time formula (Eqn. 2) [Fig. 2d] accelerating networksdefinition of prokaryotic complexityboolean networks maximum information storage capacity in the observed universegenome size distribution [Fig. 4b]distribution in s− η plane [Fig. 4c] range of genome size [Fig. 3]C−value vs. eukaryotic complexity [Fig. 3] environmental factors ecological interactions CAMBRIAN EXPLOSION ORIGIN OF LIFE BIG BANG MACROEVOLUTION: FROM NON−LIVING WORLD TO LIVING WORLD gene number formula [Fig. 1b]
Figure 5:
Supplementary Figure 1: The multidisciplinary framework to explain the Cambrianexplosion.
We found the close relationship between the C-value enigma and the Cambrian explosion.Hence we invented a new method of C-value clock depending on the empirical formula of genome size.The unique turning point in genome size evolution corresponds to the critical event of the Cambrianexplosion. The constraint on the unicellular genome evolution resulted in the upper limit complexityof unicellular organisms. We believe that the limited information storage capacity may determine thecomplexity of gene networks. The origin of life and the Cambrian explosion were the most importantmilestones in the evolution of biological complexity from nonliving systems to living systems.
200 400 600 800 1000 1200 1400 1600 1800 20000510152025 a P r o t e i n nu m be r b P r o t e i n nu m be r c F r equen cy d Protein length F r equen cy E. coli Prokaryotes in database PEP Total distribution as axis z The distribution as axis z’ Figure 6:
Supplementary Figure 2: Protein length distributions. a , An example of protein lengthdistribution of E. coli. b , Total protein length distribution for prokaryotes. c , Total protein length distri-bution for all the species in database PEP, which can be taken as the polar axis z in the SupplementaryFigure 4a. d , An outline of the protein length distribution, which can be taken as the polar axis z ′ in theSupplementary Figure 4a. species i θ Correlation between two protein length distributions of species i and j species i s pe c i e s j a b Figure 7:
Supplementary Figure 3: The evolutionary relationship can be revealed by the correla-tion between protein length distributions. a , The correlation matrix ( C ij ) represent the evolutionaryrelationship between any pairs of species i and j among the species. The species in the matrix areordered by the average protein length from short to long for archaebacteria, eubacteria, virus and eukary-otes respectively. The species can be given concretely by the serial number in Supplementary Table 1from the 1st position to the 106th position in the correlation matrix: 3, 8, 84, 65, 66, 83, 95, 51, 64, 82,96, 63, 87, 9, 104, 49, 10, 40, 31, 93, 76, 91, 45, 94, 78, 57, 21, 90, 86, 53, 89, 11, 59, 61, 58, 62, 42,13, 50, 34, 101, 4, 41, 60, 48, 33, 47, 26, 106, 56, 20, 35, 5, 100, 39, 2, 97, 46, 44, 37, 6, 54, 92, 85,16, 81, 102, 38, 28, 15, 73, 77, 19, 23, 70, 18, 22, 24, 14, 69, 80, 17, 27, 103, 36, 79, 98, 30, 74, 29, 32,99, 1, 75, 72, 12, 71, 52, 68, 25, 55, 7, 67, 105, 88, 43. b , The correlation polar angle θ for each of the species (see Supplementary Table 2) can be interpreted as the average evolutionary relationship: themore the average correlation between protein length distributions is, the less the value of is; and the lessthe value of is, the closer the average evolutionary relationship is. θ φ EubacteriaArchaebacteriaEukaryotes unit sphere (dimension=3000) d’ d z z’ θ φ θ ’ φ ’ a b Figure 8:
Supplementary Figure 4: The correlation polar angle and the evolutionary relationship.a , The correlation polar angle θ and the auxiliary angle φ . b , Distribution of three domains in the θ − η plane. The evolutionary relationship can be reflected by the correlation between the protein lengthdistributions. Species in different domains gather together in different areas respectively. θ l n S
0 0.2 0.4 0.6 0.8 1 5 6 7 8 9 10 η l n S EubacteriaArchaebacteriaEukaryotesEubacteriaArchaebacteriaEukaryotes a b
Figure 9:
Supplementary Figure 5: Relationship between the genome size and the non-codingDNA content and the correlation polar angle. a , ln S increases when θ decreases on the whole. b , ln S increases when η increases on the whole. • • • Boolean network N L The state space of N L A proteome of aspecies correspondsto one of the total2 L dots in thestate space of N L . e v o l u t i on p r o t eo m e information capacity: 20 L bits A node on the Booleannetwork N L is one of~20 L possible protein sequencies. Each nodehas two states. An attrctor (a set of nodes)can be interpreted as a proteome of a species. Figure 10:
Supplementary Figure 6: Explanation of prokaryotic complexity by the Boolean net-work N L and its state space. Each node on the network N L is one of ∼ L possible amino acidsequences, which has two states ”on” or ”off” according to the theory of Boolean networks. Each pointin the state space of N L represents a ”proteome” (a set of ”proteins” as an attractor of the Boolean net-work N L whose states are ”on”). The attractor is robust against the perturbations in evolution. Theevolution of a species can be described by a trajectory of the evolving proteome of the species in thestate space of N L . An underlying evolutionary mechanism is necessary to determine the movement ofthe species in the global state space of N L , so the complexity of the life system is proportional to thenumber of points in the state space of N L . The information stored in gene networks ( L bits) re-flects the complexity of the life system, which is compatible to the maximum information stored in theobserved universe I univ . upplementary Table 1: Organisms in the database Predictionsfor Entire Proteomes PEP
Notes: There are 7 eukaryotes, 12 archaebecteria, 85 eubacteria and 2 viruses in PEP.31No. 1)PEP FILE: achfl.pepORGANISM: Acholeplasma florum (Mesoplasma florum); A (M) florum; achflDOMAIN: Eubacteria(No. 2)PEP FILE: aciad.pepORGANISM: Acinetobacter sp (strain ADP1); A sp ADP1; aciadDOMAIN: Eubacteria(No. 3)PEP FILE: aerpe.pepORGANISM: Aeropyrum pernix K1; A pernix K1; aerpeDOMAIN: Archaebacteria(No. 4)PEP FILE: agrt5.pepORGANISM: Agrobacterium tumefaciens (strain C58 / ATCC 33970); A tumefaciens; agrt5DOMAIN: Eubacteria(No. 5)PEP FILE: agrtu.pepORGANISM: Agrobacterium tumefaciens; A tumefaciens; agrtuDOMAIN: Eubacteria(No. 6)PEP FILE: aquae.pepORGANISM: Aquifex aeolicus; A aeolicus; aquaeDOMAIN: Eubacteria(No. 7)PEP FILE: arath.pepORGANISM: Arabidopsis thaliana; A thaliana; arathDOMAIN: Eukaryote(No. 8)PEP FILE: arcfu.pepORGANISM: Achaeoglobus fulgidus; A fulgidus; arcfuDOMAIN: Archaebacteria(No. 9)PEP FILE: bacaa.pepORGANISM: Bacillus anthracis (strain Ames); B anthracis Ames; bacaaDOMAIN: Eubacteria(No. 10)PEP FILE: bacce.pepORGANISM: Bacillus cereus (ATCC 14579); B cereus (ATCC 14579); bacceDOMAIN: Eubacteria 32No. 11)PEP FILE: bacsu.pepORGANISM: Bacillus subtilis; B subtilis; bacsuDOMAIN: Eubacteria(No. 12)PEP FILE: bactn.pepORGANISM: Bacteroides thetaiotaomicron VPI-5482; B thetaiotaomicron VPI-5482; bactnDOMAIN: Eubacteria(No. 13)PEP FILE: barhe.pepORGANISM: Bartonella henselae (Houston-1); B henselae Houston-1; barheDOMAIN: Eubacteria(No. 14)PEP FILE: barqu.pepORGANISM: Bartonella quintana (Toulouse); B quintana Toulouse; barquDOMAIN: Eubacteria(No. 15)PEP FILE: bdeba.pepORGANISM: Bdellovibrio bacteriovorus; B bacteriovorus; bdebaDOMAIN: Eubacteria(No. 16)PEP FILE: borbr.pepORGANISM: Bordetella bronchiseptica RB50; B bronchiseptica RB50; borbrDOMAIN: Eubacteria(No. 17)PEP FILE: borbu.pepORGANISM: Borrelia burgdorferi; B burgdorferi; borbuDOMAIN: Eubacteria(No. 18)PEP FILE: borpa.pepORGANISM: Bordetella parapertussis; B parapertussis; borpaDOMAIN: Eubacteria(No. 19)PEP FILE: borpe.pepORGANISM: Bordetella pertussis; B pertussis; borpeDOMAIN: Eubacteria(No. 20)PEP FILE: braja.pepORGANISM: Bradyrhizobium japonicum; B japonicum; brajaDOMAIN: Eubacteria 33No. 21)PEP FILE: brume.pepORGANISM: Brucella melitensis; B melitensis; brumeDOMAIN: Eubacteria(No. 22)PEP FILE: bucai.pepORGANISM: Buchnera aphidicola (subsp. Acyrthosiphon pisum); B aphidicola (subsp.Acyrthosiphon pisum); bucaiDOMAIN: Eubacteria(No. 23)PEP FILE: bucap.pepORGANISM: Buchnera aphidicola (subsp. Schizaphis graminum); B aphidicola (subsp.Schizaphis graminum); bucapDOMAIN: Eubacteria(No. 24)PEP FILE: bucbp.pepORGANISM: Buchnera aphidicola (subsp. Baizongia pistaciae); B aphidicola (subsp.Baizongia pistaciae); bucbpDOMAIN: Eubacteria(No. 25)PEP FILE: caeel.pepORGANISM: Caenorhabditis elegans; C elegans; caeelDOMAIN: Eukaryote(No. 26)PEP FILE: camje.pepORGANISM: Campylobacter jejuni; C jejuni; camjeDOMAIN: Eubacteria(No. 27)PEP FILE: canbf.pepORGANISM: Candidatus Blochmannia floridanus; C Blochmannia floridanus; canbfDOMAIN: Eubacteria(No. 28)PEP FILE: caucr.pepORGANISM: Caulobacter crescentus; C crescentus; caucrDOMAIN: Eubacteria(No. 29)PEP FILE: chlcv.pepORGANISM: Chlamydophila caviae; C caviae; chlcvDOMAIN: Eubacteria(No. 30)PEP FILE: chlmu.pepORGANISM: Chlamydia muridarum; C muridarum; chlmuDOMAIN: Eubacteria 34No. 31)PEP FILE: chlte.pepORGANISM: Chlorobium tepidum; C tepidum; chlteDOMAIN: Eubacteria(No. 32)PEP FILE: chltr.pepORGANISM: Chlamydia trachomatis; C trachomatis; chltrDOMAIN: Eubacteria(No. 33)PEP FILE: chrvo.pepORGANISM: Chromobacterium violaceum ATCC 12472; C violaceum ATCC 12472; chrvoDOMAIN: Eubacteria(No. 34)PEP FILE: cloab.pepORGANISM: Clostridium acetobutylicum; C acetobutylicum; cloabDOMAIN: Eubacteria(No. 35)PEP FILE: clope.pepORGANISM: Clostridium perfringens; C perfringens; clopeDOMAIN: Eubacteria(No. 36)PEP FILE: clote.pepORGANISM: Clostridium tetani; C tetani; cloteDOMAIN: Eubacteria(No. 37)PEP FILE: cordi.pepORGANISM: Corynebacterium diphtheriae NCTC 13129; C diphtheriae NCTC 13129; cordiDOMAIN: Eubacteria(No. 38)PEP FILE: coref.pepORGANISM: Corynebacterium efficiens; C efficiens; corefDOMAIN: Eubacteria(No. 39)PEP FILE: corgl.pepORGANISM: Corynebacterium glutamicum; C glutamicum; corglDOMAIN: Eubacteria(No. 40)PEP FILE: coxbu.pepORGANISM: Coxiella burnetii; C burnetii; coxbuDOMAIN: Eubacteria 35No. 41)PEP FILE: deira.pepORGANISM: Deinococcus radiodurans; D radiodurans; deiraDOMAIN: Eubacteria(No. 42)PEP FILE: desvh.pepORGANISM: Desulfovibrio vulgaris subsp. vulgaris str. Hildenborough;D vulgaris subsp. vulgaris str. Hildenborough; desvhDOMAIN: Eubacteria(No. 43)PEP FILE: drome.pepORGANISM: Drosophila melanogaster; D melanogaster; dromeDOMAIN: Eukaryote(No. 44)PEP FILE: ecoli.pepORGANISM: Escherichia coli; E coli; ecoliDOMAIN: Eubacteria(No. 45)PEP FILE: entfa.pepORGANISM: Enterococcus faecalis; E faecalis; entfaDOMAIN: Eubacteria(No. 46)PEP FILE: erwca.pepORGANISM: Erwinia carotovora; E carotovora; erwcaDOMAIN: Eubacteria(No. 47)PEP FILE: fusnu.pepORGANISM: Fusobacterium nucleatum; F nucleatum; fusnuDOMAIN: Eubacteria(No. 48)PEP FILE: glovi.pepORGANISM: Gloeobacter violaceus; G violaceus; gloviDOMAIN: Eubacteria(No. 49)PEP FILE: haedu.pepORGANISM: Haemophilus ducreyi; H ducreyi; haeduDOMAIN: Eubacteria(No. 50)PEP FILE: haein.pepORGANISM: Haemophilus influenzae; H influenzae; haeinDOMAIN: Eubacteria 36No. 51)PEP FILE: haln1.pepORGANISM: Halobacterium sp. (strain NRC-1); H sp. (strain NRC-1); haln1DOMAIN: Archaebacteria(No. 52)PEP FILE: hcmva.pepORGANISM: Human cytomegalovirus (strain AD169); HCMV (strain AD169); hcmvaDOMAIN: virus(No. 53)PEP FILE: helhe.pepORGANISM: Helicobacter heilmannii; H heilmannii; helheDOMAIN: Eubacteria(No. 54)PEP FILE: helpy.pepORGANISM: Helicobacter pylori; H pylori; helpyDOMAIN: Eubacteria(No. 55)PEP FILE: human.pepORGANISM: Homo sapiens; H sapiens; humanDOMAIN: Eukaryote(No. 56)PEP FILE: lacjo.pepORGANISM: Lactobacillus johnsonii; L johnsonii; lacjoDOMAIN: Eubacteria(No. 57)PEP FILE: lacla.pepORGANISM: Lactococcus lactis (subsp. lactis); L lactis (subsp. lactis); laclaDOMAIN: Eubacteria(No. 58)PEP FILE: lacpl.pepORGANISM: Lactobacillus plantarum WCFS1; L plantarum WCFS1; lacplDOMAIN: Eubacteria(No. 59)PEP FILE: leixx.pepORGANISM: Leifsonia xyli (subsp. xyli); L xyli (subsp. xyli); leixxDOMAIN: Eubacteria(No. 60)PEP FILE: lepic.pepORGANISM: Leptospira interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni);L interrogans (serogroup Icterohaemorrhagiae / serovar Copenhageni); lepicDOMAIN: Eubacteria 37No. 61)PEP FILE: lisin.pepORGANISM: Listeria innocua; L innocua; lisinDOMAIN: Eubacteria(No. 62)PEP FILE: lismo.pepORGANISM: Listeria monocytogenes; L monocytogenes; lismoDOMAIN: Eubacteria(No. 63)PEP FILE: metac.pepORGANISM: Methanosarcina acetivorans; M acetivorans; metacDOMAIN: Archaebacteria(No. 64)PEP FILE: metka.pepORGANISM: Methanopyrus kandleri; M kandleri; metkaDOMAIN: Archaebacteria(No. 65)PEP FILE: metth.pepORGANISM: Methanobacterium thermoautotrophicum; M thermoautotrophicum; metthDOMAIN: Archaebacteria(No. 66)PEP FILE: mettm.pepORGANISM: Methanobacterium thermoautotrophicum; M thermoautotrophicum ; mettmDOMAIN: Archaebacteria(No. 67)PEP FILE: mouse.pepORGANISM: Mus musculus; M musculus; mouseDOMAIN: Eukaryote(No. 68)PEP FILE: muhv4.pepORGANISM: Murine herpesvirus 68 strain WUMS; Murine herpesvirus 68 strain WUMS;muhv4DOMAIN: virus(No. 69)PEP FILE: mycav.pepORGANISM: Mycobacterium avium; M avium; mycavDOMAIN: Eubacteria(No. 70)PEP FILE: mycbo.pepORGANISM: Mycobacterium bovis AF2122/97; M bovis AF2122/97; mycboDOMAIN: Eubacteria 38No. 71)PEP FILE: mycga.pepORGANISM: Mycoplasma gallisepticum; M gallisepticum; mycgaDOMAIN: Eubacteria(No. 72)PEP FILE: mycge.pepORGANISM: Mycoplasma genitalium; M genitalium; mycgeDOMAIN: Eubacteria(No. 73)PEP FILE: mycms.pepORGANISM: Mycoplasma mycoides (subsp. mycoides SC); M mycoides (subsp. mycoidesSC); mycmsDOMAIN: Eubacteria(No. 74)PEP FILE: mycpn.pepORGANISM: Mycoplasma pneumoniae; M pneumoniae; mycpnDOMAIN: Eubacteria(No. 75)PEP FILE: mycpu.pepORGANISM: Mycoplasma pulmonis; M pulmonis; mycpuDOMAIN: Eubacteria(No. 76)PEP FILE: neime.pepORGANISM: Neisseria meningitidis; N meningitidis; neimeDOMAIN: Eubacteria(No. 77)PEP FILE: niteu.pepORGANISM: Nitrosomonas europaea; N europaea; niteuDOMAIN: Eubacteria(No. 78)PEP FILE: oceih.pepORGANISM: Oceanobacillus iheyensis; O iheyensis; oceihDOMAIN: Eubacteria(No. 79)PEP FILE: porgi.pepORGANISM: Porphyromonas gingivalis; P gingivalis; porgiDOMAIN: Eubacteria(No. 80)PEP FILE: pseae.pepORGANISM: Pseudomonas aeruginosa; P aeruginosa; pseaeDOMAIN: Eubacteria 39No. 81)PEP FILE: psepu.pepORGANISM: Pseudomonas putida; P putida; psepuDOMAIN: Eubacteria(No. 82)PEP FILE: pyrab.pepORGANISM: Pyrococcus abyssi; P abyssi; pyrabDOMAIN: Archaebacteria(No. 83)PEP FILE: pyrfu.pepORGANISM: Pyrococcus furiosus; P furiosus; pyrfuDOMAIN: Archaebacteria(No. 84)PEP FILE: pyrho.pepORGANISM: Pyrococcus horikoshii; P horikoshii; pyrhoDOMAIN: Archaebacteria(No. 85)PEP FILE: ralso.pepORGANISM: Ralstonia solanacearum; R solanacearum; ralsoDOMAIN: Eubacteria(No. 86)PEP FILE: rhilo.pepORGANISM: Rhizobium loti; R loti; rhiloDOMAIN: Eubacteria(No. 87)PEP FILE: riccn.pepORGANISM: Rickettsia conorii; R conorii; riccnDOMAIN: Eubacteria(No. 88)PEP FILE: schpo.pepPEP FILE: SPBC839 05c ORG Schizosaccharomyces pombe; S pombe; schpoDOMAIN: Eukaryote(No. 89)PEP FILE: shifl.pepORGANISM: Shigella flexneri; S flexneri; shiflDOMAIN: Eubacteria(No. 90)PEP FILE: staau.pepORGANISM: Staphylococcus aureus; S aureus; staauDOMAIN: Eubacteria 40No. 91)PEP FILE: strag.pepORGANISM: Streptococcus agalactiae; S agalactiae; stragDOMAIN: Eubacteria(No. 92)PEP FILE: strco.pepORGANISM: Streptomyces coelicolor; S coelicolor; strcoDOMAIN: Eubacteria(No. 93)PEP FILE: strpn.pepORGANISM: Streptococcus pneumoniae; S pneumoniae; strpnDOMAIN: Eubacteria(No. 94)PEP FILE: strpy.pepORGANISM: Streptococcus pyogenes; S pyogenes; strpyDOMAIN: Eubacteria(No. 95)PEP FILE: sulso.pepORGANISM: Sulfolobus solfataricus; S solfataricus; sulsoDOMAIN: Archaebacteria(No. 96)PEP FILE: theac.pepORGANISM: Thermoplasma acidophilum; T acidophilum; theacDOMAIN: Archaebacteria(No. 97)PEP FILE: thema.pepORGANISM: Thermotoga maritima; T maritima; themaDOMAIN: Eubacteria(No. 98)PEP FILE: trepa.pepORGANISM: Treponema pallidum; T pallidum; trepaDOMAIN: Eubacteria(No. 99)PEP FILE: ureur.pepORGANISM: Ureaplasma urealyticum; U urealyticum; ureurDOMAIN: Eubacteria(No. 100)PEP FILE: vibch.pepORGANISM: Vibrio cholerae; V cholerae; vibchDOMAIN: Eubacteria 41No. 101)PEP FILE: vibpa.pepORGANISM: Vibrio parahaemolyticus RIMD 2210633; V parahaemolyticus RIMD 2210633;vibpaDOMAIN: Eubacteria(No. 102)PEP FILE: wolsu.pepORGANISM: Wolinella succinogenes; W succinogenes; wolsuDOMAIN: Eubacteria(No. 103)PEP FILE: xanac.pepORGANISM: Xanthomonas axonopodis (pv. citri); X axonopodis (pv. citri); xanacDOMAIN: Eubacteria(No. 104)PEP FILE: xylfa.pepORGANISM: Xylella fastidiosa; X fastidiosa; xylfaDOMAIN: Eubacteria(No. 105)PEP FILE: yeast.pepORGANISM: Saccharomyces cerevisiae; S cerevisiae; yeastDOMAIN: Eukaryote(No. 106)PEP FILE: yerpe.pepORGANISM: Yersinia pestis; Y pestis; yerpeDOMAIN: Eubacteria 42 upplementary Table 2:
Data of η , θ and the comparisonbetween theoretical predictions and experimental observations forgenome size and gene number. Notes: The serial numbers for organisms here are the same numbers for the organisms inSupplementary Table 1. The data of non-coding DNA contents η and the genome sizes areobtained from Ref. [18], where there are species ( eukaryotes, archaebacteria and eubacteria, i.e., prokaryotes in total) can be also found in database PEP. The gene numbersare obtained by the numbers of Open Reading Frames (ORFs) in proteomes in PEP. The non-coding content is obtained according to the Human genome draft in this table according toRef. [18]. But we choose the more precise value of η ∗ according to the finished euchromaticsequence of the human genome in Ref. [17] to calculate the accurate time of the Cambrianexplosion. 43o. η θ genome size S ( η, θ ) gene number N ( η, θ ) η θ genome size S ( η, θ ) gene number N ( η, θ )
41 0.0910 0.2753 3284156 2.8916e+006 3099 3.1197e+00342 0.3107 352443 0.8100 0.2562 120000000 2.5164e+008 18358 1.6650e+00444 0.1220 0.2247 4641000 4.6505e+006 4281 4.6032e+00345 0.1200 0.2852 3218031 3.2588e+006 3145 3.1186e+00346 0.2226 446347 0.1020 0.3149 2714500 2.4678e+006 2067 2.4821e+00348 0.2462 442549 0.4102 171550 0.1500 0.3434 4524893 2.8080e+006 1709 2.2966e+00351 0.3185 205852 0.6795 20253 0.0700 0.3495 1799146 1.6699e+006 1874 1.8581e+00354 0.0920 0.3633 1643831 1.7640e+006 1564 1.7844e+00355 0.9830 0.1889 3.0000e+009 1.0522e+009 37229 3.7131e+00456 0.3399 181357 0.1260 0.3358 2365589 2.5342e+006 2266 2.2879e+00358 0.2637 300259 0.3320 202360 0.2837 365261 0.0970 0.2748 3011209 3.0073e+006 2968 3.1706e+00362 0.0970 0.2622 2944528 3.2293e+006 2833 3.4342e+00363 0.2999 454064 0.3418 168765 0.0800 0.3228 1751377 2.0652e+006 1873 2.2511e+00366 0.3222 186967 0.9500 0.1828 2.5000e+009 8.9214e+008 28085 3.5960e+00468 0.8092 8069 0.2537 434070 0.0900 0.2451 4345492 3.4112e+006 3906 3.7721e+00371 0.5086 72672 0.1200 0.5416 580070 7.5934e+005 484 609.214973 0.5674 101674 0.4804 68675 0.0860 0.4867 963879 8.4373e+005 778 802.612676 0.1710 0.3407 2184406 3.2388e+006 2065 2.4452e+00377 0.3436 246178 0.2513 349679 0.3800 190980 0.1060 0.2167 6264403 4.4184e+006 5563 4.6810e+00345o. η θ genome size S ( η, θ ) gene number N ( η, θ ))