[PDF] A Unified Theory on Construction and Evolution of the Genetic Code

Abstract

Full PDF

AA Unified Theory on Construction and Evolution of the Genetic Code

Liaofu Luo*

Laboratory of Theoretical Biophysics, Faculty of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China *Email address: [email protected]

Abstract

A quantitative theory on the construction and the evolution of the genetic code is proposed. Through introducing the concept of mutational deterioration (MD) and developing a theoretical formalism on MD minimization we have proved: 1 ， the redundancy distribution of codons in the genetic code obeys MD minimization principle; 2, the hydrophilic-hydrophobic distribution of amino acids on the code table is global MD (GMD) minimal; 3, the standard genetic code can be deduced from the adaptive minimization of GMD; 4, the variants of the standard genetic code can be explained quantitatively by use of GMD formalism and the general trend of the evolution is GMD non-increasing which reflects the selection on the code. We have demonstrated that the redundancy distribution of codons and the hydrophobic-hydrophilic (H-P) distribution of amino acids are robust in the code relative to the mutational parameter, and indicated that the GMD can be looked as a non-fitness function on the adaptive landscape. Finally, an important aspect on the symmetry of the code construction, the Yin-Yang duality is investigated. The Yin-Yang duality among codons affords a sound basis for understanding the H-P structure in the genetic code. The approximate universality of the canonical genetic code and the discoveries of various deviant codes in a wide range of organisms strongly reveal that the genetic code is still evolving. Several mechanisms on code evolution were proposed, for example, the codon capture and the ambiguous decoding by tRNA [Knight et al, 2001; Santos et al, 2004]. However, a unified theory still lacks for a full explanation of the genetic code evolution both in its high universality and various deviations. Evidently, the point is closely related to the construction of the code. The construction of the genetic code obeys some general rules that afford a basis for understanding the universality and changeability of the code. On the other hand, the error minimization property of the genetic code was analyzed by several authors [Di Giuilo et al, 1994; Freeland & Hurst, 1998]. But it is still unclear why the canonical genetic code takes the standard form with error non-minimized and what are the evolutionary constraints for deducing the standard code. In the article we emphasize the unified understanding of the code construction and code evolution. We shall indicate that the unification between code construction and code evolution can be achieved through introducing the concept of mutational deterioration (MD) and developing a theoretical formalism for MD minimization. The materials are organized in the article as follows. In the first section we will review the mutational deterioration theory on the redundancy distribution in the genetic code. Then the adaptive minimization of global mutational deterioration and the accuracy of the genetic code will be discussed in the second section. Next, in the third section, we will study the volvability of the genetic code from the point of unified mutational deterioration theory. Finally, an important aspect on the symmetry of the code construction, namely, the Yin-Yang duality in the genetic code will be investigated in the last section. Mutational deterioration theory on the redundancy distribution in the genetic code

The constancy of the genetic code among different organisms is one of the most striking, interesting, and challenging phenomena in life. The mathematical relation behind the constancy intrigued many biologists and physicists [di Giulio, 1997; Trifonov et al, 1997; Freeland et al, 1998; Maeshiro et al, 1998; Judson et al, 1999; Jimenez-Montano, 1999; Knight et al, 1999; Knight et al, 2000; Freeland et al, 2000; Weberndorfer et al, 2003 ; Chechetkin, 2003; Copley et al, 2005; Yang, 2005; Goodarzi et al, 2005 ; Chechetkin, 2006]. Historically, there are two different kinds of theories regarding the origin and evolution of the genetic code [Yockey, 1992; Freeland et al, 2003]. The first approach originated from Gamow [1954]. His “Diamond code” model opened up a way to explain the origin of the universal amino acid code through the stereochemical interactions between codons or anticodons and amino acids [Woese et al, 1966; Woese, 1967; and recently, Knight et al, 2001; Yarus, 2000]. The second approach is called “frozen accident” theory. The term “frozen accident”, used firstly by Crick, means that all living organisms evolved from an ancient single ancestor, and after the evolutionary expansion of the descendants started, changes in the amino acid assignments of codons were not possible [Crick, 1968]. In fact, the two theories can explain part of observations and experiments from their own standpoints, but they are by no means comprehensive. For example, the point that the canonical code is superior due to some specific fit or affinity between each amino acid and its codon (through t-RNA molecule as an adaptor)[Crick et al, 1961] has never been proved rigorously. The deviant codon assignments discovered since 1979 demonstrate that other different codes are also possible [Jukes & Osawa, 1991]. On the other hand, the formation of the genetic code could not be explained as a fully accidental event. The hydrophobic order of amino acids consistent with that of their anti-codonic dinucleotide is an important fact [Larcey et al, 1983; 1992], which shows that the codon assignments may be required thermodynamically and some stereochemical relations may exist between the amino acids and the codons. So, the historical accident and the stereo-chemical constraint both exist and play their roles together in the formation of prevalent code. In the later decades many new variants of above two theories were proposed. For example, the frozen accident is related to the amino acid alphabet expanding. Trifonov et al studied the temporal order of 20 kinds of amino acids [Trifonov et al, 1997; Trifonov, 2004]. Wong proposed the coevolution theory which indicated the coevolution existed between amino acids and codes [Wong, 1975; 1988; 2005].

From 1988 on we have proposed an alternative approach to the problem [Luo, 1988; 1989]. We started from the observation of the pattern of codon degeneracy and the investigation of the synonym redundancy distribution in the code. From the genetic code (Figure 1) we find that each degenerate codon doublet is located on the upper side or lower side of one of the 4×4 blocks in the table, that is, their first two nucleotides are same and the third ones are related by a transitional mutation, a mutation not changing the purine or pyrimidine-type of the nucleotide. We also find each degenerate codon quartet located in one of the 4×4 blocks, namely, their first two nucleotides re common in the quartet. The hexamerous multiplets all occupy one and half block. Ile and terminators both are triplet but their codons are arranged differently in the code table. These rules holds even for deviant codes. How to explain these rules? Our theory is based on following assumptions: . The mutation of code word （ codon ） causes wrong coding for amino acid or terminator. It is lethal and will be eliminated by evolution (selection). Regardless of the complexity existed in the mutation and translation mechanism and the possible alterations in the tRNAs, what we concern is only the code relation between code words and encoded amino acids. The prevalent code is a product of long-term evolution. The high universality of the code and its degeneracy rule over a wide range of organisms indicates its selective non-lethality. That is, as compared with other ideal codes, the real code is the most advantageous due to selection. Mathematically, for each ideal codon multiplet, one can define a mutational deterioration (MD) function that represents the mutational frequency of the multiplet and the deterioration caused by the mutation. The degeneracy rule of the code can be deduced from the minimization of MD function. . The MDs are classified into three categories and parameterized as follows: the non-synonymous transitional MD (MD caused by transitional mutation, U ↔ C G ↔ A, between non-synonymous codons), denoted by u, the non-synonymous transversional MD (MD caused by transversional mutation, U ↔ A, U ↔ G, C ↔ G, C ↔ A, between non-synonymous codons), denoted by v, and wobble MD w u, , w v which describe the additional effect of the third letter mutation in a sense codon [Crick, 1966]. That is, we set , , , u v u u u v v v u u w v v w = = = = = + = + ( , , , 0) u v u v w w > for sense codons (the subscript 1, 2, or 3 means the position in a codon). For nonsense codons (terminators) w u = w v =

0 should be taken. Here we emphasize only the non-synonymous transitional and transversional mutations are considered since the synonymous mutation has no lethal effect. The MD function for an ideal multiplet is equal to the sum of mutational deterioration of all single-base mutations for codons belonging to the multiplet. The double- and triple-base mutations are neglected due to their frequencies much smaller than the single-base ones. From these assumptions we can deduce all degeneracy rules in the genetic code [Luo,1989; Luo, 2000; Luo et al, 2002b]. Two codons that can not be related with each other through a single-base mutation (for example, UGC and CAU) are called non-neighboring. Oppositely, two codons related with each other through a single-base mutation are called neighboring. If they are related by a transitional mutation between first bases the two codons are called T1-neighboring. If they are related by a transversional mutation between first bases the two codons are called V1-neighboring. Likewise, one can define T2, T3, V2, and V3-neighboring of two codons. For example, consider an MD comparison for all possible ideal degenerate doublets. For an ideal codon arrangement it is easy to deduce the MD function as the sum of contributions from all possible single-base mutations in the doublet. The results are, Non-neighboring vu wwvu ccMD +++= == (1.1) T1 or T2 neighboring

MD c u ( )2 = − (1.2) V1 or V2 neighboring MD c v ( )2 = − (1.3) T3 neighboring (1.4) u wucMD −−= V3 neighboring v wvcMD −−= (1.5) As u > v ， and w u > w v > 0, Eq.(1.4) yields the smallest value. The relations u > v and w u > w v > 0 mean that the rate of transitional mutation is larger than transversional mutation and the rate of mutation at third position of a codon is larger than at other positions. Under these conditions, the minimization of MD would lead to the degenerate doublet taking the form of T3 neighboring. In fact, the nine amino acids of degenerate doublet in the standard code table all take this form of codon arrangement. The physics behind the above deduction is that the deterioration of nucleotide mutation comes from the amino acid substitution, and the amino acid substitution in an organism would generally lead to an amount of selective death. However, the synonymous mutation has no lethal effect. So, to reduce the mutational deterioration at best, the nucleotides in a multiplet should be so arranged that a large portion of base mutations, especially for transitional mutation in the third codon position, belong to synonymous mutations within the multiplet. The above approach can be generalized to other degenerate multiplets. For a multiplet with degenerate degree k , there are k(k- )/ ways of pairing of codons. The connection of each pair may be T1, T3, V1, V3 or non-neighboring. Suppose that each codon arrangement is called a graph. There are generally 5 k ( k -1)/2 graphs for the multiplet with degenerate degree k . For a given graph, suppose there are n connections being T1 (or T2) neighboring, n connections being T3 neighboring, m connections being V1 (or V2) neighboring, m connections being V3 neighboring, and others non-neighboring. The MD for this graph is )22()2()22()2()( vuk wvmvmwununckMD +−−+−−= c k = k c = k (3 u + 6 v + w u + 2 w v ) (1.6) There are 125 graphs for a degenerate triplet. Many of them are forbidden due to inconsistent connections. The parameters n , n , m , and m for allowable graphs are listed in Table 1-1. The first line in Table gives the name k.p.l for each graph ( k = degenerate degree, here k =3; p = n +n +m +m ; l denotes the number of graph for given k and p ), the second line gives an example for the graph. From Table 1-1 we find that the graph 3.3.1 has minimal MD when u > v ， w u > w v > 0 , w v > u – v. The corresponding minimal MD level is vu wwvucMD −−−−= (1.7) So, the degenerate triplet of codons should take this arrangement with MD (1.7). In fact, for Ile, three codons are distributed in this manner in the code table. However, for three terminators, though the MD for ideal codon arrangement is still expressed by (1.6) (with k =3), the relations w u = w v = 0 should be taken into account. For a graph corresponding to third column of Table 1-1 (graph 3.2.1), one has ( T C ) 5 1 8 M D u = + v (1.8) It is easily shown that this takes on a minimum when u v > . So terminators should be arranged in his form, which is different from amino acid Ile. For quartet there are 5 graphs. The parameters n , n , m , and m for allowable graphs are listed in Table 1-2. To save space, only graphs with p >3 are listed. From Tab 1-2, we find that the graph 4.6.1 has minimal MD when u>v ， w u > w v > 0, 2 w v > u – v. Its MD level is （） vu wwvucMD −−−−= So, the degenerate quartet should take this arrangement with MD given by (1.9). In fact, there are five amino acids of degenerate quartet in the standard code table that all take this arrangement of codons.

For hexamerous multiplets there are 5 graphs. The parameters n , n , m , and m for allowable graphs are listed in Table 1-3. To save space, only graphs with p >6 are listed. The graph of hexamerous multiplet can be deduced from some quartet as nucleus. Evidently, the nucleus for a graph is not unique. To give an intuitive picture, a possible nucleus for each graph is shown in the second line of Table 1-3. From Table 1-3 we find the graph 6.9.1 has minimal MD level, graph 6.9.2 – the first excited level, graphs 6.7.1 – the second excited level and graphs 6.9.3 – the third excited level when u > v ， w u > w v > 0 , w v > u –v, w u – w v > u+v . are assumed. The three lowest levels correspond to Leu, Arg, and Ser, respectively. To summarize, under the assumption (1.10) vu v u vw u vw w u >> −− > + v one can deduce all degeneracy rules in the genetic code. The condition (1.10) can easily be understood since it indicates the difference between transitional and transversional mutations and the importance of wobble’s mutation. From experimental data on single-base mutation in pseudo genes, one finds the rate of transitional mutation larger than transversional by a factor 2 to 3. Likewise, from comparison of rates of synonymous and non-synonymous substitution, one finds the mutational rate at the third codon position is larger than first two positions by a factor 4 to 8 [Li, 1997]. Eq (1.10) is consistent with these data. Thus, we have succeeded in deducing the codon arrangements for all amino acid and terminator multiplets with different degenerate degrees from a unified point of view. Taking the experimental data on base mutation into account and assuming w u / w v = u / v , we shall choose vwvwvu vu === (1.11) in the following calculation which is in accordance with Eqs. (1.10). Set minimum MD of the multiplet with degenerate degree i denoted by , and set the difference (gap) between minimum MD (ground state) and first higher MD (first excited state, corresponding to some ideal codon arrangement) denoted by m i ( ) Δ ( ) i . The calculation results are summarized as follows: vu wwvum )( ( 2 ) 4 12 4 , v m u v = + + w (2) 2( ) u v w w u v Δ = − + − vu wwvum )( , )()( vuw v vum )( , )()( vuw v vu wwvum )( , )()( vuw v v wvuLeumm ),()( , v wvuArgm ++= , v wvuSerm ++= , )(2)(2)6( vu wwvu −+−=Δ (gap between graph 6.9.3 and 6.9.1) v v u v = + u v (7) 9 30 2(8) 8 32 (1.12) u v m u v w wm u v = + + += + ( 1 , T e r ) 3 6( 2 , T e r ) 4 1 2 m um u = += + (2, Ter) 2( ) u v Δ = − (3, T m er ) 5 18 (3, Ter) 2( Δ = (1.13) ve − Simultaneously, we ha )( )(

2Δ 2 m )( )( m )( )( m )( )( m )( )( m )( ),( m Arg )( ),( m Ser

32% 32% 86% 50% 19% 8% 15% Note: ))( (,(), rg , )) mAmArg )(,(,( m , Ser and mSer (6) Δ = gap between 3 and m (6), ground state graph 6.9.1, where m (6) ≡ m (6, D for Leu, Arg and Ser, respectively. So, as seen from thgraph 6.9. Leu ), m g ) and m (6, Ser ) are the M e table, (6, Ar Δ ( ) i is generally not a small quantity as compared with m i ( ) . The only exceptions are (6, ) Arg Δ and (6, ) Ser Δ . It shows that, with the exception of h erous degenerate codons, o und tes with minimum MD) are all re ly stable under selection. ficul ain MD-excited states through statistical fluctuation for these multiplets. However, for hexamerous degenerate multiplets, the MD gap between Arg (or Ser) and Leu is smaller. It may give a clue to understand why the hexamerous degenerate codons have taken the arrangement of Arg and Ser in the code table, which is different from the ground state Leu . exam ther MD gro states (stalative It seems dif t to att Table 1 -1 Parameters n , n , m , and m for allowable graphs of degenerate triplets example AUU AUC AUA AUU GUU CUU AUU AUC GUU AUUAUCCUU AUUAUAGUU AUUAUACUU AUUAUCGCU AUUAUAGCU AUU GUU GCA AUU CUU GCA AUUCCUGCA n

0 1 1 0 1 0 0 0 1 0 0 n

1 0 1 1 0 0 1 0 0 0 0 m

0 2 0 1 0 1 0 0 0 1 0 m

2 0 0 0 1 1 0 1 0 0 0

Table 1 -2 Parameters n , n , m , and m for allowable graphs ( p >3 )of degenerate quartets 4.6.1 4.6.2 4.4.1 4.4.2 4.4.3 4.4.4 4.4.5 4.4.6 4.4.7 4.4.8 example GUU GUC GUA GUG GUU AUU CUU UUU GUU GUC AUU AUC GUU GUC CUU CUC GUU GUA AUU AUA GUU GUA CUU CUA GUU GUC GUA AUU GUU GUC GUA CUU GUU AUU CUU GUC GUU AUU CUU GUA n

0 2 2 0 2 0 1 0 1 1 n

2 0 2 2 0 0 1 1 1 0 m

0 4 0 2 0 2 0 1 2 2 m

4 0 0 0 2 2 2 2 0 1

Table 1 -3 Parameters n , n , m , and m for allowable graphs ( p >6 )of hexamerous multiplets 6.9.1 6.9.2 6.9.3 6.9.4 6.9.5 6.9.6 6.9.7 6.9.8 6.9.9 6.9.10 6.9.11 6.9.12 6.9.13 6.9.14 nucleus 4.6.1 4.6.1 4.6.1 4.6.1 4.6.2 4.6.2 4.6.2 4.6.2 4.4.1 4.4.1 4.4.2 4.4.3 4.6.1 4.6.2 n

2 0 2 0 3 2 3 2 3 2 0 2 1 2 n

3 3 2 2 2 2 0 0 2 3 2 0 2 1 m

0 2 0 2 4 5 4 5 0 4 3 4 2 4 m

4 4 5 5 0 0 2 2 4 0 4 3 4 2 Tab me , , a f w ra >6 )of hex usmultiplets (continued) n n , m nd m or allo able g phs ( p amero 6.8.1 6.8.2 6.8. 6.8 6.8 5 6.8.6 8.7 .8.8 .8.9 .8.10 neucls 2 .2 . 4. 4 4eu 4.6.1 4.6.1 4.6. 4.6 4.4 1 4.4.2 4.3 .4.4 .6.1 4.4.1 n

2 2 1 2 1 2 2 1 0 2 n

2 0 2 2 1 1 2 2 2 1 m

1 4 2 3 2 3 0 4 2 4 m

4 4 1 2 2 2 3 3 4 0 neueus cl 4.6.1 4.6.1 4.6.1 4.6.1 4.6.2 4.6.2 4.6.2 4.6.2 4.4.1 4.4.1 n

0 0 1 0 2 2 3 2 2 2 n

3 2 2 2 1 0 0 0 2 2 m

0 0 0 1 4 4 4 5 1 2 m

4 5 4 4 0 1 0 0 2 1 . 11 .7. 12 .7. 13 . 14 .7. 15 .7. 16 .7. 17 . 18 .7. 19 . 20 6.7 6 6 6.7 6 6 6 6.7 6 6.7 neucleus 4.4.2 4.4.2 4.4.2 4.4.3 4.4.3 4.4.4 4.4.4 4.4.4 4.4.4 4.4.3 n n

2 0 1 1 1 1 1 2 2 0 m m In conclusion, we have demonstrated that the redundancy distribution in the genetic code is determ ram onal mutatio e distribution is robust relative to the pa qs (1.10) are fulfilled.

The aver u al he f c d a a ci e e lic ge d . ll t avemutational deterioration (AMD) of this amino acid. We have: Trp, Met (singlet ） AMDined by the mutational pa eters, the relative rates between transitional and transversins and between 1-2 codon position and 3 rd codon position substitutions, and thrameter choice as soon as E aged m tation deterioration Set t MD o codons orrespon ing to given mino a d divid d by th multip ity (de neracy egree) We ca he quantity as raged = vu wwvu （） Cys, Tyr, His, Phe, Gln, Lys, Asn, Asp, Glu (doublet) AMD= v wvu (1.15) Ile ( let) AMD= trip u v w w + (1.1Ser (hexamerous) AMD= u v + +

14 22 3 v w (1.17) Ar (hexamerous) AMD= u v + + g v wvu ++ (1.18) Leu (h amerous) D= ex AM v wvu + (1.19Pro, , Gly Val, Ala (quartet) + ) Thr , AMD= vu + (.1.20 Their diffe s ar 15) - 4) = (1.16) - (1.15) = ) rence (1. e ： (1.1 u wu + uv wwuv −+− (1.17) - (1.16) = u wu + (1.18) - (1.17) = v (.1.19) - (1.18) = vu − (1.20) - (1.19) = v wuv +− As )(41,, uvv wuwvvuwvu +>+−>> (1.21) these differences are all positive, namely, these equations (from Eq (1.14) to Eq (1.20)) are MD. In fact, from Eq. (1 (1.22) we obtain (1.21). The para ice (1.11) sati Eq (1.14) to Eq (1.20) are in the o using the sequence data of hemoglobins a table of mutual replaceabilities of amino acids can be he degree of irreplacea established [Vokenstein, 1982]. The results are Table 1 -4 The relative irreplaceability of amino acid residues u v , vu ww < < meter cho sfies these constraints. So, the equations fromindeed arranged rder of decreasing AMD. On the other hand,obtained. On the basis of this table t bility of amino acid residues wasTrp Met Cys Ty Phe Gln Lys Asn Asp Glu Ile Ser Leu Thr Gly Val Ala 0.76 0.65 0.64 Pro 0.61 Arg 0.60 0.58 0.56 0.56 0.54 0.52 By comparison of the irreplaceability and AMD we find they agree well with each other. (The t means that the ra higher the value of AMD. It is not surprising. Because the higher mutational deterioration of some nce to other amino acids and in turn, its larger irreplaceability. Adaptive minimization of global mutational deterioration and the accuracy of the genetic code

General formalism of global mutational deterioration and hydrophilic-hydrophobic domain in the genetic code imum arra of codons in each degenerate multiplet. This is the local minimum. Then, what is the global minimum of mutational deterioration for the code table as a whole? Is the mutational deterioration of prevalent code is globally minimized? This is a f tw nty kinds of amino acids on the code table. To have a clear understanding we shall investigate the block distribution of amino acids in the code, namely, the hydrophobic-hydrophilic domain-like distribution at first. s to obtai he experimental scale of hydrophobicity. Usually they can be classified into two categories. The first is to measure the solubility difference between water and some apolar solvent. The second is to measure the tendency of an amino acid residue to only exception is Pro). The agreemen rer the substitution by other residues, theamino acid means its larger average dista

We have derived the min ngement problem of the distribution o e There are several method n t e sequestered inside the folded molecule. From the theoretical standpoint, the hydrophobicity is m interactions be te and solvent as well as the entropy factor. But, in the final analysis, it is determined by the chemical structure of the residue. We est that the hydrophobicity of an amino acid is determined by the type of atoms on the end of side-chain. If the atom on the end of side-chain is NH or OH, then the amino acid is hydrophilic; he end of side-chain is CH or SH, then the ami a ring, then it is hydrophilic when there exsits NH or OH in the ring, or hydrophobic otherwise. Ar sp, G(polar); Gly; ification of twenty amino acids is consistent with other wor e & Doolittle, 1982] and E b a e hl 6 is hydrophil r e of c h hydrophobicity while Gly has strong hydrophilicity [Nozaki & Tanford, 1971 ] , which is consistent ur ific a th ica rAcco i w di u n two ain [Luo, 1989]. Although the measure of hydrophobicity is not unique in iology and the amino acids with medium hydrophobicity can change their positions in ydrophobic order, one can always divide amino acids into a hydrophobic domain and a ydrophilic domain on the code table. hy r ain can be displayed in a more symmetric fashion as seen in Figure [Luo, 1992]. The meaning of the base order UCGA will be discussed in section 4. e it from the global minimization of mutational deterioration of the genetic code icity drastically, then the mutation is related to all kinds of quantu tween solusuggIf the atom on t no acid is hydrophobic. If the end isThe case of Gly is more complex. The recognition site of anti-codon is NH in the end of peptide [Davydov, 1989]. Thus, we obtain the hydrophobicity scale as follows: Hydrophilic: g , Lys, A lu (charged); Asn, Gln, His (strong polar); Tyr, Ser, Thr Hydrophobic: Ile, Val, Leu, Phe (strong hydrophobic); Met, Ala, Trp, Cys (hydrophobic); Pro. Sometimes, cysteine is classified as an independent subclass since this residue has some special properties, for instance, its ability to form disulfide bridges that plays an important role in protein folding. The above hydrophilic-hydrophobic classks apart from a little difference. In Kyte and Doolittle’s scale [Kytisen erg’s sc le [Eis nberg & McLac an, 198 ] Gly hydrophobic and Pro isic. However, f om the consid ration free energy differen e, Pro has hig with o class ation b sed on e chem l structu e. rding to the above class fication e can vide Fig re 1 of the ge etic code into regions. The amino acids inside the solid line are hydrophilic but outside it hydrophobic. The case for dinucleotide UC (framed by dotted line) should be considered carefully, since serine is hydrophilic but the 3'-dinucleotides in anticodon corresponding to UC is hydrophobic [Wong, 1988]. We named the above-mentioned distribution of amino acids as hydrophobic(H)- hydrophilic(P) dombhh If the conventional base order UCAG has been changed to UCGA on the code table then the d ophobic-hydrophilic dom6 How to explain the hydrophobic-hydrophilic domain of amino acid distribution in the code table? We shall deduc. In the previous discussions on mutational deterioration only the difference between synonymous and non-synonymous mutations is taken into account but the more detailed differences of deterioration among amino acids in the non-synonymous mutation have not been considered. In fact, the selective death caused by amino acid replacement is an important factor of mutational deterioration which should be studied carefully. For example, if an amino acid substitution due to base mutation changes the hydrophobexplicitly lethal. On the contrary, if the amino acid substitution does not alter the hydrophobicity then the lethal effect is small. To take the difference in amino acid substitution into account we define the global MD GMD) (for an ideal code table U ) as follows [Luo, 2000; Luo et al, 2002a; 2002b] ∑ ≠ α e for a pair of amino acids suggests a weak deteriora = βα αββα ,, )( ji ijji DfUUUQ (2.1) GMD Q ( U ) is a quantity to measure the accuracy of the table U . Set [ U ] to be a 64 ×

21 matrix that represents an ideal code. U i α =1 shows the i- th codon coding for the α -th amino acid (or terminators); otherwise U i α =0 . In other words, U i α ’ s ( i =1…64) describe the codon distribution of the α -th amino acid (or terminator) in code. So one has = ∑ i U α degeneracy degree of amino acid α i = ∑ α i U (2.2) f ij denotes the mutational deterioration for codon i mutated to codon j . It has been parameterized through u , v , w u and w v introduced in the previous section. Namely, if i and j are related by a non-synonymous single-base mutation, one has f ij = u . if i and j T1 or T2 – neighboring f ij = v . if i and j V1 or V2 – neighboring f ij = u + w u if i and j T3 – neighboring f ij = v + w v if i and j V3 – neighboring (2.3) ( f ij =0 if i and j cannot be related by any single-base mutation). For the mutation between a terminator and an amino acid, w u = w v = 0 should be taken in Eq.(2.3). The GMD of a code table is the sum of MDs from all pairs of codons. As shown in Eq (2.1), GMD depends on two factors  one is the mutational rate and the other is the selective force, represented by the distance αβ D between amino acids of initial and final states. If the distance between a pair of amino acids is large (the similarity of the two is small) then the corresponding mutational deterioration will be serious. On the contrary, the small distanc tion in their replacement. So, the mutational deterioration of a code depends not only on f ij , but also on αβ D  the distance between amino acids α and β . The common approach to define amino acid distance is based on evolutionary data (PAM matrix data). There are many new developments and applications in recent years (for example, see Wyckoff et al, 2000). However, the evolutionary approach is pure empirical and has been criticized as tautologous in its application for the study of the genetic code origin [Di Giulio, 2001]. From the standpoint of basic research we prefer using the difference of physico-chemical property between a pair of amino acids to define their distance. Following Grantham [Grantham,1974] we define the physico-chemical distance between amino acids α and β as ( ) ( ) ( ) D αβ ⎡ = c c c c P P c v v α α αβ β β ⎤ − + − + − ⎢ ⎥⎣ ⎦ here c = composition, p = polarity and v = molecular volume). Evidently, it leads to = 0 for αβ D α = β == . On the other hand, we assume αα ,, terter DD a large enough number ( ter means terminators) due to the similarity between any amino acid α and terminators being very small. The genetic code table (Fig 1) is constructed from 4 × m , n ) ( m , n = 1…4, representing U,C,A,G respectively; m referring to the first letter of a codon and n its second letter). The GMD Q ( U ) has the following symmetries: 1) Q riant whe ly, the ( m , n ) element exchanged with ( n , m ) element. 2) Q remains invariant when 1 st and 2 nd rows are exchanged with 3 rd and 4 th rows , or 1 st and 2 nd columns exchanged with 3 rd and 4 th columns; Q remains invariant when 1 st (columns) exchanged between themselves. 3) Q remains invariant when 1 st and 2 nd element in each block d into many representations. A representation can be i n sites of the table. Two different representations are connected by a symmetrical operation. Evidently, it is enough to inveremains inva n the 4 ×

4 table is transposed, nameand 2 nd rows are exchanged between themselves, or 3 rd and 4 th rows (columns) areare exchanged with 3 rd and 4 th element; Q remains invariant when 1 st and 2 nd element, or 3 rd and 4 th element in each block are exchanged between themselves. In accordance with the above symmetries the ideal code table can be classifiedentified through fixation of terminators and some amino acid on givestigate the GMD spectrum in one particular representation. Table 2-1 Amino acid distance αβ D Tyr His Gln Arg Thr Asn Lys Asp Glu Gly Phe Leu Ala Ser Pro Ile Met Val Cys TrpTyr 0 83 99 77 92 143 85 160 122 147 22 36 112 144 110 33 36 55 194 37 His 83 0 24 29 47 68 32 81 40 98 100 99 86 89 77 94 87 84 174 115

Gln 99 24 0 43 42 46 53 61 29 87 116 113 91 68 76 109 101 96 154

Thr 92 47 42 71 0 65 78 85 65 59 103 92 58 58 38 89 81 69 149 128Asn 143 68 46 86 65 0 94 23 42 80 158 153 111 46 91 149 142 133 139 174Lys 85 32 53 26 78 94 0 101 56 127 102 107 106 121 103 102 95 97 202 110Asp 160 81 61 96 85 23 101 0 45 94 177 172 126 65 108 168 160 152 154 181122 40 29 54 65 42 56 45 0 98 14 138 107 80 93 134 126 121 170 152Phe 22 100 116 97 103 158 102 177 140 153 0 22 113 155 114 21 28 50 205 40 Leu 36 99 113 102 92 153 107 172 138 138 22 0 96 145 98 5 15 32 198 61 Ala 112 86 91 112 58 111 106 126 107 60 113 96 0 99 27 94 84 64 195 148Ser 144 89 68 110 58 46 121 65 80 56 155 145 99 0 74 142 135 124 112 177Pro 110 77 76 103 38 91 103 108 93 42 114 98 27 74 0 95 87 68 169 147Ile 33 94 109 97 89 149 102 168 134 135 21 5 94 142 95 0 10 29 198 61 Met 36 87 10 91

Arg 77 29 43 0 71 86 26 96 54 125 97 102 112 110 103 97 91 96 180 102Glu 0Gly 147 98 87 125 59 80 127 94 98 0 153 138 60 56 42 135 127 109 159 184 1 81 142 95 160 126 127 28 15 84 135 87 10 0 21 196 67 Val 55 84 96 96 69 133 97 152 121 109 50 32 64 124 68 29 21 0 192 88

Cys 194 174 154 180 149 139 202 154 170 159 205 198 195 112 169 198 196 192 0 (After Grantham, 1974) 13 , β ) (terminator, amino acid) or (amino a By splitting out the leading terms one ai )(')()( UQUqUQ

Now we shall discuss the minimization of

Q(U).

The largest terms in Eq (2.1) are those with ( α = cid, terminator). obt ns ter += (2.4) where D = == αα ,, terter DD const., the factor 2 comes from the equal contribution of ter mutating to amino acid and its reverse. q ter (U) is the leading term. The minimization of q ter (U) is just like the procedure used in deducing Eq (1.8). It leads to the minimal MD described by Eq (1.8) and the corresponding optimal U i α = (codons 1 and 2, T3- neighboring ; codons 1 and 3, T1- or T2- neighboring). Evidently, the solution of optimal U i α is not unique. The arrangement UAA, UAG, and UGA of three codons occurred in the prevalent code is o ijjteriterijter fUUDUq ββ , ∑∑ ≠ =

61 20 (2.5) '( ) i j ijij

Q U U U f D α β αβαβ = ∑ ∑ ne of the minimal solutions. To remove the degeneracy, we shall discuss the min ization in a particular representation where the terminators have been fixed The next step is g e terminators have been fixed u th phi o in we ll vestig a si . Suppose amino acids classified into several categories and neglect their differences in each category. In this approximation, one assumes D = λ i α in di ren ca = λ −δ if α , β in same category but α ≠ β = 0 if α = β (2.6) Denote the corresponding MD function as Q’ (U).

We have −= )(' δλ ijel ffUQ (2.7) ( α ≠α , denoting different amino acids but in the same category). The minimization of the first term leads to degeneracy rules for each amino acid multiplet which has been discussed in section 1. But there are many different distributions of amino acids satisfying the same degeneracy rules. The minimization of the second term (the term proportional to δ) would select some from all possible distributions satisfying degeneracy rules. It will lead to H-P domain. By inspection of a T y ami a o r c gories - d C(cysteine)- class from the consideration of distances. The distance between amino acids in erent c y larg h h in n T 2-1. T th tegories of amino acids occupy three domains: 7 and 1/4 H-blocks, 7 and 1/2 P-blocks and 1/2 C-block on the code table. The distribution of these H-blocks, P-blocks and C-block on the standard code table ons leads to a particular distribution im as in the standard code . minimization of Q’(U) (Eq.(2.5), assumin th. To ded ce e hydro lic - hydrophobic d ma sha in ate mplified model αβ f , β ffe t tegory model { } ∑∑ ∑ ∈∈ , α j ∉α ij ∈α i α ij mod ααβ D d ta, ab 2-1, we shall classif 20 no cids int th ee ate : H , P-, andiff lasses is obviousl er t an t at same class as see in ab he ree ca can be found in Fig 1. Any ideal assignment ( U ) of cod f three types of blocks and gives a Q’ model ( U ). They are differentiated through δ term. different block distributions we found that under conditions Eq (1.10) with the com w v > 2 u - v Q ’ model reaches its minimum. There exist several minimal hydrophobic-hydrophilic distributions with the same max in the standard code (Fig 1) is one of the minimal distributio model ( U ), i.e. the minimization of GMD, can lead to the d understandable be hydrophilic and hy o that the mutatiopercentage of amino acid replacements since the mutation within a group contributes a smaller istances Table 2-1 will be used in the following calcu d through permutation of the rows of matr , there are 61! permutations so th egree has been given for each amino acid and the degeneracy rule has been satisfied for each . (1.10) and (2.8), have been assumed and the parameter choice Eq. (1.11) satisfies these constraints. For simplicity, the codons e amino acids are assumed to be arranged as t obeying their degeneracy rules respectively. The problem is then converted to the permutation of steps. Step

For several typical H-, P-, and C- block distributions the calculated results on δ term in Q ’ model ( U ) were given in literatures [Luo, 1989; Luo, 2000; Luo, 2004]. By comparison of the δ term in plement (2.8) imal δ term. The block distributionns. Therefore, the minimization of Q ’omain-like distribution of amino acids in the prevalent code. The result iscause the mutational deterioration of an ideal code is minimal only when thedrophobic amino acids are arranged in two separately-connected regions ingood order, s ns within a group (hydrophilic or hydrophobic) comprise a larger deterioration to the code. Note that the equation (2.8) is only a modification of the second equation of (1.10). These equations mean under a larger transitional-to-transversional ratio and a larger wobble-to-non-wobble ratio, not only the redundancy distribution but also the hydrophilic- hydrophobic distribution are robust in the code relative to the mutational parameter choice. Deducing the optimal code from GMD minimization

We have succeeded in deducing the hydrophobic–hydrophilic domain distribution of amino acids through a simplified model, namely, through the minimization of Q ’ model ( U ). Now we will search for the global minimum of mutational deterioration Q ’( U ), Eq.(2.5). The mutational parameters Eq.(1.11) and the amino acid dlation. The minimization of Q ’( U ) can be accomplisheix U since one permutation equivalent to one ideal code. Howeveris is computationally intractable. To simplify the calculation, we assume that the degeneracydmultiplet, since the constraints on mutational parameters, Eqsof hexamerous degenerat one quartet and one double20 amino acids in 4 × Q’(U) can be completed on a PC computer. Formally, the minimization can be done in the following . The triplet Ile should be grouped with a codon singlet in a block. The distance between Met and Ile is 10, much smaller than Trp and Ile (Trp-Ile distance 61, see Table 2-1). So Ile shares a block with initial codon Met. Step 2.

The codon singlet that shares a half-block with terminator UGA should be Trp, since another singlet Met has been grouped with Ile.

Step 3 . The terminators UAA and UAG should be grouped with a codon doublet in a block. The est candidate for the doublet is Cys, since Cys has large distance with all other amino yr. Step (Gln, His), and (Lys, Arg(2)), since the sum of above four distances takes the smallest value 141. cids are (Asp,Asn), (Glu, Ser(2)), (Gln, min acids.

Step 4 . The half-block Trp-ter(UGA) should be grouped with a codon doublet, too. The best candidate of the doublet is Tyr, since the amino acid which has the smallest distance with Trp is T . Phe should be grouped with Leu since their distance is small and Tyr has been groupedwith Trp (both the distance between Phe and Leu, and the distance between Phe and Tyr being 22, Table 2-1). Step 6 . The blocks ( m , n )=(1,3) and (1,4) have been fixed on account of step 2 to 4. The remaining 14 blocks are divided into 7 hydrophobic, namely (1,1),(1,2),(2,1), (2,2),(3,1),(4,1) and (4,2) (called H-blocks), and 7 hydrophilic, namely (2,3),(2,4),(3,2), (3,3),(3,4),(4,3) and (4,4) (called P-blocks). The 14 blocks code for 17 amino acids, in which there are 7 doublets, namely, Asp, Glu, Asn, Lys, Gln, His, and Phe, and 3 hexamerous multiplets ⎯ Leu, Arg and Ser. A hexamerous multiplet occupies one and a half blocks. The half-blocks are denoted by Leu(2), Arg(2) and Ser(2) respectively.

Step 7.

The case of two doublets A and B located in one block is called “doublet bundle”, denoted as (A,B). The most favorable combinations of 6 doublets (except Phe in 7 doublets which has been grouped with Leu) and Arg (2) and Ser (2) are (Asp, Glu), (Asn, Ser(2)), The next favorable combinations of the 8 amino aHis), and (Lys, Arg(2)). The sum of their distances is 153.

Step 8.

The hydrophobic amino acids Leu, Phe, Ile, Met and Val occupy 4 H-blocks. The hydrophilic amino acids Arg, Ser, Asp, Glu, His, Gln, Asn and Lys occupy 6 P- blocks (by use of the most favorable combinations indicated in step 7). The amino acids Gly, Thr, Pro and Ala with medium hydrophobicity can change their positions in hydrophobic order. One of the four (say, Thr) is chosen to be filled in a P-block and other three are chosen to be filled in H-blocks. Thus we have 7 H-blocks of hydrophobic amino acids and 7 P-blocks of hydrophilic amino acids. For each distribution of H blocks we permute 7 P blocks and search for the minimal distribution (distribution with minimal GMD). Then, under the fixed minimal distribution of P blocks we permute 7 H blocks and search for the new minimal distribution. Repeating the above steps, finally one obtains the self-consistent minimal solution.

Step 9 . Based on the result obtained in step 8, taking the possible change of hydrophobic order into account we further permute four amino acids with medium hydrophobicity – Gly, Thr, Pro and Ala and find the minimal GMD.

Step 10 . To prove the above calculation, one can make some checks. The first check is: By use of the next favorable combinations of 8 amino acids (Asp,Asn), (His,Gln), (Glu, Ser(2)), and (Lys, Arg(2)) indicated in step 7 instead of the most favorable combinations, we repeat the steps 8 and 9 and compare the result with that obtained in step 9. The second check is: We take Cys, instead of Tyr, grouped with Trp-ter(UGA) and Tyr, instead of Cys, grouped with terminators UAA and UAG.. By the above procedures we can check if the global minimum deduced in step 9 is true. Through steps 1 to 10, setting v =1 in Eq.(1.11), we obtain global minimum Q = 41722 and the corresponding minimal table shown in Figure 2 [Luo and Li, 2002a]. Th mal code has following properties. 1) The GMD spectrum near the ground state doubleminimu t 10 to 30; an exchange between some quartets (namely, Gly, Ala, Thr and Pro) change 2) We undles in the minimal code are same as those in the standard code. table h/ Ser dbundleGMD. 3) Thre ing the doubletarrangeacids llower-rleft site is in simplified model.

Deduc

By54940. assumebroken t Q max − and the m Wacids? T(the abnew prodtransferred to its of the constrathe num l varints. e mini(minimal code) has abundant structure. For example, an exchange between two amino acid ts in a block in the vicinity of the ground state leads to a change of Q value (global m Q ’) abouthat have similar property and located in the lower-left site of the minimal table also causes a of Q value about 10 to 30. These results are related to the robustness of the genetic code. find that many doublet bHowever, the important differences are: Cys / Trp-ter bundle and Tyr / ter bundle in the standard ave been changed to Tyr / Trp-ter bundle and Cys / ter bundle in minimal table, Arg doublet oublet bundle and Lys / Asn bundle in the standard table have been changed to Ser / Asn and Arg / Lys bundle in the minimal table. These changes largely lower the Q value of The point can easily be estimated from the amino acid distance αβ D data (see Table 2-1). e hexamerous degenerate codons in minimal table all have been arranged followsame degeneracy rule of ground state, namely, T1 or T2 neighboring between their quartet and components (graph 6.9.1 of Table 1-3). 4) Amino acids with similar hydrophobicity are d as near as possible in the minimal table. For example, the strong hydrophobic amino ocate in the upper-left sites of the table, the strong hydrophilic amino acids locate in the ight sites of the table, and the medium hydrophobic-hydrophilic amino acids in the lower s of the table. The result is consistent with the above analys ing the standard genetic code use of the same parameters one calculates GMD for the standard code and obtains Q std = On the other hand, if a matrix [ U ] is stochastic, namely, the amino acid distribution is d to be random and the degeneracy rules on synonymous codon arrangement have been o the utmost extent, one obtains the maximal Q values, Q max , near 1.75 × . By use of Q min as a measure of maximum distance one finds the distance between the standard code inimal code (i.e. Q std − Q min ) about 9.86% of the maximum . hy does Nature select the standard code rather than the minimal code to encode amino he point is not difficult to understand since optimization (minimization) alone could not determine the structure of the prevalent genetic code. Not only the optimization (minimization) with respect to some parameters, but also the adaptive constraints in the early stage of evolution undance of pre-synthesized amino acids, the precursor-product relations in biosynthetic pathways, etc) should be taken into account. The coevolution theory suggests that early on in the genetic code, only precursor amino acids were codified and later, as these precursors gave rise to ucts, their codons underwent subdivision and some of the codons of each precursor were product [Wong, 2005]. The optimization of GMD means the error minimization genetic code. The error minimization in the previous paragraph was done under the int of 20 amino acids with the same multiplicity distribution as in the standard code. In fact, ber of encoded amino acids and their degeneracy degrees changes in evolution. If there is enough knowledge on the amino-acid chronology (including the historica ation on the degeneracies of these amino acids) then we are able to deduce a more real picture on the genetic code evolution through GMD minimization under the varying constrai T nov (2004) indicated two important features of amino acid evolution: the amino acids synth ly.. The lly. The ain steps are: d in the block (1,4) and Tyr and ter(UAA/UAG) are fixed rifoesized in Miller experiments appeared first, and those associated with codon capture events (when all 64 triplets are already engaged and codons for new amino acid have to be captured from the established codon repertoires) came last. Due to lack of the knowledge on the amino acid degeneracy we propose a simplified model as follows. Assume GMD minimized under the same multiplicity distribution of 20 amino acids as in the prevalent standard code and introduce two additional constraints. The first constraint is (Cys, Trp) bundle and (Lys, Asn) bundle which are related to the later evolution stage of codon capture events. Lys and Asn have a common precursor Asp, while Cys and Trp have a common precursor Ser [Wong, 1988]. The precursor amino acid may have been encoded by some codons. The codons of Cys and Trp may also borrowed from original repertoire UGN for terminators. The second constraint is regarded to the early stage of amino acid evolution. We assume the initial fixation of Gly, Ala, Ser(4) (the quartet component of Ser) and Arg(4) (the quartet component of Arg) in the code, namely, G encoded by GG, Ala encoded by GC, Ser(4) encoded by UC and Arg(4) encoded by CG meaning of this assumption is: Gly, Ala, and Ser were early amino acids [Wong, 1988 ； Trifonov, 2004]; Ser(4) fixed in hydrophobic region of the table should be a frozen accident; Arg was possibly recruited earlier due to its ability to interact with and stabilize nucleic acids by ionic forces [Houen, 1999] or due to its significant probability of codon/binding site association in the earlier RNA world [Knight et al, 2000]. So, these four amino acids encoded by GG, GC, UC and CG may be an earlier event. Under these two constraints, by use of the same calculation given in previous paragraph, we can deduce the standard code table through minimization of Q’ ( U ) logicam Step 1.

Cys and Trp-ter(UGA) are fixein the block (1,3) due to the assumption of (Cys, Trp) bundle.

Step 2.

Met should be grouped with Ile ,and Phe should be grouped with Leu as stated in step 1 and 5 of previous paragraph. Under the assumption of (Lys, Asn) bundle the most favorable combinations of other doublets, namely Asp, Glu, Asn, Lys, Gln, His, Arg(2) and Ser(2), should be: (Lys,Asn), (Asp, Glu), (His,Gln), and (Ser(2)), Arg(2)). Considering the early fixation of Ser(4) and Arg(4) the coding of Arg(2) and Ser(2) may den entbe a later indepen t ev .

Step 3.

Following our assumption, the blocks (1,2) and (4,2) has been filled in by Ser(4) and Ala respectively, the blocks (2,4) and (4,4) has been filled in by Arg(4) and Gly respectively. So, only five blocks in H- domain, namely, (1,1), (2,1), (2,2), (3,1) and (4,1), and five blocks in P- domain, namely, (2,3), (3,2), (3,3), (3,4) and (4,3), are to be determined. By permutation of these ten blocks we have succeeded in deducing the minimum of GMD and therefore proved that the standard code is Q ’( U ) minimal under two constraints. So, the twenty amino acids in the standard genetic code are distributed by the principle of minimization of GMD function [Luo and Li, 2002a]. In the deduction of the standard code we have obtained a table of intermediate case.

If only the first constraint － doublet bundles of Cys / Trp and Lys / Asn － is introduced the minimization of Q’ ( U ) will lead to a table with Q = 51039. Its distance to the standard table is 2.91% of ( Q max − Q min ) [Luo, 2000; 2004]. So, to deduce the standard code by use of GMD minimization, the constraint on the early fixation of several amino acids (Gly, Ala, Ser and Arg) on the table as the initial condition is necessary. he above discussions are held under mutational parameters given by Eq (1.11). However, the results are insensitive to the parameter choice. By changing mutational parameters in the range of experimental data (the transitional-to-transversional ratio about 2-3 and the synonymous-to- nonsynonymous ratio about 4-8) the standard code can always be deduced from the minimization of Q’ ( U ). Likewise, the results do not change substantially under some possible alteration of amino acid distances. For example, if the doublet bundles and other amino acid clustering rules (described in steps 1 to 7 of the minimization of GMD) remain unchanged by use of new distances and if the new distances are still classified into two or three categories according to hydrophobicity scale, then the basically same code can be deduced [Luo and Li, 2002a]. In present study the multiplicity of each amino acid (and stop codons) has been assumed in advance. In fact, a codon may disappear from a coding sequences due to some mutational pressure, and then it reappears and acquires a new function, which results in the change of multiplicity distribution of codons. If the multiplicities of some amino acids and terminators have been changed, the minimal code should be deduced by use of new multiplicity constraints. So, the deviant assignment of codons and the evolvability of the genetic code could be accounted for in a generalized mutational deterioration theory. The point will be discussed in the following section. Remarks

1. We have proposed a unified theory on the construction and evolution of the genetic code － from the local MD minimization of a codon multiplet to the global MD minimization of the whole table. The theory explains the robustness of synonym redundancy distribution of codons and the hydrophilic-hydrophobic distribution of amino acids in the genetic code and these properties have been used for parameter choice and computational check in the global MD minimization. The meaning of GMD (Eq (2.1)) is twofold. On the one hand, the GMD can be regarded as a measure of non-fitness of the genetic code and its minimization is comparable with the Wright’s adaptive theory [Wright, 1932]. The minimization of the non-fitness through changing amino reflects the real selection process in the code evolution. On the other hand, the GMD oked as an error function which contains two factors, base-mutational error and error. As compared with other error minimization of the genetic code, the two factors landscapeacid codecan be lotranslationalestimtranslational optimized hy the division oour theorminim d and Hurst, 1998], it was argued that the standard code is “one can be ated independently in our theory. The mutational error can be minimized by the parameter choice based on the determination of synonym redundancy distribution. The error is minimized through the appropriate arrangement of amino acids on the drophilic-hydrophobic domain. Simultaneously, different from Haig and Hurst (1991),f codon space (the 64 possible codons) into 21 nonoverlapping sets is not fixed in y but is changeable with amino acid replacement in variant ideal codes. Therefore, the al table, Fig 2, obtained by us is a unique one, different from those in other error minimization theory, for example, Fig 3 given by Di Giulio (1994) and Fig 4 given by Freeland and Hurst (1998). 2. In our approach, the prevalent standard code has been deduced logically from GMD minimization under some constraints. It has a lower MD value, but not the minimal one (deviating from the minimum about 9.86%). The result is reasonable due to the existence of constraints that relate to the amino acid expansion and the frozen accident occurred in the early stages. In some error minimization theory [Freelan n a million” event but how the prevalent standard code emerged from the evolutionary history is not clear. The merit of our approach is: we demonstrate that the natural code is not far from the minimal code and it is evolutionary accessible through introducing some constraints that reflect the adaptation in early evolution. The remarkable capacity of the proposed approach is due to GMD not only a measure of error, but also a quantity for describing the adaptive evolution of the genetic code. In the meantime, that the standard code is deducible through GMD minimization under two constraints adapted to the early environment also infers the big-bang-like formation of the standard code in a relatively short time after the Last Universal Common Ancestor of extant life (LUCA) [Knight et al, 2000; Chechetkin, 2003]. 3. The mutational deterioration of molecular sequence [Ji and Luo, in Luo, 2000]. The concept of mutational deterioration for the genetic code can be generalized to molecular sequence. Set P ( j ) the normalized frequency of codon j in sequence, ( ) 64 j p j = ∑ . Define the mutational deterioration of molecular sequence αββα DfipUUJ ijji )( ∑ = (2.9) Eq (2.9) is reduced to Q ( U ), Eq (2.1), as P ( i )=1. The meaning of J is a measure of natural assignments have also been discovered in nuclear genome [Barrell et al, 1979; selection strength on molecular sequence. Through calculation we find the differences of J among various coding sequences are generally smaller than 5%. Virus, phage and Ras oncogene have comparatively large J, which may be related to the stronger mutation ability or selective death of these genes. Evolution of the genetic code from the viewpoint of mutational deterioration theory

The evolution of the genetic code is closely related to the amino acid expansion and the change of the synonym multiplicity in the genetic code.

In the previous section the GMD (non-fitness of the code) minimization was accomplished under given degeneracy degrees of amino acids and terminators. The ideal code (represented by U i α in Eq (2.1)) is the coordinate of the landscape, and the encoded amino acid number and the degeneracy degree of each multiplet (constraints Eq (2.2)) determines the adaptive landscape of the code. Now we will discuss the possible change on the constraints of U i α and therefore the alteration of the fittest genetic code. Since 1979 a number of departures or changes from the universal genetic code have been discovered in mitochondria. It was pointed out that mitochondria had very small genomes and, in contrast to whole organisms, can tolerate changes in the code. However, this changed in 1985. me deviant codon SoJukes & Osawa, 1991]. The deviant assignments of codons are summarized in Table 3-1[Maeshiro & Kimura, 1998, Knight et al, 2001]. In 30 deviant assignments there are 16 cases for stop codons changing to sense codons, 2 cases for the reversed reassignment (sense codons changing to stop codons), and 12 cases related to alternative codes for amino acids. The latter includes: ① AUA (Ile) codes for Met deviantly (four cases); AGR(Arg) codes for Ser deviantly (three cases); ③ AGR(Arg) codes for Gly deviantly (one case); ④ AAA(Lys) codes for Asn deviantly (two cases); ⑤ CUN (Leu) codes for Thr deviantly (one case); and ⑥ CUG (Leu) codes for Ser deviantly (one case) (3.1)

Table 3 -1 Deviant assignments of codons

Codon Standard code Abnormal code Representative system 1a

UGA stop Trp AUA Ile Met CUN Leu Thr Mitochondrial yeasts UGA stop Trp AAA Lys Asn AGR Arg Ser UAA stop Tyr Mitochondrial platyhelminths UGA stop Trp AGR Arg Ser AUA Ile Met Mitochondrial nematoda arthropoda mollusca UGA stop Trp AAA Lys Asn AGR Arg Ser Mitochondrial echinodermata UGA stop Trp AUA Ile Met R Arg Gly Mitochondrial tunicata AG UGA stop Trp AUA Ile Met AGR Arg stop Mitochondrial vertebrata UGA stop Trp Mitochondrial euascomycetes UAG stop Leu UAG stop Ala UCA Ser stop Mitochondrial in some green plants * UGA stop Trp Nuclear mycoplasma

UGA stop Cys Nuclear euplotes

UAR stop Gln Nuclear acetabularia

UAG stop Gln Nuclear blepharisma

CUG Leu Ser Nuclear candida

UGA stop SeCys Nuclear**

UAG stop PyLys Nuclear** (After Maeshiro et al, 1998; * 8a, 8b and 8c taken from Knight et al, 2001; ** new amino acid )

These discoveries revealed that the genetic code is still evolving. Subsequently, two volutionary theories, codon capture theory and ambiguous intermediate theory were proposed ries the evolutionary mechanisms are complex ting in tRNAs, genetic code ambiguity, ent, etc. The alterations in the tRNA, volution. To follow each detail about tRNA alteration theoretically is not easy. Ignoring these redundant and unnecessary details we shall give a quantitative observation on the deviant codes from the mutationa al, 2002b]. There are five evolutionary modes on the alternative genetic codes: ode

1 nmen on to a ia cod t al, 2004). In the mode the constraint conditions in GMD minimization should be changed from Eq (2.2), ely R α (deg acy degree of mu et α ), R α ([Knight et al, 2001; Santos et al, 2004]. In both theoand diverse, including base medication and RNA edigenome base composition, codon usage and codon reassignmthe mutation or disappearance of some tRNA species, is a key step in the alternative code e l deterioration theory [Luo, 1989; Luo et M Reassig t of a stop cod sense codon (v on capture, Santos enam = ∑ i i U α = ∑ α α i U ener ltiplto = ∑ i i U α , α α τ ), = R τ - ≠ or ( α τ= , terminator R α ) = +1 ( α α = , some am acid) α = ∑ (3.2) codon gnment be viewed as a v l codon interaction described by equation + T j → + T j- (3.3) j → A T j- (for case 14a,15a) (3.4) ere A i de s an am acid with codon icity i and T j terminators with multiplicity j. lowing E .4) the ing term in GMD ter. One may assino i U α The reassi can irtua A i A i+ or T + wh scribe ino multiplFol q. (2 lead is q ume ( ) D D , , ter ter α α large ber as ared with other terms. So, process (3.3) (3.4) mutational erioratio is sele ble. This ains 16 cases for stop codons changing to sense ons in T 1. e

2 R gnment of a sense codon to a stop codon. The nditions in GMD imizatio ld be ch d from Eq (2.2) t R α ( is anum comp the will lower the det n and ctive-favora explcod able 3- Mod eassi constraint comin n shou ange o = ∑ i i U α , α α τ ≠ ), R α - or = ( α α = ) = R τ +1 ( α τ = ) i U αα = ∑ (3.5) The virtual codon interaction equation reads A i + T j → A i- + T j+ (3.6) The GMD increases in the process if , , ( ) ter ter D D α α is a large enough number.

Mode

3 Sense codon reassignment via codon capture (Santos et al, 2004; Knight et al, 2001). The constraint conditions in GMD minimization is changed from Eq (2.2) to = ∑ i i U α R α ( , α α β ≠ ), or = R α -1 ( α α = ) ( α β = = R β +1 ) (3.7) The ion is j A k + A i U αα = ∑ codon interaction equat of the form A i + A → l ( i + j = k + l ) (3.8) where + for the present case. Mod

Sense cod2001). The constraint conditions in GMD minimization are changed from Eq (2.2) to , , 1, i j k l α β α β = = = − = e

4 on reassignment via ambiguous intermediate (Santos et al, 2004; Knight et al, = ∑ i U α R α ( , α α β ≠ ), i r = R o α ( α α = ) = R β +1 ( α β = ) i U αα = ∑ ( i i ≠ ), or i U αα = ∑ ( i i = ， coding for α and β ) (3.9) for ambiguous intermediate and finally to = ∑ i i U α R α ( , α α β ≠ ), or = R α ( α α = ) β = )= R β +1 ( α i U αα = ∑ (3.10) The final virtual codon interaction equation is also of the form of Eq (3.8). Mode

5 Reassignment via two steps. The first step is same as

Mode

2, the reassignment of a step (as

Mode t of the new stop ccessive codon i eps are sense codon to a stop; then the second 1) follows, the reassignmencodon to another sense codon. The su nteraction equations in two st

R R R R

A T A T α β α β

R R R R

T A T A β γ − + + → + β γ + + + → + (3.11) and the total codon interaction is the sum of above two equations R R R R A A A α A γ α γ − + → + (3.12) again ). n Mod d 2 have been indicated above. Now we will discuss the variation of optimal value of GMD in

Mode Q’ been neglected, constant. In this approximation the optimal Q’ ( U ) can easi . For given code (3.13) where U min means the minimal code for given multiplicity distribution { n j }, and m(j) represents the c inimum een found in Eqs (1.12). Since an alternative genetic code contains only one or a small number of deviant codon reassignments the IAAA is a good se of (3.13) we are able to deduce the GMD variation in codon reassignment Mode n j } is sup sed t Σ n = 20 (3.14) From Lagrange multiple method one has + in the form of Eq (3.8The GMD variations i e U ) (Eq.(2.5)) have Q’ ( U ) = ly be deducedwith ideal multiplicity distribution { n j } ( j =multiplicity) the approximation leads to orresponding local m of MD that has b approximation. By u Eq. Th ltip distributio po o satisfy the constraints j Σj n j = 61 =−− ∑ ∑∑ jnn μλδ j j jjj j jmn (3.15) It gi ves jjm μλ += )( (3.16) One can easily check that m ( j ) given by Eq (1.12) (with parameter choice (1.11)) satisfies Eq (3.16) approximately. So the code table consisting of degenerate multiplets deduced from local minima min j )()(' jmnUQ j ∑ = f MD is approximately globally minimized. However, no information about { n j } has been obtained from abocode. r and multiplicity) is a ents in Mode

3 to 5 can be reduced to the fundamental virtual process (3.8) we define the selective potential

R R = ( m ( i ) + m ( j )– m ( k )– m ( l ) ) / ( ) for the process. The reassignment of codons will be selective-favorable if R >

0. If the linear e (3.8). But Eq (3.16) is only an approximate one and R differs from zero in reality r decrease the fitness of the code if R > ① of Eq (3.1) can be expressed as ve deduction. There is much room for the choice of multiplet distribution in The minimization of GMD under given constraints (amino acid numbeprocess of selective optimization in the evolution. The choice from the comparison of two optimal codes under different constraints has the similar meaning of selective optimization. Such a deduced code with lower GMD should be selective-favorable. Since the codon reassignm m ( i ) + m ( j )) (3.17relation (3.16) holds rigorously then R =0 for all processes of typ. So the reassignment of codons may increase o The deviant codon assign A + A → A + A and by use of Eqs (3.17) (1.12) and (1.11), one has u vu v u v w wR u v w w − + −= =+ + + (3.18) which is selective-favorable. The deviant codon assignment ② is also selective-favorable, since from A (Arg)+ A (Ser) → A + A we have v v u v wR u v w + += =+ + (3.19) nThe devia t codon assignment ③ can be explained in the same way, through A (Arg)+ A (Gly) → A + A and u vR u −= =+ (3.20) v w + d by v By use of the same calculation, for the deviant codon assignment ④ represente A A A A + → + one has u vR − + −= = u vv w wu v w + −+ + (3.21) eviant codon assignment ⑤ represented by ( ) *6 4 2 Leu

A A A A + → +

For the d (here star means inimum) one has A not at the MD m uR u v w −= = −+ + (3.22) v For the deviant codon assignment ⑥ represented by ( ) ( ) * *6 6 5 7 Leu Ser

A A A A + → + one has u vv u v w wR u v w − − − −= = −+ + (3.23) The above deviant codon assignments can be classified into 3 categories. The reassignments of class ① ② an has R = . We assume that the reassignment of codon CUG from Le da cylindracea (class ⑥ ) follows the odes, the Mode ⑥ shows uncommon character of GMD-increthat the alternative genetic codes of class ⑥ evolves across an intermediate stop codon should d to each other by changing several nucleotides. The factor is also important for understanding the reassignment. mmary, the evolution of the alternative genetic code is classified into five categories: two related to the reassignment of a stop codon to a sense codon or its reverse and three related to on between different amino acids. The variati important quantity for describing the evolvability of the code. In IAAA approximation it can be 17)). From the 30 reassignments of codons (Table 3- ely 6c 8c (referring to Mode

2, eassignment of a sense codon to a stop) and 13a (referring to

Mode

5, that includes an intermediate step of the reassignment of a sense codon to a stop) are explicitly GMD-increasing. It is in apart from the reassignment of a sense codevolution of alternative genetic code has a general trend of GMD non-increasing which reflects intermediate theory been cleared up naturally in the evolution through the selection role of MD minimizatio ally observed reassignment is a selective- advantageous or neutral one ( ). As for the abnormality of GMD variation in the reassignment of a sense codon to a e partly due to the lack of an accuraphysico-chemical distance between amino acid and terminator since we have generally assumed D calculation.

Synonym multiplicity distribution in the genetic code

In the last paragraph w e will give an explanation on the distribution of codon multiplicities in the genetic code . d ③ (8 in 12 cases of Eq (3.1) lower the GMD value as compared with the standard code. The reassignments of class ④ and ⑤ (3 in 12 cases) leads to a higher GMD but near the standard code. On may assume that these reassignments ① to ⑤ follow the evolutionary Mode

3 and 4. So, the general trend of the genetic code evolution is towards a lower GMD (or keeping the value unchanged). However, the deviant codon assignment ⑥ (of

Eq (3.1)) -34%, which should be explained by other evolutionary mechanismu to Ser in

Candi evolutionary

Mode . Different from other four mcodon reassignment cl asing.

The assumption wait for further test. Of course, the tRNA Leu and tRNA Ser are structurally similar and they can be converseIn suthe reassignment of a cod on of optimal GMD is ancalculated through the local minima of MD (selective potential R , Eq (3.1), we find only three cases, namthe rteresting to note that, on to a stop codon, the the selection on the code. In fact, many ambiguities of the intermediate (as in the ambiguous ) may haven and the fin R ≥ stop codon, it may b te calculation method on the , , ( ) ter ter D D α α a very large constant in all cases of GM Consider the fundamental process （） and calculate the total MD for a pair of multiplets with given codon number N = i + j = k + l . By use of Eqs. (3.8) (1.12) and parameter choice (1.11) -lying) and the first excited sta in Table 3–2. From the table we find low-lying pairs (with minimal total MD) 2with v =1 we obtain the total MD in the ground (low tes shown +2 for N =4, 1+4 for N =5, 2+4 for N =6, 3+4 for N =7, 4+4 for N =8, 1+8 for N =9, 2+8 for N =10, 3+8 for N =11, and 4+8 for N =12, etc. No A and A occur in low-lying pairs. This explains the multiplicity distribution in the standard code and the disappearance of A and A in it. They may occur in abnormal code but scarcely. A does not occur in low-lying pairs, too, but it occurs in the first excited pairs near the ground pairs (namely, in 4+6 and 1+6). The calculation also shows that the pair 2+2 is slightly lower than 1 + 3, so the doublet occurs more frequently in the code table. If the virtual 3-body interaction A i + A j + A k → A i + A m + A n is taken into account the above conclusion remains unchanged. For example, for case N=7 the low-lying state is 1+2+4 and the first excited state is 2+2+3 which are comparable with Table 3-2. Table 3 -2 Total MD for a pair of amino acids with given N N Low-lying ( MD ) First excited ( )( )

MDMD −

4 2+2 71 1+3 3% 5 1+4 62 2+3 30% 6 2+4 69 1+5 24% 7 3+4 79 1+6 12% 8 4+4 67 2+6 43% 9 1+8 78 4+5 17% 10 2+8 85 4+6 11% 11 3+8 96 1+10 8.6% 12 4+8 83 2+10 35% (MD) and (MD) mean the total MD value for a pair of amino acids in low-lying state and the first excited state respectively. Yin-Yang duality in the genetic code eoretic symmetry behind the genetic Code The group-th roup theory is an appropriate tool for studying the symmetry of a system. For continuous metries seem too Gsymmetry the groups with 64 dimensional irreducible representations are SU (2), SU (3), SU (4), Sp (4), Sp (6), SO (13), SO (14) and G . The Sp (6) symmetry was introduced in the genetic code study by several authors [Hornos & Hornos, 1993]. But these continuous sym igh to describe the genetic code. Should such a high symmetry sp(6) among four nucleotides the real process of temporal refinement in the codon recognition [Nieselt-Struwe & Wills, 1997]. Therefore, we shall consider the discrete symmetry. Following Cayley theorem, any group G of the symmetry among 4 nucleotides, the group S is most appropriate. So the triplet code should be described by S ⊃ S ⊗ S ⊗ S . Of course, the S symmetry may be still too high and it should Z ： { （），（）（），（）， e } or { e ， a ， b （ = a ）， c （ = a ） ⏐ a = e } (4.1) （ e means the identity ） . Another is Klein- 4 group V . Its elements are V ： { （）（），（）（），（）（） , e } or { e ,a , b , c ⏐ ecba === , ab = ba = c , etc} (4.2) Which is the exist in code, it must be seriously broken. But the decomposition of sp(6) symmetry did not reflect order n is isomorphic with a subgroup of the symmetric grou [Hamermesh, 1962] . To describe be broken furt o subgroups er 4. One is cyclic group with elements mo priate candidate for describing the symmetry behind the genetic code, or ？ The elements in , apart from the ntity, are all 2-cycles. They may have clear 1999; Luo, 2000]. The elemen be defi ugh (ˆ (ˆ(ˆ abcd dcbaabcd badcabcd γβ p n S her. S contains tw of ord Z st appro V ide Z V biological meaning. While in Z , the elements include 4-cycles, such as (1 2 3 4),(1 4 3 2) ， etc. which lack biological meaning. So, V is the best candidate. The Klein 4-group as a relevant group-theoretic descri ez-Montano, ption has been discussed in literatures [Finley et al , 1982; Jiments of V can ned thro )() cdab = )() = )() =α (ˆ abcde （） Set the re ion betwe ur nuc a,b,c,d ) as =a+b+c+d = a + b G = a–b )() abcd = lat en fo leotides and ( U C – c – d + c–d (4.4) Evidently, U,C,G,A are eigenstates of A = a– b– c+ d α ˆ 、 β ˆ 、 γ ˆ 、 e ˆ respectively. Their eigenvalues are given in Table -1. Table 4-1 The eigenvalues of U C G A V e ˆ α ˆ ˆ β γ ˆ +1 +1 –1 –1 +1 –1 –1 +1 +1 –1 +1 –1 +1 +1 +1 +1 o, α ˆ is the operation classifying purine ( α ˆ = –1) and pyrimidine ( α ˆ =+1), β ˆ is the operation classifying strong bond ( β ˆ = –1) and weak bond ( β ˆ =+1). However, because there is sharp distinction in physicochemical properties between different pu es, pyrimidines, strong bonds or weak bonds, the V symmetry should be broken further. The broken V symmetry can be manifested thr gh in-Y ng uality (see below rinou Y a d ). nese d ional me e a d ancienin an . T m ids and nucleotides). In f surface of globular protein but n in its interior. So, t lity. On the other nglit ddes are classified into purine and pyrimidine according to their chemical struct ssi tson - Crick pairs according to the number of hydrogen bonds. So there exist the structural invariance between U and C or A and G and the ion between U and A or C and G.. To express this symmetry we suppose that four ,C,A,G are expressed by two lines (upper line and lower line) and each line takes two states , Yin denoted by   and Yang denoted by ━━ . The two states of upper line are classify purine an a i pyrimidine. The W tions he following we will use the former symmetrical, C and G are Yin-Yang symmetrical, too. W- C bonds take bove representation of bases is called the assumption of duality or ) ophy. In this b  m tual co r nature and society. The four doublets of Yin and Yang are called four Yis which gives ed L-Yang (Lao-Yang or Large Yang); C, Y n or Small Yin) ; and A, L-Yin (Lao-Yin or Large Yin). The eight triple called eight Guas. Each Gua characterizes a phenomenon in nature. A p the hexameron of Ying and Yang ) describes a change (a changing state) o ual representation of nucleotides was introduced by us in 1992 [Luo, 19 2]. ture 《 Who wrote the book of life? – A history of the genetic code 》 [Kay, 20 nd 1969 several individuals in Europe and the United States observed, from ssional vantage points, that the ancient Chinese I The Duality of Genetic Code

According to Chi tra it dicin n t philosophy, life is the unity of a pair of contradictory factors, namely Y d Yang he Yin-Yang duality is displayed not only in the stratum of cells, but also in a more deeper stratum - olecules (amino acfact, in protein olding, the hydrophilic residues are exposed on thethe hydrophobic residues burde he hydrophilicity and hydrophobicity can beseen as a kind of Yin-Yang dua hand, the biosynthesis of protein is under theinstruction of nucleotide sequence. It is unimaginable that if there is no existence of Yin-Ya dua y in nucleic acids. On account of this, we propose the assumption of Yin-Yang duality of nucleotides. We emphasize the uality property of nucleotides and its relation to the characteristics and classification of codons and amino acids. In literatures, the similar model has been suggested by Swanson but from a different view of point [Swanson, 1984]. Four nucleotiure. They are cla fied into two kinds of Wasymmetrical relatnucleotides Uintroduced to d pyrimidine, while that of lower line for sub-classific tion npurine or in –C pair occurs between Yin and Yang. The representa aregiven by Figure 5a or equivalently by Figure 5b.

In trepresentation of Figure 5a. U and A are Yin-Yangplace between them. The aYin-Yang of nucleotides. The representation is taken from 《 The book of Changes 》 ( 《 I Ching 》  a book of Chinese ancient philos ook Yin and Yang are introduced as two universal and fundamental properties u nt adictory and dependent with each other  of all things inthe detailed classification of Yin and Yang. U is callS-Yang (Shao-Yang or Small Yang); G, S-Yin (Shao- its of Yin and Yang areair of Guas (namelyf things. The above d9 But in a popular litera00] we read that “Arou very different profe hin nd the newly completed genetic code shared remarkable sim . T three-thousand- years old Book of Changes – a symbolic sys em for comprehending human experience – and the genetic Book of Life exhibi striking correspondence. ems that many people have noticed the similarity between the enetic code and the ancie t Chinese I Ching. They take the nearly same view without prior consultation. The dual repres tation of nucleotides reflects not only the intrinsic sym etry between four bases, but also the similarity order of them. U is plg a ilarity hetted ” It seg nen maced in one end, A is placed in another end , C nd G between them. The similarity between U and C (or A and G) is larger than that between U urn, larger than U and A, since the upper line classifying purinaand G (or A and C), and the latter is, in te and pyrimidine has a higher weight than the lower line. Two bases with large similarity will have high mutational rate between them. The observation on pseudo-genes mutation approves the supposition (see Table 4-3)[Li, 1997]. Table 4-3 Relative mutational frequencies in pseudo genes

Mut to Ori A T C G A – 4.7 ± ± ± ± ± ± ± ± ± ± ± ± ≥ T (in presence of O , C and G may be transposed). The result is consistent with the theoretical calculation of resonance energy per π electron, 0.32 β , 0.27 β , 0.23 β , 0.19 β , 0.17 β for A,G,C,U,T respectively [Pullman & Pullman, 1964]. Moreover, the representation is also consistent with the hydrophobicity order of nucleoside 5'- monophosphate, AMP > GMP > CMP > UMP, see Table 4-4 [Lacey & Mullins, 1983]. Table 4-4 The hydrophilicity - A and hydrophobicity - B for nucleoside 5'- monophosphate AMP GMP CMP UMP A B （ A is taken from Weber & Lacey 1978 ； B - taken from Garel 1973. see Lacey & Mullins, 1983 ） The definite order of four nucleotides – UCGA – is an important factor in understanding the broken symmetry.

The base order has been changed from conventional UCAG to UCGA in above Ying-Yang representation. The point is also consistent with the structural regularity in nucleobases: the sp2 nitrogen atom number in nucleobase is 0 in U, 1 in C, 2 in G and 3 in A which was indicated by Yang [Yang, 2005].

The genetic code is triplets of nucleotides. Each codon should be represented by a diagram with 6 lines. We suppose that double lines corresponding to the first base are put in the center (the 3rd and 4th line of the six-line-diagram), double lines corresponding to the second base are put on its upper and lower sides (the 2nd and 5th line), and double lines corresponding to the third base re put on the exterior of the diagram (the 1st and 6th line). For example, tryptophan (Trp) is expressed by — — — — —— dons is shown in Fig.6 .

1) Sta anging ) ━ ) a codon X is transformed to its antony don composed of complementary bases) X*; 2) R o . By hanging r lin it cor n lower line a codon X is transfo X R . I = is c -symm a R =X* it lled R- antisymmetrical. The in nd term AA G t t es) are y trical. CC, GG, CG, and GC bel -antis e . 3) T operation It is defined by interchange of lines (3,4) and (2,5) in six-line-diagram, that is, the a etry under T operation. Hydrophilicity and hydrophobicity are a kind of universal Yin-Yang duality of life expressed s Y (hobicity of the encoded amino acid is. If the number of Yin lin higher weight of upper lines as compared with ond position of a codon (2nd or 5th in six- lines) as compa the osition r 4th in six-lines) should be considere llo e rul can div e geneti e of Figure 6 into two regions. The amin ds i the solid lower-ri rt of the ) are hydrophilic and outside it (upper-left part)  ophobic. W n the con onal order of es, namely UCAG, has been changed t ic - phobic domain in the code table is obtained. The above hydrophilic - hydrophobic class es) contributes more to the amino acid hydrophobicity. The —— —— —— Since the first two bases are more important in the determination of the property of amino acid we suppose that the central four lines are fundamental in the six-line-diagram. The genetic code table represented by central four lines of 64 coSeveral symmetry operations can be defined as follows: r operation. By ch ━━ (– – to – – ( ━ m (coperation interc each uppe e w h the respo ding rmed to f X R X it alled R etric l; if X is caitiator a inators U , UA (firs wo bas R- s mmeong to R ymm tricalinterchange of first and second letter of a codon. The higher weight of the second position ofcodon than the first position means the asymm at the level of amino acids. In Fig.6 the genetic code ha been represented through in line   ) and Yang line ( ━━ ). We find that the more Yin lines (   ) the di-nucleotide contains, the stronger the hydrophilicity of the encoded amino acid is; the more Yang lines ( ━ ) the di-nucleotide contains, the stronger the hydropes is equal to that of Yang lines for a di-nucleotide, then the lower ones and the higher weight of sec red with first p (3rd od. Fo wing thes es we ide th c codo aci nside hydr line (he ght paventi figure baso UCGA (the order of Yin-Yang), a very symmetric fashion of the hydrophilhydroification of amino acids is consistent with experimental data. (The case of di-nucleotide UC framed by dotted line should be considered carefully, that has been discussed in section 2.) So, the Yin-Yang duality provides a new explanation on the domain-like distribution of amino acids in the genetic code: The base A (Yin lines) in a codon contributes more to the amino acid hydrophilicity and the base U (Yang linbase G and C are in the middle of A and U [Luo, 1992; 2000; 2004]. In section 2 we have deduce hydrophilic – hydrophobic domain in the genetic code under the condition (1.10) and its complement (2.8). These inequalities on mutational parameters reflect the existence of some efinite order about the base property among U,C,G and A. Now the base order and symmetry has been summarized by the formulation of Yin-Yang duality. Thus, the Yin-Yang duality can serve as a basic idea for understanding the hydrophil of amino acids in the genetic code. According to the proposed diagram repre that the codon and its part from Ser, is hydrophilic obic (hydrophilic). In fact, is three-base mutation nd it occurs between Watson - Crick pairs. By use of the similar method described in section successfully from the dual represe ted to the tRNA structure and the codon, and the hydrophobic amino acid ticodon and the base U (in an anticodon) should contribute cleoside 5'- monophosphate are list , the similar data on dinucleoside monophosphate can be found in Lacey & Mullins, 1983 ic – hydrophobic distribution sentation, it is easily to findantonym behave differently in their Yin-Yang. So, if any amino acid, a(hydrophobic), then the amino acid encoded by its antonym is hydrophthe mutation rate between a codon and its antonym is very small since itawe can prove that a pair of codon and antonym are arranged in regions with different hydrophobicity. Why the amino acid hydrophobicity can be deduced sontation of nucleotides? The molecular mechanism is relaorigin of the genetic code. The selective interaction in the formation of tRNA molecule leads to the hydrophilic amino acid recognizing hydrophilic anti-recognizing hydrophobic anti-codon. Considering that the base A in a codon contributes more to the amino acid hydrophilicity while the base U in a codon contributes more to the amino acid hydrophobicity, the base A (in an anticodon) should contribute more hydrophobicity to the dinucleoside monophosphate in anmore hydrophilicity to the dinucleoside monophosphate in anticodon. For example, dinucleoside monophosphate AA (Phe, Leu) and UU (Lys, Asn) have the lowest and highest hydrophilicity values 0.023 and 0.389 respectively. (see Table 4-4, where data on nued). The Yin-Yang duality affords a sound basis for understanding the hydrophilic – hydrophobic domain structure in the genetic code. It also provides an explanation on the robustness of the distribution under the variation of amino acids in the evolution. Another important characteristic of amino acid is its volume (Table 4-5). They are roughly classified into two categories  the first ten are small amino acids while the last ten are large amino acids. The large amino acid is stiffer while the small one is more flexible. So, as the hydrophobicity, the volume of amino acid also plays an important role in protein folding, too. The volume classification of amino acids is shown in Figure 7, where codons encircled by solid lines code for small amino acids and those in the outer code for amino acids with large volume. Two kinds of amino acids classified by volumes are also located in separate domains in the code table [Luo, 1992]. Table 4-5 Amino acid volum e Gly Ala Ser Cys Asp Pro Thr Val Asn Glu 3 14 21 30 30 31 32 36 36 41

Ile Leu Gln His Met Lys Phe Tyr Arg Trp 46 46 47 50 52 58 62 69 70 83 （ in unit of cubic angstrom ， with a scale factor 2.01 ） Conclusions The synonym redundancy distribution in the genetic code is determined by the mutational parameters, the relative rates between transitional and transversional mutations and between 1-2 emphasizes the definite order and the duality-symmetry among four nts on coordinate [U]. Then, the fittest code is selected out on the adaptive landscape. ation of the adaptive landscape de and may infer the big-bang-like formation of the standard code in a relatively short time after the La tant life (LUCA). 5. The mecha is mainly due to the alteration in tR sing the local minima of MD which p ind that, apart from the reassignment of a se tive genetic code has a general trend gnments are selective- antageous or nearly n te have been cleared up aturally through the selection role of MD minimization. codon position and 3 rd codon position mutations. The distribution is robust relative to the parameter choice. Under the constraints of u , v , w u and w v given by Eq (1.10) the pattern of codon degeneracy in the code can always be deduced. 2. The hydrophilic-hydrophobic domain in the genetic code is also robust under the mutational parameter choice and the variation of the distribution of amino acids in the code table. The robustness reflects the Ying-Yang duality existed among four nucleobases and 64 codons. The Ying-Yang dualitynucleotides in codons. 3. MD theory gives an estimate on the accuracy of the genetic coding. The error of the genetic code comes from base mutation and translation. The two factors can be considered independently in the GMD formulation (Eq (2.1)). The mutational error can be minimized by the parameter choice based on the determination of synonym redundancy distribution and the translational error can be minimized through the appropriate arrangement of amino acids on the optimized hydrophilic-hydrophobic domain. In the proposed theory the optimal code is deduced through GMD minimization under the constraint of given amino acid number and given degeneracy degree for each amino acid. Apart from the estimation of the genetic coding accuracy, the GMD minimization reflects the selection process in the code evolution. The GMD is essentially a measure of non-fitness of the genetic code and the ideal code (expressed by [U] in Eq (2.1)) serves as the coordinate of Wright’s adaptive landscape. The landscape changes, adaptive to the constraiTherefore, in MD theory the genetic code origin is a problem of the evolution towards the optimal code (the fittest code) adiabatically on a given adaptive landscape if the landscape changes much slowly than the codon mutation and selection. The historical vari(changed with the constraints on the degeneracy degree of each amino acid and the total number of encoded amino acids) is a central issue to be clarified for founding a comprehensive evolutionary theory. 4.

However, from the preliminary calculation of GMD minimization under 20 amino acids with multiplicity distribution as in the standard code we find that under the initial fixation of some early amino acids on the code and under the doublet bundle of pairs of late amino acids with common precursor the standard code can be deduced logically. It shows the evolutionary accessibility of the prevalent standard co st Universal Common Ancestor of exnism for the evolvability of the prevalent standard codeNAs. The variation of optimal GMD can be calculated by urovides an approach to study the evolvability of the code. We fnse codon to a stop codon, the evolution of alternaof GMD non-increasing and the finally observed reassiadvn eutral ones. Many ambiguities of the intermedia Acknowledgement The author is indebt to Dr Li Xiaoqin for her help in numerical calculation on the deduction of the minimal code table. U C A G PHE TYR CYS ////////////U LEU SER ////////////////////

TRP U C A G HIS C LEU PRO GLN ARG UCAGASN SER ILE A MET THR LYS ARG UCAGASP G VAL ALA GLU GLY UCAG

Figure 1 The standard genetic code

Amino acids inside the solid line are hydrophilic and outside it—hydrophobic. The domain-like distribution of amino acids in the code table is called hydrophobic-hydrophilic domain. For details see section 2.

U C A G LEU CYS TYR U PHE LEU TRP U C A G ARG ILE C MET VAL LYS ARG UCAGSER GLU A GLY ALA ASN ASP UCAGG THR PRO SER HIS UCAG

Figure 2 The minimal genetic code deduced from minimization of GMD Q’ ( U ) ( Q min = 41722 ， see text, section 2) 35 Asn Trp Lys Ala As p Phe Gln Pro Cy Ser s Met Ala Arg Glu Val Tyr Ser Leu His Thr Ile Gly

Figure 3 The m inimal code deduced by Di Giulio et al (1 Ile Gln His Ala Gly Thr Cys Leu Phe Ser Asp Ala Trp Val Pro Glu Ser Asn Tyr Met Lys Arg

Figure 4 The lower code deduced by Freeland and Hurst (1998) —— —— — — — — — —— — U C G A —— —— — — — — —— — —— — — A G C U Figure 5 The dual repr ucleotides (see text, section gure 6 The genetic code plotted ual representation of nucleotides

The base order has been chan CGA and a more symmetric hydrop ydrophilic domain can be obtained in this order as show ydrophobic amino acids are located outside the solid line, and hydrophilic amino acids inside the solid line. ds cated at their outer. (a) —— — — (b) — esentation of n Fi with d ged to Un in figure. H hobic-h

Figure 7 The volume classification of amino aci

The small amino acids are encircled by solid lines while the large amino acids lo 37 eferences

Barre n mitochondria. Nature hechetkin VR. 2003. Block structure and stability of the genetic code. J Theor. Biol. , Chec ol. , 922-934. Copl iation of amino acids with their codons and t S.A. , 4442-4447. Crick FHC, Barnett L, Brenner S, Watt eneral nature of the genetic code for proteins. Nature , 511-518. reeland SJ, Wu T, Keulmann N. 2003. The case for an error minimizing standard genetic code. Origins of Life , 412-41Hamermesh M. 19 . Massachusetts: Addison-WesleyHornos JEM, Hornos ic code. Phys. 404. Houen G. 1999. Evolu ce versa . BioSystems , 47 – 64. ll BG, Bankier AT, Drouin J. 1979. A different genetic code in huma282,189-194. C 177-188. hetkin VR. 2006. Genetic code from tRNA point of view. J Theor. Biey SD, Smith E, Morowitz HJ. 2005. A mechanism for the assoche origin of the genetic code. Proc. Natl. Acad. Sci. U.s-Tobin RJ. 1961. Gorigin of the genetic code. J. Mol. Biol. , 928. anco M, Medugno M. 1994. On the optimization of the ino acids in the evolution of the genetic code. J. Theor. Bi the origin of the genetic. J. Theor. Biol. ,141-144. lan A D. 1986. Solvation energy in protein folding and bineor. Biol. adi HS, Nejad HA, Torabi N. 2005. The impact of including tR of the genetic code. Bull. Math. Biol. , 1355-1368. mino acid difference formula to help explain protein evolut1991. A quantitative measure of error minimization in the geneti7. 62. Group Theory and Its Application to Physical Problems

Pub. YMM. 1993. Algebraic model for the evolution of the genetRev. Letters vi ukes TH, Osawa S. 1991. Recent evide of the genetic code. In: Phylogenetic Analysis of DNA Sequences . p79-p95. Edi Miyamoto M, Cracraft J., U.K.: Oxford Univ Juds D. 1999. The genetic code: What is it good for? An analysis of the effects of Kay

Who wrote the book of life? A history of the genetic code . California: Stanford Knig . 2001. Rewiring the keyboard: evolution of the genetic character of protein. n the origin of the Li W nd, Massachusetts: Sinauer Associates, Inc. Luo distribution of amino acids in genetic code. Origins of Life.

Collected Works on Theoretical

Luo

Physical Aspects of Life Evolution . Shanghai: Shanghai Science and Technology Luo ch to Molecular Biology.

Shanghai: Shanghai Science Luo rules for amino acids in the genetic code – The genetic code is netic code. Proc. Natl. Acad. Sci. U.S.A. , 5088-5093. e tions. J. Biol. Chem , 539-550. LE. 2000. University Press. Knight RD, Freeland SJ, Landweber LF. 1999. Selection, history and chemistry: the three faces of the genetic code. Trends Biochem. Sci. , 569-572. ht RD, Freeland SJ, Landweber LFcode. Nature Reviews Genetics. , 49-58. Kyte J, Doolittle RF. 1982. A simple method for displaying the hydropathicJ. Mol. Biol. － a review. Origins of Life , 3-42. Lacey JC, Wickramasinghe NSMD, Cook GW. 1992. Experimental studies ogenetic code and the process of protein synthesis － a review updata. Origins of life Molecular Evolution . SundarlaLuo LF. 1988. The degeneracy rule of genetic code. Origins of Life

Biophysics . Edi. Luo. Hohhot: Inner Mongolia University Press. LF. 2000. Publisher (in Chinese). LF. 2004.

Theoretic-physical Approa and Technology Publisher. LF and Li XQ. 2002a. Codinga minimal code of mutational deterioration. Origins of Life , 23-33. Luo LF and Li XQ. 2002b. Construction of genetic code from evolutionary stability. BioSystems , 83-97. Maeshiro M and Kimura M. 1998. The role of ribustness and changeability on the origin and evolution of geNieselt-Struwe K, Wills PR. 1997. The emergence of genetic coding in physical system. J. Theor. Biol. , 1-14. Nozaki Y, Tanford C. 1971. The solubility of amino acids and two glycine peptides in acqueous thanol and dioxane soluPullman B, Pullman A. 1963. Quantum Biochemistry.

N.Y.:Wiley Intersci. os MAS, Moura G, Massey SEgenetic codes. Trends in Genetics. odons. Gene , 1-11. enstein TrifoVolk MV. 1982. Physics and Biology . New York: Acad. Press, Inc. Oe etic code. Cold Spring Harbor Symp. Quant. Biol.

Wong c code at age thirty. BioEssays. , 416-425. P ional Congress of Genetics

I, 356. Yaru igand chemistry: A testable source for the genetic code. RNA , 475-484. niversity Note

The as published from 1988 to , and nces ） . It has been five years since the last paper in this series was resea still be regarded as one of the best theories on the genetic code expre mbined these papers have ed but the results remain the same. Except for a few new papers are added into the acceWeberndorfer G, Hofacker IL, Stadler PF. 2003. On the evolution of primitive genetic codes. rigins of Life . Woese CR, Dugre DH, Dugre SA, Kondo M, Saxinger WC. 1996. On the fundamental nature and volution of the genWoese CR. 1967.

The Genetic Code . New York: Harper & Row. JT. 1975. A co-evolution theory of the genetic code1909-1912. Wong JT. 1988. Evolution of the genetic code. Microbiol. Sci. , 174-181. JT. 2005. Coevolution theory of the genetiWright S.1932. The roles of mutation, inbreeding, cross-breeding and selection in evolution. In roceedings of the Sixth Internat Wyckoff G.J, Wang W, Wu CI. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature , 304-309. Yang CM. 2005. On the structural regularity in nucleobases and amino acids and relationship to the origin and evolution of the genetic code. Origins of Life

Information theory and molecular biology . London: Cambridge UPress. pp178-p206

Added in Publication series of work on the origin and evolution of the genetic code w2004 （ see papers: Luo 1988; Luo 1989; Luo 1992; Luo 2000; Luo & Li, 2002a; Luo & Li, 2002bLuo 2004 listed in referepublished. However, we feel that till now the proposed theory, with its related calculation and rch results in the series can evolution. Considering part of work was published in domestic journals and part of views was ssed in physical language unfamiliar to the circle of biologists, we cointo one and rephrased it in a way which biologist may find easy to understand. All calculations been checkreferences, most referred publications dated before 2004. We hope through the platform of open ss, this theory and its research questions would attract wider attention.see papers: Luo 1988; Luo 1989; Luo 1992; Luo 2000; Luo & Li, 2002a; Luo & Li, 2002bLuo 2004 listed in referepublished. However, we feel that till now the proposed theory, with its related calculation and rch results in the series can evolution. Considering part of work was published in domestic journals and part of views was ssed in physical language unfamiliar to the circle of biologists, we cointo one and rephrased it in a way which biologist may find easy to understand. All calculations been checkreferences, most referred publications dated before 2004. We hope through the platform of open ss, this theory and its research questions would attract wider attention.