[PDF] Chirality in a quaternionic representation of the genetic code

Abstract

A quaternionic representation of the genetic code, previously reported by the authors, is updated in order to incorporate chirality of nucleotide bases and amino acids. The original representation assigns to each nucleotide base a prime integer quaternion of norm 7 and involves a function that associates with each codon, represented by three of these quaternions, another integer quaternion (amino acid type quaternion) in such a way that the essentials of the standard genetic code (particulaty its degeneration) are preserved. To show the advantages of such a quaternionic representation we have, in turn, associated with each amino acid of a given protein, besides of the type quaternion, another real one according to its order along the protein (order quaternion) and have designed an algorithm to go from the primary to the tertiary structure of the protein by using type and order quaternions. In this context, we incorporate chirality in our representation by observing that the set of eight integer quaternions of norm 7 can be partitioned into a pair of subsets of cardinality four each with their elements mutually conjugates and by putting they in correspondence one to one with the two sets of enantiomers (D and L) of the four nucleotide bases adenine, cytosine, guanine and uracil, respectively. Thus, guided by two diagrams proposed for the codes evolution, we define functions that in each case assign a L- (D-) amino acid type integer quaternion to the triplets of D- (L-) bases. The assignation is such that for a given D-amino acid, the associated integer quaternion is the conjugate of that one corresponding to the enantiomer L. The chiral type quaternions obtained for the amino acids are used, together with a common set of order quaternions, to describe the folding of the two classes, L and D, of homochiral proteins.

Full PDF

CChirality in a quaternionic representation of the genetic code

C. Manuel Carlevaro a,b , Ramiro M. Irastorza a,c , Fernando Vericat a,d, ∗ a Instituto de F´ısica de L´ıquidos y Sistemas Biol´ogicos, 59 Nro. 780, 1900, La Plata, Argentina. b Universidad Tecnol´ogica Nacional, Facultad Regional Buenos Aires, Mozart Nro. 2300, C14071VT, Buenos Aires,Argentina. c Instituto de Ingenier´ıa y Agronom´ıa, Universidad Nacional Arturo Jauretche, 1888 Florencio Varela, Buenos Aires,Argentina. d Grupo de Aplicaciones Matem´aticas y Estad´ısticas de la Facultad de Ingenier´ıa (GAMEFI), Universidad Nacional de LaPlata, Calle 115 y 48, 1900 La Plata, Argentina.

Abstract

A quaternionic representation of the genetic code, previously reported by the authors, is updated in orderto incorporate chirality of nucleotide bases and amino acids. The original representation assigns to eachnucleotide base a prime integer quaternion of norm 7 and involves a function that associates with each codon,represented by three of these quaternions, another integer quaternion (amino acid type quaternion) in sucha way that the essentials of the standard genetic code (particulaty its degeneration) are preserved. To showthe advantages of such a quaternionic representation we have, in turn, associated with each amino acid ofa given protein, besides of the type quaternion, another real one according to its order along the protein(order quaternion) and have designed an algorithm to go from the primary to the tertiary structure of theprotein by using type and order quaternions. In this context, we incorporate chirality in our representationby observing that the set of eight integer quaternions of norm 7 can be partitioned into a pair of subsets ofcardinality four each with their elements mutually conjugates and by putting they in correspondence oneto one with the two sets of enantiomers ( D and L ) of the four nucleotide bases adenine, cytosine, guanineand uracil, respectively. Thus, guided by two diagrams -speciﬁcally proposed to describe the hypotheticalevolution of the genetic codes corresponding to both of the chiral systems of aﬃnities: D -nucleotide bases/L-amino acids and L -nucleotide bases/D-amino acids at reading frames 5´ →

3´ and 3´ → D- ( L- ) bases. The assignation is such that for a given D-amino acid, the associated integer quaternion isthe conjugate of that one corresponding to the enantiomer L. The chiral type quaternions obtained for theamino acids are used, together with a common set of order quaternions, to describe the folding of the twoclasses, L and D, of homochiral proteins. Keywords:

Genetic code representation; homochirality; homochiral protein folding

1. Introduction

Homochirality of nucleic acids and proteins is one of the attributes that characterize life on the Earth[1, 2].At present time all the living organisms in our planet have nucleic acids (DNA, RNA, etc.) with nucleotidebases that just take, between their two possible chiral forms, the one usually labelled as D (for dextro orright-handed); also, their proteins are chains of amino acids all of which are enantiomers of (exclusively)the class L (for levo, or left-handed) except by the amino acid glycine that is not a chiral molecule. To ∗ Corresponding author

Email addresses: [email protected] (C. Manuel Carlevaro), [email protected] (Ramiro M.Irastorza), [email protected] (Fernando Vericat) a r X i v : . [ q - b i o . O T ] D ec nderstand why the combination D for nucleotide bases and L for amino acids (and not any other) occursin living systems is one of the greatest quests of Biology. The attempts to answer this question involveseither biotic or abiotic arguments. In general the hypotheses of the ﬁrst type assume that homochirality isdetermined by biological necessity and that it is the result of diverse selection mechanisms. Among the abiotichypotheses we have those that propose a ”frozen accident” as the responsible of the homogeneous chiralityand those that assume the existence of some asymmetric force that selects just one of the chiral forms[3–5].The problem with these theories is that, in general, they can not be experimentally checked. This diﬃcultyis (at least partially) overcome by some hypotheses that claim that homochirality and the universal geneticcode arose closely related. More precisely, that the genetic code, its translation direction and homochiralityemerged through a common natural process of selection[6]. This approach has the advantage that it canincorporate the idea of an ancestral direct aﬃnity between amino acids and nucleotide triplets which, theselast, further acquire new functions, within a more modern translation machinery, in the form of codonsand anticodons. The goal is that this early aﬃnity between triplets of nucleotide bases and amino acidscan at present be studied in standard laboratories by simply synthesizing small RNA-oligonucleotides[7].This way preference of L-amino acids by D -bases triplets has been demonstrated[8–16]. Moreover, thenon-biological aﬃnity of D-amino acids by L -codons is also observed[8, 16, 17]. Besides, the hypothesisof coevolution of homochirality with the genetic code, makes a number of additional predictions such asthe possibility that each codon can encode at least two diﬀerent amino acids (according with the readingdirection: 5´ →

3´ or 3´ → L -ribonucleic sites for D-aminoacids, there is also experimental evidence of the impossibility of incorporating D-amino acids in proteinstructures in present biosynthetic pathways. Speciﬁcally it has been reported chiral discrimination duringthe aminoacylation in the active site of the aminoacyl tRNA synthetase and also during the peptide bondformation in the ribosomal peptidyl transferase center[18, 19].The aim of this article is to show how a mathematical representation of the genetic code recently reportedby us[20] can naturally incorporate chirality in such a way that the resulting description be consistent withmany of the previous observations. Our original representation is guided by a diagram that we have proposedto sketch the evolution of the genetic code (see Figure 2 in next Section). The diagram is based on pioneeringideas by Crick[21, 22] and includes the physical concept of broken symmetry[23–25] in a very simple formthat resembles the energy levels of an atom. The representation uses Hamilton quaternions[26, 27] as maintool. These mathematical objects are a sort of generalization of the complex numbers and obey an algebrain many aspects similar to theirs but with the very important (for our purposes) property that the productis, in general, non commutative. In addition, the quaternions are ideal for representing spatial rotations[28]with important advantages over the classical matrix representation.In our quaternionic representation we assign to each amino acid in a given protein two quaternions: aninteger one according to which one of the 20 standard amino acids it is (type quaternion) and a real onethat determines its order inside the protein primary structure (order quaternion). The type quaternions areobtained as a quaternionic function of the codons where each of the three nucleotides bases is associate toan integer quaternion. The form of this function being inspired by the code evolution diagram. The orderquaternions, on the other hand, play a fundamental role in relation with the folding of the protein[29, 30].The integer quaternions that we choose to associate with the nucleotides bases belong to a maximumcardinality subset of the set H ( Z ) = (cid:8) ( a , a , a , a ) : a , a , a , a ∈ Z ; a + a + a + a = 7, a > (cid:9) with the property that it does not contain pairs of conjugate quaternions. The set H ( Z ) has 7 + 1 = 8elements[31] and so the chosen subset has 4 quaternions as it should be. This suggest to partition theset H ( Z ) in the form H ( Z ) = H D ( Z ) (cid:83) H L ( Z ) (where the four quaternions of H D ( Z ) and the fourquaternions of H L ( Z ) are mutually conjugates) and to associate the elements of H D ( Z ) and H L ( Z ) withthe four D-nucleotide bases and the four L-nucleotide bases of D-RNA and L-RNA molecules, respectively.This way, chirality can be included in our formalism.In the next Section two diagrams for the evolution of the genetic code are presented assuming that the2wo possible combinations ( D , L) and ( L , D) for the chirality of bases triplets and amino acid were presentsince a beginning. The diagrams frozen at certain step displaying two diﬀerent genetic codes; the one thatcorresponds to the combination ( D , L) giving the present day living systems standard code. Guided bythese diagrams, in Section III, we consider the quaternionic representation of both codes and assign typequaternions to L- and D-amino acids in such a way that, for a given D-amino acid, the associated integerquaternion is the conjugate of the corresponding to the enantiomer L. Section IV is devoted to show theadvantages of the quaternionic representation by considering the folding of L- and D-proteins. Some remarksare ﬁnally made in Section V.

2. Chiral diagrams for the evolution of the genetic code

Here we generalize, by including chirality, the diagram proposed in Ref. [20] to describe the geneticcode evolution. First we take into account that Miller-Urey like experiments on the synthesis of organicmolecules in a primordial environment[32, 33] show the formation of amino acids in racemic mixtures. Also,as we have already mentioned, by studying the aﬃnity of amino acids with small RNA-oligonucleotides thepreference of D - and L-bases triplets by L- and D-amino acids, respectively, has been observed[16], so weassume that from the beginnings, the two chirality systems ( D -bases/L-amino acids) and ( L -bases/D-aminoacids) evolve independent one of the other. Moreover, as we describe next, the proposed evolution diagramsfor the two systems are very similar since we assume that the changes in the corresponding genetic codesare basically independent of the chirality. Chirality just manifests in settling the two aﬃnity systems. Figure 1: Diagram for the beginnings of the genetic code evolution. The one letter convention for amino acids is used. The twochiral combinations ( D -bases/L-amino acids and L -bases/D-amino acids) are identiﬁed with adequate subindices (see text).Note the achirality of the amino acid glycine (G). The translational direction considered in each case is also shown. Rectangleswith four bases imply fourfold degeneration with respect to those ones, so at that moment each of the four considered aminoacids was encoded by 16 triplets. To construct our diagram of the code evolution in Ref. [20] we have followed Crick[22] and have consideredthat, in the ﬁrst evolution steps, only the second base of the bases triplets was eﬀective in codifying (binding)amino acids. Here we assume that this fact is valid for the L -bases triplets as well as for the D -ones.3 First base of codon C CGUA CGUA C C CGUA P G C CGUA A U C CGUA S A C CGUA

T R G C Stop S R W

CGUA G CGUA C G CGUA G G CGUA U G CGUA CU GA A G CGUA CU GA A G

L V F L

CGUA U CGUA C U CGUA G U CGUA U U CGUA CU GA

I M

A U CGUA CU GA CUA G

H Q

C A CGUA CU GA

D E Y

U A CGUA CU GA

N K

A A CGUA CU GA

Stop

Amino acid Third base of codon Second base of codon Third base of codon First and second bases of codon

CGUA A CGUA G A CGUA CU GA

Figure 2: Complete diagram of the genetic code evolution for the chiral combination D -bases/L-amino acids. The one letterconvention for amino acids is used and, for simplicity, the chirality subindices in bases and amino acids have been dropped.The direction of the temporal evolution is from left to right. Rectangles with two or more bases implies degeneration withrespect to those ones. The broken lines link diﬀerent sets of codons that encode the same amino acid in the case of sixfolddegeneration. Arrows and common lines indicate what codons follow codifying the same amino acid and what will start tocodify a new one, respectively, after the symmetry is broken (see text). The natural numbers at the right of the amino acidsgive the temporal order of the amino acids in the Trifonov consensus scale[34]. Q D E Y N K Stop

CGUA A CGUA A C CGUA A U CGUA A A CGUA A G CGUA GA CU GA CU GA CU GA

L V F L I M

CU GA CU GA CUA G

Amino acid

CGUA C CGUA C C CGUA CGCGUA CU CGUA CA CGUA

P A S T

First base of codon First base of codon Second base of codon Third base of codon Second and third bases of codon

R G C Stop S R W

CGUA G CGUA G C CGUA G G CGUA G U CGUA G A CGUA CU GA CU GA A G CGUA U CGUA U C U G U U U A CGUA CGUA CGUA CGUA H CU Figure 3: Complete diagram of the genetic code evolution for the chiral combination L -bases/D-amino acids. The one letterconvention for amino acids is used and, for simplicity, the chirality subindices in bases and amino acids have been dropped.The direction of the temporal evolution is from right to left. Rectangles with two or more bases implies degeneration withrespect to those ones. The broken lines link diﬀerent sets of codons that encode the same amino acid in the case of sixfolddegeneration. Arrows and common lines indicate what codons follow codifying the same amino acid and what will start tocodify a new one, respectively, after the symmetry is broken (see text). C I ( I -cytosine), G I ( I -guanine), U I ( I -uracil) and A I ( I -adenine) ( I = D ( L ), respectively) independentlyof which the ﬁrst and third bases are. In Figure 1, that sketches this ﬁrst step of the codes evolution, thisfact is denoted with a rectangle containing the four letters. This is consistent with Crick´s suggestion thatonly a few amino acids were coded at the beginning.According with the diagram, C I would codify J-alanine (A J ); G I , glycine (G); U I , J-valine (V J ) and A I J-aspartic acid (D J ) ( I ,J = D ,L or L ,D) whatever the ﬁrst and third bases are. It is worth noting here thatthe four amino acids that we assume were the ﬁrst ones to be codiﬁed are the ﬁrst four in the Trifonov[34]consensus temporal order scale for the appearance of the amino acids (column of natural numbers in Figure2). The four amino acids A, G, V and D were also the ﬁrst four that appeared under simulation of theprimitive earth conditions in Miller experiments[32, 33]. We must also point out the two reading frames weare considering: 5´ → D -triplets and 3´ → L -triplets.In ﬁgures 2 and 3 we show how the evolution follows for the pairs ( D ,L) or ( L ,D), respectively. Sinceconfusion is not possible, we have ignored the subindices that denote chiral class for notation simplicity.As the left (right) part of diagram of Figure 2 (3) shows, our version of the primitive code is highlydegenerate: in principle each of the four amino acids, A J ,G,V J and D J , could be encoded by 4 = 16 codons(see also Figure 1). Physically the idea of degeneration is closely related with the concept of symmetry anda very illustrative form to think about these concepts is by doing an analogy with the energy levels of anatom. In our case we would have four levels indexed each one with the letter corresponding to the secondcodon base, say C I , G I , U I and A I (main quantum number). We thus assume that, as the code evolves,the symmetry that causes that the amino acid codiﬁcation be independent of the ﬁrst (third) base of thecodon, disappears. Because of this symmetry breaking, a part of the degeneration also disappears. In thediagrams each of the four initial levels splits into four new levels, one for each of the possible bases ( C I , G I , U I and A I ) at the ﬁrst (third) place of the codon (secondary quantum number). Now we have a totalof 16 levels indexed each one by two letters (the ﬁrst and second (second and third) bases of the codon).Each level is fourfold degenerate in the codons third (ﬁrst) base. One of the new levels follows codifying thesame amino acid as before that the level splits whereas the other three codify a new amino acid each. Weindicate with an arrow the four groups of codons that conserve the amino acid and with a simple line thosethat substitute the amino acid by a new one.As the code follows evolving it suﬀers new breaking of symmetry so that the third (ﬁrst) base of somecodons bring into use or, in the atomic analogy, some of the fourfold degenerate levels split into two levelseach one twofold degenerate. Those levels pointed out with an arrow follow codifying the same amino acidwhereas the other levels substitute it for a new one. Eventually, in subsequent steps, a few of the twofolddegenerate levels split once more given two non-degenerate levels each. This is the case of codons that codifymethionine (M J ), tryptophan (W J ) and (again) the stop signal. The case of isoleucine (I J ) is a particularone since the split level coincides with the twofold one which represents the two codons that follow codifyingthe same amino acid. This way, isoleucine is the only amino acid which is coded by three codons. The stopsignal is also threefold degenerate since it is coded by two groups of codons one twofold degenerate and theother one non-degenerate. At this step of the evolution the code frozen to give what would be its presentform. It is worth mentioning that the code evolution gives as a particular result that the amino acids serine(S J ), arginine (R J ) and leucine (L J ) are (would be) at present coded by two groups of codons each one. Inthe three cases one of the groups is fourfold degenerate and the other one is twofold degenerate, so that theseamino acids are the only three which are sixfold degenerates. We point out this property in the diagramswith a broken line linking the two groups of codons. The two groups of codons that codify the stop signalare also linked by a broken line. We remark again the similarity of the two evolution diagrams which meansthat, in our description, the symmetry breaking and the code freezing must depend of causes other thanchirality.We observe that the diagram of Figure 2 is consistent with the present day standard genetic code (Figure4). The diagram of Figure 3, on the other hand, gives the hypothetical code shown in Figure 5. Actually,this code is not observed in Nature. We would argue that, although our world (and perhaps the wholeUniverse) has the accurate symmetry for the existence of at least two systems of homochirality, say ( D -nucleic acids and L-proteins) and ( L -nucleic acids and D-proteins), the reﬁnements in the decoding and6ynthesis machineries of living organisms, maybe looking for better robustness and optimization of error-correcting tools, have broken that symmetry by discriminating between the two systems along the evolutioncausing the disappearance of the second one from the Earth. Figure 4: Standard genetic code. The three letters convention for the amino acids is used and the third base in the codons isremarked in bold. For simplicity the subindices that identify the class of enantiomer have been ignored. The reading directionof codons is 5´ → L -bases/D-amino acids. The three letters convention for theamino acids is used and the ﬁrst base in the codons is remarked in bold. For simplicity the subindices that identify the classof enantiomer have been ignored. The reading direction of codons is 3´ → . Quaternionic representation of the genetic code and chirality Based on the diagrams of Figures 2 and 3, we propose, within a common formalism, quaternionic repre-sentations of the genetic codes corresponding to both homochiral combinations ( D ,L) and ( L ,D) accordingwith the scheme B I −→ A J ↓ ↓ H , I ( Z ) −→ H ( Z ) with I ,J = D ,L and L ,D (1)where H ( Z ) denotes the set of integer quaternions (Lipschitz integers), B I = { C I , G I , U I , A I } ( I = D,L ), (2) A J = { P J ,A J ,S J ,T J ,R J ,G,C J ,W J ,L J ,V J ,F J ,I J ,M J ,H J ,Q J ,D J ,E J ,Y J ,N J ,K J ,Stop J } (J = L,D), (3) H , D ( Z ) = { (2 , , , , (2 , − , , , (2 , , − , , (2 , , , − } (4)and H , L ( Z ) = { (2 , − , − , − , (2 , , − , − , (2 , − , , − , (2 , − , − , } . (5)Here B I is the set of the 64 I -codons ( I = D,L ) and we assume that the correspondence B I → A J ( I ,J = D ,L; L ,D) are the genetic codes as described by Figures 4 and 5, whereas the functions B I → H , I ( Z )assigns to each I -codon a triplet of quaternions of the sets H , I . ( Z ) ( I = D,L ). Since the norm of all thequaternions of H ( Z ) is 7, which is a prime number, all the elements of H ( Z ) are prime quaternions[31].Assigning prime quaternions to the nucleotide bases gives they a certain character of elemental moleculesin the present context.In what follows, in order to simplify the notation, we assign natural numbers to identify the bases andthe amino acids: C → G → U → A → →

1, A →

2, S →

3, T →

4, R →

5, G →

6, C → →

8, L →

9, V →

10, F →

11, I →

12, M →

13, H →

14, Q →

15, D →

16, E →

17, Y →

18, N →

19, K → → F I J : H , I . ( Z ) → H ( Z )( q β I , q γ I , q δ I ) → α i J = F I J [( q β I , q γ I , q δ I )] ( I ,J = D ,L and L ,D) (6)by Eqs. (7) and (8): 8 L → α = q D q D ( β = 1, γ = 1, δ = 1,2,3,4)A L → α = q D q D ( β = 2, γ = 1, δ = 1,2,3,4)S L → α = q D q D = q D q D + γ D ;13 ( β = 3, γ = 1, δ = 1,2,3,4 or β = 4, γ = 2, δ = 1,3)T L → α = q D q D ( β = 4, γ = 1, δ = 1,2,3,4)R L → α = q D q D = q D q D + γ D ;24 ( β = 1, γ = 2, δ = 1,2,3,4 or β = 4, γ = 2, δ = 2,4)G → α = 72 ( q D q D + ˜ q D ˜ q D ) ( β = 2, γ = 2, δ = 1,2,3,4)C L → α = q D q D + γ D ;13 ( β = 3, γ = 2, δ = 1,3)W L → α = q D q D + γ D ;24 + δ D ;2 ( β = 3, γ = 2, δ = 2)L L → α = q D q D = q D q D + γ D ;24 ( β = 1, γ = 3, δ = 1,2,3,4 or β = 3, γ = 3, δ = 2,4)V L → α = q D q D ( β = 2, γ = 3, δ = 1,2,3,4)F L → α = q D q D + γ D ;13 ( β = 3, γ = 3, δ = 1,3)I L → α = q D q D + γ D ;13 = q D q D + γ D ;24 + δ D ;4 ( β = 4, γ = 3, δ = 1,3,4)M L → α = q D q D + γ D ;24 + δ D ;2 ( β = 4, γ = 3, δ = 2)H L → α = q D q D + γ D ;13 ( β = 1, γ = 4, δ = 1,3)Q L → α = q D q D + γ ( β = 1, γ = 4, δ = 2,4)D L → α = q D q D + γ D ;13 ( β = 2, γ = 4, δ = 1,3)E L → α = q D q D + γ D ;24 ( β = 2, γ = 4, δ = 2,4)Y L → α = q D q D + γ D ;13 ( β = 3, γ = 4, δ = 1,3)N L → α = q D q D + γ D ;13 ( β = 4, γ = 4, δ = 1,3)K L → α = q D q D + γ D ;24 ( β = 4, γ = 4, δ = 2,4)Stop L → α = q D q D + γ D ;24 + δ D ;4 = q D q D + γ D ;24 ( β = 3, γ = 2, δ = 4 or γ = 4, δ = 2,4) (7)9 D → α = q L q L ( β = 1,2,3,4, γ = 1, δ = 1)A D → α = q L q L ( β = 1,2,3,4, γ = 1, δ = 2)S D → α = q L q L = q L q L + γ L ;13 ( β = 1,2,3,4, γ = 1, δ = 3 or β = 1,3, γ = 2, δ = 4)T D → α = q L q L ( β = 1,2,3,4, γ = 1, δ = 4)R D → α = q L q L = q L q L + γ L ;24 ( β = 1,2,3,4, γ = 2, δ = 1 or β = 2,4, γ = 2, δ = 4)G → α = 72 ( q L q L + ˜ q L ˜ q L ) ( β = 1,2,3,4, γ = 2, δ = 2)C D → α = q L q L + γ L ;13 ( β = 1,3, γ = 2, δ = 3)W D → α = q L q L + γ L ;24 + δ L ;2 ( β = 2, γ = 2, δ = 3)L D → α = q L q L = q L q L + γ L ;24 ( β = 1,2,3,4, γ = 3, δ = 1 or β = 2,4, γ = 3, δ = 3)V D → α = q L q L ( β = 1,2,3,4, γ = 3, δ = 2)F D → α = q L q L + γ L ;13 ( β = 1,3, γ = 3, δ = 3)I D → α = q L q L + γ L ;13 = q L q L + γ L ;24 + δ L ;4 ( β = 1,3,4, γ = 3, δ = 4)M D → α = q L q L + γ L ;24 + δ L ;2 ( β = 2, γ = 3, δ = 4)H D → α = q L q L + γ L ;13 ( β = 1,3, γ = 4, δ = 1)Q D → α = q L q L + γ L ;24 ( β = 2,4, γ = 4, δ = 1)D D → α = q D q L + γ L ;13 ( β = 1,3, γ = 4, δ = 2)E D → α = q L q L + γ L ;24 ( β = 2,4, γ = 4, δ = 2)Y D → α = q L q L + γ L ;13 ( β = 1,3, γ = 4, δ = 3)N D → α = q L q L + γ L ;13 ( β = 1,3, γ = 4, δ = 4)K D → α = q L q L + γ L ;24 ( β = 2,4, γ = 4, δ = 4)Stop D → α = q L q L + γ L ;24 + δ L ;4 = q L q L + γ L ;24 ( β = 4, γ = 2, δ = 3 or β = 2,4, γ = 4, δ = 3)(8)The importance of working with objects that verify a non commutative algebra is evident from thesefunctions since otherwise amino acids A J and R J , and also S J and L J , would have associated the samequaternion. The expression for the amino acid glycine (G) takes into account that it is not chiral. Thefactor 7 has to do with the fact that: a ) the norm of the type quaternions can roughly be taken as a measureof the information needed to codify the corresponding amino acid in the sense that the larger the norm thelarger the necessary information (see Ref. [20]); b ) G is fourfold degenerates and that the norm for all theother amino acids which are fourfold degenerate is 49 (see Eqs. 7-10).In Eqs. (7) and (8), the quaternions γ i D ;jk ( γ i L ;jk ) accounts for the level splitting when the secondbase of codon is i and the third (ﬁrst) base is jk= 13 ( C I U I ) or 24 ( G I A I ) I = D ( L ). Analogously, thequaternion δ i D :j ( δ i L :j ) accounts for the level splitting when the second base of the codon is i and the third(ﬁrst) base is j= 2 ( G I ) or 4 ( A I ) I = D ( L ). Thus, in principle we have as unknown quaternions γ I ;13 , γ I ;24 , γ I ;13 , γ I ;24 , γ I ;13 , γ I ;24 and δ I ;2 , δ I ;4 , δ I ;2 and δ I ;4 . Of these 10 unknown quaternions we canﬁnd 5, say γ I ;13 , γ I ;24 , γ I ;13 , γ I ;24 , γ I ;24 , by requiring that those amino acids which are coded by twodiﬀerent groups of codons (case of codons sixfold degenerates or codons that codify the stop signal) haveassociated an unique quaternion and also that the two ways to reach isoleucine (I) give the same quaternion(see Figure 2 (Figure3)). To obtain the quaternions δ I ;2 , δ I ;4 , δ and δ I ;4 we have assigned to thoselevels that can not split more (non degenerate levels) the product of the quaternions associated with eachof the corresponding bases: α = q I q I q I ; α = q I q I q ; α = q I q I q I ; α = q I q I q I . Finallyfor the remaining unknown quaternion γ I ;13 we have proposed γ I ;13 = − γ I ;24 .Taking: q D = (2 , , , q D = (2 , − , , q D = (2 , , − ,

1) and q D = (2 , , , −

1) in Eq.(7), wehave explicitly obtained 10 = (1 , , , α = (6 , − , − , α = (16 , − , , α = (3 , , , α = (3 , , , α = ( − , , , − α = (3 , , , α = (5 , , , α = (18 , − , , − α = (3 , , , α = (2 , , , α = ( − , , , α = (3 , , , α = (6 , , , − α = ( − , , , − α = (7 , , , α = (18 , , − , α = (14 , − , , − α = (3 , − , − , α = ( − , , , − α = (18 , − , , . (9)Analogously, taking q L = ˜ q D = (2 , − , − , − q L = ˜ q D = (2 , , − , − q L = ˜ q D = (2 , − , , −

1) and q L = ˜ q D = (2 , − , − ,

1) in Eq.(8) we have α = (1 , − , − , − α = (6 , , , − α = (16 , , − , − α = (3 , , − , − α = (3 , − , , − α = ( − , − , − , α = (3 , − , , − α = (5 , − , − , − α = (18 , , − , α = (3 , − , − , α = (2 , − , − , − α = ( − , − , − , − α = (3 , , − , − α = (6 , − , − , α = ( − , − , − , α = (7 , , , α = (18 , − , , − α = (14 , , − , α = (3 , , , − α = ( − , − , − , α = (18 , , − , − . (10)We denote the set of quaternions assigned to the amino acids as given by Eqs.(9) and (10) by H α D ( Z )and H α L ( Z ), respectively. We see that the elements of H α D ( Z ) and H α L ( Z ) verify α i D = ˜ α i L ( i =1 , , · · · , ,

21) say, the quaternions assigned to both enantiomers of a given amino acid are mutuallyconjugates.

4. Folding of L- and D-proteins in the quaternions formalism

We have presented quaternionic representations of the standard genetic code for living systems, where thechiral combination ( D- bases / L-amino acids) is preferred, and also of a hypothetical genetic code for systems,in which we assume that the combination ( L- bases / D-amino acids) prevails among other possible chiralcombinations. These representations reproduce the structure of the corresponding codes, particularly theirdegeneration. However, the fact that distinguishes the quaternionic representation over most of availablemathematical representations is the resulting assignation of quaternions to the amino acids. Because of theadvantages of using quaternions to describe spatial rotations, the association of amino acids with quaternionsopens new horizons beyond the genetic code representation. In this context, we consider here the suitabilityof this association, together with our characterization of chirality, to take account of the folding of homochiralproteins formed by exclusively L or D amino acids.The primary structure of a J-protein (J = L,D) formed by N J-amino acids is a sequence A J1 ,A J2 , . . . ,A J N with A J i ∈ A J . Our aim is to obtain from this sequence the spatial coordinates of each one of the atoms ofall the amino acids that constitute the protein when this one is in the native -or functional- state (tertiarystructure). For L-proteins, we take as such the one corresponding to the protein in physiological solutionwhose coordinates can be obtained, after crystallization, by application of, for example, X-ray diﬀractionmethods. That is the case of most of the proteins whose coordinates are stored at the Protein Data Bank[35].For D-proteins, since in the laboratories have been synthesized very few such proteins[36, 37], we take inthese cases as tertiary structure the mirror-image of the corresponding experimental L-proteins structure.In principle we restrict ourselves to determine the coordinates for just the alpha-carbon atoms of theproteins chain which is not a severe restriction since is known that there exist (at least for L-proteins) veryeﬃcient algorithms for going from this trace representation to the full atoms one[38]. We also take intoaccount that, in our quaternionic representation, the J-amino acids sequence is expressed as a sequenceof quaternions p J1 , p J2 , . . . , p J N with p J i ∈ H α J ( Z ). Under these conditions we proceed now to present analgorithm to determine the spatial coordinates of the alpha-carbon atoms of the protein.First we observe that although adjacent alpha-carbon atoms are not covalently bonded their distanceis notably stable and take very similar values for all the pairs within a given protein and also for those11elonging to diﬀerent proteins. So in our calculations we assume that all these distances are equal to aunique value d C α − C α = 3 .

80 ˚A. Thus we determine on the unit sphere with center at the origin a point foreach of the amino acids (alpha-carbon atoms) in the protein sequence. To the last one we assign directly theorigin, the preceding one is located at the intersection between the axis z and the sphere surface (versor (cid:98) e z ).To each of the remaining alpha-carbon atoms we assign a point on the sphere surface that results of rotatingthe versor (cid:98) e z (north pole) by a quaternion. For the i th alpha-carbon atom in the J-sequence, the quaternionresponsible of the rotation is denoted (cid:98) β J i ( i = 3 , , · · · , N ). We then expand the chain of alpha-carbon atomsfrom their location on the sphere into the back-bone protein three dimensional conﬁguration (see ﬁgure 6)by means of the following iterative procedure, where initially the r j ´s are on the sphere surface:do i = 1 , N − δ r = r i +1 do j = 1 , i r j = r j + δ r end doend doAccording with the algorithm, the distance between adjacent alpha-carbon atoms is the unit so, toestablish the correct distance, we must multiply the ﬁnal calculated coordinates by d C α − C α . Figure 6: Development of the alpha-carbon atoms backbones of a hypothetical L-protein of length N and the correspondingD-enantiomer from their position on the sphere surface into their spatial conﬁguration (schematic). Both chains are the mirror-image one of the other. In each spatial chain the last two alpha-carbon atoms, as well as some of the ﬁrst ones, are labelled bytheir order number inside the sequence. It remains to determine how to calculate the quaternions (cid:98) β J i ( i = 3 , , · · · , N ). In ref. [20] we do thisin a somewhat heuristic way. We take into account that the i th amino acid interacts in some way withthe i − N − i subsequent ones. Of course that inthese interactions the eﬀect of the medium should be incorporated in some form, for example in the form ofeﬀective interactions between amino acids. Actually we are trying for a sort of decodiﬁcation and so we are12ot directly interested into the detailed form of the interactions, but we recognize that in any codiﬁcationof information that involves those interactions, some trace of their general form should be. In general it isreasonable to think that the global interaction includes just two body (eﬀective) interactions so by analogywe choose with generality for (cid:98) β J i the normalized version of the quaternion β J i = (cid:88) r =1 , , ··· ,Nr (cid:54) = i c J r ( p J r • p J i ) (J = L,D; i = 1 , , · · · , N ) (11)where • denotes the quaternionic dot product: p J r • p J i = ( p J r ) ( p J i ) + ( p J r ) ( p J i ) + ( p J r ) ( p J i ) + ( p J r ) ( p J i ) and c J r ∈ H ( R ) ( r = 1 , , · · · , N ) are unknown real quaternions (order quaternions) which are determinedby means of an optimization technique. As such we use the particle swarm optimization (PSO) procedure ofKennedy and Eberhart[39] taking as function of ﬁtness the diﬀerence between the coordinates of the alpha-carbon atoms calculated following the previous procedure and the corresponding experimental ones. Forthese last we take those directly read from the PDB (for L-proteins) or their mirror-images (for D-proteins).We take the rmsd (root-mean-square deviation) as a measure of the ﬁtness diﬀerence, using to that eﬀectBosco K. Ho´s implementation of Kabsch algorithm[40].Actually, it is enough to consider J = L since p D r = ˜ p L r so, if we take c D r = i c L r (cid:101) i = (cid:101) i c L r i , (12)with i = (0 , , , β D i : β D i = i β L i (cid:101) i = (cid:101) i β L i i = (( β L i ) , ( β L i ) , − ( β L i ) , − ( β L i ) ) . (13)We observe that if we denote with x L i = (( x L i ) , ( x L i ) , ( x L i ) ) the point on the sphere that resultsof rotating the versor (cid:98) e z by the quaternion (cid:98) β L i : (0 , x L i ) = (cid:98) β L i (0 , (cid:98) e z ) (cid:101)(cid:98) β L i , then the sphere point thatresults of rotating the north pole by (cid:98) β D i is given by(0 , x D i ) = (cid:98) β D i (0 , (cid:98) e z ) (cid:101)(cid:98) β D i = i (0 , x L i ) i = (0 , − ( x L i ) , ( x L i ) , ( x L i ) ) , (14)say x D i is the mirror-image of x L i with respect to the plane x = 0 in a cartesian axis ( x, y, z ) (see ﬁgure6). When we expand the two chains (L and D) of alpha-carbon atoms from their location on the sphere byusing the previous iterative procedure they result to be the mirror-image one of the other.In ﬁgures 7 to 9 we show the L and D enantiomers of three small proteins as obtained by using ourprocedure: in ﬁgure 7 of the hormone glucagon (PDB ID: 1GCN - length: 29 amino acids); in ﬁgure 8 ofthe ion channel inhibitor osk1 toxin (PDB ID: 2CK5 - length: 31 amino acids) and, in ﬁgure 9, of a typeIII antifreeze protein (PDB ID: 1HG7 - length: 66 amino acids). In all the cases the theoretic curves arecompared with the corresponding experimental ones as obtained directly from the PDB for L-proteins andfrom the mirror-image of these for D-proteins. The corresponding rmsd´s for the L-proteins are: 0 .

103 ˚Afor 1GCN; 0 .

091 ˚A for 2CK5 and 0 .

163 ˚A for 1HG7.

5. Remarks

Assuming that at the beginnings amino acids and nucleotide bases were synthesized from primordialelements in racemic mixtures and that D -bases were always more aﬃne for L-amino acids whereas L -baseshave preferred D-amino acids, we have proposed diagrams for the evolution of genetic codes which at presentwould settle the correspondence between codons and amino acids looking for those aﬃnities. Actually, ofboth codes only that corresponding to the aﬃnity system ( D -bases/L-amino acids) is nowadays observedon the Earth. Although the existence of the chiral combination ( L -bases/D-amino acids) is in principlepossible, none of the organisms that live in our planet shows L -nucleic acids or D-proteins. However, the13 igure 7: Trace representation of the alpha-carbon atoms backbone for L- and D-1GCN. Red (dark grey) inner tube: from thecoordinates obtained using our procedure. Cyan (light grey) external transparent tube: from the coordinates stored at PDBfor L and its mirror-image for D. existence of life in other planets, including the possibility that it be governed by such a genetic code (Figure5), is an open issue.Our evolution diagrams (Figures 2 and 3) are based on pioneering ideas by Crick and introduce in a verysimple way the concept of broken symmetry. The fact that the diagrams describe the degeneration breakingand the code freezing following a similar pattern for both aﬃnities systems, ( D -bases/L-amino acids) and( L -bases/D-amino acids), means that we are considering that these aspects basically do not depend of themolecules chirality but of other physical, chemical and/or biological causes.Inspired by the evolution diagrams we propose a quaternions based mathematical representation of thecorresponding genetic codes. We assign to each nucleotide base an integer quaternion, so the codons aretriplets of such quaternions. The representation assigns to each triplet another integer quaternion that isassociated with one of the 20 amino acids (type quaternions). The bases quaternions belong to the set ofeighth prime integer quaternions of norm 7 and the nucleotide bases chirality is introduced by partitioningthis set into two subsets (Eqs. 4 and 5) of cardinality 4 each with their elements mutually conjugates andassociating their elements with the D - and L -bases, respectively. The correspondences between triplets ofquaternions and type quaternions for both chiral combinations, ( D /L) and ( L /D), are given by functions(Eqs. 7 and 8) that use the sum and ordinary product of quaternions and is such that the type quaternionsassigned to both enantiomers of a given amino acid are mutually conjugates.Apart of preserving the degeneration of the genetic codes as it should be, our representations distinguish14 igure 8: Trace representation of the alpha-carbon atoms backbone for L- and D-2CK5. Red (dark grey) inner tube: from thecoordinates obtained using our procedure. Cyan (light grey) external transparent tube: from the coordinates stored at PDBfor L and its mirror-image for D. among other mathematical representations of the genetic code because they assign quaternions to the aminoacids as a ﬁnal result so that, in view of the close relationship between quaternions and spatial rotations, adoor towards the study of the proteins folding opens. In this context we propose an algorithm to go from theprimary to the tertiary structure of L- as well as D-proteins. The algorithm uses, besides the integer typequaternions, a set of real quaternions associated with the order of the amino acids in the protein sequence.These order quaternions are basically the same ones for L- and D-proteins so the algorithm is such that fora given primary sequence the 3D structure of the L- and D-proteins are the mirror-image one of the other.Finally, another observation about this algorithm whose critical step is the building of the quaternion β J i (Eq. 11). In ref. [20] we use for it an expression that involves the ordinary product between the typequaternion corresponding to the position i and all the others. However such expression is not adequatefor describing with a common set of order quaternions the folding of L- and D-proteins. To overcome thisproblem, in this article, we have changed the ordinary product by a dot product between type quaternions.As a consequence the number of sets of order quaternions that adjust a given protein diminishes so that thesearch for the unique set that describe the folding of all the proteins (if it exists!) would be facilitated. Weare currently working on this issue. Acknowledgments

Support of this work by Universidad Nacional de La Plata, Universidad Nacional de Rosario and ConsejoNacional de Investigaciones Cient´ıﬁcas y T´ecnicas of Argentina is greatly appreciated. The authors aremembers of CONICET. 15 igure 9: Trace representation of the alpha-carbon atoms backbone for L- and D-1HG7. Red (dark grey) inner tube: from thecoordinates obtained using our procedure. Cyan (light grey) external transparent tube: from the coordinates stored at PDBfor L and its mirror-image for D.

References [1] G. P´alyi, C. Zucchi and L. Cagliotti (Editors), Advances in BioChirality (Elsevier Science Ltd., Oxford,1999)[2] G. P´alyi, C. Zucchi and L. Cagliotti (Editors), Progress in Biological chirality (Elsevier Science Ltd., Oxford, 2004)[3] W.A. Bonner, Homochirality and life,

EXS

85, 159-188 (1998).[4] A. Jorissen andC. Cerf, Asymmetric photoreactions as the origin of biomolecular homochirality: A critical review. OriginsLife Evol. Biosphere 32, 129-142 (2002)[5] G. Goodman and M.E. Gershwin, The origin of life and the left-handed amino-acids excess: The furthest heavens and thedeepest seas? Exp. Biol. Med. 231, 1587-1592 (2006).[6] R.S. Root-Berstein, Simultaneous origin of homochirality, the genetic code and its direction. Bioessays 29, 689-698 (2007).[7] M. Yarus, J.G. Caporaso and R. Knight, Origins of the genetic code: The escaped triplet theory. Annu. Rev. Biochem.74, 179-198 (2005).[8] M. Yarus, A speciﬁc amino acid binding site composed of RNA. Science 240, 1751-1758 (1988).[9] M. Yarus, RNA-ligand chemistry: A testable source for thye genetic code. RNA 6, 475-484 (2000).[10] I. Majerﬁeld, D. Puthenvendu and M. Yarus, RNA aﬃnity for molecular L-histidine; genetic code origins. J. Mol Evol.61, 226-235 (2005).[11] I. Majerﬁeld and M. Yarus, A diminute and speciﬁc RNA binding site for L-tryptophan. Nucl. Acid, Res. 33, 5482-5493(2005).[12] M. Legiewics and M. Yarus, A more complex isoleucine with a cognate triplet. J. Biol. Chem. 280, 19815-19822 (2005).[13] M.K. Hobish, N.S. Wickramasinghe and C Ponnamperuma, Direct interaction between amino acidsand nucleotides as apossible basis for the origin of the genetic code. Adv. Space Res. 13, 365-382 (1995).[14] C. Saxinger, C. Ponnamperuma and C. Woese, Evidence for the interaction of nucleotides with immobilized amino acidsand its signiﬁcance for the origin of the genetic code. Nat. New Biol. 234, 172-174 (1971).[15] G.W. Walker, Nucleotide-binding site data and the origin of the genetic code. Biosystems 9, 139-150 (1977).[16] R.S. Root-Bernstein, Experimental test of L- and D- amino acid binding to L- and D- codons suggests that homochiralityand codon directionality emerged with the genetic code. Symmetry 2, 1180-1200 (2010).[17] A.T. Profy and D.A. Usher, Stereoselective aminoacylation of a dinucleotide monophosphate by imidazolides of DL-alanineand N-(tert-butoxycarbonyl)-DL-alanine. J- Mol. Evol. 20, 147-156 (1984).