[PDF] Did Sequence Dependent Geometry Influence the Evolution of the Genetic Code?

Abstract

The genetic code is the function from the set of codons to the set of amino acids by which a DNA sequence encodes proteins. Since the codons also influence the shape of the DNA molecule itself, the same sequence that encodes a protein also has a separate geometric interpretation. A question then arises: How well-duplexed are these two "codes"? In other words, in choosing a genetic sequence to encode a particular protein, how much freedom does one still have to vary the geometry (or vice versa). A recent paper by the first author addressed this question using two different methods. After reviewing those results, this paper addresses the same question with a third method: the use of Monte Carlo and Gaussian sampling methods to approximate a multi-integral representing the mutual information of a variety of possible genetic codes. Once again, it is found that the genetic code used in nuclear DNA has a slightly lower than average duplexing efficiency as compared with other hypothetical genetic codes. A concluding section discusses the significance of these surprising results.

Full PDF

aa r X i v : . [ q - b i o . O T ] M a r DID SEQUENCE DEPENDENT GEOMETRY INFLUENCE THEEVOLUTION OF THE GENETIC CODE?

ALEX KASMAN AND BRENTON LEMESURIER

Abstract.

The genetic code is the function from the set of codons to the setof amino acids by which a DNA sequence encodes proteins. Since the codonsalso inﬂuence the shape of the DNA molecule itself, the same sequence thatencodes a protein also has a separate geometric interpretation. A question thenarises: How well-duplexed are these two “codes”? In other words, in choosinga genetic sequence to encode a particular protein, how much freedom does onestill have to vary the geometry (or vice versa). A recent paper by the ﬁrstauthor addressed this question using two diﬀerent methods. After reviewingthose results, this paper addresses the same question with a third method: theuse of Monte Carlo and Gaussian sampling methods to approximate a multi-integral representing the mutual information of a variety of possible geneticcodes. Once again, it is found that the genetic code used in nuclear DNAhas a slightly lower than average duplexing eﬃciency as compared with otherhypothetical genetic codes. A concluding section discusses the signiﬁcance ofthese surprising results.

The ﬁrst author’s talk at the AMS Special Session on the Topology of Biopoly-mers explored the mathematical relationship between two diﬀerent roles that a DNAsequence serves in living cells: encoding proteins to be produced and inﬂuencingthe shape of the DNA molecule itself. Those results were subsequently published asa journal article [5]. After brieﬂy summarizing the main results of that publishedpaper, this article takes them a step further using a more sophisticated approachto the numerical computation of the mutual information. By combining Gaussianand Monte Carlo sampling methods with a new geometric inversion formula forcomputing the geometries, this new approach provides a more reliable result whichstrengthens and reconﬁrms the previously announced conclusions.1.

Measuring the Efficiency of Duplexed Codes

A Motivating Example.

Consider the following unlikely situation: You willsoon need to send a text message conveying a two letter word to your friendGeorgina and you also have to send a two letter word by text message to yourfriend Fred. However, because of your restrictive data plan, you must achieve thisby sending a single two character message to both of them at the same time.You can hope to achieve this by teaching Fred one of the two functions f i andteaching Georgina the function g shown in Table 1. Each of those functions turnsone of the integers from 0 to 5 into a letter and can therefore be used as a simple“code”. For example, since Georgina knows the function g you can send her the Date : March 4, 2020.2010

Mathematics Subject Classiﬁcation.

Primary 92B05 94A17 Seconday 65D30. c f ( c ) f ( c ) g ( c )0 H H N O O H O H N I O H I I O H I O

Table 1.

The functions f , f and g used in this introduction toillustrate duplexing and mutual information.numerical message “24” and she would interpret it as “ g (2) g (4) = NO ”. Alterna-tively, she would interpret the message “53” as the exclamation “ OH ”. Similarly,using either of the two functions f or f , Fred could recognize the signal “04” asthe greeting “ HI ”.The really interesting thing is that you could send the same two digit messageto both Georgina and Fred and they would interpret it diﬀerently. That is thedeﬁning characteristic of duplexed codes , that the same signal has two diﬀerentinterpretations.Let us ﬁrst suppose that Fred has memorized f and Georgina knows the code g . If you wanted to send Georgina a message that will be interpreted as “ NO ”, youhave four diﬀerent choices of signal which would convey that message to her andeach one would mean something diﬀerent to Fred. For instance, you could send“04” which Fred will interpret as “ HI ” or you could send “25” which has the sameinterpretation for Georgina but which Fred will interpret as “ OH ”. In this scenario,you have the freedom to send diﬀerent messages to Fred while still sending thedesired message to Georgina at the same time.In contrast, things would be diﬀerent if Fred had learned f as his code instead.Even though you would still have a choice of four signals to send Georgina thatwould be interpreted as “ NO ”, you would have not be able to separately controlthe message that was sent to Fred because all four of the signals that mean “ NO ”under the code g would be interpreted as “ HI ” using code f . There would be noway to send Fred the message “ OH ” or any other message besides “ HI ” if Georgina’smessage is to be interpreted as “ NO ”. Even though there is nothing wrong withthe code f on its own, there is something unfortunate about its relationship to g which creates an obstruction to sending the message “ NO ” to Georgina whilesimultaneously sending the message “ OH ” to Fred.Loosely speaking, we say that two codes are well-duplexed if such obstructionsto encoding two messages simultaneously are rare. Conversely they are poorly-duplexed if the choice of a message for one recipient severely restricts the messagesthat can be sent to the other recipient with the same signal. A more rigorous andquantiﬁable method of determining whether two codes are well-duplexed or poorlyduplexed is by using the concept of mutual information that is part of the branchof mathematics knowns as information theory.1.2. Duplexed Codes and Mutual Information.

Let us say that f and g are duplexed codes whenever f : C → X and g : C → Y are two functions with the samedomain. The terminology makes sense when one imagines sending a single “signal” c ∈ C to two recipients each of whom knows one of those two codes. The goal of this EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 3 section is to introduce a number associated to any duplexed codes which measureshow much freedom you have to send diﬀerent messages to one recipient even afterthe message for the other recipient is ﬁxed.For a randomly selected element c ∈ C , let P f ( x ) denote the probability that f ( c ) = x ∈ X , P g ( y ) be the probability that g ( c ) = y , and P f,g ( x and y ) be theprobability that both f ( c ) = x and g ( c ) = y .For example, using the functions deﬁned in Table 1 with domain C = { , , , , , } ,we see f ( c ) = N is true for two of the six possible values of c and so P f ( N ) = 2 / /

3. Moreover, P f ,g ( H , N ) = 1 / f ( c ) = H and g ( c ) = N could both be true is if c = 0. However, P f ,g ( H , N ) = 1 / c = 0 and c = 2 satisfy f ( c ) = H and g ( c ) = N .The mutual information (measured in bits) of the duplexed codes f : C → X and g : C → Y is deﬁned to be (1.1) M ( f, g ) = X y ∈Y X x ∈X P f,g ( x and y ) log (cid:18) P f,g ( x and y ) P f ( x ) P g ( y ) (cid:19) . It is easy to see that 0 ≤ M ( f, g ) is true for any two codes f and g . The minimumpossible value of 0 occurs when P f,g ( x and y ) = P f ( x ) P g ( y ) for all choices of x and y . A familiar fact from probability theory is that the joint probability is equalto the product of the two probabilities precisely when the events are independent.Indeed, the same idea applies here, although we now interpret it in terms of theindependence of the two codes. If the mutual information of two codes is zero thenthis tells us that the codes are very well-duplexed in that the selection of a messageto one recipient does not restrict the message that can be sent to the other.Since a mutual information of 0 represents the best possible duplexing of codes,larger mutual information means that the codes are not as well-duplexed. Forexample, we can compute that M ( f , g ) ≈ . M ( f , g ) ≈ . . for the codes f , f and g from Table 1 in the previous section. The combinationof functions f and g is a bad choice for duplexing since if we were using those ascodes for message to send Fred and Georgina then we could not separately choosea message for each recipient. In contrast, f and g work better as a combinationbecause even after we have chosen the message for one of the intended recipientswe still have a choice of message that can be sent to the other. This is reﬂectedhere in the fact that M ( f , g ) < M ( f , g ); the mutual information when using f iscloser to zero and therefore closer to being optimal for duplexing.1.3. Comparisons with Expected Values.

Let F : S → R be a real-valued func-tion on the ﬁnite set S = { σ , . . . , σ n } . Then deﬁne the expected value E S ( F ( σ ))by the familiar formula E S ( F ( σ )) = 1 n X σ ∈ S F ( σ ) . You will probably notice that this is nothing other than the mean of the valuesthat F takes. The terminology “expected value” taken from probability theory isa notion analogous to the average in the context of random variables. The way When P f,g ( x and y ) = 0 it is understood that P f,g ( x and y ) log (cid:18) P f,g ( x and y ) P f ( x ) P g ( y ) (cid:19) = 0. ALEX KASMAN AND BRENTON LEMESURIER to interpret it here is to imagine an experiment in which you randomly select anelement σ from S and make a measurement of it to ﬁnd the value F ( σ ). Then E S ( F ( σ )) is the expected value in the sense that it would be the average of themeasurements after a large number of experiments. In particular, if for a particularˆ σ ∈ S one has F (ˆ σ ) < E S ( F ( σ ))then one can say that the value of F (ˆ σ ) is lower than the value one would expect for a randomly selected element of S .For example, using the functions f , f and g from the motivating exampleabove, we can consider the mutual information M ( f σ , g ) as a real-valued functionon the index set S = { , } . Then0 . ≈ M ( f , g ) < E S ( M ( f σ , g )) ≈ . . ≈ . , tells us that the duplexing of the code f with g is better than average for codesselected from { f , f } . Although we already knew that in this case simply bycomparing the individual mutual information values, this notation will prove usefulbelow where we will be doing something similar but with a very large index set.2. A Natural Example of Duplexed Codes Associated to DNA

The Genetic Code.

Let B = { A , C , G , T } be the set of DNA bases. BecauseDNA sequences of length 2 and 3 will play special roles in this paper, let us introducethe following terminology and notation: The set of dimers (length two sequences)is D = { b b : b i ∈ B} . and the set of codons (length three sequences) is C = { b b b : b i ∈ B} .A genetic code is simply a function f I from the set C of codons to the set X ofamino acids (and the word “stop”): X = { I,L,V,F,M,C,A,G,P,T,S,Y,W,Q,N,H,E,D,K,R,Stop } . The genetic code used by the nuclear DNA in humans is shown in Table 2, and thisis the same genetic code used by nearly all known living organisms [8, 9]. We willrefer to the particular genetic code given in Table 2 as “the natural genetic code”so as to distinguish it from other hypothetical codes that are not found in biologybut will be used for comparison later in the paper.However, it is important to realize that there are other genetic codes that areused by biological organisms (notably, mitochondria use a diﬀerent genetic code)and that scientists have also introduced artiﬁcial genetic codes which neverthelessseem to function well enough to support life [2, 6, 7, 10, 12, 13, 14]. So, there isno physical law requiring this to be the genetic code. In theory, the genetic codecould have been diﬀerent and it is reasonable to ask the question “Why do nearlyall living organisms use this particular genetic code?”There is evidence to support the hypothesis that the natural genetic code isthe result of a combination of coincidences and evolutionary pressures (see [15]and references therein). For example, two codons for the same amino acid diﬀeronly in the third base much more frequently than would be predicted by chance ifthe genetic code was to be constructed entirely randomly. This has evolutionaryadvantages in that it decreases the likelihood that a mutation or mis-pairing ofmRNA and tRNA will produce a diﬀerent protein [3, 1]. It is therefore presumed

EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 5Codon ( c ) Amino Acid ( f I ( c )) ATT , ATC , ATA I CTT , CTC , CTA , CTG , TTA , TTG L GTT , GTC , GTA , GTG V TTT , TTC F ATG M TGT , TGC C GCT , GCC , GCA , GCG A GGT , GGC , GGA , GGG G CCT , CCC , CCA , CCG P ACT , ACC , ACA , ACG T TCT , TCC , TCA , TCG , AGT , AGC S TAT , TAC Y TGG W CAA , CAG Q AAT , AAC N CAT , CAC H GAA , GAG E GAT , GAC D AAA , AAG K CGT , CGC , CGA , CGG , AGA , AGG R TAA , TAG , TGA

Stop

Table 2.

This table deﬁnes the genetic code f I : C → X . Each of the codons from C in the ﬁrst column is a pre-image of thecorresponding amino acid (or “Stop”) in the second.that this feature is not a coincidence but an example of the eﬀect of natural selectionon the formation of the genetic code.2.2. Sequence Dependent DNA Geometry.

When shown in illustrations, DNAoften looks like a perfectly straight double-helix, a twisted ladder with “rungs” thatare the base pairs carrying the genetic sequence. However, real DNA is not straight;it is bent and twisted into compact shapes that ﬁt into living cells.It is perhaps not surprising that the way that a DNA molecule bends is aﬀectedby the sequence of bases which make it up. After all, A , C , G , and T in B arenot just abstract mathematical symbols. They represent actual chemical struc-tures that form the base pairs in a DNA molecule. Hence, the electrical repulsionand attraction between successive “rungs” in the DNA ladder will vary with thatsequence.Olson et all [11] experimentally determined the geometry of each of the 16 dimers d ∈ D by repeatedly measuring the conﬁgurations of DNA strands that were twobase pairs long. They computed the average and standard deviation of each of thesix Hassan-Calladine dimer step parameters (see [4, 5]). Their results are shown inTable 3Assuming that the geometric conﬁguration of each dimer in a longer sequence hasthe same expected values and standard deviations as the isolated dimers in thoseexperiments, it is possible to make a similar table for the geometric conﬁgurationsassociated to each of the 64 codons in C . The function(2.1) ¯ g : C → R shown in Table 4 which associates to each codon c ∈ C a 3-tuple of numbers ¯ g ( c )which gives the location of the center of the top of the codon in Angstroms if its baseis located at the origin and if each of the dimers takes exactly the expected geometryaccording to Olson et al. (Note: In [5], this role is played by a function ¯Γ : C → R whose image has six components because it has angular information as well, but ALEX KASMAN AND BRENTON LEMESURIER d ¯∆ ( d ) ( ˆ∆ ( d )) ¯∆ ( d ) ( ˆ∆ ( d )) ¯∆ ( d ) ( ˆ∆ ( d )) ¯ θ ( d ) (ˆ θ ( d )) ¯ θ ( d ) (ˆ θ ( d )) ¯ θ ( d ) (ˆ θ ( d )) AA − .

03 (0 . − .

08 (0 .

45) 3 .

27 (0 . − . .

3) 0 .

07 (5 .

4) 35 . . AC .

13 (0 . − .

58 (0 .

41) 3 .

36 (0 . − . .

1) 0 . .

9) 31 . . AG .

09 (0 . − .

25 (0 .

41) 3 .

34 (0 . − . .

3) 4 . .

4) 31 . . AT . − .

59 (0 .

31) 3 .

31 (0 .

21) 0 (2 .

5) 1 . .

9) 29 . . CA .

09 (0 .

55) 0 .

53 (0 .

89) 3 .

33 (0 .

26) 0 . .

7) 4 . .

1) 37 . . CC − .

05 (0 . − .

22 (0 .

64) 3 .

42 (0 .

24) 0 . .

7) 3 . .

5) 32 . . CG .

87) 0 .

41 (0 .

56) 3 .

39 (0 .

27) 0 (4 .

2) 5 . .

2) 36 . . CT .

28 (0 .

46) 0 .

09 (0 .

7) 3 .

37 (0 .

26) 1 . .

8) 1 . .

3) 36 . . GA − .

28 (0 .

46) 0 .

09 (0 .

7) 3 .

37 (0 . − . .

8) 1 . .

3) 36 . . GC . − .

38 (0 .

56) 3 . .

24) 0 (3 .

9) 0 . .

6) 33 . . GG .

05 (0 . − .

22 (0 .

64) 3 .

42 (0 . − . .

7) 3 . .

5) 32 . . GT − .

09 (0 .

55) 0 .

53 (0 .

89) 3 .

33 (0 . − . .

7) 4 . .

1) 37 . . TA .

52) 0 .

05 (0 .

71) 3 .

42 (0 .

24) 0 (2 .

7) 3 . .

6) 37 . . TC − .

09 (0 . − .

25 (0 .

41) 3 .

34 (0 .

23) 1 . .

3) 4 . .

4) 31 . . TG − .

13 (0 . − .

58 (0 .

41) 3 .

36 (0 .

23) 0 . .

1) 0 . .

9) 31 . . TT .

03 (0 . − .

08 (0 .

45) 3 .

27 (0 .

22) 1 . .

3) 0 .

07 (5 .

4) 35 . . Table 3.

This table shows the mean ( ¯∆ i ) and standard deviation( ˆ∆ i ) of each of the Hassan-Caladine step parameters for each dimeras determined experimentally in [11].for simplicity in this note we are considering only the ﬁrst three components whichencode the location of the center of the third rung and not the way it is tilted.)Figure 1 shows just the projection of ¯ g ( c ) onto its ﬁrst two coordinates for eachof the 64 codons c ∈ C . You can imagine that a codon (a DNA sequence of length3) is coming straight out of the xy -plane at you. Each point in this ﬁgure representsa codon and they all start out at the origin, but because the expected dimer stepparameters depend on the particular bases involved, by the time they get up totheir third rung they are in slightly diﬀerent positions. In particular, the pointsindicate the locations of the center of the third rung (with units given in Angstroms)if each of the dimer step parameters takes its expected values in agreement withthe experiments of Olson et al.As you can see, the diﬀerent codons do have slightly diﬀerent expected geome-tries. It is important to realize that these small diﬀerences can combine in dramaticways when considering longer sequences made up of many successive codons. Forinstance, Figure 2 shows the expected geometry for two diﬀerent DNA sequences.Clearly, the sequence S = AAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGC bends signiﬁcantly more than sequence S = AAGAATGGGCAGAAGCGTGCGAAGACTGGAAAGAATGGCCAGAAGCGTGCAAAAACGGGT . So, geometrically they are quite diﬀerent. But, consider how each of these twosequences is translated into a protein according to the natural genetic code. Theﬁrst codon in S ( AAA ) and the ﬁrst codon in S ( AAG ) both encode the amino acid K . Similarly, the second codon in each encode the amino acid N .In fact, the corresponding codons in each sequence always are mapped by thenatural genetic code to the same amino acid. So, S and S encode exactly thesame protein according to the natural genetic code, according to the function ¯ g ,one of them exhibits a much greater curvature than the other.2.3. The Geometric Pressure Hypothesis.

Note that the last example of twoDNA sequences with very diﬀerent expected geometries is in some ways similar to

EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 7

Codon ( c ) ¯ g ( c ) AAA ( − . , − . , . AAC (0 . , − . , . AAG (0 . , . , . AAT (0 . , − . , . ACA (0 . , . , . ACC (0 . , − . , . ACG (0 . , − . , . ACT (0 . , − . , . AGA (0 . , − . , . AGC (0 . , − . , . AGG (0 . , . , . AGT (0 . , . , . ATA (0 . , − . , . ATC (0 . , − . , . ATG (0 . , − . , . ATT (0 . , − . , . CAA (0 . , . , . CAC (0 . , . , . CAG (0 . , . , . CAT (0 . , . , . CCA (0 . , . , . CCC (0 . , − . , . CCG (0 . , . , . CCT (0 . , . , . CGA (0 . , . , . CGC (0 . , . , . CGG (0 . , . , . CGT ( − . , . , . CTA (0 . , . , . CTC (0 . , − . , . CTG (0 . , − . , . CTT (0 . , . , . c ) ¯ g ( c ) GAA ( − . , . , . GAC (0 . , − . , . GAG (0 . , . , . GAT (0 . , − . , . GCA ( − . , . , . GCC (0 . , − . , . GCG ( − . , . , . GCT (0 . , − . , . GGA (0 . , − . , . GGC (0 . , − . , . GGG (0 . , − . , . GGT (0 . , . , . GTA (0 . , . , . GTC (0 . , . , . GTG (0 . , . , . GTT (0 . , . , . TAA (0 . , . , . TAC (0 . , − . , . TAG (0 . , . , . TAT (0 . , − . , . TCA (0 . , . , . TCC (0 . , − . , . TCG (0 . , . , . TCT (0 . , − . , . TGA ( − . , − . , . TGC (0 . , − . , . TGG (0 . , − . , . TGT ( − . , − . , . TTA (0 . , − . , . TTC (0 . , − . , . TTG (0 . , − . , . TTT (0 . , − . , . Table 4.

The expected location of the center for the third rung ofeach codon relative to the position of the ﬁrst rung in Angstroms.(See Figure 1 for a visual representation of this data and [5] formathematical details.)the opening example of a simple duplexed code. Just as Fred and Georgina canhave diﬀerent interpretations of the same signal of numbers, a sequence of codonscan be interpreted either as encoding a protein or as inﬂuencing the shape of theDNA molecule.The shape of a DNA molecule is relevant to its biological function. It must bendin the right places and be straight in the right places in order for the enzymesand RNA responsible for transcription to be able to occur. This gives biologicalimportance to the question of how well-duplexed the genetic and geometric codesdiscussed in the previous two sections are. For instance, if they are very poorly-duplexed, then it could often be the case that a DNA molecule cannot encode theprotein that a creature needs unless it bends in a bad way. Conversely, it wouldbe to a creature’s advantage for the codes to be well-duplexed because then it

ALEX KASMAN AND BRENTON LEMESURIER

Figure 1.

This graphic shows the x and y -coordinates of the im-ages of the function ¯ g ( c ) (Table 4) for each codon c ∈ C . They showthe ways that the DNA molecule carrying that sequence wouldlikely bend. If one was looking down at all 64 codons, each withits bottom rung ﬁxed at the origin and with each dimer exhibitingits expected geometry (see Table 3) as it comes up out of the pagetowards you, the projections of the centers of the top rungs wouldbe located at the locations indicated (with axes measured inAngstroms).would always be able to simultaneously encode whatever protein and geometry areoptimal.The geometric code was not something that evolution could act upon since it isdetermined by the laws of physics and chemistry. However, as we have seen, thegenetic code could have been diﬀerent and likely was inﬂuenced by natural selection. EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 9 S S Figure 2.

The sequences of the two DNA molecules shown encodethe same protein, but due to the diﬀerences in expected codongeometry, one of them is noticeably more bent than the other.In [5], it was hypothesized that one of the factors that inﬂuenced that selection waspressure to ensure that the genetic and geometric codes were well-duplexed.3.

How Well Duplexed is the Real Genetic Code as Compares withAlternatives?

To test the “Geometric Pressure Hypothesis” (GPH), two diﬀerent measuresof duplexing eﬃciency were developed in [5]. Then, the duplexing eﬃciency ofthe geometric code with the natural genetic code was compared with its averageduplexing eﬃciency with a large set of reasonable alternative genetic codes.3.1.

Alternative Genetic Codes.

Let Sym( C ) be the group of permutations onthe set C of codons. Then for any σ ∈ Sym( C ), f σ = f I ◦ σ is the function from C to X which ﬁrst replaces the codon c ∈ C with its image σ ( c ) and then applies thenatural genetic code function f I to that. (In other words, the function f σ would berepresented by a table very much like Table 2 for the natural genetic code above,but the codons would be rearranged according to the permutation σ .)Notice that no matter which permutation σ is selected, the alternative geneticcode f σ has in common with the natural genetic code f I not only that it is a mapfrom C to X but also that for each amino acid a ∈ X the preimages are of the samesize: | f − I ( a ) | = | f − σ ( a ) | . However, not all of those alternative genetic codes are realistic. For most choicesof permutation σ , the alternative genetic code f σ will not have the property thattwo codons are more likely to encode the same amino acid (or chemically similaramino acids) when their ﬁrst two bases are equal, which will have already noted is a property of the real genetic code which has evolutionary advantages. Since we onlywant to consider alternative codes that also have this property, the permutationsconsidered in [5] were further restricted: we considered not arbitrary permutationsbut only ones with the property that the ﬁrst two bases in the codons σ ( c ) and σ ( c ) are equal if and only if the ﬁrst two bases of c and c are equal. Let S be theset of such permutations. Symbolically, we can deﬁne the restricted set S ⊂ Sym( C )of permutations using the map d ( b b b ) = b b which projects a codon onto itsinitial dimer as follows: S = { σ ∈ Sym( C ) : ∀ c, c ′ ∈ C d ( σ ( c )) = d ( σ ( c ′ )) ⇔ d ( c ) = d ( c ′ ) } . To test the GPH in [5], the duplexing eﬃciency of the natural genetic codewith the geometric code was compared with the expected value for the duplex-ing eﬃciency of the alternatives indexed by the set S . Since there are (4 )!(4!) permutations in S , this is a very large set of permutations to consider.3.2. Total Network Length.

The natural genetic code is shown in Table 2 andthe expected geometries of the codons is shown in Figure 1. One way to combinethis information is to draw an edge on the ﬁgure between any two codons thatencode the same amino acid, turning it into a graph with vertices and edges. Thus,for instance, an edge would be drawn between the vertices labeled

TAT and

TAC because they both encode the amino acid Y , while the vertex labeled ATG would notbe connected to any other vertices.Each connected component of the graph corresponds to an amino acid. If thereare only short edges or no edges in the component associated to an amino acid thatyou wish to encode in a sequence, then that means you have almost no geometricchoice in the DNA molecule’s expected geometry at that point. On the other hand,if there are long edges in the connected component, then you would have a choiceof diﬀerent codons that encode that same amino acid but which would cause a verydiﬀerent geometric conﬁguration of the DNA molecule.One could do the same for an alternative genetic code f σ . That graph wouldhave the same number of edges as the graph for the natural genetic code, but theywould not have the same lengths.With all of this in mind, we deﬁne the “total network length” of the genetic code f σ for any σ ∈ S to be the sum of the lengths of the edges in the graph : T σ = X a ∈X  X c,c ′ ∈ f − σ ( a ) | ¯ g ( c ) − ¯ g ( c ′ ) |  . In the case of the natural genetic code, T I was found to be 45 . large as compared with the total network length for thealternative codes which are not found in nature.Disappointingly, that is not what was found in [5]. A 95% conﬁdence interval wasconstructed for the expected value of the total network length over all permutations The second sum is over all distinct, unordered pairs in the pre-image f − σ ( a ) and the lengthdenotes the ordinary metric on R (i.e. | ( a, b, c ) | = √ a + b + c ). EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 11 in S . It was found that the average total network length E S ( T σ ) is probably between45 . . . T I < . < E S ( T σ ) < . T I for the naturalgenetic code is apparently a bit smaller than average rather than being especiallylarge.3.3. Mutual Information with a Discretized Geometric Code.

The previ-ous paper [5] also uses the concept of mutual information to quantify the mutualinformation of DNA’s geometric and genetic codes. Using mutual information as ameasure of duplexing eﬃciency has two big advantages over the use of total networklength as described in the previous section: • Firstly, it is a well-known measure of duplexing eﬃciency which is widelystudied and used, whereas total network length is an ad hoc approach de-veloped only for this particular project. • Total network length was based only on the expected values in Table 3 andtherefore ignored the standard deviations that represented the ﬂexibility ofthe dimers. Of course, once that ﬂexibility is taken into account, the “geo-metric code” is no longer a function since there is more than one possibleconﬁguration for each codon. Because the deﬁnition of mutual informationin 1.1 involves probabilities, it is well suited to address this situation.In [5], the geometry of a codon was represented by a point in R where the ﬁrstthree numbers (like the image of ¯ g above) indicate the location of the center ofthe last “rung” of the codon relative to the ﬁrst and the other three were anglesindicating how it was tilted and twisted relative to the ﬁrst. Then, R was dividedinto 4096(= 4 ) subsets called ‘bins’. If each bin is indexed by an element of Y = { , , . . . . , , } then the geometric information is encoded into a map g : C → Y .Unlike any of the maps discussed earlier, g is not a function since given codoncan be in many diﬀerent possible geometric conﬁgurations due to its ﬂexibility.Although it is more likely to be in certain conﬁgurations than others, and so g ( c ) is arandom variable for any given codon c . In order to compute the mutual informationof the genetic codes f σ with this geometric map g , we need to be able to computethe associated probabilities. In [5] that was done by running a computer programwhich looped through a large number of diﬀerent conﬁgurations and recorded thenumber of the ‘bin’ in which they ended up. In other words, the probabilities werecomputed empirically , using the assumption that the dimer step parameters for are normally distributed with the mean and standard deviation shown in Table 3.Using this information it is now possible to compute (or, perhaps it would bebetter to say “approximate”) the mutual information of any of the genetic codes f σ with this geometric code g . When this was done in [5], was found that the mutualinformation of the natural genetic code f I with the geometric code g is about M ( f I , g ) ≈ . f σ for σ ∈ S a 95% conﬁdenceinterval found that the average mutual information is probably between 0 . . . ≤ E S ( M ( f σ , g )) ≤ . < M ( f I , g ) ≈ . . Since a smaller mutual information (closer to the ideal value of 0) represents betterduplexed codes, this means that the duplexing of the real genetic code f I is worse than average. This, again, is the opposite of what would have been predicted bythe GPH.4. New Results: Mutual Information via a Monte-Carlo stylediscretization

Since the geometric parameters take values in a continuous space, the functions P f,g and P g which appear in the formula for the mutual information are actuallyprobability distribution functions whose values only become probabilities when in-tegrated over regions of that space. In the previous paper this computation wasdiscretized in a rigid way by “binning” the data into ﬁxed and pre-determinedsubsets of equal size.That approach is plausible, but not uniquely so. A more standard approach innumerical analysis is to consider a discretization based on choosing a suitable ran-dom sample Y N of N geometries that are chosen taking into account the Gaussiandistributions in Table 3, and replace the usual discrete mutual information by M N ( f, g ) = 1 W X y ∈ Y N X a P f,g ( a, y ) log (cid:18) P f,g ( a, y ) P f ( a ) P g ( y ) (cid:19) with W = X y ∈ Y N X a P f,g ( a, y ) , where y is a codon geometry, a is an amino acid, f = f σ is one of the genetic codesmapping codons to amino acids. The normalization by factor 1 /W is the volumeelement for approximate integration over the probability density P a P f,g ( a, y ). Itnormalizes the sum appropriately, in the sense of M N ( f, g ) approaching a commonvalue M ( f, g ) as N increases.The randomness of the sample Y N is one of the main diﬀerences between theprior results and this new approach. Another diﬀerence is the randomness used inthe approximation of the values of the probability distributions themselves. Unlikethe previous approach in which the probabilities were estimated by using rigidlychosen deviations from the expected values, this time a Monte Carlo approach willbe utilized. In particular, here we construct random choices for N = 64 n geometrysamples as the union of a set of n sample points for each codon, with the samplesfor each codon constructed from n sets of random values for the twelve Hassan-Calladine parameters for the two dimers; the randomness based on assuming thatthese parameters are all independent and that each is normally distributed withmean and standard deviation as in Table 3.The ﬁnal, and perhaps most interesting, diﬀerence between the previous ap-proach and the one taken in this section is an inversion of the geometric datawhich directly computes the probability that a given codon will take on a givengeometric conformation. The basic quantity needed is the probability distribution P d ( σ ) for the values of the dimer step σ for dimer d with Hassan-Calladine param-eters ∆ , ∆ , ∆ , θ , θ , θ . As above, this is based on the assumption that theseparameters are independent and each is normally distributed, so:(4.1) P d ( σ ) = Y i =1 i ( d ) √ π exp − (∆ i − ¯∆ i ( d )) i ( d ) ! Y j =1 θ j ( d ) √ π exp − ( θ j − ¯ θ j ( d )) θ j ( d ) ! . EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 13

Consider codon C consisting of dimers d and d , and for a choice of their dimersteps σ and σ , denote the resulting codon geometry y as the “step product” σ ∗ σ . Then P C ( y ), the probability density of the codon C having geometry y , isgiven by an integral over all paths to y : P C ( y ) = Z σ  X σ | σ ∗ σ = y P d ( σ )  P d ( σ ) dσ For each pair of values for y and σ , we must solve for all possible values σ ;fortunately this can be done explicitly, and generically there are only two suchpaths; this is detailed in the next subsection.For each path the quantity P d ( σ ) will be evaluated using Eq. (4.1). The outerintegral over a six dimensional space is instead dealt with by a Monte-Carlo method:noting that it is the integral w.r.t. a probability measure, we approximate bychoosing a sample of m random sets of values for the Hassan-Calladine parameters,in turn determining a set Σ m of values for σ , and averaging: P C ( y ) ≈ m X σ ∈ Σ m X σ | σ ∗ σ = y P d ( σ )Then we assemble the pieces: P g ( y ) = 164 X C P C ( y ) P f,g ( y ) = 164 X C | f ( C )= a P C ( y )and the easy one, the fraction of codons that give a speciﬁed amino acid: P f ( a ) = 164 (cid:12)(cid:12) { C | f ( C ) = a } (cid:12)(cid:12) Reconstructing the second dimer step.

What remains is to solve σ ∗ σ = y for the second dimer step σ ; that is, ﬁnd the corresponding Hassan-Calladineangles θ i and then the lengths ∆ i .Except for one case noted below (of negligible probability), the angles θ i , i = 1–3are determined up to negation of the pair θ , θ . This comes from the formula(4.2) T = M R ( θ / − φ ) R ( η ) R ( θ / φ )as in Section 2.2 of [5] (See also Eq’s (9) of [4].) The matrices M and T are theframes respectively for the end of the ﬁrst dimer and the end of the codon, R and R are the familiar matrices for rotations about y and z axes R ( η ) =  cos η − sin η − sin η η  R ( θ ) =  cos θ − sin θ θ cos θ

00 0 1  , and η = sign( θ ) p θ + θ , sin φ = θ /η with − π ≤ φ ≤ π ,From this, θ = sign( θ ) q η − θ = η cos φ Let R = M − T = R ( θ / − φ ) R ( η ) R ( θ / φ ) be the combined rotation. R = ( R ( θ / − φ ) T e ) T R ( η )( R ( θ / − φ ) e ) = e T R ( η ) e = ( R ( η )) = cos η, so η = ± arccos R ∈ [0 , π ] Case 1 (Generic): − < R < , so < | η | < π . Deﬁning α = θ / φ and β = θ / − φ , α is determined in ( − π, π ] bycos α = − R / sin η, sin α = R / sin η and likewise β by cos β = R / sin η, sin β = R / sin η sin η = 0, so no problems here.Then θ = α + β , φ = ( α − β ) / θ = η sin φ , θ = η cos φ .The two choices for η likewise negate θ and θ , but only shift θ by an irrelevantincrement of 2 π . Case 2. R = 1 , so η = 0 . θ = θ = 0, and R = R ( θ ) so θ is determined easily. Case 3: R = − , so η = ± π . This is the problem case, as R now depends only on φ , not θ , so the latter is notconstrained at all. However, the value η = ± π is extremely unlikely: η = θ + θ ,and as seen in Table 3, the values for the later two angles are far too small.Reconstructing the remaining Hassan-Calladine parameters ∆ i is now straight-forward; they are related to the known positions of the ends of each dimer and theHassan-Calladine by a system of linear equations, as seen in the formulas T = ( v ⊤ v ⊤ v ⊤ ) = M R (cid:18) θ − φ (cid:19) R (cid:16) η (cid:17) R ( φ ) . with the (known) positions p and p of the ends of the second and third basesrelated by p = p + ∆ v + ∆ v + ∆ v from [5]; see also Eq’s (10,11) of [4].4.2. Numerical Results.

The most accurate calculation so far for the true geneticcode is with N = 4096 samples and m = 2048 samples for each evaluation of P C ( y ).This gives M N ( f I , g ) ≈ . . N = 2048 and m = 1024; the random codes then give amean E S ( M N ( f σ , g )) value of 0 . . , . . . ≤ E S ( M N ( f σ , g )) ≤ . < M N ( f I , g ) ≈ . . Conclusions

A given genetic sequence can be interpreted as encoding a protein and also asinﬂuencing the geometry of the DNA molecule that carries it. This is thereforea situation like the one in the Introduction where Georgina and Fred are eachinterpreting the same signal diﬀerently. It is therefore of interest to understandhow well-duplexed these two “codes” are.

EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 15

Combining the results of the previous paper [5] with the new results in Section 4this has now been done in three diﬀerent ways. Disappointingly, each time theanswer has turned out to be “about what you’d expect if the real genetic codewas just selected randomly, but maybe a little worse”. In other words, contrary tothe predictions of the Geometric Pressure Hypothesis (GPH), the natural geneticcode does not appear to be especially well-duplexed. In other words, even if onereplaced the natural genetic code with a random alternative, it is likely that therethere would be more freedom in the geometry of the DNA molecule while encodingany given protein.It is interesting to speculate about what that tells us about the evolution ofthe genetic code. On the one hand, it could simply imply that there is not muchevolutionary advantage in having the ability to alter the shape of the DNA moleculewithout changing the protein it encodes. However, that is not the only possibleexplanation. Another intriguing possibility, which was raised during the discussionafter the talk in the session for which this volume serves as the proceedings, is thatthe code became ﬁxed before the molecules became large enough for it to matter.In particular, it does seem likely that the geometry of the molecule is not veryimportant when the chromosome is very short. So, if the genetic code that we arefamiliar with was shaped during an early period in evolution when the genome ofliving creatures were all very small, the GPH might not have applied. And, sincethe genetic code is no longer very malleable (as demonstrated by its near ubiquity),it might no longer have been able to change once the molecules grew large enoughfor their geometry and topology to matter.In any case, whatever the explanation may be, the new computations have onlyre-conﬁrmed the answer found previously to the question of the title. Since thenatural genetic code appears only slightly less well-duplexed with the geometriccode than an average alternative, it does not appear that the evolution of thegenetic code was shaped by any pressure to optimize it.

References

1. Alexander RW, Schimmel P, (2001) “Wobble Hypothesis” in

Encyclopedia of Genetics (SBrenner and JH Miller, eds) Elsevier.2. Barrell BG, Bankier AT, Drouin J. A diﬀerent genetic code in human mitochondria. Nature1979;282:189-194.3. Berg JM, Tymoczko JL, Stryer L, (2002) Biochemistry. 5th Edition. WH Freeman. Section5.5.1.4. Hassan MA and Calladine CR “The Assessment of the Geometry of Dinucleotide Steps inDouble-Helical DNA; a New Local Calculation Scheme” J. Mol. Biol. (1995) 251 648-6645. Kasman, A. “The Duplexing of the Genetic Code and Sequence-Dependent DNA Geometry”Bull Math Biol (2018). https://doi.org/10.1007/s11538-018-0486-36. Kawaguchi Y, Honda H, Taniguchi-Morimura J, Iwasaki S. The codon CUG is read as serinein an asporogenic yeast Candida cylindracea. Nature 1989;341:164-166.7. Kiga D, Sakamoto K, Kodama K, Kigawa T, Matsuda T, Yabuki T, Shirouzu M, Harada Y,Nakayama H, Takio K, et al. An engineered Escherichia coli tyrosyl-tRNA synthetase for site-speciﬁc incorporation of an unnatural amino acid into proteins in eukaryotic translation andits application in a wheat germ cell-free system. Proc. Natl Acad. Sci. USA 2002;99:9715-9720.8. Koonin EV and Novozhilov AS, “Origin and Evolution of the Universal Genetic Code”

Annu.Rev. Genet.

J Mol Biol.

11. Olson WK, Gorin AA, Lu XJ, Hock LM, and Zhurkin VB “DNA sequence-dependent de-formability deduced from protein-DNA crystal complexes” Proc. Natl . Acad. Sci. USA Vol.95, pp. 11163–11168, September 199812. Srinivasan G, James CM, Krzycki JA. Pyrrolysine encoded by UAG in

Archaea: Science .2002 May 24;296 (5572) 1459-62.13. Wang L, Brock A, Herberich B, Schultz PG. Expanding the genetic code of Escherichia coli.Science 2001;292:498-500.14. Yamao F, Muto A, Kawauchi Y, Iwami M, Iwagami S, Azumi Y, Osawa S. UGA is read astryptophan in Mycoplasma capricolum. Proc. Natl Acad. Sci. USA 1985;82:2306-2309.15. Zhang Z and Yu J, “On the Organizational Dynamics of the Genetic Code”,

Genomics,Proteomics & Bioinformatics

Vol. 9, 1–2, April 2011, pp. 21–29

Department of Mathematics, College of Charleston, Charleston, SC 29424

E-mail address : [email protected] Department of Mathematics, College of Charleston, Charleston, SC 29424

E-mail address ::