Did Sequence Dependent Geometry Influence the Evolution of the Genetic Code?
aa r X i v : . [ q - b i o . O T ] M a r DID SEQUENCE DEPENDENT GEOMETRY INFLUENCE THEEVOLUTION OF THE GENETIC CODE?
ALEX KASMAN AND BRENTON LEMESURIER
Abstract.
The genetic code is the function from the set of codons to the setof amino acids by which a DNA sequence encodes proteins. Since the codonsalso influence the shape of the DNA molecule itself, the same sequence thatencodes a protein also has a separate geometric interpretation. A question thenarises: How well-duplexed are these two “codes”? In other words, in choosinga genetic sequence to encode a particular protein, how much freedom does onestill have to vary the geometry (or vice versa). A recent paper by the firstauthor addressed this question using two different methods. After reviewingthose results, this paper addresses the same question with a third method: theuse of Monte Carlo and Gaussian sampling methods to approximate a multi-integral representing the mutual information of a variety of possible geneticcodes. Once again, it is found that the genetic code used in nuclear DNAhas a slightly lower than average duplexing efficiency as compared with otherhypothetical genetic codes. A concluding section discusses the significance ofthese surprising results.
The first author’s talk at the AMS Special Session on the Topology of Biopoly-mers explored the mathematical relationship between two different roles that a DNAsequence serves in living cells: encoding proteins to be produced and influencingthe shape of the DNA molecule itself. Those results were subsequently published asa journal article [5]. After briefly summarizing the main results of that publishedpaper, this article takes them a step further using a more sophisticated approachto the numerical computation of the mutual information. By combining Gaussianand Monte Carlo sampling methods with a new geometric inversion formula forcomputing the geometries, this new approach provides a more reliable result whichstrengthens and reconfirms the previously announced conclusions.1.
Measuring the Efficiency of Duplexed Codes
A Motivating Example.
Consider the following unlikely situation: You willsoon need to send a text message conveying a two letter word to your friendGeorgina and you also have to send a two letter word by text message to yourfriend Fred. However, because of your restrictive data plan, you must achieve thisby sending a single two character message to both of them at the same time.You can hope to achieve this by teaching Fred one of the two functions f i andteaching Georgina the function g shown in Table 1. Each of those functions turnsone of the integers from 0 to 5 into a letter and can therefore be used as a simple“code”. For example, since Georgina knows the function g you can send her the Date : March 4, 2020.2010
Mathematics Subject Classification.
Primary 92B05 94A17 Seconday 65D30. c f ( c ) f ( c ) g ( c )0 H H N O O H O H N I O H I I O H I O
Table 1.
The functions f , f and g used in this introduction toillustrate duplexing and mutual information.numerical message “24” and she would interpret it as “ g (2) g (4) = NO ”. Alterna-tively, she would interpret the message “53” as the exclamation “ OH ”. Similarly,using either of the two functions f or f , Fred could recognize the signal “04” asthe greeting “ HI ”.The really interesting thing is that you could send the same two digit messageto both Georgina and Fred and they would interpret it differently. That is thedefining characteristic of duplexed codes , that the same signal has two differentinterpretations.Let us first suppose that Fred has memorized f and Georgina knows the code g . If you wanted to send Georgina a message that will be interpreted as “ NO ”, youhave four different choices of signal which would convey that message to her andeach one would mean something different to Fred. For instance, you could send“04” which Fred will interpret as “ HI ” or you could send “25” which has the sameinterpretation for Georgina but which Fred will interpret as “ OH ”. In this scenario,you have the freedom to send different messages to Fred while still sending thedesired message to Georgina at the same time.In contrast, things would be different if Fred had learned f as his code instead.Even though you would still have a choice of four signals to send Georgina thatwould be interpreted as “ NO ”, you would have not be able to separately controlthe message that was sent to Fred because all four of the signals that mean “ NO ”under the code g would be interpreted as “ HI ” using code f . There would be noway to send Fred the message “ OH ” or any other message besides “ HI ” if Georgina’smessage is to be interpreted as “ NO ”. Even though there is nothing wrong withthe code f on its own, there is something unfortunate about its relationship to g which creates an obstruction to sending the message “ NO ” to Georgina whilesimultaneously sending the message “ OH ” to Fred.Loosely speaking, we say that two codes are well-duplexed if such obstructionsto encoding two messages simultaneously are rare. Conversely they are poorly-duplexed if the choice of a message for one recipient severely restricts the messagesthat can be sent to the other recipient with the same signal. A more rigorous andquantifiable method of determining whether two codes are well-duplexed or poorlyduplexed is by using the concept of mutual information that is part of the branchof mathematics knowns as information theory.1.2. Duplexed Codes and Mutual Information.
Let us say that f and g are duplexed codes whenever f : C → X and g : C → Y are two functions with the samedomain. The terminology makes sense when one imagines sending a single “signal” c ∈ C to two recipients each of whom knows one of those two codes. The goal of this EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 3 section is to introduce a number associated to any duplexed codes which measureshow much freedom you have to send different messages to one recipient even afterthe message for the other recipient is fixed.For a randomly selected element c ∈ C , let P f ( x ) denote the probability that f ( c ) = x ∈ X , P g ( y ) be the probability that g ( c ) = y , and P f,g ( x and y ) be theprobability that both f ( c ) = x and g ( c ) = y .For example, using the functions defined in Table 1 with domain C = { , , , , , } ,we see f ( c ) = N is true for two of the six possible values of c and so P f ( N ) = 2 / /
3. Moreover, P f ,g ( H , N ) = 1 / f ( c ) = H and g ( c ) = N could both be true is if c = 0. However, P f ,g ( H , N ) = 1 / c = 0 and c = 2 satisfy f ( c ) = H and g ( c ) = N .The mutual information (measured in bits) of the duplexed codes f : C → X and g : C → Y is defined to be (1.1) M ( f, g ) = X y ∈Y X x ∈X P f,g ( x and y ) log (cid:18) P f,g ( x and y ) P f ( x ) P g ( y ) (cid:19) . It is easy to see that 0 ≤ M ( f, g ) is true for any two codes f and g . The minimumpossible value of 0 occurs when P f,g ( x and y ) = P f ( x ) P g ( y ) for all choices of x and y . A familiar fact from probability theory is that the joint probability is equalto the product of the two probabilities precisely when the events are independent.Indeed, the same idea applies here, although we now interpret it in terms of theindependence of the two codes. If the mutual information of two codes is zero thenthis tells us that the codes are very well-duplexed in that the selection of a messageto one recipient does not restrict the message that can be sent to the other.Since a mutual information of 0 represents the best possible duplexing of codes,larger mutual information means that the codes are not as well-duplexed. Forexample, we can compute that M ( f , g ) ≈ . M ( f , g ) ≈ . . for the codes f , f and g from Table 1 in the previous section. The combinationof functions f and g is a bad choice for duplexing since if we were using those ascodes for message to send Fred and Georgina then we could not separately choosea message for each recipient. In contrast, f and g work better as a combinationbecause even after we have chosen the message for one of the intended recipientswe still have a choice of message that can be sent to the other. This is reflectedhere in the fact that M ( f , g ) < M ( f , g ); the mutual information when using f iscloser to zero and therefore closer to being optimal for duplexing.1.3. Comparisons with Expected Values.
Let F : S → R be a real-valued func-tion on the finite set S = { σ , . . . , σ n } . Then define the expected value E S ( F ( σ ))by the familiar formula E S ( F ( σ )) = 1 n X σ ∈ S F ( σ ) . You will probably notice that this is nothing other than the mean of the valuesthat F takes. The terminology “expected value” taken from probability theory isa notion analogous to the average in the context of random variables. The way When P f,g ( x and y ) = 0 it is understood that P f,g ( x and y ) log (cid:18) P f,g ( x and y ) P f ( x ) P g ( y ) (cid:19) = 0. ALEX KASMAN AND BRENTON LEMESURIER to interpret it here is to imagine an experiment in which you randomly select anelement σ from S and make a measurement of it to find the value F ( σ ). Then E S ( F ( σ )) is the expected value in the sense that it would be the average of themeasurements after a large number of experiments. In particular, if for a particularˆ σ ∈ S one has F (ˆ σ ) < E S ( F ( σ ))then one can say that the value of F (ˆ σ ) is lower than the value one would expect for a randomly selected element of S .For example, using the functions f , f and g from the motivating exampleabove, we can consider the mutual information M ( f σ , g ) as a real-valued functionon the index set S = { , } . Then0 . ≈ M ( f , g ) < E S ( M ( f σ , g )) ≈ . . ≈ . , tells us that the duplexing of the code f with g is better than average for codesselected from { f , f } . Although we already knew that in this case simply bycomparing the individual mutual information values, this notation will prove usefulbelow where we will be doing something similar but with a very large index set.2. A Natural Example of Duplexed Codes Associated to DNA
The Genetic Code.
Let B = { A , C , G , T } be the set of DNA bases. BecauseDNA sequences of length 2 and 3 will play special roles in this paper, let us introducethe following terminology and notation: The set of dimers (length two sequences)is D = { b b : b i ∈ B} . and the set of codons (length three sequences) is C = { b b b : b i ∈ B} .A genetic code is simply a function f I from the set C of codons to the set X ofamino acids (and the word “stop”): X = { I,L,V,F,M,C,A,G,P,T,S,Y,W,Q,N,H,E,D,K,R,Stop } . The genetic code used by the nuclear DNA in humans is shown in Table 2, and thisis the same genetic code used by nearly all known living organisms [8, 9]. We willrefer to the particular genetic code given in Table 2 as “the natural genetic code”so as to distinguish it from other hypothetical codes that are not found in biologybut will be used for comparison later in the paper.However, it is important to realize that there are other genetic codes that areused by biological organisms (notably, mitochondria use a different genetic code)and that scientists have also introduced artificial genetic codes which neverthelessseem to function well enough to support life [2, 6, 7, 10, 12, 13, 14]. So, there isno physical law requiring this to be the genetic code. In theory, the genetic codecould have been different and it is reasonable to ask the question “Why do nearlyall living organisms use this particular genetic code?”There is evidence to support the hypothesis that the natural genetic code isthe result of a combination of coincidences and evolutionary pressures (see [15]and references therein). For example, two codons for the same amino acid differonly in the third base much more frequently than would be predicted by chance ifthe genetic code was to be constructed entirely randomly. This has evolutionaryadvantages in that it decreases the likelihood that a mutation or mis-pairing ofmRNA and tRNA will produce a different protein [3, 1]. It is therefore presumed
EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 5Codon ( c ) Amino Acid ( f I ( c )) ATT , ATC , ATA I CTT , CTC , CTA , CTG , TTA , TTG L GTT , GTC , GTA , GTG V TTT , TTC F ATG M TGT , TGC C GCT , GCC , GCA , GCG A GGT , GGC , GGA , GGG G CCT , CCC , CCA , CCG P ACT , ACC , ACA , ACG T TCT , TCC , TCA , TCG , AGT , AGC S TAT , TAC Y TGG W CAA , CAG Q AAT , AAC N CAT , CAC H GAA , GAG E GAT , GAC D AAA , AAG K CGT , CGC , CGA , CGG , AGA , AGG R TAA , TAG , TGA
Stop
Table 2.
This table defines the genetic code f I : C → X . Each of the codons from C in the first column is a pre-image of thecorresponding amino acid (or “Stop”) in the second.that this feature is not a coincidence but an example of the effect of natural selectionon the formation of the genetic code.2.2. Sequence Dependent DNA Geometry.
When shown in illustrations, DNAoften looks like a perfectly straight double-helix, a twisted ladder with “rungs” thatare the base pairs carrying the genetic sequence. However, real DNA is not straight;it is bent and twisted into compact shapes that fit into living cells.It is perhaps not surprising that the way that a DNA molecule bends is affectedby the sequence of bases which make it up. After all, A , C , G , and T in B arenot just abstract mathematical symbols. They represent actual chemical struc-tures that form the base pairs in a DNA molecule. Hence, the electrical repulsionand attraction between successive “rungs” in the DNA ladder will vary with thatsequence.Olson et all [11] experimentally determined the geometry of each of the 16 dimers d ∈ D by repeatedly measuring the configurations of DNA strands that were twobase pairs long. They computed the average and standard deviation of each of thesix Hassan-Calladine dimer step parameters (see [4, 5]). Their results are shown inTable 3Assuming that the geometric configuration of each dimer in a longer sequence hasthe same expected values and standard deviations as the isolated dimers in thoseexperiments, it is possible to make a similar table for the geometric configurationsassociated to each of the 64 codons in C . The function(2.1) ¯ g : C → R shown in Table 4 which associates to each codon c ∈ C a 3-tuple of numbers ¯ g ( c )which gives the location of the center of the top of the codon in Angstroms if its baseis located at the origin and if each of the dimers takes exactly the expected geometryaccording to Olson et al. (Note: In [5], this role is played by a function ¯Γ : C → R whose image has six components because it has angular information as well, but ALEX KASMAN AND BRENTON LEMESURIER d ¯∆ ( d ) ( ˆ∆ ( d )) ¯∆ ( d ) ( ˆ∆ ( d )) ¯∆ ( d ) ( ˆ∆ ( d )) ¯ θ ( d ) (ˆ θ ( d )) ¯ θ ( d ) (ˆ θ ( d )) ¯ θ ( d ) (ˆ θ ( d )) AA − .
03 (0 . − .
08 (0 .
45) 3 .
27 (0 . − . .
3) 0 .
07 (5 .
4) 35 . . AC .
13 (0 . − .
58 (0 .
41) 3 .
36 (0 . − . .
1) 0 . .
9) 31 . . AG .
09 (0 . − .
25 (0 .
41) 3 .
34 (0 . − . .
3) 4 . .
4) 31 . . AT . − .
59 (0 .
31) 3 .
31 (0 .
21) 0 (2 .
5) 1 . .
9) 29 . . CA .
09 (0 .
55) 0 .
53 (0 .
89) 3 .
33 (0 .
26) 0 . .
7) 4 . .
1) 37 . . CC − .
05 (0 . − .
22 (0 .
64) 3 .
42 (0 .
24) 0 . .
7) 3 . .
5) 32 . . CG .
87) 0 .
41 (0 .
56) 3 .
39 (0 .
27) 0 (4 .
2) 5 . .
2) 36 . . CT .
28 (0 .
46) 0 .
09 (0 .
7) 3 .
37 (0 .
26) 1 . .
8) 1 . .
3) 36 . . GA − .
28 (0 .
46) 0 .
09 (0 .
7) 3 .
37 (0 . − . .
8) 1 . .
3) 36 . . GC . − .
38 (0 .
56) 3 . .
24) 0 (3 .
9) 0 . .
6) 33 . . GG .
05 (0 . − .
22 (0 .
64) 3 .
42 (0 . − . .
7) 3 . .
5) 32 . . GT − .
09 (0 .
55) 0 .
53 (0 .
89) 3 .
33 (0 . − . .
7) 4 . .
1) 37 . . TA .
52) 0 .
05 (0 .
71) 3 .
42 (0 .
24) 0 (2 .
7) 3 . .
6) 37 . . TC − .
09 (0 . − .
25 (0 .
41) 3 .
34 (0 .
23) 1 . .
3) 4 . .
4) 31 . . TG − .
13 (0 . − .
58 (0 .
41) 3 .
36 (0 .
23) 0 . .
1) 0 . .
9) 31 . . TT .
03 (0 . − .
08 (0 .
45) 3 .
27 (0 .
22) 1 . .
3) 0 .
07 (5 .
4) 35 . . Table 3.
This table shows the mean ( ¯∆ i ) and standard deviation( ˆ∆ i ) of each of the Hassan-Caladine step parameters for each dimeras determined experimentally in [11].for simplicity in this note we are considering only the first three components whichencode the location of the center of the third rung and not the way it is tilted.)Figure 1 shows just the projection of ¯ g ( c ) onto its first two coordinates for eachof the 64 codons c ∈ C . You can imagine that a codon (a DNA sequence of length3) is coming straight out of the xy -plane at you. Each point in this figure representsa codon and they all start out at the origin, but because the expected dimer stepparameters depend on the particular bases involved, by the time they get up totheir third rung they are in slightly different positions. In particular, the pointsindicate the locations of the center of the third rung (with units given in Angstroms)if each of the dimer step parameters takes its expected values in agreement withthe experiments of Olson et al.As you can see, the different codons do have slightly different expected geome-tries. It is important to realize that these small differences can combine in dramaticways when considering longer sequences made up of many successive codons. Forinstance, Figure 2 shows the expected geometry for two different DNA sequences.Clearly, the sequence S = AAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGCAAAAACGGGC bends significantly more than sequence S = AAGAATGGGCAGAAGCGTGCGAAGACTGGAAAGAATGGCCAGAAGCGTGCAAAAACGGGT . So, geometrically they are quite different. But, consider how each of these twosequences is translated into a protein according to the natural genetic code. Thefirst codon in S ( AAA ) and the first codon in S ( AAG ) both encode the amino acid K . Similarly, the second codon in each encode the amino acid N .In fact, the corresponding codons in each sequence always are mapped by thenatural genetic code to the same amino acid. So, S and S encode exactly thesame protein according to the natural genetic code, according to the function ¯ g ,one of them exhibits a much greater curvature than the other.2.3. The Geometric Pressure Hypothesis.
Note that the last example of twoDNA sequences with very different expected geometries is in some ways similar to
EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 7
Codon ( c ) ¯ g ( c ) AAA ( − . , − . , . AAC (0 . , − . , . AAG (0 . , . , . AAT (0 . , − . , . ACA (0 . , . , . ACC (0 . , − . , . ACG (0 . , − . , . ACT (0 . , − . , . AGA (0 . , − . , . AGC (0 . , − . , . AGG (0 . , . , . AGT (0 . , . , . ATA (0 . , − . , . ATC (0 . , − . , . ATG (0 . , − . , . ATT (0 . , − . , . CAA (0 . , . , . CAC (0 . , . , . CAG (0 . , . , . CAT (0 . , . , . CCA (0 . , . , . CCC (0 . , − . , . CCG (0 . , . , . CCT (0 . , . , . CGA (0 . , . , . CGC (0 . , . , . CGG (0 . , . , . CGT ( − . , . , . CTA (0 . , . , . CTC (0 . , − . , . CTG (0 . , − . , . CTT (0 . , . , . c ) ¯ g ( c ) GAA ( − . , . , . GAC (0 . , − . , . GAG (0 . , . , . GAT (0 . , − . , . GCA ( − . , . , . GCC (0 . , − . , . GCG ( − . , . , . GCT (0 . , − . , . GGA (0 . , − . , . GGC (0 . , − . , . GGG (0 . , − . , . GGT (0 . , . , . GTA (0 . , . , . GTC (0 . , . , . GTG (0 . , . , . GTT (0 . , . , . TAA (0 . , . , . TAC (0 . , − . , . TAG (0 . , . , . TAT (0 . , − . , . TCA (0 . , . , . TCC (0 . , − . , . TCG (0 . , . , . TCT (0 . , − . , . TGA ( − . , − . , . TGC (0 . , − . , . TGG (0 . , − . , . TGT ( − . , − . , . TTA (0 . , − . , . TTC (0 . , − . , . TTG (0 . , − . , . TTT (0 . , − . , . Table 4.
The expected location of the center for the third rung ofeach codon relative to the position of the first rung in Angstroms.(See Figure 1 for a visual representation of this data and [5] formathematical details.)the opening example of a simple duplexed code. Just as Fred and Georgina canhave different interpretations of the same signal of numbers, a sequence of codonscan be interpreted either as encoding a protein or as influencing the shape of theDNA molecule.The shape of a DNA molecule is relevant to its biological function. It must bendin the right places and be straight in the right places in order for the enzymesand RNA responsible for transcription to be able to occur. This gives biologicalimportance to the question of how well-duplexed the genetic and geometric codesdiscussed in the previous two sections are. For instance, if they are very poorly-duplexed, then it could often be the case that a DNA molecule cannot encode theprotein that a creature needs unless it bends in a bad way. Conversely, it wouldbe to a creature’s advantage for the codes to be well-duplexed because then it
ALEX KASMAN AND BRENTON LEMESURIER
Figure 1.
This graphic shows the x and y -coordinates of the im-ages of the function ¯ g ( c ) (Table 4) for each codon c ∈ C . They showthe ways that the DNA molecule carrying that sequence wouldlikely bend. If one was looking down at all 64 codons, each withits bottom rung fixed at the origin and with each dimer exhibitingits expected geometry (see Table 3) as it comes up out of the pagetowards you, the projections of the centers of the top rungs wouldbe located at the locations indicated (with axes measured inAngstroms).would always be able to simultaneously encode whatever protein and geometry areoptimal.The geometric code was not something that evolution could act upon since it isdetermined by the laws of physics and chemistry. However, as we have seen, thegenetic code could have been different and likely was influenced by natural selection. EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 9 S S Figure 2.
The sequences of the two DNA molecules shown encodethe same protein, but due to the differences in expected codongeometry, one of them is noticeably more bent than the other.In [5], it was hypothesized that one of the factors that influenced that selection waspressure to ensure that the genetic and geometric codes were well-duplexed.3.
How Well Duplexed is the Real Genetic Code as Compares withAlternatives?
To test the “Geometric Pressure Hypothesis” (GPH), two different measuresof duplexing efficiency were developed in [5]. Then, the duplexing efficiency ofthe geometric code with the natural genetic code was compared with its averageduplexing efficiency with a large set of reasonable alternative genetic codes.3.1.
Alternative Genetic Codes.
Let Sym( C ) be the group of permutations onthe set C of codons. Then for any σ ∈ Sym( C ), f σ = f I ◦ σ is the function from C to X which first replaces the codon c ∈ C with its image σ ( c ) and then applies thenatural genetic code function f I to that. (In other words, the function f σ would berepresented by a table very much like Table 2 for the natural genetic code above,but the codons would be rearranged according to the permutation σ .)Notice that no matter which permutation σ is selected, the alternative geneticcode f σ has in common with the natural genetic code f I not only that it is a mapfrom C to X but also that for each amino acid a ∈ X the preimages are of the samesize: | f − I ( a ) | = | f − σ ( a ) | . However, not all of those alternative genetic codes are realistic. For most choicesof permutation σ , the alternative genetic code f σ will not have the property thattwo codons are more likely to encode the same amino acid (or chemically similaramino acids) when their first two bases are equal, which will have already noted is a property of the real genetic code which has evolutionary advantages. Since we onlywant to consider alternative codes that also have this property, the permutationsconsidered in [5] were further restricted: we considered not arbitrary permutationsbut only ones with the property that the first two bases in the codons σ ( c ) and σ ( c ) are equal if and only if the first two bases of c and c are equal. Let S be theset of such permutations. Symbolically, we can define the restricted set S ⊂ Sym( C )of permutations using the map d ( b b b ) = b b which projects a codon onto itsinitial dimer as follows: S = { σ ∈ Sym( C ) : ∀ c, c ′ ∈ C d ( σ ( c )) = d ( σ ( c ′ )) ⇔ d ( c ) = d ( c ′ ) } . To test the GPH in [5], the duplexing efficiency of the natural genetic codewith the geometric code was compared with the expected value for the duplex-ing efficiency of the alternatives indexed by the set S . Since there are (4 )!(4!) permutations in S , this is a very large set of permutations to consider.3.2. Total Network Length.
The natural genetic code is shown in Table 2 andthe expected geometries of the codons is shown in Figure 1. One way to combinethis information is to draw an edge on the figure between any two codons thatencode the same amino acid, turning it into a graph with vertices and edges. Thus,for instance, an edge would be drawn between the vertices labeled
TAT and
TAC because they both encode the amino acid Y , while the vertex labeled ATG would notbe connected to any other vertices.Each connected component of the graph corresponds to an amino acid. If thereare only short edges or no edges in the component associated to an amino acid thatyou wish to encode in a sequence, then that means you have almost no geometricchoice in the DNA molecule’s expected geometry at that point. On the other hand,if there are long edges in the connected component, then you would have a choiceof different codons that encode that same amino acid but which would cause a verydifferent geometric configuration of the DNA molecule.One could do the same for an alternative genetic code f σ . That graph wouldhave the same number of edges as the graph for the natural genetic code, but theywould not have the same lengths.With all of this in mind, we define the “total network length” of the genetic code f σ for any σ ∈ S to be the sum of the lengths of the edges in the graph : T σ = X a ∈X X c,c ′ ∈ f − σ ( a ) | ¯ g ( c ) − ¯ g ( c ′ ) | . In the case of the natural genetic code, T I was found to be 45 . large as compared with the total network length for thealternative codes which are not found in nature.Disappointingly, that is not what was found in [5]. A 95% confidence interval wasconstructed for the expected value of the total network length over all permutations The second sum is over all distinct, unordered pairs in the pre-image f − σ ( a ) and the lengthdenotes the ordinary metric on R (i.e. | ( a, b, c ) | = √ a + b + c ). EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 11 in S . It was found that the average total network length E S ( T σ ) is probably between45 . . . T I < . < E S ( T σ ) < . T I for the naturalgenetic code is apparently a bit smaller than average rather than being especiallylarge.3.3. Mutual Information with a Discretized Geometric Code.
The previ-ous paper [5] also uses the concept of mutual information to quantify the mutualinformation of DNA’s geometric and genetic codes. Using mutual information as ameasure of duplexing efficiency has two big advantages over the use of total networklength as described in the previous section: • Firstly, it is a well-known measure of duplexing efficiency which is widelystudied and used, whereas total network length is an ad hoc approach de-veloped only for this particular project. • Total network length was based only on the expected values in Table 3 andtherefore ignored the standard deviations that represented the flexibility ofthe dimers. Of course, once that flexibility is taken into account, the “geo-metric code” is no longer a function since there is more than one possibleconfiguration for each codon. Because the definition of mutual informationin 1.1 involves probabilities, it is well suited to address this situation.In [5], the geometry of a codon was represented by a point in R where the firstthree numbers (like the image of ¯ g above) indicate the location of the center ofthe last “rung” of the codon relative to the first and the other three were anglesindicating how it was tilted and twisted relative to the first. Then, R was dividedinto 4096(= 4 ) subsets called ‘bins’. If each bin is indexed by an element of Y = { , , . . . . , , } then the geometric information is encoded into a map g : C → Y .Unlike any of the maps discussed earlier, g is not a function since given codoncan be in many different possible geometric configurations due to its flexibility.Although it is more likely to be in certain configurations than others, and so g ( c ) is arandom variable for any given codon c . In order to compute the mutual informationof the genetic codes f σ with this geometric map g , we need to be able to computethe associated probabilities. In [5] that was done by running a computer programwhich looped through a large number of different configurations and recorded thenumber of the ‘bin’ in which they ended up. In other words, the probabilities werecomputed empirically , using the assumption that the dimer step parameters for are normally distributed with the mean and standard deviation shown in Table 3.Using this information it is now possible to compute (or, perhaps it would bebetter to say “approximate”) the mutual information of any of the genetic codes f σ with this geometric code g . When this was done in [5], was found that the mutualinformation of the natural genetic code f I with the geometric code g is about M ( f I , g ) ≈ . f σ for σ ∈ S a 95% confidenceinterval found that the average mutual information is probably between 0 . . . ≤ E S ( M ( f σ , g )) ≤ . < M ( f I , g ) ≈ . . Since a smaller mutual information (closer to the ideal value of 0) represents betterduplexed codes, this means that the duplexing of the real genetic code f I is worse than average. This, again, is the opposite of what would have been predicted bythe GPH.4. New Results: Mutual Information via a Monte-Carlo stylediscretization
Since the geometric parameters take values in a continuous space, the functions P f,g and P g which appear in the formula for the mutual information are actuallyprobability distribution functions whose values only become probabilities when in-tegrated over regions of that space. In the previous paper this computation wasdiscretized in a rigid way by “binning” the data into fixed and pre-determinedsubsets of equal size.That approach is plausible, but not uniquely so. A more standard approach innumerical analysis is to consider a discretization based on choosing a suitable ran-dom sample Y N of N geometries that are chosen taking into account the Gaussiandistributions in Table 3, and replace the usual discrete mutual information by M N ( f, g ) = 1 W X y ∈ Y N X a P f,g ( a, y ) log (cid:18) P f,g ( a, y ) P f ( a ) P g ( y ) (cid:19) with W = X y ∈ Y N X a P f,g ( a, y ) , where y is a codon geometry, a is an amino acid, f = f σ is one of the genetic codesmapping codons to amino acids. The normalization by factor 1 /W is the volumeelement for approximate integration over the probability density P a P f,g ( a, y ). Itnormalizes the sum appropriately, in the sense of M N ( f, g ) approaching a commonvalue M ( f, g ) as N increases.The randomness of the sample Y N is one of the main differences between theprior results and this new approach. Another difference is the randomness used inthe approximation of the values of the probability distributions themselves. Unlikethe previous approach in which the probabilities were estimated by using rigidlychosen deviations from the expected values, this time a Monte Carlo approach willbe utilized. In particular, here we construct random choices for N = 64 n geometrysamples as the union of a set of n sample points for each codon, with the samplesfor each codon constructed from n sets of random values for the twelve Hassan-Calladine parameters for the two dimers; the randomness based on assuming thatthese parameters are all independent and that each is normally distributed withmean and standard deviation as in Table 3.The final, and perhaps most interesting, difference between the previous ap-proach and the one taken in this section is an inversion of the geometric datawhich directly computes the probability that a given codon will take on a givengeometric conformation. The basic quantity needed is the probability distribution P d ( σ ) for the values of the dimer step σ for dimer d with Hassan-Calladine param-eters ∆ , ∆ , ∆ , θ , θ , θ . As above, this is based on the assumption that theseparameters are independent and each is normally distributed, so:(4.1) P d ( σ ) = Y i =1 i ( d ) √ π exp − (∆ i − ¯∆ i ( d )) i ( d ) ! Y j =1 θ j ( d ) √ π exp − ( θ j − ¯ θ j ( d )) θ j ( d ) ! . EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 13
Consider codon C consisting of dimers d and d , and for a choice of their dimersteps σ and σ , denote the resulting codon geometry y as the “step product” σ ∗ σ . Then P C ( y ), the probability density of the codon C having geometry y , isgiven by an integral over all paths to y : P C ( y ) = Z σ X σ | σ ∗ σ = y P d ( σ ) P d ( σ ) dσ For each pair of values for y and σ , we must solve for all possible values σ ;fortunately this can be done explicitly, and generically there are only two suchpaths; this is detailed in the next subsection.For each path the quantity P d ( σ ) will be evaluated using Eq. (4.1). The outerintegral over a six dimensional space is instead dealt with by a Monte-Carlo method:noting that it is the integral w.r.t. a probability measure, we approximate bychoosing a sample of m random sets of values for the Hassan-Calladine parameters,in turn determining a set Σ m of values for σ , and averaging: P C ( y ) ≈ m X σ ∈ Σ m X σ | σ ∗ σ = y P d ( σ )Then we assemble the pieces: P g ( y ) = 164 X C P C ( y ) P f,g ( y ) = 164 X C | f ( C )= a P C ( y )and the easy one, the fraction of codons that give a specified amino acid: P f ( a ) = 164 (cid:12)(cid:12) { C | f ( C ) = a } (cid:12)(cid:12) Reconstructing the second dimer step.
What remains is to solve σ ∗ σ = y for the second dimer step σ ; that is, find the corresponding Hassan-Calladineangles θ i and then the lengths ∆ i .Except for one case noted below (of negligible probability), the angles θ i , i = 1–3are determined up to negation of the pair θ , θ . This comes from the formula(4.2) T = M R ( θ / − φ ) R ( η ) R ( θ / φ )as in Section 2.2 of [5] (See also Eq’s (9) of [4].) The matrices M and T are theframes respectively for the end of the first dimer and the end of the codon, R and R are the familiar matrices for rotations about y and z axes R ( η ) = cos η − sin η − sin η η R ( θ ) = cos θ − sin θ θ cos θ
00 0 1 , and η = sign( θ ) p θ + θ , sin φ = θ /η with − π ≤ φ ≤ π ,From this, θ = sign( θ ) q η − θ = η cos φ Let R = M − T = R ( θ / − φ ) R ( η ) R ( θ / φ ) be the combined rotation. R = ( R ( θ / − φ ) T e ) T R ( η )( R ( θ / − φ ) e ) = e T R ( η ) e = ( R ( η )) = cos η, so η = ± arccos R ∈ [0 , π ] Case 1 (Generic): − < R < , so < | η | < π . Defining α = θ / φ and β = θ / − φ , α is determined in ( − π, π ] bycos α = − R / sin η, sin α = R / sin η and likewise β by cos β = R / sin η, sin β = R / sin η sin η = 0, so no problems here.Then θ = α + β , φ = ( α − β ) / θ = η sin φ , θ = η cos φ .The two choices for η likewise negate θ and θ , but only shift θ by an irrelevantincrement of 2 π . Case 2. R = 1 , so η = 0 . θ = θ = 0, and R = R ( θ ) so θ is determined easily. Case 3: R = − , so η = ± π . This is the problem case, as R now depends only on φ , not θ , so the latter is notconstrained at all. However, the value η = ± π is extremely unlikely: η = θ + θ ,and as seen in Table 3, the values for the later two angles are far too small.Reconstructing the remaining Hassan-Calladine parameters ∆ i is now straight-forward; they are related to the known positions of the ends of each dimer and theHassan-Calladine by a system of linear equations, as seen in the formulas T = ( v ⊤ v ⊤ v ⊤ ) = M R (cid:18) θ − φ (cid:19) R (cid:16) η (cid:17) R ( φ ) . with the (known) positions p and p of the ends of the second and third basesrelated by p = p + ∆ v + ∆ v + ∆ v from [5]; see also Eq’s (10,11) of [4].4.2. Numerical Results.
The most accurate calculation so far for the true geneticcode is with N = 4096 samples and m = 2048 samples for each evaluation of P C ( y ).This gives M N ( f I , g ) ≈ . . N = 2048 and m = 1024; the random codes then give amean E S ( M N ( f σ , g )) value of 0 . . , . . . ≤ E S ( M N ( f σ , g )) ≤ . < M N ( f I , g ) ≈ . . Conclusions
A given genetic sequence can be interpreted as encoding a protein and also asinfluencing the geometry of the DNA molecule that carries it. This is thereforea situation like the one in the Introduction where Georgina and Fred are eachinterpreting the same signal differently. It is therefore of interest to understandhow well-duplexed these two “codes” are.
EQUENCE DEPENDENT GEOMETRY AND THE GENETIC CODE 15
Combining the results of the previous paper [5] with the new results in Section 4this has now been done in three different ways. Disappointingly, each time theanswer has turned out to be “about what you’d expect if the real genetic codewas just selected randomly, but maybe a little worse”. In other words, contrary tothe predictions of the Geometric Pressure Hypothesis (GPH), the natural geneticcode does not appear to be especially well-duplexed. In other words, even if onereplaced the natural genetic code with a random alternative, it is likely that therethere would be more freedom in the geometry of the DNA molecule while encodingany given protein.It is interesting to speculate about what that tells us about the evolution ofthe genetic code. On the one hand, it could simply imply that there is not muchevolutionary advantage in having the ability to alter the shape of the DNA moleculewithout changing the protein it encodes. However, that is not the only possibleexplanation. Another intriguing possibility, which was raised during the discussionafter the talk in the session for which this volume serves as the proceedings, is thatthe code became fixed before the molecules became large enough for it to matter.In particular, it does seem likely that the geometry of the molecule is not veryimportant when the chromosome is very short. So, if the genetic code that we arefamiliar with was shaped during an early period in evolution when the genome ofliving creatures were all very small, the GPH might not have applied. And, sincethe genetic code is no longer very malleable (as demonstrated by its near ubiquity),it might no longer have been able to change once the molecules grew large enoughfor their geometry and topology to matter.In any case, whatever the explanation may be, the new computations have onlyre-confirmed the answer found previously to the question of the title. Since thenatural genetic code appears only slightly less well-duplexed with the geometriccode than an average alternative, it does not appear that the evolution of thegenetic code was shaped by any pressure to optimize it.
References
1. Alexander RW, Schimmel P, (2001) “Wobble Hypothesis” in
Encyclopedia of Genetics (SBrenner and JH Miller, eds) Elsevier.2. Barrell BG, Bankier AT, Drouin J. A different genetic code in human mitochondria. Nature1979;282:189-194.3. Berg JM, Tymoczko JL, Stryer L, (2002) Biochemistry. 5th Edition. WH Freeman. Section5.5.1.4. Hassan MA and Calladine CR “The Assessment of the Geometry of Dinucleotide Steps inDouble-Helical DNA; a New Local Calculation Scheme” J. Mol. Biol. (1995) 251 648-6645. Kasman, A. “The Duplexing of the Genetic Code and Sequence-Dependent DNA Geometry”Bull Math Biol (2018). https://doi.org/10.1007/s11538-018-0486-36. Kawaguchi Y, Honda H, Taniguchi-Morimura J, Iwasaki S. The codon CUG is read as serinein an asporogenic yeast Candida cylindracea. Nature 1989;341:164-166.7. Kiga D, Sakamoto K, Kodama K, Kigawa T, Matsuda T, Yabuki T, Shirouzu M, Harada Y,Nakayama H, Takio K, et al. An engineered Escherichia coli tyrosyl-tRNA synthetase for site-specific incorporation of an unnatural amino acid into proteins in eukaryotic translation andits application in a wheat germ cell-free system. Proc. Natl Acad. Sci. USA 2002;99:9715-9720.8. Koonin EV and Novozhilov AS, “Origin and Evolution of the Universal Genetic Code”
Annu.Rev. Genet.
J Mol Biol.
11. Olson WK, Gorin AA, Lu XJ, Hock LM, and Zhurkin VB “DNA sequence-dependent de-formability deduced from protein-DNA crystal complexes” Proc. Natl . Acad. Sci. USA Vol.95, pp. 11163–11168, September 199812. Srinivasan G, James CM, Krzycki JA. Pyrrolysine encoded by UAG in
Archaea: Science .2002 May 24;296 (5572) 1459-62.13. Wang L, Brock A, Herberich B, Schultz PG. Expanding the genetic code of Escherichia coli.Science 2001;292:498-500.14. Yamao F, Muto A, Kawauchi Y, Iwami M, Iwagami S, Azumi Y, Osawa S. UGA is read astryptophan in Mycoplasma capricolum. Proc. Natl Acad. Sci. USA 1985;82:2306-2309.15. Zhang Z and Yu J, “On the Organizational Dynamics of the Genetic Code”,
Genomics,Proteomics & Bioinformatics
Vol. 9, 1–2, April 2011, pp. 21–29
Department of Mathematics, College of Charleston, Charleston, SC 29424
E-mail address : [email protected] Department of Mathematics, College of Charleston, Charleston, SC 29424
E-mail address ::