[PDF] Chaos in DNA Evolution

Abstract

In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008) accurately simulates gene mutations over time. First, we demonstrate that the CM model is a truly chaotic one, as defined by Devaney. Then, we show that mutations occurring in gene mutations have the same chaotic dynamic, thus making the use of chaotic models relevant for genome evolution.

Full PDF

aa r X i v : . [ q - b i o . GN ] A ug Chaos in DNA Evolution

Jacques M. Bahi b,c , Christophe Guyeux b,c, , Antoine Perasso a,c, a Chrono-environnement laboratory, UMR 6249 CNRS b FEMTO-ST Institute, UMR 6174 CNRS c University of Franche-Comté, Besançon, France

Abstract

In this paper, we explain why the chaotic model (CM) of Bahi and Michel (2008)accurately simulates gene mutations over time. First, we demonstrate that theCM model is a truly chaotic one, as deﬁned by Devaney. Then, we show thatmutations occurring in gene mutations have the same chaotic dynamic, thusmaking the use of chaotic models relevant for genome evolution.

Keywords:

Genome evolution models, Mutations, Mathematical topology,Devaney’s chaos

1. Introduction

Codons are not uniformly distributed in the genome. Over time, mutationshave introduced some variations in their frequency of apparition. It can be at-tractive to study the genetic patterns (blocs of more than one nucleotide: din-ucleotides, trinucleotides...) that appear and disappear depending on mutationparameters. Mathematical models allow the prediction of such an evolution, insuch a way that statistical values observed in current genomes can be recoveredfrom hypotheses on past DNA sequences. A ﬁrst model for genome evolutionwas proposed in 1969 by Thomas Jukes and Charles Cantor [20]. This ﬁrstmodel is very simple, as it supposes that each nucleotide

A, C, G, T has theprobability m to mutate to any other nucleotide, as described in the followingmutation matrix, ¨˚˚˝ ´ m m m mm ´ m m mm m ´ m mm m m ´ m ˛‹‹‚ In this matrix, the coeﬃcient in row 3, column 2 represents the probabilitythat the nucleotide G mutates in C during the next time interval, i.e. , P p G Ñ Email addresses: [email protected] (Jacques M. Bahi), [email protected] (Christophe Guyeux), [email protected] (Antoine Perasso)

Preprint submitted to Elsevier October 18, 2018 q . This ﬁrst attempt has been followed up by Motoo Kimura [21], who hasreasonably considered that transitions ( A ÐÑ G and T ÐÑ C ) should not havethe same mutation rate as transversions ( A ÐÑ T , A ÐÑ C , T ÐÑ G , and C ÐÑ G ), leading to the following mutation matrix. ¨˚˚˝ ´ a ´ b b a bb ´ a ´ b b aa b ´ a ´ b bb a b ´ a ´ b ˛‹‹‚ This model was reﬁned by Kimura in 1981 (three constant parameters, to makea distinction between natural A ÐÑ T , C ÐÑ G and unnatural transversions),Joseph Felsenstein [16], Masami Hasegawa, Hirohisa Kishino, and Taka-AkiYano [19] respectively. The diﬀerences between these models are in the numberof parameters they use, but all of the latter manipulate constant parameters .However, they all are rudimentary as they only allow to study nucleotide evo-lution, not genetic patterns mutations. From 1990 to 1994, Didier Arquès andChristian Michel proposed models based on the RY purine/pyrimidine alphabet[4, 3, 5, 8, 6, 1]. These models have been abandoned by their own authors infavor of models over the t A, C, G, T u alphabet. More precisely, in 1998 DidierArquès, Jean-Paul Fallot, and Christian Michel proposed a ﬁrst evolutionarymodel on the t A, C, G, T u alphabet that is based on trinucleotides [2]. Withsuch a model, the mutation matrix now has a size of ˆ (there are 64 trin-ucleotides). This model comprises 3 parameters p, q, r that correspond, for agiven trinucleotide XY Z , to the probability p of mutation of the ﬁrst nucleotide X , the mutation probability q of Y , and the probability r that Z mutates. Asfor the nucleotides based models, this new approach has only taken into accountconstant parameters. In 2004, Jacques M. Bahi and Christian Michel publishednovel research work in which the 1998 model was improved by replacing con-stants parameters by new parameters dependent on time [12]. In this way, ithas been possible to simulate a gene evolution that is non-linear. However, thefollowing years, these researchers returned to models embedding constant pa-rameters, probably due to the fact that the 2004 model leads to poor results:only one of the twelve studied cases allows to recover values that are close to re-ality. For instance, in 2006, Gabriel Frey and Christian Michel proposed a modelthat uses 6 constant parameters [18], whereas in 2007, Christian Michel used amodel with 9 constant parameters that generalize those of 1998 and 2006 [23].Finally, Jacques M. Bahi and Christian Michel have recently introduced in [10],a last model with 3 constant parameters, but whose evolution matrix evolvesover time . In other words, trinucleotides that have to mutate (modifying trin-ucleotide content without changing their location) are not ﬁxed, but they arerandomly picked among a subset of potentially mutable trinucleotides. Thismodel, called “chaotic model” (CM), allows good recovery of various statisticalproperties detected in the genome. Furthermore, this model ﬁts well with thehypothesis of some primitive genes that have mutated over time.In this paper, we ask why the CM model yields such good results. Obviously,it is reasonable to assume that not all of the trinucleotides have to mutate each2ime as, for instance, the stop codons that have very small mutation probabili-ties. However, such a biological claim is not suﬃcient to explain the success ofthe CM model to accurately simulate the dynamics of mutations in genomes.Our proposal is that the dynamics of genomes evolution is indeed chaotic , asdeﬁned by the Devaney’s formulation [15, 14]. This is why linear non-chaoticmodels of evolution are far from what they attempt to model, leading to a pooraccuracy in their prediction. By contrast, we have recently established thatdiscrete dynamical systems in chaotic iterations satisfy Devaney’s deﬁnition ofchaos [11]. Thus the CM model, which is the ﬁrst mutation model based onchaotic iterations [10] (considering that the set of trinucleotides that can pos-sibly mutate evolve over time), uses a chaotic dynamical system to describe achaotic behavior, leading to a model of the same nature as the phenomenonunder study. We ﬁnally demonstrate that, in contrast to inversions, mutationsoccurring in genomes have a chaotic dynamics. So at least one type of genomesreorganization process is chaotic, according to the formulation of such a behaviorin the mathematical theory of chaos.The remainder of this research work is organized as follows. In Section 2, theCM model of genomes evolution is recalled and its performances are synthesized.Then, in the next section, basic recalls concerning chaotic iterations and De-vaney’s chaos are given. Genomics mutations are formalized through a discretedynamical system and studied in Section 4. In particular, they are proven to bechaotic according to Devaney. Other categories of genomics rearrangements areinvestigated too, namely transpositions and inversions. This research work endswith a conclusion section, where the contribution is summarized and intendedfuture work is listed.

2. The CM Model of Genome Evolution

In this section, the CM model is presented, its capability to reasonably ap-proximate mutations into genomes is recalled, and its relationship with chaoticiterations is stated.

When considering the model of 2007 with 9 constant parameters [23] thatgeneralizes the models of 1998 and 2006 ([2] and [18] respectively), all of thetrinucleotides have to mutate at each time . These models do not take into ac-count the low mutability of the stop codons. Furthermore, they do not allowmutation strategies to be applied to certain given codons, while the other codonsdo not mutate. This is why the model with 3 constant parameters and a chaoticstrategy has been proposed in [10, 13]. In this model, the set of trinucleotidesis divided into two subsets at each time t : the ﬁrst one comprises trinucleotidesthat can possibly mutate at time t , whereas in the second set, trinucleotidescannot change at the considered time. The trinucleotides that mutate withreplacement at time t are randomly picked following a uniform distribution onthe set of all possible subsets of trinucleotides (other distribution of probabilities3ike discrete Poisson process have not been regarded by these authors). Con-sequently, the size and the constitution of the subset of mutable trinucleotideschange at each time t . This subset is denoted by J p t q , and this new model hasbeen called “chaotic model” CM by the authors of [10, 13], as opposed to theformer “standard model” of 1998 [2], due to its relationship with the chaoticiterations recalled below. Since the trinucleotides that do not mutate in thechaotic model are not derived from the mutation of other trinucleotides (as, ateach iteration, we focus only on the subset of trinucleotides that are allowedto mutate at the considered time), their probabilities of occurrence are con-stant. Conversely, mutation parameters of the mutable trinucleotides are thesame as the 1998 model: p , q , and r with p ` q ` r “ , for each of the threenucleotide sites. Let P i p t q the probability of occurrence of the trinucleotide i at time t . Let A p t q be the mutation matrix at time t , whose element p i, j q is P p t q p i Ñ j q : the probability that the i ´ th trinucleotide (ordered in lexicla or-der) mutates into the j ´ th one. For instance, in line 1 and column 2, there is P p t q p Ñ q “ P p t q p AAA Ñ AAC q . The previous remarks lead to the followingformulation: $’&’% P i p t q “ if i R J p t q ,P i p t q “ ÿ j “ p A p t q ´ I q ji P j p t q if i P J p t q . Obviously, this new model is a generalization of the 1998 version. Indeed, ifwe suppose that A p t q “ A for every t , then denoting J p t q is the set of all thetrinucleotides at time t , the latter system can be summarized to its secondline, which is exactly the 1998 model. As the number of mutable trinucleotideschanges over time, the mutation matrix is not constant, which leads to the factthat the resolution method used in the standard model cannot be applied here.To solve the system, authors of [10, 13] have considered discrete times smallenough to be sure that the mutation matrix does not change between t i and t i ` , where the length of r t i , t i ` s is small enough compared to the mutation rate.Let A p k q be the (constant) mutation matrix during the time interval r t k ´ , t k s .To compute P i p t k ´ q , authors of [10, 13] have considered that: d P i p t k ´ q dt “ P i p t k q ´ P i p t k ´ q h , where h “ t k ´ t k ´ is supposed small and constant. By putting this formulainto the previous system, these authors have ﬁnally obtained: $’&’% P i p t k q “ P i p t k ´ q if i R J p t k q ,P i p t k q “ h ÿ j “ p A p k q ´ I q ji P j p t k ´ q ` P i p t k ´ q if i P J p t k q . This model has been called the “discrete time chaotic evolution model CM” in[10, 13]. We will show that this discrete version is, indeed, a gene evolutionmodel that uses chaotic iterations. To understand the interest of this discretetime chaotic evolution model, we must ﬁrstly recall the discovery by Michel etal. of a C ´ code and its properties [7].4 .2. Relevance of the CM model A computation of the frequency of each trinucleotide in the 3 frames ofgenes, in a large gene population (protein coding region) of both eukariotes andprokaryotes, it was established in 1996 that the distribution of trinucleotides inthese frames is not uniform [7]. Such a surprising result has led to the deﬁnitionof 3 subsets of trinucleotides, denoted by X , X , and X . These sets are deﬁnedas follows. For each of the 60 trinucleotides diﬀerent from AAA , T T T , CCC , GGG , computes its frequency in the reading frame R , in the frame R obtainedby a shift of 1 nucleotide to the left of R , and in the frame R obtained by a shiftof 2 nucleotides. If the considered trinucleotide is more frequent in R (resp. R , R ), put it in X (resp. X , X ). This procedure is repeated, with smallvariations, until X , X , and X are respectively made-up of 20 trinucleotides.These sets are linked by the following permutation property: X “ t P p t q , t P X u , X “ t P p t q , t P X u , where P is deﬁned for all trinucleotide t “ n n n by P p t q “ n n n . Additionally, if we denote c : N ÝÑ N the complementaryfunction deﬁned on the set of nucleotides N “ t A, T, C, G u by: c p A q “ T , c p T q “ A , c p C q “ G , c p G q “ C , and for all words of nucleotides u and v , c p uv q “ c p v q c p u q , then we have c p X q “ X , c p X q “ X , and c p X q “ X , which isreferred to the “complementarity property”. More details about the researchcontext, the constitution of these sets, and their properties ( C code, rarity,largest window length, higher frequency of “misplaced” trinucleotides, ﬂexibility)can be found in [10, 13]. Among other things, it has been proven that X occurswith the highest probability (48.8%) in genes (reading frames 0), whereas X and X occur mainly in the frames 1 and 2 respectively. In other words, X isnot pure in the reading frame (its probability is less than 1): it is mixed with X and X . Such a property has been explained by authors of [10, 13] as follows.Suppose that X represents the set of trinucleotides used to build the gene of thelast common ancestor of the considered set of species. Random mutations haveintroduced noise during evolution, leading to a decreased probability of X [10,13]. Another fundamental property is asymmetry in the sense that codes X and X satisfy P p X q ă P p X q . The standard and chaotic models (with particularstrategies for the stop codons) can explain both the decreased probability ofthe code X and the asymmetry between the codes X and X in genes, by thefollowing procedure. Construct the “primitive” genes, i.e., genes before randomsubstitutions, with trinucleotides of the circular code X . Starting from thisinitial condition, the systems (standard or chaotic) are launched, iterating theirprocesses until a stop condition is checked. By doing so, and for rates chosencarefully, it is possible to be close to the current frequency of each of the threecodes X , X , and X in genes. In this situation, CM models largely outperformthe standard models, being closer to the observed probabilities for X , X , and X discussed above. In particular, the chaotic model called “ CM T AA ” with lowmutability of the stop codon TAA, matches as much as possible the probabilitydiscrepancy between the circular codes X and X observed in reality. Forfurther details, the reader is referred to [10, 13].All the properties described before show that the gene mutation predictionis suitable to describe these phenomena. This kind of manifestation of chaos in5enomics is somewhat surprising and needs, in our opinion, to be further inves-tigated, determining whether more fundamental reasons can justify the successof chaotic models to well simulate genome evolution. In the following section,we will propose some reasons explaining why chaos is related to genomes. Moreprecisely, we will show that some genome evolution mechanisms, as modeled inthe present article, are chaotic according to Devaney. To achieve this goal, weﬁrst need to recall the bases of the mathematical theory of chaos.

3. Basic Remainders

Let us now rigorously introduce the notions of Devaney’s chaos and of chaoticiterations, with their respective links.

Consider a topological space p X , τ q and a continuous function f : X Ñ X . Deﬁnition 1

Function f is said to be topologically transitive if, for any pair ofnon empty open sets U, V Ă X , there exists k ą such that f k p U q X V ‰ H . Deﬁnition 2

The point x P X is a periodic point for f of period n P N ˚ if f n p x q “ x . Deﬁnition 3

Function f is said to be regular on p X , τ q if the set of periodicpoints for f is dense in X : for any point x in X , any neighborhood of x containsat least one periodic point. Deﬁnition 4

Function f is said to be chaotic on p X , τ q if f is regular andtopologically transitive.In cases where the topology τ can be described by a metric d , the chaos propertyis strongly linked to the notion of “sensitivity”, deﬁned on a metric space p X , d q by: Deﬁnition 5

Function f has sensitive dependence on initial conditions if thereexists δ ą such that, for any x P X and any neighborhood V of x , there exists y P V and n ě such that d p f n p x q , f n p y qq ą δ . Then δ is called the constantof sensitivity of f .Indeed, Banks et al. have proven in [14] that when f is chaotic on p X , d q , then f has the property of sensitive dependence on initial conditions (this property wasformerly an element of the deﬁnition of chaos). To sum up, quoting Devaneyin [15], a chaotic dynamical system “is unpredictable because of the sensitivedependence on initial conditions. It cannot be broken down or simpliﬁed into twosubsystems which do not interact because of topological transitivity. And in themidst of this random behavior, we nevertheless have an element of regularity”.Fundamentally diﬀerent behaviors are consequently possible and occur in anunpredictable way. 6 .2. Chaotic Iterations Deﬁnition 6

Let X be a set, N P N ˚ , f : X N ÝÑ X N be a function, and S be asequence of subsets of v N w called a “chaotic strategy”. The chaotic iterations are the sequence p x n q n P N of elements of X N deﬁned by x P X N and @ n P N ˚ , @ i P v N w , x ni “ x n ´ i if i R S n ` f p x n ´ q ˘ i if i P S n . In other words, at the n th iteration, only the components of S n are “iterated”.Note that the term “chaotic” in the name of these iterations, has a priori nolink with the mathematical theory of chaos, which will be recalled in the nextsection. However, it has been proven in [9] that, for a large variety of functions,chaotic iterations are indeed really chaotic .

4. Genomics Mutations as a Discrete Dynamical System

We now ask whether the evolution of a DNA sequence under evolution can bepredicted or not. In this section, we will more speciﬁcally focus on the followingquestions. Firstly, given a genome (or any DNA sequence) G of interest, and amore or less precise idea of mutations that it will probably face in future (forinstance, some areas in the genomes are known to mutate more frequently thanother ones), is it possible to infer a set of the more probable genomes that canresult, in the future, from this original sequence G after mutations? Second,given a sequence known at the current generation (say, at time t n ), is it possibleto determine what was the most probable aspect of this sequence in the past(at time t m , m ă n )? Thirdly, given two DNA sequences, the second one beingthe result of some mutations on the ﬁrst one, is it possible to ﬁnd the mutationsequence that has changed the ﬁrst sequence in the second one (taking intoaccount the fact that a given nucleotide can mutate several times). Obviously,with no information about the mutation rate and history of the considered DNAsequence, this prediction is quite impossible. But what happens if we can followthe DNA sequence over several generations, learning by doing so informationabout the possible form of its mutations sequence? For instance, following alineage of Escherichia coli during 40000 generations gives us a lot of informationsconcerning the behavior of mutations in the genomes of the considered lineage.Is it possible to use this knowledge to predict the genome of this lineage atgeneration number 45000 ? In other words, knowing the initial DNA sequence G at time t and the 40000 ﬁrst terms of the mutations sequence, can we predictthe DNA sequence at time t ? With the knowledge of G and the wholemutations sequence S “ p S , . . . , S q , the genome G can be obtainedwithout prediction, but what happens to our ability to make a prediction whenusing only the head p S , . . . , S q of this sequence? This head can be seenas an approximation of the true mutations sequence S , and if the evolutiondynamics of the mutations is quite stable through approximations, in the sense7here a small perturbation at the origin yields a small perturbation at the endof the process, then this prediction makes sense. To measure the stability of themutation dynamics through small errors or approximations, and the capabilityto predict the evolution of genomes under mutations, we must ﬁrstly write thismutation operation as a dynamical system, provide an accurate distance thatcorresponds to the “approximation” quoted below, and measure the eﬀects ofour ignorance on the complete mutations sequence on the prediction of genomesevolution. A genome having N nucleotides is formalized here as a sequence of N inte-gers belonging in t , , , u , where (resp. , , and ) refers to the adenine(resp. cytosine, guanine, and thymine). The beneﬁt of using integers t , , , u instead of t A, C, G, T u is justiﬁed by the construction of a metric for the muta-tion process (see Section 4.3). An evolution under nucleotide mutations of thisgenome is a sequence of couples of v , N w ˆ v , w , where we infer that: • time has been divided into a sequence t , t , . . . , t n , . . . such that at mostone mutation can occur between two time intervals, • the i ´ th couple of the mutation sequence is equal to p m, n q if and only ifthe m ´ th nucleotide of the genome is replaced into the nucleotide n . Ifthe m ´ th nucleotide was n , then no mutation has occurred at time t i .Such a sequence will be called “mutations sequence” in the remainder of thisdocument. S N “ ď n P N pv , N w ˆ v , wq n will denote the (inﬁnite) set of all possiblemutations (ﬁnite) sequences. We introduce the phase space X N “ v , w N ˆ S N as the set of mutating genomes. It is constituted by couples of points that storethe information of a genome and its future evolution: the ﬁrst coordinate ofthe couple is the current DNA sequence whereas the second coordinate is thesequence of mutations that will appear in the future (the problem is that thissequence can only be, in the best case, approximate concretely). Example 1

For instance, the point pp , , , , q , pp , q , p , q , p , qqq P X cor-responds to the evolution t AACAG, ACCAG, AGCAG, T GCAG u : the left co-ordinate p , , , , q means that we start with the sequence AACAG , whereasthe second coordinate pp , q , p , q , p , qq explains that:1. the ﬁrst mutation p , q is a substitution of the second nucleotide by C ,2. the second mutation p , q is a substitution of the second nucleotide by G ,3. the third and last mutation p , q refers to the substitution of the ﬁrstnucleotide by T , which is designed here by .Let us now introduce the initial and shift operators i and σ deﬁned respectivelyby i : S N ÝÑ v , N w ˆ v , wp s , s , . . . q ÞÝÑ s σ : S N ÝÑ S N p s , s , . . . q ÞÝÑ p s , s , . . . q . The shift operator corresponds to the so-called symbolic dynamics, a well-studied mathematical example of chaotic dynamics [17]. With this material,the mutation operation M can be written as follows: M : X N ÝÑ X N p G , . . . , G N q , S q ÞÝÑ ` p G , . . . , G i p S q ´ , i p S q , G i p S q ` , . . . , G N q , σ p S q ˘ . (1)In other words, the nucleotide at position i p S q in the genome p G , . . . , G N q isreplaced by the nucleotide i p S q , and the ﬁrst substitution i p S q in the mutationsequence S is removed (as the mutation has already been achieved). Thus theDNA evolution as the generations pass can ﬁnally be written as the followingdiscrete dynamical system: " X “ p G , S q P X N X n ` “ M p X n q . (2) Example 2

Let us consider Example 1 another time. As stated before, X “pp , , , , q , pp , q , p , q , p , qqq P X . Then X “ M p X q “ pp , , , , q , pp , q , p , qqq , X “ M p X q “ pp , , , , q , pp , qqq , and X “ M p X q “pp , , , , q , ∅ q . The last DNA sequence, obtained after 3 mutations (3 itera-tions of the dynamical system), is thus equal to G “ X “ p , , , , q , whichis T GCAG . A relevant metric must now be introduced in order to measure the correctnessof the prediction, and to give consistency to the notion of approximation thathas occurred several times in the previous section. This distance must be deﬁnedon the set X N , to measure how close is a predicted DNA evolution to the realone. It will be constructed as follows: given X “ p X , X q , Y “ p Y , Y q P X N ,the number d p X, Y q : • has an integer part that computes the diﬀerences between the two DNAsequences X (for instance, the predicted or approximated genome) and Y (the real genome), that is, the number of nucleotides that do not cor-respond in the two genomes. • has a fractional part that must be as small as the evolution processes X , Y will coincide for a long enough duration. More precisely, the k ´ thdigit of d p X, Y q will be equal to if and only if, after k generations, thesame position (nucleotide) will be changed in both X and Y genomes,and the same nucleotide is inserted in each case.9uch requirements lead to the introduction of the following function: @ X, Y P X N , d p X, Y q “ d G p X , Y q ` d S p X , Y q where $’’’’&’’’’% d G p X , Y q “ N ÿ k “ δ p X k , Y k q ,d S p X , Y q “ N ÿ k “ F p X k ´ Y k q k ` , where δ is the discrete metric on R , that is, for x, y P R , δ p x, y q “ if x ‰ y, else,and F : R ÝÑ R ` is given by F p x , x q “ | x | ` δ p , x q . Proposition 1

Function d is a metric on X N . Proof

The function d G is clearly a metric on v , w N as being the 1-productmetric of the N metric spaces pv , w , δ q . We now prove that d S is a metric.Firstly, d S is well deﬁned since for p X , Y q P S N , one gets F p X k ´ Y k q ď N ` for every k P N , implying the convergence of the series in the deﬁnition of d S .Coincidence axiom and symmetry being obvious, we only prove the subadditivityof f S . If x , y P R are such that δ p , x ´ y q “ then x ‰ y and for every z P R , either x ‰ z or y ‰ z and so δ p , x ´ z q ` δ p , z ´ y q ě .Consequently for every x, y, z P R , F p x ´ y q ď F p x ´ z q ` F p z ´ y q . Theseries being convergent for every X , Y P S N , one deduces that d G satisﬁes thesubadditivity property on S N and is a metric on this set. As a consequence, d is a metric on X N . Let us start by proving that,

Proposition 2

The mutation operation M is a continuous function on p X N , d q . Proof

This result will be established by using the sequential characterizationof the continuity. Let p X n q n P N a sequence of X N that converges to ℓ P X N . Wemust prove that M p X n q ÝÑ M p ℓ q in p X N , d q . Let ε ą . X n ÝÑ ℓ in p X N , d q ,then D n P N , @ n ě n , d p X n , ℓ q ď . . So all X n for n ě n have the sameﬁrst coordinate (genome), which is ℓ . Furthermore, consequently to the deﬁni-tion of d S , the ﬁrst term of each X n for n ě n is equal to the ﬁrst term of ℓ .So, @ n ě n , M p X n q “ M p ℓ q .Let k “ r ´ log p ε q s . As d p X n , ℓ q ÝÑ , D n ě n such that, for n ě n , d S p X n , ℓ q ď k ` , meaning that the sequences X n , n ě n and ℓ startall with the same k ` terms. As the operation of M on the second co-ordinate of points of X N is a shift of one term to the left, we conclude that @ n ě n , d G p M p X n q , M p ℓ q q “ and d S p M p X n q , M p ℓ q q ă k ď ε , andthus M p X n q ÝÑ M p ℓ q in p X N , d q , which ends the proof.10s M is continuous, we can thus study the chaotic behavior of the discretedynamical system of Eq. 2. We ﬁrst prove that,

Proposition 3 M is topologically transitive on p X N , d q . Proof

Let X “ p G, S q and ˇ X “ p ˇ G, ˇ S q two points of X N , and ε ą . We willﬁnd n P N and a point X “ p G , S q P B p X, ε q , the open ball centered on X with radius ε , such that M n p X q “ ˇ X . Let k “ r ´ log p ε q s . Thus any point ofthe form p G, p S , S , . . . , S k , s k ` , s k ` , . . . qq , with s k ` , s k ` , . . . P v , N w ,is in B p X, ε q . Suppose that G and ˇ G have m diﬀerent nucleotides, in position i , . . . , i m P v , N w . Thus the point X “ p G, p S , S , . . . , S k , p i , ˇ G i q , . . . , p i m , ˇ G i m q , ˇ S , ˇ S , . . . q P B p X, ε q is such that M k ` m ` p X q “ ˇ X , leading to the transitivity of M . Remark 1

A stronger result than the topological transitivity has indeed beenstated in the proof above. It is called strong transitivity and is deﬁned by: forall

X, Y P X and for all neighborhood V of X , it exists n P N and X P V suchthat M n p X q “ Y . Obviously, the strong transitivity implies the transitivityproperty.We now prove that, Proposition 4 M is regular on p X N , d q . Proof

Let X P X N and ε ą . We have to exhibit a periodic point X P B p X, ε q .Let k “ r ´ log p ε q s . Suppose that X “ p G, p S , S , . . . q , and that the genome M k p X q diﬀer from m nucleotides of G in position i , . . . , i m P v , N w . Thenthe point: X “ p G, p S , . . . , S k , p i , G i q , . . . p i m , G i m q , S , . . . , S k , p i , G i q , . . . p i m , G i m q , . . . q is a periodic point in B p X, ε q .Let us ﬁnally demonstrate that: Proposition 5

The mutation operator M has sensitive dependence on initialcondition, and its constant of sensitivity is equal to N ` t N u ` N . Proof

Let X “ p G, S q P X N and ε ą . Let k “ r ´ log p ε q s . Consider aﬁnite sequence of nucleotides p n , . . . , n N q P v , w N such that for each i P v , N w , n i ‰ ` M k ` N p X q ˘ i , and an inﬁnite sequence p s j q j P N such that for every j P N ,11 s j “ $&% N if ` M k ` N ` j p X q ˘ ď N , else . • s j “ if ` M k ` N ` j p X q ˘ “ , else . Then the point X “ p G, p S , . . . , S k , p , n q , . . . p N , n N q , s , s , . . . q P B p X, ε q and is such that d p M k ` N p X q , M k ` N p X qq ě N ` t N u ` N . Due to the deﬁni-tion of p n , . . . , n N q and p s j q j P N , the inﬁmum in the latter equality is optimal,and the distance cannot be enlarged systematically for the neighborhood of allpoints.The three previous propositions lead to the following result. Theorem 1

Genome mutations as modeled by our approach have a chaoticbehavior according to Devaney.4.5. Further Investigations4.5.1. Quantitative properties

Genomic mutations possess the instability property:

Deﬁnition 7

A dynamical system p X , f q is unstable if for all x P X , the orbit γ x : n P N ÞÝÑ f n p x q is unstable in the following sense, D ε ą , @ δ ą , D y P X , D n P N , s.t. d p x, y q ă δ and d p f n p n q , f n p y qq ě ε. This property, which is implied by sensitive dependence on initial conditions,leads to the fact that in all neighborhoods of any genome evolution p G, S q thereare points that can be separated with distance bigger than ε in the futurethrough mutations.Let us now recall another common quantitative measure of disorder. Deﬁnition 8

A function f is said to have the property of expansivity if D ε ą , @ x ‰ y, D n P N , d p f n p x q , f n p y qq ě ε. Then ε is the constant of expansivity of f : an arbitrarily small error on anyinitial condition is always ampliﬁed of ε . Let us prove that, Theorem 2

Mutation operator M is expansive and its constant of expansivityis at least equal to 1. Proof If X ‰ Y , then d p X, Y q “ d p M p X q , M p Y qq ě .Or else necessarily X ‰ Y . Let n “ min t k P N , X k ‰ Y k u . Then @ k ă n, M k p X q “ M k p Y q and M n p X q ‰ M n p Y q , so d p M n p X q , M n p Y qq ě .12 .5.2. Qualitative properties Firstly, the topological transitivity property implies indecomposability [24].

Deﬁnition 9

A dynamical system p X , f q is indecomposable if it is not theunion of two closed sets A, B Ă X such that f p A q Ă A, f p B q Ă B .Hence, taking into account only a small part of a genome in the modelingprocess, in order to simplify the complexity of the studied dynamics, takes awayfrom us to a global vision of mutations. Moreover, we will prove that genomicmutations are topologically mixing, which is a strong version of transitivity: Deﬁnition 10

A discrete dynamical system is said to be topologically mixing if and only if, for any couple of disjoint open sets

U, V ‰ ∅ , n P N can be foundso that @ n ě n , f n p U q X V ‰ ∅ .We have the result, Theorem 3 p X N , M q is topologically mixing. This property is an immediate consequence of the lemma below.

Lemma 1

For all open ball B Ă X N , there exists an integer n P N ˚ such that M p n q p B q “ X N , where M p n q is the n -th composition of operator M deﬁned in (1) . Proof

Let B “ B `` p G , . . . , G N q , p S , . . . q ˘ , ε ˘ , k “ ´ log p| ε |q , and ˇ X “ `` ˇ G , . . . , ˇ G N ˘ , p ˇ S , . . . q ˘ P X N .We deﬁne X “ ` p G , . . . , G N q , p S , . . . , S k , p , ˇ G q , . . . , p N , ˇ G N q , ˇ S , . . . ˘ .This point is such that X P B and M p k ` N q p X q “ ˇ X .Mutations M satisfy the notion of chaos according to Knudsen too, whichis deﬁned by [22]: Deﬁnition 11

A discrete dynamical system is chaotic according to Knudsenif: • it is sensitive to the initial conditions, • there is a dense orbit.Let us prove that, Proposition 6

The mutation operator is chaotic according to Knudsen on p X N , d q . Proof

The sensitivity to the initial condition has yet been stated. Let us deﬁnea point X on X N having a dense orbit under iterations of M . X N “ v , w N ˆ S N being the Cartesian product of two countable sets, it is countably inﬁnite too:there exists a bijection σ : N ÝÑ X N . Let G : X N ÝÑ v , w N , p G, S q ÞÝÑ G bethe ﬁrst projection. Then X can be deﬁned as follows: X “ pp , , . . . , q , p p , G p σ p qq q , p , G p σ p qq q , . . . , p N , G p σ p qq N q , σ p q , p , G p σ p qq q , p , G p σ p qq q , . . . , p N , G p σ p qq N q , σ p q ,. . . qq X is such that @ Y P X N , D n Y P N , M n Y p Y q “ X , which is stronger thanthe required density.To a certain extent, this notion of chaos is less restrictive than the one ofDevaney. More precisely, Devaney’s chaos implies Knudsen’s chaos in compactspaces [17]. Conclusion of this study of mutations is that they present a chaotic behaviorleading to the impossibility to qualify the long term eﬀect of an error in pre-dicting the location and frequency of mutations into genomes. In the worst casescenario, this error will be ampliﬁed until having a completely diﬀerent genome(all the nucleotides are diﬀerent, as the constant of sensitivity is greater thanthe length of the genome). However this case is rather marginal, mutations donot occur as frequently as the generations pass, and a mutation implies a changeof only one nucleotide, leading to the opinion that, at least in the short term,the general aspect of the genome under consideration still remains under controlwhen only mutations occur.Inversion and transpositions are another genomics rearrangements that mostlyaﬀects more than one nucleotide. Thus an error in the prediction of these op-erations can potentially more largely impact the genome evolution. To qualifysuch impact, we ﬁrst give some deﬁnitions useful to formalize inversions.

Deﬁnition 12

The complementary function c : v , w ÝÑ v , w is deﬁned by c p q “ , c p q “ , c p q “ , and c p q “ .Then the complement of adenine A is thymine T, and c p q “ means, forinstance, that the complement of cytosine is guanine. We can now deﬁne theinversion process on a chromosome: Deﬁnition 13

Let N P N ˚ , and p n , . . . , n N q a chromosome. Inversions havethe form: p n , . . . , n i ´ , n i , n i ` . . . , n j ´ , n j , n j ` , . . . , n N q ÝÑp n , . . . , n i ´ , c p n j q , c p n j ´ q . . . , c p n i ` q , c p n i q , n j ` , . . . , n N q . Example 3

For instance,

ACCT GT AAT GT T A is a possible inversion of

ACCT T T ACT GT T A .Obviously, it is impossible to map the DNA sequence

AAAAAAAA into

CCCCCCCC using only inversions, as the complement of A is T . This fact isin contradiction with the property of transitivity, leading to the statement that, Proposition 7

The inversion rearrangement is not chaotic on the set of allgenomes of size N . p n , . . . , n i ´ , n i , . . . , n j , n j ` , . . . , n k , n k ` , . . . , n N q ÝÑp n , . . . , n i ´ , n j ` , . . . , n k , n i , . . . n j , n k ` , . . . , n N q . Obviously this transposition cannot ﬁt the requirements of transitivity, as thenumber of adenines, thymines, guanines, and cytosines are preserved. Then, forinstance, it is impossible to join a genome with an high rate of thymine, startingtranspositions on a genome with a low rate of T . Thus, Proposition 8

Transposition of transposons is not chaotic according to De-vaney.

5. Conclusion

In this document, the three operations of genomics rearrangement that canbe modeled by discrete dynamical systems (due to the preservation of the sizeof the genomes) have been studied using mathematical topology. It has beenstated that mutations are chaotic, whereas transpositions and inversions arenot. The proposed models lead to the feeling that genome evolution generatesmoderate chaos, and that this evolution can probably be predicted to a certainextent.This claim will be further investigated in our future work, by making a largerand complete study of all the possible rearrangements into genomes, measureand study their frequency using the related literature, and discussing to whatextend this prediction can be realized. In particular, authors will study the setof mutations, transpositions, and inversion strategies, to take into account forthe presence of recombination hotspots.

References [1] D. G. Arques and C. J. Michel. Analytical expression of thepurine/pyrimidine autocorrelation function after and before random mu-tations.

Math Biosci , 123(1):103–125, Sep 1994.[2] D. G. Arquès, J. P. Fallot, and C. J. Michel. An evolutionary analyti-cal model of a complementary circular code simulating the protein codinggenes, the 5’ and 3’ regions.

Bull Math Biol , 60(1):163–194, Jan 1998.[3] D. G. Arquès and C. J. Michel. A model of dna sequence evolution.

BullMath Biol , 52(6):741–772, 1990.[4] D. G. Arquès and C. J. Michel. Periodicities in coding and noncodingregions of the genes.

J Theor Biol , 143(3):307–318, Apr 1990.155] D. G. Arquès and C. J. Michel. A simulation of the genetic periodicitiesmodulo 2 and 3 with processes of nucleotide insertions and deletions.

JTheor Biol , 156(1):113–127, May 1992.[6] D. G. Arquès and C. J. Michel. Identiﬁcation and simulation of new non-random statistical properties common to diﬀerent eukaryotic gene subpop-ulations.

Biochimie , 75(5):399–407, 1993.[7] D G Arquès and C J Michel. A complementary circular code in the proteincoding genes.

Journal of Theoretical Biology , 182(1):45–58, 1996.[8] D. G. Arquès, C. J. Michel, and K. Orieux. Identiﬁcation and simulationof new non-random statistical properties common to diﬀerent populationsof eukaryotic non-coding genes.

J Theor Biol , 161(3):329–342, Apr 1993.[9] Jacques Bahi and Christophe Guyeux. Hash functions using chaotic iter-ations.

Journal of Algorithms & Computational Technology , 4(2):167–181,2010.[10] Jacques M. Bahi and Christophe Guyeux. Chaotic iterations and topolog-ical chaos. 2008.[11] Jacques M Bahi and Christophe Guyeux. Hash functions using chaoticiterations.

Journal of Algorithms & Computational Technology , 4(2):167–182, 2010.[12] Jacques M Bahi and Christian J Michel. A stochastic gene evolution modelwith time dependent mutations.

Bull Math Biol , 66(4):763–778, Jul 2004.[13] Jacques M Bahi and Christian J Michel. A stochastic model of gene evo-lution with chaotic mutations.

J Theor Biol , 255(1):53–63, Nov 2008.[14] J. Banks, J. Brooks, G. Cairns, and P. Stacey. On devaney’s deﬁnition ofchaos.

Amer. Math. Monthly , 99:332–334, 1992.[15] Robert L. Devaney.

An Introduction to Chaotic Dynamical Systems .Addison-Wesley, Redwood City, CA, 2nd edition, 1989.[16] J. Felsenstein. A view of population genetics.

Science , 208(4449):1253, Jun1980.[17] Enrico Formenti.

Automates cellulaires et chaos : de la vision topologiqueà la vision algorithmique . PhD thesis, École Normale Supérieure de Lyon,1998.[18] Gabriel Frey and Christian J Michel. An analytical model of gene evolutionwith six mutation parameters: an application to archaeal circular codes.

Comput Biol Chem , 30(1):1–11, Feb 2006.1619] M. Hasegawa, H. Kishino, and T. Yano. Dating of the human-ape splittingby a molecular clock of mitochondrial dna.

J Mol Evol , 22(2):160–174,1985.[20] T. H. Jukes and C. R. Cantor.

Evolution of Protein Molecules . AcademyPress, 1969.[21] Motoo Kimura. A simple method for estimating evolutionary rates of basesubstitutions through comparative studies of nucleotide sequences.

Journalof Molecular Evolution , 16:111–120, 1980. 10.1007/BF01731581.[22] C. Knudsen.

Aspects of noninvertible dynamics and chaos . PhD thesis,Technical University of Denmark, 1994.[23] Christian J Michel. An analytical model of gene evolution with 9 muta-tion parameters: an application to the amino acids coded by the commoncircular code.

Bull Math Biol , 69(2):677–698, Feb 2007.[24] Sylvie Ruette.