Evidence for strong co-evolution of mitochondrial and somatic genomes
EEVIDENCE FOR STRONG CO-EVOLUTION
OF MITOCHONDRIAL AND SOMATIC
GENOMES
Michael G. SadovskyInstitute of computational modelling of SB RAS660036 Russia, Krasnoyarsk; [email protected]
Abstract
We studied a relations between the triplet frequency composition of mitochondriagenomes, and the phylogeny of their bearers. First, the clusters in 63dimensionalspace were developed due to K -means. Second, the clade composition of thoseclusters has been studied. It was found that genomes are distributed among theclusters very regularly, with strong correlation to taxonomy. Strong co-evolutionmanifests through this correlation: the proximity in frequency space was determinedover the mitochondrion genomes, while the proximity in taxonomy was determinedmorphologically.
1. Introduction
A study of statistical properties of nucleotide sequences still may tell a lot to a researcheron the relation between structure and some biological issues encoded in these former.A frequency dictionary of a nucleotide sequence is supposed to be a structure, further(Sadovsky et al., 2008; Sadovsky, 2003, 2006; Садовский, 2009). A consistent and com-prehensive study of frequency dictionaries answers the questions concerning the statisticaland information properties of DNA sequences. A frequency dictionary, whatever one un-derstands for it, is rather multidimensional entity. Relation between the structure (i. e.,oligonucleotides composition and their frequency), and the taxonomy of the bearers ofDNA sequences is of great importance. Here we studied this relation for the set of mito-chondrion genomes. 1 a r X i v : . [ q - b i o . GN ] M a y o begin with, let’s introduce basic definition. Consider a continuous symbol sequencefrom four-letter alphabet ℵ = { A , C , G , T } of the length N ; the length here is just thetotal number of symbols in a sequence. This sequence corresponds to some genetic entity(genome, chromosome, etc.). No other symbols or gaps in the sequence take place, bysupposition (see details in Section 2.). Any coherent string ω = ν ν . . . ν q of the length q makes a word. A set of all the words occurred within a sequence yields the support ofthat latter. Counting the numbers of copies n ω of the words, one gets a finite dictionary;changing the numbers for the frequency f ω = n ω N one gets the frequency dictionary W q of the thickness q . This is the main object of ourstudy. Again, more technical details could be found in Section 2.Further, we shall concentrate on frequency dictionaries W (i. e., the triplet composi-tion) only. Thus, any genetic entity is represented by a point in 63-dimensional space.Obviously, two genetic entities with identical frequency dictionaries W (1)3 and W (2)3 aremapped into the same point in the space. It is evident, that the absolute congruency oftwo frequency dictionaries W (1)3 and W (2)3 does not mean a complete coincidence of theoriginal sequences standing behind the dictionaries. Nonetheless, such two sequences areindistinguishable from the point of view of their triplet composition.Definitely, few entities may have very proximal frequencies of all the triplets, but fewothers may have not, thus making a distribution of the points in 63-dimensional spaceinhomogeneous. So, the key question here is what is the pattern of this distribution ofmitochondrion genomes in that space? Are there some discrete clusters, and if yes is therea correlation to a phylogeny of the genome bearers and clusters?To address the questions, we have implemented an unsupervised classification of mi-tochondrion genomes, in (metric) space of frequencies of triplets. Then, the taxa compo-sition of the classes developed due to the classification has been studied; a considerablecorrelation between taxa composition, and the class occupation was found. Some resultsof the study of the correlation of the distribution of bacterial taxa in the informationvalue space, developed over 16S RNA are presented in (Sadovsky et al., 2008; Gorban etal., 2001, 2000).This paper presents the evidences of the strong relation between the structure ofmitochondrion genomes, and the taxonomy of their bearers. That is mitochondrion genome, in our case. . Materials and methods We used the sequences from EMBL–bank (release of March, 2011). This release con-tains ∼ . × entries. The final database used for our studies enlists 1132 entries.Order M Order M Batrachia Chondrostei Crocodylidae Cryptodira Dinosauria Eutheria
Gymnophiona Metatheria Neopterygii
Squamata Table 1. Database structure; M is the abun-dance of taxon. This discrimination comes from the (notobvious) constraint: we had to keep in thedatabase the entries which do not presentrather highly ranked clades solely. In sim-ple words, a taxon of order rank (or higher)presented with a single genome yields “sig-nal” that is strong enough to deteriorate ageneral pattern, but fails to produce a dis-tinguishable detail in the pattern. Thus,we enlist into the final database the entries representing an order with five species ormore. Some genomes have other than indicated above symbols in the sequence; we haveomitted those “junk” symbols with concatination of the sequence fragments separated bythose junk symbols into a coherent entity.The stricture of the database is following: it contains the genomes of animals, only,with 988 entries of
Chordata and 144 entries of
Arthropoda phyla. Further discretionin
Chordata phylum is shown in Table 1.
Arthropoda phylum consists of 83 entries of
Endopterygota , 30 entries of
Paraneoptera and 30 entries of
Orthopteroidea .Unsupervised classification by K -means has been implemented to develop classes (seedetails and a lot of extensions in (Gorban et al., 2007; Gorban and Zinovyev, 2010; Y Shiet al., 2014; Fukunaga, 1990; Горбань, Россиев, 1996)). We sequentially developed theclassifications with two, three and four classes. No class separability has been checked.To develop a classification, we reduced the dimension of data to 63: this reduction comesfrom the fact that the sum of all frequencies must be equal to 1. Formally speaking, anytriplet could be excluded from the data set; practically, we excluded triplet GCG , sincethat latter yields the least standard deviation among other triplet ( σ GCG = 0 . ). In K -mean applications, Euclidean distance has been used. All the results were obtainedwith ViDaExpert software by A. Zinovyev . http://bioinfo-out.curie.fr/projects/vidaexpert/ . Results Here we present the results of the classification implementation through K -means tech-nique. Consider a classification with two classes, firstly. This classification carried outwith K -means is very stable and very discretional: there are no volatile genomes underthis classification. We have carried out runs of the classification, and in 13 runs aclass consisting of a single element has been observed. All other runs yielded very stablythe classification with two classes enlisting 154 and 978 entries, respectively. The compo-sition of classes is following: 142 genomes of Arthropoda always form a class, and only twogenomes belong to the opposite class. These two genomes belong to
Reticulitermes flavipes and
Gampsocleis gratiosa (accession numbers EF206314 and EU527333, respectively).Table 2 shows the results of this classification implementation. Turtles occupy thesame class, with no exclusions. Also, this class is occupied by fossils (
Archosauria and
Lepidosauria ). Three clades (
Neoptera division, mammalia and fossils) are distributedTaxon N I II III
Actinopterygii
510 464 46 0
Amphibia
65 40 17 8
Archosauria and
Lepidosauria
177 1 176 0
Mammalia
212 0 1 211
Neoptera
143 0 4 139
Testudines
25 0 25 0
Table 2. Distribution of clades for the unsupervised classifi-cation implemented for three classes; N is the abundance of ataxonomy group. between two classes, only, andthe distribution is extremelybiased: there are two sin-gle genomes of mammalia andfossils, respectively, belongingto the opposite class. Hereturtles and fossils occupy thecame class, that is not a mat-ter of surprise. Less clearis an amalgamation of tworather distinct clades (theseare Mammalia and
Neoptera ) into the single class. A rate of escaped genomes here isless than 0.5 % and 1.5 %, respectively.976 genomes from
Chordata phylum form another class, with 12 genomes escap-ing the opposite class. Again, the escaping genomes set is absolutely stable and in-cludes the following species (accession numbers are in parentheses):
Ranodon sibiri-cus (AJ419960),
Aneides flavipunctatus (AY728214),
Ensatina eschscholtzii (AY728216),
Rhyacotriton variegatus (AY728219),
Desmognathus fuscus (AY728227),
Hydromantesbrunus (AY728234),
Geotrypetes seraphini (AY954505),
Pachyhynobius shangchengensis (DQ333812),
Onychodactylus fischeri (DQ333820),
Dermophis mexicanus (GQ244467),4 icamptodon aterrimus (GQ368657),
Hemiechinus auritus (AB099481).Classification implementation in three classes with K -means also was very good andstable. Again, runs of the classification development have been carried out. Onlythree patterns of the classification have been observed differing in the abundances of theclasses. Namely, the abundance distributions were the following:(i) ÷ ÷ entries, 18 cases,(ii) ÷ ÷ entries, 854 cases,(iii) ÷ ÷ entries, 136 cases.Again, there were no volatile genomes (i. e. those permanently changing the class attri-bution).Fishes tend to occupy two classes, in more explicit manner. The majority of thegenomes of this clade occupy the first class (together with Amphibia ), while 10 % of thegenomes of fishes are located at the second class. This ratio differs from the similar oneobserved for other clades (except
Amphibia ). Amphibia exhibit the most intriguing behaviour. That is the only clade occupying allthree classes rather remarkably. Moreover, the distribution in triplet frequency space issensitive at quite low taxonomy level.
Amphibia enrolled 9 genomes of
Caudata order,and 13 genomes of
Anura order. These two orders are separated into two classes of thestatistical classification: all
Anura genomes fall into the second class, while the third classincludes 7 genomes of
Caudata order, in comparison to two genomes of that latter orderfound in the second class.Fig. 1 illustrates the distribution of clades in rigid map (see (Gorban and Zinovyev,2010; Gorban et al., 2007) for the details on the rigid map techniques). Labels for cladesare in the figure legend. Fig. 2 shows the distributions of the genomes for three- andtwo-class classification.
4. Discussion
A study presented in this paper is done within the scope of population genomics method-ology. An idea to figure out the relation between taxonomy and triplet composition ofsome genetic entities was originally provided by (Gorban et al., 2000; Горбань и др.,2003). These papers presented the study of a relation between bacterial taxonomy, andtriplet frequency dictionaries determined over 16S RNAs of bacteria. Here we apply those5deas to the study of structure–taxonomy relations observed over the entire genetic object(that is a mitochondrion genome).
Figure 1. Distribution of genomes in rigid map. Following are the labels of the clades: • – Actinopterigii ; (cid:7) – Amphibia ; (cid:78) – Archosaura ; (cid:78) – Lepidozaura ; (cid:4)(cid:3) – Mammalian ; • – Neoptera ; (cid:7)♦ – Testudines . Mitochondria are another very good objects for the population studies of this type:they have rather simple and short genome (typical length is about nucleotides); theyencode absolutely the same function, in any organism; a genome consists of a singlechromosome, thus making taxonomy the only factor affecting the difference among them.Another good genetic system that might be used for this kind of studies are chloroplasts.Moreover, one might want to study the mutual distribution of two genomic systems: theformer is of mitochondria, and the latter is of chloroplasts.6robably, a database structure is the key problem in this kind of studies. We haveused an unsupervised classification technique to develop a distribution of genomes intofew groups. The results of such classification are usually quite sensitive to an originaldatabase composition (Fukunaga, 1990; Горбань, Россиев, 1996). We have found thatdatabase containing too many entries representing higher taxa with a single (or, maybe,two) species shows nothing in the terms of a distribution of genomes into the classes,even with neither respect to the specific composition of the classes. In other words, noseparation into classes takes place, for such databases.Increasing “concentration” of relatively proximal species in the database, one can figureout various patterns in the set of genetic entities. In such capacity, the best databaseshould enlist the equal number of entries in any genus. Should the number of generabe equal in the taxonomy ranks of higher order in such databases, is a matter question.Indeed, the results presented above unambiguously prove the efficiency of unsupervisedlinear classification technique for rather non-equilibrium databases. Figure 2. Two class (left) and three class (right) distributions of the genomes.
Another option to retrieve more knowledge towards the relation between structure andtaxonomy (or function) is to change for some other type of structure. Indeed, here weused the most general structure that was frequency dictionary W of triplets. Here eachnucleotide generates a triplet (for the sequence connected into a ring), and the frame shiftstep is equal to one nucleotide. There could be other dictionaries; first of all, obtaineddue to a variation of the frame shift step. In particular, paper (Gorban et al., 2003) showsthe results in bacterial genomes clusterization obtained through the comparison of three7ifferent dictionaries (cid:102) W (1)3 , (cid:102) W (2)3 and (cid:102) W (3)3 counted over non-overlapping triplets, for threedifferent starting points of a count. Obviously, a unification of these three dictionariesyields the standard dictionary W .Evidently, a set of frequency dictionaries (of triplets) is mapped into a linear subspaceof co-dimension one. All the points representing the genomes are located at the simplexdetermined by the normalization constraint f A + f C + f G + f T = 1 . This constraint makes the interlocation of the points quite bound. Changing real fre-quencies of triplets for the information values of these latter, one can break through theconstraint mentioned above. Information value of a triplet here is the ratio of real fre-quency f ν ν ν and the expected one (cid:101) f ν ν ν . Obviously, an expected frequency could bedetermined in various ways; the approach based on the construction of the most probablecontinuation of dinucleotides in triplets yields the formula (cid:101) f ν ν ν = f ν ν × f ν ν f ν (1)for the expected frequency (cid:101) f ν ν ν . The details of this idea could be found in (Bugaenkoet al., 1998; Gorban et al., 2001; Sadovsky et al., 2008; Sadovsky, 2003, 2006; Садовский,2009). Thus, the information value of triplet is the ratio of real frequency and the expecteddetermined by (1). This approach was very fruitful, for a study of the correlation betweenthe structure and taxonomy of bacteria (Gorban et al., 2000; Горбань и др., 2003). References
Bugaenko N. N., Gorban A. N., Sadovsky M. G. 1998. Maximum entropy method in anal-ysis of genetic text and measurement of its information content.
Open Systems & In-formation Dyn . : 265–278.Gorban A. N., Zinovyev A. Yu. 2010. Principal manifolds and graphs in practice: frommolecular biology to dynamical systems. International Journal of Neural Systems , : 219–232.Gorban A. N., Kegl B., W¨unsch D. C., Zinovyev A. Yu. (eds.). 2007. Principal Manifoldsfor Data Visualisation and Dimension Reduction, // Lecture Notes in ComputationalScience and Engineering , , Springer, Berlin – Heidelberg – New York. 332 p.8orban A. N., Popova T. G., Zinovyev A. Yu. 2003. Seven clusters in genomic tripletdistributions. In Silico Biology , : 471–482.Gorban A. N., Popova T. G., Sadovsky M. G., W¨unsch D. C. 2001. Information contentof the frequency dictionaries, reconstruction, transformation and classification of dic-tionaries and genetic texts. Intelligent Engineering Systems through Artificial NeuralNetworks , 11 —
Smart Engineering System Design , N.-Y.: ASME Press, (2001). pp.657 –663.Gorban A. N., Popova T. G., Sadovsky M. G. 2000. Classification of symbol sequencesover thier frequency dictionaries: towards the connection between structure and naturaltaxonomy.
Open Systems & Information Dyn . : 1–17.Fukunaga K., 1990. Introduction to statistical pattern recognition. nd edition. AcademicPress: London. 591 p.Sadovsky M. G., Shchepanovsky A. S., Putintzeva Yu. A. 2008. Genes, Information andSense: Complexity and Knowledge Retrieval. Theory in Biosciences : 69–78.Sadovsky M. G. 2003. Comparison of real frequencies of strings vs. the expected onesreveals the information capacity of macromoleculae.
J.of Biol.Physics : 23–38.Sadovsky M. G. 2006. Information capacity of nucleotide sequences and its applications. Bulletin of Math.Biology . : 156–178.Y Shi, Gorban A. N., Yang T. Y. 2014. Is it possible to predict long-term success with k -NN? Case study of four market indices (FTSE100, DAX, HANGSENG, NASDAQ). J. Phys.: Conf. Ser.