Chloroplast Genome Yields Unusual Seven-Cluster Structure C
Michael G.Sadovsky, Eugenia I.Bondar, Yuliya A.Putintseva, Konstantin V.Krutovsky
LL. sibirica
Ledeb. CHLOROPLAST GENOME YIELDS UNUSUALSEVEN-CLUSTER STRUCTURE
Michael G. Sadovsky, Eugenia I. Bondar, Yuliya A. Putintseva,
2, 3 and Konstantin V. Krutovsky
4, 2, 5, 6 Institute of computational modelling of SD of RAS;660036 Russia, Krasnoyarsk, Akademgorodok. ∗ Siberian Federal university;660041 Russia, Krasnoyarsk, Svobodny prosp.79. † Institute of forest of SD RAS;660036 Russia, Krasnoyarsk, Akademgorodok. ‡ Georg-August-University of G¨ottingen, B¨usgenweg, 2, G¨ottingen, D-37077, Germany § N. I. Vavilov Institute of General Genetics of RAS; Gubkin Str., 3, Moscow, 119333, Russia Texas A & M University, HFSB 305, 2138 TAMU, College Station, Texas, 77843, USA
We studied the structuredness in a chloroplast genome of Siberian larch. The clusters in 63-dimensional space were identified with elastic map technique, where the objects to be clusterizedare the different fragments of the genome. A seven-cluster structure in the distribution of thosefragments reported previously has been found. Unlike the previous results, we have found thedrastically other composition of the clusters comprising the fragments extracted from coding andnon-coding regions of the genome.
PACS numbers: 87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-rKeywords: frequency; triplet; order; cluster; elastic map; evolution
I. INTRODUCTION
Molecular biology provides mathematics with a num-ber of mathematically sound problems and questions.Eventually, the structure identification and an orderimplementation in an ensemble of finite sequences arethe most interesting among them. Finite symbol se-quences, being a typical mathematical object, are nat-urally present as genetic matter in any living being;namely, as DNA sequence. Further we will considerthe finite symbol sequence of chloroplast genomes of fiveplant species, including one from
Larix sibirica
Ledeb.,which was recently completely sequenced, assembled andannotated in the Laboratory of Forest Genomics at theGenome Research and Education Centre of Siberian Fed-eral University [1, 2].This sequence consisted of 122 561 symbols or lettersfrom the four-letter alphabet ℵ = { A , C , G , T } . Neitherother symbols, nor blank spaces are supposed to be foundin a sequence; a sequence under consideration is also sup-posed to be coherent (i. e., consisting of a single piece).An identification and search of structures in DNA se-quence is a main objective of mathematical bioinformat-ics, biophysics and related scientific fields, including com-puter programming and information theory. Structuresobserved within a sequence reveal an order and provideeasier understanding of functional roles of a sequence orits fragments. A new function (or a connection between ∗ Electronic address: [email protected] † Electronic address: [email protected] ‡ Electronic address: [email protected] § Electronic address: [email protected] function and structure, or taxonomy) might be discov-ered through a search for new patterns in symbol se-quences corresponding to DNA molecule.Previously, an intriguing seven-cluster structure in var-ious genomes has been reported [3–5]. In brief, the ob-served pattern consists of seven groups of considerablyshort fragments of a genome (say, ∼ × nucleotides)arranged into seven clusters, in dependence on the infor-mation content encoded in those fragments. Three clus-ters comprising a triangle in the seven-cluster patterngather the fragments encoding genes, etc., other threeones gather the fragments that are the complimentaryones to the former ones, and the seventh central clus-ter gathers the fragments to be found in the non-codingregions of a genome.Here we report the similar structure observed over achloroplast genome of L. sibirica
Ledeb. Unlike the pat-terns described in [3–5], the structure observed in thechloroplast genome has drastically different pattern ofthe vertices to be found in the clusterization: the nodescomprising coding and non-coding regions are located inthe basically different manner.
II. MATERIAL AND METHODS
The chloroplast genome sequence Siberian larch(
L. sibirica
Ledeb.) has been sequenced using the Illu-mina HiSeq2000 sequencer at the Laboratory of ForestGenomics of the Siberian Federal University [1, 2] (seealso [15]). The chloroplast genome contains 121 codingregions with 69 entities to be found in a leading strand,and 52 ones found in the lagging one, respectively. Totallength of all the genes is equal to 68 307 bp. The aver-age length of a coding regions is 593 bp, ranging from a r X i v : . [ q - b i o . GN ] A p r TABLE I:
Product is a list of typical protein products of agroup of genes; M is the number of corresponding genes inthe group product M tRNA 31rRNA 3Ribosomal proteins 25Photosystems I,II 22Cytochrome ( b − f , b ) 10RNA polymerase 5ATP synthase 4Light-independent protochlorophyllide reductase 3Translation initiation factor 1 1NAD(P)H-quinone oxidoreductase 1Cell division protein FtsH 1Maturase K 1Ribulose bisphosphate carboxylase 1Acetyl-coenzyme A carboxyl transferase 1ATP-dependent Clp protease 1Hypothetical protein 10
70 bp (the shortest one) to 6 560 bp (the longest one).The standard deviation of the lengths of coding regionsis σ CDS = 1028 . T of the length N from the four-letter alphabet ℵ mentioned above. No gaps take placein a text. The word ω = ν ν . . . ν q − ν q of the length q is a string occurred in the text T . Here ν j is a symboloccupying the j -th position at the word; ν j ∈ ℵ .Everywhere below we shall consider the 3 symbols (nu-cleotides) long words, only (and call them triplets). ( q, l )-frequency dictionary W q ( l ) is the set of all the words ofthe length q counted within the text T with the step in l symbols, so that each word is accompanied with its fre-quency. A frequency of a word ω is defined traditionally:that is the number n ω of copies of the word divided bythe total number of all copies of all the words [6–9]. Pa-rameter l is arbitrary in a dictionary; everywhere furtherwe will consider only the W (3) frequency dictionaries.Evidently, a W (3) frequency dictionary comprises a setof codons in some cases.Triplets play the key role in inherited information pro-cessing, and this is the basic idea standing behind thechoice. Besides, we follow the classic papers [3–5] wherethe triplet frequencies W have been used to developthe cluster structure of bacterial and yeast genomes.The authors of [3–5] had also tried the other q -tipplecombinations (for q = 2 and q = 4, respectively) andfound that such choice provides significantly less infor-mation towards the structuredness of a nucleotide se-quence. Another informal support for this choice comesfrom the observation over the information capacity ofvarious genomes [12], where triplets also are shown to be featured from other q -tipples, with q (cid:54) = 3.A frequency dictionary W (3) unambiguously maps atext T into a 64-dimensional space with triplets beingthe coordinates, and the frequencies are the coordinatefigures. Hence, frequency dictionary represents a shortrange (or meso-scale, at most) structuredness in a sym-bol sequence. Consider, then, a window selecting a frag-ment F of the length S in a text T . Then R ( S, d ) isan (
S, d )-lattice that is the set of the fragments of thelength S consequently selected alongside the text T bythe window of the length S , with the step d . Obviously,a lattice consists of overlapping fragments, if d < S .That is the basic object for further analysis of statisti-cal properties of a symbol sequence representing chloro-plast genomes. The key idea of the paper is to checkwhether the fragments obtained for some ( S, d )-lattice R ( S, d ) differ in their statistical properties, or not. Theproperties expressed in the terms of (3 , W (3) would be considered, only. A. Clusterization techniques
We used the approach to figure out clusters in a datasetbased on an elastic map technique [10, 11, 13]. The basicidea of this method is to approximate the multidimen-sional data with a manifold of smaller dimension; theelastic map technique implies the approximation withtwo-dimensional manifold (see details in [11]). In brief,the procedure looks like the following. At the first step,the first and the second principal components must befound. Then a plane must be developed over these twoaxes. At the second step, each data point must be pro-jected at the plane and connected with the projection byan elastic spring. At the third step, the plane is allowedto bend and expand; so, the system is to be released toreach the minimum of the total energy (deformation plusspring extension). At the fourth step, each data pointmust be re-determined on the jammed map. Namely, anew data point image is the point on the map that isthe closest to the original point in terms of the chosenmetrics. Finally, the jammed map is “smoothened” byinverse non-linear transformation (for more details see[11, 13]). All the results were obtained with
ViDaExpert software by A. Zinovyev . III. RESULTS
To begin with, we describe the procedure of the de-velopment of the data set. Firstly, the genome sequencewas covered with the (
S, d )-lattice; S = 303, d = 10. Itwas important that d (cid:54) = 0 (mod 3). Each fragment in thelattice was labeled with the number of central symbol http://bioinfo-out.curie.fr/projects/vidaexpert/ of the given fragment. Next, each identified fragment ofthe lattice has been transformed into W (3) frequencydictionary. Hence, the sequence was mapped into a setof the points in a metric 64-dimensional space. We usedEuclidean metrics hereafter.Since the linear constraint (cid:88) ω f ω = 1 (1)brings an additional parasitic signal, one of the tripletsmust be eliminated from the set. Formally, any tripletcould be eliminated, but, practically, the choice may af-fect the results of the further treatment. Two strate-gies could be implemented here: either to exclude thetriplet with the greatest frequency, or to remove thetriplet yielding the least contribution into the points sep-aration and discrimination. We persuaded the secondstrategy: the triplet CGC with the minimal standard de-viation σ CGC = 0 . ViDaExpert by A. Zinovyev has been used. The stan-dard parameters configuration was used to develop themap (see Fig. 1, left) that depicts the famous seven clus-ter structure [3–5] more explicitly with due edge-nodepattern. The left picture in the figure shows the distri-bution of the fragments over the clusters; colors indicatea local density of the fragments, with maximal labeled inred, and the lowest one labeled in blue.The right picture in Fig. 1 shows the distribution of thefragments located within coding vs. non-coding regionsof the genome. To begin with, we shall explain the idea ofphase of a fragment. The (absolute) phase of a fragmentlabeled by the number S j (here S j denotes the position ofthe central nucleotide alongside the genome) yields threefigures for the remainder of the division of S j by 3: 0,1 and 2. These figures make the absolute phase of afragment.Meanwhile, an absolute phase may have nothing todo with a biological charge of a fragment: indeed, itmeasures the location of a fragment against the first nu-cleotide in a sequence, while a gene (or a coding region)may have, or may have not the fixed absolute phase.Thus, one has to introduce the relative phase determin-ing the location of a fragment with respect to a codingregion. The location (expressed in nucleotide numbers)of a gene or another functionally charged site to be foundwithin the genome is provided by the genome annotation.Hence, we determined the relative location of each frag-ment against the coding region, whereas a fragment isembedded into the region. Again, the relative phase isdefined as the reminder of the division by 3 of the lengthof the string connecting the start position of the codingregion, and the fragment; obviously, the relative phaseyields the figures of 0, 1 and 2. Hence, the relative phase identifies the fragments in ( S, d )-lattice starting at thesame position (by the reminder, to be exact) within anycoding region.Fig. 1 (right picture) shows the distribution of the rel-ative phases over the clusters obtained due to the pro-cedure of clusterization described above (for (303 , IV. DISCUSSION
Seven-cluster structures in various genomes (bacterial,mainly) were reported earlier [3–5]. It has been foundthat the pattern is provided by two triangles whose ver-tices are the clusters; whether one would observe seven-cluster or four-cluster structure, severely depends on GC -content of a genome. For the genomes with the contentclose to an equilibrium one (say, about 0 . ÷ . GC -content to some of poles yieldsa kind of degeneration of the pattern into a four-clusterstructure. The GC -content for the chloroplast genomeunder consideration is equal to 0 .
43 what makes the ob-served seven-cluster structure to be concordant to thepreviously reported ones.Surely, the choice of the specific figures for (
S, d )-lattice may affect strongly the clusterization results. Yet,we have no comprehensive and substantial idea how tochoose the figures. Probably, some heuristic approachescould be used for that. For the technical reasons, S mustbe odd and divisible by 3; reciprocally, d may vary sig-nificantly, while it must not be divisible be 3. A growthof d length results just in a decay of the capacity of thedataset: the total number of fragments to be taken intoconsideration goes down (linearly). A choice of S figureis less evident: the figure right and above illustrates thisfact. It shows the similar seven-cluster pattern observedfor a rice chloroplast genome, with S = 3003 (see Fig. 3).Probably, a natural constraint to choose the S figure isto take it close to (an average) length of a gene or otherfunctionally charged site to be found within the genome.This idea makes the figure of S = 303 taken for the stud-ies described above rather natural and informative.An obvious bias in the mutual location of the fragmentsfrom coding vs. non-coding region raises a new questiontowards the distribution of genes between those clusters(these are the clusters GC -content FIG. 1: The seven-cluster structure identified over
L. sibirica
L. chloroplast genome (left). The right figure shows the distributionof the fragments from coding vs. non-coding regions of the genome: clusters in the distribution of genes over the clusters failed. Nextgroup of genes (in terms of abundance of that latter) isfound in a sole cluster. Finally, the least number of genesare distributed among two clusters (see Fig. 2). The ob-served pattern is not a matter of surprise: this is onemore manifest of the (very diverse) relation of a struc- ture and a function. Yet, the pattern should be checkedand verified: we did not distinguish the genes located inleading strand from those located in lagging one. Tak-ing into account such disposition, one may expect thatthe pattern changes slightly. Definitely, it will not be de-stroyed completely, nor it would change into absolutelyanother form; nonetheless, some minor while importantchanges may take place.A distribution of the fragments located within codingand non-coding regions of the genome makes a significantdeviation from the results presented in [3–5]. The au-thors of [3–5] stipulate that the nodes of the seven-clusterstructure (as shown in Fig. 1) could be arranged intotwo triangles, where the first triangle comprises the frag-ments from the coding regions, the second triangle alsocomprises the coding regions (while with dual tripletsthat make the so called complimentary palindromes, orthe couples defined according to the Chargaff’s parityrule), and the central cluster gathers the fragments lo-cated within the non-coding regions. This might be truefor bacterial genomes which are known to bear very fewnon-coding regions.Roughly, the ratio of the length of coding vs. non-coding regions for
L. sibirica
Ledeb. chloroplast genomeis about 0 .
5. Maybe, this fact results in the pattern ob-served for the genome (see Fig. 1): two dual trianglescomprise the fragments belonging to coding and non-coding regions, separately. An immediate and to someextent innocent explanation of this observation may come
FIG. 3: An example of clusterization similar to that oneshown in Fig. 1, with window length S = 3000. The pat-tern is shown in principal components visualization scheme. from the length of the genome: it might be short enough,so that the finite sampling effects take place manifestingthrough this separation of coding vs. non-coding frag-ments. Since chloroplast genomes are not (typically) toolong, then one may address the question making simu-lation. Taking a bacterial genome (or better any otherwith close figures for GC -content and coding vs. non-coding regions ratio), one can take several parts of suchgenome, randomly chosen, to be a surrogate “chloroplastgenome”. If similar pattern is observed, then the finitesampling effect takes place.If no finite sampling effect takes place, then the ob-served pattern is expected to result from biological is-sues of the chloroplast genomes. To figure out whetherthe biology affects the combinatorial properties of the nu-cleotide sequences, one should carry out a comparativestudy of the clusterization over a family of chloroplastgenomes of various species; this topic falls beyond thescope of the paper.Here we present some preliminary results on clusteri-zation observed over the chloroplast genome of L. sibirica
Ledeb. Originally, the seven-cluster structure has beenobserved for bacterial or yeast genomes [3–5]. Chloro-plast are supposed to take origin in bacterial world.Nonetheless, the structure we have found differs strongly,from a similar one observed on bacterial genomes. Thus, further studies should address the comparative analysisof the pattern described above with the similar studiesof chloroplast genomes of
Abies sibirica and
Pinus sibir-ica ; obviously, one should compare the patterns observedon conifers, with those to be observed over angiospermspecies.Additionally, we have collected a number of portraits ofchloroplast genomes of some species; the collection sup-ports the idea of a clusterization of coding vs. non-codingfragments, while the pattern may differ significantly forma “classical” seven-cluster one. Everywhere further inthese pictures orange colored points correspond to non-coding fragments, and crimson indicates phase 0, greenindicates phase 1,yellow indicates phase 2. Just enjoy thepictures!
V. CONCLUSION
Seven cluster structure in chloroplast genome for
L. sibirica was found. This is the fundamental structureof any genome; the found pattern is not degenerated sincefrequency of nucleotide A differed significantly from fre-quency of nucleotide G . The absence of a degeneracymay indicate the prototypic genome that gave origin tochloroplasts entities; it is supposed to be a bacterial one(following symbiotic theory of organelle origin). Unlikenuclear genome of bacteria, the chloroplast genome yieldsmore complex structure of (at least two) clusters: theseseem to consists of two and three subclusters, respec-tively. The detailed structure of these complex clustersneeds more studies, but may bring new understanding ofa fine structure details, or of relations between structureand function of chloroplast genome. VI. ACKNOWLEDGEMENTS
This study was supported by a research grantNo. 14.Y26.31.0004 from the Government of the RussianFederation. We also thank Maria Yu. Senashova fromICM SB RAS (Krasnoyarsk) for help in calculations anddata visualization. [1] Krutovsky K. V., Oreshkova N. V., Putintseva Yu. A.,Ibe A. A., Deutsch K. O. Shilkina E. A. Some prelimi-nary results of a full genome de novo sequencing of
Larixsibirica
Ledeb. and
Pinus sibirica
Du Tour. Siberian For-est Journal (in Russian, English abstract). (4): 79 – 83,(2014).[2] Bondar E. I., Putintseva Yu. A., Oreshkova N. V., Kru-tovsky K. V. Study of Siberian larch ( Larix sibir-ica
Ledeb.) chloroplast genome and development ofpolymorphic chloroplast markers. In: Proc. of the 4 th Int. Conf. “Conservation of Forest Genetic Resourcesin Siberia”, August 24–29, Barnaul, Russia. P.20 – 21. (2015).[3] Gorban A. N., Zinovyev A. Yu., Popova T. G. Seven clus-ters in genomic triplet distributions. In Silico Biology :39 – 45. (2003).[4] Gorban A. N., Zinovyev A. Yu., Popova T. G. Four basicsymmetry types in the universal 7-cluster structure ofmicrobial genomic sequences. In Silico Biology : 25 –37. (2005).[5] Gorban A. N., Zinovyev A. Yu., Popova T. G. Universalseven-cluster structure of genome fragment distribution:basic symmetry in triplet frequencies. Bioinformatics ofGenome Regulation and Structure II. (Eds. N. Kolchanov and R. Hofestaedt) Springer Science+Business Media,Inc. P. 153 – 163. (2005).[6] Bugaenko, N. N., Gorban, A. N., Sadovsky, M. G. Max-imum entropy method in analysis of genetic text andmeasurement of its information content. Open Systems& Information Dyn. , 265–281 (1998).[7] Sadovsky M. G., Shchepanovsky A. S., Putintzeva Yu. A.Genes, Information and Sense: Complexity and Knowl-edge Retrieval. Theory in Biosciences, , 69 (2008).[8] Sadovsky M. G. Comparison of real frequencies of stringsvs. the expected ones reveals the information capacity ofmacromoleculae. J.of Biol.Physics, , 23 (2003).[9] Sadovsky M. G. Information capacity of nucleotide se-quences and its applications. Bulletin of Math.Biology. , 156 (2006).[10] Gorban A. N., Zinovyev A. Yu. Principal manifolds andgraphs in practice: from molecular biology to dynamicalsystems. Int. J. of Neural Systems, , 219 (2010).[11] Gorban A. N., K¨ogl B., W¨unsch D. C., Zinovyev A. Yu.(eds.). Principal Manifolds for Data Visualisation and Di- mension Reduction, // Lecture Notes in ComputationalScience and Engineering , , Springer, Berlin – Heidel-berg – New York. 332 p. (2007).[12] Sadovsky M. G. Information Capacity of BiologicalMacromoleculae Reloaded arXiv:q-bio/0501011 . (2005).[13] Gorban A. N., Zinovyev A. Yu. Principal Graphs andManifolds. In: Handbook of Research on Machine Learn-ing Applications and Trends: Algorithms, Methods andTechniques, Olivas E. S. et al . eds. Information ScienceReference, IGI Global: Hershey, PA, USA, pp. 28 – 59.(2009).[14] Fukunaga K. Introduction to statistical pattern recogni-tion. 2 nd edition. Academic Press: London. 591 p. (1990).[15] Sadovsky M. G., Krutovsky K. V., Oreshkova N. V.,Putintseva Yu. A., Bondar Eu. I., Vaganov Eu. F. Seven-Cluster Structure of Larch Chloroplast Genome. Journalof Siberian Federal University. Biology, (1), 128 – 146,(2015). FIG. 4:
Larix decidua chloroplast DNA; accession number is AB501189.
FIG. 5:
Physcomitrella patens chloroplast DNA; accession number is AP005672. That is moss.
FIG. 6:
Equisetum arvense chloroplast DNA; accession number is GU191334. That is mare’s-tail. FIG. 7:
Ginkgo biloba chloroplast DNA; accession number is AB684440. FIG. 8:
Pinus taeda chloroplast DNA; accession number is KC427273. That is pine. FIG. 9:
Arabidopsis thaliana (thale cress) chloroplast DNA; accession number is AP000423. FIG. 10: