[PDF] Codon Usage Bias Measured Through Entropy Approach

Abstract

Codon usage bias measure is defined through the mutual entropy calculation of real codon frequency distribution against the quasi-equilibrium one. This latter is defined in three manners: (1) the frequency of synonymous codons is supposed to be equal (i.e., the arithmetic mean of their frequencies); (2) it coincides to the frequency distribution of triplets; and, finally, (3) the quasi-equilibrium frequency distribution is defined as the expected frequency of codons derived from the dinucleotide frequency distribution. The measure of bias in codon usage is calculated for 125 bacterial genomes.

Full PDF

aa r X i v : . [ q - b i o . GN ] J un CODON USAGE BIAS MEASUREDTHROUGH ENTROPY APPROACH

Michael G.Sadovsky a , b , ∗ , a Institute of computational modelling of RAS

Julia A.Putintzeva b b Siberian Federal university, Institute of natural sciences & humanities

Abstract

Codon usage bias measure is deﬁned through the mutual entropy calculation ofreal codon frequency distribution against the quasi-equilibrium one. This latter isdeﬁned in three manners: (1) the frequency of synonymous codons is supposed tobe equal (i.e., the arithmetic mean of their frequencies); (2) it coincides to thefrequency distribution of triplets; and, ﬁnally, (3) the quasi-equilibrium frequencydistribution is deﬁned as the expected frequency of codons derived from the din-ucleotide frequency distribution. The measure of bias in codon usage is calculatedfor 125 bacterial genomes.

Key words: frequency, expected frequency, information value, entropy,correlation, classiﬁcation ∗ Email addresses: [email protected] (Michael G.Sadovsky), [email protected] (JuliaA.Putintzeva). To whom the correspondence should be addressed.

Introduction

It is a common fact, that the genetic code is degenerated. All amino acids (besides twoones) are encoded by two or more codons; such codons are called synonymous and usuallydiﬀer in a nucleotide occupying the third position at codon. The synonymous codons occurwith diﬀerent frequencies, and this diﬀerence is observed both between various genomes(Sharp, Li, 1987; Jansen et al., 2003; Zeeberg, 2002; Supek, Vlahoviˇcek, 2005), and diﬀer-ent genes of the same genome (Zeeberg, 2002; Supek, Vlahoviˇcek, 2005; Xiu-Feng et al.,2004; Suzuki et al., 2004). A synonymous codon usage bias could be explained in variousways, including mutational bias (shaping genomic G + C composition) and translational se-lection by tRNA abundance (acting mainly on highly expressed genes). Still, the reportedresults are somewhat contradictory (Suzuki et al., 2004). A contradiction may result fromthe diﬀerences in statistical methods used to estimate the codon usage bias. Here oneshould clearly understand what factors aﬀect the method and numerical result. Boltz-mann entropy theory (Gibbs, 1902; Gorban, Karlin, 2005) has been applied to estimatethe degree of deviation from equal codon usage (Frappat et al., 2003; Zeeberg, 2002).The key point here is that the deviation measure of codon usage bias should be indepen-dent of biological issue. It is highly desirable to avoid an implementation of any biologicalassumptions (such as mutational bias or translational selection); it must be deﬁned inpurely mathematical way. The idea of entropy seems to suit best of all here. The addi-tional constraints on codon usage resulted from the amino acid frequency distributionaﬀects the entropy values, thus conspiring the eﬀects directly linked to biases in synony-mous codon usage.Here we propose three new indices of codon usage bias, which take into account all of thethree important aspects of amino acid usage, i.e. (1) the number of distinct amino acids,(2) their relative frequencies, and (3) their degree of codon degeneracy. All the indices arebased on mutual entropy S calculation. They diﬀer in the codon frequency distributionsupposed to be “quasi-equilibrium”. Indeed, the diﬀerence between the indices consists inthe diﬀerence of the deﬁnition of that latter.Consider a genetic entity, say, a genome, of the length N ; that latter is the number ofnucleotides composing the entity. A word ω (of the length q ) is a string of the length q , 1 ≤ q ≤ N observed within the entity. A set of all the words occurred within anentity makes the support V of the entity (or q –support, if indication of the length q isnecessary). Accompanying each element ω , ω ∈ V with the number n ω of its copies, onegets the (ﬁnite) dictionary of the entity. Changing n ω for the frequency f ω = n ω N , one gets the frequency dictionary W q of the entity (of the thickness q ).2verywhere below, for the purposes of this paper, we shall distinguish codon frequencydistribution from the triplet frequency distribution. A triplet frequency distribution is thefrequency dictionary W of the thickness q = 3, where triplets are identiﬁed with neitherrespect to the speciﬁc position of a triplet within the sequence. On the contrary, codondistribution is the frequency distribution of the triplets occupying speciﬁc places withinan entity: a codon is the triplet embedded into a sequence at the coding position, only.Thus, the abundance of copes of the words of the length q = 3 involved into the codondistribution implementation is three times less, in comparison to the frequency dictionary W of triplets. Further, we shall denote the codon frequency dictionary as W ; no lowerindex will be used, since the thickness of the dictionary is ﬁxed (and equal to q = 3). The tables of codon usage frequency were taken at Kazusa Institute site . The corre-sponding genome sequences have been retrieved from EMBL–bank . The codon usagetables containing not less that 10000 codons have been used. Here we studied bacterialgenomes (see Table 1). Let F denote the codon frequency distribution, F = { f ν ν ν } ; here f ν ν ν is the frequencyof a codon ν ν ν . Further, let e F denote a quasi-equilibrium frequency distribution ofcodons. Hence, the measure I of the codon usage bias is deﬁned as the mutual entropy ofthe real frequency distribution F calculated against the quasi-equilibrium e F one: I = X ω =1 f ω · ln f ω ˜ f ω ! . (1)Here index ω enlists the codons, and ˜ f ω ∈ e F is quasi-equilibrium frequency. The measure(1) itself is rather simple and clear; a deﬁnition of quasi-equilibrium distribution of codonsis the matter of discussion here. We propose three ways to deﬁne the distribution e F ; theyprovide three diﬀerent indices of codon usage bias. The relation between the values ofthese indices observed for the same genome is the key issue, for our study. .2.1 Locally equilibrium codon distribution It is well known fact, that various amino acids manifest diﬀerent occurrence frequency,within a genome, or a gene. Synonymous codons, in turn, exhibit the diﬀerent occurrencewithin the similar genetic entities. Thus, an equality of frequencies of all the synonymouscodons encoding the same amino acid˜ f j = 1 L X j ∈ J i f j , X j ∈ J i f j = X j ∈ J i ˜ f j = ϕ i , (2)is the ﬁrst way to determine a quasi-equilibrium codon frequency distribution. Here theindex j enlists the synonymous codons encoding the same amino acid, and J i is the setof such codons for i -th amino acid, and ϕ i is the frequency of that latter. Surely, the listof amino acids must be extended with stop signal (encoded by three codons). Obviously,˜ f j = ˜ f k for any couple j, k ∈ J i . A triplet distribution gives the second way to deﬁne the quasi-equilibrium codon frequencydistribution. Since the codon frequency is determined with respect to the speciﬁc locationsof the strings of the length q = 3, then two third of the abundance of copies of these stringsfall beyond the calculation of the codon frequency distribution. Thus, one can comparethe codon frequency distribution with the similar distribution implemented over the entiresequence, with no gaps in strings location. So, the frequency dictionary of the thickness q = 3˜ f l = ˆ f l , ≤ l ≤

64 (3)is the quasi-equilibrium codon distribution here.

Finally, the third way to deﬁne the quasi-equilibrium codon frequency distribution is toderive it from the frequency distribution of dinucleotides composing the codon. Havingthe codons frequency distribution F , one always can derive the frequency composition F of the dinucleotides composing the codons. To do that, one must sum up the frequenciesof the codons diﬀering in the third (or the ﬁrst one) nucleotide. Such transformation isunambiguous . The situation is getting worse, as one tends to get a codon distribution dueto the inverse transformation. An upward transformation yields a family of dictionaries { F } , instead of the single one F . To eliminate the ambiguity, one should implement some Here one must close up a sequence into a ring. e F with maximal entropy, among the entities composingthe family { F } . This approach allows to calculate the frequencies of codons explicitly: e f ijk = f ij × f jk f j , (4)where e f ijk is the expected frequency of codon ijk , f ij is the frequency of a dinucleotide ij , and f j is the frequency of nucleotide j ; here i, j, k ∈ { A , C , G , T } .Thus, the calculation of the measure (1) maps each genome into tree-dimension space.Table 1 shows the data calculated for 115 bacterial genomes. We have examined 115 bacterial genomes. The calculations of three indices (1 – 4) andthe absolute entropy of codon distribution is shown in Table 1.Table 1: Indices of codon usage bias; is the index calcu-lated according to (2), S ∗ stands for the index deﬁneddue to (3), and T is the index deﬁned due to (4). S isthe absolute entropy of codon distribution. C is the classattribution (see Section 3.1).Genomes I S ∗ T S C

Acinetobacter sp.ADP1 0.1308 0.1526 0.1332 3.9111 1Aeropyrum pernix K1 0.1381 0.1334 0.1611 3.9302 2Agrobacterium tumefaciens str. C58 0.1995 0.1730 0.2681 3.8504 2Aquifex aeolicus VF5 0.1144 0.1887 0.2273 3.8507 2Archaeoglobus fulgidus DSM 4304 0.1051 0.2008 0.2264 3.9011 2Bacillus anthracis str. Ames 0.1808 0.1880 0.1301 3.8232 1Bacillus anthracis str. Sterne 0.1800 0.1873 0.1300 3.8236 1Bacillus anthracis str.’Ames Ancestor’ 0.1788 0.1850 0.1278 3.8246 1continued on the next page5able 1 – continuedGenomes

I S ∗ T S C

Bacillus cereus ATCC 10987 0.1750 0.1791 0.1254 3.8291 1Bacillus cereus ATCC 14579 0.1807 0.1853 0.1290 3.8220 1Bacillus halodurans C-125 0.0538 0.1296 0.0967 3.9733 1Bacillus subtilis subsp.subtilis str. 168 0.0581 0.1231 0.1117 3.9605 2Bacteroides fragilis YCH46 0.0499 0.1201 0.1305 3.9824 2Bacteroides thetaiotaomicron VPI-5482 0.0557 0.1258 0.1364 3.9713 2Bartonella henselae str. Houston-1 0.1555 0.1650 0.1077 3.8913 1Bartonella quintana str. Toulouse 0.1525 0.1616 0.1039 3.8954 1Bdellovibrio bacteriovorus HD100 0.1197 0.1593 0.2404 3.9232 2Biﬁdobacterium longum NCC2705 0.2459 0.2315 0.3666 3.8011 2Bordetella bronchiseptica RB50 0.4884 0.3165 0.5598 3.5485 2Borrelia burqdorferi B31 0.2330 0.1555 0.0988 3.6709 1Borrelia garinii Pbi 0.2421 0.1616 0.1008 3.6630 1Bradyrhizobium japonicum USDA 110 0.3163 0.2236 0.3789 3.7368 2Campylobacter jejuni RM1221 0.2839 0.1994 0.1357 3.6617 1Campylobacter jejuni subsp. Jejuni NCTC11168 0.2846 0.2010 0.1379 3.6660 1Caulobacter crescentus CB15 0.4250 0.2890 0.5045 3.6062 2Chlamydophila caviae GPIC 0.1079 0.1199 0.0990 3.9445 1Chlamydophila pneumoniae CWL029 0.0803 0.1054 0.0778 3.9748 1Chlamydophila pneumoniae J138 0.0801 0.1050 0.0772 3.9755 1Chlamydophila pneumoniae TW-183 0.0802 0.1037 0.0764 3.9760 1Chlorobium tepidum TLS 0.1767 0.1809 0.2935 3.8777 2Chromobacterium violaceum ATCC 12472 0.4245 0.3004 0.5354 3.6218 2Clamydophyla pneumoniae AR39 0.0804 0.1055 0.0773 3.9748 2Clostridium acetobutylicum ATCC 824 0.2431 0.1951 0.1305 3.7142 1continued on the next page6able 1 – continuedGenomes

I S ∗ T S C

Clostridium perfringens str. 13 0.3602 0.2752 0.1943 3.5816 1Clostridium tetani E88 0.3240 0.2381 0.1767 3.6088 1Corynebacterium eﬃciens YS-314 0.2983 0.2379 0.3980 3.7494 2Corynebacterium glutamicum ATCC 13032 0.0964 0.1510 0.1674 3.9498 2Coxiella burnetii RSA 493 0.0843 0.1050 0.0892 3.9648 2Desulfovibrio vulgaris subsp.vulgaris str.Hildenborough 0.2459 0.1980 0.3183 3.8090 2Enterococcus faecalis V583 0.1592 0.1838 0.1295 3.8453 1Escherichia coli CFT073 0.1052 0.1305 0.1734 3.9576 2Escherichia coli K12 MG1655 0.1206 0.1463 0.1933 3.9372 2Helicobacter hepaticus ATCC 51449 0.1760 0.1513 0.1065 3.8315 1Helicobacter pylori 26695 0.1420 0.1646 0.1843 3.8454 2Helicobacter pylori J99 0.1404 0.1660 0.1895 3.8479 2Lactobacillus johnsonii NCC 533 0.2113 0.1937 0.1481 3.7856 1Lactobacillus plantarum WCFS1 0.0813 0.1453 0.1544 3.9537 2Lactococcus lactis subsp. Lactis Il1403 0.1923 0.1857 0.1173 3.8068 1Legionella pneumophila subsp. Pneumophilastr. Philadelphia 1 0.1018 0.1098 0.0880 3.9339 1Leifsonia xyli subsp. Xyli str. CTCB07 0.3851 0.2411 0.4032 3.6490 2Listeria monocytoqenes str. 4b F2365 0.1389 0.1766 0.1012 3.8600 1Mannheimia succiniciproducens MBEL55E 0.1390 0.1624 0.1571 3.8943 1Mesorhizobium loti MAFF303099 0.2734 0.2019 0.3402 3.7751 2Methanocaldococcus jannaschii DSM 2661 0.2483 0.2108 0.1324 3.6751 2Methanopyrus kandleri AV19 0.2483 0.2108 0.1324 3.6751 1Methanosarcina acetivorans C2A 0.0530 0.1223 0.0876 3.9718 1Methanosarcina mazei Go1 0.0739 0.1314 0.0889 3.9468 1Methylococcus capsulatus str. Bath 0.2847 0.2096 0.3738 3.7709 2continued on the next page7able 1 – continuedGenomes

I S ∗ T S C

Mycobacterium avium subsp. Paratuberculosisstr. K10 0.4579 0.2779 0.4819 3.6038 2Mycobacterium bovis AF2122/97 0.2449 0.1688 0.2862 3.7931 2Mycobacterium leprae TN 0.1075 0.1216 0.1717 3.9513 2Mycobacterium tuberculoisis CDC1551 0.2387 0.1618 0.2749 3.8029 2Mycobacterium tuberculosis H37Rv 0.2457 0.1696 0.2878 3.7929 2Mycoplasma mycoides subsp. mycoides SC 0.4748 0.2571 0.2247 3.4356 1Mycoplasma penetrans HF-2 0.4010 0.2320 0.2047 3.5294 1Neisseria gonorrhoeae FA 1090 0.1610 0.1740 0.2343 3.8852 2Neisseria meningitidis MC58 0.1481 0.1708 0.2244 3.8969 2Neisseria meninqitidis Z2491 serogroup A str.Z2491 0.1541 0.1786 0.2342 3.8898 2Nitrosomonas europeae ATCC 19718 0.0824 0.1104 0.1587 3.9806 2Nocardia farcinica IFM 10152 0.4842 0.2917 0.4968 3.5343 2Nostoc sp.PCC7120 0.0877 0.1308 0.1124 3.9638 1Parachlamydia sp. UWE25 0.1689 0.1397 0.1027 3.8561 1Photorhabdus luminescens subsp. LaumondiiTTO1 0.0704 0.1183 0.1068 3.9838 1Porphyromonas gingivalis W83 0.0476 0.1167 0.1559 4.0034 2Prochlorococcus marinus str. MIT 9313 0.0472 0.0956 0.0773 4.0203 1Prochlorococcus marinus subsp. Marinus str.CCMP1375 0.1729 0.1423 0.1177 3.8697 1Prochlorococcus marinus subsp. Pastoris str.CCMP1986 0.2556 0.1671 0.1412 3.7354 1Propionibacterium acnes KPA171202 0.1277 0.1338 0.1700 3.9293 2Pseudomonas aeruginosa PAO1 0.4648 0.3204 0.5733 3.5827 2Pseudomonas putida KT2440 0.2847 0.2255 0.4061 3.7696 2Pseudomonas syringae pv. Tomato str. DC3000 0.1960 0.1736 0.3013 3.8633 2continued on the next page8able 1 – continuedGenomes

I S ∗ T S C

Pyrococcus abyssi GE5 0.0983 0.1962 0.1996 3.8887 2Pyrococcus furiosus DSM 3638 0.1000 0.1641 0.1079 3.8847 1Pyrococcus horikoshii OT3 0.0899 0.1508 0.1260 3.9105 1Salmonella enterica subsp. Enterica serovar Ty-phi Ty2 0.1272 0.1465 0.2068 3.9327 2Salmonella typhimurium LT2 0.1293 0.1490 0.2100 3.9300 2Shewanella oneidensis MR-1 0.0700 0.1320 0.1329 3.9795 2Shigella ﬂexneri 2a str. 2457T 0.1196 0.1429 0.1913 3.9416 2Shigella ﬂexneri 2a str. 301 0.1097 0.1343 0.1791 3.9529 2Sinorhizobium meliloti 1021 0.1960 0.2199 0.3013 3.8633 2Staphylococcus aureus subsp. AureusMRSA252 0.2338 0.2086 0.1531 3.7572 1Staphylococcus aureus subsp. AureusMSSA476 0.2356 0.2071 0.1554 3.7557 1Staphylococcus aureus subsp. Aureus Mu50 0.2318 0.2056 0.1522 3.7591 1Staphylococcus aureus subsp. Aureus MW2 0.2368 0.2106 0.1562 3.7535 1Staphylococcus aureus subsp. Aureus N315 0.2348 0.2083 0.1543 3.7564 1Staphylococcus epidermidis ATCC 12228 0.2277 0.2036 0.1399 3.7613 1Staphylococcus haemolyticus JCSC1435 0.2304 0.2043 0.1526 3.7619 1Streptococcus agalactiae 2603V/R 0.1690 0.1794 0.1200 3.8372 1Streptococcus agalactiae NEM316 0.1679 0.1790 0.1209 3.8371 1Streptococcus mutans UA159 0.1577 0.1783 0.1240 3.8468 1Streptococcus pneumoniae R6 0.0952 0.1529 0.1210 3.9152 1Streptococcus pneumoniae TIGR4 0.0957 0.1525 0.1209 3.9168 1Streptococcus pyogenes M1 GAS 0.1227 0.1619 0.1137 3.8900 1Streptococcus pyogenes MGAS10394 0.1167 0.1596 0.1101 3.8974 1Streptococcus pyogenes MGAS315 0.1189 0.1636 0.1108 3.8929 1continued on the next page9able 1 – continuedGenomes

I S ∗ T S C

Streptococcus pyogenes MGAS5005 0.1215 0.1612 0.1115 3.8929 1Streptococcus pyogenes MGAS8232 0.1194 0.1608 0.1114 3.8932 1Streptococcus pyogenes SSI-1 0.1189 0.1597 0.1111 3.8932 1Streptococcus thermophilus CNRZ1066 0.1210 0.1710 0.1325 3.8908 1Streptococcus thermophilus LMG 18311 0.1235 0.1737 0.1339 3.8881 1Sulfolobus tokodaii str. 7 0.1932 0.1639 0.1253 3.7954 1Thermoplasma acidophilum DSM 1728 0.0920 0.1668 0.2228 3.9315 2Thermoplasma volcanium GSS1 0.0692 0.1345 0.1247 3.9379 2Treponema polllidum str.Nichols 0.0548 0.0894 0.1095 4.0205 2Ureaplasma parvun serovar 3 str. ATCC700970 0.4111 0.2316 0.1950 3.5023 1Thus, each genome is mapped into three-dimensional space determined by the indices (1 –4). The Table provides also the fourth dimension, that is the absolute entropy of a codondistribution. Further (see Section 3.1), we shall not take this dimension into consideration,since it deteriorates the pattern observed in three-dimensional case.Meanwhile, the data on absolute entropy calculation of the codon distribution for variousbacterial genomes are rather interesting. Keeping in mind, that maximal value of the en-tropy is equal to S max = ln 64 = 4 . . . . , one sees that absolute entropy values observedover the set of genomes varies rather signiﬁcantly. Treponema polllidum str.Nichols ex-hibits the maximal absolute entropy value equal to 4 . Mycoplasma mycoidessubsp. mycoides SC has the minimal level of absolute entropy (equal to 3 . Consider a dispersion of the genomes at the space deﬁned by the indices (1 – 4). Thescattering is shown in Figure 1. The dispersion pattern shown in this ﬁgure is two-horned;thus, two-class pattern of the dispersion is hypothesized. Moreover, the genomes in thethree-dimensional space determined by the indices (1 – 4) occupy a nearly plane subspace.Obviously, the dispersion of the genomes in the space is supposed to consists of two classes.Whether the proximity of genomes observed at the space deﬁned by three indices (1 –4) meets a proximity in other sense, is the key question of our investigation. Taxonomy10 ig. 1. The distribution of genomes in the space determined by the indices (1 – 4). S is I basedindex, S is S ∗ based index, and S is T based index of codon usage bias. is the most natural idea of proximity, for genomes. Thus, the question arises, whetherthe genomes closely located at the space indices (1 – 4), belong the same or closelyrelated taxons? To answer this question, we developed an unsupervised classiﬁcation ofthe genomes, in three-dimensional space determined by the indices (1 – 4).To develop such classiﬁcation, one must split the genomes on K classes, randomly. Then,for each class the center is determined; that latter is the arithmetic mean of each coordinate11orresponding to the speciﬁc index. Then each genome (i.e., each point at the three-dimensional space) is checked for a proximity to each K classes. If a genome is closerto other class, than originally was attributed, then it must be transferred to this class.As soon, as all the genomes are redistributed among the classes, the centers must berecalculated, and all the genomes are checked again, for the proximity to their class;a redistribution takes place, where necessary. This procedure runs till no one genomechanges its class attribution. Then, the discernibility of classes must be veriﬁed. Thereare various discernibility conditions (see, e.g., (Gorban, Rossiev, 2004)).Here we executed a simpliﬁed version of the unsupervised classiﬁcation. First, we did notchecked the class discernibility; next, a center of a class diﬀers from a regular one. Astraight line at the space determined by the indices (1 – 4) is supposed to be a centerof a class, rather than a point in it. So, the classiﬁcation was developed with respect tothese two issues. The Table 1 also shows the class attribution, for each genome (see thelast column indicated as C ). Clear, concise and comprehensive investigation of the peculiarities of codon bias distribu-tion may reveal valuable and new knowledge towards the relation between the function(in general sense) and the structure of nucleotide sequences. Indeed, here we studied therelation between the taxonomy of a genome bearer, and the structure of that former.A structure may be deﬁned in many ways, and here we explore the idea of ensemble of(considerably short) fragments of a sequence. In particular, the structure here is under-stood in terms of frequency dictionary (see Section 1; see also (Bugaeko et al., 1996, 1998;Sadovsky, 2003, 2006) for details).Figure 1 shows the dispersion of genomes in three-dimensional space determined by theindices (1 – 4). The projection shown in this Figure yields the most suitable view of thepattern; a comprehensive study of the distribution pattern seen in various projectionsshows that it is located in a plane (or close to a plane). Thus, the three indices (1 – 4)are not independent.Next, the dispersion of the genomes in the indices (1 – 4) space is likely to hypothesize thetwo-class distribution of the entities. Indeed, the unsupervised classiﬁcation developed forthe set of genomes gets it. First of all, the genomes of the same genus belong the sameclass, as a rule. Some rare exclusion of this rule result from a speciﬁc location of theentities within the “bullet” shown in Figure 1.A measure of codon usage bias is matter of study of many researchers (see, e.g., (Nakamura et al.,2000; Galtier et al., 2006; Carbone et al., 2003; Sueoka, Kawanishi, 2000; Bierne, Eyre-Walker,2006)). There have been explored numerous approaches for the bias index implementa-tion. Basically, such indices are based either on the statistical or probabilistic features12f codon frequency distribution (Sharp, Li, 1987; Jansen et al., 2003; Nakamura et al.,2000), others are based on the entropy calculation of the distribution (Zeeberg, 2002;Frappat et al., 2003) or similar indices based on the issues of multidimensional data anal-ysis and visualization techniques (Carbone et al., 2003, 2005). An implementation of anindex (of a set of indices) aﬀects strongly the sense and meaning of the observed data; herethe question arises towards the similarity of the observations obtained through variousindices implementation, and the discretion of the ﬁne peculiarities standing behind thoseindices.Entropy seems to be the most universal and sustainable characteristics of a frequencydistribution of any nature (Gibbs, 1902; Gorban, 1984). Thus, the entropy based approachto a study of codon usage bias seems to be the most powerful. In particular, this approachwas used by Suzuki et al. (2004), where the entropy of the codon frequency distributionhas been calculated, for various genomes, and various fragments of genome. The datapresented at this paper manifest a signiﬁcant correspondence to those shown above; herewe take an advantage of the general approach provided by Suzuki et al. (2004) throughthe calculation of more speciﬁc index, that is a mutual entropy.An implementation of an index (or indices) of codon usage bias is of a merit not itself, butwhen it brings a new comprehension of biological issues standing behind. Some biologicalmechanisms aﬀecting the codon usage bias are rather well known (Bierne, Eyre-Walker,2006; Galtier et al., 2006; Jansen et al., 2003; Sharp et al., 2005; Supek, Vlahoviˇcek, 2005;Xiu-Feng et al., 2004). The rate of translation processes are the key issue here. Quanti-tatively, the codon usage bias manifests a signiﬁcant correlation to C + G content of agenetic entity. Obviously, the C + G content seems to be an important factor (see, e. g.(Carbone et al., 2003, 2005)); some intriguing observation towards the correspondencebetween C + G content and the taxonomy of bacteria is considered in (Gorban, Zinovyev,2007).Probably, the distribution of genomes as shown in Figure 1 could result from C + G content;yet, one may not exclude some other mechanisms and biological issues determining it. Anexact and reliable consideration of the relation between structure (that is the codon usagebias indices), and the function encoded in a sequence is still obturated with the widestvariety of the functions observed in diﬀerent sites of a sequence. Thus, a comprehensivestudy of such relation strongly require the clariﬁcation and identiﬁcation of the functionto be considered as an entity. Moreover, one should provide some additional eﬀorts toprove an absence of interference between two (or more) functions encoded by the sites.A relation between the structure (that is the codon usage bias) and taxonomy seems tobe less deteriorated with a variety of features to be considered. Previously, a signiﬁcantdependence between the triplet composition of 16S RNA of bacteria and their taxonomyhas been reported (Gorban et al., 2000, 2001). We have pursued similar approach here.We studied the correlation between the class determined by the proximity at the spacedeﬁned by the codon usage bias indices (1 – 4), and the taxonomy of bacterial genomes.13he data shown in Table 1 reveal a signiﬁcant correlation of class attribution to the taxon-omy of bacterial genomes. First of all, the correlation is the highest one for species and/orstrain levels. Some exclusion observed for Bacillus genus may result from a modiﬁcationof the unsupervised classiﬁcation implementation; on the other hand, the entities of thatgenus are spaced at the head of the bullet (see Figure 1). A distribution of genomes overtwo classes looks rather complicated and quite irregular. This fact may follow from ageneral situation with higher taxons disposition of bacteria.Nevertheless, the introduced indices of codon usage bias provide a researcher with newtool for knowledge retrieval concerning the relation between structure and function, andstructure and taxonomy of the bearers of genetic entities.

Acknowledgements

We are thankful to Professor Alexander Gorban from Liechester University for encourag-ing discussions of this work.

References

Bierne, N., Eyre-Walker, A. Variation in synonymous codon use and DNA polymorphismwithin the Drosophila genome. J. Evol. Biol. (1), 1–11 (2006)Bugaenko N.N., Gorban A.N., Sadovsky M.G. Towards the determination of informationcontent of nucleotide sequences. Russian J.of Mol.Biol. , 529–541 (1996)Bugaenko N.N., Gorban A.N., Sadovsky M.G. Maximum entropy method in analysis ofgenetic text and measurement of its information content Open Sys.& Information Dyn.. , 265–278 (1998)Carbone, A., Zinovyev, A., K´ep`es, F. Codon adaptation index as a measure of dominatingcodon bias. Bioinformatics. (16), 2005–2015 (2003)Carbone, A., K´ep`es, F. Zinovyev, A. Codon bias signatures, organization of microorgan-isms in codon space, and lifestyle. Mol. Biol. Evol. (3), 547–561 (2006)Frappat, L., Minichini, C., Sciarrino, A., Sorba, P. Universality and Shannon entropy ofcodon usage. Phys.Review E . , 061910 (2003)Fuglsang, A. Estimating the “Eﬀective Number of Codons”: The Wright Way of Deter-mining Codon Homozygosity Leads to Superior Estimates. Genetics. , 1301–1307(2006)Galtier, N., Bazin, E., Bierne, N. GC-biased segregation of non-coding polymorphisms inDrosophila. Genetics. , 221–228 (2006)Gibbs, J.W. Elementary Principles in Statistical Mechanics, Developed with EspecialReference to the Rational Foundation of Thermodynamics. C. Scribner’s Sons, NewHaven (1902) 14orban, A.N., Zinovyev, A.Yu. The Mystery of Two Straight Lines in Bacterial GenomeStatistics. Release 2007 arXiv:q-bio/0412015Gorban, A.N., Karlin, I.V. Invariant Manifolds for Physical and Chemical Kinetics, Lect.Notes Phys. 660, Springer, Berlin, Heidelberg (2005).Gorban, A.N., Rossiev, D.A. Neurocomputers on PC. Nauka plc., Novosibirsk (2004).Gorban, A.N., Popova, T.G., Sadovsky, M.G., Wunsch, D.C. Information content of thefrequency dictionaries, re-construction, transformation and classiﬁcation of dictionariesand genetic texts // Intelligent Engineering Systems through Artiﬁcial Neural Netwerks: – Smart Engineering System Design , N.-Y.: ASME Press 657–663 (2001)Gorban, A.N., Popova, T.G., Sadovsky, M.G. Classiﬁcation of symbol sequences over thierfrequency dictionaries: towards the connection between structure and natural taxonomy.Open Systems & Information Dynamics. (1), 1–17 (2000)Gorban, A.N. Equilibrium Encircling. Equations of Chemical Kinetics and their Thermo-dynamic Analysis. Novosibirsk, Nauka Publ. (1984) 256 p.Jansen, R., Bussemaker, H.J. and Gerstein, M. Revisiting the codon adaptation indexfrom a whole-genome perspective: analyzing the relationship between gene expressionand codon occurrence in yeast using a variety of models. NAR , 2242–2251 (2003)Nakamura, Y., Gojobori, T., Ikemura, T. Codon usage tabulated from international DNAsequence databases: status for the year 2000. Nucleic Acids Res. , 292 (2000).Sadovsky, M.G. Comparison of real frequencies of strings vs. the expected ones revealsthe information capacity of macromoleculae. Journal of Biol.Phys. , 23–38 (2003)Sadovsky, M.G. Information capacity of nucleotide sequences and its applications. Bulletinof Math.Biology. , 156–178 (2006)Sharp, P.M., Wen-Hsiung Li. The codon adaptation index — a measure of directional syn-onymous codon usage bias, and its potential applications. NAR , 1281–1295 (1987)Sharp, P.M., Bailes, E., Grocock, R.J., Peden, J.F., Sockett, R.E. Variation in the strengthof selected codon usage bias among bacteria. Nucleic Acids Research. , 1141–1153(2005)Sueoka, N., Kawanishi, Y. DNA G+C content of the third codon position and codon usagebiases of human genes. Gene. (1), 53–62 (2000)Supek, F. and Vlahoviˇcek, K. Comparison of codon usage measures and their applicabilityin prediction of microbial gene expressivity. BMC Bioinformatics. , 182–197 (2005)Suzuki, H., Saito, R. and Tomita, M. The ‘weighted sum of relative entropy’: a new indexfor synonymous codon usage bias. Gene. , 19–23 (2004)Xiu-Feng Wan, Dong Xu, Kleinhofs, A., Jizhong Zhou Quantitative relationship betweensynonymous codon usage bias and GC composition across unicellular genomes. BMCEvolutionary Biology. , 19–30 (2004)Zeeberg, B. Shannon Information Theoretic Computation of Synonymous Codon UsageBiases in Coding Regions of Human and Mouse Genomes. Genome Res.12